An Introduction to Data Analysis

  • First Online: 28 September 2018

Cite this chapter

data analysis in research scribd

  • Fabio Nelli 2  

16k Accesses

In this chapter, you begin to take the first steps in the world of data analysis, learning in detail about all the concepts and processes that make up this discipline. The concepts discussed in this chapter are helpful background for the following chapters, where these concepts and procedures will be applied in the form of Python code, through the use of several libraries that will be discussed in just as many chapters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and affiliations.

Rome, Italy

Fabio Nelli

You can also search for this author in PubMed   Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Fabio Nelli

About this chapter

Nelli, F. (2018). An Introduction to Data Analysis. In: Python Data Analytics. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-3913-1_1

Download citation

DOI : https://doi.org/10.1007/978-1-4842-3913-1_1

Published : 28 September 2018

Publisher Name : Apress, Berkeley, CA

Print ISBN : 978-1-4842-3912-4

Online ISBN : 978-1-4842-3913-1

eBook Packages : Professional and Applied Computing Apress Access Books Professional and Applied Computing (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HCA Healthc J Med
  • v.1(2); 2020
  • PMC10324782

Logo of hcahjm

Introduction to Research Statistical Analysis: An Overview of the Basics

Christian vandever.

1 HCA Healthcare Graduate Medical Education

Description

This article covers many statistical ideas essential to research statistical analysis. Sample size is explained through the concepts of statistical significance level and power. Variable types and definitions are included to clarify necessities for how the analysis will be interpreted. Categorical and quantitative variable types are defined, as well as response and predictor variables. Statistical tests described include t-tests, ANOVA and chi-square tests. Multiple regression is also explored for both logistic and linear regression. Finally, the most common statistics produced by these methods are explored.

Introduction

Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology. Some of the information is more applicable to retrospective projects, where analysis is performed on data that has already been collected, but most of it will be suitable to any type of research. This primer will help the reader understand research results in coordination with a statistician, not to perform the actual analysis. Analysis is commonly performed using statistical programming software such as R, SAS or SPSS. These allow for analysis to be replicated while minimizing the risk for an error. Resources are listed later for those working on analysis without a statistician.

After coming up with a hypothesis for a study, including any variables to be used, one of the first steps is to think about the patient population to apply the question. Results are only relevant to the population that the underlying data represents. Since it is impractical to include everyone with a certain condition, a subset of the population of interest should be taken. This subset should be large enough to have power, which means there is enough data to deliver significant results and accurately reflect the study’s population.

The first statistics of interest are related to significance level and power, alpha and beta. Alpha (α) is the significance level and probability of a type I error, the rejection of the null hypothesis when it is true. The null hypothesis is generally that there is no difference between the groups compared. A type I error is also known as a false positive. An example would be an analysis that finds one medication statistically better than another, when in reality there is no difference in efficacy between the two. Beta (β) is the probability of a type II error, the failure to reject the null hypothesis when it is actually false. A type II error is also known as a false negative. This occurs when the analysis finds there is no difference in two medications when in reality one works better than the other. Power is defined as 1-β and should be calculated prior to running any sort of statistical testing. Ideally, alpha should be as small as possible while power should be as large as possible. Power generally increases with a larger sample size, but so does cost and the effect of any bias in the study design. Additionally, as the sample size gets bigger, the chance for a statistically significant result goes up even though these results can be small differences that do not matter practically. Power calculators include the magnitude of the effect in order to combat the potential for exaggeration and only give significant results that have an actual impact. The calculators take inputs like the mean, effect size and desired power, and output the required minimum sample size for analysis. Effect size is calculated using statistical information on the variables of interest. If that information is not available, most tests have commonly used values for small, medium or large effect sizes.

When the desired patient population is decided, the next step is to define the variables previously chosen to be included. Variables come in different types that determine which statistical methods are appropriate and useful. One way variables can be split is into categorical and quantitative variables. ( Table 1 ) Categorical variables place patients into groups, such as gender, race and smoking status. Quantitative variables measure or count some quantity of interest. Common quantitative variables in research include age and weight. An important note is that there can often be a choice for whether to treat a variable as quantitative or categorical. For example, in a study looking at body mass index (BMI), BMI could be defined as a quantitative variable or as a categorical variable, with each patient’s BMI listed as a category (underweight, normal, overweight, and obese) rather than the discrete value. The decision whether a variable is quantitative or categorical will affect what conclusions can be made when interpreting results from statistical tests. Keep in mind that since quantitative variables are treated on a continuous scale it would be inappropriate to transform a variable like which medication was given into a quantitative variable with values 1, 2 and 3.

Categorical vs. Quantitative Variables

Both of these types of variables can also be split into response and predictor variables. ( Table 2 ) Predictor variables are explanatory, or independent, variables that help explain changes in a response variable. Conversely, response variables are outcome, or dependent, variables whose changes can be partially explained by the predictor variables.

Response vs. Predictor Variables

Choosing the correct statistical test depends on the types of variables defined and the question being answered. The appropriate test is determined by the variables being compared. Some common statistical tests include t-tests, ANOVA and chi-square tests.

T-tests compare whether there are differences in a quantitative variable between two values of a categorical variable. For example, a t-test could be useful to compare the length of stay for knee replacement surgery patients between those that took apixaban and those that took rivaroxaban. A t-test could examine whether there is a statistically significant difference in the length of stay between the two groups. The t-test will output a p-value, a number between zero and one, which represents the probability that the two groups could be as different as they are in the data, if they were actually the same. A value closer to zero suggests that the difference, in this case for length of stay, is more statistically significant than a number closer to one. Prior to collecting the data, set a significance level, the previously defined alpha. Alpha is typically set at 0.05, but is commonly reduced in order to limit the chance of a type I error, or false positive. Going back to the example above, if alpha is set at 0.05 and the analysis gives a p-value of 0.039, then a statistically significant difference in length of stay is observed between apixaban and rivaroxaban patients. If the analysis gives a p-value of 0.91, then there was no statistical evidence of a difference in length of stay between the two medications. Other statistical summaries or methods examine how big of a difference that might be. These other summaries are known as post-hoc analysis since they are performed after the original test to provide additional context to the results.

Analysis of variance, or ANOVA, tests can observe mean differences in a quantitative variable between values of a categorical variable, typically with three or more values to distinguish from a t-test. ANOVA could add patients given dabigatran to the previous population and evaluate whether the length of stay was significantly different across the three medications. If the p-value is lower than the designated significance level then the hypothesis that length of stay was the same across the three medications is rejected. Summaries and post-hoc tests also could be performed to look at the differences between length of stay and which individual medications may have observed statistically significant differences in length of stay from the other medications. A chi-square test examines the association between two categorical variables. An example would be to consider whether the rate of having a post-operative bleed is the same across patients provided with apixaban, rivaroxaban and dabigatran. A chi-square test can compute a p-value determining whether the bleeding rates were significantly different or not. Post-hoc tests could then give the bleeding rate for each medication, as well as a breakdown as to which specific medications may have a significantly different bleeding rate from each other.

A slightly more advanced way of examining a question can come through multiple regression. Regression allows more predictor variables to be analyzed and can act as a control when looking at associations between variables. Common control variables are age, sex and any comorbidities likely to affect the outcome variable that are not closely related to the other explanatory variables. Control variables can be especially important in reducing the effect of bias in a retrospective population. Since retrospective data was not built with the research question in mind, it is important to eliminate threats to the validity of the analysis. Testing that controls for confounding variables, such as regression, is often more valuable with retrospective data because it can ease these concerns. The two main types of regression are linear and logistic. Linear regression is used to predict differences in a quantitative, continuous response variable, such as length of stay. Logistic regression predicts differences in a dichotomous, categorical response variable, such as 90-day readmission. So whether the outcome variable is categorical or quantitative, regression can be appropriate. An example for each of these types could be found in two similar cases. For both examples define the predictor variables as age, gender and anticoagulant usage. In the first, use the predictor variables in a linear regression to evaluate their individual effects on length of stay, a quantitative variable. For the second, use the same predictor variables in a logistic regression to evaluate their individual effects on whether the patient had a 90-day readmission, a dichotomous categorical variable. Analysis can compute a p-value for each included predictor variable to determine whether they are significantly associated. The statistical tests in this article generate an associated test statistic which determines the probability the results could be acquired given that there is no association between the compared variables. These results often come with coefficients which can give the degree of the association and the degree to which one variable changes with another. Most tests, including all listed in this article, also have confidence intervals, which give a range for the correlation with a specified level of confidence. Even if these tests do not give statistically significant results, the results are still important. Not reporting statistically insignificant findings creates a bias in research. Ideas can be repeated enough times that eventually statistically significant results are reached, even though there is no true significance. In some cases with very large sample sizes, p-values will almost always be significant. In this case the effect size is critical as even the smallest, meaningless differences can be found to be statistically significant.

These variables and tests are just some things to keep in mind before, during and after the analysis process in order to make sure that the statistical reports are supporting the questions being answered. The patient population, types of variables and statistical tests are all important things to consider in the process of statistical analysis. Any results are only as useful as the process used to obtain them. This primer can be used as a reference to help ensure appropriate statistical analysis.

Funding Statement

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity.

Conflicts of Interest

The author declares he has no conflicts of interest.

Christian Vandever is an employee of HCA Healthcare Graduate Medical Education, an organization affiliated with the journal’s publisher.

This research was supported (in whole or in part) by HCA Healthcare and/or an HCA Healthcare affiliated entity. The views expressed in this publication represent those of the author(s) and do not necessarily represent the official views of HCA Healthcare or any of its affiliated entities.

Log in using your username and password

  • Search More Search for this keyword Advanced search
  • Latest content
  • Current issue
  • Write for Us
  • BMJ Journals More You are viewing from: Google Indexer

You are here

  • Volume 3, Issue 3
  • Data analysis in qualitative research
  • Article Text
  • Article info
  • Citation Tools
  • Rapid Responses
  • Article metrics

Download PDF

  • Sally Thorne , RN, PhD
  • School of Nursing, University of British Columbia Vancouver, British Columbia, Canada

https://doi.org/10.1136/ebn.3.3.68

Statistics from Altmetric.com

Request permissions.

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Unquestionably, data analysis is the most complex and mysterious of all of the phases of a qualitative project, and the one that receives the least thoughtful discussion in the literature. For neophyte nurse researchers, many of the data collection strategies involved in a qualitative project may feel familiar and comfortable. After all, nurses have always based their clinical practice on learning as much as possible about the people they work with, and detecting commonalities and variations among and between them in order to provide individualised care. However, creating a database is not sufficient to conduct a qualitative study. In order to generate findings that transform raw data into new knowledge, a qualitative researcher must engage in active and demanding analytic processes throughout all phases of the research. Understanding these processes is therefore an important aspect not only of doing qualitative research, but also of reading, understanding, and interpreting it.

For readers of qualitative studies, the language of analysis can be confusing. It is sometimes difficult to know what the researchers actually did during this phase and to understand how their findings evolved out of the data that were collected or constructed. Furthermore, in describing their processes, some authors use language that accentuates this sense of mystery and magic. For example, they may claim that their conceptual categories “emerged” from the data 1 —almost as if they left the raw data out overnight and awoke to find that the data analysis fairies had organised the data into a coherent new structure that explained everything! In this EBN notebook, I will try to help readers make sense of some of the assertions that are made about qualitative data analysis so that they can develop a critical eye for when an analytical claim is convincing and when it is not.

Qualitative data

Qualitative data come in various forms. In many qualitative nursing studies, the database consists of interview transcripts from open ended, focused, but exploratory interviews. However, there is no limit to what might possibly constitute a qualitative database, and increasingly we are seeing more and more creative use of such sources as recorded observations (both video and participatory), focus groups, texts and documents, multi-media or public domain sources, policy manuals, photographs, and lay autobiographical accounts.

Qualitative analytic reasoning processes

What makes a study qualitative is that it usually relies on inductive reasoning processes to interpret and structure the meanings that can be derived from data. Distinguishing inductive from deductive inquiry processes is an important step in identifying what counts as qualitative research. Generally, inductive reasoning uses the data to generate ideas (hypothesis generating), whereas deductive reasoning begins with the idea and uses the data to confirm or negate the idea (hypothesis testing). 2 In actual practice, however, many quantitative studies involve much inductive reasoning, whereas good qualitative analysis often requires access to a full range of strategies. 3 A traditional quantitative study in the health sciences typically begins with a theoretical grounding, takes direction from hypotheses or explicit study questions, and uses a predetermined (and auditable) set of steps to confirm or refute the hypothesis. It does this to add evidence to the development of specific, causal, and theoretical explanations of phenomena. 3 In contrast, qualitative research often takes the position that an interpretive understanding is only possible by way of uncovering or deconstructing the meanings of a phenomenon. Thus, a distinction between explaining how something operates (explanation) and why it operates in the manner that it does (interpretation) may be a more effective way to distinguish quantitative from qualitative analytic processes involved in any particular study.

Because data collection and analysis processes tend to be concurrent, with new analytic steps informing the process of additional data collection and new data informing the analytic processes, it is important to recognise that qualitative data analysis processes are not entirely distinguishable from the actual data. The theoretical lens from which the researcher approaches the phenomenon, the strategies that the researcher uses to collect or construct data, and the understandings that the researcher has about what might count as relevant or important data in answering the research question are all analytic processes that influence the data. Analysis also occurs as an explicit step in conceptually interpreting the data set as a whole, using specific analytic strategies to transform the raw data into a new and coherent depiction of the thing being studied. Although there are many qualitative data analysis computer programs available on the market today, these are essentially aids to sorting and organising sets of qualitative data, and none are capable of the intellectual and conceptualising processes required to transform data into meaningful findings.

Specific analytic strategies

Although a description of the actual procedural details and nuances of every qualitative data analysis strategy is well beyond the scope of a short paper, a general appreciation of the theoretical assumptions underlying some of the more common approaches can be helpful in understanding what a researcher is trying to say about how data were sorted, organised, conceptualised, refined, and interpreted.

CONSTANT COMPARATIVE ANALYSIS

Many qualitative analytic strategies rely on a general approach called “constant comparative analysis”. Originally developed for use in the grounded theory methodology of Glaser and Strauss, 4 which itself evolved out of the sociological theory of symbolic interactionism, this strategy involves taking one piece of data (one interview, one statement, one theme) and comparing it with all others that may be similar or different in order to develop conceptualisations of the possible relations between various pieces of data. For example, by comparing the accounts of 2 different people who had a similar experience, a researcher might pose analytical questions like: why is this different from that? and how are these 2 related? In many qualitative studies whose purpose it is to generate knowledge about common patterns and themes within human experience, this process continues with the comparison of each new interview or account until all have been compared with each other. A good example of this process is reported in a grounded theory study of how adults with brain injury cope with the social attitudes they face (see Evidence-Based Nursing , April 1999, p64).

Constant comparison analysis is well suited to grounded theory because this design is specifically used to study those human phenomena for which the researcher assumes that fundamental social processes explain something of human behaviour and experience, such as stages of grieving or processes of recovery. However, many other methodologies draw from this analytical strategy to create knowledge that is more generally descriptive or interpretive, such as coping with cancer, or living with illness. Naturalistic inquiry, thematic analysis, and interpretive description are methods that depend on constant comparative analysis processes to develop ways of understanding human phenomena within the context in which they are experienced.

PHENOMENOLOGICAL APPROACHES

Constant comparative analysis is not the only approach in qualitative research. Some qualitative methods are not oriented toward finding patterns and commonalities within human experience, but instead seek to discover some of the underlying structure or essence of that experience through the intensive study of individual cases. For example, rather than explain the stages and transitions within grieving that are common to people in various circumstances, a phenomenological study might attempt to uncover and describe the essential nature of grieving and represent it in such a manner that a person who had not grieved might begin to appreciate the phenomenon. The analytic methods that would be employed in these studies explicitly avoid cross comparisons and instead orient the researcher toward the depth and detail that can be appreciated only through an exhaustive, systematic, and reflective study of experiences as they are lived.

Although constant comparative methods might well permit the analyst to use some pre-existing or emergent theory against which to test all new pieces of data that are collected, these more phenomenological approaches typically challenge the researcher to set aside or “bracket” all such preconceptions so that they can work inductively with the data to generate entirely new descriptions and conceptualisations. There are numerous forms of phenomenological research; however, many of the most popular approaches used by nurses derive from the philosophical work of Husserl on modes of awareness (epistemology) and the hermeneutic tradition of Heidegger, which emphasises modes of being (ontology). 5 These approaches differ from one another in the degree to which interpretation is acceptable, but both represent strategies for immersing oneself in data, engaging with data reflectively, and generating a rich description that will enlighten a reader as to the deeper essential structures underlying a particular human experience. Examples of the kinds of human experience that are amenable to this type of inquiry are the suffering experienced by individuals who have a drinking problem (see Evidence-Based Nursing , October 1998, p134) and the emotional experiences of parents of terminally ill adolescents (see Evidence-Based Nursing , October 1999, p132). Sometimes authors explain their approaches not by the phenomenological position they have adopted, but by naming the theorist whose specific techniques they are borrowing. Colaizzi and Giorgi are phenomenologists who have rendered the phenomenological attitude into a set of manageable steps and processes for working with such data and have therefore become popular reference sources among phenomenological nurse researchers.

ETHNOGRAPHIC METHODS

Ethnographic research methods derive from anthropology's tradition of interpreting the processes and products of cultural behaviour. Ethnographers documented such aspects of human experience as beliefs, kinship patterns and ways of living. In the healthcare field, nurses and others have used ethnographic methods to uncover and record variations in how different social and cultural groups understand and enact health and illness. An example of this kind of study is an investigation of how older adults adjust to living in a nursing home environment (see Evidence-Based Nursing , October 1999, p136). When a researcher claims to have used ethnographic methods, we can assume that he or she has come to know a culture or group through immersion and engagement in fieldwork or participant observation and has also undertaken to portray that culture through text. 6 Ethnographic analysis uses an iterative process in which cultural ideas that arise during active involvement “in the field” are transformed, translated, or represented in a written document. It involves sifting and sorting through pieces of data to detect and interpret thematic categorisations, search for inconsistencies and contradictions, and generate conclusions about what is happening and why.

NARRATIVE ANALYSIS AND DISCOURSE ANALYSIS

Many qualitative nurse researchers have discovered the extent to which human experience is shaped, transformed, and understood through linguistic representation. The vague and subjective sensations that characterise cognitively unstructured life experiences take on meaning and order when we try to articulate them in communication. Putting experience into words, whether we do this verbally, in writing, or in thought, transforms the actual experience into a communicable representation of it. Thus, speech forms are not the experiences themselves, but a socially and culturally constructed device for creating shared understandings about them. Narrative analysis is a strategy that recognises the extent to which the stories we tell provide insights about our lived experiences. 7 For example, it was used as a strategy to learn more about the experiences of women who discover that they have a breast lump (see Evidence-Based Nursing , July 1999, p93). Through analytic processes that help us detect the main narrative themes within the accounts people give about their lives, we discover how they understand and make sense of their lives.

By contrast, discourse analysis recognises speech not as a direct representation of human experience, but as an explicit linguistic tool constructed and shaped by numerous social or ideological influences. Discourse analysis strategies draw heavily upon theories developed in such fields as sociolinguistics and cognitive psychology to try to understand what is represented by the various ways in which people communicate ideas. They capitalise on critical inquiry into the language that is used and the way that it is used to uncover the societal influences underlying our behaviours and thoughts. 8 Thus, although discourse analysis and narrative analysis both rely heavily on speech as the most relevant data form, their reasons for analysing speech differ. The table ⇓ illustrates the distinctions among the analytic strategies described above using breast cancer research as an example.

  • View inline

General distinctions between selected qualitative research approaches: an illustration using breast cancer research

Cognitive processes inherent in qualitative analysis

The term “qualitative research” encompasses a wide range of philosophical positions, methodological strategies, and analytical procedures. Morse 1 has summarised the cognitive processes involved in qualitative research in a way that can help us to better understand how the researcher's cognitive processes interact with qualitative data to bring about findings and generate new knowledge. Morse believes that all qualitative analysis, regardless of the specific approach, involves:

comprehending the phenomenon under study

synthesising a portrait of the phenomenon that accounts for relations and linkages within its aspects

theorising about how and why these relations appear as they do, and

recontextualising , or putting the new knowledge about phenomena and relations back into the context of how others have articulated the evolving knowledge.

Although the form that each of these steps will take may vary according to such factors as the research question, the researcher's orientation to the inquiry, or the setting and context of the study, this set of steps helps to depict a series of intellectual processes by which data in their raw form are considered, examined, and reformulated to become a research product.

Quality measures in qualitative analysis

It used to be a tradition among qualitative nurse researchers to claim that such issues as reliability and validity were irrelevant to the qualitative enterprise. Instead, they might say that the proof of the quality of the work rested entirely on the reader's acceptance or rejection of the claims that were made. If the findings “rang true” to the intended audience, then the qualitative study was considered successful. More recently, nurse researchers have taken a lead among their colleagues in other disciplines in trying to work out more formally how the quality of a piece of qualitative research might be judged. Many of these researchers have concluded that systematic, rigorous, and auditable analytical processes are among the most significant factors distinguishing good from poor quality research. 9 Researchers are therefore encouraged to articulate their findings in such a manner that the logical processes by which they were developed are accessible to a critical reader, the relation between the actual data and the conclusions about data is explicit, and the claims made in relation to the data set are rendered credible and believable. Through this short description of analytical approaches, readers will be in a better position to critically evaluate individual qualitative studies, and decide whether and when to apply the findings of such studies to their nursing practice.

  • ↵ Morse JM. “Emerging from the data”: the cognitive processes of analysis in qualitative inquiry. In: JM Morse, editor. Critical issues in qualitative research methods . Thousand Oaks, CA: Sage, 1994:23–43.
  • ↵ Holloway I. Basic concepts for qualitative research . Oxford: Blackwell Science, 1997.
  • ↵ Schwandt TA. Qualitative inquiry: a dictionary of terms . Thousand Oaks, CA: Sage, 1997.
  • ↵ Glaser BG, Strauss AL. The discovery of grounded theory . Hawthorne, NY: Aldine, 1967.
  • ↵ Ray MA. The richness of phenomenology: philosophic, theoretic, and methodologic concerns. In: J M Morse, editor. Critical issues in qualitative research methods . Thousand Oaks, CA: Sage, 1994:117–33.
  • ↵ Boyle JS. Styles of ethnography. In: JM Morse, editor. Critical issues in qualitative research methods .. Thousand Oaks, CA: Sage, 1994:159–85.
  • ↵ Sandelowski M. We are the stories we tell: narrative knowing in nursing practice. J Holist Nurs 1994 ; 12 : 23 –33. OpenUrl CrossRef PubMed
  • ↵ Boutain DM. Critical language and discourse study: their transformative relevance for critical nursing inquiry. ANS Adv Nurs Sci 1999 ; 21 : 1 –8.
  • ↵ Thorne S. The art (and science) of critiquing qualitative research. In: JM Morse, editor. Completing a qualitative project: details and dialogue . Thousand Oaks, CA: Sage, 1997:117–32.

Read the full text or download the PDF:

Logo for University of Southern Queensland

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

14 Quantitative analysis: Descriptive statistics

Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs. Inferential analysis refers to the statistical testing of hypotheses (theory testing). In this chapter, we will examine statistical techniques used for descriptive analysis, and the next chapter will examine statistical techniques for inferential analysis. Much of today’s quantitative data analysis is conducted using software programs such as SPSS or SAS. Readers are advised to familiarise themselves with one of these programs for understanding the concepts described in this chapter.

Data preparation

In research projects, data may be collected from a variety of sources: postal surveys, interviews, pretest or posttest experimental data, observational data, and so forth. This data must be converted into a machine-readable, numeric format, such as in a spreadsheet or a text file, so that they can be analysed by computer programs like SPSS or SAS. Data preparation usually follows the following steps:

Data coding. Coding is the process of converting data into numeric format. A codebook should be created to guide the coding process. A codebook is a comprehensive document containing a detailed description of each variable in a research study, items or measures for that variable, the format of each item (numeric, text, etc.), the response scale for each item (i.e., whether it is measured on a nominal, ordinal, interval, or ratio scale, and whether this scale is a five-point, seven-point scale, etc.), and how to code each value into a numeric format. For instance, if we have a measurement item on a seven-point Likert scale with anchors ranging from ‘strongly disagree’ to ‘strongly agree’, we may code that item as 1 for strongly disagree, 4 for neutral, and 7 for strongly agree, with the intermediate anchors in between. Nominal data such as industry type can be coded in numeric form using a coding scheme such as: 1 for manufacturing, 2 for retailing, 3 for financial, 4 for healthcare, and so forth (of course, nominal data cannot be analysed statistically). Ratio scale data such as age, income, or test scores can be coded as entered by the respondent. Sometimes, data may need to be aggregated into a different form than the format used for data collection. For instance, if a survey measuring a construct such as ‘benefits of computers’ provided respondents with a checklist of benefits that they could select from, and respondents were encouraged to choose as many of those benefits as they wanted, then the total number of checked items could be used as an aggregate measure of benefits. Note that many other forms of data—such as interview transcripts—cannot be converted into a numeric format for statistical analysis. Codebooks are especially important for large complex studies involving many variables and measurement items, where the coding process is conducted by different people, to help the coding team code data in a consistent manner, and also to help others understand and interpret the coded data.

Data entry. Coded data can be entered into a spreadsheet, database, text file, or directly into a statistical program like SPSS. Most statistical programs provide a data editor for entering data. However, these programs store data in their own native format—e.g., SPSS stores data as .sav files—which makes it difficult to share that data with other statistical programs. Hence, it is often better to enter data into a spreadsheet or database where it can be reorganised as needed, shared across programs, and subsets of data can be extracted for analysis. Smaller data sets with less than 65,000 observations and 256 items can be stored in a spreadsheet created using a program such as Microsoft Excel, while larger datasets with millions of observations will require a database. Each observation can be entered as one row in the spreadsheet, and each measurement item can be represented as one column. Data should be checked for accuracy during and after entry via occasional spot checks on a set of items or observations. Furthermore, while entering data, the coder should watch out for obvious evidence of bad data, such as the respondent selecting the ‘strongly agree’ response to all items irrespective of content, including reverse-coded items. If so, such data can be entered but should be excluded from subsequent analysis.

-1

Data transformation. Sometimes, it is necessary to transform data values before they can be meaningfully interpreted. For instance, reverse coded items—where items convey the opposite meaning of that of their underlying construct—should be reversed (e.g., in a 1-7 interval scale, 8 minus the observed value will reverse the value) before they can be compared or combined with items that are not reverse coded. Other kinds of transformations may include creating scale measures by adding individual scale items, creating a weighted index from a set of observed measures, and collapsing multiple values into fewer categories (e.g., collapsing incomes into income ranges).

Univariate analysis

Univariate analysis—or analysis of a single variable—refers to a set of statistical techniques that can describe the general properties of one variable. Univariate statistics include: frequency distribution, central tendency, and dispersion. The frequency distribution of a variable is a summary of the frequency—or percentages—of individual values or ranges of values for that variable. For instance, we can measure how many times a sample of respondents attend religious services—as a gauge of their ‘religiosity’—using a categorical scale: never, once per year, several times per year, about once a month, several times per month, several times per week, and an optional category for ‘did not answer’. If we count the number or percentage of observations within each category—except ‘did not answer’ which is really a missing value rather than a category—and display it in the form of a table, as shown in Figure 14.1, what we have is a frequency distribution. This distribution can also be depicted in the form of a bar chart, as shown on the right panel of Figure 14.1, with the horizontal axis representing each category of that variable and the vertical axis representing the frequency or percentage of observations within each category.

Frequency distribution of religiosity

With very large samples, where observations are independent and random, the frequency distribution tends to follow a plot that looks like a bell-shaped curve—a smoothed bar chart of the frequency distribution—similar to that shown in Figure 14.2. Here most observations are clustered toward the centre of the range of values, with fewer and fewer observations clustered toward the extreme ends of the range. Such a curve is called a normal distribution .

(15 + 20 + 21 + 20 + 36 + 15 + 25 + 15)/8=20.875

Lastly, the mode is the most frequently occurring value in a distribution of values. In the previous example, the most frequently occurring value is 15, which is the mode of the above set of test scores. Note that any value that is estimated from a sample, such as mean, median, mode, or any of the later estimates are called a statistic .

36-15=21

Bivariate analysis

Bivariate analysis examines how two variables are related to one another. The most common bivariate statistic is the bivariate correlation —often, simply called ‘correlation’—which is a number between -1 and +1 denoting the strength of the relationship between two variables. Say that we wish to study how age is related to self-esteem in a sample of 20 respondents—i.e., as age increases, does self-esteem increase, decrease, or remain unchanged?. If self-esteem increases, then we have a positive correlation between the two variables, if self-esteem decreases, then we have a negative correlation, and if it remains the same, we have a zero correlation. To calculate the value of this correlation, consider the hypothetical dataset shown in Table 14.1.

Normal distribution

After computing bivariate correlation, researchers are often interested in knowing whether the correlation is significant (i.e., a real one) or caused by mere chance. Answering such a question would require testing the following hypothesis:

\[H_0:\quad r = 0 \]

Social Science Research: Principles, Methods and Practices (Revised edition) Copyright © 2019 by Anol Bhattacherjee is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

Your Modern Business Guide To Data Analysis Methods And Techniques

Data analysis methods and techniques blog post by datapine

Table of Contents

1) What Is Data Analysis?

2) Why Is Data Analysis Important?

3) What Is The Data Analysis Process?

4) Types Of Data Analysis Methods

5) Top Data Analysis Techniques To Apply

6) Quality Criteria For Data Analysis

7) Data Analysis Limitations & Barriers

8) Data Analysis Skills

9) Data Analysis In The Big Data Environment

In our data-rich age, understanding how to analyze and extract true meaning from our business’s digital insights is one of the primary drivers of success.

Despite the colossal volume of data we create every day, a mere 0.5% is actually analyzed and used for data discovery , improvement, and intelligence. While that may not seem like much, considering the amount of digital information we have at our fingertips, half a percent still accounts for a vast amount of data.

With so much data and so little time, knowing how to collect, curate, organize, and make sense of all of this potentially business-boosting information can be a minefield – but online data analysis is the solution.

In science, data analysis uses a more complex approach with advanced techniques to explore and experiment with data. On the other hand, in a business context, data is used to make data-driven decisions that will enable the company to improve its overall performance. In this post, we will cover the analysis of data from an organizational point of view while still going through the scientific and statistical foundations that are fundamental to understanding the basics of data analysis. 

To put all of that into perspective, we will answer a host of important analytical questions, explore analytical methods and techniques, while demonstrating how to perform analysis in the real world with a 17-step blueprint for success.

What Is Data Analysis?

Data analysis is the process of collecting, modeling, and analyzing data using various statistical and logical methods and techniques. Businesses rely on analytics processes and tools to extract insights that support strategic and operational decision-making.

All these various methods are largely based on two core areas: quantitative and qualitative research.

To explain the key differences between qualitative and quantitative research, here’s a video for your viewing pleasure:

Gaining a better understanding of different techniques and methods in quantitative research as well as qualitative insights will give your analyzing efforts a more clearly defined direction, so it’s worth taking the time to allow this particular knowledge to sink in. Additionally, you will be able to create a comprehensive analytical report that will skyrocket your analysis.

Apart from qualitative and quantitative categories, there are also other types of data that you should be aware of before dividing into complex data analysis processes. These categories include: 

  • Big data: Refers to massive data sets that need to be analyzed using advanced software to reveal patterns and trends. It is considered to be one of the best analytical assets as it provides larger volumes of data at a faster rate. 
  • Metadata: Putting it simply, metadata is data that provides insights about other data. It summarizes key information about specific data that makes it easier to find and reuse for later purposes. 
  • Real time data: As its name suggests, real time data is presented as soon as it is acquired. From an organizational perspective, this is the most valuable data as it can help you make important decisions based on the latest developments. Our guide on real time analytics will tell you more about the topic. 
  • Machine data: This is more complex data that is generated solely by a machine such as phones, computers, or even websites and embedded systems, without previous human interaction.

Why Is Data Analysis Important?

Before we go into detail about the categories of analysis along with its methods and techniques, you must understand the potential that analyzing data can bring to your organization.

  • Informed decision-making : From a management perspective, you can benefit from analyzing your data as it helps you make decisions based on facts and not simple intuition. For instance, you can understand where to invest your capital, detect growth opportunities, predict your income, or tackle uncommon situations before they become problems. Through this, you can extract relevant insights from all areas in your organization, and with the help of dashboard software , present the data in a professional and interactive way to different stakeholders.
  • Reduce costs : Another great benefit is to reduce costs. With the help of advanced technologies such as predictive analytics, businesses can spot improvement opportunities, trends, and patterns in their data and plan their strategies accordingly. In time, this will help you save money and resources on implementing the wrong strategies. And not just that, by predicting different scenarios such as sales and demand you can also anticipate production and supply. 
  • Target customers better : Customers are arguably the most crucial element in any business. By using analytics to get a 360° vision of all aspects related to your customers, you can understand which channels they use to communicate with you, their demographics, interests, habits, purchasing behaviors, and more. In the long run, it will drive success to your marketing strategies, allow you to identify new potential customers, and avoid wasting resources on targeting the wrong people or sending the wrong message. You can also track customer satisfaction by analyzing your client’s reviews or your customer service department’s performance.

What Is The Data Analysis Process?

Data analysis process graphic

When we talk about analyzing data there is an order to follow in order to extract the needed conclusions. The analysis process consists of 5 key stages. We will cover each of them more in detail later in the post, but to start providing the needed context to understand what is coming next, here is a rundown of the 5 essential steps of data analysis. 

  • Identify: Before you get your hands dirty with data, you first need to identify why you need it in the first place. The identification is the stage in which you establish the questions you will need to answer. For example, what is the customer's perception of our brand? Or what type of packaging is more engaging to our potential customers? Once the questions are outlined you are ready for the next step. 
  • Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you define which sources of data you will use and how you will use them. The collection of data can come in different forms such as internal or external sources, surveys, interviews, questionnaires, and focus groups, among others.  An important note here is that the way you collect the data will be different in a quantitative and qualitative scenario. 
  • Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the data you collect will be useful, when collecting big amounts of data in different formats it is very likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start working with your data you need to make sure to erase any white spaces, duplicate records, or formatting errors. This way you avoid hurting your analysis with bad-quality data. 
  • Analyze : With the help of various techniques such as statistical analysis, regressions, neural networks, text analysis, and more, you can start analyzing and manipulating your data to extract relevant conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you answer the questions you first thought of in the identify stage. Various technologies in the market assist researchers and average users with the management of their data. Some of them include business intelligence and visualization software, predictive analytics, and data mining, among others. 
  • Interpret: Last but not least you have one of the most important steps: it is time to interpret your results. This stage is where the researcher comes up with courses of action based on the findings. For example, here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations and work on them. 

Now that you have a basic understanding of the key data analysis steps, let’s look at the top 17 essential methods.

17 Essential Types Of Data Analysis Methods

Before diving into the 17 essential types of methods, it is important that we go over really fast through the main analysis categories. Starting with the category of descriptive up to prescriptive analysis, the complexity and effort of data evaluation increases, but also the added value for the company.

a) Descriptive analysis - What happened.

The descriptive analysis method is the starting point for any analytic reflection, and it aims to answer the question of what happened? It does this by ordering, manipulating, and interpreting raw data from various sources to turn it into valuable insights for your organization.

Performing descriptive analysis is essential, as it enables us to present our insights in a meaningful way. Although it is relevant to mention that this analysis on its own will not allow you to predict future outcomes or tell you the answer to questions like why something happened, it will leave your data organized and ready to conduct further investigations.

b) Exploratory analysis - How to explore data relationships.

As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there is still no notion of the relationship between the data and the variables. Once the data is investigated, exploratory analysis helps you to find connections and generate hypotheses and solutions for specific problems. A typical area of ​​application for it is data mining.

c) Diagnostic analysis - Why it happened.

Diagnostic data analytics empowers analysts and executives by helping them gain a firm contextual understanding of why something happened. If you know why something happened as well as how it happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Designed to provide direct and actionable answers to specific questions, this is one of the world’s most important methods in research, among its other key organizational functions such as retail analytics , e.g.

c) Predictive analysis - What will happen.

The predictive method allows you to look into the future to answer the question: what will happen? In order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic analysis, in addition to machine learning (ML) and artificial intelligence (AI). Through this, you can uncover future trends, potential problems or inefficiencies, connections, and casualties in your data.

With predictive analysis, you can unfold and develop initiatives that will not only enhance your various operational processes but also help you gain an all-important edge over the competition. If you understand why a trend, pattern, or event happened through data, you will be able to develop an informed projection of how things may unfold in particular areas of the business.

e) Prescriptive analysis - How will it happen.

Another of the most effective types of analysis methods in research. Prescriptive data techniques cross over from predictive analysis in the way that it revolves around using patterns or trends to develop responsive, practical business strategies.

By drilling down into prescriptive analysis, you will play an active role in the data consumption process by taking well-arranged sets of visual data and using it as a powerful fix to emerging issues in a number of key areas, including marketing, sales, customer experience, HR, fulfillment, finance, logistics analytics , and others.

Top 17 data analysis methods

As mentioned at the beginning of the post, data analysis methods can be divided into two big categories: quantitative and qualitative. Each of these categories holds a powerful analytical value that changes depending on the scenario and type of data you are working with. Below, we will discuss 17 methods that are divided into qualitative and quantitative approaches. 

Without further ado, here are the 17 essential types of data analysis methods with some use cases in the business world: 

A. Quantitative Methods 

To put it simply, quantitative analysis refers to all methods that use numerical data or data that can be turned into numbers (e.g. category variables like gender, age, etc.) to extract valuable insights. It is used to extract valuable conclusions about relationships, differences, and test hypotheses. Below we discuss some of the key quantitative methods. 

1. Cluster analysis

The action of grouping a set of data elements in a way that said elements are more similar (in a particular sense) to each other than to those in other groups – hence the term ‘cluster.’ Since there is no target variable when clustering, the method is often used to find hidden patterns in the data. The approach is also used to provide additional context to a trend or dataset.

Let's look at it from an organizational perspective. In a perfect world, marketers would be able to analyze each customer separately and give them the best-personalized service, but let's face it, with a large customer base, it is timely impossible to do that. That's where clustering comes in. By grouping customers into clusters based on demographics, purchasing behaviors, monetary value, or any other factor that might be relevant for your company, you will be able to immediately optimize your efforts and give your customers the best experience based on their needs.

2. Cohort analysis

This type of data analysis approach uses historical data to examine and compare a determined segment of users' behavior, which can then be grouped with others with similar characteristics. By using this methodology, it's possible to gain a wealth of insight into consumer needs or a firm understanding of a broader target group.

Cohort analysis can be really useful for performing analysis in marketing as it will allow you to understand the impact of your campaigns on specific groups of customers. To exemplify, imagine you send an email campaign encouraging customers to sign up for your site. For this, you create two versions of the campaign with different designs, CTAs, and ad content. Later on, you can use cohort analysis to track the performance of the campaign for a longer period of time and understand which type of content is driving your customers to sign up, repurchase, or engage in other ways.  

A useful tool to start performing cohort analysis method is Google Analytics. You can learn more about the benefits and limitations of using cohorts in GA in this useful guide . In the bottom image, you see an example of how you visualize a cohort in this tool. The segments (devices traffic) are divided into date cohorts (usage of devices) and then analyzed week by week to extract insights into performance.

Cohort analysis chart example from google analytics

3. Regression analysis

Regression uses historical data to understand how a dependent variable's value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable's relationship and how it developed in the past, you can anticipate possible outcomes and make better decisions in the future.

Let's bring it down with an example. Imagine you did a regression analysis of your sales in 2019 and discovered that variables like product quality, store design, customer service, marketing campaigns, and sales channels affected the overall result. Now you want to use regression to analyze which of these variables changed or if any new ones appeared during 2020. For example, you couldn’t sell as much in your physical store due to COVID lockdowns. Therefore, your sales could’ve either dropped in general or increased in your online channels. Through this, you can understand which independent variables affected the overall performance of your dependent variable, annual sales.

If you want to go deeper into this type of analysis, check out this article and learn more about how you can benefit from regression.

4. Neural networks

The neural network forms the basis for the intelligent algorithms of machine learning. It is a form of analytics that attempts, with minimal intervention, to understand how the human brain would generate insights and predict values. Neural networks learn from each and every data transaction, meaning that they evolve and advance over time.

A typical area of application for neural networks is predictive analytics. There are BI reporting tools that have this feature implemented within them, such as the Predictive Analytics Tool from datapine. This tool enables users to quickly and easily generate all kinds of predictions. All you have to do is select the data to be processed based on your KPIs, and the software automatically calculates forecasts based on historical and current data. Thanks to its user-friendly interface, anyone in your organization can manage it; there’s no need to be an advanced scientist. 

Here is an example of how you can use the predictive analysis tool from datapine:

Example on how to use predictive analytics tool from datapine

**click to enlarge**

5. Factor analysis

The factor analysis also called “dimension reduction” is a type of data analysis used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The aim here is to uncover independent latent variables, an ideal method for streamlining specific segments.

A good way to understand this data analysis method is a customer evaluation of a product. The initial assessment is based on different variables like color, shape, wearability, current trends, materials, comfort, the place where they bought the product, and frequency of usage. Like this, the list can be endless, depending on what you want to track. In this case, factor analysis comes into the picture by summarizing all of these variables into homogenous groups, for example, by grouping the variables color, materials, quality, and trends into a brother latent variable of design.

If you want to start analyzing data using factor analysis we recommend you take a look at this practical guide from UCLA.

6. Data mining

A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.  When considering how to analyze data, adopting a data mining mindset is essential to success - as such, it’s an area that is worth exploring in greater detail.

An excellent use case of data mining is datapine intelligent data alerts . With the help of artificial intelligence and machine learning, they provide automated signals based on particular commands or occurrences within a dataset. For example, if you’re monitoring supply chain KPIs , you could set an intelligent alarm to trigger when invalid or low-quality data appears. By doing so, you will be able to drill down deep into the issue and fix it swiftly and effectively.

In the following picture, you can see how the intelligent alarms from datapine work. By setting up ranges on daily orders, sessions, and revenues, the alarms will notify you if the goal was not completed or if it exceeded expectations.

Example on how to use intelligent alerts from datapine

7. Time series analysis

As its name suggests, time series analysis is used to analyze a set of data points collected over a specified period of time. Although analysts use this method to monitor the data points in a specific interval of time rather than just monitoring them intermittently, the time series analysis is not uniquely used for the purpose of collecting data over time. Instead, it allows researchers to understand if variables changed during the duration of the study, how the different variables are dependent, and how did it reach the end result. 

In a business context, this method is used to understand the causes of different trends and patterns to extract valuable insights. Another way of using this method is with the help of time series forecasting. Powered by predictive technologies, businesses can analyze various data sets over a period of time and forecast different future events. 

A great use case to put time series analysis into perspective is seasonality effects on sales. By using time series forecasting to analyze sales data of a specific product over time, you can understand if sales rise over a specific period of time (e.g. swimwear during summertime, or candy during Halloween). These insights allow you to predict demand and prepare production accordingly.  

8. Decision Trees 

The decision tree analysis aims to act as a support tool to make smart and strategic decisions. By visually displaying potential outcomes, consequences, and costs in a tree-like model, researchers and company users can easily evaluate all factors involved and choose the best course of action. Decision trees are helpful to analyze quantitative data and they allow for an improved decision-making process by helping you spot improvement opportunities, reduce costs, and enhance operational efficiency and production.

But how does a decision tree actually works? This method works like a flowchart that starts with the main decision that you need to make and branches out based on the different outcomes and consequences of each decision. Each outcome will outline its own consequences, costs, and gains and, at the end of the analysis, you can compare each of them and make the smartest decision. 

Businesses can use them to understand which project is more cost-effective and will bring more earnings in the long run. For example, imagine you need to decide if you want to update your software app or build a new app entirely.  Here you would compare the total costs, the time needed to be invested, potential revenue, and any other factor that might affect your decision.  In the end, you would be able to see which of these two options is more realistic and attainable for your company or research.

9. Conjoint analysis 

Last but not least, we have the conjoint analysis. This approach is usually used in surveys to understand how individuals value different attributes of a product or service and it is one of the most effective methods to extract consumer preferences. When it comes to purchasing, some clients might be more price-focused, others more features-focused, and others might have a sustainable focus. Whatever your customer's preferences are, you can find them with conjoint analysis. Through this, companies can define pricing strategies, packaging options, subscription packages, and more. 

A great example of conjoint analysis is in marketing and sales. For instance, a cupcake brand might use conjoint analysis and find that its clients prefer gluten-free options and cupcakes with healthier toppings over super sugary ones. Thus, the cupcake brand can turn these insights into advertisements and promotions to increase sales of this particular type of product. And not just that, conjoint analysis can also help businesses segment their customers based on their interests. This allows them to send different messaging that will bring value to each of the segments. 

10. Correspondence Analysis

Also known as reciprocal averaging, correspondence analysis is a method used to analyze the relationship between categorical variables presented within a contingency table. A contingency table is a table that displays two (simple correspondence analysis) or more (multiple correspondence analysis) categorical variables across rows and columns that show the distribution of the data, which is usually answers to a survey or questionnaire on a specific topic. 

This method starts by calculating an “expected value” which is done by multiplying row and column averages and dividing it by the overall original value of the specific table cell. The “expected value” is then subtracted from the original value resulting in a “residual number” which is what allows you to extract conclusions about relationships and distribution. The results of this analysis are later displayed using a map that represents the relationship between the different values. The closest two values are in the map, the bigger the relationship. Let’s put it into perspective with an example. 

Imagine you are carrying out a market research analysis about outdoor clothing brands and how they are perceived by the public. For this analysis, you ask a group of people to match each brand with a certain attribute which can be durability, innovation, quality materials, etc. When calculating the residual numbers, you can see that brand A has a positive residual for innovation but a negative one for durability. This means that brand A is not positioned as a durable brand in the market, something that competitors could take advantage of. 

11. Multidimensional Scaling (MDS)

MDS is a method used to observe the similarities or disparities between objects which can be colors, brands, people, geographical coordinates, and more. The objects are plotted using an “MDS map” that positions similar objects together and disparate ones far apart. The (dis) similarities between objects are represented using one or more dimensions that can be observed using a numerical scale. For example, if you want to know how people feel about the COVID-19 vaccine, you can use 1 for “don’t believe in the vaccine at all”  and 10 for “firmly believe in the vaccine” and a scale of 2 to 9 for in between responses.  When analyzing an MDS map the only thing that matters is the distance between the objects, the orientation of the dimensions is arbitrary and has no meaning at all. 

Multidimensional scaling is a valuable technique for market research, especially when it comes to evaluating product or brand positioning. For instance, if a cupcake brand wants to know how they are positioned compared to competitors, it can define 2-3 dimensions such as taste, ingredients, shopping experience, or more, and do a multidimensional scaling analysis to find improvement opportunities as well as areas in which competitors are currently leading. 

Another business example is in procurement when deciding on different suppliers. Decision makers can generate an MDS map to see how the different prices, delivery times, technical services, and more of the different suppliers differ and pick the one that suits their needs the best. 

A final example proposed by a research paper on "An Improved Study of Multilevel Semantic Network Visualization for Analyzing Sentiment Word of Movie Review Data". Researchers picked a two-dimensional MDS map to display the distances and relationships between different sentiments in movie reviews. They used 36 sentiment words and distributed them based on their emotional distance as we can see in the image below where the words "outraged" and "sweet" are on opposite sides of the map, marking the distance between the two emotions very clearly.

Example of multidimensional scaling analysis

Aside from being a valuable technique to analyze dissimilarities, MDS also serves as a dimension-reduction technique for large dimensional data. 

B. Qualitative Methods

Qualitative data analysis methods are defined as the observation of non-numerical data that is gathered and produced using methods of observation such as interviews, focus groups, questionnaires, and more. As opposed to quantitative methods, qualitative data is more subjective and highly valuable in analyzing customer retention and product development.

12. Text analysis

Text analysis, also known in the industry as text mining, works by taking large sets of textual data and arranging them in a way that makes it easier to manage. By working through this cleansing process in stringent detail, you will be able to extract the data that is truly relevant to your organization and use it to develop actionable insights that will propel you forward.

Modern software accelerate the application of text analytics. Thanks to the combination of machine learning and intelligent algorithms, you can perform advanced analytical processes such as sentiment analysis. This technique allows you to understand the intentions and emotions of a text, for example, if it's positive, negative, or neutral, and then give it a score depending on certain factors and categories that are relevant to your brand. Sentiment analysis is often used to monitor brand and product reputation and to understand how successful your customer experience is. To learn more about the topic check out this insightful article .

By analyzing data from various word-based sources, including product reviews, articles, social media communications, and survey responses, you will gain invaluable insights into your audience, as well as their needs, preferences, and pain points. This will allow you to create campaigns, services, and communications that meet your prospects’ needs on a personal level, growing your audience while boosting customer retention. There are various other “sub-methods” that are an extension of text analysis. Each of them serves a more specific purpose and we will look at them in detail next. 

13. Content Analysis

This is a straightforward and very popular method that examines the presence and frequency of certain words, concepts, and subjects in different content formats such as text, image, audio, or video. For example, the number of times the name of a celebrity is mentioned on social media or online tabloids. It does this by coding text data that is later categorized and tabulated in a way that can provide valuable insights, making it the perfect mix of quantitative and qualitative analysis.

There are two types of content analysis. The first one is the conceptual analysis which focuses on explicit data, for instance, the number of times a concept or word is mentioned in a piece of content. The second one is relational analysis, which focuses on the relationship between different concepts or words and how they are connected within a specific context. 

Content analysis is often used by marketers to measure brand reputation and customer behavior. For example, by analyzing customer reviews. It can also be used to analyze customer interviews and find directions for new product development. It is also important to note, that in order to extract the maximum potential out of this analysis method, it is necessary to have a clearly defined research question. 

14. Thematic Analysis

Very similar to content analysis, thematic analysis also helps in identifying and interpreting patterns in qualitative data with the main difference being that the first one can also be applied to quantitative analysis. The thematic method analyzes large pieces of text data such as focus group transcripts or interviews and groups them into themes or categories that come up frequently within the text. It is a great method when trying to figure out peoples view’s and opinions about a certain topic. For example, if you are a brand that cares about sustainability, you can do a survey of your customers to analyze their views and opinions about sustainability and how they apply it to their lives. You can also analyze customer service calls transcripts to find common issues and improve your service. 

Thematic analysis is a very subjective technique that relies on the researcher’s judgment. Therefore,  to avoid biases, it has 6 steps that include familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. It is also important to note that, because it is a flexible approach, the data can be interpreted in multiple ways and it can be hard to select what data is more important to emphasize. 

15. Narrative Analysis 

A bit more complex in nature than the two previous ones, narrative analysis is used to explore the meaning behind the stories that people tell and most importantly, how they tell them. By looking into the words that people use to describe a situation you can extract valuable conclusions about their perspective on a specific topic. Common sources for narrative data include autobiographies, family stories, opinion pieces, and testimonials, among others. 

From a business perspective, narrative analysis can be useful to analyze customer behaviors and feelings towards a specific product, service, feature, or others. It provides unique and deep insights that can be extremely valuable. However, it has some drawbacks.  

The biggest weakness of this method is that the sample sizes are usually very small due to the complexity and time-consuming nature of the collection of narrative data. Plus, the way a subject tells a story will be significantly influenced by his or her specific experiences, making it very hard to replicate in a subsequent study. 

16. Discourse Analysis

Discourse analysis is used to understand the meaning behind any type of written, verbal, or symbolic discourse based on its political, social, or cultural context. It mixes the analysis of languages and situations together. This means that the way the content is constructed and the meaning behind it is significantly influenced by the culture and society it takes place in. For example, if you are analyzing political speeches you need to consider different context elements such as the politician's background, the current political context of the country, the audience to which the speech is directed, and so on. 

From a business point of view, discourse analysis is a great market research tool. It allows marketers to understand how the norms and ideas of the specific market work and how their customers relate to those ideas. It can be very useful to build a brand mission or develop a unique tone of voice. 

17. Grounded Theory Analysis

Traditionally, researchers decide on a method and hypothesis and start to collect the data to prove that hypothesis. The grounded theory is the only method that doesn’t require an initial research question or hypothesis as its value lies in the generation of new theories. With the grounded theory method, you can go into the analysis process with an open mind and explore the data to generate new theories through tests and revisions. In fact, it is not necessary to collect the data and then start to analyze it. Researchers usually start to find valuable insights as they are gathering the data. 

All of these elements make grounded theory a very valuable method as theories are fully backed by data instead of initial assumptions. It is a great technique to analyze poorly researched topics or find the causes behind specific company outcomes. For example, product managers and marketers might use the grounded theory to find the causes of high levels of customer churn and look into customer surveys and reviews to develop new theories about the causes. 

How To Analyze Data? Top 17 Data Analysis Techniques To Apply

17 top data analysis techniques by datapine

Now that we’ve answered the questions “what is data analysis’”, why is it important, and covered the different data analysis types, it’s time to dig deeper into how to perform your analysis by working through these 17 essential techniques.

1. Collaborate your needs

Before you begin analyzing or drilling down into any techniques, it’s crucial to sit down collaboratively with all key stakeholders within your organization, decide on your primary campaign or strategic goals, and gain a fundamental understanding of the types of insights that will best benefit your progress or provide you with the level of vision you need to evolve your organization.

2. Establish your questions

Once you’ve outlined your core objectives, you should consider which questions will need answering to help you achieve your mission. This is one of the most important techniques as it will shape the very foundations of your success.

To help you ask the right things and ensure your data works for you, you have to ask the right data analysis questions .

3. Data democratization

After giving your data analytics methodology some real direction, and knowing which questions need answering to extract optimum value from the information available to your organization, you should continue with democratization.

Data democratization is an action that aims to connect data from various sources efficiently and quickly so that anyone in your organization can access it at any given moment. You can extract data in text, images, videos, numbers, or any other format. And then perform cross-database analysis to achieve more advanced insights to share with the rest of the company interactively.  

Once you have decided on your most valuable sources, you need to take all of this into a structured format to start collecting your insights. For this purpose, datapine offers an easy all-in-one data connectors feature to integrate all your internal and external sources and manage them at your will. Additionally, datapine’s end-to-end solution automatically updates your data, allowing you to save time and focus on performing the right analysis to grow your company.

data connectors from datapine

4. Think of governance 

When collecting data in a business or research context you always need to think about security and privacy. With data breaches becoming a topic of concern for businesses, the need to protect your client's or subject’s sensitive information becomes critical. 

To ensure that all this is taken care of, you need to think of a data governance strategy. According to Gartner , this concept refers to “ the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics .” In simpler words, data governance is a collection of processes, roles, and policies, that ensure the efficient use of data while still achieving the main company goals. It ensures that clear roles are in place for who can access the information and how they can access it. In time, this not only ensures that sensitive information is protected but also allows for an efficient analysis as a whole. 

5. Clean your data

After harvesting from so many sources you will be left with a vast amount of information that can be overwhelming to deal with. At the same time, you can be faced with incorrect data that can be misleading to your analysis. The smartest thing you can do to avoid dealing with this in the future is to clean the data. This is fundamental before visualizing it, as it will ensure that the insights you extract from it are correct.

There are many things that you need to look for in the cleaning process. The most important one is to eliminate any duplicate observations; this usually appears when using multiple internal and external sources of information. You can also add any missing codes, fix empty fields, and eliminate incorrectly formatted data.

Another usual form of cleaning is done with text data. As we mentioned earlier, most companies today analyze customer reviews, social media comments, questionnaires, and several other text inputs. In order for algorithms to detect patterns, text data needs to be revised to avoid invalid characters or any syntax or spelling errors. 

Most importantly, the aim of cleaning is to prevent you from arriving at false conclusions that can damage your company in the long run. By using clean data, you will also help BI solutions to interact better with your information and create better reports for your organization.

6. Set your KPIs

Once you’ve set your sources, cleaned your data, and established clear-cut questions you want your insights to answer, you need to set a host of key performance indicators (KPIs) that will help you track, measure, and shape your progress in a number of key areas.

KPIs are critical to both qualitative and quantitative analysis research. This is one of the primary methods of data analysis you certainly shouldn’t overlook.

To help you set the best possible KPIs for your initiatives and activities, here is an example of a relevant logistics KPI : transportation-related costs. If you want to see more go explore our collection of key performance indicator examples .

Transportation costs logistics KPIs

7. Omit useless data

Having bestowed your data analysis tools and techniques with true purpose and defined your mission, you should explore the raw data you’ve collected from all sources and use your KPIs as a reference for chopping out any information you deem to be useless.

Trimming the informational fat is one of the most crucial methods of analysis as it will allow you to focus your analytical efforts and squeeze every drop of value from the remaining ‘lean’ information.

Any stats, facts, figures, or metrics that don’t align with your business goals or fit with your KPI management strategies should be eliminated from the equation.

8. Build a data management roadmap

While, at this point, this particular step is optional (you will have already gained a wealth of insight and formed a fairly sound strategy by now), creating a data governance roadmap will help your data analysis methods and techniques become successful on a more sustainable basis. These roadmaps, if developed properly, are also built so they can be tweaked and scaled over time.

Invest ample time in developing a roadmap that will help you store, manage, and handle your data internally, and you will make your analysis techniques all the more fluid and functional – one of the most powerful types of data analysis methods available today.

9. Integrate technology

There are many ways to analyze data, but one of the most vital aspects of analytical success in a business context is integrating the right decision support software and technology.

Robust analysis platforms will not only allow you to pull critical data from your most valuable sources while working with dynamic KPIs that will offer you actionable insights; it will also present them in a digestible, visual, interactive format from one central, live dashboard . A data methodology you can count on.

By integrating the right technology within your data analysis methodology, you’ll avoid fragmenting your insights, saving you time and effort while allowing you to enjoy the maximum value from your business’s most valuable insights.

For a look at the power of software for the purpose of analysis and to enhance your methods of analyzing, glance over our selection of dashboard examples .

10. Answer your questions

By considering each of the above efforts, working with the right technology, and fostering a cohesive internal culture where everyone buys into the different ways to analyze data as well as the power of digital intelligence, you will swiftly start to answer your most burning business questions. Arguably, the best way to make your data concepts accessible across the organization is through data visualization.

11. Visualize your data

Online data visualization is a powerful tool as it lets you tell a story with your metrics, allowing users across the organization to extract meaningful insights that aid business evolution – and it covers all the different ways to analyze data.

The purpose of analyzing is to make your entire organization more informed and intelligent, and with the right platform or dashboard, this is simpler than you think, as demonstrated by our marketing dashboard .

An executive dashboard example showcasing high-level marketing KPIs such as cost per lead, MQL, SQL, and cost per customer.

This visual, dynamic, and interactive online dashboard is a data analysis example designed to give Chief Marketing Officers (CMO) an overview of relevant metrics to help them understand if they achieved their monthly goals.

In detail, this example generated with a modern dashboard creator displays interactive charts for monthly revenues, costs, net income, and net income per customer; all of them are compared with the previous month so that you can understand how the data fluctuated. In addition, it shows a detailed summary of the number of users, customers, SQLs, and MQLs per month to visualize the whole picture and extract relevant insights or trends for your marketing reports .

The CMO dashboard is perfect for c-level management as it can help them monitor the strategic outcome of their marketing efforts and make data-driven decisions that can benefit the company exponentially.

12. Be careful with the interpretation

We already dedicated an entire post to data interpretation as it is a fundamental part of the process of data analysis. It gives meaning to the analytical information and aims to drive a concise conclusion from the analysis results. Since most of the time companies are dealing with data from many different sources, the interpretation stage needs to be done carefully and properly in order to avoid misinterpretations. 

To help you through the process, here we list three common practices that you need to avoid at all costs when looking at your data:

  • Correlation vs. causation: The human brain is formatted to find patterns. This behavior leads to one of the most common mistakes when performing interpretation: confusing correlation with causation. Although these two aspects can exist simultaneously, it is not correct to assume that because two things happened together, one provoked the other. A piece of advice to avoid falling into this mistake is never to trust just intuition, trust the data. If there is no objective evidence of causation, then always stick to correlation. 
  • Confirmation bias: This phenomenon describes the tendency to select and interpret only the data necessary to prove one hypothesis, often ignoring the elements that might disprove it. Even if it's not done on purpose, confirmation bias can represent a real problem, as excluding relevant information can lead to false conclusions and, therefore, bad business decisions. To avoid it, always try to disprove your hypothesis instead of proving it, share your analysis with other team members, and avoid drawing any conclusions before the entire analytical project is finalized.
  • Statistical significance: To put it in short words, statistical significance helps analysts understand if a result is actually accurate or if it happened because of a sampling error or pure chance. The level of statistical significance needed might depend on the sample size and the industry being analyzed. In any case, ignoring the significance of a result when it might influence decision-making can be a huge mistake.

13. Build a narrative

Now, we’re going to look at how you can bring all of these elements together in a way that will benefit your business - starting with a little something called data storytelling.

The human brain responds incredibly well to strong stories or narratives. Once you’ve cleansed, shaped, and visualized your most invaluable data using various BI dashboard tools , you should strive to tell a story - one with a clear-cut beginning, middle, and end.

By doing so, you will make your analytical efforts more accessible, digestible, and universal, empowering more people within your organization to use your discoveries to their actionable advantage.

14. Consider autonomous technology

Autonomous technologies, such as artificial intelligence (AI) and machine learning (ML), play a significant role in the advancement of understanding how to analyze data more effectively.

Gartner predicts that by the end of this year, 80% of emerging technologies will be developed with AI foundations. This is a testament to the ever-growing power and value of autonomous technologies.

At the moment, these technologies are revolutionizing the analysis industry. Some examples that we mentioned earlier are neural networks, intelligent alarms, and sentiment analysis.

15. Share the load

If you work with the right tools and dashboards, you will be able to present your metrics in a digestible, value-driven format, allowing almost everyone in the organization to connect with and use relevant data to their advantage.

Modern dashboards consolidate data from various sources, providing access to a wealth of insights in one centralized location, no matter if you need to monitor recruitment metrics or generate reports that need to be sent across numerous departments. Moreover, these cutting-edge tools offer access to dashboards from a multitude of devices, meaning that everyone within the business can connect with practical insights remotely - and share the load.

Once everyone is able to work with a data-driven mindset, you will catalyze the success of your business in ways you never thought possible. And when it comes to knowing how to analyze data, this kind of collaborative approach is essential.

16. Data analysis tools

In order to perform high-quality analysis of data, it is fundamental to use tools and software that will ensure the best results. Here we leave you a small summary of four fundamental categories of data analysis tools for your organization.

  • Business Intelligence: BI tools allow you to process significant amounts of data from several sources in any format. Through this, you can not only analyze and monitor your data to extract relevant insights but also create interactive reports and dashboards to visualize your KPIs and use them for your company's good. datapine is an amazing online BI software that is focused on delivering powerful online analysis features that are accessible to beginner and advanced users. Like this, it offers a full-service solution that includes cutting-edge analysis of data, KPIs visualization, live dashboards, reporting, and artificial intelligence technologies to predict trends and minimize risk.
  • Statistical analysis: These tools are usually designed for scientists, statisticians, market researchers, and mathematicians, as they allow them to perform complex statistical analyses with methods like regression analysis, predictive analysis, and statistical modeling. A good tool to perform this type of analysis is R-Studio as it offers a powerful data modeling and hypothesis testing feature that can cover both academic and general data analysis. This tool is one of the favorite ones in the industry, due to its capability for data cleaning, data reduction, and performing advanced analysis with several statistical methods. Another relevant tool to mention is SPSS from IBM. The software offers advanced statistical analysis for users of all skill levels. Thanks to a vast library of machine learning algorithms, text analysis, and a hypothesis testing approach it can help your company find relevant insights to drive better decisions. SPSS also works as a cloud service that enables you to run it anywhere.
  • SQL Consoles: SQL is a programming language often used to handle structured data in relational databases. Tools like these are popular among data scientists as they are extremely effective in unlocking these databases' value. Undoubtedly, one of the most used SQL software in the market is MySQL Workbench . This tool offers several features such as a visual tool for database modeling and monitoring, complete SQL optimization, administration tools, and visual performance dashboards to keep track of KPIs.
  • Data Visualization: These tools are used to represent your data through charts, graphs, and maps that allow you to find patterns and trends in the data. datapine's already mentioned BI platform also offers a wealth of powerful online data visualization tools with several benefits. Some of them include: delivering compelling data-driven presentations to share with your entire company, the ability to see your data online with any device wherever you are, an interactive dashboard design feature that enables you to showcase your results in an interactive and understandable way, and to perform online self-service reports that can be used simultaneously with several other people to enhance team productivity.

17. Refine your process constantly 

Last is a step that might seem obvious to some people, but it can be easily ignored if you think you are done. Once you have extracted the needed results, you should always take a retrospective look at your project and think about what you can improve. As you saw throughout this long list of techniques, data analysis is a complex process that requires constant refinement. For this reason, you should always go one step further and keep improving. 

Quality Criteria For Data Analysis

So far we’ve covered a list of methods and techniques that should help you perform efficient data analysis. But how do you measure the quality and validity of your results? This is done with the help of some science quality criteria. Here we will go into a more theoretical area that is critical to understanding the fundamentals of statistical analysis in science. However, you should also be aware of these steps in a business context, as they will allow you to assess the quality of your results in the correct way. Let’s dig in. 

  • Internal validity: The results of a survey are internally valid if they measure what they are supposed to measure and thus provide credible results. In other words , internal validity measures the trustworthiness of the results and how they can be affected by factors such as the research design, operational definitions, how the variables are measured, and more. For instance, imagine you are doing an interview to ask people if they brush their teeth two times a day. While most of them will answer yes, you can still notice that their answers correspond to what is socially acceptable, which is to brush your teeth at least twice a day. In this case, you can’t be 100% sure if respondents actually brush their teeth twice a day or if they just say that they do, therefore, the internal validity of this interview is very low. 
  • External validity: Essentially, external validity refers to the extent to which the results of your research can be applied to a broader context. It basically aims to prove that the findings of a study can be applied in the real world. If the research can be applied to other settings, individuals, and times, then the external validity is high. 
  • Reliability : If your research is reliable, it means that it can be reproduced. If your measurement were repeated under the same conditions, it would produce similar results. This means that your measuring instrument consistently produces reliable results. For example, imagine a doctor building a symptoms questionnaire to detect a specific disease in a patient. Then, various other doctors use this questionnaire but end up diagnosing the same patient with a different condition. This means the questionnaire is not reliable in detecting the initial disease. Another important note here is that in order for your research to be reliable, it also needs to be objective. If the results of a study are the same, independent of who assesses them or interprets them, the study can be considered reliable. Let’s see the objectivity criteria in more detail now. 
  • Objectivity: In data science, objectivity means that the researcher needs to stay fully objective when it comes to its analysis. The results of a study need to be affected by objective criteria and not by the beliefs, personality, or values of the researcher. Objectivity needs to be ensured when you are gathering the data, for example, when interviewing individuals, the questions need to be asked in a way that doesn't influence the results. Paired with this, objectivity also needs to be thought of when interpreting the data. If different researchers reach the same conclusions, then the study is objective. For this last point, you can set predefined criteria to interpret the results to ensure all researchers follow the same steps. 

The discussed quality criteria cover mostly potential influences in a quantitative context. Analysis in qualitative research has by default additional subjective influences that must be controlled in a different way. Therefore, there are other quality criteria for this kind of research such as credibility, transferability, dependability, and confirmability. You can see each of them more in detail on this resource . 

Data Analysis Limitations & Barriers

Analyzing data is not an easy task. As you’ve seen throughout this post, there are many steps and techniques that you need to apply in order to extract useful information from your research. While a well-performed analysis can bring various benefits to your organization it doesn't come without limitations. In this section, we will discuss some of the main barriers you might encounter when conducting an analysis. Let’s see them more in detail. 

  • Lack of clear goals: No matter how good your data or analysis might be if you don’t have clear goals or a hypothesis the process might be worthless. While we mentioned some methods that don’t require a predefined hypothesis, it is always better to enter the analytical process with some clear guidelines of what you are expecting to get out of it, especially in a business context in which data is utilized to support important strategic decisions. 
  • Objectivity: Arguably one of the biggest barriers when it comes to data analysis in research is to stay objective. When trying to prove a hypothesis, researchers might find themselves, intentionally or unintentionally, directing the results toward an outcome that they want. To avoid this, always question your assumptions and avoid confusing facts with opinions. You can also show your findings to a research partner or external person to confirm that your results are objective. 
  • Data representation: A fundamental part of the analytical procedure is the way you represent your data. You can use various graphs and charts to represent your findings, but not all of them will work for all purposes. Choosing the wrong visual can not only damage your analysis but can mislead your audience, therefore, it is important to understand when to use each type of data depending on your analytical goals. Our complete guide on the types of graphs and charts lists 20 different visuals with examples of when to use them. 
  • Flawed correlation : Misleading statistics can significantly damage your research. We’ve already pointed out a few interpretation issues previously in the post, but it is an important barrier that we can't avoid addressing here as well. Flawed correlations occur when two variables appear related to each other but they are not. Confusing correlations with causation can lead to a wrong interpretation of results which can lead to building wrong strategies and loss of resources, therefore, it is very important to identify the different interpretation mistakes and avoid them. 
  • Sample size: A very common barrier to a reliable and efficient analysis process is the sample size. In order for the results to be trustworthy, the sample size should be representative of what you are analyzing. For example, imagine you have a company of 1000 employees and you ask the question “do you like working here?” to 50 employees of which 49 say yes, which means 95%. Now, imagine you ask the same question to the 1000 employees and 950 say yes, which also means 95%. Saying that 95% of employees like working in the company when the sample size was only 50 is not a representative or trustworthy conclusion. The significance of the results is way more accurate when surveying a bigger sample size.   
  • Privacy concerns: In some cases, data collection can be subjected to privacy regulations. Businesses gather all kinds of information from their customers from purchasing behaviors to addresses and phone numbers. If this falls into the wrong hands due to a breach, it can affect the security and confidentiality of your clients. To avoid this issue, you need to collect only the data that is needed for your research and, if you are using sensitive facts, make it anonymous so customers are protected. The misuse of customer data can severely damage a business's reputation, so it is important to keep an eye on privacy. 
  • Lack of communication between teams : When it comes to performing data analysis on a business level, it is very likely that each department and team will have different goals and strategies. However, they are all working for the same common goal of helping the business run smoothly and keep growing. When teams are not connected and communicating with each other, it can directly affect the way general strategies are built. To avoid these issues, tools such as data dashboards enable teams to stay connected through data in a visually appealing way. 
  • Innumeracy : Businesses are working with data more and more every day. While there are many BI tools available to perform effective analysis, data literacy is still a constant barrier. Not all employees know how to apply analysis techniques or extract insights from them. To prevent this from happening, you can implement different training opportunities that will prepare every relevant user to deal with data. 

Key Data Analysis Skills

As you've learned throughout this lengthy guide, analyzing data is a complex task that requires a lot of knowledge and skills. That said, thanks to the rise of self-service tools the process is way more accessible and agile than it once was. Regardless, there are still some key skills that are valuable to have when working with data, we list the most important ones below.

  • Critical and statistical thinking: To successfully analyze data you need to be creative and think out of the box. Yes, that might sound like a weird statement considering that data is often tight to facts. However, a great level of critical thinking is required to uncover connections, come up with a valuable hypothesis, and extract conclusions that go a step further from the surface. This, of course, needs to be complemented by statistical thinking and an understanding of numbers. 
  • Data cleaning: Anyone who has ever worked with data before will tell you that the cleaning and preparation process accounts for 80% of a data analyst's work, therefore, the skill is fundamental. But not just that, not cleaning the data adequately can also significantly damage the analysis which can lead to poor decision-making in a business scenario. While there are multiple tools that automate the cleaning process and eliminate the possibility of human error, it is still a valuable skill to dominate. 
  • Data visualization: Visuals make the information easier to understand and analyze, not only for professional users but especially for non-technical ones. Having the necessary skills to not only choose the right chart type but know when to apply it correctly is key. This also means being able to design visually compelling charts that make the data exploration process more efficient. 
  • SQL: The Structured Query Language or SQL is a programming language used to communicate with databases. It is fundamental knowledge as it enables you to update, manipulate, and organize data from relational databases which are the most common databases used by companies. It is fairly easy to learn and one of the most valuable skills when it comes to data analysis. 
  • Communication skills: This is a skill that is especially valuable in a business environment. Being able to clearly communicate analytical outcomes to colleagues is incredibly important, especially when the information you are trying to convey is complex for non-technical people. This applies to in-person communication as well as written format, for example, when generating a dashboard or report. While this might be considered a “soft” skill compared to the other ones we mentioned, it should not be ignored as you most likely will need to share analytical findings with others no matter the context. 

Data Analysis In The Big Data Environment

Big data is invaluable to today’s businesses, and by using different methods for data analysis, it’s possible to view your data in a way that can help you turn insight into positive action.

To inspire your efforts and put the importance of big data into context, here are some insights that you should know:

  • By 2026 the industry of big data is expected to be worth approximately $273.4 billion.
  • 94% of enterprises say that analyzing data is important for their growth and digital transformation. 
  • Companies that exploit the full potential of their data can increase their operating margins by 60% .
  • We already told you the benefits of Artificial Intelligence through this article. This industry's financial impact is expected to grow up to $40 billion by 2025.

Data analysis concepts may come in many forms, but fundamentally, any solid methodology will help to make your business more streamlined, cohesive, insightful, and successful than ever before.

Key Takeaways From Data Analysis 

As we reach the end of our data analysis journey, we leave a small summary of the main methods and techniques to perform excellent analysis and grow your business.

17 Essential Types of Data Analysis Methods:

  • Cluster analysis
  • Cohort analysis
  • Regression analysis
  • Factor analysis
  • Neural Networks
  • Data Mining
  • Text analysis
  • Time series analysis
  • Decision trees
  • Conjoint analysis 
  • Correspondence Analysis
  • Multidimensional Scaling 
  • Content analysis 
  • Thematic analysis
  • Narrative analysis 
  • Grounded theory analysis
  • Discourse analysis 

Top 17 Data Analysis Techniques:

  • Collaborate your needs
  • Establish your questions
  • Data democratization
  • Think of data governance 
  • Clean your data
  • Set your KPIs
  • Omit useless data
  • Build a data management roadmap
  • Integrate technology
  • Answer your questions
  • Visualize your data
  • Interpretation of data
  • Consider autonomous technology
  • Build a narrative
  • Share the load
  • Data Analysis tools
  • Refine your process constantly 

We’ve pondered the data analysis definition and drilled down into the practical applications of data-centric analytics, and one thing is clear: by taking measures to arrange your data and making your metrics work for you, it’s possible to transform raw information into action - the kind of that will push your business to the next level.

Yes, good data analytics techniques result in enhanced business intelligence (BI). To help you understand this notion in more detail, read our exploration of business intelligence reporting .

And, if you’re ready to perform your own analysis, drill down into your facts and figures while interacting with your data on astonishing visuals, you can try our software for a free, 14-day trial .

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 02 May 2024

Underwater acoustic analysis reveals unique pressure signals associated with aircraft crashes in the sea: revisiting MH370

  • Usama Kadri 1  

Scientific Reports volume  14 , Article number:  10102 ( 2024 ) Cite this article

24k Accesses

191 Altmetric

Metrics details

  • Physical oceanography
  • Scientific data

Data analysis from the hydroacoustic stations of the Comprehensive Nuclear-Test-Ban Treaty Organization has unveiled distinctive pressure signals linked to aircraft crashes of varying sizes in the ocean. Notably, these signals were detected at distances ranging from two to five thousand kilometres, highlighting the efficacy of underwater acoustic technology in event identification and classification in marine environments. In this study, we investigate the plausibility of an aircraft, such as Malaysian Airlines Flight 370 (MH370), crashing into the sea leaving a discernible pressure signal at distant hydrophones. Consequently, we focus on recordings obtained from the hydroacoustic monitoring stations located at Cape Leeuwin and Diego Garcia, within a few minutes of the last satellite ping on the 7th arc, associated with the assumed crash time and location. Among the available data, only one relevant signal has emerged as a potential candidate, albeit recorded at a single station out of the two stations available. To ensure a comprehensive analysis, we also examine the time frame and location of the airplane along its initial route. Though no corresponding signal was observed. Nevertheless, the findings in this study narrow down the range of possibilities and present a novel scientific approach to investigate such incidents. These findings contribute to our understanding of acoustic signals associated with aircraft crashes at sea. They emphasise the potential for hydrophones to detect events even when the signal travels long distances through land. Ultimately, this research offers recommendations for conducting on-site experiments involving controlled explosions with energy levels similar to the impact of MH370 along the 7th arc. The aim is to encourage pertinent authorities to implement actions that could reveal insights into the destiny of MH370 specifically. Additionally, this initiative seeks to establish a comprehensive framework for addressing comparable incidents in the broader ocean context.

Similar content being viewed by others

data analysis in research scribd

3D ocean assessments reveal that fisheries reach deep but marine protection remains shallow

data analysis in research scribd

Analyzing Joshimath’s sinking: causes, consequences, and future prospects with remote sensing techniques

data analysis in research scribd

Global prediction of extreme floods in ungauged watersheds

Introduction.

This article delves into the perplexing enigma surrounding the Malaysian Airlines Flight 370 (MH370) by conducting an in-depth analysis of previous aircraft crashes. Out of about two decades of hydroacoustic data available from the CTBTO, a comprehensive study was undertaken of 100 hours of data identified as having the potential for containing signals pertaining to acoustic signatures of aircraft crashing into the sea. Ten historical aircraft accidents that occurred in open sea locations were selected for this study. Figures 1 , 2 and 3 depict maps illustrating the positions of these aircraft and their respective distances (ranging from 2211 to 4695 km) and their bearings (directions) relative to the hydroacoustic stations. Further background information on each crash is provided in the supplementary material S1 . Figure 1 showcases the impact locations of three aircraft, namely F-35a, Transair Flight 810, and Asiana Flight 991, along with their distances from the northern and southern parts of the hydroacoustic station at Wake Island, stations H11N and H11S, in the Pacific Ocean. Notably, each hydroacoustic station consists of three hydrophones arranged in a triangular configuration with distances of a few kilometres between them. By calculating the time differences in signal arrival, it is possible to determine the bearing of the signal (see “ Methods ”). In Fig. 2 , the impact locations of five aircrafts are depicted, all of which were recorded at hydroacoustic stations in the Indian Ocean, namely Diego Garcia (H08S and H08N) and Cape Leeuwin (H01W). Yemenia Flight 626 was recorded at both H08S and H08N, while AB Aviation Flight 1103 and Sriwijaya Air Flight 182 were recorded at H08S. Air Asia Flight 8501 was recorded at H01W, and Lion Air Flight 904 was recorded on both H01W and H08S, providing a rare opportunity to study signal variations due to propagation through land. Low-frequency acoustic signals coupling to land and subsequently to the ocean have been reasonably well-documented 1 , 2 . The concept that hydroacoustic or more generally acoustic-gravity wave signals originating from objects impacting the sea surface can transition between land and water has been proposed by Ref. 3 . To demonstrate the detectability of such signals even after traveling hundreds of kilometres inland, the case of the ARA San Juan, an Argentine submarine that vanished off the coast of Argentina on November 15, 2017, is considered. An explosion at the disappearance site was detected on CTBTO stations Ascension (H10) and Crozet Islands (H04), located thousands of kilometres away 4 . In Fig. 3 , the location of the submarine and the vast stretch of land separating it from the Juan Fernandez Island hydroacoustic station (H03S) is depicted. A similar distance, but without any land in between, separates Air France Flight 447 from H10S.

figure 1

Location of the CTBTO’s hydroacoustic stations H11N and H11S (white triangles); the impact location of three aircrafts (indicated in yellow): F-35a, Transair Flight 810, and Asiana Flight 991; and the distances and bearings relative to the hydroacoustic stations (presented in magenta). The cyan star shows the location of earthquake M 4.8–9 km S of Yōkaichiba, Japan, 2014-03-07 18:34:20 (UTC) 35.611 \(^{\circ }\) N 140.552 \(^{\circ }\) E 23.9 km depth.

figure 2

Location of the CTBTO’s hydroacoustic stations H08S, H08N and H01W (white triangles); the impact location of five aircrafts, Yemenia Flight 626, AB Aviation Flight 1103, Sriwijaya Air Flight 182, Air Asia Flight 8501, Lion Air Flight 904 (indicated in yellow); and the distances and bearing relative to the hydroacoustic stations (presented in magenta). The cyan star shows the location of earthquake M 2.7–85 km NE of Sinabang, Indonesia, 2014-03-07 18:55:12 (UTC) 3.072 \(^{\circ }\) N 96.873 \(^{\circ }\) E 10.0 km depth.

figure 3

Location of the CTBTO’s hydroacoustic stations H10S, H10N and H03S (white triangles); impact location of Air France Flight 447, and disappearance location of ARA San Juan Submarine (indicated in yellow); distances and bearings relative to the hydroacoustic stations (presented in magenta); and shortest time path for the acoustic signal to travel inland (cyan).

Past aircraft crashes with known locations

The spectrograms and pressure time series of signals from ten aircraft incidents and a single submarine disappearance, recorded on seven hydroacoustic stations of the CTBTO, are presented in Figs. 4 , 5 and 6 . Detailed information about each incident can be found in the supplementary material S1 , and technical notes on the calculation methods employed are given in the “ Methods ” section. A consistent observation in this study, across all aircraft incidents, is the detection of acoustic signals even at distances of thousands of kilometres from the hydrophones.

The spectrograms of aircraft F-35a exhibit a distinctive, sharp, and rapid (6 s) signal following the supersonic impact (Fig. 4 a). This signal is captured at both stations, H11N and H11S, which are over 3300 km away. The result is in agreement with Ref. 5 who studied this event in detail. A similar pattern is observed in the case of Yemenia Flight 626, also recorded at two stations, H08N and H08S (Fig. 4 b), although the signal appears fainter compared to F-35a, likely due to a less energetic impact. It is worth noting that the background noise at H08S is particularly high, posing a challenge for isolating the signal of interest. Conversely, H08N displays two subsequent signals that appear to be local events. For Sriwijaya Air Flight 182 (Fig. 4 c), data was only available at H08S within the specified time window. Despite the significant distance from the hydrophone station, the signal of interest is sharp and clear.

The signal associated with Air France Flight 447 (Fig. 5 a) is also short, but due to relatively high background noise between 18 and 28 Hz, the signal appears faint. It is important to emphasise that Flight 447 experienced a high-impact event, as evidenced by the pressure values reaching 75% of that observed for F-35a, even over a distance of 2200 km. Note that the above mentioned flights had powerful impacts, some due to aerodynamic stall prior to crashing.

The incident involving Transair Flight 810 is noteworthy as data is available from both H11N and H11S (Fig. 5 b), and there appears to be a double signal, potentially indicating a ditching event, an engine that was detached, or signal arrival from different locations. The duration of the signal is longer compared to previous cases, suggesting either signal dispersion (When sound propagates through the ocean, it encounters variations in water temperature, pressure, and salinity. These environmental factors can lead to changes in the speed of sound at different depths and locations. As a result, different frequency components of a sound signal may travel at different speeds, causing the signal to spread out or disperse over time. This phenomenon is known as dispersion.) or an extended impact time due to ditching. A similar observation, with more pronounced dispersion indicated by white arrows, is noticed in the case of Lion Air Flight 904 (Fig. 5 c). The associated signals, within the expected time windows, were recorded at two distant stations, namely H01W and H08S. Notably, station H08S is nearly twice the distance from the incident location, resulting in stronger signal dispersion. The shape and arrival time of the signal at H01W indicate that it primarily travelled through water, although the bearing suggests a landward direction. The high pressure recorded at H01W indicates that the signal might be associated with a different event, possibly local, especially considering reports of Lion Air Flight 904’s relatively low-impact crash. It is also noteworthy that approximately 8 min later, two faint signals (not shown in the figure) were observed from feasible bearings, indicating travelling predominantly in the water layer.

In the final set of plots, Fig. 6 , signal dispersion is most pronounced. Panels (a), (b), and (c) of Fig. 6 present signal recordings associated with aircraft incidents involving AB Aviation Flight 1103 (recorded at H08S), Air Asia Flight 8501 (recorded at H01W), and Asiana Flight 991 (recorded at both H11N and H11S). Nevertheless, the differences in the mode of crash is rather pronounced. The signal in Fig. 6 b is distorted with noise (possibly due to airgun shots as shown in Figs. 3(a) and 4(c) of Ref. 6 ), and the magnitude of the pressure induced by the aircraft crash is not clear. AB 1103 is a small aeroplane, whereas Asiana 991 (a heavy cargo aeroplane) and Air Asia 8501 had a much more impactful crash.

In summary, the analysis of the recorded signals from past aircraft incidents reveals three different types of acoustic signature. The first type is characterised by distinct and rapid signals following energetic impacts, whereas the second type includes signals with longer impact time. The third type involves signals with higher uncertainty around them, due to multiple factors such as large distances crossing through land and unknown modes of impact (e.g., nose, main body, or wings first). However, the pressure signatures radiating during a high energy impact crash were found to be observable even at a few thousands kilometres away. The effect of signal travelling through land is discussed in the following note considering the case of ARA San Juan Submarine.

figure 4

Spectrogram (left) and pressure time series (right) of: ( a ) F-35a; ( b ) Yemenia Flight 626; and ( c ) Sriwijaya Air Flight 182. Rectangles highlight the signals of interest.

figure 5

Spectrogram (left) and pressure time series (right) of: ( a ) Air France Flight 447; ( b ) Transair Flight 810; and (c) Lion Air Flight 904. Rectangles highlight the signals of interest, and white arrows highlight dispersion.

figure 6

Spectrogram (left) and pressure time series (right) of: ( a ) AB Aviation Flight 1103; ( b ) Air Asia Flight 8501; and ( c ) Asiana Flight 991. Rectangles highlight the signals of interest, and white arrows highlight dispersion.

figure 7

Spectrogram and pressure time series of ARA San Juan Submarine explosion recorded at stations H03S (top), H04S (middle), and H10N (bottom). White arrow highlights dispersion.

figure 8

Spectrogram and pressure time series of a calibration grenade recorded at stations H03S (top), H04S (middle), and H10N (bottom). White arrow highlights dispersion.

A note on ARA San Juan Submarine

The ARA San Juan Submarine case is a good example for demonstrating that acoustic signals that radiate in water can couple to the elastic seabed when approaching shallow water, then travel large distances in land before returning to the sea where they can still be recorded at hydrophones. Specifically, Fig. 7 presents spectrogram and pressure time series of acoustic waves generated by explosions in the submarine that were observed not only at H10 and H04 stations that have direct water connection with the location of the submarine (which is in agreement with Fig. 2 of Ref. 4 ), but also at station H03, that is separated by 2000 km of land as illustrated in Fig. 3 . This observation highlights the effectiveness of employing hydrophones to capture events in the ocean beyond current convention, namely when signals cross through lands. However, the higher frequencies of the signal attenuate due to scattering absorption in the elastic layer when a critical depth is reached. This observation is in agreement with acoustic-gravity wave theory 7 , 8 : while higher acoustic modes might be absorbed in the elastic layer when a critical water height is reached, the leading mode couples with the elastic sea-bed, turns into Scholte wave, then Rayleigh-type wave, and thus preserves most of the energy. It is noticed even though H03S is much closer to the event location compared to the other two stations, i.e., about one third the distance to H10N and one quarter the distance to H04S, yet the pressure signal recorded at H03S is 21% of H04S and 11% of H10N. Thus, radiation through large distances in land and complex topography can significantly reduce the pressure signature.

Moreover, the shorter propagation path in terms of duration, not necessarily the shortest distance, experiences less attenuation during transmission, resulting in a greater concentration of acoustic energy. The cyan curve in Fig. 3 qualitatively indicates the path the acoustic signals would travel through land, which minimises the duration of the journey as it optimises travelling in the solid earth (much higher speed) than in the water. Specifically, with an average speed of 3550 m/s travelling roughly 1900 km \(\pm 100\) km, and an average speed of 1480 m/s travelling about 330 km \(\pm 100\) km, it takes a total time of between 12 and 13.5 min to reach H03S. This choice of shortest time path results in a different relative bearing, i.e., 105.6 \(^{\circ }\) , which is in agreement with the actual arrival direction at H03S (with an uncertainty of 0.5 \(^{\circ }\) ), as opposed to the expected 135.5 \(^{\circ }\) , in case of shortest distance or 173 \(^{\circ }\) considering the acoustic signals travelling only in the water.

Further support of the argument above can be demonstrated by examining an event that took place in close proximity to the site of the submarine’s explosion. On December 1, 2017, 2 weeks after the ARA Juan submarine vanished, the Argentine Navy dropped a calibration grenade into the sea within the vicinity of the submarine’s last known location. Comparing the temporal and spectral features of both events helps confirming the impulsive nature of the San Juan event, as rigorously done by Ref. 4 . Here, the grenade event provides a further validation (Fig. 8 ) that acoustic signals in water can travel across large distances in land before attenuating significantly, in particular for lower frequency modes 3 , 8 . Figure 8 presents spectrogram and pressure time series of acoustic waves generated by the grenade. As before, the signals observed at H10 and H04 stations are in agreement with Fig. 2 of Ref. 4 , and the signal recorded on station H03 matches the expected arrival time of 13 min after the explosion, the duration being 30 s, and the frequency range and the secondary smaller signal that appears exactly 1.5 min after the main signal, just as observed in Fig. 7 for the ARA San Juan Submarine case.

The unsolved case of MH370

The Malaysia Airlines Flight 370, known as MH370, captivated the world upon its disappearance on March 8, 2014, during a journey from Kuala Lumpur to Beijing. Despite an extensive multinational search effort, the whereabouts of the aircraft and the fate of its 239 passengers and crew remain shrouded in uncertainty. To delve into the perplexing enigma surrounding MH370, attention is focused on Southern Indian Ocean, which is, arguably, associated with the final stage of the journey, searching for signals that were generated near the 7th arc (last communication with the satellite - Inmarsat) following the official search recommendations. In addition, data associated with the disappearance stage of the flight around the time of the last communication, i.e., in the Gulf of Thailand, was analysed in general to check for any unusual signals. No signals of interest were found to be associated with the early disappearance stage (see supplementary material S1 ).

The final search stage: southern Indian Ocean

According to the official search team, data that emerged later on from the International Maritime Satellite Organization (Inmarsat) alongside calculations by Boeing search teams concluded that the MH370 crash impact has to be in the Indian Ocean at the vicinity of the last handshake at 00:19:29 UTC between Inmarsat and MH370, known as the 7th arc 9 . Moreover, it has been concluded that during the last minutes of the flight the aircraft was on autopilot at regular altitude of 36,000 ft when the two engines had subsequently flamed off, following fuel being exhausted, which caused the aircraft go through a spiral stall. Consequently, the crash impact was highly energetic, as also supported by evidence from debris 9 . Bayesian methods employed for analysing potential routs support these findings as well 10 . Provided all these information not only that the time window and bearing being sought are very narrow, but one expects a significant signal to appear on both H01W and H08S (note that H04 was not operational during that time, and until 2017). The distance from the 7th arc to H01W is about 1600 km, which is less than three quarters the distance of Air France Flight 447 relative to H10S (the signal-to-hydrophone distance analysed); and the distance to H08S is around 3700 km, still within the average distance of the other studied cases.

A number of acoustic signals were analysed previously in several studies, in particular Refs. 11 , 12 . One of these, a signal of interest with a bearing of 301.4 \(\pm 0.4^{\circ }\) relative to H01W, was recorded at 01:34:40 UTC on 8 March 2014. A later study by Refs. 3 , 13 , suggested that the signal is at distance of 1900 km ( \(\pm 200\) km) from H01W, centred at \(-23.662^{\circ }\) , \(96.676^{\circ }\) , with the source generated between 01:11 and 01:16 UTC on 8 March 2014). However, following the recommendations by the official investigation the signal of interest that is associated with the impact has to occur around the 7th arc, both in time and location. Around 00:30 UTC there are a few repeating signals (Fig. 9 a) followed by no observed signals (Fig. 9 b), around the requested time window. After that, there are only two signals that were observed within the requested time window, as shown in Fig. 9 c. The first signal has a bearing of \(57^{\circ }\) and thus cannot be related to MH370, but the second, which was recorded at 00:52 UTC at H01W, arrives from a plausible direction, i.e., \(306^{\circ }\) . Notably, the transect from the 7th arc to H01W has no bathymetric irregularities (see Fig. 15 c in “ Methods ”), and thus minimum loss of energy is expected due to scattering. Confirming the same signal of interest at H08S would be a challenging task if at all possible due to the very high noise distortion (most likely airgun), and the significant bathymetric rise midway along the transect, that is expected to significantly reduce the transmission of the signal. The expected pressure amplitude at H08S, if had a similar value as recorded on H01W, would be 0.5 [Pa], whereas the airgun noise is ranging from \(2-4\) [Pa] which is 4–8 times stronger (Fig. 9 d–e).

In summary, adhering to the official narrative, it can be asserted with confidence that if MH370 did crash in the vicinity of the 7th arc, then hydroacoustic signals should have been detected at least at H01W. Within the specified time window, only a single signal of interest at a bearing of \(306^{\circ }\) has been identified.

figure 9

Spectrogram and pressure time series of signals observed at H08S and H01W on 8 March 2014 shortly after the last handshake (7th arc) of MH370. The only signal of interest is found in rectangle (2) of panel ( c ), which has a bearing of 306 \(^{\circ }\) relative to H01W; signal in rectangle (1) has a bearing of 57 \(^{\circ }\) . For a comprehensive list of signals shortly after the 7th arc see Table 1 .

figure 10

Location of the CTBTO’s hydroacoustic station H01W (white triangle); Inmarsat 6th arc at 00:10:59 UTC (yellow curve); Inmarsat 7th arc at 00:19:29 UTC (white curve). The possible impact location of the source of signal (2) of Fig. 9 c is presented in a gradient rectangle (red for earlier impact, and blue for later impact). Suggested future controlled explosions of comparable energy to MH370 are illustrated by yellow explosion symbol.

Discussion and practical suggestions

The limited number of analysed signals from aircraft crashes in this study hinders a comprehensive examination of acoustic signatures arising from various modes of aircraft crashes occurring under complex conditions, including diverse topography and the presence of natural and manmade noise sources. Nonetheless, it is implausible to imagine that a significant crash of an aircraft on the ocean surface would fail to generate a discernible pressure signature, even at distant hydrophones, let alone those in closer proximity. The official investigation of MH370 supports the conclusion that a substantial crash should have taken place near the 7th arc 9 . In this case, the acoustic signal would have travelled approximately 1600 km to reach the nearest hydrophone station (H01W), a considerably shorter distance than any other acoustic signal analysed here. An aircraft with a mass of 200 tons impacting the water surface at a velocity of 200 m/s possesses a kinetic energy of 4 GJ, equivalent to 956 kg of TNT or an earthquake of magnitude 3.2. Considering half that velocity, the released energy would still be large, i.e., 1 GJ, equivalent to 239 kg of TNT or an earthquake of magnitude 2.8. Even with a significantly lower impact velocity of 30 m/s, the resulting energy release would be 89 MJ, equivalent to 21 kg of TNT or an earthquake of magnitude 2.1. Evidently, such earthquakes can be detected by distant hydrophones, just as the M 2.7 earthquake (see Fig. 2 for the location of the earthquake on the map; and Fig. 17 in “ Methods ” for the bearing). Therefore, it is highly unlikely for MH370 to have crashed near the 7th arc without leaving a discernible acoustic signature. Within the constraints imposed by the official search team, only one signal (bearing \(306^{\circ }\) ) has been identified at H01W, suggesting a potential impact location (see “ Methods ”), as illustrated in Fig. 10 . However, this same signal was not observed at H08S. Whether the signals at bearings \(306^{\circ }\) and the later signal at \(301^{\circ }\) , are related remains subject to future analysis.

A practical suggestion to help resolve the ongoing debate regarding the detectability of the acoustic signal radiated by MH370 is to conduct controlled explosions along the 7th arc (similarly to ARA San Juan Submarine 4 , 14 ), containing an equivalent amount of energy believed to be associated with MH370. If the received signals from the explosions exhibit comparable pressure amplitudes to that of the signal of interest, that would support the notion that the identified signal should be a focal point in future search efforts. Conversely, if the signals received on both H01W and H08S are significantly stronger than the signal of interest discussed here, it would necessitate further analysis of the signals from both stations. This may also warrant a reassessment of the data that led to the determination of the 7th arc, allowing for the consideration of new scenarios consistent with the updated findings. Additionally, variations in the strength of recorded signals could offer insights into the underlying conditions influencing such variability, potentially enabling a more refined localisation of potential impact areas based on specific terrains and associated transects.

It is important to note that the present study does not definitively pinpoint the impact location of MH370. However, it significantly enhances our understanding of the acoustic signals associated with aircraft crashes at sea in general. It narrows down the signals of interest and their respective locations, and importantly, it highlights the effectiveness of hydrophones in classifying and detecting signals even after they have travelled long distances, including over land. It is hoped that these practical suggestions will inspire relevant authorities to take the next significant steps towards unraveling the greatest aviation mystery of all time.

Analysis approach

Conducting a detailed analysis of aeroplane crashes in the ocean is challenging due to the limited number of occurrences. Despite the scarcity of data, the unique and energetic nature of such events, coupled with a relatively low frequency of occurrence of signals with similar characteristics, presents an opportunity for insightful analysis. For instance, the highest relevant signal frequency observed in the analysis was four signals in 10 min, in Fig. 9 a, highlighting a constrained set of possibilities. Considering the 10-min window and allowing for a 2 \(^{\circ }\) uncertainty in the bearing (which is much larger than the maximum deviation), the probability of a signal occurring by chance, unrelated to the event, is less than 3%. This high level of confidence (97%) in the signal’s relevance to the event arises from the limited occurrences and distinctive characteristics of the signals. Even with a broader 20-min window, featuring eight signals and a more lenient 5 \(^{\circ }\) bearing uncertainty, the probability of the signal being unrelated to the event remains below 12%. However, it is acknowledged that the disadvantage of this approach emerges when the impact location is unknown, leading to an increased degree in bearing uncertainty.

It is worth noting that in the calculation of the acoustic speed, although average speeds can be calculated with precision, the methodology employed here takes a broader perspective. Rather than focusing solely on accuracy, the approach involves exploring the entire spectrum of possibilities, even if it means incorporating seemingly irrelevant signals. This ensures that no potential signal of interest is inadvertently overlooked. Therefore, prior to diving into the sufficient conditions to correlate the signal with the incident, the necessary conditions are first examined. This approach is chosen due to the limited number of signals that meet the necessary conditions as described above.

Signal processing and bearing calculation

Signals from past aircraft crashes were presented around the middle of a 10 min time window, to allow visual comparison. Spectrograms indicate the frequency band of noise as opposed to frequency bands from signals. Most processed signals show a significant amount of noise in the 0–4 Hz band. Noting that the broadband frequency content is typical for the propagation of acoustic-gravity waves 15 that can travel far distances from the event location, a high pass Butterworth IIR filter was used (in addition to the CTBTO high pass filter) below 5 Hz. Moreover, a 2–40 [Hz] band-pass filter was applied in general. Filtering noise, enhancing the signal appearance and stabilising the time-waveform can be crucial for identifying the bearing of the signal of interest, in particular when the source is an impulse that generates short signals of a few seconds length.

Each hydrophone station has three hydrophones configured in a triangular shape spaced by around 2 km. Exact locations of all hydrophones can be obtained from the CTBTO. In information theory, entropy serves as a measure quantifying the amount of information encapsulated within a signal. Let \(P(x)\) denote the probability distribution function of the discrete-time signal \(x\) , then the entropy is defined as

Additionally, log energy entropy 16 , represented as \(H_L\) , is defined as:

Windowed entropy calculations can be employed to identify transient signals against a noisy background. This is predicated on the assumption that the randomness measure of the signal will exhibit changes when the nature of the signal undergoes alterations. Entropy values have been computed across a window size of a few seconds and a step size of 0.5 s. As depicted in Figs. 11 and 12 , peaks in the entropy trace emerge where transient signals are detected. A threshold of \(2.3 \times 10^4\) has been established; all peaks within a 2-second window across all signals are considered for subsequent bearing calculations. Note that the frequencies of the signal of interest, are within a frequency band, say below 20 Hz, which is order of magnitude lower than the sampling rate of the hydrophones (250 Hz). Thus related errors are expected to be small 17 .

Following the isolation of signals of interest, bearing determination is carried out using time-of-arrival-based triangulation. The time of arrival is estimated by identifying the maximum of the cross-correlation function across channel pairs, enabling the derivation of pairwise time of arrival differences \(t_i - t_j = \Delta _{ij}\) . Geometric parameters for the array are derived from latitude and longitude position data corresponding to the hydrophones. Assuming a constant arrival velocity v an expression of the geometric parameters is given by 13

with parameters defined in Fig. 13 , with no loss of generality. Using a 99.5% confidence interval, the bearing calculations are considered to be accurate to \(\pm 0.4^{\circ }\) (see Ref. 13 for more details).

Figure 11 illustrates how high noise to signal ratio can prevent calculating the bearing of the signal of interest. In particular, the figure concerns bearing calculations of signals from airgun shots on Fig. S1 c. Clearly, the airgun bearing of \(210^{\circ }\) (highlighted in magenta) dominates the picture, which prevents calculating the bearing of the signal of interest at 19:05 UTC. On the other hand, Fig. 12 provides a successful capture of the bearing of the signal of interest \(269^{\circ }\) (highlighted in magenta) 19:41 UTC.

figure 11

Bearing calculations of signals on Fig. S1 c. From top to bottom: recordings from three channels at station H08S between 19:00 and 19:10 UTC on March 7th 2014. Windowed entropy shows peaks (black triangles) where a transient signal is found. Each peak defines an event, for which bearing is calculated from differences in arrival times in the bearing subplot. The map demonstrates the dominating bearing from airgun signals, \(210^{\circ }\) (highlighted in magenta), which prevent calculating the bearing of the signal of interest at 19:05 UTC.

figure 12

Bearing calculations of signals on Fig. S1 e. From top to bottom: recordings from three channels at station H11S between 19:41 and 19:51 UTC on March 7th 2014. Windowed entropy shows peaks (black triangles) where a transient signal is found. Each peak defines an event, for which bearing is calculated from differences in arrival times in the bearing subplot. The map illustrates the direction of the calculated bearings. The bearing of the signal of interest at 19:46 UTC is successfully calculated as \(269^{\circ }\) (highlighted in magenta).

figure 13

Geometry of the three hydrophones array and bearing calculation. Credit: Fig. 9 of Ref. 13 .

Possible impact location near the 7th Arc

A comprehensive list of signals potentially recorded near the 7th arc shortly after the last handshake are given in Table 1 . The only major signal has a bearing of \(306.18^{\circ }\) , and it is recorded at 00:54 UTC. In that direction the distance of the 7th arc from H01W is \(R_7=1586\) km. For an acoustic signal travelling with a speed of \(c=1500 \,{\pm \,70}\) m/s, the required time to travel from the 7th arc to the H01W is \(17.45\pm 0.65\) min. The time difference between the 7th arc and the time the signal was recorded is 34.5 min. Thus, if the signal is associated with the aircraft crash then there is an extra time of \(\Delta t = 17.02\pm 0.65\) min. Therefore, if the airplane was travelling at an average velocity \(\bar{v}\) (defined positive in the direction of the bearing), it will travel away from the 7th arc a distance \(\Delta \bar{r}\) during time \(\Delta T\) following

Figure 14 shows the possible combinations of distance and time that would have been travelled. Note that both \(\Delta \bar{r}\) and \(\bar{v}\) are vectors defined positive in the bearing direction, i.e., inwards the 7th arc. Illustration of the possible time duration and distance travelled (Eq. 4 ) as a function of the aircraft average velocity (defined positive in the direction of the bearing, travelling away). Hence, for this event to be associated with MH370, the aircraft must have remained in air for at least 10 min prior to impacting the water surface. Even for a considerably low average velocity of 1450 m/s, it would require the aircraft to remain in air about 13 min after the last handshake, and travel an extra distance of about 350 km. Finally, adding a tolerance of 0.5 \(^{\circ }\) in the bearing calculations provides the possible impact scenario depicted in Fig. 10 . Later impact (blue area) is associated with lower average velocity travelled in opposite direction from H01W, and thus closer distance to the 7th arc, whereas Earlier impact is associated with spending a longer time in air with a higher average velocity resulting in travelling further away from the 7th arc, opposite to H01W. It is possible that the aircraft crash is outwards the 7th arc in the direction of H01W, but that will require the aircraft to last a much longer time in the air (above 17 min), before finally impacting the water.

It is worth noting that the signal with bearing \(268.24^{\circ }\) recorded at 00:39:02 UTC is the only signal with \(\Delta t = 0\) , thus if originated at the exact 7th arc time, it would also be exact location. However, this signal is faint and would require further analysis, perhaps alongside the suggested field experiment.

figure 14

Distance (top) and time (bottom) that should have been travelled by MH370 after the last handshake (7th arc), as function of the average velocity (defined positive in the bearing direction.).

Transmission losses due to changing bathymetry

Ref. 4 studied transmission losses in the case of the ARA San Juan Submarine. The authors reported a transmission loss reduction of 20 dB due to scattering by the Rio Grande Rise along H10N path. They asserted that the substantial bathymetric rise accounted for the higher recorded levels at H04S compared to H10N - noting that H04S is 8000 km far from the incident location, approximately 2000 km further than the path to H10N. The transects of these paths are shown in Fig. 15 a.

Examining the transect of the F-35a case reveals a clear path, suggesting minimal losses due to bathymetric scattering and potentially explaining the distinct signal reception, see Fig. 15 b. In the case of the 7th arc, the path to H01W is very similar to that of F-35a, though the distance is twice as short. Consequently, one would anticipate a clear distinct signal with minor losses due to a slight bathymetric rise near H01W, Fig. 15 c. On the other hand, the path from the 7th arc to H08S presents a bathymetric barrier of comparable size, half way along the route, resembling the scenario of the ARA San Juan Submarine. Therefore, in the case of MH370, which arguably had a significantly less energetic compared to the Argentinian submarine, observing the signal at H08S becomes a challenging task, if feasible at all, due to transmission losses induced by bathymetry.

figure 15

Transects of events to corresponding hydrophones. ( a ) ARA San Juan Submarine: a large barrier (Rio Grande Rise) half way to H10N; smaller barriers observed in the direction of H04S. ( b ) F-35a: clear path with no barriers observed. ( c ) Bearing 306 \(^{\circ }\) : clear path with a minor barrier towards the end in the direction of H01W; a large barrier half way to H08S.

Signals at the early stage of flight MH370

The main signal of interest at the early stage of the flight was recorded at H11S, with a bearing of \(269^{\circ }\) , as shown in Fig. 12 . Signals that appear to be generated by airguns are also observed, at a bearing of \(265^{\circ }\) relative to H11S. In addition, there are two earthquakes that erupted at time windows and locations related to the early stage of flight MH370. The first earthquake is a M 4.8–9 km South of Yōkaichiba, Japan, which erupted at 18:34:20 (UTC), on 7 March 2014. The earthquake epicentre is 3200 km away from H11S, which is a 36 min travelling distance for acoustic waves propagating at 1480 m/s. The bearing of the earthquake is \(311.83^{\circ }\) as calculated from the geographic locations, which well matches the calculated bearing from the signal, \({ 311.827^{\circ }}\) (see Fig. 16 ), with almost no deviation. The second earthquake is the M 2.7–85 km NE of Sinabang, Indonesia, erupted at 18:55:12 (UTC), on 7 March 2014. The earthquake epicentre is 3000 km away from H08S, which is a 34 min travelling distance for acoustic waves propagating at 1480 m/s. The bearing of the earthquake is \(67.12^{\circ }\) as calculated from the geographic locations. Due to the extremely high noise in the direction of \(210^{\circ }\) , and since most of the energy in earthquake seem to be between 2.5 and 6 Hz filtering allows calculating a bearing of \(63.7^{\circ }\) with a relatively large deviation of \(3.3^{\circ }\) (see Fig. 17 ), which is about 140 km off the epicentre.

figure 16

Bearing calculations of earthquake M 4.8–9 km S of Yōkaichiba, Japan, that erupted at 18:34:20 UTC on 7 March 2014 (Fig. 1 ). From top to bottom: recordings from three channels at station H11S between 19:05 and 19:15 UTC on March 7th 2014. Windowed entropy shows peaks (black triangles) where a transient signal is found. Each peak defines an event, for which bearing is calculated from differences in arrival times in the bearing subplot. The map demonstrates a perfect match with direction of the earthquake \(312^{\circ }\) .

figure 17

Bearing calculation of \(210^{\circ }\) for the back noise, and of \(63.7^{\circ }\) for earthquake M 2.7–85 km NE of Sinabang, Indonesia, that erupted at 18:55:12 UTC on 7 March 2014 (Fig. 2 ). From top to bottom: spectrogram and pressure recordings from three channels at station H08S between 19:27 and 19:29 UTC. Windowed entropy shows peaks (black triangles) where a transient signal is found. Each peak defines an event, for which bearing is calculated from differences in arrival times in the bearing subplot.

Data availability

The hydroacoustic data analysed in this paper were obtained from the CTBTO and cannot be shared by the author to third parties, though can be requested directly from CTBTO. Data Availability Access to the IMS network’s data of the hydroacoustic stations is available to National Data Centres of the CTBTO and can be made available to others on request through the virtual Data Exploitation Center (vDEC) at https://www.ctbto.org/specials/vdec . Other data requests can be made by direct inquiries to the author.

Kosuga, M. Localization of T-wave energy on land revealed by a dense seismic network in Japan. Geophys. J. Int. 187 (1), 338–354. https://doi.org/10.1111/j.1365-246X.2011.05143.x (2011).

Article   ADS   Google Scholar  

Oliveira, T. C. et al. Megameter propagation and correlation of T-waves from Kermadec Trench and Islands. Front. Mar. Sci. 9 , 1009013. https://doi.org/10.3389/fmars.2022.1009013 (2022).

Article   Google Scholar  

Kadri, U. Effect of sea-bottom elasticity on the propagation of acoustic-gravity waves from impacting objects. Sci. Rep. 9 , 919. https://doi.org/10.1038/s41598-018-37626-z (2019).

Article   ADS   CAS   Google Scholar  

Vergoz, J. et al. Analysis of Hydroacoustic Signals Associated to the Loss of the Argentinian ARA San Juan Submarine. Pure Appl. Geophys. 178 , 2527–2556. https://doi.org/10.1007/s00024-020-02625-7 (2021).

Metz, D., Obana, K. & Fukao, Y. Remote hydroacoustic detection of an airplane crash. Pure Appl. Geophys. 180 , 1343–1351. https://doi.org/10.1007/s00024-022-03117-6 (2023).

Schwardt, M. et al. Natural and anthropogenic sources of seismic, hydroacoustic, and infrasonic waves: Waveforms and spectral characteristics (and their applicability for sensor calibration). Surv. Geophys. 43 , 1265–1361. https://doi.org/10.1007/s10712-022-09713-4 (2022).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Eyov, E., Klar, A., Kadri, U. & Stiassnie, M. Progressive waves in a compressible-ocean with an elastic bottom. Wave Motion 50 , 929–939. https://doi.org/10.1016/j.wavemoti.2013.03.003 (2013).

Article   ADS   MathSciNet   Google Scholar  

Williams, B. & Kadri, U. On the propagation of acoustic-gravity waves due to a slender rupture in an elastic seabed. J. Fluid Mech. 956 , A6. https://doi.org/10.1017/jfm.2022.1091 (2023).

Article   ADS   MathSciNet   CAS   Google Scholar  

AE-2014-054: Technical assistance to the Department of Civil Aviation Malaysia in support of missing Malaysia Airlines flight MH370 on 7 March 2014 UTC. Available online at https://www.atsb.gov.au/mh370-pages/updates/reports

Davey, S., Gordon, N., Holland, I., Rutten, M. & Williams, J. Bayesian Methods in the Search for MH370, SpringerBriefs in Electrical and Computer Engineering. 114 (Springer Open, 2016).

Dall’Osto, D. R. & Dahl, P. H. Analysis of hydroacoustic signals from Diego Garcia South CTBTO station in support of Curtin University analysis of Cape Leeuwin CTBTO data regarding possible location of MH370 crash (University of Washington, Seattle, Applied Physics Laboratory, 2014).

Google Scholar  

Duncan, A. J., Gavrilov, A. N. & McCauley, R. D., “Analysis of Low Frequency Underwater Acoustic Signals Possibly Related to the Loss of Malaysian Airlines Flight MH370”, Centre for Marine Science and Technology Report 2014-30, June 2014.

Kadri, U., Crivelli, D., Parsons, W., Colbourne, B. & Ryan, A. Rewinding the waves: Tracking underwater signals to their source. Sci. Rep. 7 , 13949.  https://doi.org/10.1038/s41598-017-14177-3 (2017).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Dall’Osto, D. R. Source triangulation utilizing three-dimensional arrivals: Application to the search for the ARA San Juan submarine. J. Acoust. Soc. Am. 146 (1), 2104. https://doi.org/10.1121/1.5125251 (2019).

Article   ADS   PubMed   Google Scholar  

Kadri, U. & Stiassnie, M. Acoustic-gravity waves interacting with the shelf break. J. Geophys. Res. Ocean . 117 . https://doi.org/10.1029/2011JC007674 (2012).

Coifman, R. R. & Wickerhauser, M. V. Entropy-based algorithms for best basis selection. IEEE Trans. Inf. Theory 38 , 713–718. https://doi.org/10.1109/18.119732 (1992).

Céspedes, I., Huang, Y., Ophir, J. & Spratt, S. Methods for estimation of subsample time delays of digitized echo signals. Ultrason. Imaging 17 (1), 42–171. https://doi.org/10.1006/uimg.1995.1007 (1995).

Download references

Acknowledgements

The author is grateful to Dr. D. Crivelli for a useful discussion and help in the bearing code.

The views expressed in this study are those of the authors and do not necessarily reflect those of the Preparatory Commission for the CTBTO.

Author information

Authors and affiliations.

School of Mathematics, Cardiff University, Cardiff, CF24 4AG, UK

Usama Kadri

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Usama Kadri .

Ethics declarations

Competing interests.

The author declares no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Kadri, U. Underwater acoustic analysis reveals unique pressure signals associated with aircraft crashes in the sea: revisiting MH370. Sci Rep 14 , 10102 (2024). https://doi.org/10.1038/s41598-024-60529-1

Download citation

Received : 16 October 2023

Accepted : 24 April 2024

Published : 02 May 2024

DOI : https://doi.org/10.1038/s41598-024-60529-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

data analysis in research scribd

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

When Online Content Disappears

38% of webpages that existed in 2013 are no longer accessible a decade later, table of contents.

  • Webpages from the last decade
  • Links on government websites
  • Links on news websites
  • Reference links on Wikipedia
  • Posts on Twitter
  • Acknowledgments
  • Collection and analysis of Twitter data
  • Data collection for World Wide Web websites, government websites and news websites
  • Data collection for Wikipedia source links
  • Evaluating the status of pages and links
  • Definition of links

Pew Research Center conducted the analysis to examine how often online content that once existed becomes inaccessible. One part of the study looks at a representative sample of webpages that existed over the past decade to see how many are still accessible today. For this analysis, we collected a sample of pages from the Common Crawl web repository for each year from 2013 to 2023. We then tried to access those pages to see how many still exist.

A second part of the study looks at the links on existing webpages to see how many of those links are still functional. We did this by collecting a large sample of pages from government websites, news websites and the online encyclopedia Wikipedia .

We identified relevant news domains using data from the audience metrics company comScore and relevant government domains (at multiple levels of government) using data from get.gov , the official administrator for the .gov domain. We collected the news and government pages via Common Crawl and the Wikipedia pages from an archive maintained by the Wikimedia Foundation . For each collection, we identified the links on those pages and followed them to their destination to see what share of those links point to sites that are no longer accessible.

A third part of the study looks at how often individual posts on social media sites are deleted or otherwise removed from public view. We did this by collecting a large sample of public tweets on the social media platform X (then known as Twitter) in real time using the Twitter Streaming API. We then tracked the status of those tweets for a period of three months using the Twitter Search API to monitor how many were still publicly available. Refer to the report methodology for more details.

The internet is an unimaginably vast repository of modern life, with hundreds of billions of indexed webpages. But even as users across the world rely on the web to access books, images, news articles and other resources, this content sometimes disappears from view.

A new Pew Research Center analysis shows just how fleeting online content actually is:

  • A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

A line chart showing that 38% of webpages from 2013 are no longer accessible

  • For older content, this trend is even starker. Some 38% of webpages that existed in 2013 are not available today, compared with 8% of pages that existed in 2023.

This “digital decay” occurs in many different online spaces. We examined the links that appear on government and news websites, as well as in the “References” section of Wikipedia pages as of spring 2023. This analysis found that:

  • 23% of news webpages contain at least one broken link, as do 21% of webpages from government sites. News sites with a high level of site traffic and those with less are about equally likely to contain broken links. Local-level government webpages (those belonging to city governments) are especially likely to have broken links.
  • 54% of Wikipedia pages contain at least one link in their “References” section that points to a page that no longer exists.

To see how digital decay plays out on social media, we also collected a real-time sample of tweets during spring 2023 on the social media platform X (then known as Twitter) and followed them for three months. We found that:

  • Nearly one-in-five tweets are no longer publicly visible on the site just months after being posted. In 60% of these cases, the account that originally posted the tweet was made private, suspended or deleted entirely. In the other 40%, the account holder deleted the individual tweet, but the account itself still existed.
  • Certain types of tweets tend to go away more often than others. More than 40% of tweets written in Turkish or Arabic are no longer visible on the site within three months of being posted. And tweets from accounts with the default profile settings are especially likely to disappear from public view.

How this report defines inaccessible links and webpages

There are many ways of defining whether something on the internet that used to exist is now inaccessible to people trying to reach it today. For instance, “inaccessible” could mean that:

  • The page no longer exists on its host server, or the host server itself no longer exists. Someone visiting this type of page would typically receive a variation on the “404 Not Found” server error instead of the content they were looking for.
  • The page address exists but its content has been changed – sometimes dramatically – from what it was originally.
  • The page exists but certain users – such as those with blindness or other visual impairments – might find it difficult or impossible to read.

For this report, we focused on the first of these: pages that no longer exist. The other definitions of accessibility are beyond the scope of this research.

Our approach is a straightforward way of measuring whether something online is accessible or not. But even so, there is some ambiguity.

First, there are dozens of status codes indicating a problem that a user might encounter when they try to access a page. Not all of them definitively indicate whether the page is permanently defunct or just temporarily unavailable. Second, for security reasons, many sites actively try to prevent the sort of automated data collection that we used to test our full list of links.

For these reasons, we used the most conservative estimate possible for deciding whether a site was actually accessible or not. We counted pages as inaccessible only if they returned one of nine error codes that definitively indicate that the page and/or its host server no longer exist or have become nonfunctional – regardless of how they are being accessed, and by whom. The full list of error codes that we included in our definition are in the methodology .

Here are some of the findings from our analysis of digital decay in various online spaces.

To conduct this part of our analysis, we collected a random sample of just under 1 million webpages from the archives of Common Crawl , an internet archive service that periodically collects snapshots of the internet as it exists at different points in time. We sampled pages collected by Common Crawl each year from 2013 through 2023 (approximately 90,000 pages per year) and checked to see if those pages still exist today.

We found that 25% of all the pages we collected from 2013 through 2023 were no longer accessible as of October 2023. This figure is the sum of two different types of broken pages: 16% of pages are individually inaccessible but come from an otherwise functional root-level domain; the other 9% are inaccessible because their entire root domain is no longer functional.

Not surprisingly, the older snapshots in our collection had the largest share of inaccessible links. Of the pages collected from the 2013 snapshot, 38% were no longer accessible in 2023. But even for pages collected in the 2021 snapshot, about one-in-five were no longer accessible just two years later.

A bar chart showing that Around 1 in 5 government webpages contain at least one broken link

We sampled around 500,000 pages from government websites using the Common Crawl March/April 2023 snapshot of the internet, including a mix of different levels of government (federal, state, local and others). We found every link on each page and followed a random selection of those links to their destination to see if the pages they refer to still exist.

Across the government websites we sampled, there were 42 million links. The vast majority of those links (86%) were internal, meaning they link to a different page on the same website. An explainer resource on the IRS website that links to other documents or forms on the IRS site would be an example of an internal link.

Around three-quarters of government webpages we sampled contained at least one on-page link. The typical (median) page contains 50 links, but many pages contain far more. A page in the 90th percentile contains 190 links, and a page in the 99th percentile (that is, the top 1% of pages by number of links) has 740 links.

Other facts about government webpage links:

  • The vast majority go to secure HTTP pages (and have a URL starting with “https://”).
  • 6% go to a static file, like a PDF document.
  • 16% now redirect to a different URL than the one they originally pointed to.

When we followed these links, we found that 6% point to pages that are no longer accessible. Similar shares of internal and external links are no longer functional.

Overall, 21% of all the government webpages we examined contained at least one broken link. Across every level of government we looked at, there were broken links on at least 14% of pages; city government pages had the highest rates of broken links.

A bar chart showing that 23% of news webpages have at least one broken link

For this analysis, we sampled 500,000 pages from 2,063 websites classified as “News/Information” by the audience metrics firm comScore. The pages were collected from the Common Crawl March/April 2023 snapshot of the internet.

Across the news sites sampled, this collection contained more than 14 million links pointing to an outside website. 1 Some 94% of these pages contain at least one external-facing link. The median page contains 20 links, and pages in the top 10% by link count have 56 links.

Like government websites, the vast majority of these links go to secure HTTP pages (those with a URL beginning with “https://”). Around 12% of links on these news sites point to a static file, like a PDF document. And 32% of links on news sites redirected to a different URL than the one they originally pointed to – slightly less than the 39% of external links on government sites that redirect.

When we tracked these links to their destination, we found that 5% of all links on news site pages are no longer accessible. And 23% of all the pages we sampled contained at least one broken link.

Broken links are about as prevalent on the most-trafficked news websites as they are on the least-trafficked sites. Some 25% of pages on news websites in the top 20% by site traffic have at least one broken link. That is nearly identical to the 26% of sites in the bottom 20% by site traffic.

For this analysis, we collected a random sample of 50,000 English-language Wikipedia pages and examined the links in their “References” section. The vast majority of these pages (82%) contain at least one reference link – that is, one that directs the reader to a webpage other than Wikipedia itself.

In total, there are just over 1 million reference links across all the pages we collected. The typical page has four reference links.

The analysis indicates that 11% of all references linked on Wikipedia are no longer accessible. On about 2% of source pages containing reference links, every link on the page was broken or otherwise inaccessible, while another 53% of pages contained at least one broken link.

A pie chart showing that Around 1 in 5 tweets disappear from public view within months

For this analysis, we collected nearly 5 million tweets posted from March 8 to April 27, 2023, on the social media platform X, which at the time was known as Twitter. We did this using Twitter’s Streaming API, collecting 3,000 public tweets every 30 minutes in real time. This provided us with a representative sample of all tweets posted on the platform during that period. We monitored those tweets until June 15, 2023, and checked each day to see if they were still available on the site or not.

At the end of the observation period, we found that 18% of the tweets from our initial collection window were no longer publicly visible on the site . In a majority of cases, this was because the account that originally posted the tweet was made private, suspended or deleted entirely. For the remaining tweets, the account that posted the tweet was still visible on the site, but the individual tweet had been deleted.

Which tweets tend to disappear?

A bar chart showing that Inaccessible tweets often come from accounts with default profile settings

Tweets were especially likely to be deleted or removed over the course of our collection period if they were:

  • Written in certain languages. Nearly half of all the Turkish-language tweets we collected – and a slightly smaller share of those written in Arabic – were no longer available at the end of the tracking period.
  • Posted by accounts using the site’s default profile settings. More than half of tweets from accounts using the default profile image were no longer available at the end of the tracking period, as were more than a third from accounts with a default bio field. Tweets from these accounts tend to disappear because the entire account has been deleted or made private, as opposed to the individual tweet being deleted.
  • Posted by unverified accounts.

We also found that removed or deleted tweets tended to come from newer accounts with relatively few followers and modest activityon the site. On average, tweets that were no longer visible on the site were posted by accounts around eight months younger than those whose tweets stayed on the site.

And when we analyzed the types of tweets that were no longer available, we found that retweets, quote tweets and original tweets did not differ much from the overall average. But replies were relatively unlikely to be removed – just 12% of replies were inaccessible at the end of our monitoring period.

Most tweets that are removed from the site tend to disappear soon after being posted. In addition to looking at how many tweets from our collection were still available at the end of our tracking period, we conducted a survival analysis to see how long these tweets tended to remain available. We found that:

  • 1% of tweets are removed within one hour
  • 3% within a day
  • 10% within a week
  • 15% within a month

Put another way: Half of tweets that are eventually removed from the platform are unavailable within the first six days of being posted. And 90% of these tweets are unavailable within 46 days.

Tweets don’t always disappear forever, though. Some 6% of the tweets we collected disappeared and then became available again at a later point. This could be due to an account going private and then returning to public status, or to the account being suspended and later reinstated. Of those “reappeared” tweets, the vast majority (90%) were still accessible on Twitter at the end of the monitoring period.

  • For our analysis of news sites, we did not collect or check the functionality of internal-facing on-page links – those that point to another page on the same root domain. ↩

Sign up for our weekly newsletter

Fresh data delivery Saturday mornings

Sign up for The Briefing

Weekly updates on the world of news & information

  • Internet & Technology
  • Online Search
  • Public Knowledge

Electric Vehicle Charging Infrastructure in the U.S.

A quarter of u.s. teachers say ai tools do more harm than good in k-12 education, teens and video games today, americans’ views of technology companies, 6 facts about americans and tiktok, most popular, report materials.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Daily marijuana use outpaces daily drinking in the U.S., a new study says

A Marijuana plant is displayed as a person smokes marijuana

Daily and near-daily marijuana use is now more common than similar levels of drinking in the U.S., according to  an analysis of national survey data  over four decades.

Alcohol is still more widely used, but 2022 was the first time this intensive level of marijuana use overtook high-frequency drinking, said the study’s author, Jonathan Caulkins, a cannabis policy researcher at Carnegie Mellon University.

“A good 40% of current cannabis users are using it daily or near daily, a pattern that is more associated with tobacco use than typical alcohol use,” Caulkins said.

The research, based on data from the National Survey on Drug Use and Health, was published Wednesday in the journal Addiction. The survey is a highly regarded source of estimates of tobacco, alcohol and drug use in the United States.

In 2022, an estimated 17.7 million people used marijuana daily or near-daily compared to 14.7 million daily or near-daily drinkers, according to the study. From 1992 to 2022, the per capita rate of reporting daily or near-daily marijuana use increased 15-fold.

The trend reflects changes in public policy.  Most states now allow  medical or recreational marijuana, though it remains illegal at the federal level. In November, Florida voters will decide on a constitutional amendment allowing recreational cannabis, and the federal government is moving to  reclassify marijuana  as a less dangerous drug.

Research shows that high-frequency users are more likely to become addicted to marijuana, said Dr. David A. Gorelick, a psychiatry professor at the University of Maryland School of Medicine, who was not involved in the study.

The number of daily users suggests that more people are at risk for developing problematic cannabis use or addiction, Gorelick said.

“High frequency use also increases the risk of developing cannabis-associated psychosis,” a severe condition where a person loses touch with reality, he said.

data analysis in research scribd

The Associated Press

IMAGES

  1. 5 Steps of the Data Analysis Process

    data analysis in research scribd

  2. Data Analysis and Interpretation

    data analysis in research scribd

  3. Chapter 3: Data Analysis And Presentation

    data analysis in research scribd

  4. 6+ Research Analysis Templates

    data analysis in research scribd

  5. Research Methodology, Data Analysis and Interpretation

    data analysis in research scribd

  6. Standard statistical tools in research and data analysis

    data analysis in research scribd

VIDEO

  1. SPSS: DATA ANALYSIS

  2. How to Assess the Quantitative Data Collected from Questionnaire

  3. What is an Ontology? Building and Inference Using The Stanford Protege tool Part I

  4. Epidata version 3.1 for data entry

  5. how to install SPSS Software for data Analysis,Research,thesis

  6. How to interpret Reliability analysis results

COMMENTS

  1. Data Analysis in Research

    This document discusses data analysis in research and provides details on: 1. The types of data that can be analyzed including quantitative, categorical, and qualitative data. 2. The process of data analysis which includes data mining, initial data analysis to check quality, and the main analysis phase. 3. What is analyzed in the initial phase such as data quality, measurement quality, initial ...

  2. The Beginner's Guide to Statistical Analysis

    Step 1: Write your hypotheses and plan your research design. To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design. Writing statistical hypotheses. The goal of research is often to investigate a relationship between variables within a population. You start with a prediction ...

  3. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  4. PDF An Introduction to Data Analysis

    the performance of all the steps constituting data analysis, from data research to data mining, to publishing the results of the predictive model. Mathematics and Statistics As you will see throughout the book, data analysis requires a lot of complex math during the treatment and processing of data. You need to be competent in all of this,

  5. Research Design: Decide on your Data Analysis Strategy

    The last step of designing your research is planning your data analysis strategies. In this video, we'll take a look at some common approaches for both quant...

  6. What Is Data Analysis? (With Examples)

    Written by Coursera Staff • Updated on Apr 19, 2024. Data analysis is the practice of working with data to glean useful information, which can then be used to make informed decisions. "It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts," Sherlock ...

  7. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  8. Descriptive Statistics

    A data set is a collection of responses or observations from a sample or entire population. In quantitative research, after collecting data, the first step of statistical analysis is to describe characteristics of the responses, such as the average of one variable (e.g., age), or the relation between two variables (e.g., age and creativity).

  9. An Overview of Statistical Techniques for Analysis of Data in

    Print ISBN: 978- 81-19761-21 -0, eBook ISBN: 978- 81-19761-71 -5. An Overview of Statistical T echniques for. Analysis of Data in Agricultural Research. Rahul Banerjee a, Bharti a*, Pankaj Das a ...

  10. PDF 2 An Introduction to Data Analysis

    acquiring skills in data analysis. • List the components of data analysis and how they fit together. • Form hypotheses from descriptions of data. • Explain the connection between hypotheses, models, and estimates. • Define diagnostics and explain their role in data analysis. • Formulate new questions. 2 An Introduction to Data ...

  11. Data analysis in qualitative research

    Unquestionably, data analysis is the most complex and mysterious of all of the phases of a qualitative project, and the one that receives the least thoughtful discussion in the literature. For neophyte nurse researchers, many of the data collection strategies involved in a qualitative project may feel familiar and comfortable. After all, nurses have always based their clinical practice on ...

  12. Quantitative analysis: Descriptive statistics

    Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs.Inferential analysis refers to the statistical testing of hypotheses (theory testing).

  13. What is data analysis? Methods, techniques, types & how-to

    A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.

  14. PDF CHAPTER 4 QUALITATIVE DATA ANALYSIS

    4.1 INTRODUCTION. In this chapter, I describe the qualitative analysis of the data, including the practical steps involved in the analysis. A quantitative analysis of the data follows in Chapter 5. In the qualitative phase, I analyzed the data into generative themes, which will be described individually. I describe how the themes overlap.

  15. PDF Chapter 3: Research Design, Data Collection, and Analysis ...

    Contingency tables were created. Chapter 3: Research Design, Data Collection, and Analysis Procedures 44. to organize the categorical variables and make it easier to understand the null hypothesis (Reeves, n.d.). The contingency tables for research question 1-4 can be found in Tables 2, 3, 4, and 5. Table 1.

  16. PDF Chapter 4: Analysis and Interpretation of Results

    To complete this study properly, it is necessary to analyse the data collected in order to test the hypothesis and answer the research questions. As already indicated in the preceding chapter, data is interpreted in a descriptive form. This chapter comprises the analysis, presentation and interpretation of the findings resulting from this study.

  17. PDF Chapter 4 DATA ANALYSIS AND RESEARCH FINDINGS

    4.1 INTRODUCTION. This chapter describes the analysis of data followed by a discussion of the research findings. The findings relate to the research questions that guided the study. Data were analyzed to identify, describe and explore the relationship between death anxiety and death attitudes of nurses in a private acute care hospital and to ...

  18. Underwater acoustic analysis reveals unique pressure signals ...

    Data analysis from the hydroacoustic stations of the Comprehensive Nuclear-Test-Ban Treaty Organization has unveiled distinctive pressure signals linked to aircraft crashes of varying sizes in the ...

  19. When Online Content Disappears

    A new Pew Research Center analysis shows just how fleeting online content actually is: A quarter of all webpages that existed at one point between 2013 and 2023 are no longer accessible, as of October 2023. In most cases, this is because an individual page was deleted or removed on an otherwise functional website.

  20. From the brink of survival to "become the person that they want to

    There is a paucity of research highlighting the impact of trauma on women from refugee backgrounds despite the likelihood that many experience gender-related traumatic events on their often-protracted journey from their country of origin to postsettlement. Conversely, research indicates that despite distress, growth out of such adversity is possible. Therefore, this idiographic study explored ...

  21. Daily marijuana use outpaces daily drinking in the U.S., a new study says

    The survey is a highly regarded source of estimates of tobacco, alcohol and drug use in the United States. In 2022, an estimated 17.7 million people used marijuana daily or near-daily compared to ...