Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

The Beginner's Guide to Statistical Analysis | 5 Steps & Examples

Statistical analysis means investigating trends, patterns, and relationships using quantitative data . It is an important research tool used by scientists, governments, businesses, and other organizations.

To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process . You need to specify your hypotheses and make decisions about your research design, sample size, and sampling procedure.

After collecting data from your sample, you can organize and summarize the data using descriptive statistics . Then, you can use inferential statistics to formally test hypotheses and make estimates about the population. Finally, you can interpret and generalize your findings.

This article is a practical introduction to statistical analysis for students and researchers. We’ll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables.

Table of contents

Step 1: write your hypotheses and plan your research design, step 2: collect data from a sample, step 3: summarize your data with descriptive statistics, step 4: test hypotheses or make estimates with inferential statistics, step 5: interpret your results, other interesting articles.

To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.

Writing statistical hypotheses

The goal of research is often to investigate a relationship between variables within a population . You start with a prediction, and use statistical analysis to test that prediction.

A statistical hypothesis is a formal way of writing a prediction about a population. Every research prediction is rephrased into null and alternative hypotheses that can be tested using sample data.

While the null hypothesis always predicts no effect or no relationship between variables, the alternative hypothesis states your research prediction of an effect or relationship.

  • Null hypothesis: A 5-minute meditation exercise will have no effect on math test scores in teenagers.
  • Alternative hypothesis: A 5-minute meditation exercise will improve math test scores in teenagers.
  • Null hypothesis: Parental income and GPA have no relationship with each other in college students.
  • Alternative hypothesis: Parental income and GPA are positively correlated in college students.

Planning your research design

A research design is your overall strategy for data collection and analysis. It determines the statistical tests you can use to test your hypothesis later on.

First, decide whether your research will use a descriptive, correlational, or experimental design. Experiments directly influence variables, whereas descriptive and correlational studies only measure variables.

  • In an experimental design , you can assess a cause-and-effect relationship (e.g., the effect of meditation on test scores) using statistical tests of comparison or regression.
  • In a correlational design , you can explore relationships between variables (e.g., parental income and GPA) without any assumption of causality using correlation coefficients and significance tests.
  • In a descriptive design , you can study the characteristics of a population or phenomenon (e.g., the prevalence of anxiety in U.S. college students) using statistical tests to draw inferences from sample data.

Your research design also concerns whether you’ll compare participants at the group level or individual level, or both.

  • In a between-subjects design , you compare the group-level outcomes of participants who have been exposed to different treatments (e.g., those who performed a meditation exercise vs those who didn’t).
  • In a within-subjects design , you compare repeated measures from participants who have participated in all treatments of a study (e.g., scores from before and after performing a meditation exercise).
  • In a mixed (factorial) design , one variable is altered between subjects and another is altered within subjects (e.g., pretest and posttest scores from participants who either did or didn’t do a meditation exercise).
  • Experimental
  • Correlational

First, you’ll take baseline test scores from participants. Then, your participants will undergo a 5-minute meditation exercise. Finally, you’ll record participants’ scores from a second math test.

In this experiment, the independent variable is the 5-minute meditation exercise, and the dependent variable is the math test score from before and after the intervention. Example: Correlational research design In a correlational study, you test whether there is a relationship between parental income and GPA in graduating college students. To collect your data, you will ask participants to fill in a survey and self-report their parents’ incomes and their own GPA.

Measuring variables

When planning a research design, you should operationalize your variables and decide exactly how you will measure them.

For statistical analysis, it’s important to consider the level of measurement of your variables, which tells you what kind of data they contain:

  • Categorical data represents groupings. These may be nominal (e.g., gender) or ordinal (e.g. level of language ability).
  • Quantitative data represents amounts. These may be on an interval scale (e.g. test score) or a ratio scale (e.g. age).

Many variables can be measured at different levels of precision. For example, age data can be quantitative (8 years old) or categorical (young). If a variable is coded numerically (e.g., level of agreement from 1–5), it doesn’t automatically mean that it’s quantitative instead of categorical.

Identifying the measurement level is important for choosing appropriate statistics and hypothesis tests. For example, you can calculate a mean score with quantitative data, but not with categorical data.

In a research study, along with measures of your variables of interest, you’ll often collect data on relevant participant characteristics.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

data analysis in research scribd

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalizable findings, you should use a probability sampling method. Random selection reduces several types of research bias , like sampling bias , and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to at risk for biases like self-selection bias , they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalizing your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalize your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialized, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalized in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardized indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarize them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organizing data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualizing the relationship between two variables using a scatter plot .

By visualizing your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimize the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasizes null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval

Methodology

  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hostile attribution bias
  • Affect heuristic

Is this article helpful?

Other students also liked.

  • Descriptive Statistics | Definitions, Types, Examples
  • Inferential Statistics | An Easy Introduction & Examples
  • Choosing the Right Statistical Test | Types & Examples

More interesting articles

  • Akaike Information Criterion | When & How to Use It (Example)
  • An Easy Introduction to Statistical Significance (With Examples)
  • An Introduction to t Tests | Definitions, Formula and Examples
  • ANOVA in R | A Complete Step-by-Step Guide with Examples
  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Chi-Square (Χ²) Distributions | Definition & Examples
  • Chi-Square (Χ²) Table | Examples & Downloadable Table
  • Chi-Square (Χ²) Tests | Types, Formula & Examples
  • Chi-Square Goodness of Fit Test | Formula, Guide & Examples
  • Chi-Square Test of Independence | Formula, Guide & Examples
  • Coefficient of Determination (R²) | Calculation & Interpretation
  • Correlation Coefficient | Types, Formulas & Examples
  • Frequency Distribution | Tables, Types & Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | 4 Ways with Examples & Explanation
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Mode | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Hypothesis Testing | A Step-by-Step Guide with Easy Examples
  • Interval Data and How to Analyze It | Definitions & Examples
  • Levels of Measurement | Nominal, Ordinal, Interval and Ratio
  • Linear Regression in R | A Step-by-Step Guide & Examples
  • Missing Data | Types, Explanation, & Imputation
  • Multiple Linear Regression | A Quick Guide (Examples)
  • Nominal Data | Definition, Examples, Data Collection & Analysis
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • One-way ANOVA | When and How to Use It (With Examples)
  • Ordinal Data | Definition, Examples, Data Collection & Analysis
  • Parameter vs Statistic | Definitions, Differences & Examples
  • Pearson Correlation Coefficient (r) | Guide & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Probability Distribution | Formula, Types, & Examples
  • Quartiles & Quantiles | Calculation, Definition & Interpretation
  • Ratio Scales | Definition, Examples, & Data Analysis
  • Simple Linear Regression | An Easy Introduction & Examples
  • Skewness | Definition, Examples & Formula
  • Statistical Power and Why It Matters | A Simple Introduction
  • Student's t Table (Free Download) | Guide & Examples
  • T-distribution: What it is and how to use it
  • Test statistics | Definition, Interpretation, and Examples
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Two-Way ANOVA | Examples & When To Use It
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Understanding P values | Definition and Examples
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Kurtosis? | Definition, Examples & Formula
  • What Is Standard Error? | How to Calculate (Guide with Examples)

What is your plagiarism score?

Your Modern Business Guide To Data Analysis Methods And Techniques

Data analysis methods and techniques blog post by datapine

Table of Contents

1) What Is Data Analysis?

2) Why Is Data Analysis Important?

3) What Is The Data Analysis Process?

4) Types Of Data Analysis Methods

5) Top Data Analysis Techniques To Apply

6) Quality Criteria For Data Analysis

7) Data Analysis Limitations & Barriers

8) Data Analysis Skills

9) Data Analysis In The Big Data Environment

In our data-rich age, understanding how to analyze and extract true meaning from our business’s digital insights is one of the primary drivers of success.

Despite the colossal volume of data we create every day, a mere 0.5% is actually analyzed and used for data discovery , improvement, and intelligence. While that may not seem like much, considering the amount of digital information we have at our fingertips, half a percent still accounts for a vast amount of data.

With so much data and so little time, knowing how to collect, curate, organize, and make sense of all of this potentially business-boosting information can be a minefield – but online data analysis is the solution.

In science, data analysis uses a more complex approach with advanced techniques to explore and experiment with data. On the other hand, in a business context, data is used to make data-driven decisions that will enable the company to improve its overall performance. In this post, we will cover the analysis of data from an organizational point of view while still going through the scientific and statistical foundations that are fundamental to understanding the basics of data analysis. 

To put all of that into perspective, we will answer a host of important analytical questions, explore analytical methods and techniques, while demonstrating how to perform analysis in the real world with a 17-step blueprint for success.

What Is Data Analysis?

Data analysis is the process of collecting, modeling, and analyzing data using various statistical and logical methods and techniques. Businesses rely on analytics processes and tools to extract insights that support strategic and operational decision-making.

All these various methods are largely based on two core areas: quantitative and qualitative research.

To explain the key differences between qualitative and quantitative research, here’s a video for your viewing pleasure:

Gaining a better understanding of different techniques and methods in quantitative research as well as qualitative insights will give your analyzing efforts a more clearly defined direction, so it’s worth taking the time to allow this particular knowledge to sink in. Additionally, you will be able to create a comprehensive analytical report that will skyrocket your analysis.

Apart from qualitative and quantitative categories, there are also other types of data that you should be aware of before dividing into complex data analysis processes. These categories include: 

  • Big data: Refers to massive data sets that need to be analyzed using advanced software to reveal patterns and trends. It is considered to be one of the best analytical assets as it provides larger volumes of data at a faster rate. 
  • Metadata: Putting it simply, metadata is data that provides insights about other data. It summarizes key information about specific data that makes it easier to find and reuse for later purposes. 
  • Real time data: As its name suggests, real time data is presented as soon as it is acquired. From an organizational perspective, this is the most valuable data as it can help you make important decisions based on the latest developments. Our guide on real time analytics will tell you more about the topic. 
  • Machine data: This is more complex data that is generated solely by a machine such as phones, computers, or even websites and embedded systems, without previous human interaction.

Why Is Data Analysis Important?

Before we go into detail about the categories of analysis along with its methods and techniques, you must understand the potential that analyzing data can bring to your organization.

  • Informed decision-making : From a management perspective, you can benefit from analyzing your data as it helps you make decisions based on facts and not simple intuition. For instance, you can understand where to invest your capital, detect growth opportunities, predict your income, or tackle uncommon situations before they become problems. Through this, you can extract relevant insights from all areas in your organization, and with the help of dashboard software , present the data in a professional and interactive way to different stakeholders.
  • Reduce costs : Another great benefit is to reduce costs. With the help of advanced technologies such as predictive analytics, businesses can spot improvement opportunities, trends, and patterns in their data and plan their strategies accordingly. In time, this will help you save money and resources on implementing the wrong strategies. And not just that, by predicting different scenarios such as sales and demand you can also anticipate production and supply. 
  • Target customers better : Customers are arguably the most crucial element in any business. By using analytics to get a 360° vision of all aspects related to your customers, you can understand which channels they use to communicate with you, their demographics, interests, habits, purchasing behaviors, and more. In the long run, it will drive success to your marketing strategies, allow you to identify new potential customers, and avoid wasting resources on targeting the wrong people or sending the wrong message. You can also track customer satisfaction by analyzing your client’s reviews or your customer service department’s performance.

What Is The Data Analysis Process?

Data analysis process graphic

When we talk about analyzing data there is an order to follow in order to extract the needed conclusions. The analysis process consists of 5 key stages. We will cover each of them more in detail later in the post, but to start providing the needed context to understand what is coming next, here is a rundown of the 5 essential steps of data analysis. 

  • Identify: Before you get your hands dirty with data, you first need to identify why you need it in the first place. The identification is the stage in which you establish the questions you will need to answer. For example, what is the customer's perception of our brand? Or what type of packaging is more engaging to our potential customers? Once the questions are outlined you are ready for the next step. 
  • Collect: As its name suggests, this is the stage where you start collecting the needed data. Here, you define which sources of data you will use and how you will use them. The collection of data can come in different forms such as internal or external sources, surveys, interviews, questionnaires, and focus groups, among others.  An important note here is that the way you collect the data will be different in a quantitative and qualitative scenario. 
  • Clean: Once you have the necessary data it is time to clean it and leave it ready for analysis. Not all the data you collect will be useful, when collecting big amounts of data in different formats it is very likely that you will find yourself with duplicate or badly formatted data. To avoid this, before you start working with your data you need to make sure to erase any white spaces, duplicate records, or formatting errors. This way you avoid hurting your analysis with bad-quality data. 
  • Analyze : With the help of various techniques such as statistical analysis, regressions, neural networks, text analysis, and more, you can start analyzing and manipulating your data to extract relevant conclusions. At this stage, you find trends, correlations, variations, and patterns that can help you answer the questions you first thought of in the identify stage. Various technologies in the market assist researchers and average users with the management of their data. Some of them include business intelligence and visualization software, predictive analytics, and data mining, among others. 
  • Interpret: Last but not least you have one of the most important steps: it is time to interpret your results. This stage is where the researcher comes up with courses of action based on the findings. For example, here you would understand if your clients prefer packaging that is red or green, plastic or paper, etc. Additionally, at this stage, you can also find some limitations and work on them. 

Now that you have a basic understanding of the key data analysis steps, let’s look at the top 17 essential methods.

17 Essential Types Of Data Analysis Methods

Before diving into the 17 essential types of methods, it is important that we go over really fast through the main analysis categories. Starting with the category of descriptive up to prescriptive analysis, the complexity and effort of data evaluation increases, but also the added value for the company.

a) Descriptive analysis - What happened.

The descriptive analysis method is the starting point for any analytic reflection, and it aims to answer the question of what happened? It does this by ordering, manipulating, and interpreting raw data from various sources to turn it into valuable insights for your organization.

Performing descriptive analysis is essential, as it enables us to present our insights in a meaningful way. Although it is relevant to mention that this analysis on its own will not allow you to predict future outcomes or tell you the answer to questions like why something happened, it will leave your data organized and ready to conduct further investigations.

b) Exploratory analysis - How to explore data relationships.

As its name suggests, the main aim of the exploratory analysis is to explore. Prior to it, there is still no notion of the relationship between the data and the variables. Once the data is investigated, exploratory analysis helps you to find connections and generate hypotheses and solutions for specific problems. A typical area of ​​application for it is data mining.

c) Diagnostic analysis - Why it happened.

Diagnostic data analytics empowers analysts and executives by helping them gain a firm contextual understanding of why something happened. If you know why something happened as well as how it happened, you will be able to pinpoint the exact ways of tackling the issue or challenge.

Designed to provide direct and actionable answers to specific questions, this is one of the world’s most important methods in research, among its other key organizational functions such as retail analytics , e.g.

c) Predictive analysis - What will happen.

The predictive method allows you to look into the future to answer the question: what will happen? In order to do this, it uses the results of the previously mentioned descriptive, exploratory, and diagnostic analysis, in addition to machine learning (ML) and artificial intelligence (AI). Through this, you can uncover future trends, potential problems or inefficiencies, connections, and casualties in your data.

With predictive analysis, you can unfold and develop initiatives that will not only enhance your various operational processes but also help you gain an all-important edge over the competition. If you understand why a trend, pattern, or event happened through data, you will be able to develop an informed projection of how things may unfold in particular areas of the business.

e) Prescriptive analysis - How will it happen.

Another of the most effective types of analysis methods in research. Prescriptive data techniques cross over from predictive analysis in the way that it revolves around using patterns or trends to develop responsive, practical business strategies.

By drilling down into prescriptive analysis, you will play an active role in the data consumption process by taking well-arranged sets of visual data and using it as a powerful fix to emerging issues in a number of key areas, including marketing, sales, customer experience, HR, fulfillment, finance, logistics analytics , and others.

Top 17 data analysis methods

As mentioned at the beginning of the post, data analysis methods can be divided into two big categories: quantitative and qualitative. Each of these categories holds a powerful analytical value that changes depending on the scenario and type of data you are working with. Below, we will discuss 17 methods that are divided into qualitative and quantitative approaches. 

Without further ado, here are the 17 essential types of data analysis methods with some use cases in the business world: 

A. Quantitative Methods 

To put it simply, quantitative analysis refers to all methods that use numerical data or data that can be turned into numbers (e.g. category variables like gender, age, etc.) to extract valuable insights. It is used to extract valuable conclusions about relationships, differences, and test hypotheses. Below we discuss some of the key quantitative methods. 

1. Cluster analysis

The action of grouping a set of data elements in a way that said elements are more similar (in a particular sense) to each other than to those in other groups – hence the term ‘cluster.’ Since there is no target variable when clustering, the method is often used to find hidden patterns in the data. The approach is also used to provide additional context to a trend or dataset.

Let's look at it from an organizational perspective. In a perfect world, marketers would be able to analyze each customer separately and give them the best-personalized service, but let's face it, with a large customer base, it is timely impossible to do that. That's where clustering comes in. By grouping customers into clusters based on demographics, purchasing behaviors, monetary value, or any other factor that might be relevant for your company, you will be able to immediately optimize your efforts and give your customers the best experience based on their needs.

2. Cohort analysis

This type of data analysis approach uses historical data to examine and compare a determined segment of users' behavior, which can then be grouped with others with similar characteristics. By using this methodology, it's possible to gain a wealth of insight into consumer needs or a firm understanding of a broader target group.

Cohort analysis can be really useful for performing analysis in marketing as it will allow you to understand the impact of your campaigns on specific groups of customers. To exemplify, imagine you send an email campaign encouraging customers to sign up for your site. For this, you create two versions of the campaign with different designs, CTAs, and ad content. Later on, you can use cohort analysis to track the performance of the campaign for a longer period of time and understand which type of content is driving your customers to sign up, repurchase, or engage in other ways.  

A useful tool to start performing cohort analysis method is Google Analytics. You can learn more about the benefits and limitations of using cohorts in GA in this useful guide . In the bottom image, you see an example of how you visualize a cohort in this tool. The segments (devices traffic) are divided into date cohorts (usage of devices) and then analyzed week by week to extract insights into performance.

Cohort analysis chart example from google analytics

3. Regression analysis

Regression uses historical data to understand how a dependent variable's value is affected when one (linear regression) or more independent variables (multiple regression) change or stay the same. By understanding each variable's relationship and how it developed in the past, you can anticipate possible outcomes and make better decisions in the future.

Let's bring it down with an example. Imagine you did a regression analysis of your sales in 2019 and discovered that variables like product quality, store design, customer service, marketing campaigns, and sales channels affected the overall result. Now you want to use regression to analyze which of these variables changed or if any new ones appeared during 2020. For example, you couldn’t sell as much in your physical store due to COVID lockdowns. Therefore, your sales could’ve either dropped in general or increased in your online channels. Through this, you can understand which independent variables affected the overall performance of your dependent variable, annual sales.

If you want to go deeper into this type of analysis, check out this article and learn more about how you can benefit from regression.

4. Neural networks

The neural network forms the basis for the intelligent algorithms of machine learning. It is a form of analytics that attempts, with minimal intervention, to understand how the human brain would generate insights and predict values. Neural networks learn from each and every data transaction, meaning that they evolve and advance over time.

A typical area of application for neural networks is predictive analytics. There are BI reporting tools that have this feature implemented within them, such as the Predictive Analytics Tool from datapine. This tool enables users to quickly and easily generate all kinds of predictions. All you have to do is select the data to be processed based on your KPIs, and the software automatically calculates forecasts based on historical and current data. Thanks to its user-friendly interface, anyone in your organization can manage it; there’s no need to be an advanced scientist. 

Here is an example of how you can use the predictive analysis tool from datapine:

Example on how to use predictive analytics tool from datapine

**click to enlarge**

5. Factor analysis

The factor analysis also called “dimension reduction” is a type of data analysis used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. The aim here is to uncover independent latent variables, an ideal method for streamlining specific segments.

A good way to understand this data analysis method is a customer evaluation of a product. The initial assessment is based on different variables like color, shape, wearability, current trends, materials, comfort, the place where they bought the product, and frequency of usage. Like this, the list can be endless, depending on what you want to track. In this case, factor analysis comes into the picture by summarizing all of these variables into homogenous groups, for example, by grouping the variables color, materials, quality, and trends into a brother latent variable of design.

If you want to start analyzing data using factor analysis we recommend you take a look at this practical guide from UCLA.

6. Data mining

A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.  When considering how to analyze data, adopting a data mining mindset is essential to success - as such, it’s an area that is worth exploring in greater detail.

An excellent use case of data mining is datapine intelligent data alerts . With the help of artificial intelligence and machine learning, they provide automated signals based on particular commands or occurrences within a dataset. For example, if you’re monitoring supply chain KPIs , you could set an intelligent alarm to trigger when invalid or low-quality data appears. By doing so, you will be able to drill down deep into the issue and fix it swiftly and effectively.

In the following picture, you can see how the intelligent alarms from datapine work. By setting up ranges on daily orders, sessions, and revenues, the alarms will notify you if the goal was not completed or if it exceeded expectations.

Example on how to use intelligent alerts from datapine

7. Time series analysis

As its name suggests, time series analysis is used to analyze a set of data points collected over a specified period of time. Although analysts use this method to monitor the data points in a specific interval of time rather than just monitoring them intermittently, the time series analysis is not uniquely used for the purpose of collecting data over time. Instead, it allows researchers to understand if variables changed during the duration of the study, how the different variables are dependent, and how did it reach the end result. 

In a business context, this method is used to understand the causes of different trends and patterns to extract valuable insights. Another way of using this method is with the help of time series forecasting. Powered by predictive technologies, businesses can analyze various data sets over a period of time and forecast different future events. 

A great use case to put time series analysis into perspective is seasonality effects on sales. By using time series forecasting to analyze sales data of a specific product over time, you can understand if sales rise over a specific period of time (e.g. swimwear during summertime, or candy during Halloween). These insights allow you to predict demand and prepare production accordingly.  

8. Decision Trees 

The decision tree analysis aims to act as a support tool to make smart and strategic decisions. By visually displaying potential outcomes, consequences, and costs in a tree-like model, researchers and company users can easily evaluate all factors involved and choose the best course of action. Decision trees are helpful to analyze quantitative data and they allow for an improved decision-making process by helping you spot improvement opportunities, reduce costs, and enhance operational efficiency and production.

But how does a decision tree actually works? This method works like a flowchart that starts with the main decision that you need to make and branches out based on the different outcomes and consequences of each decision. Each outcome will outline its own consequences, costs, and gains and, at the end of the analysis, you can compare each of them and make the smartest decision. 

Businesses can use them to understand which project is more cost-effective and will bring more earnings in the long run. For example, imagine you need to decide if you want to update your software app or build a new app entirely.  Here you would compare the total costs, the time needed to be invested, potential revenue, and any other factor that might affect your decision.  In the end, you would be able to see which of these two options is more realistic and attainable for your company or research.

9. Conjoint analysis 

Last but not least, we have the conjoint analysis. This approach is usually used in surveys to understand how individuals value different attributes of a product or service and it is one of the most effective methods to extract consumer preferences. When it comes to purchasing, some clients might be more price-focused, others more features-focused, and others might have a sustainable focus. Whatever your customer's preferences are, you can find them with conjoint analysis. Through this, companies can define pricing strategies, packaging options, subscription packages, and more. 

A great example of conjoint analysis is in marketing and sales. For instance, a cupcake brand might use conjoint analysis and find that its clients prefer gluten-free options and cupcakes with healthier toppings over super sugary ones. Thus, the cupcake brand can turn these insights into advertisements and promotions to increase sales of this particular type of product. And not just that, conjoint analysis can also help businesses segment their customers based on their interests. This allows them to send different messaging that will bring value to each of the segments. 

10. Correspondence Analysis

Also known as reciprocal averaging, correspondence analysis is a method used to analyze the relationship between categorical variables presented within a contingency table. A contingency table is a table that displays two (simple correspondence analysis) or more (multiple correspondence analysis) categorical variables across rows and columns that show the distribution of the data, which is usually answers to a survey or questionnaire on a specific topic. 

This method starts by calculating an “expected value” which is done by multiplying row and column averages and dividing it by the overall original value of the specific table cell. The “expected value” is then subtracted from the original value resulting in a “residual number” which is what allows you to extract conclusions about relationships and distribution. The results of this analysis are later displayed using a map that represents the relationship between the different values. The closest two values are in the map, the bigger the relationship. Let’s put it into perspective with an example. 

Imagine you are carrying out a market research analysis about outdoor clothing brands and how they are perceived by the public. For this analysis, you ask a group of people to match each brand with a certain attribute which can be durability, innovation, quality materials, etc. When calculating the residual numbers, you can see that brand A has a positive residual for innovation but a negative one for durability. This means that brand A is not positioned as a durable brand in the market, something that competitors could take advantage of. 

11. Multidimensional Scaling (MDS)

MDS is a method used to observe the similarities or disparities between objects which can be colors, brands, people, geographical coordinates, and more. The objects are plotted using an “MDS map” that positions similar objects together and disparate ones far apart. The (dis) similarities between objects are represented using one or more dimensions that can be observed using a numerical scale. For example, if you want to know how people feel about the COVID-19 vaccine, you can use 1 for “don’t believe in the vaccine at all”  and 10 for “firmly believe in the vaccine” and a scale of 2 to 9 for in between responses.  When analyzing an MDS map the only thing that matters is the distance between the objects, the orientation of the dimensions is arbitrary and has no meaning at all. 

Multidimensional scaling is a valuable technique for market research, especially when it comes to evaluating product or brand positioning. For instance, if a cupcake brand wants to know how they are positioned compared to competitors, it can define 2-3 dimensions such as taste, ingredients, shopping experience, or more, and do a multidimensional scaling analysis to find improvement opportunities as well as areas in which competitors are currently leading. 

Another business example is in procurement when deciding on different suppliers. Decision makers can generate an MDS map to see how the different prices, delivery times, technical services, and more of the different suppliers differ and pick the one that suits their needs the best. 

A final example proposed by a research paper on "An Improved Study of Multilevel Semantic Network Visualization for Analyzing Sentiment Word of Movie Review Data". Researchers picked a two-dimensional MDS map to display the distances and relationships between different sentiments in movie reviews. They used 36 sentiment words and distributed them based on their emotional distance as we can see in the image below where the words "outraged" and "sweet" are on opposite sides of the map, marking the distance between the two emotions very clearly.

Example of multidimensional scaling analysis

Aside from being a valuable technique to analyze dissimilarities, MDS also serves as a dimension-reduction technique for large dimensional data. 

B. Qualitative Methods

Qualitative data analysis methods are defined as the observation of non-numerical data that is gathered and produced using methods of observation such as interviews, focus groups, questionnaires, and more. As opposed to quantitative methods, qualitative data is more subjective and highly valuable in analyzing customer retention and product development.

12. Text analysis

Text analysis, also known in the industry as text mining, works by taking large sets of textual data and arranging them in a way that makes it easier to manage. By working through this cleansing process in stringent detail, you will be able to extract the data that is truly relevant to your organization and use it to develop actionable insights that will propel you forward.

Modern software accelerate the application of text analytics. Thanks to the combination of machine learning and intelligent algorithms, you can perform advanced analytical processes such as sentiment analysis. This technique allows you to understand the intentions and emotions of a text, for example, if it's positive, negative, or neutral, and then give it a score depending on certain factors and categories that are relevant to your brand. Sentiment analysis is often used to monitor brand and product reputation and to understand how successful your customer experience is. To learn more about the topic check out this insightful article .

By analyzing data from various word-based sources, including product reviews, articles, social media communications, and survey responses, you will gain invaluable insights into your audience, as well as their needs, preferences, and pain points. This will allow you to create campaigns, services, and communications that meet your prospects’ needs on a personal level, growing your audience while boosting customer retention. There are various other “sub-methods” that are an extension of text analysis. Each of them serves a more specific purpose and we will look at them in detail next. 

13. Content Analysis

This is a straightforward and very popular method that examines the presence and frequency of certain words, concepts, and subjects in different content formats such as text, image, audio, or video. For example, the number of times the name of a celebrity is mentioned on social media or online tabloids. It does this by coding text data that is later categorized and tabulated in a way that can provide valuable insights, making it the perfect mix of quantitative and qualitative analysis.

There are two types of content analysis. The first one is the conceptual analysis which focuses on explicit data, for instance, the number of times a concept or word is mentioned in a piece of content. The second one is relational analysis, which focuses on the relationship between different concepts or words and how they are connected within a specific context. 

Content analysis is often used by marketers to measure brand reputation and customer behavior. For example, by analyzing customer reviews. It can also be used to analyze customer interviews and find directions for new product development. It is also important to note, that in order to extract the maximum potential out of this analysis method, it is necessary to have a clearly defined research question. 

14. Thematic Analysis

Very similar to content analysis, thematic analysis also helps in identifying and interpreting patterns in qualitative data with the main difference being that the first one can also be applied to quantitative analysis. The thematic method analyzes large pieces of text data such as focus group transcripts or interviews and groups them into themes or categories that come up frequently within the text. It is a great method when trying to figure out peoples view’s and opinions about a certain topic. For example, if you are a brand that cares about sustainability, you can do a survey of your customers to analyze their views and opinions about sustainability and how they apply it to their lives. You can also analyze customer service calls transcripts to find common issues and improve your service. 

Thematic analysis is a very subjective technique that relies on the researcher’s judgment. Therefore,  to avoid biases, it has 6 steps that include familiarization, coding, generating themes, reviewing themes, defining and naming themes, and writing up. It is also important to note that, because it is a flexible approach, the data can be interpreted in multiple ways and it can be hard to select what data is more important to emphasize. 

15. Narrative Analysis 

A bit more complex in nature than the two previous ones, narrative analysis is used to explore the meaning behind the stories that people tell and most importantly, how they tell them. By looking into the words that people use to describe a situation you can extract valuable conclusions about their perspective on a specific topic. Common sources for narrative data include autobiographies, family stories, opinion pieces, and testimonials, among others. 

From a business perspective, narrative analysis can be useful to analyze customer behaviors and feelings towards a specific product, service, feature, or others. It provides unique and deep insights that can be extremely valuable. However, it has some drawbacks.  

The biggest weakness of this method is that the sample sizes are usually very small due to the complexity and time-consuming nature of the collection of narrative data. Plus, the way a subject tells a story will be significantly influenced by his or her specific experiences, making it very hard to replicate in a subsequent study. 

16. Discourse Analysis

Discourse analysis is used to understand the meaning behind any type of written, verbal, or symbolic discourse based on its political, social, or cultural context. It mixes the analysis of languages and situations together. This means that the way the content is constructed and the meaning behind it is significantly influenced by the culture and society it takes place in. For example, if you are analyzing political speeches you need to consider different context elements such as the politician's background, the current political context of the country, the audience to which the speech is directed, and so on. 

From a business point of view, discourse analysis is a great market research tool. It allows marketers to understand how the norms and ideas of the specific market work and how their customers relate to those ideas. It can be very useful to build a brand mission or develop a unique tone of voice. 

17. Grounded Theory Analysis

Traditionally, researchers decide on a method and hypothesis and start to collect the data to prove that hypothesis. The grounded theory is the only method that doesn’t require an initial research question or hypothesis as its value lies in the generation of new theories. With the grounded theory method, you can go into the analysis process with an open mind and explore the data to generate new theories through tests and revisions. In fact, it is not necessary to collect the data and then start to analyze it. Researchers usually start to find valuable insights as they are gathering the data. 

All of these elements make grounded theory a very valuable method as theories are fully backed by data instead of initial assumptions. It is a great technique to analyze poorly researched topics or find the causes behind specific company outcomes. For example, product managers and marketers might use the grounded theory to find the causes of high levels of customer churn and look into customer surveys and reviews to develop new theories about the causes. 

How To Analyze Data? Top 17 Data Analysis Techniques To Apply

17 top data analysis techniques by datapine

Now that we’ve answered the questions “what is data analysis’”, why is it important, and covered the different data analysis types, it’s time to dig deeper into how to perform your analysis by working through these 17 essential techniques.

1. Collaborate your needs

Before you begin analyzing or drilling down into any techniques, it’s crucial to sit down collaboratively with all key stakeholders within your organization, decide on your primary campaign or strategic goals, and gain a fundamental understanding of the types of insights that will best benefit your progress or provide you with the level of vision you need to evolve your organization.

2. Establish your questions

Once you’ve outlined your core objectives, you should consider which questions will need answering to help you achieve your mission. This is one of the most important techniques as it will shape the very foundations of your success.

To help you ask the right things and ensure your data works for you, you have to ask the right data analysis questions .

3. Data democratization

After giving your data analytics methodology some real direction, and knowing which questions need answering to extract optimum value from the information available to your organization, you should continue with democratization.

Data democratization is an action that aims to connect data from various sources efficiently and quickly so that anyone in your organization can access it at any given moment. You can extract data in text, images, videos, numbers, or any other format. And then perform cross-database analysis to achieve more advanced insights to share with the rest of the company interactively.  

Once you have decided on your most valuable sources, you need to take all of this into a structured format to start collecting your insights. For this purpose, datapine offers an easy all-in-one data connectors feature to integrate all your internal and external sources and manage them at your will. Additionally, datapine’s end-to-end solution automatically updates your data, allowing you to save time and focus on performing the right analysis to grow your company.

data connectors from datapine

4. Think of governance 

When collecting data in a business or research context you always need to think about security and privacy. With data breaches becoming a topic of concern for businesses, the need to protect your client's or subject’s sensitive information becomes critical. 

To ensure that all this is taken care of, you need to think of a data governance strategy. According to Gartner , this concept refers to “ the specification of decision rights and an accountability framework to ensure the appropriate behavior in the valuation, creation, consumption, and control of data and analytics .” In simpler words, data governance is a collection of processes, roles, and policies, that ensure the efficient use of data while still achieving the main company goals. It ensures that clear roles are in place for who can access the information and how they can access it. In time, this not only ensures that sensitive information is protected but also allows for an efficient analysis as a whole. 

5. Clean your data

After harvesting from so many sources you will be left with a vast amount of information that can be overwhelming to deal with. At the same time, you can be faced with incorrect data that can be misleading to your analysis. The smartest thing you can do to avoid dealing with this in the future is to clean the data. This is fundamental before visualizing it, as it will ensure that the insights you extract from it are correct.

There are many things that you need to look for in the cleaning process. The most important one is to eliminate any duplicate observations; this usually appears when using multiple internal and external sources of information. You can also add any missing codes, fix empty fields, and eliminate incorrectly formatted data.

Another usual form of cleaning is done with text data. As we mentioned earlier, most companies today analyze customer reviews, social media comments, questionnaires, and several other text inputs. In order for algorithms to detect patterns, text data needs to be revised to avoid invalid characters or any syntax or spelling errors. 

Most importantly, the aim of cleaning is to prevent you from arriving at false conclusions that can damage your company in the long run. By using clean data, you will also help BI solutions to interact better with your information and create better reports for your organization.

6. Set your KPIs

Once you’ve set your sources, cleaned your data, and established clear-cut questions you want your insights to answer, you need to set a host of key performance indicators (KPIs) that will help you track, measure, and shape your progress in a number of key areas.

KPIs are critical to both qualitative and quantitative analysis research. This is one of the primary methods of data analysis you certainly shouldn’t overlook.

To help you set the best possible KPIs for your initiatives and activities, here is an example of a relevant logistics KPI : transportation-related costs. If you want to see more go explore our collection of key performance indicator examples .

Transportation costs logistics KPIs

7. Omit useless data

Having bestowed your data analysis tools and techniques with true purpose and defined your mission, you should explore the raw data you’ve collected from all sources and use your KPIs as a reference for chopping out any information you deem to be useless.

Trimming the informational fat is one of the most crucial methods of analysis as it will allow you to focus your analytical efforts and squeeze every drop of value from the remaining ‘lean’ information.

Any stats, facts, figures, or metrics that don’t align with your business goals or fit with your KPI management strategies should be eliminated from the equation.

8. Build a data management roadmap

While, at this point, this particular step is optional (you will have already gained a wealth of insight and formed a fairly sound strategy by now), creating a data governance roadmap will help your data analysis methods and techniques become successful on a more sustainable basis. These roadmaps, if developed properly, are also built so they can be tweaked and scaled over time.

Invest ample time in developing a roadmap that will help you store, manage, and handle your data internally, and you will make your analysis techniques all the more fluid and functional – one of the most powerful types of data analysis methods available today.

9. Integrate technology

There are many ways to analyze data, but one of the most vital aspects of analytical success in a business context is integrating the right decision support software and technology.

Robust analysis platforms will not only allow you to pull critical data from your most valuable sources while working with dynamic KPIs that will offer you actionable insights; it will also present them in a digestible, visual, interactive format from one central, live dashboard . A data methodology you can count on.

By integrating the right technology within your data analysis methodology, you’ll avoid fragmenting your insights, saving you time and effort while allowing you to enjoy the maximum value from your business’s most valuable insights.

For a look at the power of software for the purpose of analysis and to enhance your methods of analyzing, glance over our selection of dashboard examples .

10. Answer your questions

By considering each of the above efforts, working with the right technology, and fostering a cohesive internal culture where everyone buys into the different ways to analyze data as well as the power of digital intelligence, you will swiftly start to answer your most burning business questions. Arguably, the best way to make your data concepts accessible across the organization is through data visualization.

11. Visualize your data

Online data visualization is a powerful tool as it lets you tell a story with your metrics, allowing users across the organization to extract meaningful insights that aid business evolution – and it covers all the different ways to analyze data.

The purpose of analyzing is to make your entire organization more informed and intelligent, and with the right platform or dashboard, this is simpler than you think, as demonstrated by our marketing dashboard .

An executive dashboard example showcasing high-level marketing KPIs such as cost per lead, MQL, SQL, and cost per customer.

This visual, dynamic, and interactive online dashboard is a data analysis example designed to give Chief Marketing Officers (CMO) an overview of relevant metrics to help them understand if they achieved their monthly goals.

In detail, this example generated with a modern dashboard creator displays interactive charts for monthly revenues, costs, net income, and net income per customer; all of them are compared with the previous month so that you can understand how the data fluctuated. In addition, it shows a detailed summary of the number of users, customers, SQLs, and MQLs per month to visualize the whole picture and extract relevant insights or trends for your marketing reports .

The CMO dashboard is perfect for c-level management as it can help them monitor the strategic outcome of their marketing efforts and make data-driven decisions that can benefit the company exponentially.

12. Be careful with the interpretation

We already dedicated an entire post to data interpretation as it is a fundamental part of the process of data analysis. It gives meaning to the analytical information and aims to drive a concise conclusion from the analysis results. Since most of the time companies are dealing with data from many different sources, the interpretation stage needs to be done carefully and properly in order to avoid misinterpretations. 

To help you through the process, here we list three common practices that you need to avoid at all costs when looking at your data:

  • Correlation vs. causation: The human brain is formatted to find patterns. This behavior leads to one of the most common mistakes when performing interpretation: confusing correlation with causation. Although these two aspects can exist simultaneously, it is not correct to assume that because two things happened together, one provoked the other. A piece of advice to avoid falling into this mistake is never to trust just intuition, trust the data. If there is no objective evidence of causation, then always stick to correlation. 
  • Confirmation bias: This phenomenon describes the tendency to select and interpret only the data necessary to prove one hypothesis, often ignoring the elements that might disprove it. Even if it's not done on purpose, confirmation bias can represent a real problem, as excluding relevant information can lead to false conclusions and, therefore, bad business decisions. To avoid it, always try to disprove your hypothesis instead of proving it, share your analysis with other team members, and avoid drawing any conclusions before the entire analytical project is finalized.
  • Statistical significance: To put it in short words, statistical significance helps analysts understand if a result is actually accurate or if it happened because of a sampling error or pure chance. The level of statistical significance needed might depend on the sample size and the industry being analyzed. In any case, ignoring the significance of a result when it might influence decision-making can be a huge mistake.

13. Build a narrative

Now, we’re going to look at how you can bring all of these elements together in a way that will benefit your business - starting with a little something called data storytelling.

The human brain responds incredibly well to strong stories or narratives. Once you’ve cleansed, shaped, and visualized your most invaluable data using various BI dashboard tools , you should strive to tell a story - one with a clear-cut beginning, middle, and end.

By doing so, you will make your analytical efforts more accessible, digestible, and universal, empowering more people within your organization to use your discoveries to their actionable advantage.

14. Consider autonomous technology

Autonomous technologies, such as artificial intelligence (AI) and machine learning (ML), play a significant role in the advancement of understanding how to analyze data more effectively.

Gartner predicts that by the end of this year, 80% of emerging technologies will be developed with AI foundations. This is a testament to the ever-growing power and value of autonomous technologies.

At the moment, these technologies are revolutionizing the analysis industry. Some examples that we mentioned earlier are neural networks, intelligent alarms, and sentiment analysis.

15. Share the load

If you work with the right tools and dashboards, you will be able to present your metrics in a digestible, value-driven format, allowing almost everyone in the organization to connect with and use relevant data to their advantage.

Modern dashboards consolidate data from various sources, providing access to a wealth of insights in one centralized location, no matter if you need to monitor recruitment metrics or generate reports that need to be sent across numerous departments. Moreover, these cutting-edge tools offer access to dashboards from a multitude of devices, meaning that everyone within the business can connect with practical insights remotely - and share the load.

Once everyone is able to work with a data-driven mindset, you will catalyze the success of your business in ways you never thought possible. And when it comes to knowing how to analyze data, this kind of collaborative approach is essential.

16. Data analysis tools

In order to perform high-quality analysis of data, it is fundamental to use tools and software that will ensure the best results. Here we leave you a small summary of four fundamental categories of data analysis tools for your organization.

  • Business Intelligence: BI tools allow you to process significant amounts of data from several sources in any format. Through this, you can not only analyze and monitor your data to extract relevant insights but also create interactive reports and dashboards to visualize your KPIs and use them for your company's good. datapine is an amazing online BI software that is focused on delivering powerful online analysis features that are accessible to beginner and advanced users. Like this, it offers a full-service solution that includes cutting-edge analysis of data, KPIs visualization, live dashboards, reporting, and artificial intelligence technologies to predict trends and minimize risk.
  • Statistical analysis: These tools are usually designed for scientists, statisticians, market researchers, and mathematicians, as they allow them to perform complex statistical analyses with methods like regression analysis, predictive analysis, and statistical modeling. A good tool to perform this type of analysis is R-Studio as it offers a powerful data modeling and hypothesis testing feature that can cover both academic and general data analysis. This tool is one of the favorite ones in the industry, due to its capability for data cleaning, data reduction, and performing advanced analysis with several statistical methods. Another relevant tool to mention is SPSS from IBM. The software offers advanced statistical analysis for users of all skill levels. Thanks to a vast library of machine learning algorithms, text analysis, and a hypothesis testing approach it can help your company find relevant insights to drive better decisions. SPSS also works as a cloud service that enables you to run it anywhere.
  • SQL Consoles: SQL is a programming language often used to handle structured data in relational databases. Tools like these are popular among data scientists as they are extremely effective in unlocking these databases' value. Undoubtedly, one of the most used SQL software in the market is MySQL Workbench . This tool offers several features such as a visual tool for database modeling and monitoring, complete SQL optimization, administration tools, and visual performance dashboards to keep track of KPIs.
  • Data Visualization: These tools are used to represent your data through charts, graphs, and maps that allow you to find patterns and trends in the data. datapine's already mentioned BI platform also offers a wealth of powerful online data visualization tools with several benefits. Some of them include: delivering compelling data-driven presentations to share with your entire company, the ability to see your data online with any device wherever you are, an interactive dashboard design feature that enables you to showcase your results in an interactive and understandable way, and to perform online self-service reports that can be used simultaneously with several other people to enhance team productivity.

17. Refine your process constantly 

Last is a step that might seem obvious to some people, but it can be easily ignored if you think you are done. Once you have extracted the needed results, you should always take a retrospective look at your project and think about what you can improve. As you saw throughout this long list of techniques, data analysis is a complex process that requires constant refinement. For this reason, you should always go one step further and keep improving. 

Quality Criteria For Data Analysis

So far we’ve covered a list of methods and techniques that should help you perform efficient data analysis. But how do you measure the quality and validity of your results? This is done with the help of some science quality criteria. Here we will go into a more theoretical area that is critical to understanding the fundamentals of statistical analysis in science. However, you should also be aware of these steps in a business context, as they will allow you to assess the quality of your results in the correct way. Let’s dig in. 

  • Internal validity: The results of a survey are internally valid if they measure what they are supposed to measure and thus provide credible results. In other words , internal validity measures the trustworthiness of the results and how they can be affected by factors such as the research design, operational definitions, how the variables are measured, and more. For instance, imagine you are doing an interview to ask people if they brush their teeth two times a day. While most of them will answer yes, you can still notice that their answers correspond to what is socially acceptable, which is to brush your teeth at least twice a day. In this case, you can’t be 100% sure if respondents actually brush their teeth twice a day or if they just say that they do, therefore, the internal validity of this interview is very low. 
  • External validity: Essentially, external validity refers to the extent to which the results of your research can be applied to a broader context. It basically aims to prove that the findings of a study can be applied in the real world. If the research can be applied to other settings, individuals, and times, then the external validity is high. 
  • Reliability : If your research is reliable, it means that it can be reproduced. If your measurement were repeated under the same conditions, it would produce similar results. This means that your measuring instrument consistently produces reliable results. For example, imagine a doctor building a symptoms questionnaire to detect a specific disease in a patient. Then, various other doctors use this questionnaire but end up diagnosing the same patient with a different condition. This means the questionnaire is not reliable in detecting the initial disease. Another important note here is that in order for your research to be reliable, it also needs to be objective. If the results of a study are the same, independent of who assesses them or interprets them, the study can be considered reliable. Let’s see the objectivity criteria in more detail now. 
  • Objectivity: In data science, objectivity means that the researcher needs to stay fully objective when it comes to its analysis. The results of a study need to be affected by objective criteria and not by the beliefs, personality, or values of the researcher. Objectivity needs to be ensured when you are gathering the data, for example, when interviewing individuals, the questions need to be asked in a way that doesn't influence the results. Paired with this, objectivity also needs to be thought of when interpreting the data. If different researchers reach the same conclusions, then the study is objective. For this last point, you can set predefined criteria to interpret the results to ensure all researchers follow the same steps. 

The discussed quality criteria cover mostly potential influences in a quantitative context. Analysis in qualitative research has by default additional subjective influences that must be controlled in a different way. Therefore, there are other quality criteria for this kind of research such as credibility, transferability, dependability, and confirmability. You can see each of them more in detail on this resource . 

Data Analysis Limitations & Barriers

Analyzing data is not an easy task. As you’ve seen throughout this post, there are many steps and techniques that you need to apply in order to extract useful information from your research. While a well-performed analysis can bring various benefits to your organization it doesn't come without limitations. In this section, we will discuss some of the main barriers you might encounter when conducting an analysis. Let’s see them more in detail. 

  • Lack of clear goals: No matter how good your data or analysis might be if you don’t have clear goals or a hypothesis the process might be worthless. While we mentioned some methods that don’t require a predefined hypothesis, it is always better to enter the analytical process with some clear guidelines of what you are expecting to get out of it, especially in a business context in which data is utilized to support important strategic decisions. 
  • Objectivity: Arguably one of the biggest barriers when it comes to data analysis in research is to stay objective. When trying to prove a hypothesis, researchers might find themselves, intentionally or unintentionally, directing the results toward an outcome that they want. To avoid this, always question your assumptions and avoid confusing facts with opinions. You can also show your findings to a research partner or external person to confirm that your results are objective. 
  • Data representation: A fundamental part of the analytical procedure is the way you represent your data. You can use various graphs and charts to represent your findings, but not all of them will work for all purposes. Choosing the wrong visual can not only damage your analysis but can mislead your audience, therefore, it is important to understand when to use each type of data depending on your analytical goals. Our complete guide on the types of graphs and charts lists 20 different visuals with examples of when to use them. 
  • Flawed correlation : Misleading statistics can significantly damage your research. We’ve already pointed out a few interpretation issues previously in the post, but it is an important barrier that we can't avoid addressing here as well. Flawed correlations occur when two variables appear related to each other but they are not. Confusing correlations with causation can lead to a wrong interpretation of results which can lead to building wrong strategies and loss of resources, therefore, it is very important to identify the different interpretation mistakes and avoid them. 
  • Sample size: A very common barrier to a reliable and efficient analysis process is the sample size. In order for the results to be trustworthy, the sample size should be representative of what you are analyzing. For example, imagine you have a company of 1000 employees and you ask the question “do you like working here?” to 50 employees of which 49 say yes, which means 95%. Now, imagine you ask the same question to the 1000 employees and 950 say yes, which also means 95%. Saying that 95% of employees like working in the company when the sample size was only 50 is not a representative or trustworthy conclusion. The significance of the results is way more accurate when surveying a bigger sample size.   
  • Privacy concerns: In some cases, data collection can be subjected to privacy regulations. Businesses gather all kinds of information from their customers from purchasing behaviors to addresses and phone numbers. If this falls into the wrong hands due to a breach, it can affect the security and confidentiality of your clients. To avoid this issue, you need to collect only the data that is needed for your research and, if you are using sensitive facts, make it anonymous so customers are protected. The misuse of customer data can severely damage a business's reputation, so it is important to keep an eye on privacy. 
  • Lack of communication between teams : When it comes to performing data analysis on a business level, it is very likely that each department and team will have different goals and strategies. However, they are all working for the same common goal of helping the business run smoothly and keep growing. When teams are not connected and communicating with each other, it can directly affect the way general strategies are built. To avoid these issues, tools such as data dashboards enable teams to stay connected through data in a visually appealing way. 
  • Innumeracy : Businesses are working with data more and more every day. While there are many BI tools available to perform effective analysis, data literacy is still a constant barrier. Not all employees know how to apply analysis techniques or extract insights from them. To prevent this from happening, you can implement different training opportunities that will prepare every relevant user to deal with data. 

Key Data Analysis Skills

As you've learned throughout this lengthy guide, analyzing data is a complex task that requires a lot of knowledge and skills. That said, thanks to the rise of self-service tools the process is way more accessible and agile than it once was. Regardless, there are still some key skills that are valuable to have when working with data, we list the most important ones below.

  • Critical and statistical thinking: To successfully analyze data you need to be creative and think out of the box. Yes, that might sound like a weird statement considering that data is often tight to facts. However, a great level of critical thinking is required to uncover connections, come up with a valuable hypothesis, and extract conclusions that go a step further from the surface. This, of course, needs to be complemented by statistical thinking and an understanding of numbers. 
  • Data cleaning: Anyone who has ever worked with data before will tell you that the cleaning and preparation process accounts for 80% of a data analyst's work, therefore, the skill is fundamental. But not just that, not cleaning the data adequately can also significantly damage the analysis which can lead to poor decision-making in a business scenario. While there are multiple tools that automate the cleaning process and eliminate the possibility of human error, it is still a valuable skill to dominate. 
  • Data visualization: Visuals make the information easier to understand and analyze, not only for professional users but especially for non-technical ones. Having the necessary skills to not only choose the right chart type but know when to apply it correctly is key. This also means being able to design visually compelling charts that make the data exploration process more efficient. 
  • SQL: The Structured Query Language or SQL is a programming language used to communicate with databases. It is fundamental knowledge as it enables you to update, manipulate, and organize data from relational databases which are the most common databases used by companies. It is fairly easy to learn and one of the most valuable skills when it comes to data analysis. 
  • Communication skills: This is a skill that is especially valuable in a business environment. Being able to clearly communicate analytical outcomes to colleagues is incredibly important, especially when the information you are trying to convey is complex for non-technical people. This applies to in-person communication as well as written format, for example, when generating a dashboard or report. While this might be considered a “soft” skill compared to the other ones we mentioned, it should not be ignored as you most likely will need to share analytical findings with others no matter the context. 

Data Analysis In The Big Data Environment

Big data is invaluable to today’s businesses, and by using different methods for data analysis, it’s possible to view your data in a way that can help you turn insight into positive action.

To inspire your efforts and put the importance of big data into context, here are some insights that you should know:

  • By 2026 the industry of big data is expected to be worth approximately $273.4 billion.
  • 94% of enterprises say that analyzing data is important for their growth and digital transformation. 
  • Companies that exploit the full potential of their data can increase their operating margins by 60% .
  • We already told you the benefits of Artificial Intelligence through this article. This industry's financial impact is expected to grow up to $40 billion by 2025.

Data analysis concepts may come in many forms, but fundamentally, any solid methodology will help to make your business more streamlined, cohesive, insightful, and successful than ever before.

Key Takeaways From Data Analysis 

As we reach the end of our data analysis journey, we leave a small summary of the main methods and techniques to perform excellent analysis and grow your business.

17 Essential Types of Data Analysis Methods:

  • Cluster analysis
  • Cohort analysis
  • Regression analysis
  • Factor analysis
  • Neural Networks
  • Data Mining
  • Text analysis
  • Time series analysis
  • Decision trees
  • Conjoint analysis 
  • Correspondence Analysis
  • Multidimensional Scaling 
  • Content analysis 
  • Thematic analysis
  • Narrative analysis 
  • Grounded theory analysis
  • Discourse analysis 

Top 17 Data Analysis Techniques:

  • Collaborate your needs
  • Establish your questions
  • Data democratization
  • Think of data governance 
  • Clean your data
  • Set your KPIs
  • Omit useless data
  • Build a data management roadmap
  • Integrate technology
  • Answer your questions
  • Visualize your data
  • Interpretation of data
  • Consider autonomous technology
  • Build a narrative
  • Share the load
  • Data Analysis tools
  • Refine your process constantly 

We’ve pondered the data analysis definition and drilled down into the practical applications of data-centric analytics, and one thing is clear: by taking measures to arrange your data and making your metrics work for you, it’s possible to transform raw information into action - the kind of that will push your business to the next level.

Yes, good data analytics techniques result in enhanced business intelligence (BI). To help you understand this notion in more detail, read our exploration of business intelligence reporting .

And, if you’re ready to perform your own analysis, drill down into your facts and figures while interacting with your data on astonishing visuals, you can try our software for a free, 14-day trial .

Logo for University of Southern Queensland

Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.

14 Quantitative analysis: Descriptive statistics

Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs. Inferential analysis refers to the statistical testing of hypotheses (theory testing). In this chapter, we will examine statistical techniques used for descriptive analysis, and the next chapter will examine statistical techniques for inferential analysis. Much of today’s quantitative data analysis is conducted using software programs such as SPSS or SAS. Readers are advised to familiarise themselves with one of these programs for understanding the concepts described in this chapter.

Data preparation

In research projects, data may be collected from a variety of sources: postal surveys, interviews, pretest or posttest experimental data, observational data, and so forth. This data must be converted into a machine-readable, numeric format, such as in a spreadsheet or a text file, so that they can be analysed by computer programs like SPSS or SAS. Data preparation usually follows the following steps:

Data coding. Coding is the process of converting data into numeric format. A codebook should be created to guide the coding process. A codebook is a comprehensive document containing a detailed description of each variable in a research study, items or measures for that variable, the format of each item (numeric, text, etc.), the response scale for each item (i.e., whether it is measured on a nominal, ordinal, interval, or ratio scale, and whether this scale is a five-point, seven-point scale, etc.), and how to code each value into a numeric format. For instance, if we have a measurement item on a seven-point Likert scale with anchors ranging from ‘strongly disagree’ to ‘strongly agree’, we may code that item as 1 for strongly disagree, 4 for neutral, and 7 for strongly agree, with the intermediate anchors in between. Nominal data such as industry type can be coded in numeric form using a coding scheme such as: 1 for manufacturing, 2 for retailing, 3 for financial, 4 for healthcare, and so forth (of course, nominal data cannot be analysed statistically). Ratio scale data such as age, income, or test scores can be coded as entered by the respondent. Sometimes, data may need to be aggregated into a different form than the format used for data collection. For instance, if a survey measuring a construct such as ‘benefits of computers’ provided respondents with a checklist of benefits that they could select from, and respondents were encouraged to choose as many of those benefits as they wanted, then the total number of checked items could be used as an aggregate measure of benefits. Note that many other forms of data—such as interview transcripts—cannot be converted into a numeric format for statistical analysis. Codebooks are especially important for large complex studies involving many variables and measurement items, where the coding process is conducted by different people, to help the coding team code data in a consistent manner, and also to help others understand and interpret the coded data.

Data entry. Coded data can be entered into a spreadsheet, database, text file, or directly into a statistical program like SPSS. Most statistical programs provide a data editor for entering data. However, these programs store data in their own native format—e.g., SPSS stores data as .sav files—which makes it difficult to share that data with other statistical programs. Hence, it is often better to enter data into a spreadsheet or database where it can be reorganised as needed, shared across programs, and subsets of data can be extracted for analysis. Smaller data sets with less than 65,000 observations and 256 items can be stored in a spreadsheet created using a program such as Microsoft Excel, while larger datasets with millions of observations will require a database. Each observation can be entered as one row in the spreadsheet, and each measurement item can be represented as one column. Data should be checked for accuracy during and after entry via occasional spot checks on a set of items or observations. Furthermore, while entering data, the coder should watch out for obvious evidence of bad data, such as the respondent selecting the ‘strongly agree’ response to all items irrespective of content, including reverse-coded items. If so, such data can be entered but should be excluded from subsequent analysis.

-1

Data transformation. Sometimes, it is necessary to transform data values before they can be meaningfully interpreted. For instance, reverse coded items—where items convey the opposite meaning of that of their underlying construct—should be reversed (e.g., in a 1-7 interval scale, 8 minus the observed value will reverse the value) before they can be compared or combined with items that are not reverse coded. Other kinds of transformations may include creating scale measures by adding individual scale items, creating a weighted index from a set of observed measures, and collapsing multiple values into fewer categories (e.g., collapsing incomes into income ranges).

Univariate analysis

Univariate analysis—or analysis of a single variable—refers to a set of statistical techniques that can describe the general properties of one variable. Univariate statistics include: frequency distribution, central tendency, and dispersion. The frequency distribution of a variable is a summary of the frequency—or percentages—of individual values or ranges of values for that variable. For instance, we can measure how many times a sample of respondents attend religious services—as a gauge of their ‘religiosity’—using a categorical scale: never, once per year, several times per year, about once a month, several times per month, several times per week, and an optional category for ‘did not answer’. If we count the number or percentage of observations within each category—except ‘did not answer’ which is really a missing value rather than a category—and display it in the form of a table, as shown in Figure 14.1, what we have is a frequency distribution. This distribution can also be depicted in the form of a bar chart, as shown on the right panel of Figure 14.1, with the horizontal axis representing each category of that variable and the vertical axis representing the frequency or percentage of observations within each category.

Frequency distribution of religiosity

With very large samples, where observations are independent and random, the frequency distribution tends to follow a plot that looks like a bell-shaped curve—a smoothed bar chart of the frequency distribution—similar to that shown in Figure 14.2. Here most observations are clustered toward the centre of the range of values, with fewer and fewer observations clustered toward the extreme ends of the range. Such a curve is called a normal distribution .

(15 + 20 + 21 + 20 + 36 + 15 + 25 + 15)/8=20.875

Lastly, the mode is the most frequently occurring value in a distribution of values. In the previous example, the most frequently occurring value is 15, which is the mode of the above set of test scores. Note that any value that is estimated from a sample, such as mean, median, mode, or any of the later estimates are called a statistic .

36-15=21

Bivariate analysis

Bivariate analysis examines how two variables are related to one another. The most common bivariate statistic is the bivariate correlation —often, simply called ‘correlation’—which is a number between -1 and +1 denoting the strength of the relationship between two variables. Say that we wish to study how age is related to self-esteem in a sample of 20 respondents—i.e., as age increases, does self-esteem increase, decrease, or remain unchanged?. If self-esteem increases, then we have a positive correlation between the two variables, if self-esteem decreases, then we have a negative correlation, and if it remains the same, we have a zero correlation. To calculate the value of this correlation, consider the hypothetical dataset shown in Table 14.1.

Normal distribution

After computing bivariate correlation, researchers are often interested in knowing whether the correlation is significant (i.e., a real one) or caused by mere chance. Answering such a question would require testing the following hypothesis:

\[H_0:\quad r = 0 \]

Social Science Research: Principles, Methods and Practices (Revised edition) Copyright © 2019 by Anol Bhattacherjee is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License , except where otherwise noted.

Share This Book

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMJ Open Access

Logo of bmjgroup

How to write statistical analysis section in medical research

Alok kumar dwivedi.

Department of Molecular and Translational Medicine, Division of Biostatistics and Epidemiology, Texas Tech University Health Sciences Center El Paso, El Paso, Texas, USA

Associated Data

jim-2022-002479supp001.pdf

Data sharing not applicable as no datasets generated and/or analyzed for this study.

Reporting of statistical analysis is essential in any clinical and translational research study. However, medical research studies sometimes report statistical analysis that is either inappropriate or insufficient to attest to the accuracy and validity of findings and conclusions. Published works involving inaccurate statistical analyses and insufficient reporting influence the conduct of future scientific studies, including meta-analyses and medical decisions. Although the biostatistical practice has been improved over the years due to the involvement of statistical reviewers and collaborators in research studies, there remain areas of improvement for transparent reporting of the statistical analysis section in a study. Evidence-based biostatistics practice throughout the research is useful for generating reliable data and translating meaningful data to meaningful interpretation and decisions in medical research. Most existing research reporting guidelines do not provide guidance for reporting methods in the statistical analysis section that helps in evaluating the quality of findings and data interpretation. In this report, we highlight the global and critical steps to be reported in the statistical analysis of grants and research articles. We provide clarity and the importance of understanding study objective types, data generation process, effect size use, evidence-based biostatistical methods use, and development of statistical models through several thematic frameworks. We also provide published examples of adherence or non-adherence to methodological standards related to each step in the statistical analysis and their implications. We believe the suggestions provided in this report can have far-reaching implications for education and strengthening the quality of statistical reporting and biostatistical practice in medical research.

Introduction

Biostatistics is the overall approach to how we realistically and feasibly execute a research idea to produce meaningful data and translate data to meaningful interpretation and decisions. In this era of evidence-based medicine and practice, basic biostatistical knowledge becomes essential for critically appraising research articles and implementing findings for better patient management, improving healthcare, and research planning. 1 However, it may not be sufficient for the proper execution and reporting of statistical analyses in studies. 2 3 Three things are required for statistical analyses, namely knowledge of the conceptual framework of variables, research design, and evidence-based applications of statistical analysis with statistical software. 4 5 The conceptual framework provides possible biological and clinical pathways between independent variables and outcomes with role specification of variables. The research design provides a protocol of study design and data generation process (DGP), whereas the evidence-based statistical analysis approach provides guidance for selecting and implementing approaches after evaluating data with the research design. 2 5 Ocaña-Riola 6 reported a substantial percentage of articles from high-impact medical journals contained errors in statistical analysis or data interpretation. These errors in statistical analyses and interpretation of results do not only impact the reliability of research findings but also influence the medical decision-making and planning and execution of other related studies. A survey of consulting biostatisticians in the USA reported that researchers frequently request biostatisticians for performing inappropriate statistical analyses and inappropriate reporting of data. 7 This implies that there is a need to enforce standardized reporting of the statistical analysis section in medical research which can also help rreviewers and investigators to improve the methodological standards of the study.

Biostatistical practice in medicine has been improving over the years due to continuous efforts in promoting awareness and involving expert services on biostatistics, epidemiology, and research design in clinical and translational research. 8–11 Despite these efforts, the quality of reporting of statistical analysis in research studies has often been suboptimal. 12 13 We noticed that none of the methods reporting documents were developed using evidence-based biostatistics (EBB) theory and practice. The EBB practice implies that the selection of statistical analysis methods for statistical analyses and the steps of results reporting and interpretation should be grounded based on the evidence generated in the scientific literature and according to the study objective type and design. 5 Previous works have not properly elucidated the importance of understanding EBB concepts and related reporting in the write-up of statistical analyses. As a result, reviewers sometimes ask to present data or execute analyses that do not match the study objective type. 14 We summarize the statistical analysis steps to be reported in the statistical analysis section based on review and thematic frameworks.

We identified articles describing statistical reporting problems in medicine using different search terms ( online supplemental table 1 ). Based on these studies, we prioritized commonly reported statistical errors in analytical strategies and developed essential components to be reported in the statistical analysis section of research grants and studies. We also clarified the purpose and the overall implication of reporting each step in statistical analyses through various examples.

Supplementary data

Although biostatistical inputs are critical for the entire research study ( online supplemental table 2 ), biostatistical consultations were mostly used for statistical analyses only 15 . Even though the conduct of statistical analysis mismatched with the study objective and DGP was identified as the major problem in articles submitted to high-impact medical journals. 16 In addition, multivariable analyses were often inappropriately conducted and reported in published studies. 17 18 In light of these statistical errors, we describe the reporting of the following components in the statistical analysis section of the study.

Step 1: specify study objective type and outcomes (overall approach)

The study objective type provides the role of important variables for a specified outcome in statistical analyses and the overall approach of the model building and model reporting steps in a study. In the statistical framework, the problems are classified into descriptive and inferential/analytical/confirmatory objectives. In the epidemiological framework, the analytical and prognostic problems are broadly classified into association, explanatory, and predictive objectives. 19 These study objectives ( figure 1 ) may be classified into six categories: (1) exploratory, (2) association, (3) causal, (4) intervention, (5) prediction and (6) clinical decision models in medical research. 20

An external file that holds a picture, illustration, etc.
Object name is jim-2022-002479f01.jpg

Comparative assessments of developing and reporting of study objective types and models. Association measures include odds ratio, risk ratio, or hazard ratio. AUC, area under the curve; C, confounder; CI, confidence interval; E, exposure; HbA1C: hemoglobin A1c; M, mediator; MFT, model fit test; MST, model specification test; PI, predictive interval; R 2 , coefficient of determinant; X, independent variable; Y, outcome.

The exploratory objective type is a specific type of determinant study and is commonly known as risk factors or correlates study in medical research. In an exploratory study, all covariates are considered equally important for the outcome of interest in the study. The goal of the exploratory study is to present the results of a model which gives higher accuracy after satisfying all model-related assumptions. In the association study, the investigator identifies predefined exposures of interest for the outcome, and variables other than exposures are also important for the interpretation and considered as covariates. The goal of an association study is to present the adjusted association of exposure with outcome. 20 In the causal objective study, the investigator is interested in determining the impact of exposure(s) on outcome using the conceptual framework. In this study objective, all variables should have a predefined role (exposures, confounders, mediators, covariates, and predictors) in a conceptual framework. A study with a causal objective is known as an explanatory or a confirmatory study in medical research. The goal is to present the direct or indirect effects of exposure(s) on an outcome after assessing the model’s fitness in the conceptual framework. 19 21 The objective of an interventional study is to determine the effect of an intervention on outcomes and is often known as randomized or non-randomized clinical trials in medical research. In the intervention objective model, all variables other than the intervention are treated as nuisance variables for primary analyses. The goal is to present the direct effect of the intervention on the outcomes by eliminating biases. 22–24 In the predictive study, the goal is to determine an optimum set of variables that can predict the outcome, particularly in external settings. The clinical decision models are a special case of prognostic models in which high dimensional data at various levels are used for risk stratification, classification, and prediction. In this model, all variables are considered input features. The goal is to present a decision tool that has high accuracy in training, testing, and validation data sets. 20 25 Biostatisticians or applied researchers should properly discuss the intention of the study objective type before proceeding with statistical analyses. In addition, it would be a good idea to prepare a conceptual model framework regardless of study objective type to understand study concepts.

A study 26 showed a favorable effect of the beta-blocker intervention on survival outcome in patients with advanced human epidermal growth factor receptor (HER2)-negative breast cancer without adjusting for all the potential confounding effects (age or menopausal status and Eastern Cooperative Oncology Performance Status) in primary analyses or validation analyses or using a propensity score-adjusted analysis, which is an EBB preferred method for analyzing non-randomized studies. 27 Similarly, another study had the goal of developing a predictive model for prediction of Alzheimer’s disease progression. 28 However, this study did not internally or externally validate the performance of the model as per the requirement of a predictive objective study. In another study, 29 investigators were interested in determining an association between metabolic syndrome and hepatitis C virus. However, the authors did not clearly specify the outcome in the analysis and produced conflicting associations with different analyses. 30 Thus, the outcome should be clearly specified as per the study objective type.

Step 2: specify effect size measure according to study design (interpretation and practical value)

The study design provides information on the selection of study participants and the process of data collection conditioned on either exposure or outcome ( figure 2 ). The appropriate use of effect size measure, tabular presentation of results, and the level of evidence are mostly determined by the study design. 31 32 In cohort or clinical trial study designs, the participants are selected based on exposure status and are followed up for the development of the outcome. These study designs can provide multiple outcomes, produce incidence or incidence density, and are preferred to be analyzed with risk ratio (RR) or hazards models. In a case–control study, the selection of participants is conditioned on outcome status. This type of study can have only one outcome and is preferred to be analyzed with an odds ratio (OR) model. In a cross-sectional study design, there is no selection restriction on outcomes or exposures. All data are collected simultaneously and can be analyzed with a prevalence ratio model, which is mathematically equivalent to the RR model. 33 The reporting of effect size measure also depends on the study objective type. For example, predictive models typically require reporting of regression coefficients or weight of variables in the model instead of association measures, which are required in other objective types. There are agreements and disagreements between OR and RR measures. Due to the constancy and symmetricity properties of OR, some researchers prefer to use OR in studies with common events. Similarly, the collapsibility and interpretability properties of RR make it more appealing to use in studies with common events. 34 To avoid variable practice and interpretation issues with OR, it is recommended to use RR models in all studies except for case–control and nested case–control studies, where OR approximates RR and thus OR models should be used. Otherwise, investigators may report sufficient data to compute any ratio measure. Biostatisticians should educate investigators on the proper interpretation of ratio measures in the light of study design and their reporting. 34 35

An external file that holds a picture, illustration, etc.
Object name is jim-2022-002479f02.jpg

Effect size according to study design.

Investigators sometimes either inappropriately label their study design 36 37 or report effect size measures not aligned with the study design, 38 39 leading to difficulty in results interpretation and evaluation of the level of evidence. The proper labeling of study design and the appropriate use of effect size measure have substantial implications for results interpretation, including the conduct of systematic review and meta-analysis. 40 A study 31 reviewed the frequency of reporting OR instead of RR in cohort studies and randomized clinical trials (RCTs) and found that one-third of the cohort studies used an OR model, whereas 5% of RCTs used an OR model. The majority of estimated ORs from these studies had a 20% or higher deviation from the corresponding RR.

Step 3: specify study hypothesis, reporting of p values, and interval estimates (interpretation and decision)

The clinical hypothesis provides information for evaluating formal claims specified in the study objectives, while the statistical hypothesis provides information about the population parameters/statistics being used to test the formal claims. The inference about the study hypothesis is typically measured by p value and confidence interval (CI). A smaller p value indicates that the data support against the null hypothesis. Since the p value is a conditional probability, it can never tell about the acceptance or rejection of the null hypothesis. Therefore, multiple alternative strategies of p values have been proposed to strengthen the credibility of conclusions. 41 42 Adaption of these alternative strategies is only needed in the explanatory objective studies. Although exact p values are recommended to be reported in research studies, p values do not provide any information about the effect size. Compared with p values, the CI provides a confidence range of the effect size that contains the true effect size if the study were repeated and can be used to determine whether the results are statistically significant or not. 43 Both p value and 95% CI provide complementary information and thus need to be specified in the statistical analysis section. 24 44

Researchers often test one or more comparisons or hypotheses. Accordingly, the side and the level of significance for considering results to be statistically significant may change. Furthermore, studies may include more than one primary outcome that requires an adjustment in the level of significance for multiplicity. All studies should provide the interval estimate of the effect size/regression coefficient in the primary analyses. Since the interpretation of data analysis depends on the study hypothesis, researchers are required to specify the level of significance along with the side (one-sided or two-sided) of the p value in the test for considering statistically significant results, adjustment of the level of significance due to multiple comparisons or multiplicity, and reporting of interval estimates of the effect size in the statistical analysis section. 45

A study 46 showed a significant effect of fluoxetine on relapse rates in obsessive-compulsive disorder based on a one-sided p value of 0.04. Clearly, there was no reason for using a one-sided p value as opposed to a two-sided p value. A review of the appropriate use of multiple test correction methods in multiarm clinical trials published in major medical journals in 2012 identified over 50% of the articles did not perform multiple-testing correction. 47 Similar to controlling a familywise error rate due to multiple comparisons, adjustment of the false discovery rate is also critical in studies involving multiple related outcomes. A review of RCTs for depression between 2007 and 2008 from six journals reported that only limited studies (5.8%) accounted for multiplicity in the analyses due to multiple outcomes. 48

Step 4: account for DGP in the statistical analysis (accuracy)

The study design also requires the specification of the selection of participants and outcome measurement processes in different design settings. We referred to this specific design feature as DGP. Understanding DGP helps in determining appropriate modeling of outcome distribution in statistical analyses and setting up model premises and units of analysis. 4 DGP ( figure 3 ) involves information on data generation and data measures, including the number of measurements after random selection, complex selection, consecutive selection, pragmatic selection, or systematic selection. Specifically, DGP depends on a sampling setting (participants are selected using survey sampling methods and one subject may represent multiple participants in the population), clustered setting (participants are clustered through a recruitment setting or hierarchical setting or multiple hospitals), pragmatic setting (participants are selected through mixed approaches), or systematic review setting (participants are selected from published studies). DGP also depends on the measurements of outcomes in an unpaired setting (measured on one occasion only in independent groups), paired setting (measured on more than one occasion or participants are matched on certain subject characteristics), or mixed setting (measured on more than one occasion but interested in comparing independent groups). It also involves information regarding outcomes or exposure generation processes using quantitative or categorical variables, quantitative values using labs or validated instruments, and self-reported or administered tests yielding a variety of data distributions, including individual distribution, mixed-type distribution, mixed distributions, and latent distributions. Due to different DGPs, study data may include messy or missing data, incomplete/partial measurements, time-varying measurements, surrogate measures, latent measures, imbalances, unknown confounders, instrument variables, correlated responses, various levels of clustering, qualitative data, or mixed data outcomes, competing events, individual and higher-level variables, etc. The performance of statistical analysis, appropriate estimation of standard errors of estimates and subsequently computation of p values, the generalizability of findings, and the graphical display of data rely on DGP. Accounting for DGP in the analyses requires proper communication between investigators and biostatisticians about each aspect of participant selection and data collection, including measurements, occasions of measurements, and instruments used in the research study.

An external file that holds a picture, illustration, etc.
Object name is jim-2022-002479f03.jpg

Common features of the data generation process.

A study 49 compared the intake of fresh fruit and komatsuna juice with the intake of commercial vegetable juice on metabolic parameters in middle-aged men using an RCT. The study was criticized for many reasons, but primarily for incorrect statistical methods not aligned with the study DGP. 50 Similarly, another study 51 highlighted that 80% of published studies using the Korean National Health and Nutrition Examination Survey did not incorporate survey sampling structure in statistical analyses, producing biased estimates and inappropriate findings. Likewise, another study 52 highlighted the need for maintaining methodological standards while analyzing data from the National Inpatient Sample. A systematic review 53 identified that over 50% of studies did not specify whether a paired t-test or an unpaired t-test was performed in statistical analysis in the top 25% of physiology journals, indicating poor transparency in reporting of statistical analysis as per the data type. Another study 54 also highlighted the data displaying errors not aligned with DGP. As per DGP, delay in treatment initiation of patients with cancer defined from the onset of symptom to treatment initiation should be analyzed into three components: patient/primary delay, secondary delay, and tertiary delay. 55 Similarly, the number of cancerous nodes should be analyzed with count data models. 56 However, several studies did not analyze such data according to DGP. 57 58

Step 5: apply EBB methods specific to study design features and DGP (efficiency and robustness)

The continuous growth in the development of robust statistical methods for dealing with a specific problem produced various methods to analyze specific data types. Since multiple methods are available for handling a specific problem yet with varying performances, heterogeneous practices among applied researchers have been noticed. Variable practices could also be due to a lack of consensus on statistical methods in literature, unawareness, and the unavailability of standardized statistical guidelines. 2 5 59 However, it becomes sometimes difficult to differentiate whether a specific method was used due to its robustness, lack of awareness, lack of accessibility of statistical software to apply an alternative appropriate method, intention to produce expected results, or ignorance of model diagnostics. To avoid heterogeneous practices, the selection of statistical methodology and their reporting at each stage of data analysis should be conducted using methods according to EBB practice. 5 Since it is hard for applied researchers to optimally select statistical methodology at each step, we encourage investigators to involve biostatisticians at the very early stage in basic, clinical, population, translational, and database research. We also appeal to biostatisticians to develop guidelines, checklists, and educational tools to promote the concept of EBB. As an effort, we developed the statistical analysis and methods in biomedical research (SAMBR) guidelines for applied researchers to use EBB methods for data analysis. 5 The EBB practice is essential for applying recent cutting-edge robust methodologies to yield accurate and unbiased results. The efficiency of statistical methodologies depends on the assumptions and DGP. Therefore, investigators may attempt to specify the choice of specific models in the primary analysis as per the EBB.

Although details of evidence-based preferred methods are provided in the SAMBR checklists for each study design/objective, 5 we have presented a simplified version of evidence-based preferred methods for common statistical analysis ( online supplemental table 3 ). Several examples are available in the literature where inefficient methods not according to EBB practice have been used. 31 57 60

Step 6: report variable selection method in the multivariable analysis according to study objective type (unbiased)

Multivariable analysis can be used for association, prediction or classification or risk stratification, adjustment, propensity score development, and effect size estimation. 61 Some biological, clinical, behavioral, and environmental factors may directly associate or influence the relationship between exposure and outcome. Therefore, almost all health studies require multivariable analyses for accurate and unbiased interpretations of findings ( figure 1 ). Analysts should develop an adjusted model if the sample size permits. It is a misconception that the analysis of RCT does not require adjusted analysis. Analysis of RCT may require adjustment for prognostic variables. 23 The foremost step in model building is the entry of variables after finalizing the appropriate parametric or non-parametric regression model. In the exploratory model building process due to no preference of exposures, a backward automated approach after including any variables that are significant at 25% in the unadjusted analysis can be used for variable selection. 62 63 In the association model, a manual selection of covariates based on the relevance of the variables should be included in a fully adjusted model. 63 In a causal model, clinically guided methods should be used for variable selection and their adjustments. 20 In a non-randomized interventional model, efforts should be made to eliminate confounding effects through propensity score methods and the final propensity score-adjusted multivariable model may adjust any prognostic variables, while a randomized study simply should adjust any prognostic variables. 27 Maintaining the event per variable (EVR) is important to avoid overfitting in any type of modeling; therefore, screening of variables may be required in some association and explanatory studies, which may be accomplished using a backward stepwise method that needs to be clarified in the statistical analyses. 10 In a predictive study, a model with an optimum set of variables producing the highest accuracy should be used. The optimum set of variables may be screened with the random forest method or bootstrap or machine learning methods. 64 65 Different methods of variable selection and adjustments may lead to different results. The screening process of variables and their adjustments in the final multivariable model should be clearly mentioned in the statistical analysis section.

A study 66 evaluating the effect of hydroxychloroquine (HDQ) showed unfavorable events (intubation or death) in patients who received HDQ compared with those who did not (hazard ratio (HR): 2.37, 95% CI 1.84 to 3.02) in an unadjusted analysis. However, the propensity score-adjusted analyses as appropriate with the interventional objective model showed no significant association between HDQ use and unfavorable events (HR: 1.04, 95% CI 0.82 to 1.32), which was also confirmed in multivariable and other propensity score-adjusted analyses. This study clearly suggests that results interpretation should be based on a multivariable analysis only in observational studies if feasible. A recent study 10 noted that approximately 6% of multivariable analyses based on either logistic or Cox regression used an inappropriate selection method of variables in medical research. This practice was more commonly noted in studies that did not involve an expert biostatistician. Another review 61 of 316 articles from high-impact Chinese medical journals revealed that 30.7% of articles did not report the selection of variables in multivariable models. Indeed, this inappropriate practice could have been identified more commonly if classified according to the study objective type. 18 In RCTs, it is uncommon to report an adjusted analysis based on prognostic variables, even though an adjusted analysis may produce an efficient estimate compared with an unadjusted analysis. A study assessing the effect of preemptive intervention on development outcomes showed a significant effect of an intervention on reducing autism spectrum disorder symptoms. 67 However, this study was criticized by Ware 68 for not reporting non-significant results in unadjusted analyses. If possible, unadjusted estimates should also be reported in any study, particularly in RCTs. 23 68

Step 7: provide evidence for exploring effect modifiers (applicability)

Any variable that modifies the effect of exposure on the outcome is called an effect modifier or modifier or an interacting variable. Exploring the effect modifiers in multivariable analyses helps in (1) determining the applicability/generalizability of findings in the overall or specific subpopulation, (2) generating ideas for new hypotheses, (3) explaining uninterpretable findings between unadjusted and adjusted analyses, (4) guiding to present combined or separate models for each specific subpopulation, and (5) explaining heterogeneity in treatment effect. Often, investigators present adjusted stratified results according to the presence or absence of an effect modifier. If the exposure interacts with multiple variables statistically or conceptually in the model, then the stratified findings (subgroup) according to each effect modifier may be presented. Otherwise, stratified analysis substantially reduces the power of the study due to the lower sample size in each stratum and may produce significant results by inflating type I error. 69 Therefore, a multivariable analysis involving an interaction term as opposed to a stratified analysis may be presented in the presence of an effect modifier. 70 Sometimes, a quantitative variable may emerge as a potential effect modifier for exposure and an outcome relationship. In such a situation, the quantitative variable should not be categorized unless a clinically meaningful threshold is not available in the study. In fact, the practice of categorizing quantitative variables should be avoided in the analysis unless a clinically meaningful cut-off is available or a hypothesis requires for it. 71 In an exploratory objective type, any possible interaction may be obtained in a study; however, the interpretation should be guided based on clinical implications. Similarly, some objective models may have more than one exposure or intervention and the association of each exposure according to the level of other exposure should be presented through adjusted analyses as suggested in the presence of interaction effects. 70

A review of 428 articles from MEDLINE on the quality of reporting from statistical analyses of three (linear, logistic, and Cox) commonly used regression models reported that only 18.5% of the published articles provided interaction analyses, 17 even though interaction analyses can provide a lot of useful information.

Step 8: assessment of assumptions, specifically the distribution of outcome, linearity, multicollinearity, sparsity, and overfitting (reliability)

The assessment and reporting of model diagnostics are important in assessing the efficiency, validity, and usefulness of the model. Model diagnostics include satisfying model-specific assumptions and the assessment of sparsity, linearity, distribution of outcome, multicollinearity, and overfitting. 61 72 Model-specific assumptions such as normal residuals, heteroscedasticity and independence of errors in linear regression, proportionality in Cox regression, proportionality odds assumption in ordinal logistic regression, and distribution fit in other types of continuous and count models are required. In addition, sparsity should also be examined prior to selecting an appropriate model. Sparsity indicates many zero observations in the data set. 73 In the presence of sparsity, the effect size is difficult to interpret. Except for machine learning models, most of the parametric and semiparametric models require a linear relationship between independent variables and a functional form of an outcome. Linearity should be assessed using a multivariable polynomial in all model objectives. 62 Similarly, the appropriate choice of the distribution of outcome is required for model building in all study objective models. Multicollinearity assessment is also useful in all objective models. Assessment of EVR in multivariable analysis can be used to avoid the overfitting issue of a multivariable model. 18

Some review studies highlighted that 73.8%–92% of the articles published in MEDLINE had not assessed the model diagnostics of the multivariable regression models. 17 61 72 Contrary to the monotonically, linearly increasing relationship between systolic blood pressure (SBP) and mortality established using the Framingham’s study, 74 Port et al 75 reported a non-linear relationship between SBP and all-cause mortality or cardiovascular deaths by reanalysis of the Framingham’s study data set. This study identified a different threshold for treating hypertension, indicating the role of linearity assessment in multivariable models. Although a non-Gaussian distribution model may be required for modeling patient delay outcome data in cancer, 55 a study analyzed patient delay data using an ordinary linear regression model. 57 An investigation of the development of predictive models and their reporting in medical journals identified that 53% of the articles had fewer EVR than the recommended EVR, indicating over half of the published articles may have an overfitting model. 18 Another study 76 attempted to identify the anthropometric variables associated with non-insulin-dependent diabetes and found that none of the anthropometric variables were significant after adjusting for waist circumference, age, and sex, indicating the presence of collinearity. A study reported detailed sparse data problems in published studies and potential solutions. 73

Step 9: report type of primary and sensitivity analyses (consistency)

Numerous considerations and assumptions are made throughout the research processes that require assessment, evaluation, and validation. Some assumptions, executions, and errors made at the beginning of the study data collection may not be fixable 13 ; however, additional information collected during the study and data processing, including data distribution obtained at the end of the study, may facilitate additional considerations that need to be verified in the statistical analyses. Consistencies in the research findings via modifications in the outcome or exposure definition, study population, accounting for missing data, model-related assumptions, variables and their forms, and accounting for adherence to protocol in the models can be evaluated and reported in research studies using sensitivity analyses. 77 The purpose and type of supporting analyses need to be specified clearly in the statistical analyses to differentiate the main findings from the supporting findings. Sensitivity analyses are different from secondary or interim or subgroup analyses. 78 Data analyses for secondary outcomes are often referred to as secondary analyses, while data analyses of an ongoing study are called interim analyses and data analyses according to groups based on patient characteristics are known as subgroup analyses.

Almost all studies require some form of sensitivity analysis to validate the findings under different conditions. However, it is often underutilized in medical journals. Only 18%–20.3% of studies reported some forms of sensitivity analyses. 77 78 A review of nutritional trials from high-quality journals reflected that 17% of the conclusions were reported inappropriately using findings from sensitivity analyses not based on the primary/main analyses. 77

Step 10: provide methods for summarizing, displaying, and interpreting data (transparency and usability)

Data presentation includes data summary, data display, and data from statistical model analyses. The primary purpose of the data summary is to understand the distribution of outcome status and other characteristics in the total sample and by primary exposure status or outcome status. Column-wise data presentation should be preferred according to exposure status in all study designs, while row-wise data presentation for the outcome should be preferred in all study designs except for a case–control study. 24 32 Summary statistics should be used to provide maximum information on data distribution aligned with DGP and variable type. The purpose of results presentation primarily from regression analyses or statistical models is to convey results interpretation and implications of findings. The results should be presented according to the study objective type. Accordingly, the reporting of unadjusted and adjusted associations of each factor with the outcome may be preferred in the determinant objective model, while unadjusted and adjusted effects of primary exposure on the outcome may be preferred in the explanatory objective model. In prognostic models, the final predictive models may be presented in such a way that users can use models to predict an outcome. In the exploratory objective model, a final multivariable model should be reported with R 2 or area under the curve (AUC). In the association and interventional models, the assessment of internal validation is critically important through various sensitivity and validation analyses. A model with better fit indices (in terms of R 2 or AUC, Akaike information criterion, Bayesian information criterion, fit index, root mean square error) should be finalized and reported in the causal model objective study. In the predictive objective type, the model performance in terms of R 2 or AUC in training and validation data sets needs to be reported ( figure 1 ). 20 21 There are multiple purposes of data display, including data distribution using bar diagram or histogram or frequency polygons or box plots, comparisons using cluster bar diagram or scatter dot plot or stacked bar diagram or Kaplan-Meier plot, correlation or model assessment using scatter plot or scatter matrix, clustering or pattern using heatmap or line plots, the effect of predictors with fitted models using marginsplot, and comparative evaluation of effect sizes from regression models using forest plot. Although the key purpose of data display is to highlight critical issues or findings in the study, data display should essentially follow DGP and variable types and should be user-friendly. 54 79 Data interpretation heavily relies on the effect size measure along with study design and specified hypotheses. Sometimes, variables require standardization for descriptive comparison of effect sizes among exposures or interpreting small effect size, or centralization for interpreting intercept or avoiding collinearity due to interaction terms, or transformation for achieving model-related assumptions. 80 Appropriate methods of data reporting and interpretation aligned with study design, study hypothesis, and effect size measure should be specified in the statistical analysis section of research studies.

Published articles from reputed journals inappropriately summarized a categorized variable with mean and range, 81 summarized a highly skewed variable with mean and standard deviation, 57 and treated a categorized variable as a continuous variable in regression analyses. 82 Similarly, numerous examples from published studies reporting inappropriate graphical display or inappropriate interpretation of data not aligned with DGP or variable types are illustrated in a book published by Bland and Peacock. 83 84 A study used qualitative data on MRI but inappropriately presented with a Box-Whisker plot. 81 Another study reported unusually high OR for an association between high breast parenchymal enhancement and breast cancer in both premenopausal and postmenopausal women. 85 This reporting makes suspicious findings and may include sparse data bias. 86 A poor tabular presentation without proper scaling or standardization of a variable, missing CI for some variables, missing unit and sample size, and inconsistent reporting of decimal places could be easily noticed in table 4 of a published study. 29 Some published predictive models 87 do not report intercept or baseline survival estimates to use their predictive models in clinical use. Although a direct comparison of effect sizes obtained from the same model may be avoided if the units are different among variables, 35 a study had an objective to compare effect sizes across variables but the authors performed comparisons without standardization of variables or using statistical tests. 88

A sample for writing statistical analysis section in medical journals/research studies

Our primary study objective type was to develop a (select from figure 1 ) model to assess the relationship of risk factors (list critical variables or exposures) with outcomes (specify type from continuous/discrete/count/binary/polytomous/time-to-event). To address this objective, we conducted a (select from figure 2 or any other) study design to test the hypotheses of (equality or superiority or non-inferiority or equivalence or futility) or develop prediction. Accordingly, the other variables were adjusted or considered as (specify role of variables from confounders, covariates, or predictors or independent variables) as reflected in the conceptual framework. In the unadjusted or preliminary analyses as per the (select from figure 3 or any other design features) DGP, (specify EBB preferred tests from online supplemental table 3 or any other appropriate tests) were used for (specify variables and types) in unadjusted analyses. According to the EBB practice for the outcome (specify type) and DGP of (select from figure 3 or any other), we used (select from online supplemental table 1 or specify a multivariable approach) as the primary model in the multivariable analysis. We used (select from figure 1 ) variable selection method in the multivariable analysis and explored the interaction effects between (specify variables). The model diagnostics including (list all applicable, including model-related assumptions, linearity, or multicollinearity or overfitting or distribution of outcome or sparsity) were also assessed using (specify appropriate methods) respectively. In such exploration, we identified (specify diagnostic issues if any) and therefore the multivariable models were developed using (specify potential methods used to handle diagnostic issues). The other outcomes were analyzed with (list names of multivariable approaches with respective outcomes). All the models used the same procedure (or specify from figure 1 ) for variable selection, exploration of interaction effects, and model diagnostics using (specify statistical approaches) depending on the statistical models. As per the study design, hypothesis, and multivariable analysis, the results were summarized with effect size (select as appropriate or from figure 2 ) along with (specify 95% CI or other interval estimates) and considered statistically significant using (specify the side of p value or alternatives) at (specify the level of significance) due to (provide reasons for choosing a significance level). We presented unadjusted and/or adjusted estimates of primary outcome according to (list primary exposures or variables). Additional analyses were conducted for (specific reasons from step 9) using (specify methods) to validate findings obtained in the primary analyses. The data were summarized with (list summary measures and appropriate graphs from step 10), whereas the final multivariable model performance was summarized with (fit indices if applicable from step 10). We also used (list graphs) as appropriate with DGP (specify from figure 3 ) to present the critical findings or highlight (specify data issues) using (list graphs/methods) in the study. The exposures or variables were used in (specify the form of the variables) and therefore the effect or association of (list exposures or variables) on outcome should be interpreted in terms of changes in (specify interpretation unit) exposures/variables. List all other additional analyses if performed (with full details of all models in a supplementary file along with statistical codes if possible).

Concluding remarks

We highlighted 10 essential steps to be reported in the statistical analysis section of any analytical study ( figure 4 ). Adherence to minimum reporting of the steps specified in this report may enforce investigators to understand concepts and approach biostatisticians timely to apply these concepts in their study to improve the overall quality of methodological standards in grant proposals and research studies. The order of reporting information in statistical analyses specified in this report is not mandatory; however, clear reporting of analytical steps applicable to the specific study type should be mentioned somewhere in the manuscript. Since the entire approach of statistical analyses is dependent on the study objective type and EBB practice, proper execution and reporting of statistical models can be taught to the next generation of statisticians by the study objective type in statistical education courses. In fact, some disciplines ( figure 5 ) are strictly aligned with specific study objective types. Bioinformaticians are oriented in studying determinant and prognostic models toward precision medicine, while epidemiologists are oriented in studying association and causal models, particularly in population-based observational and pragmatic settings. Data scientists are heavily involved in prediction and classification models in personalized medicine. A common thing across disciplines is using biostatistical principles and computation tools to address any research question. Sometimes, one discipline expert does the part of others. 89 We strongly recommend using a team science approach that includes an epidemiologist, biostatistician, data scientist, and bioinformatician depending on the study objectives and needs. Clear reporting of data analyses as per the study objective type should be encouraged among all researchers to minimize heterogeneous practices and improve scientific quality and outcomes. In addition, we also encourage investigators to strictly follow transparent reporting and quality assessment guidelines according to the study design ( https://www.equator-network.org/ ) to improve the overall quality of the study, accordingly STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) for observational studies, CONSORT (Consolidated Standards of Reporting Trials) for clinical trials, STARD (Standards for Reporting Diagnostic Accuracy Studies) for diagnostic studies, TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis OR Diagnosis) for prediction modeling, and ARRIVE (Animal Research: Reporting of In Vivo Experiments) for preclinical studies. The steps provided in this document for writing the statistical analysis section is essentially different from other guidance documents, including SAMBR. 5 SAMBR provides a guidance document for selecting evidence-based preferred methods of statistical analysis according to different study designs, while this report suggests the global reporting of essential information in the statistical analysis section according to study objective type. In this guidance report, our suggestion strictly pertains to the reporting of methods in the statistical analysis section and their implications on the interpretation of results. Our document does not provide guidance on the reporting of sample size or results or statistical analysis section for meta-analysis. The examples and reviews reported in this study may be used to emphasize the concepts and related implications in medical research.

An external file that holds a picture, illustration, etc.
Object name is jim-2022-002479f04.jpg

Summary of reporting steps, purpose, and evaluation measures in the statistical analysis section.

An external file that holds a picture, illustration, etc.
Object name is jim-2022-002479f05.jpg

Role of interrelated disciplines according to study objective type.

Acknowledgments

The author would like to thank the reviewers for their careful review and insightful suggestions.

Contributors: AKD developed the concept and design and wrote the manuscript.

Funding: The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

Competing interests: AKD is a Journal of Investigative Medicine Editorial Board member. No other competing interests declared.

Provenance and peer review: Commissioned; externally peer reviewed.

Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Data availability statement

Ethics statements, patient consent for publication.

Not required.

  • Skip to primary navigation
  • Skip to main content

Aggie Research Programs

Texas A&M University

Spring 2024: NASA Mars rover “Perseverance”: images and composition data analysis

WORLD TRADE ORGANIZATION

Home   |  About WTO   |  News & events   |  Trade topics   |  WTO membership   |  Documents & resources   |  External relations

Contact us   |  Site map   |  A-Z   |  Search

español   français

ECONOMIC RESEARCH AND ANALYSIS

  

Four separate WTO datasets are available, covering digitally delivered services, trade in services by mode of supply, trade in commercial services, and the WTO-OECD Balanced Trade in Services dataset. Users can create and download individual charts and tables or bulk download data series according to their needs.

The dataset on digitally delivered services trade contains WTO estimates on services traded through computer networks, such as the Internet, apps, emails, voice and video calls, and digital intermediation platforms. The coverage of digitally delivered services reflects the definition of the  IMF-OECD-UNCTAD-WTO Handbook on Measuring Digital Trade .  It covers over 200 economies and regions and eight sub-sectors for the period 2005-23.

The Trade in Services by Mode of Supply (TISMOS) dataset, produced by the WTO, contains estimates of services trade through the four modes of supply identified in the WTO's General Agreement on Trade in Services (GATS), as of April 2024. It covers over 200 economies and regions for the period 2005-22 for 55 sectors. It provides valuable insights into how services trade has modified over the years, including the impact of digitalization and of the COVID-19 pandemic.

The Trade in Commercial Services dataset contains first estimates on commercial services trade until 2023 for some 150 economies and regions based on quarterly annualized statistics and for eight broad services sectors. It is jointly produced with UNCTAD. Final estimates for over 50 sectors will be released in July 2024.

The Balanced Trade in Services (BaTiS) dataset, developed with the OECD, contains a balanced matrix of international trade in services for over 200 reporting economies and their partners by sector for 2005-21. The dataset is the only source of comprehensive bilateral data on trade in services. The dataset will be updated and expanded in 2024.

The Hub will be further enhanced with new features and content and will be regularly updated.

The Global Services Trade Data Hub can be accessed here .

Photo gallery

Problems viewing this page? If so, please contact [email protected] giving details of the operating system and web browser you are using.

IMAGES

  1. 5 Steps of the Data Analysis Process

    data analysis in research scribd

  2. Standard statistical tools in research and data analysis

    data analysis in research scribd

  3. Data Analysis in Research| Types of Data Analysis in Research| Data

    data analysis in research scribd

  4. Tools for data analysis in research example

    data analysis in research scribd

  5. 6+ Data Analysis Report Templates

    data analysis in research scribd

  6. Data Analysis Introduction Example

    data analysis in research scribd

VIDEO

  1. Data Analysis

  2. Epidata version 3.1 for data entry

  3. Analysis in SPSS

  4. Data Analysis Using #SPSS (Part 1)

  5. Module 3: Data Import and Package Installation in R Studio

  6. What is Data Analysis in research

COMMENTS

  1. Data Analysis in Research

    Data Analysis in Research - Free download as Word Doc (.doc), PDF File (.pdf), Text File (.txt) or read online for free. Data Analysis is a process of inspecting, cleaning, transforming, and modeling data. The goal of Data Analysis is to highlight useful information, suggesting conclusions. Data Analysis has multiple facets and approaches, in different business, science, and social science ...

  2. How to Write a Results Section

    The most logical way to structure quantitative results is to frame them around your research questions or hypotheses. For each question or hypothesis, share: A reminder of the type of analysis you used (e.g., a two-sample t test or simple linear regression). A more detailed description of your analysis should go in your methodology section.

  3. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  4. The Beginner's Guide to Statistical Analysis

    Step 1: Write your hypotheses and plan your research design. To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design. Writing statistical hypotheses. The goal of research is often to investigate a relationship between variables within a population. You start with a prediction ...

  5. Learning to Do Qualitative Data Analysis: A Starting Point

    For many researchers unfamiliar with qualitative research, determining how to conduct qualitative analyses is often quite challenging. Part of this challenge is due to the seemingly limitless approaches that a qualitative researcher might leverage, as well as simply learning to think like a qualitative researcher when analyzing data. From framework analysis (Ritchie & Spencer, 1994) to content ...

  6. Creating a Data Analysis Plan: What to Consider When Choosing

    For those interested in conducting qualitative research, previous articles in this Research Primer series have provided information on the design and analysis of such studies. 2, 3 Information in the current article is divided into 3 main sections: an overview of terms and concepts used in data analysis, a review of common methods used to ...

  7. PDF 2 An Introduction to Data Analysis

    acquiring skills in data analysis. • List the components of data analysis and how they fit together. • Form hypotheses from descriptions of data. • Explain the connection between hypotheses, models, and estimates. • Define diagnostics and explain their role in data analysis. • Formulate new questions. 2 An Introduction to Data ...

  8. Research Design: Decide on your Data Analysis Strategy

    The last step of designing your research is planning your data analysis strategies. In this video, we'll take a look at some common approaches for both quant...

  9. What is data analysis? Methods, techniques, types & how-to

    A method of data analysis that is the umbrella term for engineering metrics and insights for additional value, direction, and context. By using exploratory statistical evaluation, data mining aims to identify dependencies, relations, patterns, and trends to generate advanced knowledge.

  10. Quantitative analysis: Descriptive statistics

    Numeric data collected in a research project can be analysed quantitatively using statistical tools in two different ways. Descriptive analysis refers to statistically describing, aggregating, and presenting the constructs of interest or associations between these constructs.Inferential analysis refers to the statistical testing of hypotheses (theory testing).

  11. PDF CHAPTER 4 QUALITATIVE DATA ANALYSIS

    4.1 INTRODUCTION. In this chapter, I describe the qualitative analysis of the data, including the practical steps involved in the analysis. A quantitative analysis of the data follows in Chapter 5. In the qualitative phase, I analyzed the data into generative themes, which will be described individually. I describe how the themes overlap.

  12. How to write statistical analysis section in medical research

    Results. Although biostatistical inputs are critical for the entire research study (online supplemental table 2), biostatistical consultations were mostly used for statistical analyses only 15.Even though the conduct of statistical analysis mismatched with the study objective and DGP was identified as the major problem in articles submitted to high-impact medical journals. 16 In addition ...

  13. PDF 3.1 Introduction: the qualitative research paradigm

    In qualitative research, the related processes of collecting, analyzing, and interpreting data is not necessarily a chronological, step-by-step process as it tends to be in the quantitative research design (Neuman, 2006, p. 15). Instead, Neuman (p. 15) explains that all three processes of data collection, data analysis, and interpretation occur

  14. PDF Chapter 4 DATA ANALYSIS AND RESEARCH FINDINGS

    4.1 INTRODUCTION. This chapter describes the analysis of data followed by a discussion of the research findings. The findings relate to the research questions that guided the study. Data were analyzed to identify, describe and explore the relationship between death anxiety and death attitudes of nurses in a private acute care hospital and to ...

  15. Introduction to Data Analysis in Qualitative Research

    The author offers guidance for data analysis while distinguishing among the processes for each of the research types. The book includes a link to the "Narralizer" -unique and friendly software, accompanied by a detailed guide, to help researchers conduct qualitative research.

  16. Basic Statistical Tools in Research and Data Analysis

    Basic Statistical Tools in Research and Data Analysis - Free download as PDF File (.pdf), Text File (.txt) or read online for free. Basic Statistical Tools in Research and Data Analysis

  17. PDF Chapter 4: Analysis and Interpretation of Results

    To complete this study properly, it is necessary to analyse the data collected in order to test the hypothesis and answer the research questions. As already indicated in the preceding chapter, data is interpreted in a descriptive form. This chapter comprises the analysis, presentation and interpretation of the findings resulting from this study.

  18. PDF Chapter 6: Data Analysis and Interpretation 6.1. Introduction

    methods research design, (cf. par. 5.7, p. 321, p. Fig. 16, p. 318; 17, p. 326; 18, p. 327). The mixed methods research design were applied in this research study to acquire an experiential ... data analysis well, when he provides the following definition of qualitative data analysis that serves

  19. Spring 2024: NASA Mars rover "Perseverance": images and composition

    The analysis of the rover *images* will require using a ImageJ and/or a GIS system (such as ArcGIS, QGIS); prior experience with processing and analyzing scientific images or with a GIS is not required, but would help the research be more efficient. The analysis of the compositional data of martian rocks (from the SuperCam instrument) requires ...

  20. New Global Services Trade Data Hub offers extensive, tailored ...

    The new Global Services Trade Data Hub now available on the WTO website provides access to comprehensive data on services trade. It provides visualizations and customizable features for download — catering to the diverse needs of trade negotiators, analysts, researchers and decision-makers — to deliver insights into global services trade.

  21. Data Center Transformers Market Report 2024: Edge Computing

    Dublin, April 11, 2024 (GLOBE NEWSWIRE) -- The "Global Data Center Transformer Market Size, Market Share, Application Analysis, Regional Outlook, Growth Trends, Key Players, Competitive Strategies ...