Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Statistics articles within Scientific Reports

Article 10 April 2024 | Open Access

Response times are affected by mispredictions in a stochastic game

  • Paulo Roberto Cabral-Passos
  • , Antonio Galves
  •  &  Claudia D. Vargas

The effect of city reputation on Chinese corporate risk-taking

  •  &  Haifeng Jiang

Article 06 April 2024 | Open Access

Improvement in variance estimation using transformed auxiliary variable under simple random sampling

  • , Syed Muhammad Asim
  •  &  Soofia Iftikhar

Article 28 March 2024 | Open Access

Fatty liver classification via risk controlled neural networks trained on grouped ultrasound image data

  • Tso-Jung Yen
  • , Chih-Ting Yang
  •  &  Hsin-Chou Yang

Article 27 March 2024 | Open Access

A new unit distribution: properties, estimation, and regression analysis

  • Kadir Karakaya
  • , C. S. Rajitha
  •  &  Ahmed M. Gemeay

Article 26 March 2024 | Open Access

GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides

  • Jaskaran Singh
  • , Narendra N. Khanna
  •  &  Jasjit S. Suri

On topological indices and entropy measures of beryllonitrene network via logarithmic regression model

  • , Muhammad Kamran Siddiqui
  •  &  Fikre Bogale Petros

Article 22 March 2024 | Open Access

Measuring the similarity of charts in graphical statistics

  • Krzysztof Górnisiewicz
  • , Zbigniew Palka
  •  &  Waldemar Ratajczak

Article 21 March 2024 | Open Access

Risk prediction and interaction analysis using polygenic risk score of type 2 diabetes in a Korean population

  • Minsun Song
  • , Soo Heon Kwak
  •  &  Jihyun Kim

A longitudinal causal graph analysis investigating modifiable risk factors and obesity in a European cohort of children and adolescents

  • Ronja Foraita
  • , Janine Witte
  •  &  Vanessa Didelez

Article 19 March 2024 | Open Access

A novel group decision making method based on CoCoSo and interval-valued Q-rung orthopair fuzzy sets

  • , Hongwu Qin
  •  &  Xiuqin Ma

Impact of using virtual avatars in educational videos on user experience

  • Ruyuan Zhang
  •  &  Qun Wu

A generalisation of the method of regression calibration and comparison with Bayesian and frequentist model averaging methods

  • Mark P. Little
  • , Nobuyuki Hamada
  •  &  Lydia B. Zablotska

Article 18 March 2024 | Open Access

Monitoring gamma type-I censored data using an exponentially weighted moving average control chart based on deep learning networks

  • Pei-Hsi Lee
  •  &  Shih-Lung Liao

Article 15 March 2024 | Open Access

Statistical detection of selfish mining in proof-of-work blockchain systems

  • Sheng-Nan Li
  • , Carlo Campajola
  •  &  Claudio J. Tessone

Article 13 March 2024 | Open Access

Evaluation metrics and statistical tests for machine learning

  • Oona Rainio
  • , Jarmo Teuho
  •  &  Riku Klén

PARSEG: a computationally efficient approach for statistical validation of botanical seeds’ images

  • Luca Frigau
  • , Claudio Conversano
  •  &  Jaromír Antoch

Application of analysis of variance to determine important features of signals for diagnostic classifiers of displacement pumps

  • Jarosław Konieczny
  • , Waldemar Łatas
  •  &  Jerzy Stojek

Article 12 March 2024 | Open Access

Prediction and detection of side effects severity following COVID-19 and influenza vaccinations: utilizing smartwatches and smartphones

  • , Margaret L. Brandeau
  •  &  Dan Yamin

Article 08 March 2024 | Open Access

Evaluating the lifetime performance index of omega distribution based on progressive type-II censored samples

  • N. M. Kilany
  •  &  Lobna H. El-Refai

Article 07 March 2024 | Open Access

Development of risk models of incident hypertension using machine learning on the HUNT study data

  • Filip Emil Schjerven
  • , Emma Maria Lovisa Ingeström
  •  &  Frank Lindseth

Article 06 March 2024 | Open Access

Online trend estimation and detection of trend deviations in sub-sewershed time series of SARS-CoV-2 RNA measured in wastewater

  • Katherine B. Ensor
  • , Julia C. Schedler
  •  &  Loren Hopkins

Article 05 March 2024 | Open Access

Machine learning and XAI approaches highlight the strong connection between \(O_3\) and \(NO_2\) pollutants and Alzheimer’s disease

  • Alessandro Fania
  • , Alfonso Monaco
  •  &  Roberto Bellotti

Article 04 March 2024 | Open Access

High-precision regressors for particle physics

  • Fady Bishara
  • , Ayan Paul
  •  &  Jennifer Dy

Applying explainable artificial intelligence methods to models for diagnosing personal traits and cognitive abilities by social network data

  • Anastasia S. Panfilova
  •  &  Denis Yu. Turdakov

Application of machine learning with large-scale data for an effective vaccination against classical swine fever for wild boar in Japan

  • Satoshi Ito
  • , Cecilia Aguilar-Vega
  •  &  José Manuel Sánchez-Vizcaíno

Article 01 March 2024 | Open Access

Evaluating public opinions: informing public health policy adaptations in China amid the COVID-19 pandemic

  • Chenyang Wang
  • , Xinzhi Wang
  •  &  Hui Zhang

Article 26 February 2024 | Open Access

Quality assessment and community detection methods for anonymized mobility data in the Italian Covid context

  • Jules Morand
  • , Shoichi Yip
  •  &  Luca Tubiana

Article 23 February 2024 | Open Access

The disparate impacts of college admissions policies on Asian American applicants

  • Joshua Grossman
  • , Sabina Tomkins
  •  &  Sharad Goel

Article 22 February 2024 | Open Access

Spectrum analysis of digital UPWM signals generated from random modulating signals

  • Konstantinos Kaleris
  • , Emmanouil Psarakis
  •  &  John Mourjopoulos

Article 21 February 2024 | Open Access

A novel extension of half-logistic distribution with statistical inference, estimation and applications

  • , S. P. Ahmad
  •  &  Oluwafemi Samson Balogun

Article 20 February 2024 | Open Access

Bayesian spatio-temporal analysis of the COVID-19 pandemic in Catalonia

  • Pau Satorra
  •  &  Cristian Tebé

Article 19 February 2024 | Open Access

A raster-based spatial clustering method with robustness to spatial outliers

  • , Changqing Song
  •  &  Peichao Gao

A bi-level framework for real-time crash risk forecasting using artificial intelligence-based video analytics

  • Fizza Hussain
  • , Yasir Ali
  •  &  Md Mazharul Haque

Article 14 February 2024 | Open Access

A machine learning model to predict privacy fatigued users from social media personalized advertisements

  • Ghadeer Alwafi
  •  &  Bahjat Fakieh

Article 13 February 2024 | Open Access

Commodity-specific triads in the Dutch inter-industry production network

  • Marzio Di Vece
  • , Frank P. Pijpers
  •  &  Diego Garlaschelli

Article 12 February 2024 | Open Access

A new function for drug combination dose finding trials

  • Jiacheng Xiao
  •  &  Weijia Zhang

Article 08 February 2024 | Open Access

Impact of material supply chain on the productivity optimization for the construction of roads projects

  • Shrouk Awaad
  • , Dina M. Mansour
  •  &  Ibrahim Abdelrasheed

Article 07 February 2024 | Open Access

Memory type Max-EWMA control chart for the Weibull process under the Bayesian theory

  • Muhammad Noor-ul-Amin
  • , Imad Khan
  •  &  Bakhtyar Ahmad

Global poverty estimation using private and public sector big data sources

  • Robert Marty
  •  &  Alice Duhaut

Saddlepoint p-values for a class of location-scale tests under randomized block design

  • Haidy N. Mohamed
  • , Ehab F. Abd-Elfattah
  •  &  Abd El-Raheem M. Abd El-Raheem

Article 05 February 2024 | Open Access

Categorization of disaster-related deaths in Minamisoma city after the Fukushima nuclear disaster using clustering analysis

  • Hiroki Yoshimura
  • , Toyoaki Sawano
  •  &  Masaharu Tsubokura

Prediction of the axial compression capacity of stub CFST columns using machine learning techniques

  • Khaled Megahed
  • , Nabil Said Mahmoud
  •  &  Saad Elden Mostafa Abd-Rabou

Article 28 January 2024 | Open Access

Identifying the need for infection-related consultations in intensive care patients using machine learning models

  • Leslie R. Zwerwer
  • , Christian F. Luz
  •  &  Bhanu Sinha

Article 27 January 2024 | Open Access

A fragmented neural network ensemble method and its application to image classification

  • , Shuai Liu
  •  &  Yumei Li

Article 22 January 2024 | Open Access

Drivers’ behavior confronting fixed and point-to-point speed enforcement camera: agent-based simulation and translation to crash relative risk change

  • Seyed Amir Ahmad Safavi-Naini
  • , Shayan Sobhani
  •  &  Farshad Farzadfar

Article 19 January 2024 | Open Access

Dataset meta-level and statistical features affect machine learning performance

  • Shahadat Uddin
  •  &  Haohui Lu

Improving long-term multivariate time series forecasting with a seasonal-trend decomposition-based 2-dimensional temporal convolution dense network

  • Jianhua Hao
  •  &  Fangai Liu

A hierarchical approach for evaluating athlete performance with an application in elite basketball

  • Thiago de Paula Oliveira
  •  &  John Newell

A novel generalized Weibull Poisson G class of continuous probabilistic distributions with some copulas, properties and applications to real-life datasets

  • Atef F. Hashem
  • , M. A. Abdelkawy
  •  &  Haitham M. Yousof

Advertisement

Browse broader subjects

  • Mathematics and computing

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research papers in statistics

Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

The Beginner's Guide to Statistical Analysis | 5 Steps & Examples

Statistical analysis means investigating trends, patterns, and relationships using quantitative data . It is an important research tool used by scientists, governments, businesses, and other organizations.

To draw valid conclusions, statistical analysis requires careful planning from the very start of the research process . You need to specify your hypotheses and make decisions about your research design, sample size, and sampling procedure.

After collecting data from your sample, you can organize and summarize the data using descriptive statistics . Then, you can use inferential statistics to formally test hypotheses and make estimates about the population. Finally, you can interpret and generalize your findings.

This article is a practical introduction to statistical analysis for students and researchers. We’ll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables.

Table of contents

Step 1: write your hypotheses and plan your research design, step 2: collect data from a sample, step 3: summarize your data with descriptive statistics, step 4: test hypotheses or make estimates with inferential statistics, step 5: interpret your results, other interesting articles.

To collect valid data for statistical analysis, you first need to specify your hypotheses and plan out your research design.

Writing statistical hypotheses

The goal of research is often to investigate a relationship between variables within a population . You start with a prediction, and use statistical analysis to test that prediction.

A statistical hypothesis is a formal way of writing a prediction about a population. Every research prediction is rephrased into null and alternative hypotheses that can be tested using sample data.

While the null hypothesis always predicts no effect or no relationship between variables, the alternative hypothesis states your research prediction of an effect or relationship.

  • Null hypothesis: A 5-minute meditation exercise will have no effect on math test scores in teenagers.
  • Alternative hypothesis: A 5-minute meditation exercise will improve math test scores in teenagers.
  • Null hypothesis: Parental income and GPA have no relationship with each other in college students.
  • Alternative hypothesis: Parental income and GPA are positively correlated in college students.

Planning your research design

A research design is your overall strategy for data collection and analysis. It determines the statistical tests you can use to test your hypothesis later on.

First, decide whether your research will use a descriptive, correlational, or experimental design. Experiments directly influence variables, whereas descriptive and correlational studies only measure variables.

  • In an experimental design , you can assess a cause-and-effect relationship (e.g., the effect of meditation on test scores) using statistical tests of comparison or regression.
  • In a correlational design , you can explore relationships between variables (e.g., parental income and GPA) without any assumption of causality using correlation coefficients and significance tests.
  • In a descriptive design , you can study the characteristics of a population or phenomenon (e.g., the prevalence of anxiety in U.S. college students) using statistical tests to draw inferences from sample data.

Your research design also concerns whether you’ll compare participants at the group level or individual level, or both.

  • In a between-subjects design , you compare the group-level outcomes of participants who have been exposed to different treatments (e.g., those who performed a meditation exercise vs those who didn’t).
  • In a within-subjects design , you compare repeated measures from participants who have participated in all treatments of a study (e.g., scores from before and after performing a meditation exercise).
  • In a mixed (factorial) design , one variable is altered between subjects and another is altered within subjects (e.g., pretest and posttest scores from participants who either did or didn’t do a meditation exercise).
  • Experimental
  • Correlational

First, you’ll take baseline test scores from participants. Then, your participants will undergo a 5-minute meditation exercise. Finally, you’ll record participants’ scores from a second math test.

In this experiment, the independent variable is the 5-minute meditation exercise, and the dependent variable is the math test score from before and after the intervention. Example: Correlational research design In a correlational study, you test whether there is a relationship between parental income and GPA in graduating college students. To collect your data, you will ask participants to fill in a survey and self-report their parents’ incomes and their own GPA.

Measuring variables

When planning a research design, you should operationalize your variables and decide exactly how you will measure them.

For statistical analysis, it’s important to consider the level of measurement of your variables, which tells you what kind of data they contain:

  • Categorical data represents groupings. These may be nominal (e.g., gender) or ordinal (e.g. level of language ability).
  • Quantitative data represents amounts. These may be on an interval scale (e.g. test score) or a ratio scale (e.g. age).

Many variables can be measured at different levels of precision. For example, age data can be quantitative (8 years old) or categorical (young). If a variable is coded numerically (e.g., level of agreement from 1–5), it doesn’t automatically mean that it’s quantitative instead of categorical.

Identifying the measurement level is important for choosing appropriate statistics and hypothesis tests. For example, you can calculate a mean score with quantitative data, but not with categorical data.

In a research study, along with measures of your variables of interest, you’ll often collect data on relevant participant characteristics.

Prevent plagiarism. Run a free check.

Population vs sample

In most cases, it’s too difficult or expensive to collect data from every member of the population you’re interested in studying. Instead, you’ll collect data from a sample.

Statistical analysis allows you to apply your findings beyond your own sample as long as you use appropriate sampling procedures . You should aim for a sample that is representative of the population.

Sampling for statistical analysis

There are two main approaches to selecting a sample.

  • Probability sampling: every member of the population has a chance of being selected for the study through random selection.
  • Non-probability sampling: some members of the population are more likely than others to be selected for the study because of criteria such as convenience or voluntary self-selection.

In theory, for highly generalizable findings, you should use a probability sampling method. Random selection reduces several types of research bias , like sampling bias , and ensures that data from your sample is actually typical of the population. Parametric tests can be used to make strong statistical inferences when data are collected using probability sampling.

But in practice, it’s rarely possible to gather the ideal sample. While non-probability samples are more likely to at risk for biases like self-selection bias , they are much easier to recruit and collect data from. Non-parametric tests are more appropriate for non-probability samples, but they result in weaker inferences about the population.

If you want to use parametric tests for non-probability samples, you have to make the case that:

  • your sample is representative of the population you’re generalizing your findings to.
  • your sample lacks systematic bias.

Keep in mind that external validity means that you can only generalize your conclusions to others who share the characteristics of your sample. For instance, results from Western, Educated, Industrialized, Rich and Democratic samples (e.g., college students in the US) aren’t automatically applicable to all non-WEIRD populations.

If you apply parametric tests to data from non-probability samples, be sure to elaborate on the limitations of how far your results can be generalized in your discussion section .

Create an appropriate sampling procedure

Based on the resources available for your research, decide on how you’ll recruit participants.

  • Will you have resources to advertise your study widely, including outside of your university setting?
  • Will you have the means to recruit a diverse sample that represents a broad population?
  • Do you have time to contact and follow up with members of hard-to-reach groups?

Your participants are self-selected by their schools. Although you’re using a non-probability sample, you aim for a diverse and representative sample. Example: Sampling (correlational study) Your main population of interest is male college students in the US. Using social media advertising, you recruit senior-year male college students from a smaller subpopulation: seven universities in the Boston area.

Calculate sufficient sample size

Before recruiting participants, decide on your sample size either by looking at other studies in your field or using statistics. A sample that’s too small may be unrepresentative of the sample, while a sample that’s too large will be more costly than necessary.

There are many sample size calculators online. Different formulas are used depending on whether you have subgroups or how rigorous your study should be (e.g., in clinical research). As a rule of thumb, a minimum of 30 units or more per subgroup is necessary.

To use these calculators, you have to understand and input these key components:

  • Significance level (alpha): the risk of rejecting a true null hypothesis that you are willing to take, usually set at 5%.
  • Statistical power : the probability of your study detecting an effect of a certain size if there is one, usually 80% or higher.
  • Expected effect size : a standardized indication of how large the expected result of your study will be, usually based on other similar studies.
  • Population standard deviation: an estimate of the population parameter based on a previous study or a pilot study of your own.

Once you’ve collected all of your data, you can inspect them and calculate descriptive statistics that summarize them.

Inspect your data

There are various ways to inspect your data, including the following:

  • Organizing data from each variable in frequency distribution tables .
  • Displaying data from a key variable in a bar chart to view the distribution of responses.
  • Visualizing the relationship between two variables using a scatter plot .

By visualizing your data in tables and graphs, you can assess whether your data follow a skewed or normal distribution and whether there are any outliers or missing data.

A normal distribution means that your data are symmetrically distributed around a center where most values lie, with the values tapering off at the tail ends.

Mean, median, mode, and standard deviation in a normal distribution

In contrast, a skewed distribution is asymmetric and has more values on one end than the other. The shape of the distribution is important to keep in mind because only some descriptive statistics should be used with skewed distributions.

Extreme outliers can also produce misleading statistics, so you may need a systematic approach to dealing with these values.

Calculate measures of central tendency

Measures of central tendency describe where most of the values in a data set lie. Three main measures of central tendency are often reported:

  • Mode : the most popular response or value in the data set.
  • Median : the value in the exact middle of the data set when ordered from low to high.
  • Mean : the sum of all values divided by the number of values.

However, depending on the shape of the distribution and level of measurement, only one or two of these measures may be appropriate. For example, many demographic characteristics can only be described using the mode or proportions, while a variable like reaction time may not have a mode at all.

Calculate measures of variability

Measures of variability tell you how spread out the values in a data set are. Four main measures of variability are often reported:

  • Range : the highest value minus the lowest value of the data set.
  • Interquartile range : the range of the middle half of the data set.
  • Standard deviation : the average distance between each value in your data set and the mean.
  • Variance : the square of the standard deviation.

Once again, the shape of the distribution and level of measurement should guide your choice of variability statistics. The interquartile range is the best measure for skewed distributions, while standard deviation and variance provide the best information for normal distributions.

Using your table, you should check whether the units of the descriptive statistics are comparable for pretest and posttest scores. For example, are the variance levels similar across the groups? Are there any extreme values? If there are, you may need to identify and remove extreme outliers in your data set or transform your data before performing a statistical test.

From this table, we can see that the mean score increased after the meditation exercise, and the variances of the two scores are comparable. Next, we can perform a statistical test to find out if this improvement in test scores is statistically significant in the population. Example: Descriptive statistics (correlational study) After collecting data from 653 students, you tabulate descriptive statistics for annual parental income and GPA.

It’s important to check whether you have a broad range of data points. If you don’t, your data may be skewed towards some groups more than others (e.g., high academic achievers), and only limited inferences can be made about a relationship.

A number that describes a sample is called a statistic , while a number describing a population is called a parameter . Using inferential statistics , you can make conclusions about population parameters based on sample statistics.

Researchers often use two main methods (simultaneously) to make inferences in statistics.

  • Estimation: calculating population parameters based on sample statistics.
  • Hypothesis testing: a formal process for testing research predictions about the population using samples.

You can make two types of estimates of population parameters from sample statistics:

  • A point estimate : a value that represents your best guess of the exact parameter.
  • An interval estimate : a range of values that represent your best guess of where the parameter lies.

If your aim is to infer and report population characteristics from sample data, it’s best to use both point and interval estimates in your paper.

You can consider a sample statistic a point estimate for the population parameter when you have a representative sample (e.g., in a wide public opinion poll, the proportion of a sample that supports the current government is taken as the population proportion of government supporters).

There’s always error involved in estimation, so you should also provide a confidence interval as an interval estimate to show the variability around a point estimate.

A confidence interval uses the standard error and the z score from the standard normal distribution to convey where you’d generally expect to find the population parameter most of the time.

Hypothesis testing

Using data from a sample, you can test hypotheses about relationships between variables in the population. Hypothesis testing starts with the assumption that the null hypothesis is true in the population, and you use statistical tests to assess whether the null hypothesis can be rejected or not.

Statistical tests determine where your sample data would lie on an expected distribution of sample data if the null hypothesis were true. These tests give two main outputs:

  • A test statistic tells you how much your data differs from the null hypothesis of the test.
  • A p value tells you the likelihood of obtaining your results if the null hypothesis is actually true in the population.

Statistical tests come in three main varieties:

  • Comparison tests assess group differences in outcomes.
  • Regression tests assess cause-and-effect relationships between variables.
  • Correlation tests assess relationships between variables without assuming causation.

Your choice of statistical test depends on your research questions, research design, sampling method, and data characteristics.

Parametric tests

Parametric tests make powerful inferences about the population based on sample data. But to use them, some assumptions must be met, and only some types of variables can be used. If your data violate these assumptions, you can perform appropriate data transformations or use alternative non-parametric tests instead.

A regression models the extent to which changes in a predictor variable results in changes in outcome variable(s).

  • A simple linear regression includes one predictor variable and one outcome variable.
  • A multiple linear regression includes two or more predictor variables and one outcome variable.

Comparison tests usually compare the means of groups. These may be the means of different groups within a sample (e.g., a treatment and control group), the means of one sample group taken at different times (e.g., pretest and posttest scores), or a sample mean and a population mean.

  • A t test is for exactly 1 or 2 groups when the sample is small (30 or less).
  • A z test is for exactly 1 or 2 groups when the sample is large.
  • An ANOVA is for 3 or more groups.

The z and t tests have subtypes based on the number and types of samples and the hypotheses:

  • If you have only one sample that you want to compare to a population mean, use a one-sample test .
  • If you have paired measurements (within-subjects design), use a dependent (paired) samples test .
  • If you have completely separate measurements from two unmatched groups (between-subjects design), use an independent (unpaired) samples test .
  • If you expect a difference between groups in a specific direction, use a one-tailed test .
  • If you don’t have any expectations for the direction of a difference between groups, use a two-tailed test .

The only parametric correlation test is Pearson’s r . The correlation coefficient ( r ) tells you the strength of a linear relationship between two quantitative variables.

However, to test whether the correlation in the sample is strong enough to be important in the population, you also need to perform a significance test of the correlation coefficient, usually a t test, to obtain a p value. This test uses your sample size to calculate how much the correlation coefficient differs from zero in the population.

You use a dependent-samples, one-tailed t test to assess whether the meditation exercise significantly improved math test scores. The test gives you:

  • a t value (test statistic) of 3.00
  • a p value of 0.0028

Although Pearson’s r is a test statistic, it doesn’t tell you anything about how significant the correlation is in the population. You also need to test whether this sample correlation coefficient is large enough to demonstrate a correlation in the population.

A t test can also determine how significantly a correlation coefficient differs from zero based on sample size. Since you expect a positive correlation between parental income and GPA, you use a one-sample, one-tailed t test. The t test gives you:

  • a t value of 3.08
  • a p value of 0.001

The final step of statistical analysis is interpreting your results.

Statistical significance

In hypothesis testing, statistical significance is the main criterion for forming conclusions. You compare your p value to a set significance level (usually 0.05) to decide whether your results are statistically significant or non-significant.

Statistically significant results are considered unlikely to have arisen solely due to chance. There is only a very low chance of such a result occurring if the null hypothesis is true in the population.

This means that you believe the meditation intervention, rather than random factors, directly caused the increase in test scores. Example: Interpret your results (correlational study) You compare your p value of 0.001 to your significance threshold of 0.05. With a p value under this threshold, you can reject the null hypothesis. This indicates a statistically significant correlation between parental income and GPA in male college students.

Note that correlation doesn’t always mean causation, because there are often many underlying factors contributing to a complex variable like GPA. Even if one variable is related to another, this may be because of a third variable influencing both of them, or indirect links between the two variables.

Effect size

A statistically significant result doesn’t necessarily mean that there are important real life applications or clinical outcomes for a finding.

In contrast, the effect size indicates the practical significance of your results. It’s important to report effect sizes along with your inferential statistics for a complete picture of your results. You should also report interval estimates of effect sizes if you’re writing an APA style paper .

With a Cohen’s d of 0.72, there’s medium to high practical significance to your finding that the meditation exercise improved test scores. Example: Effect size (correlational study) To determine the effect size of the correlation coefficient, you compare your Pearson’s r value to Cohen’s effect size criteria.

Decision errors

Type I and Type II errors are mistakes made in research conclusions. A Type I error means rejecting the null hypothesis when it’s actually true, while a Type II error means failing to reject the null hypothesis when it’s false.

You can aim to minimize the risk of these errors by selecting an optimal significance level and ensuring high power . However, there’s a trade-off between the two errors, so a fine balance is necessary.

Frequentist versus Bayesian statistics

Traditionally, frequentist statistics emphasizes null hypothesis significance testing and always starts with the assumption of a true null hypothesis.

However, Bayesian statistics has grown in popularity as an alternative approach in the last few decades. In this approach, you use previous research to continually update your hypotheses based on your expectations and observations.

Bayes factor compares the relative strength of evidence for the null versus the alternative hypothesis rather than making a conclusion about rejecting the null hypothesis or not.

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Student’s  t -distribution
  • Normal distribution
  • Null and Alternative Hypotheses
  • Chi square tests
  • Confidence interval

Methodology

  • Cluster sampling
  • Stratified sampling
  • Data cleansing
  • Reproducibility vs Replicability
  • Peer review
  • Likert scale

Research bias

  • Implicit bias
  • Framing effect
  • Cognitive bias
  • Placebo effect
  • Hawthorne effect
  • Hostile attribution bias
  • Affect heuristic

Is this article helpful?

Other students also liked.

  • Descriptive Statistics | Definitions, Types, Examples
  • Inferential Statistics | An Easy Introduction & Examples
  • Choosing the Right Statistical Test | Types & Examples

More interesting articles

  • Akaike Information Criterion | When & How to Use It (Example)
  • An Easy Introduction to Statistical Significance (With Examples)
  • An Introduction to t Tests | Definitions, Formula and Examples
  • ANOVA in R | A Complete Step-by-Step Guide with Examples
  • Central Limit Theorem | Formula, Definition & Examples
  • Central Tendency | Understanding the Mean, Median & Mode
  • Chi-Square (Χ²) Distributions | Definition & Examples
  • Chi-Square (Χ²) Table | Examples & Downloadable Table
  • Chi-Square (Χ²) Tests | Types, Formula & Examples
  • Chi-Square Goodness of Fit Test | Formula, Guide & Examples
  • Chi-Square Test of Independence | Formula, Guide & Examples
  • Coefficient of Determination (R²) | Calculation & Interpretation
  • Correlation Coefficient | Types, Formulas & Examples
  • Frequency Distribution | Tables, Types & Examples
  • How to Calculate Standard Deviation (Guide) | Calculator & Examples
  • How to Calculate Variance | Calculator, Analysis & Examples
  • How to Find Degrees of Freedom | Definition & Formula
  • How to Find Interquartile Range (IQR) | Calculator & Examples
  • How to Find Outliers | 4 Ways with Examples & Explanation
  • How to Find the Geometric Mean | Calculator & Formula
  • How to Find the Mean | Definition, Examples & Calculator
  • How to Find the Median | Definition, Examples & Calculator
  • How to Find the Mode | Definition, Examples & Calculator
  • How to Find the Range of a Data Set | Calculator & Formula
  • Hypothesis Testing | A Step-by-Step Guide with Easy Examples
  • Interval Data and How to Analyze It | Definitions & Examples
  • Levels of Measurement | Nominal, Ordinal, Interval and Ratio
  • Linear Regression in R | A Step-by-Step Guide & Examples
  • Missing Data | Types, Explanation, & Imputation
  • Multiple Linear Regression | A Quick Guide (Examples)
  • Nominal Data | Definition, Examples, Data Collection & Analysis
  • Normal Distribution | Examples, Formulas, & Uses
  • Null and Alternative Hypotheses | Definitions & Examples
  • One-way ANOVA | When and How to Use It (With Examples)
  • Ordinal Data | Definition, Examples, Data Collection & Analysis
  • Parameter vs Statistic | Definitions, Differences & Examples
  • Pearson Correlation Coefficient (r) | Guide & Examples
  • Poisson Distributions | Definition, Formula & Examples
  • Probability Distribution | Formula, Types, & Examples
  • Quartiles & Quantiles | Calculation, Definition & Interpretation
  • Ratio Scales | Definition, Examples, & Data Analysis
  • Simple Linear Regression | An Easy Introduction & Examples
  • Skewness | Definition, Examples & Formula
  • Statistical Power and Why It Matters | A Simple Introduction
  • Student's t Table (Free Download) | Guide & Examples
  • T-distribution: What it is and how to use it
  • Test statistics | Definition, Interpretation, and Examples
  • The Standard Normal Distribution | Calculator, Examples & Uses
  • Two-Way ANOVA | Examples & When To Use It
  • Type I & Type II Errors | Differences, Examples, Visualizations
  • Understanding Confidence Intervals | Easy Examples & Formulas
  • Understanding P values | Definition and Examples
  • Variability | Calculating Range, IQR, Variance, Standard Deviation
  • What is Effect Size and Why Does It Matter? (Examples)
  • What Is Kurtosis? | Definition, Examples & Formula
  • What Is Standard Error? | How to Calculate (Guide with Examples)

What is your plagiarism score?

Data Science: the impact of statistics

  • Regular Paper
  • Open access
  • Published: 16 February 2018
  • Volume 6 , pages 189–194, ( 2018 )

Cite this article

You have full access to this open access article

  • Claus Weihs 1 &
  • Katja Ickstadt 2  

39k Accesses

46 Citations

17 Altmetric

Explore all metrics

In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting. Also, we indicate fallacies when neglecting statistical reasoning.

Similar content being viewed by others

research papers in statistics

Data Analysis

research papers in statistics

Data science vs. statistics: two cultures?

Iain Carmichael & J. S. Marron

research papers in statistics

Data Science: An Introduction

Avoid common mistakes on your manuscript.

1 Introduction and premise

Data Science as a scientific discipline is influenced by informatics, computer science, mathematics, operations research, and statistics as well as the applied sciences.

In 1996, for the first time, the term Data Science was included in the title of a statistical conference (International Federation of Classification Societies (IFCS) “Data Science, classification, and related methods”) [ 37 ]. Even though the term was founded by statisticians, in the public image of Data Science, the importance of computer science and business applications is often much more stressed, in particular in the era of Big Data.

Already in the 1970s, the ideas of John Tukey [ 43 ] changed the viewpoint of statistics from a purely mathematical setting , e.g., statistical testing, to deriving hypotheses from data ( exploratory setting ), i.e., trying to understand the data before hypothesizing.

Another root of Data Science is Knowledge Discovery in Databases (KDD) [ 36 ] with its sub-topic Data Mining . KDD already brings together many different approaches to knowledge discovery, including inductive learning, (Bayesian) statistics, query optimization, expert systems, information theory, and fuzzy sets. Thus, KDD is a big building block for fostering interaction between different fields for the overall goal of identifying knowledge in data.

Nowadays, these ideas are combined in the notion of Data Science, leading to different definitions. One of the most comprehensive definitions of Data Science was recently given by Cao as the formula [ 12 ]:

data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking) .

In this formula, sociology stands for the social aspects and | (data + environment + thinking) means that all the mentioned sciences act on the basis of data, the environment and the so-called data-to-knowledge-to-wisdom thinking.

A recent, comprehensive overview of Data Science provided by Donoho in 2015 [ 16 ] focuses on the evolution of Data Science from statistics. Indeed, as early as 1997, there was an even more radical view suggesting to rename statistics to Data Science [ 50 ]. And in 2015, a number of ASA leaders [ 17 ] released a statement about the role of statistics in Data Science, saying that “statistics and machine learning play a central role in data science.”

In our view, statistical methods are crucial in most fundamental steps of Data Science. Hence, the premise of our contribution is:

Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

This paper aims at addressing the major impact of statistics on the most important steps in Data Science.

2 Steps in data science

One of forerunners of Data Science from a structural perspective is the famous CRISP-DM (Cross Industry Standard Process for Data Mining) which is organized in six main steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment [ 10 ], see Table  1 , left column. Ideas like CRISP-DM are now fundamental for applied statistics.

In our view, the main steps in Data Science have been inspired by CRISP-DM and have evolved, leading to, e.g., our definition of Data Science as a sequence of the following steps: Data Acquisition and Enrichment, Data Storage and Access , Data Exploration, Data Analysis and Modeling, Optimization of Algorithms , Model Validation and Selection, Representation and Reporting of Results, and Business Deployment of Results . Note that topics in small capitals indicate steps where statistics is less involved, cp. Table  1 , right column.

Usually, these steps are not just conducted once but are iterated in a cyclic loop. In addition, it is common to alternate between two or more steps. This holds especially for the steps Data Acquisition and Enrichment , Data Exploration , and Statistical Data Analysis , as well as for Statistical Data Analysis and Modeling and Model Validation and Selection .

Table  1 compares different definitions of steps in Data Science. The relationship of terms is indicated by horizontal blocks. The missing step Data Acquisition and Enrichment in CRISP-DM indicates that that scheme deals with observational data only. Moreover, in our proposal, the steps Data Storage and Access and Optimization of Algorithms are added to CRISP-DM, where statistics is less involved.

The list of steps for Data Science may even be enlarged, see, e.g., Cao in [ 12 ], Figure 6, cp. also Table  1 , middle column, for the following recent list: Domain-specific Data Applications and Problems, Data Storage and Management, Data Quality Enhancement, Data Modeling and Representation, Deep Analytics, Learning and Discovery, Simulation and Experiment Design, High-performance Processing and Analytics, Networking, Communication, Data-to-Decision and Actions.

In principle, Cao’s and our proposal cover the same main steps. However, in parts, Cao’s formulation is more detailed; e.g., our step Data Analysis and Modeling corresponds to Data Modeling and Representation, Deep Analytics, Learning and Discovery . Also, the vocabularies differ slightly, depending on whether the respective background is computer science or statistics. In that respect note that Experiment Design in Cao’s definition means the design of the simulation experiments.

In what follows, we will highlight the role of statistics discussing all the steps, where it is heavily involved, in Sects.  2.1 – 2.6 . These coincide with all steps in our proposal in Table  1 except steps in small capitals. The corresponding entries Data Storage and Access and Optimization of Algorithms are mainly covered by informatics and computer science , whereas Business Deployment of Results is covered by Business Management .

2.1 Data acquisition and enrichment

Design of experiments (DOE) is essential for a systematic generation of data when the effect of noisy factors has to be identified. Controlled experiments are fundamental for robust process engineering to produce reliable products despite variation in the process variables. On the one hand, even controllable factors contain a certain amount of uncontrollable variation that affects the response. On the other hand, some factors, like environmental factors, cannot be controlled at all. Nevertheless, at least the effect of such noisy influencing factors should be controlled by, e.g., DOE.

DOE can be utilized, e.g.,

to systematically generate new data ( data acquisition ) [ 33 ],

for systematically reducing data bases [ 41 ], and

for tuning (i.e., optimizing) parameters of algorithms [ 1 ], i.e., for improving the data analysis methods (see Sect.  2.3 ) themselves.

Simulations [ 7 ] may also be used to generate new data. A tool for the enrichment of data bases to fill data gaps is the imputation of missing data [ 31 ].

Such statistical methods for data generation and enrichment need to be part of the backbone of Data Science. The exclusive use of observational data without any noise control distinctly diminishes the quality of data analysis results and may even lead to wrong result interpretation. The hope for “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” [ 4 ] appears to be wrong due to noise in the data.

Thus, experimental design is crucial for the reliability, validity, and replicability of our results.

2.2 Data exploration

Exploratory statistics is essential for data preprocessing to learn about the contents of a data base. Exploration and visualization of observed data was, in a way, initiated by John Tukey [ 43 ]. Since that time, the most laborious part of data analysis, namely data understanding and transformation, became an important part in statistical science.

Data exploration or data mining is fundamental for the proper usage of analytical methods in Data Science. The most important contribution of statistics is the notion of distribution . It allows us to represent variability in the data as well as (a-priori) knowledge of parameters, the concept underlying Bayesian statistics. Distributions also enable us to choose adequate subsequent analytic models and methods.

2.3 Statistical data analysis

Finding structure in data and making predictions are the most important steps in Data Science. Here, in particular, statistical methods are essential since they are able to handle many different analytical tasks. Important examples of statistical data analysis methods are the following.

Hypothesis testing is one of the pillars of statistical analysis. Questions arising in data driven problems can often be translated to hypotheses. Also, hypotheses are the natural links between underlying theory and statistics. Since statistical hypotheses are related to statistical tests, questions and theory can be tested for the available data. Multiple usage of the same data in different tests often leads to the necessity to correct significance levels. In applied statistics, correct multiple testing is one of the most important problems, e.g., in pharmaceutical studies [ 15 ]. Ignoring such techniques would lead to many more significant results than justified.

Classification methods are basic for finding and predicting subpopulations from data. In the so-called unsupervised case, such subpopulations are to be found from a data set without a-priori knowledge of any cases of such subpopulations. This is often called clustering.

In the so-called supervised case, classification rules should be found from a labeled data set for the prediction of unknown labels when only influential factors are available.

Nowadays, there is a plethora of methods for the unsupervised [ 22 ] as well for the supervised case [ 2 ].

In the age of Big Data, a new look at the classical methods appears to be necessary, though, since most of the time the calculation effort of complex analysis methods grows stronger than linear with the number of observations n or the number of features p . In the case of Big Data, i.e., if n or p is large, this leads to too high calculation times and to numerical problems. This results both, in the comeback of simpler optimization algorithms with low time-complexity [ 9 ] and in re-examining the traditional methods in statistics and machine learning for Big Data [ 46 ].

Regression methods are the main tool to find global and local relationships between features when the target variable is measured. Depending on the distributional assumption for the underlying data, different approaches may be applied. Under the normality assumption, linear regression is the most common method, while generalized linear regression is usually employed for other distributions from the exponential family [ 18 ]. More advanced methods comprise functional regression for functional data [ 38 ], quantile regression [ 25 ], and regression based on loss functions other than squared error loss like, e.g., Lasso regression [ 11 , 21 ]. In the context of Big Data, the challenges are similar to those for classification methods given large numbers of observations n (e.g., in data streams) and / or large numbers of features p . For the reduction of n , data reduction techniques like compressed sensing, random projection methods [ 20 ] or sampling-based procedures [ 28 ] enable faster computations. For decreasing the number p to the most influential features, variable selection or shrinkage approaches like the Lasso [ 21 ] can be employed, keeping the interpretability of the features. (Sparse) principal component analysis [ 21 ] may also be used.

Time series analysis aims at understanding and predicting temporal structure [ 42 ]. Time series are very common in studies of observational data, and prediction is the most important challenge for such data. Typical application areas are the behavioral sciences and economics as well as the natural sciences and engineering. As an example, let us have a look at signal analysis, e.g., speech or music data analysis. Here, statistical methods comprise the analysis of models in the time and frequency domains. The main aim is the prediction of future values of the time series itself or of its properties. For example, the vibrato of an audio time series might be modeled in order to realistically predict the tone in the future [ 24 ] and the fundamental frequency of a musical tone might be predicted by rules learned from elapsed time periods [ 29 ].

In econometrics, multiple time series and their co-integration are often analyzed [ 27 ]. In technical applications, process control is a common aim of time series analysis [ 34 ].

2.4 Statistical modeling

Complex interactions between factors can be modeled by graphs or networks . Here, an interaction between two factors is modeled by a connection in the graph or network [ 26 , 35 ]. The graphs can be undirected as, e.g., in Gaussian graphical models, or directed as, e.g., in Bayesian networks. The main goal in network analysis is deriving the network structure. Sometimes, it is necessary to separate (unmix) subpopulation specific network topologies [ 49 ].

Stochastic differential and difference equations can represent models from the natural and engineering sciences [ 3 , 39 ]. The finding of approximate statistical models solving such equations can lead to valuable insights for, e.g., the statistical control of such processes, e.g., in mechanical engineering [ 48 ]. Such methods can build a bridge between the applied sciences and Data Science.

Local models and globalization Typically, statistical models are only valid in sub-regions of the domain of the involved variables. Then, local models can be used [ 8 ]. The analysis of structural breaks can be basic to identify the regions for local modeling in time series [ 5 ]. Also, the analysis of concept drifts can be used to investigate model changes over time [ 30 ].

In time series, there are often hierarchies of more and more global structures. For example, in music, a basic local structure is given by the notes and more and more global ones by bars, motifs, phrases, parts etc. In order to find global properties of a time series, properties of the local models can be combined to more global characteristics [ 47 ].

Mixture models can also be used for the generalization of local to global models [ 19 , 23 ]. Model combination is essential for the characterization of real relationships since standard mathematical models are often much too simple to be valid for heterogeneous data or bigger regions of interest.

2.5 Model validation and model selection

In cases where more than one model is proposed for, e.g., prediction, statistical tests for comparing models are helpful to structure the models, e.g., concerning their predictive power [ 45 ].

Predictive power is typically assessed by means of so-called resampling methods where the distribution of power characteristics is studied by artificially varying the subpopulation used to learn the model. Characteristics of such distributions can be used for model selection [ 7 ].

Perturbation experiments offer another possibility to evaluate the performance of models. In this way, the stability of the different models against noise is assessed [ 32 , 44 ].

Meta-analysis as well as model averaging are methods to evaluate combined models [ 13 , 14 ].

Model selection became more and more important in the last years since the number of classification and regression models proposed in the literature increased with higher and higher speed.

2.6 Representation and reporting

Visualization to interpret found structures and storing of models in an easy-to-update form are very important tasks in statistical analyses to communicate the results and safeguard data analysis deployment. Deployment is decisive for obtaining interpretable results in Data Science. It is the last step in CRISP-DM [ 10 ] and underlying the data-to-decision and action step in Cao [ 12 ].

Besides visualization and adequate model storing, for statistics, the main task is reporting of uncertainties and review [ 6 ].

3 Fallacies

The statistical methods described in Sect.  2 are fundamental for finding structure in data and for obtaining deeper insight into data, and thus, for a successful data analysis. Ignoring modern statistical thinking or using simplistic data analytics/statistical methods may lead to avoidable fallacies. This holds, in particular, for the analysis of big and/or complex data.

As mentioned at the end of Sect.  2.2 , the notion of distribution is the key contribution of statistics. Not taking into account distributions in data exploration and in modeling restricts us to report values and parameter estimates without their corresponding variability. Only the notion of distributions enables us to predict with corresponding error bands.

Moreover, distributions are the key to model-based data analytics. For example, unsupervised learning can be employed to find clusters in data. If additional structure like dependency on space or time is present, it is often important to infer parameters like cluster radii and their spatio-temporal evolution. Such model-based analysis heavily depends on the notion of distributions (see [ 40 ] for an application to protein clusters).

If more than one parameter is of interest, it is advisable to compare univariate hypothesis testing approaches to multiple procedures, e.g., in multiple regression, and choose the most adequate model by variable selection. Restricting oneself to univariate testing, would ignore relationships between variables.

Deeper insight into data might require more complex models, like, e.g., mixture models for detecting heterogeneous groups in data. When ignoring the mixture, the result often represents a meaningless average, and learning the subgroups by unmixing the components might be needed. In a Bayesian framework, this is enabled by, e.g., latent allocation variables in a Dirichlet mixture model. For an application of decomposing a mixture of different networks in a heterogeneous cell population in molecular biology see [ 49 ].

A mixture model might represent mixtures of components of very unequal sizes, with small components (outliers) being of particular importance. In the context of Big Data, naïve sampling procedures are often employed for model estimation. However, these have the risk of missing small mixture components. Hence, model validation or sampling according to a more suitable distribution as well as resampling methods for predictive power are important.

4 Conclusion

Following the above assessment of the capabilities and impacts of statistics our conclusion is:

The role of statistics in Data Science is under-estimated as, e.g., compared to computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modeling needed for prediction.

Stimulated by this conclusion, statisticians are well-advised to more offensively play their role in this modern and well accepted field of Data Science.

Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.

Adenso-Diaz, B., Laguna, M.: Fine-tuning of algorithms using fractional experimental designs and local search. Oper. Res. 54 (1), 99–114 (2006)

Article   Google Scholar  

Aggarwal, C.C. (ed.): Data Classification: Algorithms and Applications. CRC Press, Boca Raton (2014)

Google Scholar  

Allen, E., Allen, L., Arciniega, A., Greenwood, P.: Construction of equivalent stochastic differential equation models. Stoch. Anal. Appl. 26 , 274–297 (2008)

Article   MathSciNet   Google Scholar  

Anderson, C.: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine https://www.wired.com/2008/06/pb-theory/ (2008)

Aue, A., Horváth, L.: Structural breaks in time series. J. Time Ser. Anal. 34 (1), 1–16 (2013)

Berger, R.E.: A scientific approach to writing for engineers and scientists. IEEE PCS Professional Engineering Communication Series IEEE Press, Wiley (2014)

Book   Google Scholar  

Bischl, B., Mersmann, O., Trautmann, H., Weihs, C.: Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 20 (2), 249–275 (2012)

Bischl, B., Schiffner, J., Weihs, C.: Benchmarking local classification methods. Comput. Stat. 28 (6), 2599–2619 (2013)

Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)

Brown, M.S.: Data Mining for Dummies. Wiley, London (2014)

Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (2017). https://doi.org/10.1145/3076253

Claeskens, G., Hjort, N.L.: Model Selection and Model Averaging. Cambridge University Press, Cambridge (2008)

Cooper, H., Hedges, L.V., Valentine, J.C.: The Handbook of Research Synthesis and Meta-analysis. Russell Sage Foundation, New York City (2009)

Dmitrienko, A., Tamhane, A.C., Bretz, F.: Multiple Testing Problems in Pharmaceutical Statistics. Chapman and Hall/CRC, London (2009)

Donoho, D.: 50 Years of Data Science. http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf (2015)

Dyk, D.V., Fuentes, M., Jordan, M.I., Newton, M., Ray, B.K., Lang, D.T., Wickham, H.: ASA Statement on the Role of Statistics in Data Science. http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ (2015)

Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Berlin (2013)

Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, Berlin (2006)

MATH   Google Scholar  

Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., Sohler, C.: Random projections for Bayesian regression. Stat. Comput. 27 (1), 79–101 (2017). https://doi.org/10.1007/s11222-015-9608-z

Article   MathSciNet   MATH   Google Scholar  

Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. Chapman & Hall, London (2015)

Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, K., Dugas, M.: Integrative analysis of histone chip-seq and transcription data using Bayesian mixture models. Bioinformatics 30 (8), 1154–1162 (2014)

Knoche, S., Ebeling, M.: The musical signal: physically and psychologically, chap 2. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 15–68. CRC Press, Boca Raton (2017)

Koenker, R.: Quantile Regression. Econometric Society Monographs, vol. 38 (2010)

Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

Lütkepohl, H.: New Introduction to Multiple Time Series Analysis. Springer, Berlin (2010)

Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 91–99. http://jmlr.org/proceedings/papers/v32/ma14.html (2014)

Martin, R., Nagathil, A.: Digital filters and spectral analysis, chap 4. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 111–143. CRC Press, Boca Raton (2017)

Mejri, D., Limam, M., Weihs, C.: A new dynamic weighted majority control chart for data streams. Soft Comput. 22(2), 511–522. https://doi.org/10.1007/s00500-016-2351-3

Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)

Molinelli, E.J., Korkut, A., Wang, W.Q., Miller, M.L., Gauthier, N.P., Jing, X., Kaushik, P., He, Q., Mills, G., Solit, D.B., Pratilas, C.A., Weigt, M., Braunstein, A., Pagnani, A., Zecchina, R., Sander, C.: Perturbation Biology: Inferring Signaling Networks in Cellular Systems. arXiv preprint arXiv:1308.5193 (2013)

Montgomery, D.C.: Design and Analysis of Experiments, 8th edn. Wiley, London (2013)

Oakland, J.: Statistical Process Control. Routledge, London (2007)

Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Los Altos (1988)

Chapter   Google Scholar  

Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT Press, Cambridge (1991)

Press, G.: A Very Short History of Data Science. https://www.forbescom/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#5c515ed055cf (2013). [last visit: March 19, 2017]

Ramsay, J., Silverman, B.W.: Functional Data Analysis. Springer, Berlin (2005)

Särkkä, S.: Applied Stochastic Differential Equations. https://users.aalto.fi/~ssarkka/course_s2012/pdf/sde_course_booklet_2012.pdf (2012). [last visit: March 6, 2017]

Schäfer, M., Radon, Y., Klein, T., Herrmann, S., Schwender, H., Verveer, P.J., Ickstadt, K.: A Bayesian mixture model to quantify parameters of spatial clustering. Comput. Stat. Data Anal. 92 , 163–176 (2015). https://doi.org/10.1016/j.csda.2015.07.004

Schiffner, J., Weihs, C.: D-optimal plans for variable selection in data bases. Technical Report, 14/09, SFB 475 (2009)

Shumway, R.H., Stoffer, D.S.: Time Series Analysis and Its Applications: With R Examples. Springer, Berlin (2010)

Tukey, J.W.: Exploratory Data Analysis. Pearson, London (1977)

Vatcheva, I., de Jong, H., Mars, N.: Selection of perturbation experiments for model discrimination. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence, ECAI-2000, IOS Press, pp 191–195 (2000)

Vatolkin, I., Weihs, C.: Evaluation, chap 13. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 329–363. CRC Press, Boca Raton (2017)

Weihs, C.: Big data classification — aspects on many features. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds.) Solving Large Scale Learning Tasks: Challenges and Algorithms, Springer Lecture Notes in Artificial Intelligence, vol. 9580, pp. 139–147 (2016)

Weihs, C., Ligges, U.: From local to global analysis of music time series. In: Morik, K., Siebes, A., Boulicault, J.F. (eds.) Detecting Local Patterns, Springer Lecture Notes in Artificial Intelligence, vol. 3539, pp. 233–245 (2005)

Weihs, C., Messaoud, A., Raabe, N.: Control charts based on models derived from differential equations. Qual. Reliab. Eng. Int. 26 (8), 807–816 (2010)

Wieczorek, J., Malik-Sheriff, R.S., Fermin, Y., Grecco, H.E., Zamir, E., Ickstadt, K.: Uncovering distinct protein-network topologies in heterogeneous cell populations. BMC Syst. Biol. 9 (1), 24 (2015)

Wu, J.: Statistics = data science? http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf (1997)

Download references

Acknowledgements

The authors would like to thank the editor, the guest editors and all reviewers for valuable comments on an earlier version of the manuscript. They also thank Leo Geppert for fruitful discussions.

Author information

Authors and affiliations.

Computational Statistics, TU Dortmund University, 44221, Dortmund, Germany

Claus Weihs

Mathematical Statistics and Biometric Applications, TU Dortmund University, 44221, Dortmund, Germany

Katja Ickstadt

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Claus Weihs .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0 /), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Weihs, C., Ickstadt, K. Data Science: the impact of statistics. Int J Data Sci Anal 6 , 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5

Download citation

Received : 20 March 2017

Accepted : 25 January 2018

Published : 16 February 2018

Issue Date : November 2018

DOI : https://doi.org/10.1007/s41060-018-0102-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Structures of data science
  • Impact of statistics on data science
  • Fallacies in data science
  • Find a journal
  • Publish with us
  • Track your research

StatAnalytica

How to Write Statistics Research Paper | Easy Guide

Statistics Research Paper

A statistics research paper is an academic document presenting original findings or analyses derived from the data’s collection, organization, analysis, and interpretation. It addresses research questions or hypotheses within the field of statistics.

As a rule, college students get such papers assigned during a semester to assess their knowledge of statistics. However, any statistician specialist can also write research papers and publish them in academic journals, thus developing and promoting this field.

Want to master the art of statistics research paper writing?

We’ve got expert tips from a professional research paper writing service on crafting such studies. In this article, you’ll find a step-by-step guide on writing a statistics research paper that your educators, colleagues, or clients will approve.

How to write a statistics research paper: Steps

Table of Contents

State the problem

Collect the data, write an introductory paragraph, craft an abstract, describe your methodology, present your findings: evaluate and illustrate, revise and proofread.

Research papers aren’t about describing the existing knowledge on the topic. You should state your intellectual concern with it, indicating why it’s worth studying. When choosing the problem you’ll research in the paper, emphasize its ongoing nature:

What have other researchers already studied about it? Cite at least one previous publication related to your research and provide your statistical motivation to continue researching the topic. (You’ll refer to those researches in footnotes or within the text of your paper.)

Once you have the topic (problem) to research in your paper, it’s time to collect sources you’ll use as evidence and references. For statistics papers, consider the following:

  • Published research from experts in Statistics (academic journals, newspapers, books, online publications, etc.)
  • Statistical data from reliable sources (Google’s Public Data and Scholar, FedStats, and others)
  • Your personal hypothesis, experiments, and info-gathering activities

The last one is a must-have! Your statistics research paper requires new information gathered by you as a researcher and not previously published anywhere. The massive block of your research paper will be about the data collection methods you used to investigate the problem and come to the conclusions you’ll provide.

Some underestimate the introductory paragraphs of research papers , but they are wrong. The introduction is the first thing a reader sees to understand if your research is worth their attention and time. With that in mind, ensure your introductory paragraph is intriguing yet informative enough for the audience to continue reading.

Start with a writing hook, a sentence grabbing a reader’s attention. Also, an intro needs background information: your topic and the scientific motivation for the new research methods. (What’s wrong with existing ones? Or, what do they miss?) Finally, move on to your thesis statement: 1-2 sentences summarizing the primary idea behind your research paper.

It’s an overview of your statistics research paper where you establish notation and outline the methods and the results. Abstracts are integral for all academic studies and research, giving readers enough details to decide whether your paper is relevant to them.

What do you include in an abstract?

Introduce your topic and explain why it’s significant in your field. State the gap present in the research at the moment and reveal the aim of your paper. Then, briefly describe your research methods and approach, summarize your findings, and explain their contribution to the field.

The methods section is the most extensive one in your research paper. Here, you provide sufficient information about how you collected data for your research, what methodologies you used, and how you evaluated the results.

Be specific; describe everything so the audience can repeat your research (experiments) and reconstruct your results. It’s the value your paper brings to the academic community.

Further paragraphs of your research paper present your findings. Try to stick to one idea per paragraph to make it clear and easy for readers to consume.

Prepare and add supporting materials that will help you illustrate findings: graphs, diagrams, charts, tables, etc. You can use them in a paper’s body or add them as an appendix; these elements will support your points and serve as extra proof for your findings.

Last but not least:

Write a concluding paragraph for your statistics research paper. Repeat your thesis, summarize your findings, and conclude whether they have proved or contradicted your initial theory (hypothesis). Also, you can make suggestions for further research in the same area.

Re-read your paper several times before publishing or submitting it for review. Ensure all the information is logical and coherent, all the terms are correct, and all the elements are present and accurately placed.

Also, proofread your final draft: Spelling, grammar, and punctuation mistakes are a no-no here! Re-check the list of references again; ensure you follow the required citation style and use the proper format.

So, now you know seven easy steps for writing a statistical research paper. Whether you’re a college student or a statistician willing to make a scientific contribution to a niche, follow them to craft a professionally structured academic document:

State a problem, choose methods of analyzing it, evaluate your findings, and illustrate them to engage the audience in discussion.

If you still need clarification or have questions about writing a statistics paper, don’t hesitate to ask for assistance. Professional writers with experience in statistics are ready to help you improve your writing skills.

Related Posts

best way to finance car

Step by Step Guide on The Best Way to Finance Car

how to get fund for business

The Best Way on How to Get Fund For Business to Grow it Efficiently

Leave a comment cancel reply.

Your email address will not be published. Required fields are marked *

Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries

Affiliations.

  • 1 Cancer Surveillance Branch, International Agency for Research on Cancer, Lyon, France.
  • 2 Surveillance and Health Equity Science, American Cancer Society, Atlanta, Georgia, USA.
  • PMID: 38572751
  • DOI: 10.3322/caac.21834

This article presents global cancer statistics by world region for the year 2022 based on updated estimates from the International Agency for Research on Cancer (IARC). There were close to 20 million new cases of cancer in the year 2022 (including nonmelanoma skin cancers [NMSCs]) alongside 9.7 million deaths from cancer (including NMSC). The estimates suggest that approximately one in five men or women develop cancer in a lifetime, whereas around one in nine men and one in 12 women die from it. Lung cancer was the most frequently diagnosed cancer in 2022, responsible for almost 2.5 million new cases, or one in eight cancers worldwide (12.4% of all cancers globally), followed by cancers of the female breast (11.6%), colorectum (9.6%), prostate (7.3%), and stomach (4.9%). Lung cancer was also the leading cause of cancer death, with an estimated 1.8 million deaths (18.7%), followed by colorectal (9.3%), liver (7.8%), female breast (6.9%), and stomach (6.8%) cancers. Breast cancer and lung cancer were the most frequent cancers in women and men, respectively (both cases and deaths). Incidence rates (including NMSC) varied from four-fold to five-fold across world regions, from over 500 in Australia/New Zealand (507.9 per 100,000) to under 100 in Western Africa (97.1 per 100,000) among men, and from over 400 in Australia/New Zealand (410.5 per 100,000) to close to 100 in South-Central Asia (103.3 per 100,000) among women. The authors examine the geographic variability across 20 world regions for the 10 leading cancer types, discussing recent trends, the underlying determinants, and the prospects for global cancer prevention and control. With demographics-based predictions indicating that the number of new cases of cancer will reach 35 million by 2050, investments in prevention, including the targeting of key risk factors for cancer (including smoking, overweight and obesity, and infection), could avert millions of future cancer diagnoses and save many lives worldwide, bringing huge economic as well as societal dividends to countries over the forthcoming decades.

Keywords: cancer burden; cancer control; epidemiology; incidence; mortality.

© 2024 The Authors. CA: A Cancer Journal for Clinicians published by Wiley Periodicals LLC on behalf of American Cancer Society.

Grants and funding

  • International Agency for Research on Cancer/World Health Organization

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Indian J Anaesth
  • v.60(9); 2016 Sep

Basic statistical tools in research and data analysis

Zulfiqar ali.

Department of Anaesthesiology, Division of Neuroanaesthesiology, Sheri Kashmir Institute of Medical Sciences, Soura, Srinagar, Jammu and Kashmir, India

S Bala Bhaskar

1 Department of Anaesthesiology and Critical Care, Vijayanagar Institute of Medical Sciences, Bellary, Karnataka, India

Statistical methods involved in carrying out a study include planning, designing, collecting data, analysing, drawing meaningful interpretation and reporting of the research findings. The statistical analysis gives meaning to the meaningless numbers, thereby breathing life into a lifeless data. The results and inferences are precise only if proper statistical tests are used. This article will try to acquaint the reader with the basic research tools that are utilised while conducting various studies. The article covers a brief outline of the variables, an understanding of quantitative and qualitative variables and the measures of central tendency. An idea of the sample size estimation, power analysis and the statistical errors is given. Finally, there is a summary of parametric and non-parametric tests used for data analysis.

INTRODUCTION

Statistics is a branch of science that deals with the collection, organisation, analysis of data and drawing of inferences from the samples to the whole population.[ 1 ] This requires a proper design of the study, an appropriate selection of the study sample and choice of a suitable statistical test. An adequate knowledge of statistics is necessary for proper designing of an epidemiological study or a clinical trial. Improper statistical methods may result in erroneous conclusions which may lead to unethical practice.[ 2 ]

Variable is a characteristic that varies from one individual member of population to another individual.[ 3 ] Variables such as height and weight are measured by some type of scale, convey quantitative information and are called as quantitative variables. Sex and eye colour give qualitative information and are called as qualitative variables[ 3 ] [ Figure 1 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g001.jpg

Classification of variables

Quantitative variables

Quantitative or numerical data are subdivided into discrete and continuous measurements. Discrete numerical data are recorded as a whole number such as 0, 1, 2, 3,… (integer), whereas continuous data can assume any value. Observations that can be counted constitute the discrete data and observations that can be measured constitute the continuous data. Examples of discrete data are number of episodes of respiratory arrests or the number of re-intubations in an intensive care unit. Similarly, examples of continuous data are the serial serum glucose levels, partial pressure of oxygen in arterial blood and the oesophageal temperature.

A hierarchical scale of increasing precision can be used for observing and recording the data which is based on categorical, ordinal, interval and ratio scales [ Figure 1 ].

Categorical or nominal variables are unordered. The data are merely classified into categories and cannot be arranged in any particular order. If only two categories exist (as in gender male and female), it is called as a dichotomous (or binary) data. The various causes of re-intubation in an intensive care unit due to upper airway obstruction, impaired clearance of secretions, hypoxemia, hypercapnia, pulmonary oedema and neurological impairment are examples of categorical variables.

Ordinal variables have a clear ordering between the variables. However, the ordered data may not have equal intervals. Examples are the American Society of Anesthesiologists status or Richmond agitation-sedation scale.

Interval variables are similar to an ordinal variable, except that the intervals between the values of the interval variable are equally spaced. A good example of an interval scale is the Fahrenheit degree scale used to measure temperature. With the Fahrenheit scale, the difference between 70° and 75° is equal to the difference between 80° and 85°: The units of measurement are equal throughout the full range of the scale.

Ratio scales are similar to interval scales, in that equal differences between scale values have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property. For example, the system of centimetres is an example of a ratio scale. There is a true zero point and the value of 0 cm means a complete absence of length. The thyromental distance of 6 cm in an adult may be twice that of a child in whom it may be 3 cm.

STATISTICS: DESCRIPTIVE AND INFERENTIAL STATISTICS

Descriptive statistics[ 4 ] try to describe the relationship between variables in a sample or population. Descriptive statistics provide a summary of data in the form of mean, median and mode. Inferential statistics[ 4 ] use a random sample of data taken from a population to describe and make inferences about the whole population. It is valuable when it is not possible to examine each member of an entire population. The examples if descriptive and inferential statistics are illustrated in Table 1 .

Example of descriptive and inferential statistics

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g002.jpg

Descriptive statistics

The extent to which the observations cluster around a central location is described by the central tendency and the spread towards the extremes is described by the degree of dispersion.

Measures of central tendency

The measures of central tendency are mean, median and mode.[ 6 ] Mean (or the arithmetic average) is the sum of all the scores divided by the number of scores. Mean may be influenced profoundly by the extreme variables. For example, the average stay of organophosphorus poisoning patients in ICU may be influenced by a single patient who stays in ICU for around 5 months because of septicaemia. The extreme values are called outliers. The formula for the mean is

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g003.jpg

where x = each observation and n = number of observations. Median[ 6 ] is defined as the middle of a distribution in a ranked data (with half of the variables in the sample above and half below the median value) while mode is the most frequently occurring variable in a distribution. Range defines the spread, or variability, of a sample.[ 7 ] It is described by the minimum and maximum values of the variables. If we rank the data and after ranking, group the observations into percentiles, we can get better information of the pattern of spread of the variables. In percentiles, we rank the observations into 100 equal parts. We can then describe 25%, 50%, 75% or any other percentile amount. The median is the 50 th percentile. The interquartile range will be the observations in the middle 50% of the observations about the median (25 th -75 th percentile). Variance[ 7 ] is a measure of how spread out is the distribution. It gives an indication of how close an individual observation clusters about the mean value. The variance of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g004.jpg

where σ 2 is the population variance, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The variance of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g005.jpg

where s 2 is the sample variance, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. The formula for the variance of a population has the value ‘ n ’ as the denominator. The expression ‘ n −1’ is known as the degrees of freedom and is one less than the number of parameters. Each observation is free to vary, except the last one which must be a defined value. The variance is measured in squared units. To make the interpretation of the data simple and to retain the basic unit of observation, the square root of variance is used. The square root of the variance is the standard deviation (SD).[ 8 ] The SD of a population is defined by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g006.jpg

where σ is the population SD, X is the population mean, X i is the i th element from the population and N is the number of elements in the population. The SD of a sample is defined by slightly different formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g007.jpg

where s is the sample SD, x is the sample mean, x i is the i th element from the sample and n is the number of elements in the sample. An example for calculation of variation and SD is illustrated in Table 2 .

Example of mean, variance, standard deviation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g008.jpg

Normal distribution or Gaussian distribution

Most of the biological variables usually cluster around a central value, with symmetrical positive and negative deviations about this point.[ 1 ] The standard normal distribution curve is a symmetrical bell-shaped. In a normal distribution curve, about 68% of the scores are within 1 SD of the mean. Around 95% of the scores are within 2 SDs of the mean and 99% within 3 SDs of the mean [ Figure 2 ].

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g009.jpg

Normal distribution curve

Skewed distribution

It is a distribution with an asymmetry of the variables about its mean. In a negatively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the right of Figure 1 . In a positively skewed distribution [ Figure 3 ], the mass of the distribution is concentrated on the left of the figure leading to a longer right tail.

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g010.jpg

Curves showing negatively skewed and positively skewed distribution

Inferential statistics

In inferential statistics, data are analysed from a sample to make inferences in the larger collection of the population. The purpose is to answer or test the hypotheses. A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. Hypothesis tests are thus procedures for making rational decisions about the reality of observed effects.

Probability is the measure of the likelihood that an event will occur. Probability is quantified as a number between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty).

In inferential statistics, the term ‘null hypothesis’ ( H 0 ‘ H-naught ,’ ‘ H-null ’) denotes that there is no relationship (difference) between the population variables in question.[ 9 ]

Alternative hypothesis ( H 1 and H a ) denotes that a statement between the variables is expected to be true.[ 9 ]

The P value (or the calculated probability) is the probability of the event occurring by chance if the null hypothesis is true. The P value is a numerical between 0 and 1 and is interpreted by researchers in deciding whether to reject or retain the null hypothesis [ Table 3 ].

P values with interpretation

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g011.jpg

If P value is less than the arbitrarily chosen value (known as α or the significance level), the null hypothesis (H0) is rejected [ Table 4 ]. However, if null hypotheses (H0) is incorrectly rejected, this is known as a Type I error.[ 11 ] Further details regarding alpha error, beta error and sample size calculation and factors influencing them are dealt with in another section of this issue by Das S et al .[ 12 ]

Illustration for null hypothesis

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g012.jpg

PARAMETRIC AND NON-PARAMETRIC TESTS

Numerical data (quantitative variables) that are normally distributed are analysed with parametric tests.[ 13 ]

Two most basic prerequisites for parametric statistical analysis are:

  • The assumption of normality which specifies that the means of the sample group are normally distributed
  • The assumption of equal variance which specifies that the variances of the samples and of their corresponding population are equal.

However, if the distribution of the sample is skewed towards one side or the distribution is unknown due to the small sample size, non-parametric[ 14 ] statistical techniques are used. Non-parametric tests are used to analyse ordinal and categorical data.

Parametric tests

The parametric tests assume that the data are on a quantitative (numerical) scale, with a normal distribution of the underlying population. The samples have the same variance (homogeneity of variances). The samples are randomly drawn from the population, and the observations within a group are independent of each other. The commonly used parametric tests are the Student's t -test, analysis of variance (ANOVA) and repeated measures ANOVA.

Student's t -test

Student's t -test is used to test the null hypothesis that there is no difference between the means of the two groups. It is used in three circumstances:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g013.jpg

where X = sample mean, u = population mean and SE = standard error of mean

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g014.jpg

where X 1 − X 2 is the difference between the means of the two groups and SE denotes the standard error of the difference.

  • To test if the population means estimated by two dependent samples differ significantly (the paired t -test). A usual setting for paired t -test is when measurements are made on the same subjects before and after a treatment.

The formula for paired t -test is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g015.jpg

where d is the mean difference and SE denotes the standard error of this difference.

The group variances can be compared using the F -test. The F -test is the ratio of variances (var l/var 2). If F differs significantly from 1.0, then it is concluded that the group variances differ significantly.

Analysis of variance

The Student's t -test cannot be used for comparison of three or more groups. The purpose of ANOVA is to test if there is any significant difference between the means of two or more groups.

In ANOVA, we study two variances – (a) between-group variability and (b) within-group variability. The within-group variability (error variance) is the variation that cannot be accounted for in the study design. It is based on random differences present in our samples.

However, the between-group (or effect variance) is the result of our treatment. These two estimates of variances are compared using the F-test.

A simplified formula for the F statistic is:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g016.jpg

where MS b is the mean squares between the groups and MS w is the mean squares within groups.

Repeated measures analysis of variance

As with ANOVA, repeated measures ANOVA analyses the equality of means of three or more groups. However, a repeated measure ANOVA is used when all variables of a sample are measured under different conditions or at different points in time.

As the variables are measured from a sample at different points of time, the measurement of the dependent variable is repeated. Using a standard ANOVA in this case is not appropriate because it fails to model the correlation between the repeated measures: The data violate the ANOVA assumption of independence. Hence, in the measurement of repeated dependent variables, repeated measures ANOVA should be used.

Non-parametric tests

When the assumptions of normality are not met, and the sample means are not normally, distributed parametric tests can lead to erroneous results. Non-parametric tests (distribution-free test) are used in such situation as they do not require the normality assumption.[ 15 ] Non-parametric tests may fail to detect a significant difference when compared with a parametric test. That is, they usually have less power.

As is done for the parametric tests, the test statistic is compared with known values for the sampling distribution of that statistic and the null hypothesis is accepted or rejected. The types of non-parametric analysis techniques and the corresponding parametric analysis techniques are delineated in Table 5 .

Analogue of parametric and non-parametric tests

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g017.jpg

Median test for one sample: The sign test and Wilcoxon's signed rank test

The sign test and Wilcoxon's signed rank test are used for median tests of one sample. These tests examine whether one instance of sample data is greater or smaller than the median reference value.

This test examines the hypothesis about the median θ0 of a population. It tests the null hypothesis H0 = θ0. When the observed value (Xi) is greater than the reference value (θ0), it is marked as+. If the observed value is smaller than the reference value, it is marked as − sign. If the observed value is equal to the reference value (θ0), it is eliminated from the sample.

If the null hypothesis is true, there will be an equal number of + signs and − signs.

The sign test ignores the actual values of the data and only uses + or − signs. Therefore, it is useful when it is difficult to measure the values.

Wilcoxon's signed rank test

There is a major limitation of sign test as we lose the quantitative information of the given data and merely use the + or – signs. Wilcoxon's signed rank test not only examines the observed values in comparison with θ0 but also takes into consideration the relative sizes, adding more statistical power to the test. As in the sign test, if there is an observed value that is equal to the reference value θ0, this observed value is eliminated from the sample.

Wilcoxon's rank sum test ranks all data points in order, calculates the rank sum of each sample and compares the difference in the rank sums.

Mann-Whitney test

It is used to test the null hypothesis that two samples have the same median or, alternatively, whether observations in one sample tend to be larger than observations in the other.

Mann–Whitney test compares all data (xi) belonging to the X group and all data (yi) belonging to the Y group and calculates the probability of xi being greater than yi: P (xi > yi). The null hypothesis states that P (xi > yi) = P (xi < yi) =1/2 while the alternative hypothesis states that P (xi > yi) ≠1/2.

Kolmogorov-Smirnov test

The two-sample Kolmogorov-Smirnov (KS) test was designed as a generic method to test whether two random samples are drawn from the same distribution. The null hypothesis of the KS test is that both distributions are identical. The statistic of the KS test is a distance between the two empirical distributions, computed as the maximum absolute difference between their cumulative curves.

Kruskal-Wallis test

The Kruskal–Wallis test is a non-parametric test to analyse the variance.[ 14 ] It analyses if there is any difference in the median values of three or more independent samples. The data values are ranked in an increasing order, and the rank sums calculated followed by calculation of the test statistic.

Jonckheere test

In contrast to Kruskal–Wallis test, in Jonckheere test, there is an a priori ordering that gives it a more statistical power than the Kruskal–Wallis test.[ 14 ]

Friedman test

The Friedman test is a non-parametric test for testing the difference between several related samples. The Friedman test is an alternative for repeated measures ANOVAs which is used when the same parameter has been measured under different conditions on the same subjects.[ 13 ]

Tests to analyse the categorical data

Chi-square test, Fischer's exact test and McNemar's test are used to analyse the categorical or nominal variables. The Chi-square test compares the frequencies and tests whether the observed data differ significantly from that of the expected data if there were no differences between groups (i.e., the null hypothesis). It is calculated by the sum of the squared difference between observed ( O ) and the expected ( E ) data (or the deviation, d ) divided by the expected data by the following formula:

An external file that holds a picture, illustration, etc.
Object name is IJA-60-662-g018.jpg

A Yates correction factor is used when the sample size is small. Fischer's exact test is used to determine if there are non-random associations between two categorical variables. It does not assume random sampling, and instead of referring a calculated statistic to a sampling distribution, it calculates an exact probability. McNemar's test is used for paired nominal data. It is applied to 2 × 2 table with paired-dependent samples. It is used to determine whether the row and column frequencies are equal (that is, whether there is ‘marginal homogeneity’). The null hypothesis is that the paired proportions are equal. The Mantel-Haenszel Chi-square test is a multivariate test as it analyses multiple grouping variables. It stratifies according to the nominated confounding variables and identifies any that affects the primary outcome variable. If the outcome variable is dichotomous, then logistic regression is used.

SOFTWARES AVAILABLE FOR STATISTICS, SAMPLE SIZE CALCULATION AND POWER ANALYSIS

Numerous statistical software systems are available currently. The commonly used software systems are Statistical Package for the Social Sciences (SPSS – manufactured by IBM corporation), Statistical Analysis System ((SAS – developed by SAS Institute North Carolina, United States of America), R (designed by Ross Ihaka and Robert Gentleman from R core team), Minitab (developed by Minitab Inc), Stata (developed by StataCorp) and the MS Excel (developed by Microsoft).

There are a number of web resources which are related to statistical power analyses. A few are:

  • StatPages.net – provides links to a number of online power calculators
  • G-Power – provides a downloadable power analysis program that runs under DOS
  • Power analysis for ANOVA designs an interactive site that calculates power or sample size needed to attain a given power for one effect in a factorial ANOVA design
  • SPSS makes a program called SamplePower. It gives an output of a complete report on the computer screen which can be cut and paste into another document.

It is important that a researcher knows the concepts of the basic statistical methods used for conduct of a research study. This will help to conduct an appropriately well-designed study leading to valid and reliable results. Inappropriate use of statistical techniques may lead to faulty conclusions, inducing errors and undermining the significance of the article. Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research which can be utilised for formulating the evidence-based guidelines.

Financial support and sponsorship

Conflicts of interest.

There are no conflicts of interest.

IMAGES

  1. (PDF) A case study report on integrating statistics, problem-based

    research papers in statistics

  2. Statistics research statement sample by researchstatement74 on DeviantArt

    research papers in statistics

  3. (PDF) Adaptation of Survey of Attitudes Towards Statistics (SATS 36

    research papers in statistics

  4. How to Use Tables & Graphs in a Research Paper

    research papers in statistics

  5. Sample of Chapter 3 in Research Paper (Probability and Statistics

    research papers in statistics

  6. 😊 Statistical analysis paper. Free statistics project Essays and Papers

    research papers in statistics

VIDEO

  1. Research Methods, Kinesiology 370 Statistics Chapter 1-4 TEST BANK

  2. || How to find Research Papers & Identify Research Gap || AI tools || Research Beginners Guide ||

  3. Very imp for Annual Exam ; Stat 22 districts prep 1M Question Answers

  4. Lesson 1: Writing a Research Paper

  5. How to find academic research papers and analyze them using Keymate AI and ChatGPT

  6. Second year// Past Papers 12//Long Questions// Statistics 2014 to 2023//Solved/Sampling distribution

COMMENTS

  1. Statistics

    Read the latest Research articles in Statistics from Scientific Reports. ... Measuring the similarity of charts in graphical statistics. Krzysztof Górnisiewicz, ... Calls for Papers

  2. Research Papers / Publications

    Research Papers / Publications. Xianli Zeng, Guang Cheng, Edgar Dobriban Minimax Optimal Fair Classification with Bounded Demographic Disparity. Gen Li, Yu Huang, Timofey Efimov, Yuting Wei, Yuejie Chi, Yuxin Chen, Accelerating Convergence of Score-Based Diffusion Models, Provably.

  3. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  4. Research in Statistics

    Taylor & Francis are currently supporting a 100% APC discount for all authors. Research in Statistics is a broad open access journal publishing original research in all areas of statistics and probability.The journal focuses on broadening existing research fields, and in facilitating international collaboration, and is devoted to the international advancement of the theory and application of ...

  5. The Beginner's Guide to Statistical Analysis

    Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.

  6. Journals

    Journal of Educational and Behavioral Statistics. Co-sponsored by the ASA and American Educational Research Association, JEBS includes papers that present new methods of analysis, critical reviews of current practice, tutorial presentations of less well-known methods, and novel applications of already-known methods.

  7. The Annals of Statistics

    1973-2020 • The Annals of Statistics. 1930-1972 •. The Annals of Statistics publishes research papers of the highest quality reflecting the many facets of contemporary statistics. Primary emphasis is placed on importance and originality, not on formalism. The discipline of statistics has deep roots in both mathematics and in substantive ...

  8. Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data ...

  9. Journal of Applied Statistics

    The Journal publishes original research papers, review articles, and short application notes. In general, ... The Journal of Applied Statistics Best Paper Prize is awarded annually, as decided by the Editor-in-Chief with the support of the Associate Editors. The winning article receives a £500 prize, and their paper will be made free to view ...

  10. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  11. (PDF) Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods. to find structure in and to give deeper insight into data, and ...

  12. Statistics: Vol 58, No 1 (Current issue)

    Poisson-modification of the Quasi Lindley optimal generalized regression estimator. Ramajeyam Tharshan et al. Article | Published online: 25 Mar 2024. Explore the current issue of Statistics, Volume 58, Issue 1, 2024.

  13. Journal of Probability and Statistics

    13 Nov 2023. 03 Nov 2023. 07 Oct 2023. 30 Sep 2023. 31 Aug 2023. Journal of Probability and Statistics publishes papers on the theory and application of probability and statistics that consider new methods and approaches to their implementation, or report significant results for the field.

  14. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  15. An Introduction to Statistics: Understanding Hypothesis Testing and

    Ranganathan P, Pramesh CS. An Introduction to Statistics: Understanding Hypothesis Testing and Statistical Errors. Indian J Crit Care Med 2019;23 (Suppl 3):S230-S231. Keywords: Biostatistics, Research design, Statistical bias. Two papers quoted in this issue of the Indian Journal of Critical Care Medicine report.

  16. (PDF) Use of Statistics in Research

    The function of statistics in research is to purpose as a tool in conniving research, analyzing its data and portrayal of conclusions. there from. Most research studies result in a extensive ...

  17. How to Write Statistics Research Paper

    A statistics research paper is an academic document presenting original findings or analyses derived from the data's collection, organization, analysis, and interpretation. It addresses research questions or hypotheses within the field of statistics.

  18. Statistical Research Papers by Topic

    The Statistical Research Report Series (RR) covers research in statistical methodology and estimation. Page Last Revised - October 8, 2021. View Statistical Research reports by their topics.

  19. Global cancer statistics 2022: GLOBOCAN estimates of incidence ...

    This article presents global cancer statistics by world region for the year 2022 based on updated estimates from the International Agency for Research on Cancer (IARC). There were close to 20 million new cases of cancer in the year 2022 (including nonmelanoma skin cancers [NMSCs]) alongside 9.7 mill …

  20. Basics of statistics for primary care research

    The following are the general steps for statistical analysis: (1) formulate a hypothesis, (2) select an appropriate statistical test, (3) conduct a power analysis, (4) prepare data for analysis, (5) start with descriptive statistics, (6) check assumptions of tests, (7) run the analysis, (8) examine the statistical model, (9) report the results ...

  21. After Courts: Democratizing Statutory Law

    Download This Paper. Open PDF in Browser. Share: Permalink. ... Do you have negative results from your research you'd like to share? Submit Negative Results. Paper statistics. Downloads. 266. Abstract Views. 603. PlumX Metrics. Related eJournals. U.S. Constitutional Law: Separation of Powers & Federalism eJournal.

  22. Basic statistical tools in research and data analysis

    Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research ...