Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

Data Science: the impact of statistics

  • Regular Paper
  • Open access
  • Published: 16 February 2018
  • Volume 6 , pages 189–194, ( 2018 )

Cite this article

You have full access to this open access article

research paper with statistics

  • Claus Weihs 1 &
  • Katja Ickstadt 2  

39k Accesses

46 Citations

17 Altmetric

Explore all metrics

In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data acquisition and enrichment, data exploration, data analysis and modeling, validation and representation and reporting. Also, we indicate fallacies when neglecting statistical reasoning.

Similar content being viewed by others

research paper with statistics

Data Analysis

research paper with statistics

Data science vs. statistics: two cultures?

research paper with statistics

Data Science: An Introduction

Avoid common mistakes on your manuscript.

1 Introduction and premise

Data Science as a scientific discipline is influenced by informatics, computer science, mathematics, operations research, and statistics as well as the applied sciences.

In 1996, for the first time, the term Data Science was included in the title of a statistical conference (International Federation of Classification Societies (IFCS) “Data Science, classification, and related methods”) [ 37 ]. Even though the term was founded by statisticians, in the public image of Data Science, the importance of computer science and business applications is often much more stressed, in particular in the era of Big Data.

Already in the 1970s, the ideas of John Tukey [ 43 ] changed the viewpoint of statistics from a purely mathematical setting , e.g., statistical testing, to deriving hypotheses from data ( exploratory setting ), i.e., trying to understand the data before hypothesizing.

Another root of Data Science is Knowledge Discovery in Databases (KDD) [ 36 ] with its sub-topic Data Mining . KDD already brings together many different approaches to knowledge discovery, including inductive learning, (Bayesian) statistics, query optimization, expert systems, information theory, and fuzzy sets. Thus, KDD is a big building block for fostering interaction between different fields for the overall goal of identifying knowledge in data.

Nowadays, these ideas are combined in the notion of Data Science, leading to different definitions. One of the most comprehensive definitions of Data Science was recently given by Cao as the formula [ 12 ]:

data science = (statistics + informatics + computing + communication + sociology + management) | (data + environment + thinking) .

In this formula, sociology stands for the social aspects and | (data + environment + thinking) means that all the mentioned sciences act on the basis of data, the environment and the so-called data-to-knowledge-to-wisdom thinking.

A recent, comprehensive overview of Data Science provided by Donoho in 2015 [ 16 ] focuses on the evolution of Data Science from statistics. Indeed, as early as 1997, there was an even more radical view suggesting to rename statistics to Data Science [ 50 ]. And in 2015, a number of ASA leaders [ 17 ] released a statement about the role of statistics in Data Science, saying that “statistics and machine learning play a central role in data science.”

In our view, statistical methods are crucial in most fundamental steps of Data Science. Hence, the premise of our contribution is:

Statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty.

This paper aims at addressing the major impact of statistics on the most important steps in Data Science.

2 Steps in data science

One of forerunners of Data Science from a structural perspective is the famous CRISP-DM (Cross Industry Standard Process for Data Mining) which is organized in six main steps: Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation, and Deployment [ 10 ], see Table  1 , left column. Ideas like CRISP-DM are now fundamental for applied statistics.

In our view, the main steps in Data Science have been inspired by CRISP-DM and have evolved, leading to, e.g., our definition of Data Science as a sequence of the following steps: Data Acquisition and Enrichment, Data Storage and Access , Data Exploration, Data Analysis and Modeling, Optimization of Algorithms , Model Validation and Selection, Representation and Reporting of Results, and Business Deployment of Results . Note that topics in small capitals indicate steps where statistics is less involved, cp. Table  1 , right column.

Usually, these steps are not just conducted once but are iterated in a cyclic loop. In addition, it is common to alternate between two or more steps. This holds especially for the steps Data Acquisition and Enrichment , Data Exploration , and Statistical Data Analysis , as well as for Statistical Data Analysis and Modeling and Model Validation and Selection .

Table  1 compares different definitions of steps in Data Science. The relationship of terms is indicated by horizontal blocks. The missing step Data Acquisition and Enrichment in CRISP-DM indicates that that scheme deals with observational data only. Moreover, in our proposal, the steps Data Storage and Access and Optimization of Algorithms are added to CRISP-DM, where statistics is less involved.

The list of steps for Data Science may even be enlarged, see, e.g., Cao in [ 12 ], Figure 6, cp. also Table  1 , middle column, for the following recent list: Domain-specific Data Applications and Problems, Data Storage and Management, Data Quality Enhancement, Data Modeling and Representation, Deep Analytics, Learning and Discovery, Simulation and Experiment Design, High-performance Processing and Analytics, Networking, Communication, Data-to-Decision and Actions.

In principle, Cao’s and our proposal cover the same main steps. However, in parts, Cao’s formulation is more detailed; e.g., our step Data Analysis and Modeling corresponds to Data Modeling and Representation, Deep Analytics, Learning and Discovery . Also, the vocabularies differ slightly, depending on whether the respective background is computer science or statistics. In that respect note that Experiment Design in Cao’s definition means the design of the simulation experiments.

In what follows, we will highlight the role of statistics discussing all the steps, where it is heavily involved, in Sects.  2.1 – 2.6 . These coincide with all steps in our proposal in Table  1 except steps in small capitals. The corresponding entries Data Storage and Access and Optimization of Algorithms are mainly covered by informatics and computer science , whereas Business Deployment of Results is covered by Business Management .

2.1 Data acquisition and enrichment

Design of experiments (DOE) is essential for a systematic generation of data when the effect of noisy factors has to be identified. Controlled experiments are fundamental for robust process engineering to produce reliable products despite variation in the process variables. On the one hand, even controllable factors contain a certain amount of uncontrollable variation that affects the response. On the other hand, some factors, like environmental factors, cannot be controlled at all. Nevertheless, at least the effect of such noisy influencing factors should be controlled by, e.g., DOE.

DOE can be utilized, e.g.,

to systematically generate new data ( data acquisition ) [ 33 ],

for systematically reducing data bases [ 41 ], and

for tuning (i.e., optimizing) parameters of algorithms [ 1 ], i.e., for improving the data analysis methods (see Sect.  2.3 ) themselves.

Simulations [ 7 ] may also be used to generate new data. A tool for the enrichment of data bases to fill data gaps is the imputation of missing data [ 31 ].

Such statistical methods for data generation and enrichment need to be part of the backbone of Data Science. The exclusive use of observational data without any noise control distinctly diminishes the quality of data analysis results and may even lead to wrong result interpretation. The hope for “The End of Theory: The Data Deluge Makes the Scientific Method Obsolete” [ 4 ] appears to be wrong due to noise in the data.

Thus, experimental design is crucial for the reliability, validity, and replicability of our results.

2.2 Data exploration

Exploratory statistics is essential for data preprocessing to learn about the contents of a data base. Exploration and visualization of observed data was, in a way, initiated by John Tukey [ 43 ]. Since that time, the most laborious part of data analysis, namely data understanding and transformation, became an important part in statistical science.

Data exploration or data mining is fundamental for the proper usage of analytical methods in Data Science. The most important contribution of statistics is the notion of distribution . It allows us to represent variability in the data as well as (a-priori) knowledge of parameters, the concept underlying Bayesian statistics. Distributions also enable us to choose adequate subsequent analytic models and methods.

2.3 Statistical data analysis

Finding structure in data and making predictions are the most important steps in Data Science. Here, in particular, statistical methods are essential since they are able to handle many different analytical tasks. Important examples of statistical data analysis methods are the following.

Hypothesis testing is one of the pillars of statistical analysis. Questions arising in data driven problems can often be translated to hypotheses. Also, hypotheses are the natural links between underlying theory and statistics. Since statistical hypotheses are related to statistical tests, questions and theory can be tested for the available data. Multiple usage of the same data in different tests often leads to the necessity to correct significance levels. In applied statistics, correct multiple testing is one of the most important problems, e.g., in pharmaceutical studies [ 15 ]. Ignoring such techniques would lead to many more significant results than justified.

Classification methods are basic for finding and predicting subpopulations from data. In the so-called unsupervised case, such subpopulations are to be found from a data set without a-priori knowledge of any cases of such subpopulations. This is often called clustering.

In the so-called supervised case, classification rules should be found from a labeled data set for the prediction of unknown labels when only influential factors are available.

Nowadays, there is a plethora of methods for the unsupervised [ 22 ] as well for the supervised case [ 2 ].

In the age of Big Data, a new look at the classical methods appears to be necessary, though, since most of the time the calculation effort of complex analysis methods grows stronger than linear with the number of observations n or the number of features p . In the case of Big Data, i.e., if n or p is large, this leads to too high calculation times and to numerical problems. This results both, in the comeback of simpler optimization algorithms with low time-complexity [ 9 ] and in re-examining the traditional methods in statistics and machine learning for Big Data [ 46 ].

Regression methods are the main tool to find global and local relationships between features when the target variable is measured. Depending on the distributional assumption for the underlying data, different approaches may be applied. Under the normality assumption, linear regression is the most common method, while generalized linear regression is usually employed for other distributions from the exponential family [ 18 ]. More advanced methods comprise functional regression for functional data [ 38 ], quantile regression [ 25 ], and regression based on loss functions other than squared error loss like, e.g., Lasso regression [ 11 , 21 ]. In the context of Big Data, the challenges are similar to those for classification methods given large numbers of observations n (e.g., in data streams) and / or large numbers of features p . For the reduction of n , data reduction techniques like compressed sensing, random projection methods [ 20 ] or sampling-based procedures [ 28 ] enable faster computations. For decreasing the number p to the most influential features, variable selection or shrinkage approaches like the Lasso [ 21 ] can be employed, keeping the interpretability of the features. (Sparse) principal component analysis [ 21 ] may also be used.

Time series analysis aims at understanding and predicting temporal structure [ 42 ]. Time series are very common in studies of observational data, and prediction is the most important challenge for such data. Typical application areas are the behavioral sciences and economics as well as the natural sciences and engineering. As an example, let us have a look at signal analysis, e.g., speech or music data analysis. Here, statistical methods comprise the analysis of models in the time and frequency domains. The main aim is the prediction of future values of the time series itself or of its properties. For example, the vibrato of an audio time series might be modeled in order to realistically predict the tone in the future [ 24 ] and the fundamental frequency of a musical tone might be predicted by rules learned from elapsed time periods [ 29 ].

In econometrics, multiple time series and their co-integration are often analyzed [ 27 ]. In technical applications, process control is a common aim of time series analysis [ 34 ].

2.4 Statistical modeling

Complex interactions between factors can be modeled by graphs or networks . Here, an interaction between two factors is modeled by a connection in the graph or network [ 26 , 35 ]. The graphs can be undirected as, e.g., in Gaussian graphical models, or directed as, e.g., in Bayesian networks. The main goal in network analysis is deriving the network structure. Sometimes, it is necessary to separate (unmix) subpopulation specific network topologies [ 49 ].

Stochastic differential and difference equations can represent models from the natural and engineering sciences [ 3 , 39 ]. The finding of approximate statistical models solving such equations can lead to valuable insights for, e.g., the statistical control of such processes, e.g., in mechanical engineering [ 48 ]. Such methods can build a bridge between the applied sciences and Data Science.

Local models and globalization Typically, statistical models are only valid in sub-regions of the domain of the involved variables. Then, local models can be used [ 8 ]. The analysis of structural breaks can be basic to identify the regions for local modeling in time series [ 5 ]. Also, the analysis of concept drifts can be used to investigate model changes over time [ 30 ].

In time series, there are often hierarchies of more and more global structures. For example, in music, a basic local structure is given by the notes and more and more global ones by bars, motifs, phrases, parts etc. In order to find global properties of a time series, properties of the local models can be combined to more global characteristics [ 47 ].

Mixture models can also be used for the generalization of local to global models [ 19 , 23 ]. Model combination is essential for the characterization of real relationships since standard mathematical models are often much too simple to be valid for heterogeneous data or bigger regions of interest.

2.5 Model validation and model selection

In cases where more than one model is proposed for, e.g., prediction, statistical tests for comparing models are helpful to structure the models, e.g., concerning their predictive power [ 45 ].

Predictive power is typically assessed by means of so-called resampling methods where the distribution of power characteristics is studied by artificially varying the subpopulation used to learn the model. Characteristics of such distributions can be used for model selection [ 7 ].

Perturbation experiments offer another possibility to evaluate the performance of models. In this way, the stability of the different models against noise is assessed [ 32 , 44 ].

Meta-analysis as well as model averaging are methods to evaluate combined models [ 13 , 14 ].

Model selection became more and more important in the last years since the number of classification and regression models proposed in the literature increased with higher and higher speed.

2.6 Representation and reporting

Visualization to interpret found structures and storing of models in an easy-to-update form are very important tasks in statistical analyses to communicate the results and safeguard data analysis deployment. Deployment is decisive for obtaining interpretable results in Data Science. It is the last step in CRISP-DM [ 10 ] and underlying the data-to-decision and action step in Cao [ 12 ].

Besides visualization and adequate model storing, for statistics, the main task is reporting of uncertainties and review [ 6 ].

3 Fallacies

The statistical methods described in Sect.  2 are fundamental for finding structure in data and for obtaining deeper insight into data, and thus, for a successful data analysis. Ignoring modern statistical thinking or using simplistic data analytics/statistical methods may lead to avoidable fallacies. This holds, in particular, for the analysis of big and/or complex data.

As mentioned at the end of Sect.  2.2 , the notion of distribution is the key contribution of statistics. Not taking into account distributions in data exploration and in modeling restricts us to report values and parameter estimates without their corresponding variability. Only the notion of distributions enables us to predict with corresponding error bands.

Moreover, distributions are the key to model-based data analytics. For example, unsupervised learning can be employed to find clusters in data. If additional structure like dependency on space or time is present, it is often important to infer parameters like cluster radii and their spatio-temporal evolution. Such model-based analysis heavily depends on the notion of distributions (see [ 40 ] for an application to protein clusters).

If more than one parameter is of interest, it is advisable to compare univariate hypothesis testing approaches to multiple procedures, e.g., in multiple regression, and choose the most adequate model by variable selection. Restricting oneself to univariate testing, would ignore relationships between variables.

Deeper insight into data might require more complex models, like, e.g., mixture models for detecting heterogeneous groups in data. When ignoring the mixture, the result often represents a meaningless average, and learning the subgroups by unmixing the components might be needed. In a Bayesian framework, this is enabled by, e.g., latent allocation variables in a Dirichlet mixture model. For an application of decomposing a mixture of different networks in a heterogeneous cell population in molecular biology see [ 49 ].

A mixture model might represent mixtures of components of very unequal sizes, with small components (outliers) being of particular importance. In the context of Big Data, naïve sampling procedures are often employed for model estimation. However, these have the risk of missing small mixture components. Hence, model validation or sampling according to a more suitable distribution as well as resampling methods for predictive power are important.

4 Conclusion

Following the above assessment of the capabilities and impacts of statistics our conclusion is:

The role of statistics in Data Science is under-estimated as, e.g., compared to computer science. This yields, in particular, for the areas of data acquisition and enrichment as well as for advanced modeling needed for prediction.

Stimulated by this conclusion, statisticians are well-advised to more offensively play their role in this modern and well accepted field of Data Science.

Only complementing and/or combining mathematical methods and computational algorithms with statistical reasoning, particularly for Big Data, will lead to scientific results based on suitable approaches. Ultimately, only a balanced interplay of all sciences involved will lead to successful solutions in Data Science.

Adenso-Diaz, B., Laguna, M.: Fine-tuning of algorithms using fractional experimental designs and local search. Oper. Res. 54 (1), 99–114 (2006)

Article   Google Scholar  

Aggarwal, C.C. (ed.): Data Classification: Algorithms and Applications. CRC Press, Boca Raton (2014)

Google Scholar  

Allen, E., Allen, L., Arciniega, A., Greenwood, P.: Construction of equivalent stochastic differential equation models. Stoch. Anal. Appl. 26 , 274–297 (2008)

Article   MathSciNet   Google Scholar  

Anderson, C.: The End of Theory: The Data Deluge Makes the Scientific Method Obsolete. Wired Magazine https://www.wired.com/2008/06/pb-theory/ (2008)

Aue, A., Horváth, L.: Structural breaks in time series. J. Time Ser. Anal. 34 (1), 1–16 (2013)

Berger, R.E.: A scientific approach to writing for engineers and scientists. IEEE PCS Professional Engineering Communication Series IEEE Press, Wiley (2014)

Book   Google Scholar  

Bischl, B., Mersmann, O., Trautmann, H., Weihs, C.: Resampling methods for meta-model validation with recommendations for evolutionary computation. Evol. Comput. 20 (2), 249–275 (2012)

Bischl, B., Schiffner, J., Weihs, C.: Benchmarking local classification methods. Comput. Stat. 28 (6), 2599–2619 (2013)

Bottou, L., Curtis, F.E., Nocedal, J.: Optimization methods for large-scale machine learning. arXiv preprint arXiv:1606.04838 (2016)

Brown, M.S.: Data Mining for Dummies. Wiley, London (2014)

Bühlmann, P., Van De Geer, S.: Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer, Berlin (2011)

Cao, L.: Data science: a comprehensive overview. ACM Comput. Surv. (2017). https://doi.org/10.1145/3076253

Claeskens, G., Hjort, N.L.: Model Selection and Model Averaging. Cambridge University Press, Cambridge (2008)

Cooper, H., Hedges, L.V., Valentine, J.C.: The Handbook of Research Synthesis and Meta-analysis. Russell Sage Foundation, New York City (2009)

Dmitrienko, A., Tamhane, A.C., Bretz, F.: Multiple Testing Problems in Pharmaceutical Statistics. Chapman and Hall/CRC, London (2009)

Donoho, D.: 50 Years of Data Science. http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf (2015)

Dyk, D.V., Fuentes, M., Jordan, M.I., Newton, M., Ray, B.K., Lang, D.T., Wickham, H.: ASA Statement on the Role of Statistics in Data Science. http://magazine.amstat.org/blog/2015/10/01/asa-statement-on-the-role-of-statistics-in-data-science/ (2015)

Fahrmeir, L., Kneib, T., Lang, S., Marx, B.: Regression: Models, Methods and Applications. Springer, Berlin (2013)

Frühwirth-Schnatter, S.: Finite Mixture and Markov Switching Models. Springer, Berlin (2006)

MATH   Google Scholar  

Geppert, L., Ickstadt, K., Munteanu, A., Quedenfeld, J., Sohler, C.: Random projections for Bayesian regression. Stat. Comput. 27 (1), 79–101 (2017). https://doi.org/10.1007/s11222-015-9608-z

Article   MathSciNet   MATH   Google Scholar  

Hastie, T., Tibshirani, R., Wainwright, M.: Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, Boca Raton (2015)

Hennig, C., Meila, M., Murtagh, F., Rocci, R.: Handbook of Cluster Analysis. Chapman & Hall, London (2015)

Klein, H.U., Schäfer, M., Porse, B.T., Hasemann, M.S., Ickstadt, K., Dugas, M.: Integrative analysis of histone chip-seq and transcription data using Bayesian mixture models. Bioinformatics 30 (8), 1154–1162 (2014)

Knoche, S., Ebeling, M.: The musical signal: physically and psychologically, chap 2. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 15–68. CRC Press, Boca Raton (2017)

Koenker, R.: Quantile Regression. Econometric Society Monographs, vol. 38 (2010)

Koller, D., Friedman, N.: Probabilistic Graphical Models: Principles and Techniques. MIT press, Cambridge (2009)

Lütkepohl, H.: New Introduction to Multiple Time Series Analysis. Springer, Berlin (2010)

Ma, P., Mahoney, M.W., Yu, B.: A statistical perspective on algorithmic leveraging. In: Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21–26 June 2014, pp 91–99. http://jmlr.org/proceedings/papers/v32/ma14.html (2014)

Martin, R., Nagathil, A.: Digital filters and spectral analysis, chap 4. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 111–143. CRC Press, Boca Raton (2017)

Mejri, D., Limam, M., Weihs, C.: A new dynamic weighted majority control chart for data streams. Soft Comput. 22(2), 511–522. https://doi.org/10.1007/s00500-016-2351-3

Molenberghs, G., Fitzmaurice, G., Kenward, M.G., Tsiatis, A., Verbeke, G.: Handbook of Missing Data Methodology. CRC Press, Boca Raton (2014)

Molinelli, E.J., Korkut, A., Wang, W.Q., Miller, M.L., Gauthier, N.P., Jing, X., Kaushik, P., He, Q., Mills, G., Solit, D.B., Pratilas, C.A., Weigt, M., Braunstein, A., Pagnani, A., Zecchina, R., Sander, C.: Perturbation Biology: Inferring Signaling Networks in Cellular Systems. arXiv preprint arXiv:1308.5193 (2013)

Montgomery, D.C.: Design and Analysis of Experiments, 8th edn. Wiley, London (2013)

Oakland, J.: Statistical Process Control. Routledge, London (2007)

Pearl, J.: Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, Los Altos (1988)

Chapter   Google Scholar  

Piateski, G., Frawley, W.: Knowledge Discovery in Databases. MIT Press, Cambridge (1991)

Press, G.: A Very Short History of Data Science. https://www.forbescom/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#5c515ed055cf (2013). [last visit: March 19, 2017]

Ramsay, J., Silverman, B.W.: Functional Data Analysis. Springer, Berlin (2005)

Särkkä, S.: Applied Stochastic Differential Equations. https://users.aalto.fi/~ssarkka/course_s2012/pdf/sde_course_booklet_2012.pdf (2012). [last visit: March 6, 2017]

Schäfer, M., Radon, Y., Klein, T., Herrmann, S., Schwender, H., Verveer, P.J., Ickstadt, K.: A Bayesian mixture model to quantify parameters of spatial clustering. Comput. Stat. Data Anal. 92 , 163–176 (2015). https://doi.org/10.1016/j.csda.2015.07.004

Schiffner, J., Weihs, C.: D-optimal plans for variable selection in data bases. Technical Report, 14/09, SFB 475 (2009)

Shumway, R.H., Stoffer, D.S.: Time Series Analysis and Its Applications: With R Examples. Springer, Berlin (2010)

Tukey, J.W.: Exploratory Data Analysis. Pearson, London (1977)

Vatcheva, I., de Jong, H., Mars, N.: Selection of perturbation experiments for model discrimination. In: Horn, W. (ed.) Proceedings of the 14th European Conference on Artificial Intelligence, ECAI-2000, IOS Press, pp 191–195 (2000)

Vatolkin, I., Weihs, C.: Evaluation, chap 13. In: Weihs, C., Jannach, D., Vatolkin, I., Rudolph, G. (eds.) Music Data Analysis—Foundations and Applications, pp. 329–363. CRC Press, Boca Raton (2017)

Weihs, C.: Big data classification — aspects on many features. In: Michaelis, S., Piatkowski, N., Stolpe, M. (eds.) Solving Large Scale Learning Tasks: Challenges and Algorithms, Springer Lecture Notes in Artificial Intelligence, vol. 9580, pp. 139–147 (2016)

Weihs, C., Ligges, U.: From local to global analysis of music time series. In: Morik, K., Siebes, A., Boulicault, J.F. (eds.) Detecting Local Patterns, Springer Lecture Notes in Artificial Intelligence, vol. 3539, pp. 233–245 (2005)

Weihs, C., Messaoud, A., Raabe, N.: Control charts based on models derived from differential equations. Qual. Reliab. Eng. Int. 26 (8), 807–816 (2010)

Wieczorek, J., Malik-Sheriff, R.S., Fermin, Y., Grecco, H.E., Zamir, E., Ickstadt, K.: Uncovering distinct protein-network topologies in heterogeneous cell populations. BMC Syst. Biol. 9 (1), 24 (2015)

Wu, J.: Statistics = data science? http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf (1997)

Download references

Acknowledgements

The authors would like to thank the editor, the guest editors and all reviewers for valuable comments on an earlier version of the manuscript. They also thank Leo Geppert for fruitful discussions.

Author information

Authors and affiliations.

Computational Statistics, TU Dortmund University, 44221, Dortmund, Germany

Claus Weihs

Mathematical Statistics and Biometric Applications, TU Dortmund University, 44221, Dortmund, Germany

Katja Ickstadt

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Claus Weihs .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0 /), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Weihs, C., Ickstadt, K. Data Science: the impact of statistics. Int J Data Sci Anal 6 , 189–194 (2018). https://doi.org/10.1007/s41060-018-0102-5

Download citation

Received : 20 March 2017

Accepted : 25 January 2018

Published : 16 February 2018

Issue Date : November 2018

DOI : https://doi.org/10.1007/s41060-018-0102-5

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Structures of data science
  • Impact of statistics on data science
  • Fallacies in data science
  • Find a journal
  • Publish with us
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals

Biostatistics articles from across Nature Portfolio

Biostatistics is the application of statistical methods in studies in biology, and encompasses the design of experiments, the collection of data from them, and the analysis and interpretation of data. The data come from a wide range of sources, including genomic studies, experiments with cells and organisms, and clinical trials.

Latest Research and Reviews

research paper with statistics

Impact of COVID-19 on antibiotic usage in primary care: a retrospective analysis

  • Anna Romaszko-Wojtowicz
  • K. Tokarczyk-Malesa
  • K. Glińska-Lewczuk

research paper with statistics

A standardized metric to enhance clinical trial design and outcome interpretation in type 1 diabetes

The use of a standardized outcome metric enhances clinical trial interpretation and cross-trial comparison. Here, the authors show the implementation of such a metric using type 1 diabetes trial data, reassess and compare results from these trials, and extend its use to define response to therapy.

  • Alyssa Ylescupidez
  • Henry T. Bahnson
  • Carla J. Greenbaum

research paper with statistics

A novel approach to visualize clinical benefit of therapies for chronic graft versus host disease (cGvHD): the probability of being in response (PBR) applied to the REACH3 study

  • Norbert Hollaender
  • Ekkehard Glimm
  • Robert Zeiser

research paper with statistics

Reproducibility in pharmacometrics applied in a phase III trial of BCG-vaccination for COVID-19

  • Rob C. van Wijk
  • Laurynas Mockeliunas
  • Ulrika S. H. Simonsson

research paper with statistics

Addressing mechanism bias in model-based impact forecasts of new tuberculosis vaccines

The complex transmission chain of tuberculosis (TB) forces mathematical modelers to make mechanistic assumptions when modelling vaccine effects. Here, authors posit a Bayesian formalism that unlocks mechanism-agnostic impact forecasts for TB vaccines.

research paper with statistics

Early ascending growth is associated with maternal lipoprotein profile during mid and late pregnancy and in cord blood

  • Elina Blanco Sequeiros
  • Anna-Kaisa Tuomaala
  • Emilia Huvinen

Advertisement

News and Comment

Mitigating immortal-time bias: exploring osteonecrosis and survival in pediatric all - aall0232 trial insights.

  • Shyam Srinivasan
  • Swaminathan Keerthivasagam

Response to Pfirrmann et al.’s comment on How should we interpret conclusions of TKI-stopping studies

  • Junren Chen
  • Robert Peter Gale

research paper with statistics

Cell-free DNA chromosome copy number variations predict outcomes in plasma cell myeloma

  • Wanting Qiang

research paper with statistics

The role of allogeneic haematopoietic cell transplantation as consolidation after anti-CD19 CAR-T cell therapy in adults with relapsed/refractory acute lymphoblastic leukaemia: a prospective cohort study

  • Lijuan Zhou

Clinical trials: design, endpoints and interpretation of outcomes

  • Megan Othus
  • Mei-Jie Zhang

research paper with statistics

A SAS macro for estimating direct adjusted survival functions for time-to-event data with or without left truncation

  • Zhen-Huan Hu
  • Hai-Lin Wang

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper with statistics

  • Data, AI, & Machine Learning
  • Managing Technology
  • Social Responsibility
  • Workplace, Teams, & Culture
  • AI & Machine Learning
  • Diversity & Inclusion
  • Big ideas Research Projects
  • Artificial Intelligence and Business Strategy
  • Responsible AI
  • Future of the Workforce
  • Future of Leadership
  • All Research Projects
  • AI in Action
  • Most Popular
  • The Truth Behind the Nursing Crisis
  • Work/23: The Big Shift
  • Coaching for the Future-Forward Leader
  • Measuring Culture

Spring 2024 Issue

The spring 2024 issue’s special report looks at how to take advantage of market opportunities in the digital space, and provides advice on building culture and friendships at work; maximizing the benefits of LLMs, corporate venture capital initiatives, and innovation contests; and scaling automation and digital health platform.

  • Past Issues
  • Upcoming Events
  • Video Archive
  • Me, Myself, and AI
  • Three Big Points

MIT Sloan Management Review Logo

AI and Statistics: Perfect Together

Many companies develop AI models without a solid foundation on which to base predictions — leading to mistrust and failures. Here’s how statistics can help improve results.

  • Data, AI, & Machine Learning
  • AI & Machine Learning
  • IT Governance & Leadership
  • Technology Implementation

research paper with statistics

Carolyn Geason-Beissel/MIT SMR | Getty Images

People are often unsure why artificial intelligence and machine learning algorithms work. More importantly, people can’t always anticipate when they won’t work. Ali Rahimi, an AI researcher at Google, received a standing ovation at a 2017 conference when he referred to much of what is done in AI as “ alchemy ,” meaning that developers don’t have solid grounds for predicting which algorithms will work and which won’t, or for choosing one AI architecture over another. To put it succinctly, AI lacks a basis for inference: a solid foundation on which to base predictions and decisions.

This makes AI decisions tough (or impossible) to explain and hurts trust in AI models and technologies — trust that is necessary for AI to reach its potential. As noted by Rahimi, this is an unsolved problem in AI and machine learning that keeps tech and business leaders up at night because it dooms many AI models to fail in deployment.

Get Updates on Leading With AI and Data

Get monthly insights on how artificial intelligence impacts your organization and what it means for your company and customers.

Please enter a valid email address

Thank you for signing up

Privacy Policy

Fortunately, help for AI teams and projects is available from an unlikely source: classical statistics. This article will explore how business leaders can apply statistical methods and statistics experts to address the problem.

Holdout Data: A Tempting but Flawed Approach

Some AI teams view a trained AI model as the basis for inference, especially when that model predicts well on a holdout set of the original data. It’s tempting to make such an argument, but it’s a stretch. Holdout data is nothing more than a sample of the data collected at the same time, and under the same circumstances, as the training data. Thus, a trained AI model, in and of itself, does not provide a trusted basis for inference for predictions on future data observed under different circumstances.

What’s worse, many teams working on AI models fail to clearly define the business problem to be solved . This means that the team members are hard-pressed to tell business leaders whether the training data is the right data . Any one of these three issues (bad foundation, wrong problem, or wrong data) can prove disastrous in deployment — and statistics experts on AI teams can help prevent them.

Many IT leaders and data scientists feel that statistics is an old technology that is no longer needed in a big data and AI era.

About the Authors

Thomas C. Redman is president of Data Quality Solutions and author of People and Data: Uniting to Transform Your Organization (KoganPage, 2023). Roger W. Hoerl is the Brate-Peschel Professor of Statistics at Union College in Schenectady, New York, and coauthor with Ronald D. Snee of Leading Holistic Improvement With Lean Six Sigma 2.0 , 2nd ed. (Pearson FT Press, 2018).

More Like This

Add a comment cancel reply.

You must sign in to post a comment. First time here? Sign up for a free account : Comment on articles and get access to many more articles.

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

What the data says about crime in the U.S.

A growing share of Americans say reducing crime should be a top priority for the president and Congress to address this year. Around six-in-ten U.S. adults (58%) hold that view today, up from 47% at the beginning of Joe Biden’s presidency in 2021.

We conducted this analysis to learn more about U.S. crime patterns and how those patterns have changed over time.

The analysis relies on statistics published by the FBI, which we accessed through the Crime Data Explorer , and the Bureau of Justice Statistics (BJS), which we accessed through the  National Crime Victimization Survey data analysis tool .

To measure public attitudes about crime in the U.S., we relied on survey data from Pew Research Center and Gallup.

Additional details about each data source, including survey methodologies, are available by following the links in the text of this analysis.

A line chart showing that, since 2021, concerns about crime have grown among both Republicans and Democrats.

With the issue likely to come up in this year’s presidential election, here’s what we know about crime in the United States, based on the latest available data from the federal government and other sources.

How much crime is there in the U.S.?

It’s difficult to say for certain. The  two primary sources of government crime statistics  – the Federal Bureau of Investigation (FBI) and the Bureau of Justice Statistics (BJS) – paint an incomplete picture.

The FBI publishes  annual data  on crimes that have been reported to law enforcement, but not crimes that haven’t been reported. Historically, the FBI has also only published statistics about a handful of specific violent and property crimes, but not many other types of crime, such as drug crime. And while the FBI’s data is based on information from thousands of federal, state, county, city and other police departments, not all law enforcement agencies participate every year. In 2022, the most recent full year with available statistics, the FBI received data from 83% of participating agencies .

BJS, for its part, tracks crime by fielding a  large annual survey of Americans ages 12 and older and asking them whether they were the victim of certain types of crime in the past six months. One advantage of this approach is that it captures both reported and unreported crimes. But the BJS survey has limitations of its own. Like the FBI, it focuses mainly on a handful of violent and property crimes. And since the BJS data is based on after-the-fact interviews with crime victims, it cannot provide information about one especially high-profile type of offense: murder.

All those caveats aside, looking at the FBI and BJS statistics side-by-side  does  give researchers a good picture of U.S. violent and property crime rates and how they have changed over time. In addition, the FBI is transitioning to a new data collection system – known as the National Incident-Based Reporting System – that eventually will provide national information on a much larger set of crimes , as well as details such as the time and place they occur and the types of weapons involved, if applicable.

Which kinds of crime are most and least common?

A bar chart showing that theft is most common property crime, and assault is most common violent crime.

Property crime in the U.S. is much more common than violent crime. In 2022, the FBI reported a total of 1,954.4 property crimes per 100,000 people, compared with 380.7 violent crimes per 100,000 people.  

By far the most common form of property crime in 2022 was larceny/theft, followed by motor vehicle theft and burglary. Among violent crimes, aggravated assault was the most common offense, followed by robbery, rape, and murder/nonnegligent manslaughter.

BJS tracks a slightly different set of offenses from the FBI, but it finds the same overall patterns, with theft the most common form of property crime in 2022 and assault the most common form of violent crime.

How have crime rates in the U.S. changed over time?

Both the FBI and BJS data show dramatic declines in U.S. violent and property crime rates since the early 1990s, when crime spiked across much of the nation.

Using the FBI data, the violent crime rate fell 49% between 1993 and 2022, with large decreases in the rates of robbery (-74%), aggravated assault (-39%) and murder/nonnegligent manslaughter (-34%). It’s not possible to calculate the change in the rape rate during this period because the FBI  revised its definition of the offense in 2013 .

Line charts showing that U.S. violent and property crime rates have plunged since 1990s, regardless of data source.

The FBI data also shows a 59% reduction in the U.S. property crime rate between 1993 and 2022, with big declines in the rates of burglary (-75%), larceny/theft (-54%) and motor vehicle theft (-53%).

Using the BJS statistics, the declines in the violent and property crime rates are even steeper than those captured in the FBI data. Per BJS, the U.S. violent and property crime rates each fell 71% between 1993 and 2022.

While crime rates have fallen sharply over the long term, the decline hasn’t always been steady. There have been notable increases in certain kinds of crime in some years, including recently.

In 2020, for example, the U.S. murder rate saw its largest single-year increase on record – and by 2022, it remained considerably higher than before the coronavirus pandemic. Preliminary data for 2023, however, suggests that the murder rate fell substantially last year .

How do Americans perceive crime in their country?

Americans tend to believe crime is up, even when official data shows it is down.

In 23 of 27 Gallup surveys conducted since 1993 , at least 60% of U.S. adults have said there is more crime nationally than there was the year before, despite the downward trend in crime rates during most of that period.

A line chart showing that Americans tend to believe crime is up nationally, less so locally.

While perceptions of rising crime at the national level are common, fewer Americans believe crime is up in their own communities. In every Gallup crime survey since the 1990s, Americans have been much less likely to say crime is up in their area than to say the same about crime nationally.

Public attitudes about crime differ widely by Americans’ party affiliation, race and ethnicity, and other factors . For example, Republicans and Republican-leaning independents are much more likely than Democrats and Democratic leaners to say reducing crime should be a top priority for the president and Congress this year (68% vs. 47%), according to a recent Pew Research Center survey.

How does crime in the U.S. differ by demographic characteristics?

Some groups of Americans are more likely than others to be victims of crime. In the  2022 BJS survey , for example, younger people and those with lower incomes were far more likely to report being the victim of a violent crime than older and higher-income people.

There were no major differences in violent crime victimization rates between male and female respondents or between those who identified as White, Black or Hispanic. But the victimization rate among Asian Americans (a category that includes Native Hawaiians and other Pacific Islanders) was substantially lower than among other racial and ethnic groups.

The same BJS survey asks victims about the demographic characteristics of the offenders in the incidents they experienced.

In 2022, those who are male, younger people and those who are Black accounted for considerably larger shares of perceived offenders in violent incidents than their respective shares of the U.S. population. Men, for instance, accounted for 79% of perceived offenders in violent incidents, compared with 49% of the nation’s 12-and-older population that year. Black Americans accounted for 25% of perceived offenders in violent incidents, about twice their share of the 12-and-older population (12%).

As with all surveys, however, there are several potential sources of error, including the possibility that crime victims’ perceptions about offenders are incorrect.

How does crime in the U.S. differ geographically?

There are big geographic differences in violent and property crime rates.

For example, in 2022, there were more than 700 violent crimes per 100,000 residents in New Mexico and Alaska. That compares with fewer than 200 per 100,000 people in Rhode Island, Connecticut, New Hampshire and Maine, according to the FBI.

The FBI notes that various factors might influence an area’s crime rate, including its population density and economic conditions.

What percentage of crimes are reported to police? What percentage are solved?

Line charts showing that fewer than half of crimes in the U.S. are reported, and fewer than half of reported crimes are solved.

Most violent and property crimes in the U.S. are not reported to police, and most of the crimes that  are  reported are not solved.

In its annual survey, BJS asks crime victims whether they reported their crime to police. It found that in 2022, only 41.5% of violent crimes and 31.8% of household property crimes were reported to authorities. BJS notes that there are many reasons why crime might not be reported, including fear of reprisal or of “getting the offender in trouble,” a feeling that police “would not or could not do anything to help,” or a belief that the crime is “a personal issue or too trivial to report.”

Most of the crimes that are reported to police, meanwhile,  are not solved , at least based on an FBI measure known as the clearance rate . That’s the share of cases each year that are closed, or “cleared,” through the arrest, charging and referral of a suspect for prosecution, or due to “exceptional” circumstances such as the death of a suspect or a victim’s refusal to cooperate with a prosecution. In 2022, police nationwide cleared 36.7% of violent crimes that were reported to them and 12.1% of the property crimes that came to their attention.

Which crimes are most likely to be reported to police? Which are most likely to be solved?

Bar charts showing that most vehicle thefts are reported to police, but relatively few result in arrest.

Around eight-in-ten motor vehicle thefts (80.9%) were reported to police in 2022, making them by far the most commonly reported property crime tracked by BJS. Household burglaries and trespassing offenses were reported to police at much lower rates (44.9% and 41.2%, respectively), while personal theft/larceny and other types of theft were only reported around a quarter of the time.

Among violent crimes – excluding homicide, which BJS doesn’t track – robbery was the most likely to be reported to law enforcement in 2022 (64.0%). It was followed by aggravated assault (49.9%), simple assault (36.8%) and rape/sexual assault (21.4%).

The list of crimes  cleared  by police in 2022 looks different from the list of crimes reported. Law enforcement officers were generally much more likely to solve violent crimes than property crimes, according to the FBI.

The most frequently solved violent crime tends to be homicide. Police cleared around half of murders and nonnegligent manslaughters (52.3%) in 2022. The clearance rates were lower for aggravated assault (41.4%), rape (26.1%) and robbery (23.2%).

When it comes to property crime, law enforcement agencies cleared 13.0% of burglaries, 12.4% of larcenies/thefts and 9.3% of motor vehicle thefts in 2022.

Are police solving more or fewer crimes than they used to?

Nationwide clearance rates for both violent and property crime are at their lowest levels since at least 1993, the FBI data shows.

Police cleared a little over a third (36.7%) of the violent crimes that came to their attention in 2022, down from nearly half (48.1%) as recently as 2013. During the same period, there were decreases for each of the four types of violent crime the FBI tracks:

Line charts showing that police clearance rates for violent crimes have declined in recent years.

  • Police cleared 52.3% of reported murders and nonnegligent homicides in 2022, down from 64.1% in 2013.
  • They cleared 41.4% of aggravated assaults, down from 57.7%.
  • They cleared 26.1% of rapes, down from 40.6%.
  • They cleared 23.2% of robberies, down from 29.4%.

The pattern is less pronounced for property crime. Overall, law enforcement agencies cleared 12.1% of reported property crimes in 2022, down from 19.7% in 2013. The clearance rate for burglary didn’t change much, but it fell for larceny/theft (to 12.4% in 2022 from 22.4% in 2013) and motor vehicle theft (to 9.3% from 14.2%).

Note: This is an update of a post originally published on Nov. 20, 2020.

  • Criminal Justice

John Gramlich's photo

John Gramlich is an associate director at Pew Research Center

8 facts about Black Lives Matter

#blacklivesmatter turns 10, support for the black lives matter movement has dropped considerably from its peak in 2020, fewer than 1% of federal criminal defendants were acquitted in 2022, before release of video showing tyre nichols’ beating, public views of police conduct had improved modestly, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Politics & Policy
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.6(1); 2002

Logo of ccforum

Statistics review 1: Presenting and summarising data

Elise whitley.

1 Lecturer in Medical Statistics, University of Bristol, Bristol, UK

Jonathan Ball

2 Lecturer in Intensive Care Medicine, St George's Hospital Medical School, London, UK

The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care. The first step in any analysis is to describe and summarize the data. As well as becoming familiar with the data, this is also an opportunity to look for unusually high or low values (outliers), to check the assumptions required for statistical tests, and to decide the best way to categorize the data if this is necessary. In addition to tables and graphs, summary values are a convenient way to summarize large amounts of information. This review introduces some of these measures. It describes and gives examples of qualitative data (unordered and ordered) and quantitative data (discrete and continuous); how these types of data can be represented figuratively; the two important features of a quantitative dataset (location and variability); the measures of location (mean, median and mode); the measures of variability (range, interquartile range, standard deviation and variance); common distributions of clinical data; and simple transformations of positively skewed data.

Introduction

Data description is a vital part of any research project and should not be ignored in the rush to start testing hypotheses. There are many reasons for this important process, such as gaining familiarity with the data, looking for unusually high or low values (outliers) and checking the assumptions required for statistical testing. The two most common types of data are qualitative and quantitative (Fig. ​ (Fig.1). 1 ). Qualitative data fall into two categories: unordered qualitative data, such as ventilatory support (none, non-invasive, intermittent positive-pressure ventilation, oscillatory); and ordered qualitative data, such as severity of disease (mild, moderate, severe). Quantitative data are numerical and fall into two categories: discrete quantitative data, such as the number of days spent in intensive care; and continuous quantitative data, such as blood pressure or haemoglobin concentrations. Tables are a useful way of describing both qualitative and grouped quantitative data and there are also many types of graph that provide a convenient summary. Qualitative data are commonly described using bar or pie charts, whereas quantitative data can be represented using histograms or box and whisker plots.

An external file that holds a picture, illustration, etc.
Object name is cc1455-1.jpg

Types of data. ICU = intensive care unit.

Tables and graphs provide a convenient simple picture of a set of data (dataset), but it is often necessary to further summarize quantitative data, for example for hypothesis testing. The two most important elements of a dataset are its location (where on average the data lie) and its variability (the extent to which individual data values deviate from the location). There are several different measures of location and variability that can be calculated, and the choice of which to use depends on individual circumstances.

Measuring location

The mean is the most well known average value. It is calculated by summing all of the values in a dataset and dividing them by the total number of values. The algebraic notation for the mean of a set of n values ( X 1 , X 2 ,..., X n ) is:

Of all the measures of location, the mean is the most commonly used because it is easily understood and has useful mathematical properties that make it convenient for use in many statistical contexts. It is strongly influenced by extreme values (outliers), however, and is most representative when the data are symmetrically distributed (see below).

The median is the central value when all observations are sorted in order. If there is an odd number of observations then it is simply the middle value; if there is an even number of observations then it is the average of the middle two. The median does not have the beneficial mathematical properties of the mean. However, it is not generally influenced by extreme values (outliers), and as a result it is particularly useful in situations where there are unusually low or high values that would render the mean unrepresentative of the data.

The mode is simply the most commonly occurring value in the data. It is not generally used because it is often not representative of the data, particularly when the dataset is small.

Example of calculating location

To see how these quantities are calculated in practise, consider the data shown in Table ​ Table1. 1 . These are haemoglobin concentration measurements taken from 48 patients on admission to an intensive care unit, listed here in ascending order.

Haemoglobin (g/dl) from 48 intensive care patients

The first step in exploring these data is to construct a histogram to illustrate the shape of the distribution. Rather than plot the frequency of each value separately (e.g. one patient with haemoglobin 5.4 g/dl, two patients with haemoglobin 6.4 g/dl, one patient with haemoglobin 7.0 g/dl, and so on), continuous data are generally grouped or categorized before plotting (e.g. one patient with haemoglobin between 5.0 and 5.9 g/dl, two patients with haemoglobin between 6.0 and 6.9 g/dl, four patients with haemoglobin between 7.0 and 7.9 g/dl, and so on). These categories can be defined in any way and need not necessarily be of the same width, although it is generally more convenient to have equally sized groups. However, the categories must be exhaustive (the categories must cover the full range of values in the dataset) and exclusive (there should be no overlap between categories). Therefore, if one category ends with 6.9 g/dl then the next must begin with 7.0 g/dl rather than 6.9 g/dl. Fig. ​ Fig.2 2 shows the data in Table ​ Table1 1 grouped into 1 g/dl categories (5.0—5.9, 6.0—6.9,...,14.0–14.9 g/dl).

An external file that holds a picture, illustration, etc.
Object name is cc1455-2.jpg

Histogram of admission haemoglobin measurements from 48 intensive care patients.

Fig. ​ Fig.2 2 shows that the data are roughly symmetrically distributed; more common values are clustered around a peak in the middle of the distribution, with decreasing numbers of smaller and larger values on either side. The mean, median and mode of these data are shown in Table ​ Table2 2 .

Mean, median and mode of haemoglobin measurements from 48 intensive care patients listed in Table ​ Table1 1

Notice that the mean and the median are similar. This is because the data are approximately symmetrical. In general, the mean, median and mode will be similar in a dataset that has a symmetrical distribution with a single peak, such as that shown in Fig. ​ Fig.2. 2 . However, the dataset presented here is rather small and so the mode is not such a good measure of location.

Measuring variability

As with location, there are a number of different measures of variability. The simplest of these is probably the range, which is the difference between the largest and smallest observation in the dataset. The disadvantage of this measure is that it is based on only two of the observations and may not be representative of the whole dataset, particularly if there are outliers. In addition, it gives no information regarding how the data are distributed between the two extremes.

Interquartile range

An alternative to the range is the interquartile range. Quartiles are calculated in a similar way to the median; the median splits a dataset into two equally sized groups, tertiles split the data into three (approximately) equally sized groups, quartiles into four, quintiles into five, and so on. The interquartile range is the range between the bottom and top quartiles, and indicates where the middle 50% of the data lie. Like the median, the interquartile range is not influenced by unusually high or low values and may be particularly useful when data are not symmetrically distributed. Ranges based on alternative subdivisions of the data can also be calculated; for example, if the data are split into deciles, 80% of the data will lie between the bottom and top deciles and so on.

Standard deviation

The standard deviation is a measure of the degree to which individual observations in a dataset deviate from the mean value. Broadly, it is the average deviation from the mean across all observations. It is calculated by squaring the difference of each individual observation from the mean (squared to remove any negative differences), adding them together, dividing by the total number of observations minus 1, and taking the square root of the result.

Algebraically the standard deviation for a set of n values ( X 1 , X 2 ,..., X n } is written as follows:

Another measure of variability that may be encountered is the variance. This is simply the square of the standard deviation:

The variance is not generally used in data description but is central to analysis of variance (covered in a subsequent review in this series).

Example of calculating variability

Table ​ Table3 3 shows the calculation of the range, interquartile range and standard deviation of the data shown in Table ​ Table1. 1 . The range, from 5.4 to 14.1 g/dl, indicates the full extent of the data, but does not give any information regarding how the remaining observations are distributed between these extremes. For example, it may be that the lower value of 5.4 g/dl is an outlier and the remainder of the observations are all over 10.0 g/dl, or that most values lie at the lower end of the range with substantially fewer at the other extreme. It is impossible to tell this from the range alone.

Range, interquartile range and standard deviation of haemoglobin measurements from 48 intensive care patients listed in Table ​ Table1 1

The interquartile range (which contains the central 50% of the data) gives a better indication of the general shape of the distribution, and indicates that 50% of all observations fall in a rather narrower range (from 8.7 to 10.8 g/dl). In addition, the median and mean both fall approximately in the centre of the interquartile range, which suggests that the distribution is reasonably symmetrical.

The standard deviation in isolation does not provide a great deal of information, although it is sometimes expressed as a percentage of the mean, known as the coefficient of variation. However, it is often used to calculate another extremely useful quantity known as the reference range; this will be covered in more detail in the next article.

Common distributions and simple transformations

Quantitative clinical data follow a wide variety of distributions, but the majority are unimodal, meaning that the data has a single (modal) peak with a tail on either side. The most common of these unimodal distributions are symmetrical, as shown in Fig. ​ Fig.2, 2 , with a peak in the centre of the data and evenly balanced tails on the right and left.

However, not all unimodal distributions are symmetrical; some are skewed with a substantially longer tail on one side. The type of skew is determined by which tail is longer. A positively skewed distribution has a longer tail on the right; in other words the majority of values are relatively low with a smaller number of extreme high values. Fig. ​ Fig.3 3 shows the admission serum urea levels of 100 intensive care patients. The majority have a serum urea level below 20 mmol/l, with a peak between 4.0 and 7.9 mmol/l. However, an important minority of patients have levels above 20 mmol/l and some have levels as high as 60 mmol/l.

An external file that holds a picture, illustration, etc.
Object name is cc1455-3.jpg

Histogram of admission serum urea levels from 100 intensive care patients. A = mean; B = median; C = geometric mean.

The mean of these data is 12.25 mmol/l (A) and the median is 9 mmol/l (B), as indicated in Fig. ​ Fig.3. 3 . In a positively skewed distribution the median will always be smaller than the mean because the mean is strongly influenced by the extreme values in the right-hand tail, and may therefore be less representative of the data as a whole. However, it is possible to transform data of this type in order to obtain a more representative mean value. This type of transformation is also useful when statistical tests require data to be more symmetrically distributed (see subsequent reviews in this series for details). There is a wide range of transformations that can be used in this context [ 2 ], but the most commonly used with positively skewed data is the logarithmic transformation.

In a logarithmic transformation, every value in the dataset is replaced by its logarithm. Logarithms are defined to a base, the most common being base e (natural logarithms) or base 10. The end result of a logarithmic transformation is independent of the base chosen, although the same base must be used throughout. As an example, consider the data shown in Fig. ​ Fig.3. 3 . Although the majority of values are below 20, there is also an important number of values above this. Table ​ Table4 4 shows a sample of the raw numbers along with their logarithmically transformed values (to base e).

Raw and logarithmically transformed serum urea levels

Notice that the differences between the raw values are always the same (1), whereas the differences in the transformed values are larger at the lower end of the scale (0.18 and 0.16) than at the upper end (0.02 and 0.01). The logarithmic transformation stretches out the lower end and compresses the upper end of a distribution, with the result that positively skewed data will tend to become more symmetrical in shape. The transformed data from Fig. ​ Fig.3 3 are shown in Fig. ​ Fig.4, 4 , in which it can be seen that there is a single peak at around 2.4 with similar tails to the right and left.

An external file that holds a picture, illustration, etc.
Object name is cc1455-4.jpg

Logarithmically transformed admission serum urea levels from 100 intensive care patients.

Calculations and statistical tests can now be carried out on the transformed data before converting the results back to the original scale. For example, the mean of the transformed serum urea data is 2.19. To transform this value back to the original scale, the antilog (or exponential in the case of natural, base e logarithms) is applied. This gives a 'geometric mean' of 8.94 mmol/l on the original scale (C in Fig. ​ Fig.3), 3 ), the term 'geometric' indicating that calculations have been carried out on the logarithmic scale. This is in contrast to the standard (arithmetic) mean value (calculated on the original scale) of 12.25 mmol/l (A in Fig. ​ Fig.3). 3 ). Looking at Fig. ​ Fig.3, 3 , it is clear that the geometric mean is more representative of the data than the arithmetic mean.

Similarly, a negatively skewed distribution has a longer tail to the left; in other words, the extreme values are at the lower end of the scale. Fig. ​ Fig.5 5 shows a negatively skewed distribution of admission arterial blood pH from 100 intensive care patients. In this case the mean will be unduly influenced by the extreme low values and the median (which is always greater than the mean in this setting) may be a more representative measure. However, as in the positively skewed case it is possible to transform this type of data in order to make it more symmetrical, although the function used in this setting is not the logarithm (for more details, see Kirkwood [ 2 ]).

An external file that holds a picture, illustration, etc.
Object name is cc1455-5.jpg

Admission arterial blood pH from 100 intensive care patients.

Finally, it is possible that data may arise with more than one (modal) peak. These data can be difficult to manage and it may be the case that neither the mean nor the median is a representative measure. However, such distributions are rare and may well be artefactual. For example, a (bimodal) distribution with two peaks may actually be a combination of two uni-modal distributions (such as hormone levels in men and women). Alternatively, a (multimodal) distribution with multiple peaks may be due to digit preference (rounding observations up or down) during data collection, where peaks appear at round numbers, for example peaks in systolic blood pressure at 90, 100, 110, 120 mmHg, and so on. In such cases appropriate subdivision, categorization, or even recollection of the data may be required to eliminate the problem.

Competing interests

None declared.

  • Altman DG. Practical Statistics for Medical Research London: Chapman & Hall; 1991.
  • Kirkwood BR. Essentials of medical Statistics London: Blackwell Science Ltd; 1988.
  • Topics ›
  • E-books in the U.S. ›

E-Books Still No Match for Printed Books

E-books vs. printed books.

Happy World Book Day! While UNESCO's General Conference probably thought of ink on paper when it first celebrated the event in 1995, some 21st century book lovers have moved onto enjoying the pastime in the electronic form. In the following chart, we compare just how popular e-books are versus those in print.

According to data from Statista’s Market Insights: Media & Advertising , e-book penetration still trails that of printed books in the vast majority of countries around the world. In the United States for example, 20 percent of the population are estimated to have purchased an e-book last year, compared to 30 percent who bought a printed book. China is the only country of those studied that saw the opposite trend, with only 24 percent of people having bought a printed book in the 12 months prior to the survey, while around 27 percent of people bought an e-book in that time frame.

Looking at forecasts for the book market on a worldwide scale, Statista analysts predict that while e-books have grown in popularity, they will not be the final nail in the coffin of printed books but rather a complementary product that should ultimately benefit the publishing industry.

Description

This chart shows the estimated share of the population in selected countries that purchased an e-book / a printed book in 2023.

Can I integrate infographics into my blog or website?

Yes, Statista allows the easy integration of many infographics on other websites. Simply copy the HTML code that is shown for the relevant statistic in order to integrate it. Our standard is 660 pixels, but you can customize how the statistic is displayed to suit your site by setting the width and the display size. Please note that the code must be integrated into the HTML code (not only the text) for WordPress pages and other CMS sites.

Infographic: E-Books Still No Match for Printed Books | Statista

Infographic Newsletter

Statista offers daily infographics about trending topics, covering: Economy & Finance , Politics & Society , Tech & Media , Health & Environment , Consumer , Sports and many more.

Related Infographics

Copyright infringement, the media industries most affected by piracy, page turner: printed book sales rising again in the u.s., global book market, book market expected to rally after covid slump, u.s. book market, u.s. readers are getting less voracious, book market worldwide, the world's biggest book publishers, epublishing, the uk is on top of european epublishing for now, author earnings, u.s. authors suffer drastic decline in earnings, media consumption, despite digital age physical books still reign supreme, amazon prime, amazon prime's cost is peanuts compared to its value, ebook pricing around the world, reading habits in the united states, e-books by the numbers.

  • Who may use the "Chart of the Day"? The Statista "Chart of the Day", made available under the Creative Commons License CC BY-ND 3.0, may be used and displayed without charge by all commercial and non-commercial websites. Use is, however, only permitted with proper attribution to Statista. When publishing one of these graphics, please include a backlink to the respective infographic URL. More Information
  • Which topics are covered by the "Chart of the Day"? The Statista "Chart of the Day" currently focuses on two sectors: "Media and Technology", updated daily and featuring the latest statistics from the media, internet, telecommunications and consumer electronics industries; and "Economy and Society", which current data from the United States and around the world relating to economic and political issues as well as sports and entertainment.
  • Does Statista also create infographics in a customized design? For individual content and infographics in your Corporate Design, please visit our agency website www.statista.design

Any more questions?

Get in touch with us quickly and easily. we are happy to help.

Feel free to contact us anytime using our contact form or visit our FAQ page .

Statista Content & Design

Need infographics, animated videos, presentations, data research or social media charts?

More Information

The Statista Infographic Newsletter

Receive a new up-to-date issue every day for free.

  • Our infographics team prepares current information in a clear and understandable format
  • Relevant facts covering media, economy, e-commerce, and FMCG topics
  • Use our newsletter overview to manage the topics that you have subscribed to

IMAGES

  1. (PDF) Some Common Problems with Statistics in Papers and Presentations

    research paper with statistics

  2. 38+ Research Paper Samples

    research paper with statistics

  3. 43+ Research Paper Examples

    research paper with statistics

  4. Tables in Research Paper

    research paper with statistics

  5. (PDF) A case study report on integrating statistics, problem-based

    research paper with statistics

  6. Statistical research paper examples

    research paper with statistics

VIDEO

  1. Statistical Linear Regression Models Johns Hopkins Univer

  2. Statistics paper I Important Questions

  3. Important question series Lecture No.3 statistics paper I .methods of applied statistics

  4. Nursing research statistics previous year question paper 2024 bsc nursing 3rd year #new #nursing

  5. Crafting Research Paper Hooks with Statistics

  6. Second Year// Past Papers 6//Short Questions//Statistics 2014 to 2023//Solved //Normal distribution

COMMENTS

  1. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  2. Statistics

    Improved data quality and statistical power of trial-level event-related potentials with Bayesian random-shift Gaussian processes. Dustin Pluta. , Beniamino Hadj-Amar. & Marina Vannucci. Article ...

  3. Research Papers / Publications

    Research Papers / Publications. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Seyed Hamed Hassani, Insup Lee, Osbert Bastani, Edgar Dobriban, Uncertainty in Language Models: Assessment through Rank-Calibration. Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas ...

  4. Introduction to Research Statistical Analysis: An Overview of the

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  5. Statistics

    A paper in Physical Review X presents a method for numerically generating data sequences that are as likely to be observed under a power law as a given observed dataset. Zoe Budrikis Research ...

  6. The Beginner's Guide to Statistical Analysis

    This article is a practical introduction to statistical analysis for students and researchers. We'll walk you through the steps using two research examples. The first investigates a potential cause-and-effect relationship, while the second investigates a potential correlation between variables. Example: Causal research question.

  7. Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods to find structure in and to give deeper insight into data, and the most important discipline to analyze and quantify uncertainty. We give an overview over different proposed structures of Data Science and address the impact of statistics on such steps as data ...

  8. Research in Statistics

    Taylor & Francis are currently supporting a 100% APC discount for all authors. Research in Statistics is a broad open access journal publishing original research in all areas of statistics and probability.The journal focuses on broadening existing research fields, and in facilitating international collaboration, and is devoted to the international advancement of the theory and application of ...

  9. (PDF) Data Science: the impact of statistics

    In this paper, we substantiate our premise that statistics is one of the most important disciplines to provide tools and methods. to find structure in and to give deeper insight into data, and ...

  10. Journal of Applied Statistics

    The Journal publishes original research papers, review articles, and short application notes. In general, ... The Journal of Applied Statistics Best Paper Prize is awarded annually, as decided by the Editor-in-Chief with the support of the Associate Editors. The winning article receives a £500 prize, and their paper will be made free to view ...

  11. Journals

    Journal of Educational and Behavioral Statistics. Co-sponsored by the ASA and American Educational Research Association, JEBS includes papers that present new methods of analysis, critical reviews of current practice, tutorial presentations of less well-known methods, and novel applications of already-known methods.

  12. Descriptive Statistics for Summarising Data

    Using the data from these three rows, we can draw the following descriptive picture. Mentabil scores spanned a range of 50 (from a minimum score of 85 to a maximum score of 135). Speed scores had a range of 16.05 s (from 1.05 s - the fastest quality decision to 17.10 - the slowest quality decision).

  13. Journal of Probability and Statistics

    13 Nov 2023. 03 Nov 2023. 07 Oct 2023. 30 Sep 2023. 31 Aug 2023. Journal of Probability and Statistics publishes papers on the theory and application of probability and statistics that consider new methods and approaches to their implementation, or report significant results for the field.

  14. Basic statistical tools in research and data analysis

    Bad statistics may lead to bad research, and bad research may lead to unethical practice. Hence, an adequate knowledge of statistics and the appropriate use of statistical tests are important. An appropriate knowledge about the basic statistical methods will go a long way in improving the research designs and producing quality medical research ...

  15. (PDF) The most-cited statistical papers

    Only a few of the most influential papers on the field of statistics are included on our list. through papers in statistics'. Four of our most cited papers, Duncan (1955), Kramer. (1956), and ...

  16. Journal of Applied Statistics: Vol 51, No 6 (Current issue)

    Kanae Takahashi et al. Article | Published online: 4 Apr 2024. Matching a discrete distribution by Poisson matching quantiles estimation. Hyungjun Lim et al. Article | Published online: 4 Apr 2024. Explore the current issue of Journal of Applied Statistics, Volume 51, Issue 6, 2024.

  17. Statistical Research Papers by Topic

    The Statistical Research Report Series (RR) covers research in statistical methodology and estimation. Page Last Revised - October 8, 2021. View Statistical Research reports by their topics.

  18. (PDF) Use of Statistics in Research

    The function of statistics in research is to purpose as a tool in conniving research, analyzing its data and portrayal of conclusions. there from. Most research studies result in a extensive ...

  19. Biostatistics

    Biostatistics is the application of statistical methods in studies in biology, and encompasses the design of experiments, the collection of data from them, and the analysis and interpretation of ...

  20. Inferential Statistics

    Inferential Statistics | An Easy Introduction & Examples. Published on September 4, 2020 by Pritha Bhandari.Revised on June 22, 2023. While descriptive statistics summarize the characteristics of a data set, inferential statistics help you come to conclusions and make predictions based on your data. When you have collected data from a sample, you can use inferential statistics to understand ...

  21. Reporting Statistics in APA Style

    Reporting Statistics in APA Style | Guidelines & Examples. Published on April 1, 2021 by Pritha Bhandari.Revised on January 17, 2024. The APA Publication Manual is commonly used for reporting research results in the social and natural sciences. This article walks you through APA Style standards for reporting statistics in academic writing.

  22. Statistics: Vol 58, No 1 (Current issue)

    New continuous bivariate distributions generated from shock models. Hyunju Lee et al. Article | Published online: 17 Apr 2024. View all latest articles. Explore the current issue of Statistics, Volume 58, Issue 1, 2024.

  23. AI and Statistics: Perfect Together

    Carolyn Geason-Beissel/MIT SMR | Getty Images. People are often unsure why artificial intelligence and machine learning algorithms work. More importantly, people can't always anticipate when they won't work. Ali Rahimi, an AI researcher at Google, received a standing ovation at a 2017 conference when he referred to much of what is done in AI as "alchemy," meaning that developers don ...

  24. Crime in the U.S.: Key questions answered

    The analysis relies on statistics published by the FBI, which we accessed through the Crime Data Explorer, and the Bureau of Justice Statistics (BJS), which we accessed through the National Crime Victimization Survey data analysis tool. To measure public attitudes about crime in the U.S., we relied on survey data from Pew Research Center and ...

  25. Statistics review 1: Presenting and summarising data

    The present review is the first in an ongoing guide to medical statistics, using specific examples from intensive care. The first step in any analysis is to describe and summarize the data. As well as becoming familiar with the data, this is also an opportunity to look for unusually high or low values (outliers), to check the assumptions ...

  26. Instructors as Innovators: A future-focused approach to new AI ...

    This paper explores how instructors can leverage generative AI to create personalized learning experiences for students that transform teaching and learning. We present a range of AI-based exercises that enable novel forms of practice and application including simulations, mentoring, coaching, and co-creation.

  27. Chart: E-Books Still No Match for Printed Books

    The Statista "Chart of the Day" currently focuses on two sectors: "Media and Technology", updated daily and featuring the latest statistics from the media, internet, telecommunications and ...