An official website of the United States government
Official websites use .gov A .gov website belongs to an official government organization in the United States.
Secure .gov websites use HTTPS A lock ( Lock Locked padlock icon ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
- Publications
- Account settings
- Advanced Search
- Journal List
Challenges and Opportunities in Statistics and Data Science: Ten Research Areas
- Author information
- Article notes
- Copyright and License information
[email protected] and [email protected]
Issue date 2020 Summer.
As a discipline that deals with many aspects of data, statistics is a critical pillar in the rapidly evolving landscape of data science. The increasingly vital role of data, especially big data, in many applications, presents the field of statistics with unparalleled challenges and exciting opportunities. Statistics plays a pivotal role in data science by assisting with the use of data and decision making in the face of uncertainty. In this article, we present ten research areas that could make statistics and data science more impactful on science and society. Focusing on these areas will help better transform data into knowledge, actionable insights and deliverables, and promote more collaboration with computer and other quantitative scientists and domain scientists.
Keywords: Biostatistics, Causal Inference, Cloud Computing, Data Visulization, Distributed Inference and Learning, Differential Privacy, Federated Learning, Integrative Analysis, Interpretable Machine Learning, Replicability and Reproducibility, Scalable Statistical Inference, Study Design
1. Introduction
As we enter into the digital age, data science has been growing rapidly into a vital field that has revolutionized many disciplines along with our everyday lives. Data science integrates a plethora of disciplines to produce a holistic, thorough, and insightful look into complex data, and helps to effectively sift through muddled masses of information to extract knowledge and deliver actionable insights and innovations. The NSF report on Statistics at Crossroads: Who is for the Challenge 1 calls for statistics to play a central leadership role in data science, in partnership with other equally important quantitative disciplines such as computer science and informatics. Indeed, as a data-driven discipline, the field of statistics plays a pivotal role in advancing data science, especially by assisting with the keystone of data science, analysis , for use in decision making in the face of uncertainty.
In computer science, the appearance of the term “data science” has been traced back to Peter Naur (1974) who provided a more specific definition than those used today, as “the science of dealing with data once they have been established”. 2 In statistics, the term can be attributed to C.F. Jeff Wu, whose inaugural lecture entitled “Statistics = Data Science?” for his appointment to the H.C. Carver Professorship at the University of Michigan in 1997, called for shifting the focus of statistics to center on “large/complex data, empirical-physical approach, representation and exploitation of knowledge”. William Cleveland and Leo Breiman (2001) 3 , 4 and Donoho (2017) 5 were also instrumental to the modern conception of the field. Today, data science, in the spirit of data + science, has become an interdisciplinary enterprise about data-enabled discoveries and inference with scientific theory and methods, algorithms, and systems. A number of departments of statistics and data science, as well as schools and institutes of data science, have been established in the US and around the world. Data science undergraduate majors and Master’s degree programs are now available in many leading institutions.
The emergence of data science has begun to reshape the research enterprise in the field of statistics and biostatistics. In this article, we present a list of research challenge areas that have piqued the interest of data scientists from a statistical perspective, and vice versa. The selection of these areas takes advantage of the collective wisdom in the NSF report quoted above, but also reflects the personal research experience and views of the authors. Our list, organized in no particular order, is by no means exhaustive, but aims to stimulate discussion. We certainly expect other important research areas to emerge and flourish. We also excuse ourselves for not citing references related to the research discussions; the body of work from which we have drawn inspiration is simply too large for this article.
2. Ten research areas
1). quantitative precision-x..
From personalized health to personalized learning, a common research goal is to identify and develop prevention, intervention and treatment strategies tailored towards individuals or subgroups of a population. Identification and validation of such subgroups using high-throughput genetic and genomic data, demographic variables, lifestyles, and other idiosyncratic factors is a challenging task. It calls for statistical and machine learning methods that explore data heterogeneity, borrow information from individuals with similar characteristics, and integrate domain sciences. Subgroup analysis calls for the development of integrated approaches to subgroup identification, confirmation, and quantification of differential treatment effects by using different types of data that may come from the same or different sources. Dynamic treatment regimes are increasingly appreciated as adaptive and personalized intervention strategies, but quantification of uncertainty requires more studies, as well as building treatment regimes in the presence of high-dimensional data.
2). Fair and Interpretable Learning and Decision Making.
Machine learning has established its value in the data-centric world. From business analytics to genomics, machine learning algorithms are increasingly prevalent. Machine learning methods take a variety of forms; some are based on traditional statistical tools as simple as principle component analysis, while others can be ad hoc and are sometimes referred to as black boxes, which raises issues such as implicit bias and interpretability. Algorithmic fairness is now widely recognized as an important concern, as many decisions rely on automatic learning from existing data. One may argue that interpretability is of secondary importance if prediction is the primary interest. However, in many high-stake cases (e.g., major policy recommendations or treatment choices involving an invasive operation), a good domain understanding is clearly needed to ensure the results are interpretable and insights and recommendations are actionable. By promoting fair and interpretable machine learning methods and taking ethics and replicability as an important metric for evaluation, statisticians have much to contribute to data science.
3). Post-selection inference.
Statistical inference is best justified when carefully collected data (and an appropriately chosen model) are used to infer and learn about an intrinsic quantity of interest. Such a quantity (e.g., a well-defined treatment effect) is not data- or model-dependent. In the big data era, however, statistical inference is often made in practice when the model and sometimes even the quantity of interest is chosen after the data are explored, leading to post-selection inference. Interpretability of such quantities and validity of post-selection inference have to be carefully examined. We must ensure that post-selection inference avoids the bias from data snooping, and maintains statistical validity without unnecessary efficiency losses, and moreover that the conclusions from such inference have a high level of replicability.
4). Balance of statistical and computational efficiencies.
When we have limited data, the emphasis on statistical efficiency to make the best use of the available data has naturally become an important focus of statistics research. We do not think statistical efficiency will become irrelevant in the big data era; often inference is made locally and the relevant data that are available to infer around a specific sub-population remain limited. On the other hand, useful statistical modeling and data analysis must take into account constraints on data storage, communication across sites, and the quality of numerical approximations in the computation. An “optimally efficient” statistical approach is far from optimal in practice if it relies on optimization of a highly nonconvex and non-smooth objective function, for instance. The need to work with streaming data for real-time actions also calls for a balanced approach. This is where statisticians and computer scientists, as well as experts from related domains (e.g., operation research, mathematics, and subject-matter science) can work together to address efficiency in a holistic way.
5). Cloud-based scalable and distributed statistical inference.
It is of high importance to develop practical scalable statistical inference for the analysis of real-world massive data. This requires multi-faceted strategies. Examples include sparse matrix construction and manipulation, distributed computing and distributed statistical inference and learning, and cloud-based analytic methods. A range of statistical methods have been developed for analysis of high-dimensional data with attractive theoretical properties. However, many of these methods are not readily scalable in real-world settings for analyzing massive data and making statistical inference at scale. Examples include atmospheric data, astronomical data, large scale biobanks with whole genome sequencing data, electronic health records, and radiomics. Statistical and computational methods, software and at-scale modules that are suitable for cloud-based open-source distributed computing frameworks, such as Hadoop and Spark, need to be developed and deployed for analyzing massive data. In addition, there is a rapidly increasing trend of moving towards cloud-based data sharing and analysis using the federated data ecosystem 6 , where data may be distributed across many databases and computer systems around the world. Distributed statistical inference will help researchers to virtually connect, integrate, and analyze data through software interfaces and efficient communications that allow seamless and authorized data access from different places..
6). Study design and statistical methods for reproducibility and replicability.
Reproducibility and replicability in science is pivotal for improving rigor and transparency in scientific research, especially when dealing with big data. 7 This includes data reproducibility, analysis reproducibility/stability and result replicability. 8 A rigorous study design and carefully thought-out sampling plans facilitate reproducible and replicable science by considering key factors at the design stage, including incorporation of both a discovery and a replication phase. Common data models, such as that used in the large scale All of Us research program 9 , have becoming increasingly popular for building federated data ecosystems, especially using the cloud, to assist with data standardization, quality control, harmonization, and data sharing, as well as the development of community standards. Although issues with replicability using statistical significance based on a classical p-value cutoff of, say, 0.05, have been identified and widely debated, there has not been much consensus on what the new norm should be and how to make statistical significance replicable between studies. Limited work has been done on developing formal statistical procedures to investigate whether findings are replicable, especially in the presence of a large number of hypotheses. More such efforts are needed, in collaboration with informaticians and computer scientists.
7). Causal inference for big data.
Causal inference for conventional observational studies has been well developed within the potential outcome framework using parametric and semiparametric methods, such as Maximum Likelihood Estimation (MLE), propensity score matching, and G-estimation. As big data are often observational in the real world, they have brought many emerging challenges and opportunities for causal inference, such as causal inference for network data, construction of real-time individualized sequences of treatments using mobile technologies, and adjustment for high-dimensional confounders. For example, for infectious disease network data, and social network data, such as Facebook data, , subjects are connected with each other. As a result the classical causal inference assumption, the Stable Unit Treatment Value Assumption (SUTVA), which assumes independent subjects, will not hold. Machine-learning based causal inference procedures have emerged in response to such issues, and integration of these procedures into the causal inference framework will be important.
8). Integrative analysis of different types and sources of data.
Massive data often consist of different types of data from the same subjects or from different subjects and sources. For example, the UK biobank collected whole genome array and sequencing (soon to be available) data, Electronic Health Records (EHRs), epidemiological, biomarker, activity monitor and imaging data from about 500,000 study participants. Furthermore, data from other sources and different subjects (not from the UK biobank) are also available, including data from genome-wide association studies, genomic data such as ENCODE and GTEX data, and drug target data from Open Targets 10 . There is a strong need to develop statistical methods and tools for integrating different types of data from one or multiple sources. Examples of such methods include causal mediation analysis, Mendelian randomization, and transportable statistical methods for data linkage. It is important to emphasize that statistical methods for data integration need to be driven by scientific questions. Blanket-style data integration methods are likely to be less useful. Close and deep collaborations between statisticians and domain researchers cannot be over-emphasized.
9). Statistical analysis of privatized data.
With growing emphasis on privacy, data sanitation methods, such as differential privacy, will remain a challenge for statistical analysis. Census data in particular, which are used frequently in social science, public health, internet, and many other disciplines, have raised serious questions regarding the adequacy of available theory and methods for ensuring a desired level of privacy and precision. Current differential privacy frameworks are designed to protect data privacy from any kind of inquiries, which is necessary to guard against the most sophisticate hacking, while still allowing for valid analysis. Researchers advocating differential privacy need to understand the practical concerns of data utility in determining how to balance privacy and precision as well as adoption of federated systems. Indeed, another area of data privacy analytics is federated statistical and machine learning, which allows for analyzing data that cannot leave individual warehouses, such as banks and health care systems. Through building a common data model and analysis protocol, statisticians and data scientists can bring the analysis protocol to the data instead of the traditional way of bringing centralized data to the analysis protocol. To protect privacy, data in different sites are analyzed individually using the common analysis protocol, and the updates are then combined through a single or multiple communications. The field of privacy-preserved data science is rapidly evolving. Theoretical and empirical investigations that reflect real world privacy concerns, as well as approaches to address ethics, social goods, and public policy in computer science and statistics, will be a hallmark of future research in this area.
10). Emerging data challenges.
Statistical research needs to keep pace with the growing needs for the analysis of new and complex data types, including deliberately generated false data and misinformation. Statisticians have in recent years taken up the challenge in developing new tools and methods for emerging data types, including network data analysis, natural language processing, video, image and object-oriented data analysis, music, and flow detection. Statisticians need to embrace data engineering to address data challenges. Emerging challenges arising from adversarial machine learning argue for engagement of statisticians too, and this is becoming more important in the age of information and misinformation. In addition, data visualization and statistical inference for data visualization (e.g., addressing the question “Is what we see really there?”) will play increasingly greater roles in data science, especially with massive data in the digital age.
1. Closing remarks
Statistics as an ever-growing discipline has always been rooted in and advanced by real world problems. Statisticians have played vital roles in the agricultural revolution, the industrial revolution, the big data era, and now in the broad digital age. Statistics cannot live successfully outside data science, and data science is incomplete without statistics. We believe that research in statistics and biostatistics should respond to the major challenges of our time by keeping a disciplinary identity, promoting valuable statistical principles, working with other quantitative scientists and domain scientists, and pushing boundaries of data-enabled learning and discovery. To do this well, we need substantially more young talents to join the statistical profession as well as other disciplines that contribute to data science, especially computer science, informatics, and ethical studies. This in turn calls for earlier and broader statistical and scientific data education on a global scale. We therefore encourage members of our field to lead, collaborate, and communicate in data science research and education with open minds, not only as statisticians, but also as scientists.
- 1. Berger J, He X, Madigan C, Wellner J and Yu B, (2019). Statistics at a Crossroads: Who is for the Challenge? NSF Workshop report. https://www.nsf.gov/mps/dms/documents/Statistics_at_a_Crossroads_Workshop_Report_2019.pdf [ Google Scholar ]
- 2. Naur P (1974). Concise Survey of Computer Methods. Studentlitteratur, Lund, Sweden. [ Google Scholar ]
- 3. Breiman L (2001). Statistical modeling: The two cultures. Statistical Science 16, 3 (2001), 199–231 [ Google Scholar ]
- 4. Cleveland WS (2001). Data science: An action plan for expanding the technical areas of the field of statistics. International Statistical Review 69, 1 (2001), 21–26. DOI: 10.1111/j.1751-5823.2001.tb00477.x [ DOI ] [ Google Scholar ]
- 5. Donoho D (2017). 50 Years of Data Science. Journal of Computational & Graphical Statistics. Vol. 26, no. 4, 745–766. [ Google Scholar ]
- 6. Global Alliance for Genomics and Health*, 2016. A federated ecosystem for sharing genomic, clinical data. Science, 352(6291), pp.1278–1280. [ DOI ] [ PubMed ] [ Google Scholar ]
- 7. National Academies of Sciences, Engineering, and Medicine (2019). Reproducibility and replicability in science. National Academies Press. [ PubMed ] [ Google Scholar ]
- 8. Lin X (2019). Reproducibility and Replicability in Large Scale Genetic Studies. National Academies Press. [ Google Scholar ]
- 9. All of Us Research Program Investigators (2019). The “All of Us” research program. New England Journal of Medicine, 381(7), pp.668–676. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- 10. Koscielny G, An P, Carvalho-Silva D, Cham JA, Fumis L, Gasparyan R, Hasan S, Karamanis N, Maguire M, Papa E and Pierleoni A, 2017. Open Targets: a platform for therapeutic target identification and validation. Nucleic acids research, 45(D1), pp.D985–D994. [ DOI ] [ PMC free article ] [ PubMed ] [ Google Scholar ]
- View on publisher site
- PDF (51.3 KB)
- Collections
Similar articles
Cited by other articles, links to ncbi databases.
- Download .nbib .nbib
- Format: AMA APA MLA NLM
Add to Collections
Research and Industry
Theory to application.
Departmental research interests are wide-ranging, bridging theory to application. Eighteen of our faculty hold joint positions in other departments such as Biostatistics, Demography, Electrical Engineering and Computer Sciences, Integrative Biology, and Mathematics. Faculty members are frequently involved with research collaborations throughout campus, from small designed studies to large-scale data analysis projects, including ENCODE, the Cancer Genome Atlas Project, and the Cyber Discovery Initiative (CDI). Many faculty are also actively engaged in consulting with outside industries, as well as serving as experts for legal proceedings and policy makers.
Click on one of the primary research areas to learn more about the work we are doing, the faculty involved, and the collaborations we have on and off campus in this area.
Primary Research Areas
Applications in Biology & Medicine
Applications in the Physical & Environmental Sciences
Applications in the Social Sciences
Artificial Intelligence/Machine Learning
Causal Inference & Graphical Models
High Dimensional Data Analysis
Non-parametric Inference
Probability
Statistical Computing
Research Areas of Expertise
Asymptotic Statistics | Bayesian Analysis | Causal Inference | Clinical Trials | Econometrics | Empirical Processes | Functional Data Analysis | Graphical Models | High-dimensional Statistics | Machine Learning | Model Selection | Spatial Analysis or Spatial Statistics | Statistical Genetics | Stochastic Processes | Statistical Optimal Transport in High Dimensions
Asymptotic Statistics
Asymptotic statistics studies the properties of statistical estimators, tests, and procedures as the sample size tend to infinity and finds approximations that can be used in practical applications when the sample size is finite.
Faculty Researchers : Sumanta Basu, Florentina Bunea, Ahmed El Alaoui, Ziv Goldfeld, Thorsten Joachims, Amy Kuceyeski, Yang Ning, Karthik Sridharan, Y. Samuel Wang, Kilian Weinberger, and Dana Yang.
Bayesian Analysis
Bayesian statistics provides a mathematical data analysis framework for representing uncertainty and incorporating prior knowledge into statistical inference. In Bayesian statistics, probabilities are used to represent the uncertainty in parameters rather than the data itself. This approach allows for the incorporation of prior information and the use of subjective and objective beliefs about the parameters.
Faculty Researchers : Tom Loredo, David Matteson, David Ruppert, and Martin Wells.
Causal Inference
The goal of causal inference is to develop a formal statistical framework that answers causal questions from the real world data. Examples include identification and estimation of causal effect from observational studies, optimal treatment decision for patients and causal structure learning in network.
Faculty Researchers : Yang Ning
Clinical Trials
Clinical trials are research studies designed to test the safety and effectiveness of medical treatments, drugs, or medical devices. Clinical trials aim to determine whether a new intervention is effective and safe. The results of clinical trials provide important information for regulatory agencies, healthcare providers, and patients in making decisions about the use of new medical interventions.
Faculty Researchers : Karla Ballman
Econometrics
Econometrics applies statistical methods to analyze and model economic data. It provides a way to test economic theories and make predictions about economic events. Econometric research extends methods from regression, time series, panel data, and multivariate analysis.
Faculty Researchers : Sumanta Basu, Kengo Kato, Nicholas Kiefer, David Matteson, and Francesca Molinari.
Empirical Processes
Empirical process theory provides a rigorous mathematical basis for central limit theorems, large deviation theory, weak convergence, and convergence rates. Empirical process theory is widely used in many areas of statistics and has applications in fields such as machine learning, probability theory, and mathematical finance.
Faculty Researchers : Kengo Kato and Marten Wegkamp.
Functional Data Analysis
Functional data analysis is a branch of statistics that analyzes data providing information about curves, surfaces or anything else varying over a continuum. In its most general form, under an FDA framework, each sample element of functional data is considered to be a random function. --Wikipedia. Many of the tools of FDA were developed by generalizing multivariate statistical analysis from finite to infinite dimensional spaces using mathematical theory taken from functional analysis and operator theory.
Faculty Researchers : James Booth, David Matteson, and David Ruppert.
Graphical Models
A graphical model, or probabilistic graphical model, is a statistical model which can be represented by a graph, the vertices correspond to random variables and edges between vertices indicate a conditional dependence relationship. Often times, the problem of interest is estimating the graph structure from data. When using directed edges, graphical models can be used to express causal relationships. The area draws contributions from a variety of disciplines including statistics, computer science, mathematics, philosophy, and biology.
Faculty Researchers : Ahmed El Alaoui, Yang Ning, Felix Thoemmes, Y. Samuel Wang, Martin Wells, and Dana Yang.
High-dimensional Statistics
In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis (Wikipedia).
Faculty Researchers : Marten Wegkamp
Machine Learning
Machine learning is a field of inquiry devoted to understanding and building methods that 'learn,' that is, methods that leverage data to improve performance on some set of tasks. It is at the intersection of Statistics and Computer Science, and is seen as a part of Artificial Intelligence.
Faculty Researchers : Sumanta Basu, Florentina Bunea, Ahmed El Alaoui, Ziv Goldfeld, Thorsten Joachims, Kengo Kato, Amy Kuceyeski, Yang Ning, Karthik Sridharan, Y. Samuel Wang, Marten Wegkamp, Kilian Weinberger, and Dana Yang.
Model Selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data.Instances include tuning parameter selection, feature selection in regression and classification, pattern recovery, nonparametric estimation. Model selection can also be viewed as a particular case of model aggregation.
Faculty Researchers : Sumanta Basu, James Booth, Florentina Bunea, Yang Ning, Marten Wegkamp, and Marty Wells.
Spatial Analysis or Spatial Statistics
Spatial statistics is the study of models, methods, and computational techniques for spatially-referenced data, with the goal of making predictions of and drawing inferences about spatial processes.
Faculty Researchers : Joe Guinness
Statistical Genetics
The field of Statistical genetics focuses on development and application of quantitative methods for drawing inferences from genetic data. Using techniques from statistics, computer science and bioinformatics, statistical geneticists help gain insight into the genetic basis of phenotypes and diseases.
Faculty Researchers : Sumanta Basu, James Booth, Jason Mezey, and Yang Ning.
Stochastic Processes
A stochastic process is a family of random variables, where each member is associated with an index from an index set. The type of variable is general, but common specifications are scalar, vector, matrix, or function valued random variables. The index set is also general, but common specification are the natural numbers or the real numbers, which define discrete indexed and continually indexed processes, respectively. For the former, the indexed family is considered a sequence, and the index is often associated with a time ordering. For the later, the index may be associated with continuous time, but it is also commonly associated with continuous space (location), in one, two, or three dimensions. Through such associations, the study and application of stochastic process is frequently linked to both time series analysis and spatial statistics.
Faculty Researchers : Ahmed El Alaoui, Ziv Goldfeld, Kengo Kato, David Matteson, and Gennady Samordinsky.
Statistical Optimal Transport in High Dimensions
Statistical optimal transport is the area of statistics and machine learning devoted to coupling two separate distributions in an optimal way, with applications including non-parametric inference, goodnesss-of-fit testing, domain adaptation and data alignment, among very many other. The study of its high dimensional aspects is born from the need to address modern challenges in this area, and involves developing new notions of optimal transport that avoid the curse of dimensionality and the computational burden associated with classical approaches.
Faculty Researchers : Florentina Bunea and Kengo Kato.
University of Missouri
College of Arts and Science
Research Areas
Subscribe to email list
Enter the characters shown in the image.
You are here
- Department of Statistics
Research Areas
Bayesian statistics.
The Bayesian paradigm for statistical inference uses expert knowledge, formulated in terms of probability distributions of unknown parameters of interest. These distributions, called prior distributions, are combined with data to provide new information about parameters, via new parameter distributions called posterior distributions. One research theme centers on devising new Bayesian methodologies, i.e., new statistical models with which Bayesian inferences can provide particular scientific insight. Quantifying the statistical properties of such methods and contrasting with non-Bayesian alternatives is an active area of research. Bayesian methods can lead to computational challenges, and another research theme centers on efficient computation of Bayesian solutions. The development of computational techniques for determining posterior distributions, such as Monte Carlo methods, is a rich area of research activity, with particular emphasis on Markov Chain Monte Carlo methods and sequential Monte Carlo methods.
- Ben Bloem-Reddy
- Alexandre Bouchard-Côté
- Creagh Briercliffe
- Trevor Campbell
- Naitong Chen
- Kevin Chern
- Anthony Christidis
- Gian Carlo Di-Luvi
- Fanny DUPONT
- Nathaniel Wu Dyrkton
- Paul Gustafson
- Miguel Biron Lattes
- Matteo Lepur
- Xinglong Li
- Tiange (Ivy) Liu
- Yongjin Park
- Hyeongcheol (Tom) Park
- Geoff Pleiss
- Evan Sidrow
- Nikola Surjanovic
- Quanhan (Johnny) Xi
- Zuheng (David) Xu
- Yichen Zhang
Bioinformatics/Genomics/Genetics
Recent advances of -omics technologies have stimulated a large body of biomedical studies focused on the discovery and characterization of molecular mechanisms of various diseases. For example, many studies have been focused on the identification of genes to diagnose or predict cancer. The rapid expansion of complex and large -omics datasets has nourished the development of tailored statistical methods to address the challenges that have arisen in the field. Some examples are detection and correction of biases and artifacts in raw high throughput -omics data, identification of true signal among a large number of variables measured on a much smaller number of subjects, modeling of complex covariance structures, integration of diverse -omics datasets. Research in this area is characterized by multidisciplinary collaborations among researchers from Statistics, Computer Science, Medical Genetics, Molecular Biology, and other related fields.
- Gabriela V. Cohen Freue
- Keegan Korthauer
- Daniel J. McDonald
- Giuliano Netto Flores Cruz
Biostatistics
Many faculty members work on applying statistical methods to biomedical problems, ranging from analysing gene expression data to public health issues. Much of this work is done in conjunction with local hospitals (such as St Paul's) and research institutes (such as the BC Cancer Agency and the BC Genome Sciences Center). In the fall of 2009, we introduced the biostatistics option to our MSc program, an option that is joint with the School of Population and Public Health .
- Jonathan O.K. Agyeman
- Harlan Campbell
- Harper Xiaolan Cheng
- Sihaoyu Gao
- Wakeel Adekunle Kasali
- John Petkau
- Marc Wettengel
- Xinyuan (Chloe) You
Data Science
Data Science information page.
- Katie Burak
- Andy Man Yeung Tai
- Tiffany Timbers
Environmental and Spatial Statistics
The Department has a long history of research and collaborations in Environmental Statistics and in Spatial Statistics, beginning with Jim Zidek's pioneering work with the United States Environmental Protection Agency. Since that time, faculty have been involved in many research projects, such as the development of statistical techniques for the analysis of air pollution data to study concerns such as public health issues and global climate models. Current work modelling air pollution has resulted in the an interactive map for the World Health Organization, developed by an international team of researchers. Other recent research activities involve collaboration with marine mammal biologists, to study locations and behavior via continuous-time tracking devices.
- Marie Auger-Méthé
- Tomas Beuzen
- Rowenna Gryba
- Nancy E. Heckman
- Adrian L Jones
- G. Alexi Rodríguez-Arelis
- William J. Welch
- James V. Zidek FRSC, O.C.
Forest Products Stochastic Modeling Group
Since 2009, more than 60 researchers have been a part of this group, studying the properties of wood products, working on projects such as the development of engineering standards, monitoring for changes in product properties over time, subset selection methods for species grouping in the marketing of lumber and the duration of load effect in construction. The group is made up of statisticians from UBC and SFU - faculty, students and staff - and collaborating scientists at FPInnovations Vancouver, funded by Collaborative Research and Development Grants awards under NSERC’s Forest Sector R & D Initiative.
Forest products have a complex variability and, as a biomaterial, are inherently stochastic. Therefore, the group has analyzed forest product data using advanced statistical methods in areas such as survey sampling, survival analysis, nonparametric Bayesian analysis and the handling of big data. The group has made novel contributions to statistical science that transfer to other domains and has solved long standing problems in wood science. And something that rarely is the case - statisticians have run their own experiments and data collection.
Read more about the Forest Products Stochastic Modeling Group .
- Jiahua Chen
- Shuxian (Trinity) Fan
- Carolyn Taylor
Modern Multivariate and Time Series Analysis
Modern multivariate and time series analyses go beyond the classical normality assumption by modelling data that could combine binary, categorical, extreme and heavy-tailed distributions. Dependence is modeled non-linearly, often in terms of copula functions or stochastic representations. Models for multivariate extremes arise from asymptotic limits. Characterization and modelling of dependence among extremes as well as estimation of probabilities of rare events are topics of on-going research. Advances in high-dimensional multivariate modelling have been achieved by the use of vine pair-copula constructions. Areas of application include biostatistics, psychometrics, genetics, machine learning, econometrics, quantitative risk management in finance and insurance, hydrology and geoscience.
- Daniel Hadley
- Jiaping(Olivia) Liu
- Natalia Nolde
Robust Statistics
Statistical procedures are called robust if they remain informative and efficient in the presence of outliers and other departures from typical model assumptions on the data. Ignoring unusual observations can play havoc with standard statistical methods and can also result in losing the valuable information gotten from unusual data points. Robust procedures prevent this. And these procedures are more important than ever since currently, data are often collected without following established experimental protocols. As a result, data may not represent a single well-defined population. Analyzing these data by non-robust methods may result in biased conclusions. To perform reliable and informative inference based on such a heterogeneous data set, we need statistical methods that can fit models and identify patterns, focusing on the dominant homogeneous subset of the data without being affected by structurally different small subgroups. Robust Statistics does exactly this. Some examples of applications are finding exceptional athletes (e.g. hockey players), detecting intrusion in computer networks and constructing reliable single nucleotide polymorphism (SNP) genotyping.
- Matías Salibián-Barrera
- Ruben H Zamar
Statistical Learning
Statistical learning, sometimes called machine learning, is becoming ever more important as a component of data science, and department members have had active research in this area for more than a decade. Statistical learning methods include classification and regression (supervised learning) and clustering (unsupervised learning). Current research topics of faculty members and their graduate students include construction of phylogenetic trees in evolution, ensembles of models and sparse clustering. Applications include the search for novel pharmaceutical drugs and detection of biogenic patterns.
- Pramoda Sachinthana Jayasinghe
- Saifuddin Syed
UBC Department of Statistics
- Core Members
- Affiliate Members
- Interdisciplinary Doctoral Program in Statistics
- Minor in Statistics and Data Science
- MicroMasters program in Statistics and Data Science
- Data Science and Machine Learning: Making Data-Driven Decisions
- Norbert Wiener Fellowship
- Stochastics and Statistics Seminar
- IDSS Distinguished Seminars
- IDSS Special Seminar
- SDSC Special Events
- Online events
- IDS.190 Topics in Bayesian Modeling and Computation
- Past Events
- LIDS & Stats Tea
7th Annual MIT Policy Hackathon
Apply by October 11 to join this year’s in-person hackathon taking place on November 15-17. Applicants will be notified about their status by October 18, 2024.
Podcast: Data Nation
IDSS faculty and industry experts unpack how data can be used to lead, mislead, manipulate, and inform the public’s viewpoints and decisions. Season 2 has begun!
MicroMasters in Statistics and Data Science
Learn data science methods and tools, get hands-on training in data analysis and machine learning, and find opportunities in a growing field. Watch our latest informational webinar .
This 10-week online program covers statistics and Python foundations, machine learning, prediction, recommendation systems, and more.
Nonparametric Bayesian Statistics
Bayesian nonparametrics provides modeling solutions by replacing the finite-dimensional prior distributions of classical Bayesian analysis with infinite-dimensional stochastic processes.
Causal inference and applications to learning gene regulatory networks
Causal inference: Geometry of conditional independence structures for 3-node directed Gaussian graphical models.
Combinatorial learning with set functions
Learning problems that involve combinatorial objects are ubiquitous - they include the prediction of graphs, assignments, rankings, trees, groups of discrete labels or preferred sets of a user; the expression of prior structural knowledge for regularization, the identification of sets of important variables, or inference in discrete probabilistic models.
Online Learning
In this line of research, we develop strategies to optimize utility in dynamic environments in an optimal and efficient fashion.
Statistical and Computational Tradeoffs
Computational limitations of statistical problems have largely been ignored or simply overcome by ad hoc relaxations techniques.
MIT Statistics + Data Science Center Massachusetts Institute of Technology 77 Massachusetts Avenue Cambridge, MA 02139-4307 617-253-1764
- Accessibility
- Interdisciplinary PhD in Aero/Astro and Statistics
- Interdisciplinary PhD in Brain and Cognitive Sciences and Statistics
- Interdisciplinary PhD in Economics and Statistics
- Interdisciplinary PhD in Mathematics and Statistics
- Interdisciplinary PhD in Mechanical Engineering and Statistics
- Interdisciplinary PhD in Physics and Statistics
- Interdisciplinary PhD in Political Science and Statistics
- Interdisciplinary PhD in Social & Engineering Systems and Statistics
- LIDS & Stats Tea
- Spring 2024
- Spring 2023
- Spring 2022
- Spring 2021
- Fall – Spring 2020
- Fall 2019 – IDS.190 – Topics in Bayesian Modeling and Computation
- Fall 2019 – Spring 2019
- Fall 2018 and earlier
Research Areas
Main navigation.
The world is being transformed by data and data-driven analysis is rapidly becoming an integral part of science and society. Stanford Data Science is a collaborative effort across many departments in all seven schools. We strive to unite existing data science research initiatives and create interdisciplinary collaborations, connecting the data science and related methodologists with disciplines that are being transformed by data science and computation.
Our work supports research in a variety of fields where incredible advances are being made through the facilitation of meaningful collaborations between domain researchers, with deep expertise in societal and fundamental research challenges, and methods researchers that are developing next-generation computational tools and techniques, including:
Data Science for Wildland Fire Research
In recent years, wildfire has gone from an infrequent and distant news item to a centerstage isssue spanning many consecutive weeks for urban and suburban communities. Frequent wildfires are changing everyday lives for California in numerous ways -- from public safety power shutoffs to hazardous air quality -- that seemed inconceivable as recently as 2015. Moreover, elevated wildfire risk in the western United States (and similar climates globally) is here to stay into the foreseeable future. There is a plethora of problems that need solutions in the wildland fire arena; many of them are well suited to a data-driven approach.
Seminar Series
Data Science for Physics
Astrophysicists and particle physicists at Stanford and at the SLAC National Accelerator Laboratory are deeply engaged in studying the Universe at both the largest and smallest scales, with state-of-the-art instrumentation at telescopes and accelerator facilities
Data Science for Economics
Many of the most pressing questions in empirical economics concern causal questions, such as the impact, both short and long run, of educational choices on labor market outcomes, and of economic policies on distributions of outcomes. This makes them conceptually quite different from the predictive type of questions that many of the recently developed methods in machine learning are primarily designed for.
Data Science for Education
Educational data spans K-12 school and district records, digital archives of instructional materials and gradebooks, as well as student responses on course surveys. Data science of actual classroom interaction is also of increasing interest and reality.
Data Science for Human Health
It is clear that data science will be a driving force in transitioning the world’s healthcare systems from reactive “sick-based” care to proactive, preventive care.
Data Science for Humanity
Our modern era is characterized by massive amounts of data documenting the behaviors of individuals, groups, organizations, cultures, and indeed entire societies. This wealth of data on modern humanity is accompanied by massive digitization of historical data, both textual and numeric, in the form of historic newspapers, literary and linguistic corpora, economic data, censuses, and other government data, gathered and preserved over centuries, and newly digitized, acquired, and provisioned by libraries, scholars, and commercial entities.
Data Science for Linguistics
The impact of data science on linguistics has been profound. All areas of the field depend on having a rich picture of the true range of variation, within dialects, across dialects, and among different languages. The subfield of corpus linguistics is arguably as old as the field itself and, with the advent of computers, gave rise to many core techniques in data science.
Data Science for Nature and Sustainability
Many key sustainability issues translate into decision and optimization problems and could greatly benefit from data-driven decision making tools. In fact, the impact of modern information technology has been highly uneven, mainly benefiting large firms in profitable sectors, with little or no benefit in terms of the environment. Our vision is that data-driven methods can — and should — play a key role in increasing the efficiency and effectiveness of the way we manage and allocate our natural resources.
Ethics and Data Science
With the emergence of new techniques of machine learning, and the possibility of using algorithms to perform tasks previously done by human beings, as well as to generate new knowledge, we again face a set of new ethical questions.
The Science of Data Science
The practice of data analysis has changed enormously. Data science needs to find new inferential paradigms that allow data exploration prior to the formulation of hypotheses.
Search form
Department of statistics and applied probability - uc santa barbara.
The major research areas of our department can be divided into theoretical statistics and statistical methodology, applied statistics, and probability. Our faculty members are actively engaged in interdisciplinary research in such areas as mathematics, computer science, biostatistics, environmental sciences, and financial mathematics and statistics. These dynamic interactions with researchers in other disciplines are both personal, through joint projects supported by the NSF, NIH, and other government agencies, and through the activities of the DataLab situated in the Department.
The Department also hosts Center for Financial Mathematics and Actuarial Research (CFMAR) which provides national and international leadership in quantitative finance. Its research activities are directed toward study of financial markets, asset prices, risk management, investment strategies, derivatives pricing and hedging, and systemic risk among other topics.
Our faculty's applied statistics research spans a wide range of fields including environmental sciences, different biological and biomedical fields, population genetics, and finance.
Research Areas of Special Emphasis
Theoretical statistics and statistical methodology.
Bayesian inference & computational methods
Bayesian networks
Resampling techniques
Directional data analysis
Functional data analysis
Data mining
Computational statistics
Nonparametric inference
Asymptotic statistical methods
Linear models and generalized linear models
Smoothing spline methods
Time series and spatial/temporal data models
Mixed effects models
Probability
Stochastic processes
Stochastic control
Sequential detection
Interacting particle systems
Financial Mathematics
Systemic risk in financial market
Stochastic games
Stochastic portfolio theory
Stochastic volatility modeling
Risk management and Actuarial Applications
Applied Statistics
Environmental statistics
Ecological statistics
Geophysical statistics
Statistical education
Image data analysis
Spatial statistics
Biostatistics
Survival analysis
Clinical trials
Longitudinal data analysis
Bayesian methods for Biosurveillance
Stochastic modeling of biomedical events
Quick Links
Center for Financial Mathematics and Actuarial Research
HIRING: Rice Statistics and the School of Engineering and Computing invite applicants to apply for a lecturer position .
Our research informs the direction of inquiry in statistics at Rice and brings the latest discoveries into the classroom.
Research Focus Areas & Applications
Research areas in the Department of Statistics are diverse and multidisciplinary with application areas that range from finance to social sciences. Together with its partners in the Engineering School, we are leading the way towards the next big discoveries.
Learn more about our research focus areas and applications .
Graduate Student Research
Research work by graduate students also contributes significantly to the body of knowledge in the field, while many of our undergraduates have significant research opportunities as well. Statistics is a cornerstone of the campus-wide Data Science Initiative .
Biostatistics Programs
Biostatistics is the science of bringing statistical and probabilistic reasoning to bear on the complex problems presented in research areas such as biology, genetics, human health, medical and environmental science. Our partnership with the MD Anderson Cancer Center allows our faculty and students research opportunities to contribute significantly to the improvement of human health throughout the world.
Computational Finance and Economic Systems
Rice's Center for Computational Finance and Economic Systems (CoFES) is dedicated to the quantitative study of financial markets and economic systems and their ultimate impact on society. CoFES represents a cooperative effort between the School of Engineering, the Jones Graduate School of Business, and the School of Social Sciences.
Learn more about CoFES .
Faculty Research Groups
Our research groups may consist of a faculty member, their graduate and undergraduate students, as well as collaborators from outside the department. They are the core of the department's research and education agenda. Individual groups meet regularly throughout the semester. If you are interested in joining one of the groups, contact the faculty member based on their research focus areas and applications .
Department of Statistics
Last update: 11/10/23
PhD Degree in Statistics
The Department of Statistics offers an exciting and recently revamped PhD program that involves students in cutting-edge interdisciplinary research in a wide variety of fields. Statistics has become a core component of research in the biological, physical, and social sciences, as well as in traditional computer science domains such as artificial intelligence and machine learning. The massive increase in the data acquired, through scientific measurement on one hand and through web-based collection on the other, makes the development of statistical analysis and prediction methodologies more relevant than ever.
Our graduate program prepares students to address these issues through rigorous training in scientific computation, and in the theory, methodology, and applications of statistics. The course work includes four core sequences:
- Probability (STAT 30400, 38100, 38300)
- Mathematical statistics (STAT 30400, 30100, 30210)
- Applied statistics (STAT 34300, 34700, 34800)
- Computational mathematics and machine learning (STAT 30900, 31015/31020, 37710).
All students must take the Applied Statistics and Theoretical Statistics sequence. In addition it is highly recommended that students take a third core sequence based on their interests and in consultation with the Department Graduate Advisor (DGA). At the start of their second year, the students take two preliminary examinations. All students must take the Applied Statistics Prelim. For the second the students can choose to take either the Theoretical Statistics or the Probability prelim. Students planning to take the Probability prelim should take the Probability sequence as their third sequence.
Incoming first-year students have the option of taking any or all of these exams; if an incoming student passes one or more of these, then he/she will be excused from the requirement of taking the first-year courses in that subject. During the second and subsequent years, students can take more advanced courses, and perform research, with world-class faculty in a wide variety of research areas .
In recent years, a large majority of our students complete the PhD within four or five years of entering the program. Students who have significant graduate training before entering the program can (and do) obtain their doctor's degree in three years.
Most students receiving a doctorate proceed to faculty or postdoctoral appointments in research universities. A substantial number take positions in government or industry, such as in research groups in the government labs, in communications, in commercial pharmaceutical companies, and in banking/financial institutions. The department has an excellent track record in placing new PhDs.
Prerequisites for the Program
A student applying to the PhD program normally should have taken courses in advanced calculus, linear algebra, probability, and statistics. Additional courses in mathematics, especially a course in real analysis, will be helpful. Some facility with computer programming is expected. Students without background in all of these areas, however, should not be discouraged from applying, especially if they have a substantial background, through study or experience, in some area of science or other discipline involving quantitative reasoning and empirical investigation. Statistics is an empirical and interdisciplinary field, and a strong background in some area of potential application of statistics is a considerable asset. Indeed, a student's background in mathematics and in science or another quantitative discipline is more important than his or her background in statistics.
To obtain more information about applying, see the Guide For Applicants .
Students with questions may contact Yali Amit for PhD Studies, Mei Wang for Masters Studies, and Keisha Prowoznik for all other questions, Bahareh Lampert (Dean of Students in the Physical Sciences Division), or Amanda Young (Associate Director, Graduate Student Affairs) in UChicagoGRAD.
Handbook for PhD Students in Statistics
Information for first and second year phd students in statistics.
College of Liberal Arts & Sciences
Department of Statistics
- Prospective Graduate Students
- Undergraduate Programs
- Statistics and Data Science
- Student Life and Campus Resources
- Liberal Arts & Sciences
- Graduate College
- Visit Illinois
- Graduate Programs
- Undergraduate Academics
- Advising Resources
- Courses & Registration Information
- Career Services Resources
- Student Resources
- Statistics Tutoring
- Student Groups
- Statistics Convocation
- Research Overview
- Undergraduate Research
Fundamental Research in Statistics
- Data Science and Big Data Analytics
- Biostatistics and Quantitative Biology
- Statistics and Data Science Education
- Quantitative Methods
- Interdisciplinary Research
- Blackwell Summer Scholars Program
- Faculty & Staff Resources
- Tech & Facilities Support
- Campus Dates & Deadlines
- Student Advising Resources
- Administration & Staff
- Doctoral Student
- Alumni Spotlight
- Statistics PhD Alumni
- Stay Connected
- Give to Statistics
- Consulting Overview
- Sample Projects
- Research Staff
- Consulting Rates
- Appointments
- Consultation Requests
- ISO Invoice Payment
- Corporate Connections
Fundamental research in statistics includes developing new theory to validate statistical procedures, generalizing probability models for random processes, developing new nonparametric methodology for machine learning applications, establishing the asymptotic theory behind new statistical methods and proving new approaches for experiments that cannot be handled by traditional statistical methods. Areas of interest among our faculty include:
- Nonparametric and semiparametric statistics
- Bayesian methods and machine learning
- Inferential methods for dependent and longitudinal data
- High dimensional data analysis and model selection
- Monte Carlo methods in statistical computing
- Time series and spatial-temporal models
Faculty working in Fundamental Research in Statistics
Related News
Statistics and Statistical Theory
Researchers at UCI are concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data. Statistical principles and methods are important for addressing questions in public policy, medicine, industry and virtually every branch of science.
Wanrong Zhu
Wenzhuo zhou, david armstrong, brigitte baldi, pierre baldi, veronica berrocal, hengrui cai, mine dogucu, daniel gillen, koko gulesserian, wesley johnson, volodymyr minin, tianchen qian, babak shahbaba, weining shen, padhraic smyth, erik sudderth, jessica utts, recent news about statistics and statistical theory, mandt appointed program chair for aistats 2024 , faculty spotlight: veronica berrocal speaks to the impact of statistics, uci researchers aim to diversify clinical research participation with $3.7m nih grant, professors berrocal and shahbaba named american statistical association fellows.
From ANOVA to regression: 10 key statistical analysis methods explained
Last updated
24 October 2024
Reviewed by
Miroslav Damyanov
Every action we take generates data. When you stream a video, browse a website, or even make a purchase, valuable data is created. However, without statistical analysis, the potential of this information remains untapped.
Understanding how different statistical analysis methods work can help you make the right choice. Each is applicable to a certain situation, data type, and goal.
- What is statistical analysis?
Statistical analysis is the process of collecting, organizing, and interpreting data. The goal is to identify trends and relationships. These insights help analysts forecast outcomes and make strategic business decisions.
This type of analysis can apply to multiple business functions and industries, including the following:
Finance : helps companies assess investment risks and performance
Marketing : enables marketers to identify customer behavior patterns, segment markets, and measure the effectiveness of advertising campaigns
Operations: helps streamline process optimization and reduce waste
Human resources : helps track employee performance trends or analyze turnover rates
Product development : helps with feature prioritization, evaluating A/B test results, and improving product iterations based on user data
Scientific research: supports hypothesis testing, experiment validation, and the identification of significant relations in data
Government: informs public policy decisions, such as understanding population demographics or analyzing inflation
With high-quality statistical analysis, businesses can base their decisions on data-driven insights rather than assumptions. This helps build more effective strategies and ultimately improves the bottom line.
- Importance of statistical analysis
Statistical analysis is an integral part of working with data. Implementing it at different stages of operations or research helps you gain insights that prevent costly errors.
Here are the key benefits of statistical analysis:
Informed decision-making
Statistical analysis allows businesses to base their decisions on solid data rather than assumptions.
By collecting and interpreting data, decision-makers can evaluate the potential outcomes of their strategies before they implement them. This approach reduces risks and increases the chances of success.
Understanding relationships and trends
In many complex environments, the key to insights is understanding relationships between different variables. Statistical methods such as regression or factor analysis help uncover these relationships.
Uncovering correlations through statistical methods can pave the way for breakthroughs in fields like medicine, but the true impact lies in identifying and validating cause-effect relationships. By distinguishing between simple associations and meaningful patterns, statistical analysis helps guide critical decisions, such as developing potentially life-saving treatments.
Predicting future outcomes
Statistical analysis, particularly predictive analysis and time series analysis, provides businesses with tools to forecast events based on historical data.
These forecasts help organizations prepare for future challenges (such as fluctuations in demand, market trends, or operational bottlenecks). Being able to predict outcomes allows for better resource allocation and risk mitigation.
Improving efficiency and reducing waste
Using statistical analysis can lead to improved efficiency in areas where waste occurs. In operations, this can result in streamlining processes.
For example, manufacturers can use causal analysis to identify the factors contributing to defective products and then implement targeted improvements to eliminate the causes.
Enhancing accuracy in research
In scientific research, statistical methods ensure accurate results by validating hypotheses and analyzing experimental data.
Methods such as regression analysis and ANOVA (analysis of variance) allow researchers to draw conclusions from experiments by examining relationships between variables and identifying key factors that influence outcomes.
Without statistical analysis, research findings may not be reliable. This could result in teams drawing incorrect conclusions and forming strategies that cost more than they’re worth.
Validating business assumptions
When businesses make assumptions about customer preferences, market conditions, or operational outcomes, statistical analysis can validate them.
For example, hypothesis testing can provide a framework to either confirm or reject an assumption. With these results at hand, businesses reduce the likelihood of pursuing incorrect strategies and improve their overall performance.
- Types of statistical analysis
The two main types of statistical analysis are descriptive and inferential. However, there are also other types. Here’s a short breakdown:
Descriptive analysis
Descriptive analysis focuses on summarizing and presenting data in a clear and understandable way. You can do this with simple tools like graphs and charts.
This type of statistical analysis helps break down large datasets into smaller, digestible pieces. This is usually done by calculating averages, frequencies, and ranges. The goal is to present the data in an orderly fashion and answer the question, “What happened?”
Businesses can use descriptive analysis to evaluate customer demographics or sales trends. A visual breakdown of complex data is often useful enough for people to come to useful conclusions.
Diagnostic statistics
This analysis is used to determine the cause of a particular outcome or behavior by examining relationships between variables. It answers the question, “Why did this happen?”
This approach often involves identifying anomalies or trends in data to understand underlying issues.
Inferential analysis
Inferential analysis involves drawing conclusions about a larger population based on a sample of data. It helps predict trends and test hypotheses by accounting for uncertainty and potential errors in the data.
For example, a marketing team can arrive at a conclusion about their potential audience’s demographics by analyzing their existing customer base. Another example is vaccine trials, which allow researchers to come to conclusions about side effects based on how the trial group reacts.
Predictive analysis
Predictive analysis uses historical data to forecast future outcomes. It answers the question, “What might happen in the future?”
For example, a business owner can predict future customer behavior by analyzing their past interactions with the company. Meanwhile, marketers can anticipate which products are likely to succeed based on past sales data.
This type of analysis requires the implementation of complex techniques to ensure the expected results. These results are still educated guesses—not error-free conclusions.
Prescriptive analysis
Prescriptive analysis goes beyond predicting outcomes. It suggests actionable steps to achieve desired results.
This type of statistical analysis combines data, algorithms, and business rules to recommend actual strategies. It often uses optimization techniques to suggest the best course of action in a given scenario, answering the question, “What should we do next?”
For example, in supply chain management, prescriptive analysis helps optimize inventory levels by providing specific recommendations based on forecasts. A bank can use this analysis to predict loan defaults based on economic trends and adjust lending policies accordingly.
Exploratory data analysis
Exploratory data analysis (EDA) allows you to investigate datasets to discover patterns or anomalies without predefined hypotheses. This approach can summarize a dataset’s main characteristics, often using visual methods.
EDA is particularly useful for uncovering new insights that weren’t anticipated during initial data collection.
Causal analysis
Causal analysis seeks to identify cause-and-effect relationships between variables. It helps determine why certain events happen, often employing techniques such as experiments or quasi-experimental designs to establish causality.
Understanding the “why” of specific events can help design accurate proactive and reactive strategies.
For example, in marketing, causal analysis can be applied to understand the impact of a new advertising campaign on sales.
Bayesian statistics
This approach incorporates prior knowledge or beliefs into the statistical analysis. It involves updating the probability of a hypothesis as more evidence becomes available.
- Statistical analysis methods
Depending on your industry, needs, and budget, you can implement different statistical analysis methods. Here are some of the most common techniques:
A t-test helps determine if there’s a significant difference between the means of two groups. It works well when you want to compare the average performance of two groups under different conditions.
There are different types of t-tests, including independent or dependent.
T-tests are often used in research experiments and quality control processes. For example, they work well in drug testing when one group receives a real drug and another receives a placebo. If the group that received a real drug shows significant improvements, a t-test helps determine if the improvement is real or chance-related.
2. Chi-square tests
Chi-square tests examine the relationship between categorical variables. They compare observed results with expected results. The goal is to understand if the difference between the two is due to chance or the relationship between the variables.
For instance, a company might use a chi-square test to analyze whether customer preferences for a product differ by region.
It’s particularly useful in market research, where businesses analyze responses to surveys.
ANOVA, which stands for analysis of variance, compares the means of three or more groups to determine if there are statistically significant differences among them.
Unlike t-tests, which are limited to two groups, ANOVA is ideal when comparing multiple groups at once.
One-way ANOVA: analysis with one independent variable and one dependent variable
Two-way ANOVA: analysis with two independent variables
Multivariate ANOVA (MANOVA): analysis with more than two independent variables
Businesses often use ANOVA to compare product performance across different markets and evaluate customer satisfaction across various demographics. The method is also common in experimental research, where multiple groups are exposed to different conditions.
4. Regression analysis
Regression analysis examines the relationship between one dependent variable and one or more independent variables. It helps businesses and researchers predict outcomes and understand which factors influence results the most.
This method determines a best-fit line and allows the researcher to observe how the data is distributed around this line.
It helps economists with asset valuations and predictions. It can also help marketers determine how variables like advertising affect sales.
A company might use regression analysis to forecast future sales based on marketing spend, product price, and customer demographics.
6. Time series analysis
Time series analysis evaluates data points collected over time to identify trends. An analyst records data points at equal intervals over a certain period instead of doing it randomly.
This method can help businesses and researchers forecast future outcomes based on historical data. For example, retailers might use time series analysis to plan inventory around holiday shopping trends, while financial institutions rely on it to track stock market trends. An energy company can use it to evaluate consumption trends and streamline the production schedule.
7. Survival analysis
Survival analysis focuses on time-to-event data, such as the time it takes for a machine to break down or for a customer to churn. It looks at a variable with a start time and end time. The time between them is the focus of the analysis.
This method is highly useful in medical research—for example, when studying the time between the beginning of a patient’s cancer remission and relapse. It can help doctors understand which treatments have desired or unexpected effects.
This analysis also has important applications in business. For example, companies use survival analysis to predict customer retention, product lifespan, or time until product failure.
8. Factor analysis
Factor analysis (FA) reduces large sets of variables into fewer components. It’s useful when dealing with complex datasets because it helps identify underlying structures and simplify data interpretation. This analysis is great for extracting maximum common variance from all necessary variables and turning them into a single score.
For example, in market research, businesses use factor analysis to group customer responses into broad categories. This helps reveal hidden patterns in consumer behavior.
It’s also helpful in product development, where it can use survey data to identify which product features are most important to customers.
9. Cluster analysis
Cluster analysis groups objects or individuals based on their similarities. This technique works great for customer segmentation, where businesses group customers based on common factors (such as purchasing behavior, demographics, and location).
Distinct clusters help companies tailor marketing strategies and develop personalized services. In education, this analysis can help identify groups of students who require additional assistance based on their achievement data. In medicine, it can help identify patients with similar symptoms to create targeted treatment plans.
10. Principal component analysis
Principal component analysis (PCA) is a dimensionality-reduction technique that simplifies large datasets by converting them into fewer components. It helps remove similar data from the line of comparison without affecting the data’s quality.
PCA is widely used in fields like finance, marketing, and genetics because it helps handle large datasets with many variables. For example, marketers can use PCA to identify which factors most influence customer buying decisions.
- How to choose the right statistical analysis method
Since numerous statistical analysis methods exist, choosing the right one for your needs may be complicated. While all of them can be applicable to the same situation, understanding where to start can save time and money.
Define your objective
Before choosing any statistical method, clearly define the objective of your analysis. What do you want to find out? Are you looking to compare groups, predict outcomes, or identify relationships between variables?
For example, if your goal is to compare averages between two groups, you can use a t-test. If you want to understand the effect of multiple factors on a single outcome, regression analysis could be the right choice for you.
Identify your data type
Data can be categorical (like yes/no or product types) or numerical (like sales figures or temperature readings).
For example, if you’re analyzing the relationship between two categorical variables, you may need a chi-square test. If you’re working with numerical data and need to predict future outcomes, you could use a time series analysis.
Evaluate the number of variables
The number of variables involved in your analysis influences the method you should choose. If you’re working with one dependent variable and one or more independent variables, regression analysis or ANOVA may be appropriate.
If you’re handling multiple variables, factor analysis or PCA can help simplify your dataset.
Determine sample size and data availability
Consider the assumptions of each method.
Each statistical method has its own set of assumptions, such as the distribution of the data or the relationship between variables.
For example, ANOVA assumes that the groups being compared have similar variances, while regression assumes a linear relationship between independent and dependent variables.
Understand if observations are paired or unpaired
When choosing a statistical test, you need to figure out if the data is paired or unpaired.
Paired data : the same subjects are measured more than once, like before and after a treatment or when using different methods.
Unpaired data: each group has different subjects.
For example, if you’re comparing the average scores of two groups, use a paired t-test for paired data and an independent t-test for unpaired data.
- Making the most of key statistical analysis methods
Each statistical analysis method is designed to simplify the process of gaining insights from a specific dataset. Understanding which data you need to analyze and which results you want to see can help you choose the right method.
With a comprehensive approach to analytics, you can maximize the benefits of insights and streamline decision-making. This isn’t just applicable in research and science. Businesses across multiple industries can reap significant benefits from well-structured statistical analysis.
Should you be using a customer insights hub?
Do you want to discover previous research faster?
Do you share your research findings with others?
Do you analyze research data?
Start for free today, add your research, and get to key insights faster
Editor’s picks
Last updated: 24 October 2024
Last updated: 11 January 2024
Last updated: 17 January 2024
Last updated: 12 December 2023
Last updated: 30 April 2024
Last updated: 4 July 2024
Last updated: 12 October 2023
Last updated: 5 March 2024
Last updated: 6 March 2024
Last updated: 31 January 2024
Last updated: 23 January 2024
Last updated: 13 May 2024
Last updated: 20 December 2023
Latest articles
Related topics, decide what to build next, log in or sign up.
Get started for free
Numbers, Facts and Trends Shaping Your World
Read our research on:
Full Topic List
Regions & Countries
- Publications
- Our Methods
- Short Reads
- Tools & Resources
Read Our Research On:
What the data says about immigrants in the U.S.
The United States has long had more immigrants than any other country. In fact, the U.S. is home to one-fifth of the world’s international migrants . These immigrants come from just about every country in the world.
Pew Research Center regularly publishes research on U.S. immigrants . Based on this research, here are answers to some key questions about the U.S. immigrant population.
Pew Research Center conducted this analysis to answer common questions about immigration to the United States and the U.S. immigrant population.
Data for 2023 comes from Census Bureau tabulations of the 2023 American Community Survey . The remaining data in this analysis comes mainly from Center tabulations of Census Bureau microdata the American Community Surveys ( IPUMS ) and historical data from decennial censuses.
This analysis also features estimates of the size of the U.S. unauthorized immigrant population . The estimates presented in this research for 2022 are the Center’s latest. Estimates of annual changes in the foreign-born population are from the Current Population Survey for 1994-2023 and the American Community Surveys for 2001-2022 (IPUMS), with adjustments for changes in the bureau’s survey methodology over time.
How many people in the U.S. are immigrants?
The U.S. foreign-born population reached a record 47.8 million in 2023, an increase of 1.6 million from the previous year. This is the largest annual increase in more than 20 years , since 2000.
In 1970, the number of immigrants living in the U.S. was about a fifth of what it is today. Growth of this population accelerated after Congress made changes to U.S. immigration laws in 1965.
Immigrants today account for 14.3% of the U.S. population, a roughly threefold increase from 4.7% in 1970. The immigrant share of the population today is the highest since 1910 but remains below the record 14.8% in 1890.
(Because only limited data from the 2023 American Community Survey has been released as of mid-September 2024, the rest of this post focuses on data from 2022.)
Where are U.S. immigrants from?
Mexico is the top country of birth for U.S. immigrants. In 2022, roughly 10.6 million immigrants living in the U.S. were born there, making up 23% of all U.S. immigrants. The next largest origin groups were those from India (6%), China (5%), the Philippines (4%) and El Salvador (3%).
By region of birth, immigrants from Asia accounted for 28% of all immigrants. Other regions make up smaller shares:
- Latin America (27%), excluding Mexico but including the Caribbean (10%), Central America (9%) and South America (9%)
- Europe, Canada and other North America (12%)
- Sub-Saharan Africa (5%)
- Middle East and North Africa (4%)
How have immigrants’ origin countries changed in recent decades?
Before 1965, U.S. immigration law favored immigrants from Northern and Western Europe and mostly barred immigration from Asia. The 1965 Immigration and Nationality Act opened up immigration from Asia and Latin America. The Immigration Act of 1990 further increased legal immigration and allowed immigrants from more countries to enter the U.S. legally.
Since 1965, about 72 million immigrants have come to the United States from different and more countries than their predecessors:
- From 1840 to 1889, about 90% of U.S. immigrants came from Europe, including about 70% from Germany, Ireland and the United Kingdom.
- Almost 90% of the immigrants who arrived from 1890 to 1919 came from Europe. Nearly 60% came from Italy, Austria-Hungary and Russia-Poland.
- Since 1965, about half of U.S. immigrants have come from Latin America, with about a quarter from Mexico alone. About another quarter have come from Asia. Large numbers have come from China, India, the Philippines, Central America and the Caribbean.
The newest wave of immigrants has dramatically changed states’ immigrant populations . In 1980, German immigrants were the largest group in 19 states, Canadian immigrants were the largest in 11 states and Mexicans were the largest in 10 states. By 2000, Mexicans were the largest group in 31 states.
Today, Mexico remains the largest origin country for U.S. immigrants. However, immigration from Mexico has slowed since 2007 and the Mexican-born population in the U.S. has dropped. The Mexican share of the U.S. immigrant population dropped from 29% in 2010 to 23% in 2022.
Where are recent immigrants coming from?
In 2022, Mexico was the top country of birth for immigrants who arrived in the last year, with about 150,000 people. India (about 145,000) and China (about 90,000) were the next largest sources of immigrants. Venezuela, Cuba, Brazil and Canada each had about 50,000 to 60,000 new immigrant arrivals.
The main sources of immigrants have shifted twice in the 21st century. The first was caused by the Great Recession (2007-2009). Until 2007, more Hispanics than Asians arrived in the U.S. each year. From 2009 to 2018, the opposite was true.
Since 2019, immigration from Latin America – much of it unauthorized – has reversed the pattern again. More Hispanics than Asians have come each year.
What is the legal status of immigrants in the U.S.?
Most immigrants (77%) are in the country legally. As of 2022:
- 49% were naturalized U.S. citizens.
- 24% were lawful permanent residents.
- 4% were legal temporary residents.
- 23% were unauthorized immigrants .
From 1990 to 2007, the unauthorized immigrant population more than tripled in size, from 3.5 million to a record high of 12.2 million. From there, the number slowly declined to about 10.2 million in 2019.
In 2022, the number of unauthorized immigrants in the U.S. showed sustained growth for the first time since 2007, to 11.o million.
As of 2022, about 4 million unauthorized immigrants in the U.S. are Mexican. This is the largest number of any origin country, representing more than one-third of all unauthorized immigrants. However, the Mexican unauthorized immigrant population is down from a peak of almost 7 million in 2007, when Mexicans accounted for 57% of all unauthorized immigrants.
The drop in the number of unauthorized immigrants from Mexico has been partly offset by growth from other parts of the world, especially Asia and other parts of Latin America.
The 2022 estimates of the unauthorized immigrant population are our latest comprehensive estimates. Other partial data sources suggest continued growth in 2023 and 2024 .
Who are unauthorized immigrants?
Virtually all unauthorized immigrants living in the U.S. entered the country without legal permission or arrived on a nonpermanent visa and stayed after it expired.
A growing number of unauthorized immigrants have permission to live and work in the U.S. and are temporarily protected from deportation. In 2022, about 3 million unauthorized immigrants had these temporary legal protections. These immigrants fall into several groups:
- Temporary Protected Status (TPS): About 650,000 immigrants have TPS as of July 2022. TPS is offered to individuals who cannot safely return to their home country because of civil unrest, violence, natural disaster or other extraordinary and temporary conditions.
- Deferred Action for Childhood Arrivals program (DACA): Almost 600,000 immigrants are beneficiaries of DACA. This program allows individuals brought to the U.S. as children before 2007 to remain in the U.S.
- Asylum applicants: About 1.6 million immigrants have pending applications for asylum in the U.S. as of mid-2022 because of dangers faced in their home country. These immigrants can stay in the U.S. legally while they wait for a decision on their case.
- Other protections: Several hundred thousand individuals have applied for special visas to become lawful immigrants. These types of visas are offered to victims of trafficking and certain other criminal activities.
In addition, about 500,000 immigrants arrived in the U.S. by the end of 2023 under programs created for Ukrainians (U4U or Uniting for Ukraine ) and people from Cuba, Haiti, Nicaragua and Venezuela ( CHNV parole ). These immigrants mainly arrived too late to be counted in the 2022 estimates but may be included in future estimates.
Do all lawful immigrants choose to become U.S. citizens?
Immigrants who are lawful permanent residents can apply to become U.S. citizens if they meet certain requirements. In fiscal year 2022, almost 1 million lawful immigrants became U.S. citizens through naturalization . This is only slightly below record highs in 1996 and 2008.
Most immigrants eligible for naturalization apply for citizenship, but not all do. Top reasons for not applying include language and personal barriers, lack of interest and not being able to afford it, according to a 2015 Pew Research Center survey .
Where do most U.S. immigrants live?
In 2022, most of the nation’s 46.1 million immigrants lived in four states: California (10.4 million or 23% of the national total), Texas (5.2 million or 11%), Florida (4.8 million or 10%) and New York (4.5 million or 10%).
Most immigrants lived in the South (35%) and West (33%). Another 21% lived in the Northeast and 11% were in the Midwest.
In 2022, more than 29 million immigrants – 63% of the nation’s foreign-born population – lived in just 20 major metropolitan areas. The largest populations were in the New York, Los Angeles and Miami metro areas. Most of the nation’s unauthorized immigrant population (60%) lived in these metro areas as well.
How many immigrants are working in the U.S.?
In 2022, over 30 million immigrants were in the U.S. workforce. Lawful immigrants made up the majority of the immigrant workforce, at 22.2 million. An additional 8.3 million immigrant workers are unauthorized. This is a notable increase over 2019 but about the same as in 2007 .
The share of workers who are immigrants increased slightly from 17% in 2007 to 18% in 2022. By contrast, the share of immigrant workers who are unauthorized declined from a peak of 5.4% in 2007 to 4.8% in 2022. Immigrants and their children are projected to add about 18 million people of working age between 2015 and 2035. This would offset an expected decline in the working-age population from retiring Baby Boomers.
How educated are immigrants compared with the U.S. population overall?
On average, U.S. immigrants have lower levels of education than the U.S.-born population. In 2022, immigrants ages 25 and older were about three times as likely as the U.S. born to have not completed high school (25% vs. 7%). However, immigrants were as likely as the U.S. born to have a bachelor’s degree or more (35% vs. 36%).
Immigrant educational attainment varies by origin. About half of immigrants from Mexico (51%) had not completed high school, and the same was true for 46% of those from Central America and 21% from the Caribbean. Immigrants from these three regions were also less likely than the U.S. born to have a bachelor’s degree or more.
On the other hand, immigrants from all other regions were about as likely as or more likely than the U.S. born to have at least a bachelor’s degree. Immigrants from South Asia (72%) were the most likely to have a bachelor’s degree or more.
How well do immigrants speak English?
About half of immigrants ages 5 and older (54%) are proficient English speakers – they either speak English very well (37%) or speak only English at home (17%).
Immigrants from Canada (97%), Oceania (82%), sub-Saharan Africa (76%), Europe (75%) and South Asia (73%) have the highest rates of English proficiency.
Immigrants from Mexico (36%) and Central America (35%) have the lowest proficiency rates.
Immigrants who have lived in the U.S. longer are somewhat more likely to be English proficient. Some 45% of immigrants who have lived in the U.S. for five years or less are proficient, compared with 56% of immigrants who have lived in the U.S. for 20 years or more.
Spanish is the most commonly spoken language among U.S. immigrants. About four-in-ten immigrants (41%) speak Spanish at home. Besides Spanish, the top languages immigrants speak at home are English only (17%), Chinese (6%), Filipino/Tagalog (4%), French or Haitian Creole (3%), and Vietnamese (2%).
Note: This is an update of a post originally published May 3, 2017.
- Immigrant Populations
- Immigration & Migration
- Unauthorized Immigration
Mohamad Moslimani is a former research analyst focusing on race and ethnicity at Pew Research Center .
Jeffrey S. Passel is a senior demographer at Pew Research Center .
What we know about unauthorized immigrants living in the U.S.
Cultural issues and the 2024 election, latinos’ views on the migrant situation at the u.s.-mexico border, u.s. christians more likely than ‘nones’ to say situation at the border is a crisis, how americans view the situation at the u.s.-mexico border, its causes and consequences, most popular.
901 E St. NW, Suite 300 Washington, DC 20004 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 | Media Inquiries
Research Topics
- Email Newsletters
ABOUT PEW RESEARCH CENTER Pew Research Center is a nonpartisan, nonadvocacy fact tank that informs the public about the issues, attitudes and trends shaping the world. It does not take policy positions. The Center conducts public opinion polling, demographic research, computational social science research and other data-driven research. Pew Research Center is a subsidiary of The Pew Charitable Trusts , its primary funder.
© 2024 Pew Research Center
IMAGES
VIDEO
COMMENTS
The Department of Statistics focuses its research on 10 major areas. Choose a research area below to read more about our work and see a list of affiliated faculty. Bayesian Inference. Biostatistics. Environmental and Ecological Statistics. Experimental Design. Foundations of Statistics and Probability. Functional Data Analysis.
The field of privacy-preserved data science is rapidly evolving. Theoretical and empirical investigations that reflect real world privacy concerns, as well as approaches to address ethics, social goods, and public policy in computer science and statistics, will be a hallmark of future research in this area. 10) Emerging data challenges.
Research Focus Areas Bayesian Statistics. Bayesian statistics is an approach to inference based on the celebrated Bayes theorem (ca 1763). It combines one's prior information on the unknown parameters of a model with the observed data to form the so-called posterior distribution, which reflects the updated knowledge on the parameters.
Click on one of the primary research areas to learn more about the work we are doing, the faculty involved, and the collaborations we have on and off campus in this area. Primary Research Areas. ... Department of Statistics 367 Evans Hall, University of California Berkeley, CA 94720-3860
Research Areas of Expertise. ... Statistical optimal transport is the area of statistics and machine learning devoted to coupling two separate distributions in an optimal way, with applications including non-parametric inference, goodnesss-of-fit testing, domain adaptation and data alignment, among very many other. ...
Statistics education, undergraduate education. Nancy Flournoy: Adaptive sequential design, Chemometrics, Clinical trials, Statistical immunology. Nick Grieshop: Undergraduate research, sports statistics, and deep learning. Chong (Zhuoqiong) He: Bayesian Analysis, Small Area Estimation, Survival Analysis, Sampling Survey and Spatio-temporal ...
Statistical learning, sometimes called machine learning, is becoming ever more important as a component of data science, and department members have had active research in this area for more than a decade. Statistical learning methods include classification and regression (supervised learning) and clustering (unsupervised learning). Current ...
The aim of Statistical Science is to present the full range of contemporary statistical thought at a technical level accessible to the broad community of practitioners, teachers, researchers, and students of statistics and probability. The journal publishes discussions of methodological and theoretical topics of current interest and importance, surveys of substantive research areas with ...
Research pages of the MIT Center for Statistics. Combinatorial learning with set functions. Learning problems that involve combinatorial objects are ubiquitous - they include the prediction of graphs, assignments, rankings, trees, groups of discrete labels or preferred sets of a user; the expression of prior structural knowledge for regularization, the identification of sets of important ...
Research in biostatistics includes methodology development and evaluation, design and analysis of health studies, and advanced applications in genomic research. Research in statistical methods for the social sciences is providing the framework for measurement, policy analysis and risk analysis. Key drivers of modern research in statistics and ...
Research Areas | Data Science. The world is being transformed by data and data-driven analysis is rapidly becoming an integral part of science and society. Stanford Data Science is a collaborative effort across many departments in all seven schools. We strive to unite existing data science research initiatives and create interdisciplinary ...
In this article, we present 10 research areas that could make statistics and data science more impactful on science and society. Focusing on these areas will help better transform data into knowledge, actionable insights and deliverables, and promote more collaboration with computer and other quantitative scientists and domain scientists. ...
The major research areas of our department can be divided into theoretical statistics and statistical methodology, applied statistics, and probability. Our faculty members are actively engaged in interdisciplinary research in such areas as mathematics, computer science, biostatistics, environmental sciences, and financial mathematics and ...
Research areas in the Department of Statistics are diverse and multidisciplinary with application areas that range from finance to social sciences. ... is the science of bringing statistical and probabilistic reasoning to bear on the complex problems presented in research areas such as biology, genetics, human health, medical and environmental ...
PhD Degree in Statistics. The Department of Statistics offers an exciting and recently revamped PhD program that involves students in cutting-edge interdisciplinary research in a wide variety of fields. Statistics has become a core component of research in the biological, physical, and social sciences, as well as in traditional computer science ...
Fundamental research in statistics includes developing new theory to validate statistical procedures, generalizing probability models for random processes, developing new nonparametric methodology for machine learning applications, establishing the asymptotic theory behind new statistical methods and proving new approaches for experiments that cannot be handled by traditional statistical methods.
200+ Engaging Statistics Research Topics Statistics is a versatile field with applications in virtually every domain, making it an exciting subject for research. Whether you're interested in economics, healthcare, sports, or environmental science, there are countless areas where statistical methods can help uncover trends, make predictions, or ...
Research Areas; Statistics and Statistical Theory; Statistics and Statistical Theory Researchers at UCI are concerned with developing and studying methods for collecting, analyzing, interpreting and presenting empirical data. Statistical principles and methods are important for addressing questions in public policy, medicine, industry and ...
Using statistical analysis can lead to improved efficiency in areas where waste occurs. In operations, this can result in streamlining processes. For example, manufacturers can use causal analysis to identify the factors contributing to defective products and then implement targeted improvements to eliminate the causes. Enhancing accuracy in ...
Taylor & Francis are currently supporting a 100% APC discount for all authors. Research in Statistics is a broad open access journal publishing original research in all areas of statistics and probability.The journal focuses on broadening existing research fields, and in facilitating international collaboration, and is devoted to the international advancement of the theory and application of ...
Pew Research Center conducted this analysis to answer common questions about immigration to the United States and the U.S. immigrant population. ... than 29 million immigrants - 63% of the nation's foreign-born population - lived in just 20 major metropolitan areas. The largest populations were in the New York, Los Angeles and Miami metro ...
Harvard Data Science Review • Issue 2.3, Summer 2020 Challenges and Opportunities in Statistics and Data Science: Ten Research Areas 4 3. Postselection Inference Statistical inference is best justified when carefully collected data (and an appropriately chosen model) are used to infer and learn about an intrinsic quantity of interest.