Society Homepage About Public Health Policy Contact

Data-driven hypothesis generation in clinical research: what we learned from a human subject study, article sidebar.

hypothesis testing and hypothesis generating

Submit your own article

Register as an author to reserve your spot in the next issue of the Medical Research Archives.

Join the Society

The European Society of Medicine is more than a professional association. We are a community. Our members work in countries across the globe, yet are united by a common goal: to promote health and health equity, around the world.

Join Europe’s leading medical society and discover the many advantages of membership, including free article publication.

Main Article Content

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS—a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

Article Details

The  Medical Research Archives  grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the  Medical Research Archives .

  • First Online: 01 January 2024

Cite this chapter

hypothesis testing and hypothesis generating

  • Hiroshi Ishikawa 3  

Part of the book series: Studies in Big Data ((SBD,volume 139))

145 Accesses

This chapter will explain the definition and properties of a hypothesis, the related concepts, and basic methods of hypothesis generation as follows.

Describe the definition, properties, and life cycle of a hypothesis.

Describe relationships between a hypothesis and a theory, a model, and data.

Categorize and explain research questions that provide hints for hypothesis generation.

Explain how to visualize data and analysis results.

Explain the philosophy of science and scientific methods in relation to hypothesis generation in science.

Explain deduction, induction, plausible reasoning, and analogy concretely as reasoning methods useful for hypothesis generation.

Explain problem solving as hypothesis generation methods by using familiar examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Aufmann RN, Lockwood JS et al (2018) Mathematical excursions. CENGAGE

Google Scholar  

Bortolotti L (2008) An introduction to the philosophy of science. Polity

Cairo A (2016) The truthful art: data, charts, and maps for communication. New Riders

Cellucci C (2017) Rethinking knowledge: the heuristic view. Springer

Chang M (2014) Principles of scientific methods. CRC Press

Crease RP (2010) The great equations: breakthroughs in science from Pythagoras to Heisenberg. W. W. Norton & Company

Danks D, Ippoliti E (eds) Building theories: Heuristics and hypotheses in sciences. Springer

Diggle PJ, Chetwynd AG (2011) Statistics and scientific method: an introduction for students and researchers. Oxford University Press

DOAJ (2022) Directory of open access journal. https://doaj.org/ Accessed 2022

Gilchrist P, Wheaton B (2011) Lifestyle sport, public policy and youth engagement: examining the emergence of Parkour. Int J Sport Policy Polit 3(1):109–131. https://doi.org/10.1080/19406940.2010.547866

Article   Google Scholar  

Google Maps. https://www.google.com/maps Accessed 2022.

Ishikawa H (2015) Social big data mining. CRC Press

Järvinen P (2008) Mapping research questions to research methods. In: Avison D, Kasper GM, Pernici B, Ramos I, Roode D (eds) Advances in information systems research, education and practice. Proceedings of IFIP 20th world computer congress, TC 8, information systems, vol 274. Springer. https://doi.org/10.1007/978-0-387-09682-7-9_3

JAXA (2022) Martian moons eXploration. http://www.mmx.jaxa.jp/en/ . Accessed 2022

Lewton T (2020) How the bits of quantum gravity can buzz. Quanta Magazine. 2020. https://www.quantamagazine.org/gravitons-revealed-in-the-noise-of-gravitational-waves-20200723/ . Accessed 2022

Mahajan S (2014) The art of insight in science and engineering: Mastering complexity. The MIT Press

Méndez A, Rivera–Valentín EG (2017) The equilibrium temperature of planets in elliptical orbits. Astrophys J Lett 837(1)

NASA (2022) Mars sample return. https://www.jpl.nasa.gov/missions/mars-sample-return-msr Accessed 2022

OpenStreetMap (2022). https://www.openstreetmap.org . Accessed 2022

Pólya G (2009) Mathematics and plausible reasoning: vol I: induction and analogy in mathematics. Ishi Press

Pólya G, Conway JH (2014) How to solve it. Princeton University Press

Rehm J (2019) The four fundamental forces of nature. Live science https://www.livescience.com/the-fundamental-forces-of-nature.html

Sadler-Smith E (2015) Wallas’ four-stage model of the creative process: more than meets the eye? Creat Res J 27(4):342–352. https://doi.org/10.1080/10400419.2015.1087277

Siegel E, This is why physicists think string theory might be our ‘theory of everything.’ Forbes, 2018. https://www.forbes.com/sites/startswithabang/2018/05/31/this-is-why-physicists-think-string-theory-might-be-our-theory-of-everything/?sh=b01d79758c25

Zeitz P (2006) The art and craft of problem solving. Wiley

Download references

Author information

Authors and affiliations.

Department of Systems Design, Tokyo Metropolitan University, Hino, Tokyo, Japan

Hiroshi Ishikawa

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Hiroshi Ishikawa .

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Ishikawa, H. (2024). Hypothesis. In: Hypothesis Generation and Interpretation. Studies in Big Data, vol 139. Springer, Cham. https://doi.org/10.1007/978-3-031-43540-9_2

Download citation

DOI : https://doi.org/10.1007/978-3-031-43540-9_2

Published : 01 January 2024

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-43539-3

Online ISBN : 978-3-031-43540-9

eBook Packages : Computer Science Computer Science (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Hypothesis Testing: Principles and Methods

Learn about hypothesis testing. The types of tests, common errors, best practices, and more. Perfect for all researchers.

' src=

Hypothesis testing is a fundamental tool used in scientific research to validate or reject hypotheses about population parameters based on sample data. It provides a structured framework for evaluating the statistical significance of a hypothesis and drawing conclusions about the true nature of a population. Hypothesis testing is widely used in fields such as biology, psychology, economics, and engineering to determine the effectiveness of new treatments, explore relationships between variables, and make data-driven decisions. However, despite its importance, hypothesis testing can be a challenging topic to understand and apply correctly.

In this article, we will provide an introduction to hypothesis testing, including its purpose, types of tests, steps involved, common errors, and best practices. Whether you are a beginner or an experienced researcher, this article will serve as a valuable guide to mastering hypothesis testing in your work.

Introduction to Hypothesis Testing

Hypothesis testing is a statistical tool that is commonly used in research to determine whether there is enough evidence to support or reject a hypothesis. It involves formulating a hypothesis about a population parameter, collecting data, and analyzing the data to determine the likelihood of the hypothesis being true. It is a critical component of the scientific method, and it is used in a wide range of fields.

The process of hypothesis testing typically involves two hypotheses: the null hypothesis and the alternative hypothesis. The null hypothesis is a statement that there is no significant difference between two variables or no relationship between them, while the alternative hypothesis suggests the presence of a relationship or difference. Researchers collect data and perform statistical analysis to determine if the null hypothesis can be rejected in favor of the alternative hypothesis.

Hypothesis testing is used to make decisions based on data, and it is important to understand the underlying assumptions and limitations of the process. It is crucial to choose appropriate statistical tests and sample sizes to ensure that the results are accurate and reliable, and it can be a powerful tool for researchers to validate their theories and make evidence-based decisions.

Types of Hypothesis Tests

Hypothesis testing can be broadly classified into two categories: one-sample hypothesis tests and two-sample hypothesis tests. Let’s take a closer look at each of these categories:

One Sample Hypothesis Tests

In a one-sample hypothesis test, a researcher collects data from a single population and compares it to a known value or hypothesis. The null hypothesis usually assumes that there is no significant difference between the population means and the known value or hypothesized value. The researcher then performs a statistical test to determine whether the observed difference is statistically significant. Some examples of one-sample hypothesis tests are:

One Sample t-test: This test is used to determine whether the sample mean is significantly different from the hypothesized mean of the population.

hypothesis testing and hypothesis generating

One Sample z-test: This test is used to determine whether the sample mean is significantly different from the hypothesized mean of the population when the population standard deviation is known.

hypothesis testing and hypothesis generating

Two Sample Hypothesis Tests

In a two-sample hypothesis test, a researcher collects data from two different populations and compares them to each other. The null hypothesis typically assumes that there is no significant difference between the two populations, and the researcher performs a statistical test to determine whether the observed difference is statistically significant. Some examples of two sample hypothesis tests are:

Independent Samples t-test: This test is used to compare the means of two independent samples to determine whether they are significantly different from each other.

hypothesis testing and hypothesis generating

Paired Samples t-test: This test is used to compare the means of two related samples, such as pre-test and post-test scores of the same group of subjects.

Figure: https://statstest.b-cdn.net/wp-content/uploads/2020/10/Paired-Samples-T-Test.jpg

In summary, one-sample hypothesis tests are used to test hypotheses about a single population, while two-sample hypothesis tests are used to compare two populations. The appropriate test to use depends on the nature of the data and the research question being investigated.

Steps of Hypothesis Testing

Hypothesis testing involves a series of steps that help researchers determine whether there is enough evidence to support or reject a hypothesis. These steps can be broadly classified into four categories:

Formulating the Hypothesis

The first step in hypothesis testing is to formulate the null hypothesis and alternative hypothesis. The null hypothesis usually assumes that there is no significant difference between two variables, while the alternative hypothesis suggests the presence of a relationship or difference. It is important to formulate clear and testable hypotheses before proceeding with data collection.

Collecting Data

The second step is to collect relevant data that can be used to test the hypotheses. The data collection process should be carefully designed to ensure that the sample is representative of the population of interest. The sample size should be large enough to produce statistically valid results.

Analyzing Data

The third step is to analyze the data using appropriate statistical tests. The choice of test depends on the nature of the data and the research question being investigated. The results of the statistical analysis will provide information on whether the null hypothesis can be rejected in favor of the alternative hypothesis.

Interpreting Results

The final step is to interpret the results of the statistical analysis. The researcher needs to determine whether the results are statistically significant and whether they support or reject the hypothesis. The researcher should also consider the limitations of the study and the potential implications of the results.

Common Errors in Hypothesis Testing

Hypothesis testing is a statistical method used to determine if there is enough evidence to support or reject a specific hypothesis about a population parameter based on a sample of data. The two types of errors that can occur in hypothesis testing are:

Type I error: This occurs when the researcher rejects the null hypothesis even though it is true. Type I error is also known as a false positive.

Type II error: This occurs when the researcher fails to reject the null hypothesis even though it is false. Type II error is also known as a false negative.

To minimize these errors, it is important to carefully design and conduct the study, choose appropriate statistical tests, and properly interpret the results. Researchers should also acknowledge the limitations of their study and consider the potential sources of error when drawing conclusions.

Null and Alternative Hypotheses

In hypothesis testing, there are two types of hypotheses: null hypothesis and alternative hypothesis.

The Null Hypothesis

The null hypothesis (H0) is a statement that assumes there is no significant difference or relationship between two variables. It is the default hypothesis that is assumed to be true until there is sufficient evidence to reject it. The null hypothesis is often written as a statement of equality, such as “the mean of Group A is equal to the mean of Group B.”

The Alternative Hypothesis

The alternative hypothesis (Ha) is a statement that suggests the presence of a significant difference or relationship between two variables. It is the hypothesis that the researcher is interested in testing. The alternative hypothesis is often written as a statement of inequality, such as “the mean of Group A is not equal to the mean of Group B.”

The null and alternative hypotheses are complementary and mutually exclusive. If the null hypothesis is rejected, the alternative hypothesis is accepted. If the null hypothesis cannot be rejected, the alternative hypothesis is not supported.

It is important to note that the null hypothesis is not necessarily true. It is simply a statement that assumes there is no significant difference or relationship between the variables being studied. The purpose of hypothesis testing is to determine whether there is sufficient evidence to reject the null hypothesis in favor of the alternative hypothesis.

Significance Level and P Value

In hypothesis testing, the significance level (alpha) is the probability of making a Type I error, which is rejecting the null hypothesis when it is actually true. The most commonly used significance level in scientific research is 0.05, meaning that there is a 5% chance of making a Type I error.

The p-value is a statistical measure that indicates the probability of obtaining the observed results or more extreme results if the null hypothesis is true. It is a measure of the strength of evidence against the null hypothesis. A small p-value (typically less than the chosen significance level of 0.05) suggests that there is strong evidence against the null hypothesis, while a large p-value suggests that there is not enough evidence to reject the null hypothesis.

If the p-value is less than the significance level (p < alpha), then the null hypothesis is rejected and the alternative hypothesis is accepted. This means that there is sufficient evidence to suggest that there is a significant difference or relationship between the variables being studied. On the other hand, if the p-value is greater than the significance level (p > alpha), then the null hypothesis is not rejected and the alternative hypothesis is not supported.

If you want an easy-to-understand summary of the significance level, you will find it in this article: An easy-to-understand summary of significance level .

It is important to note that statistical significance does not necessarily imply practical significance or importance. A small difference or relationship between variables may be statistically significant but may not be practically significant. Additionally, statistical significance depends on sample size and effect size, among other factors, and should be interpreted in the context of the study design and research question.

Power Analysis for Hypothesis Testing

Power analysis is a statistical method used in hypothesis testing to determine the sample size needed to detect a specific effect size with a certain level of confidence. The power of a statistical test is the probability of correctly rejecting the null hypothesis when it is false or the probability of avoiding a Type II error.

Power analysis is important because it helps researchers determine the appropriate sample size needed to achieve a desired level of power. A study with low power may fail to detect a true effect, leading to a Type II error, while a study with high power is more likely to detect a true effect, leading to more accurate and reliable results.

To conduct a power analysis, researchers need to specify the desired power level, significance level, effect size, and sample size. Effect size is a measure of the magnitude of the difference or relationship between variables being studied, and is typically estimated from previous research or pilot studies. The power analysis can then determine the necessary sample size needed to achieve the desired power level.

Power analysis can also be used retrospectively to determine the power of a completed study, based on the sample size, effect size, and significance level. This can help researchers evaluate the strength of their conclusions and determine whether additional research is needed.

Overall, power analysis is an important tool in hypothesis testing, as it helps researchers design studies that are adequately powered to detect true effects and avoid Type II errors

Bayesian Hypothesis Testing

Bayesian hypothesis testing is a statistical method that allows researchers to evaluate the evidence for and against competing hypotheses, based on the likelihood of the observed data under each hypothesis, as well as the prior probability of each hypothesis. Unlike classical hypothesis testing, which focuses on rejecting null hypotheses based on p-values, Bayesian hypothesis testing provides a more nuanced and informative approach to hypothesis testing, by allowing researchers to quantify the strength of evidence for and against each hypothesis.

In Bayesian hypothesis testing, researchers start with a prior probability distribution for each hypothesis, based on existing knowledge or beliefs. They then update the prior probability distribution based on the likelihood of the observed data under each hypothesis, using Bayes’ theorem. The resulting posterior probability distribution represents the probability of each hypothesis, given the observed data.

The strength of evidence for one hypothesis versus another can be quantified by calculating the Bayes factor, which is the ratio of the likelihood of the observed data under one hypothesis versus another, weighted by their prior probabilities. A Bayes factor greater than 1 indicates evidence in favor of one hypothesis, while a Bayes factor less than 1 indicates evidence in favor of the other hypothesis.

Bayesian hypothesis testing has several advantages over classical hypothesis testing. First, it allows researchers to update their prior beliefs based on observed data, which can lead to more accurate and reliable conclusions. Second, it provides a more informative measure of evidence than p-values, which only indicate whether the observed data is statistically significant at a predetermined level. Finally, it can accommodate complex models with multiple parameters and hypotheses, which may be difficult to analyze using classical methods.

Overall, Bayesian hypothesis testing is a powerful and flexible statistical method that can help researchers make more informed decisions and draw more accurate conclusions from their data.

Make scientifically accurate infographics in minutes

Mind the Graph platform is a powerful tool that helps scientists create scientifically accurate infographics in an easy way. With its intuitive interface, customizable templates, and extensive library of scientific illustrations and icons, Mind the Graph makes it easy for researchers to create professional-looking graphics that effectively communicate their findings to a broader audience.

hypothesis testing and hypothesis generating

Subscribe to our newsletter

Exclusive high quality content about effective visual communication in science.

Unlock Your Creativity

Create infographics, presentations and other scientifically-accurate designs without hassle — absolutely free for 7 days!

Content tags

en_US

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • J Korean Med Sci
  • v.37(16); 2022 Apr 25

Logo of jkms

A Practical Guide to Writing Quantitative and Qualitative Research Questions and Hypotheses in Scholarly Articles

Edward barroga.

1 Department of General Education, Graduate School of Nursing Science, St. Luke’s International University, Tokyo, Japan.

Glafera Janet Matanguihan

2 Department of Biological Sciences, Messiah University, Mechanicsburg, PA, USA.

The development of research questions and the subsequent hypotheses are prerequisites to defining the main research purpose and specific objectives of a study. Consequently, these objectives determine the study design and research outcome. The development of research questions is a process based on knowledge of current trends, cutting-edge studies, and technological advances in the research field. Excellent research questions are focused and require a comprehensive literature search and in-depth understanding of the problem being investigated. Initially, research questions may be written as descriptive questions which could be developed into inferential questions. These questions must be specific and concise to provide a clear foundation for developing hypotheses. Hypotheses are more formal predictions about the research outcomes. These specify the possible results that may or may not be expected regarding the relationship between groups. Thus, research questions and hypotheses clarify the main purpose and specific objectives of the study, which in turn dictate the design of the study, its direction, and outcome. Studies developed from good research questions and hypotheses will have trustworthy outcomes with wide-ranging social and health implications.

INTRODUCTION

Scientific research is usually initiated by posing evidenced-based research questions which are then explicitly restated as hypotheses. 1 , 2 The hypotheses provide directions to guide the study, solutions, explanations, and expected results. 3 , 4 Both research questions and hypotheses are essentially formulated based on conventional theories and real-world processes, which allow the inception of novel studies and the ethical testing of ideas. 5 , 6

It is crucial to have knowledge of both quantitative and qualitative research 2 as both types of research involve writing research questions and hypotheses. 7 However, these crucial elements of research are sometimes overlooked; if not overlooked, then framed without the forethought and meticulous attention it needs. Planning and careful consideration are needed when developing quantitative or qualitative research, particularly when conceptualizing research questions and hypotheses. 4

There is a continuing need to support researchers in the creation of innovative research questions and hypotheses, as well as for journal articles that carefully review these elements. 1 When research questions and hypotheses are not carefully thought of, unethical studies and poor outcomes usually ensue. Carefully formulated research questions and hypotheses define well-founded objectives, which in turn determine the appropriate design, course, and outcome of the study. This article then aims to discuss in detail the various aspects of crafting research questions and hypotheses, with the goal of guiding researchers as they develop their own. Examples from the authors and peer-reviewed scientific articles in the healthcare field are provided to illustrate key points.

DEFINITIONS AND RELATIONSHIP OF RESEARCH QUESTIONS AND HYPOTHESES

A research question is what a study aims to answer after data analysis and interpretation. The answer is written in length in the discussion section of the paper. Thus, the research question gives a preview of the different parts and variables of the study meant to address the problem posed in the research question. 1 An excellent research question clarifies the research writing while facilitating understanding of the research topic, objective, scope, and limitations of the study. 5

On the other hand, a research hypothesis is an educated statement of an expected outcome. This statement is based on background research and current knowledge. 8 , 9 The research hypothesis makes a specific prediction about a new phenomenon 10 or a formal statement on the expected relationship between an independent variable and a dependent variable. 3 , 11 It provides a tentative answer to the research question to be tested or explored. 4

Hypotheses employ reasoning to predict a theory-based outcome. 10 These can also be developed from theories by focusing on components of theories that have not yet been observed. 10 The validity of hypotheses is often based on the testability of the prediction made in a reproducible experiment. 8

Conversely, hypotheses can also be rephrased as research questions. Several hypotheses based on existing theories and knowledge may be needed to answer a research question. Developing ethical research questions and hypotheses creates a research design that has logical relationships among variables. These relationships serve as a solid foundation for the conduct of the study. 4 , 11 Haphazardly constructed research questions can result in poorly formulated hypotheses and improper study designs, leading to unreliable results. Thus, the formulations of relevant research questions and verifiable hypotheses are crucial when beginning research. 12

CHARACTERISTICS OF GOOD RESEARCH QUESTIONS AND HYPOTHESES

Excellent research questions are specific and focused. These integrate collective data and observations to confirm or refute the subsequent hypotheses. Well-constructed hypotheses are based on previous reports and verify the research context. These are realistic, in-depth, sufficiently complex, and reproducible. More importantly, these hypotheses can be addressed and tested. 13

There are several characteristics of well-developed hypotheses. Good hypotheses are 1) empirically testable 7 , 10 , 11 , 13 ; 2) backed by preliminary evidence 9 ; 3) testable by ethical research 7 , 9 ; 4) based on original ideas 9 ; 5) have evidenced-based logical reasoning 10 ; and 6) can be predicted. 11 Good hypotheses can infer ethical and positive implications, indicating the presence of a relationship or effect relevant to the research theme. 7 , 11 These are initially developed from a general theory and branch into specific hypotheses by deductive reasoning. In the absence of a theory to base the hypotheses, inductive reasoning based on specific observations or findings form more general hypotheses. 10

TYPES OF RESEARCH QUESTIONS AND HYPOTHESES

Research questions and hypotheses are developed according to the type of research, which can be broadly classified into quantitative and qualitative research. We provide a summary of the types of research questions and hypotheses under quantitative and qualitative research categories in Table 1 .

Research questions in quantitative research

In quantitative research, research questions inquire about the relationships among variables being investigated and are usually framed at the start of the study. These are precise and typically linked to the subject population, dependent and independent variables, and research design. 1 Research questions may also attempt to describe the behavior of a population in relation to one or more variables, or describe the characteristics of variables to be measured ( descriptive research questions ). 1 , 5 , 14 These questions may also aim to discover differences between groups within the context of an outcome variable ( comparative research questions ), 1 , 5 , 14 or elucidate trends and interactions among variables ( relationship research questions ). 1 , 5 We provide examples of descriptive, comparative, and relationship research questions in quantitative research in Table 2 .

Hypotheses in quantitative research

In quantitative research, hypotheses predict the expected relationships among variables. 15 Relationships among variables that can be predicted include 1) between a single dependent variable and a single independent variable ( simple hypothesis ) or 2) between two or more independent and dependent variables ( complex hypothesis ). 4 , 11 Hypotheses may also specify the expected direction to be followed and imply an intellectual commitment to a particular outcome ( directional hypothesis ) 4 . On the other hand, hypotheses may not predict the exact direction and are used in the absence of a theory, or when findings contradict previous studies ( non-directional hypothesis ). 4 In addition, hypotheses can 1) define interdependency between variables ( associative hypothesis ), 4 2) propose an effect on the dependent variable from manipulation of the independent variable ( causal hypothesis ), 4 3) state a negative relationship between two variables ( null hypothesis ), 4 , 11 , 15 4) replace the working hypothesis if rejected ( alternative hypothesis ), 15 explain the relationship of phenomena to possibly generate a theory ( working hypothesis ), 11 5) involve quantifiable variables that can be tested statistically ( statistical hypothesis ), 11 6) or express a relationship whose interlinks can be verified logically ( logical hypothesis ). 11 We provide examples of simple, complex, directional, non-directional, associative, causal, null, alternative, working, statistical, and logical hypotheses in quantitative research, as well as the definition of quantitative hypothesis-testing research in Table 3 .

Research questions in qualitative research

Unlike research questions in quantitative research, research questions in qualitative research are usually continuously reviewed and reformulated. The central question and associated subquestions are stated more than the hypotheses. 15 The central question broadly explores a complex set of factors surrounding the central phenomenon, aiming to present the varied perspectives of participants. 15

There are varied goals for which qualitative research questions are developed. These questions can function in several ways, such as to 1) identify and describe existing conditions ( contextual research question s); 2) describe a phenomenon ( descriptive research questions ); 3) assess the effectiveness of existing methods, protocols, theories, or procedures ( evaluation research questions ); 4) examine a phenomenon or analyze the reasons or relationships between subjects or phenomena ( explanatory research questions ); or 5) focus on unknown aspects of a particular topic ( exploratory research questions ). 5 In addition, some qualitative research questions provide new ideas for the development of theories and actions ( generative research questions ) or advance specific ideologies of a position ( ideological research questions ). 1 Other qualitative research questions may build on a body of existing literature and become working guidelines ( ethnographic research questions ). Research questions may also be broadly stated without specific reference to the existing literature or a typology of questions ( phenomenological research questions ), may be directed towards generating a theory of some process ( grounded theory questions ), or may address a description of the case and the emerging themes ( qualitative case study questions ). 15 We provide examples of contextual, descriptive, evaluation, explanatory, exploratory, generative, ideological, ethnographic, phenomenological, grounded theory, and qualitative case study research questions in qualitative research in Table 4 , and the definition of qualitative hypothesis-generating research in Table 5 .

Qualitative studies usually pose at least one central research question and several subquestions starting with How or What . These research questions use exploratory verbs such as explore or describe . These also focus on one central phenomenon of interest, and may mention the participants and research site. 15

Hypotheses in qualitative research

Hypotheses in qualitative research are stated in the form of a clear statement concerning the problem to be investigated. Unlike in quantitative research where hypotheses are usually developed to be tested, qualitative research can lead to both hypothesis-testing and hypothesis-generating outcomes. 2 When studies require both quantitative and qualitative research questions, this suggests an integrative process between both research methods wherein a single mixed-methods research question can be developed. 1

FRAMEWORKS FOR DEVELOPING RESEARCH QUESTIONS AND HYPOTHESES

Research questions followed by hypotheses should be developed before the start of the study. 1 , 12 , 14 It is crucial to develop feasible research questions on a topic that is interesting to both the researcher and the scientific community. This can be achieved by a meticulous review of previous and current studies to establish a novel topic. Specific areas are subsequently focused on to generate ethical research questions. The relevance of the research questions is evaluated in terms of clarity of the resulting data, specificity of the methodology, objectivity of the outcome, depth of the research, and impact of the study. 1 , 5 These aspects constitute the FINER criteria (i.e., Feasible, Interesting, Novel, Ethical, and Relevant). 1 Clarity and effectiveness are achieved if research questions meet the FINER criteria. In addition to the FINER criteria, Ratan et al. described focus, complexity, novelty, feasibility, and measurability for evaluating the effectiveness of research questions. 14

The PICOT and PEO frameworks are also used when developing research questions. 1 The following elements are addressed in these frameworks, PICOT: P-population/patients/problem, I-intervention or indicator being studied, C-comparison group, O-outcome of interest, and T-timeframe of the study; PEO: P-population being studied, E-exposure to preexisting conditions, and O-outcome of interest. 1 Research questions are also considered good if these meet the “FINERMAPS” framework: Feasible, Interesting, Novel, Ethical, Relevant, Manageable, Appropriate, Potential value/publishable, and Systematic. 14

As we indicated earlier, research questions and hypotheses that are not carefully formulated result in unethical studies or poor outcomes. To illustrate this, we provide some examples of ambiguous research question and hypotheses that result in unclear and weak research objectives in quantitative research ( Table 6 ) 16 and qualitative research ( Table 7 ) 17 , and how to transform these ambiguous research question(s) and hypothesis(es) into clear and good statements.

a These statements were composed for comparison and illustrative purposes only.

b These statements are direct quotes from Higashihara and Horiuchi. 16

a This statement is a direct quote from Shimoda et al. 17

The other statements were composed for comparison and illustrative purposes only.

CONSTRUCTING RESEARCH QUESTIONS AND HYPOTHESES

To construct effective research questions and hypotheses, it is very important to 1) clarify the background and 2) identify the research problem at the outset of the research, within a specific timeframe. 9 Then, 3) review or conduct preliminary research to collect all available knowledge about the possible research questions by studying theories and previous studies. 18 Afterwards, 4) construct research questions to investigate the research problem. Identify variables to be accessed from the research questions 4 and make operational definitions of constructs from the research problem and questions. Thereafter, 5) construct specific deductive or inductive predictions in the form of hypotheses. 4 Finally, 6) state the study aims . This general flow for constructing effective research questions and hypotheses prior to conducting research is shown in Fig. 1 .

An external file that holds a picture, illustration, etc.
Object name is jkms-37-e121-g001.jpg

Research questions are used more frequently in qualitative research than objectives or hypotheses. 3 These questions seek to discover, understand, explore or describe experiences by asking “What” or “How.” The questions are open-ended to elicit a description rather than to relate variables or compare groups. The questions are continually reviewed, reformulated, and changed during the qualitative study. 3 Research questions are also used more frequently in survey projects than hypotheses in experiments in quantitative research to compare variables and their relationships.

Hypotheses are constructed based on the variables identified and as an if-then statement, following the template, ‘If a specific action is taken, then a certain outcome is expected.’ At this stage, some ideas regarding expectations from the research to be conducted must be drawn. 18 Then, the variables to be manipulated (independent) and influenced (dependent) are defined. 4 Thereafter, the hypothesis is stated and refined, and reproducible data tailored to the hypothesis are identified, collected, and analyzed. 4 The hypotheses must be testable and specific, 18 and should describe the variables and their relationships, the specific group being studied, and the predicted research outcome. 18 Hypotheses construction involves a testable proposition to be deduced from theory, and independent and dependent variables to be separated and measured separately. 3 Therefore, good hypotheses must be based on good research questions constructed at the start of a study or trial. 12

In summary, research questions are constructed after establishing the background of the study. Hypotheses are then developed based on the research questions. Thus, it is crucial to have excellent research questions to generate superior hypotheses. In turn, these would determine the research objectives and the design of the study, and ultimately, the outcome of the research. 12 Algorithms for building research questions and hypotheses are shown in Fig. 2 for quantitative research and in Fig. 3 for qualitative research.

An external file that holds a picture, illustration, etc.
Object name is jkms-37-e121-g002.jpg

EXAMPLES OF RESEARCH QUESTIONS FROM PUBLISHED ARTICLES

  • EXAMPLE 1. Descriptive research question (quantitative research)
  • - Presents research variables to be assessed (distinct phenotypes and subphenotypes)
  • “BACKGROUND: Since COVID-19 was identified, its clinical and biological heterogeneity has been recognized. Identifying COVID-19 phenotypes might help guide basic, clinical, and translational research efforts.
  • RESEARCH QUESTION: Does the clinical spectrum of patients with COVID-19 contain distinct phenotypes and subphenotypes? ” 19
  • EXAMPLE 2. Relationship research question (quantitative research)
  • - Shows interactions between dependent variable (static postural control) and independent variable (peripheral visual field loss)
  • “Background: Integration of visual, vestibular, and proprioceptive sensations contributes to postural control. People with peripheral visual field loss have serious postural instability. However, the directional specificity of postural stability and sensory reweighting caused by gradual peripheral visual field loss remain unclear.
  • Research question: What are the effects of peripheral visual field loss on static postural control ?” 20
  • EXAMPLE 3. Comparative research question (quantitative research)
  • - Clarifies the difference among groups with an outcome variable (patients enrolled in COMPERA with moderate PH or severe PH in COPD) and another group without the outcome variable (patients with idiopathic pulmonary arterial hypertension (IPAH))
  • “BACKGROUND: Pulmonary hypertension (PH) in COPD is a poorly investigated clinical condition.
  • RESEARCH QUESTION: Which factors determine the outcome of PH in COPD?
  • STUDY DESIGN AND METHODS: We analyzed the characteristics and outcome of patients enrolled in the Comparative, Prospective Registry of Newly Initiated Therapies for Pulmonary Hypertension (COMPERA) with moderate or severe PH in COPD as defined during the 6th PH World Symposium who received medical therapy for PH and compared them with patients with idiopathic pulmonary arterial hypertension (IPAH) .” 21
  • EXAMPLE 4. Exploratory research question (qualitative research)
  • - Explores areas that have not been fully investigated (perspectives of families and children who receive care in clinic-based child obesity treatment) to have a deeper understanding of the research problem
  • “Problem: Interventions for children with obesity lead to only modest improvements in BMI and long-term outcomes, and data are limited on the perspectives of families of children with obesity in clinic-based treatment. This scoping review seeks to answer the question: What is known about the perspectives of families and children who receive care in clinic-based child obesity treatment? This review aims to explore the scope of perspectives reported by families of children with obesity who have received individualized outpatient clinic-based obesity treatment.” 22
  • EXAMPLE 5. Relationship research question (quantitative research)
  • - Defines interactions between dependent variable (use of ankle strategies) and independent variable (changes in muscle tone)
  • “Background: To maintain an upright standing posture against external disturbances, the human body mainly employs two types of postural control strategies: “ankle strategy” and “hip strategy.” While it has been reported that the magnitude of the disturbance alters the use of postural control strategies, it has not been elucidated how the level of muscle tone, one of the crucial parameters of bodily function, determines the use of each strategy. We have previously confirmed using forward dynamics simulations of human musculoskeletal models that an increased muscle tone promotes the use of ankle strategies. The objective of the present study was to experimentally evaluate a hypothesis: an increased muscle tone promotes the use of ankle strategies. Research question: Do changes in the muscle tone affect the use of ankle strategies ?” 23

EXAMPLES OF HYPOTHESES IN PUBLISHED ARTICLES

  • EXAMPLE 1. Working hypothesis (quantitative research)
  • - A hypothesis that is initially accepted for further research to produce a feasible theory
  • “As fever may have benefit in shortening the duration of viral illness, it is plausible to hypothesize that the antipyretic efficacy of ibuprofen may be hindering the benefits of a fever response when taken during the early stages of COVID-19 illness .” 24
  • “In conclusion, it is plausible to hypothesize that the antipyretic efficacy of ibuprofen may be hindering the benefits of a fever response . The difference in perceived safety of these agents in COVID-19 illness could be related to the more potent efficacy to reduce fever with ibuprofen compared to acetaminophen. Compelling data on the benefit of fever warrant further research and review to determine when to treat or withhold ibuprofen for early stage fever for COVID-19 and other related viral illnesses .” 24
  • EXAMPLE 2. Exploratory hypothesis (qualitative research)
  • - Explores particular areas deeper to clarify subjective experience and develop a formal hypothesis potentially testable in a future quantitative approach
  • “We hypothesized that when thinking about a past experience of help-seeking, a self distancing prompt would cause increased help-seeking intentions and more favorable help-seeking outcome expectations .” 25
  • “Conclusion
  • Although a priori hypotheses were not supported, further research is warranted as results indicate the potential for using self-distancing approaches to increasing help-seeking among some people with depressive symptomatology.” 25
  • EXAMPLE 3. Hypothesis-generating research to establish a framework for hypothesis testing (qualitative research)
  • “We hypothesize that compassionate care is beneficial for patients (better outcomes), healthcare systems and payers (lower costs), and healthcare providers (lower burnout). ” 26
  • Compassionomics is the branch of knowledge and scientific study of the effects of compassionate healthcare. Our main hypotheses are that compassionate healthcare is beneficial for (1) patients, by improving clinical outcomes, (2) healthcare systems and payers, by supporting financial sustainability, and (3) HCPs, by lowering burnout and promoting resilience and well-being. The purpose of this paper is to establish a scientific framework for testing the hypotheses above . If these hypotheses are confirmed through rigorous research, compassionomics will belong in the science of evidence-based medicine, with major implications for all healthcare domains.” 26
  • EXAMPLE 4. Statistical hypothesis (quantitative research)
  • - An assumption is made about the relationship among several population characteristics ( gender differences in sociodemographic and clinical characteristics of adults with ADHD ). Validity is tested by statistical experiment or analysis ( chi-square test, Students t-test, and logistic regression analysis)
  • “Our research investigated gender differences in sociodemographic and clinical characteristics of adults with ADHD in a Japanese clinical sample. Due to unique Japanese cultural ideals and expectations of women's behavior that are in opposition to ADHD symptoms, we hypothesized that women with ADHD experience more difficulties and present more dysfunctions than men . We tested the following hypotheses: first, women with ADHD have more comorbidities than men with ADHD; second, women with ADHD experience more social hardships than men, such as having less full-time employment and being more likely to be divorced.” 27
  • “Statistical Analysis
  • ( text omitted ) Between-gender comparisons were made using the chi-squared test for categorical variables and Students t-test for continuous variables…( text omitted ). A logistic regression analysis was performed for employment status, marital status, and comorbidity to evaluate the independent effects of gender on these dependent variables.” 27

EXAMPLES OF HYPOTHESIS AS WRITTEN IN PUBLISHED ARTICLES IN RELATION TO OTHER PARTS

  • EXAMPLE 1. Background, hypotheses, and aims are provided
  • “Pregnant women need skilled care during pregnancy and childbirth, but that skilled care is often delayed in some countries …( text omitted ). The focused antenatal care (FANC) model of WHO recommends that nurses provide information or counseling to all pregnant women …( text omitted ). Job aids are visual support materials that provide the right kind of information using graphics and words in a simple and yet effective manner. When nurses are not highly trained or have many work details to attend to, these job aids can serve as a content reminder for the nurses and can be used for educating their patients (Jennings, Yebadokpo, Affo, & Agbogbe, 2010) ( text omitted ). Importantly, additional evidence is needed to confirm how job aids can further improve the quality of ANC counseling by health workers in maternal care …( text omitted )” 28
  • “ This has led us to hypothesize that the quality of ANC counseling would be better if supported by job aids. Consequently, a better quality of ANC counseling is expected to produce higher levels of awareness concerning the danger signs of pregnancy and a more favorable impression of the caring behavior of nurses .” 28
  • “This study aimed to examine the differences in the responses of pregnant women to a job aid-supported intervention during ANC visit in terms of 1) their understanding of the danger signs of pregnancy and 2) their impression of the caring behaviors of nurses to pregnant women in rural Tanzania.” 28
  • EXAMPLE 2. Background, hypotheses, and aims are provided
  • “We conducted a two-arm randomized controlled trial (RCT) to evaluate and compare changes in salivary cortisol and oxytocin levels of first-time pregnant women between experimental and control groups. The women in the experimental group touched and held an infant for 30 min (experimental intervention protocol), whereas those in the control group watched a DVD movie of an infant (control intervention protocol). The primary outcome was salivary cortisol level and the secondary outcome was salivary oxytocin level.” 29
  • “ We hypothesize that at 30 min after touching and holding an infant, the salivary cortisol level will significantly decrease and the salivary oxytocin level will increase in the experimental group compared with the control group .” 29
  • EXAMPLE 3. Background, aim, and hypothesis are provided
  • “In countries where the maternal mortality ratio remains high, antenatal education to increase Birth Preparedness and Complication Readiness (BPCR) is considered one of the top priorities [1]. BPCR includes birth plans during the antenatal period, such as the birthplace, birth attendant, transportation, health facility for complications, expenses, and birth materials, as well as family coordination to achieve such birth plans. In Tanzania, although increasing, only about half of all pregnant women attend an antenatal clinic more than four times [4]. Moreover, the information provided during antenatal care (ANC) is insufficient. In the resource-poor settings, antenatal group education is a potential approach because of the limited time for individual counseling at antenatal clinics.” 30
  • “This study aimed to evaluate an antenatal group education program among pregnant women and their families with respect to birth-preparedness and maternal and infant outcomes in rural villages of Tanzania.” 30
  • “ The study hypothesis was if Tanzanian pregnant women and their families received a family-oriented antenatal group education, they would (1) have a higher level of BPCR, (2) attend antenatal clinic four or more times, (3) give birth in a health facility, (4) have less complications of women at birth, and (5) have less complications and deaths of infants than those who did not receive the education .” 30

Research questions and hypotheses are crucial components to any type of research, whether quantitative or qualitative. These questions should be developed at the very beginning of the study. Excellent research questions lead to superior hypotheses, which, like a compass, set the direction of research, and can often determine the successful conduct of the study. Many research studies have floundered because the development of research questions and subsequent hypotheses was not given the thought and meticulous attention needed. The development of research questions and hypotheses is an iterative process based on extensive knowledge of the literature and insightful grasp of the knowledge gap. Focused, concise, and specific research questions provide a strong foundation for constructing hypotheses which serve as formal predictions about the research outcomes. Research questions and hypotheses are crucial elements of research that should not be overlooked. They should be carefully thought of and constructed when planning research. This avoids unethical studies and poor outcomes by defining well-founded objectives that determine the design, course, and outcome of the study.

Disclosure: The authors have no potential conflicts of interest to disclose.

Author Contributions:

  • Conceptualization: Barroga E, Matanguihan GJ.
  • Methodology: Barroga E, Matanguihan GJ.
  • Writing - original draft: Barroga E, Matanguihan GJ.
  • Writing - review & editing: Barroga E, Matanguihan GJ.
  • Open access
  • Published: 10 October 2012

Approaches to informed consent for hypothesis-testing and hypothesis-generating clinical genomics research

  • Flavia M Facio 1 ,
  • Julie C Sapp 1 ,
  • Amy Linn 1 , 2 &
  • Leslie G Biesecker 1  

BMC Medical Genomics volume  5 , Article number:  45 ( 2012 ) Cite this article

5569 Accesses

11 Citations

Metrics details

Massively-parallel sequencing (MPS) technologies create challenges for informed consent of research participants given the enormous scale of the data and the wide range of potential results.

We propose that the consent process in these studies be based on whether they use MPS to test a hypothesis or to generate hypotheses. To demonstrate the differences in these approaches to informed consent, we describe the consent processes for two MPS studies. The purpose of our hypothesis-testing study is to elucidate the etiology of rare phenotypes using MPS. The purpose of our hypothesis-generating study is to test the feasibility of using MPS to generate clinical hypotheses, and to approach the return of results as an experimental manipulation. Issues to consider in both designs include: volume and nature of the potential results, primary versus secondary results, return of individual results, duty to warn, length of interaction, target population, and privacy and confidentiality.

The categorization of MPS studies as hypothesis-testing versus hypothesis-generating can help to clarify the issue of so-called incidental or secondary results for the consent process, and aid the communication of the research goals to study participants.

Peer Review reports

Advances in DNA sequencing technologies and concomitant cost reductions have made the use of massively-parallel sequencing (MPS) in clinical research practicable for many researchers. Implementations of MPS include whole genome sequencing and whole exome sequencing, which we consider to be the same, for the purposes of informed consent. A challenge for researchers employing these technologies is to develop appropriate informed consent [ 1 , 2 ], given the enormous amount of information generated for each research participant, and the wide range of medically-relevant genetic results. Most of the informed consent challenges raised by MPS are not novel – what is novel is the scale and scope of genetic interrogation, and the opportunity to develop novel clinical research paradigms.

Massively-parallel sequencing has the capacity to detect nearly any disease-causing gene variant, including late-onset disorders, such as neurologic or cancer-susceptibility syndromes, subclinical disease or endo-phenotypes, such as impaired fasting glucose, and heterozygous carriers of traits inherited in a recessive pattern. Not only is the range of the disorders broad, but the variants have a wide range of relative risks from very high to nearly zero. This is a key distinction of MPS when compared to common SNP variant detection (using so-called gene chips). Because some variants discovered by MPS can be highly penetrant, the detection of such variants can have enormous medical and counseling impact. While many of these informed consent issues have been addressed previously [ 1 , 3 ], the use of MPS in clinical research combines these issues and is on a scale that is orders of magnitude greater than previous study designs.

The initial clinical research uses of MPS were a brute force approach to the identification of mutations for rare mendelian disorders [ 4 ]. This is a variation of positional cloning (also known as gene mapping) and thus a form of classical hypothesis-testing research. The hypothesis is that the phenotype under study is caused by a genetic variant and a suite of techniques is employed (in this case MPS) to identify that causative variant. The application of this technology in this setting is of great promise and will identify causative gene variants for numerous traits, with some predicting that the majority of Mendelian disorders will be elucidated in 5–10 years.

The second of these pathways to discovery is a more novel approach of generating and then sifting MPS results as the raw material to allow the generation of clinical hypotheses, which are in turn used to design clinical experiments to discover the phenotype that is associated with that genotype. This approach we term hypothesis-generating clinical genomics. These hypothesis-generating studies require a consent process that provides the participant with an understanding of scale and scope of the interrogation, which is based on a contextual understanding of the goal and overall organization of the research since specific risks and benefits can be difficult to delineate [ 5 , 6 ]. Importantly, participants need to understand the notion that the researcher is exploring their genomes in an open-ended fashion, that the goal of the experiment is not predictable at the outset, and that the participant will be presented with downstream situations that are not currently foreseeable.

We outline here our approaches to informed consent for our hypothesis-testing and hypothesis-generating MPS research studies. We propose that the consent process be tailored depending on which of these two designs is used, and whether the research aims include study of the return of results.

General issues regarding return of results

Participants in our protocols have the option to learn their potentially clinically relevant genetic variant results. The issue of return of results is controversial and the theoretical arguments for and against the return of results have been extensively debated [ 7 ]. Although an increasing body of literature describes the approaches taken by a few groups no clear consensus exists in either the clinical genomics or bioethics community [ 8 ]. At one end of the spectrum there are those who argue that no results should be returned [ 9 ], and at the other end others contend that the entire sequence should be presented to the research participant [ 10 – 12 ]. In between these extremes lies a qualified or intermediate disclosure policy [ 13 , 14 ]. We take the intermediate position in both of our protocols by giving research participants the choice to receive results, including variants deemed to be clinically actionable [ 3 , 15 ]. Additionally, both protocols are investigating participants’ intentions towards receiving different types of results in order to inform the disclosure policies within the projects and in the broader community [ 16 ]. Because one of our research goals is to study the issues surrounding return of results, it is appropriate and necessary to return results. Thus, the following discussion focuses on issues pertinent to studies that plan to return results.

Issues to consider

Issue #1: primary versus secondary variant results and the open-ended nature of clinical genomics.

In our hypothesis-testing study we distinguish variants as either primary or secondary variants, the distinction reflecting the purpose of the study. A primary variant is a mutation that causes the phenotype that is under study, i.e., the hypothesis that is being tested in the study. A secondary variant is any mutation result not related to the disorder under study, but discovered as part of the quest for the primary variant.

We prefer the term ‘secondary’ to ‘incidental’ because the latter is an adjective indicating chance occurrence, and the discovery of a disease causing mutation by MPS cannot be considered a chance occurrence. The word ‘incidental’ also suggests a lesser degree of importance or impact and it is important to recognize that secondary findings can be of greater medical or personal impact than primary findings.

The consent discussion about results potentially available from participation in a hypothesis-testing study is framed in terms of the study goal, and we assume a high degree of alignment between participants’ goals and the researchers’ aims with respect to primary variants. Participants are, in general, highly motivated to learn the primary variant result and we presume that this motivation contributed to their decision to enroll in the study, similar to motivations for those who have been involved in positional cloning studies. This motivation may not hold for secondary variants, but our approach is to offer them the opportunity to learn secondary and actionable variants that may substantially alter susceptibility to, or reproductive risk for, disease.

In the hypothesis-generating study design no categorical distinction (primary vs. secondary) is made among pathogenic variants, i.e., all variants are treated the same without the label of ‘primary’ or ‘secondary’. This is because we are not using MPS to uncover genetic variants for a specific disease, and any of the variants could potentially be used for hypothesis generation. We suggest that this is the most novel issue with respect to informed consent as the study is open-ended regarding its goals and downstream research activities. This is challenging for informed consent because it is impossible to know what types of hypotheses may be generated at the time of enrollment and consent.

Because the downstream research topics and activities are impossible to predict in hypothesis-generating research, subjects must be consented initially to the open-ended nature of the project. During the course of the study, they must be iteratively re-consented as hypothesis are generated from the genomic data and more specific follow-up studies are designed and proposed to test those newly generated hypotheses. These downstream, iterative consents will vary in their formality, and the degree to which they need to be reviewed and approved. Some general procedures can be approved in advance; for example it may be anticipated that segregation studies would be useful to determine causality for sequence variants or the investigator may simply wish to obtain some additional targeted medical or family history from the research subject. This could be approved prospectively by the IRB with the iterative consent with the subject comprising a verbal discussion of the nature of the trait for which the segregation analysis or additional information is being sought. More specific or more invasive or risky iterative analyses would necessitate review and approval by the IRB with written informed consent.

Informed consent approach

The informed consent process must reflect the fundamental study design distinction of hypothesis-testing versus hypothesis-generating clinical genomics research. For the latter, the challenge is to help the research subjects understand that they are enrolling in a study that could lead to innumerable downstream research activities and goals. The informed consent process must be, like the research, iterative, and involve ongoing communication and consent with respect to those downstream activities.

Issue #2: Volume and nature of information

Whole genome sequencing can elucidate an enormous number of variations for a given individual. A typical whole genome sequence yields ~4,000,000 sequence variations. A whole exome sequence limits the interrogation to the coding regions of genes (about 1–1.5% of the genome) and generates typically 30,000-50,000 gene variants. While most are benign or of unknown consequence, some are associated with a significant increased risk of disease for the individual and/or their family members. For example, the typical human is a carrier for three to five deleterious genetic variants or mutations that cause severe recessive diseases [ 17 , 18 ]. In addition, there are over 30 known cancer susceptibility syndromes, which in aggregate may affect more than 1/500 patients, and the sequence variants that cause these disorders can be readily detected with MPS. These variants can have extremely high relative risks. For some disorders, a rare variant can be associated with a relative risk of greater than 1,000. This is in contrast with common SNP typing which detects variants associated with small relative risks (typically on the order of 1.2-1.5). It is arguable whether the latter type of variant has any clinical utility as an individual test.

Conveying the full scope of genomic interrogation planned for each sample and the volume of information generated for a given participant is impossible. The goal and challenge in this instance is to give the participant as realistic a picture as possible of the likely amount of clinically actionable results the technology can generate. Our approach is two-fold: to give the subjects the clear message that the number and nature of the findings is enormous and literally impossible to describe in a comprehensive manner and to use illustrative examples of the spectrum of these results.

To provide examples, we bin genetic variants into broad categories, as follows: heterozygous carriers of genetic variants implicated in recessive conditions (e.g., CFTR p.Phe508del and cystic fibrosis); variants that cause a treatable disorder that may be present, but asymptomatic or undiagnosed (e.g., LDLR p.Trp87X, familial hypercholesterolemia); variants that predispose to later-onset conditions (e.g., BRCA2 c.5946delT (commonly known as c.6174delT), breast and ovarian cancer susceptibility); variants that predispose to late-onset but untreatable disorders (e.g., frontotemporal dementia MAPT p.Pro301Leu).

Additionally, the scale and scope of the results determines a near certainty that all participants will be found to harbor disease-causing mutations. This is because the interrogation of all genes brings to light the fact that the average human carries 3–5 recessive deleterious genes in addition to the risks for later onset or incompletely penetrant dominant disorders. This reality can be unsettling and surprising to research subjects and we believe it is important to address this early in the process, not downstream in the iterative phase. It is essential for the participants to choose whether MPS research is appropriate for them, taking into account their personal views and values.

Communicate to participants both the overwhelming scale and scope of genetic results they may opt to receive and provide them with specific disease examples that illustrate the kinds of decisions they may need to make as the results become available. These examples should also assist the research subjects in making a decision about whether to participate in the study and if so, the kinds of decisions they may need be making in the future as results become available.

Issue #3: Return of individual genotype results

The return of individual genotype results from MPS presents a new challenge in the clinical research environment, again because of the scale and breadth of the results. The genetic and medical counseling can be challenging because of the volume of results generated, participants’ expectations, the many different categories of results, and the length of time for the information to be available. We suggest that the most reasonable practice is to take a conservative approach and disclose only clinically actionable results. To this end, the absence of a deleterious gene variant (or a negative result) would not be disclosed to research participants. It is our understanding that it is mandatory to validate any individual results that are returned to research subjects in a CLIA-certified laboratory. Using current clinical practice as a standard or benchmark, we suggest that until other approaches are shown to be appropriate and effective, disclosure should take place during a face-to-face encounter involving a multidisciplinary team (geneticist, genetic counselor, and specialists on an ad-hoc basis based on the phenotype in question).

During the initial consent, participants are alerted to the fact that in the future the study team will contact them by telephone and their previously-stated preferences and impressions about receiving primary and secondary variant results will be reviewed. The logistics and details of this future conversation feature prominently in the initial informed consent session, as it is challenging to make and to receive such calls. Participants make a choice to learn or not learn a result each time a result becomes available. Once a participant makes the decision to learn a genotype result, the variant is confirmed in a CLIA lab, and a report is generated. The results are communicated to the participant during a face-to-face meeting with a geneticist and genetic counselor, and with the participation of other specialists depending on the case and the participant’s preferences. These phone discussions are seen as an extension of the initial informed consent process and as opportunities for the participants to make decisions in a more relevant and current context (compared to the original informed consent session). We see this as an iterative approach to consent, also known as circular consent [ 5 ]. Participants who opt not to learn a specific result can still be contacted later if other results become available, unless they choose not to be contacted by us any longer.

This approach to returning results is challenged by the hypothesis-generating genomics research approach. Participants in our hypothesis-testing protocol are not asked to make a decision about learning individual genotype results at the time of consent. This is because we cannot know the nature of the future potential finding at the time of the original consent. Rather, they are engaged in a discussion of what they currently imagine their preferences might be at some future date, again using exemplar disorders and hypothetical scenarios of hypothesis-generating studies.

In the hypothesis-generating study, we have distinct approaches for variants in known disease-causing genes versus variants in genes that are hypothesized to cause disease (the latter being the operative hypothesis generating activity). For the former, the results are handled in a manner quite similar to the hypothesis-testing study. In the latter case, the participant may be asked if they would be willing to return for further phenotyping to help us determine the nature of the variant of uncertain clinical significance (VUCS). The participant is typically informed that they have a sequence variant and that we would like to learn, through clinical research whether this variant has any phenotypic or clinical significance. It is emphasized that current knowledge does not show that the variant causes any phenotype and the chances are high that the variant is benign. However, neither the gene nor the sequence variant is disclosed and the research finding is not confirmed in a CLIA certified lab. This type of VUCS would only be communicated back to the participant if the clinical research showed that the variant was causative, and the return of the result was determined medically appropriate by our Mutation Advisory Committee, and following confirmation in a CLIA-certified laboratory.

For the return of mutations in known, disease causing genes, the initial consent cannot comprehensively inform subjects of the nature of the diseases, because of the scale and scope of the potential results. Instead, exemplars are given to elicit general preferences, which are then affirmed or refined at the time results are available. Hypothesis-generating studies require that subjects receive sufficient information to make an informed choice about participation in the specific follow-up study, with return of individual results only if the cause and effect relationship is established, with appropriate oversight.

Issue #4: Duty to warn

Given the breadth of MPS gene interrogation, it is reasonable to anticipate that occasional participants may have mutations that pose a likely severe negative consequence, which we classify as “panic” results. This models clinical and research practice for the return of results such as a pulmonary mass or high serum potassium level. In contrast to the above-mentioned autosomal recessive carrier states that are expected to be nearly universal, genetic panic results should be uncommon. However, they should not be considered as unanticipated – it is obvious that such variants will be detected and the informed consent process should anticipate these. Examples would be deleterious variants for malignant hyperthermia or Long QT Syndrome, either of which have a substantial risk of sudden death and the risk can be mitigated.

Both our hypothesis-testing and hypothesis-generating studies include mechanisms for the participants to indicate the types of results that they wish to have returned to them. In the hypothesis-testing mode of research this is primarily to respect the autonomy of the participants, but in addition, for the hypothesis-generating study we are assessing the motivations and interests of the subjects in various types of results and manipulating the return of results as an experimental aim. It is our clinical research experience that participants are challenged by making decisions regarding possible future results that are rare, but potentially severe. As well, the medical and social contexts of the subjects evolves over time and the consent that was obtained at enrollment may not be relevant or appropriate at a later time when such a result arises. This is particularly relevant for a research study that is ongoing for substantial periods of time (see also point #7, below).

To address these issues we have consented the subjects to the potential return of “panic” results, irrespective of their preferences at the initial consent session. In effect, the consent process is for some participants a consent to override their preference.

In both hypothesis-testing and hypothesis-generating research it is important to outline circumstances in which researchers’ duty-to-warn may result in a return of results that may be contrary to the preferences of the subject. It is essential that the subjects understand this approach to unusually severe mutation results. Subjects who are uncomfortable with this approach to return of results are encouraged to decline enrollment.

Issue #5: Length of researcher and participant interaction

Approaches to MPS data are evolving rapidly and it is anticipated that this ongoing research into the significance of DNA variants will continue for years or decades. The different purposes of the two study designs lead to different endpoints in terms of researcher’s responsibility to analyze results. In our hypothesis-testing research, discussion of the relationship of the participants to the researchers is framed in terms of the discovery of the primary variant. We ask participants to be willing to interact with us for a period of months or years as it is impossible for to set a specific timeline to determine the cause of the disorder under investigation (if it ever discovered). While attempts to elucidate the primary variant are underway, participants’ genomic data are periodically annotated using the most current bioinformatic methodologies available. We conceptualize our commitment to return re-annotated and updated results to participants as diminishing, but not disappearing, after this initial results’ disclosure. As the primary aim of the study has been accomplished, less attention will be directed to the characterization of ancillary genomic data, yet we believe we retain an obligation to share highly clinically actionable findings with participants should we obtain them.

In the hypothesis-generating study the researcher’s responsibility to annotate participants’ genomes/exomes is ongoing. This is ongoing because, as noted above, one of the experimental aims is to study the motivations and interests of the subjects in these types of results. Determining how this motivation and interest fares over time is an important research goal. During the informed consent discussion it is emphasized that the iterative nature of result interpretation will lead to multiple meetings for the disclosure of clinically actionable results, and that the participant may be contacted months or years after the date of enrollment. Additionally, it is outlined that the participant will make a choice about learning the result each time he/she is re-contacted about the availability of a research finding, and that finding will only be confirmed in a CLIA-certified laboratory if the participant opts to learn the information. Participants who return to discuss results are reminded that they will be contacted in the future if and when other results deemed to be clinically actionable are found for that individual.

Describe nature, mutual commitments, and duration of researcher-participant relationship to participants. For hypothesis-testing studies it is appropriate that the intensity of the clinical annotation of secondary variants may decline when the primary goal of the study is met. For hypothesis-generating studies, such interactions may continue for as long as there are variants to be further evaluated and as long as the subject retains an interest in the participation.

Issue #6: Target population

The informed consent process needs to take into account the target population in terms of their disease phenotype, age, and whether the goal is to enroll individual participants or families. These considerations represent the greatest divergence in approaches to informed consent when comparing hypothesis-testing and hypothesis-generating research. In our two studies, the hypothesis-testing study focuses on rare diseases and often family participation, whereas the hypothesis-generating study focuses on more common diseases and unrelated index cases. There are an infinite number of study designs and investigators may adapt our approaches to informed consent for their own designs.

Our hypothesis-testing protocol enrolls both individual participants and families (most commonly trios), the latter being more common. In hypothesis-testing research, many participants are either affected by a genetic disease or are a close relative (typically a parent) of a person with a genetic disease. The research participants must weigh their hope for, and personal meaning ascribed to, learning the genetic cause for their disorder against the possibility of being in a position to learn a significant amount of unanticipated information. Discussing and addressing the potential discrepancy of the participants’ expectations of the value of their results and what they may realistically stand to learn (both desired and undesired information) is a central component of the informed consent process.

In our hypothesis-testing protocol, when parents are consenting on behalf of a minor child, we review with them the issues surrounding genetic testing of children and discuss their attitudes regarding their child’s autonomy and their parental decision-making values. Because family trios (most often mother-father-child) are enrolled together, we discuss how one individual’s preferences regarding results may be disrupted or superseded by another family member’s choice and communication of that individual’s knowledge.

In contrast, our hypothesis-generating protocol enrolls as probands or primary participants older, unrelated individuals [ 19 ]. Most participants are self-selected in terms of their decision to enroll and are not enrolled because they or a relative have a rare disease. Participants in the hypothesis-generating protocol are consented for future exploration of any and all possible phenotypes. This is a key distinguishing feature of this hypothesis-generating approach to research, which is a different paradigm – going from genotype to phenotype. The participants may be invited for additional phenotyping. In fact, multiple satellite studies are ongoing to evaluate various subsets of participants for different phenotypes. The key with the consent for these subjects is to initially communicate to the subjects the general approach – that their genome will be explored, variations will be identified, and they may be re-contacted for a potential follow-up study to understand the potential relationship of that variant to their phenotype. These subsequent consents for follow-up studies are considered an iterative consent process, which is similar to the Informed Cohort concept [ 20 ].

Hypothesis-generating research is a novel approach to clinical research design and requires an ongoing, iterative approach to informed consent. For hypothesis-testing research a key informed consent issue is for the subjects to balance the desire for information on the primary disease causing mutation with the pros and cons of obtaining possibly undesired information on secondary variants.

Issue #7: Privacy and confidentiality

In MPS studies, privacy and confidentiality is a complex and multifaceted issue. Some potential challenges include: the deposition of genetic and phenotypic data in public databases, the placement of CLIA-validated results in the individual’s medical chart, and the discovery of secondary variants in relatives of affected probands in family-based (typically hypothesis-testing) research.

The field of genomics has a tradition of deposition of data in publicly accessible databases. Participants in our protocols are informed that the goal of sharing de-identified information in public databases is to advance research, and that there are methods in place maximize the privacy and confidentiality of personally identifiable information. However, the deposition of genomic-scale data for an individual participant, such as a MPS sequence, is far above the minimal amount of data to uniquely identify the sample [ 21 , 22 ]. Therefore, the participants should be made aware that the scale of the data could allow analysts to connect sequence data to individuals by matching variants in the deposited research data to other data from that person. As well, the public deposition of data in some cases is an irrevocable decision. Once the data are deposited and distributed, it may be impossible to remove the data from all computer servers, should the subject decide to withdraw from the study.

Additionally, participants are informed that once a result is CLIA-certified, that result is placed in the individual’s medical chart of the clinical research institution and may be accessible by third parties. Although there are state and federal laws to protect individuals against genetic discrimination, including GINA, this law has not yet been tested in the courts. This is explained to participants up front at the time of enrollment and a more detailed discussion takes place at the time of results disclosure. To offer additional protection in the event of a court subpoena, a Certificate of Confidentiality has been obtained in the hypothesis-testing and hypothesis-generating protocols. The discussion surrounding privacy and confidentiality is approached in a similar manner in both protocols.

The third issue regarding confidentiality is that MPS can generate many results in each individual and it is highly likely that some, if not all, of the variants detected in one research participant may be present in another research participant (e.g., a parent). This is again a consequence of the scale and breadth of MPS in that the large number of variants that can be detected in each participant makes it exceedingly likely that their relatives share many of these variants and that their genetic risks of rare diseases may be measurably altered. It is important to communicate to the participants that it is likely that such variants can be detected and that they may have implications for other members of the family, and that the consented individuals, or their parent may need to communicate those results to other members of the family.

The informed consent should include discussion of public deposition of data, the entry of CLIA-validated results into medical records, and the likely discovery of variants with implications for family members.

We describe an approach to the informed consent process as a mutual opportunity for researchers and participants to assess one another’s goals in MPS protocols that employ both hypothesis-generating and hypothesis-testing methodologies. The use of MPS in clinical research requires adaptation of established processes of human subjects protections. The potentially overwhelming scale of information generated by MPS necessitates that investigators and IRBs adapt traditional approaches to consent the subjects. Because nearly all subjects will have a clinically actionable result, investigators must implement thoughtful plan for consent regarding results disclosure, including setting a threshold for the types of information that should be disclosed to the participants.

While some of the informed consent issues for MPS are independent of the study design, others should be adapted based on whether the research study is employing MPS to test a hypothesis (i.e., find the cause of a rare condition in an affected cohort), or to generate hypotheses (i.e., find deleterious or potentially deleterious variants that warrant participant follow-up and further investigation). For example, the health-related attributes of the study cohort (healthy individuals versus disease patients) are likely to influence participants’ motivations and expectations of MPS, and in the case of a disease cohort create the need to dichotomize the genetic variants into primary and secondary. Conversely, issues inherent to MPS technology are central to the informed consent approach in both types of studies. The availability of MPS allows a paradigm shift in genetics research – no longer are investigators constrained to long-standing approaches of hypothesis-testing modes of research. The scale of MPS allows investigators to proceed from genotype to phenotype, and leads to new challenges for genetic and medical counseling. Research participants receiving results from MPS might not present with a personal and/or family history suggestive of conditions revealed by their genotypic variants, and consequently might not perceive their a priori risk to be elevated for those conditions.

Participants’ motivations to have whole genome/exome sequencing at this early stage are important to take into consideration in the informed consent process. Initial qualitative data suggest that individuals enroll in the hypothesis-generating study because of altruism in promoting research, and a desire to learn about genetic factors that contribute to their own health and disease risk [ 23 ]. Most participants expect that genomic information will improve the overall knowledge of disease causes and treatments. Moreover, data on research participants’ preferences to receive different types of genetic results suggest that they have strong intentions to receive all types of results [ 16 ]. However, they are able to discern between the types and quality of information they could learn, and demonstrate stronger attitudes to learn clinically actionable and carrier status results when compared to results that are uncertain or not clinically actionable. These findings provide initial insights into the value these early adopters place on information generated by high-throughput sequencing studies, and help us tailor the informed consent process to this group of individuals. However, more empirical data are needed to guide the informed consent process, including data on research participants’ ability to receive results for multiple disorders and traits.

Participants in both types of studies are engaged in a discussion of the complex and dynamic nature of genomic annotation so that they may make an informed decision about participation and may be aware of the need to revisit results learned at additional time points in the future. As well, we advocate a process whereby investigators retain some latitude with respect to the most serious, potentially life-threatening mutations. While it is mandatory to respect the autonomy of research subjects, this does not mean that investigators must accede to the research subject’s views of these “panic” results. In a paradoxical way, the research participant and the researcher can agree that the latter can maintain a small, but initially ambiguous degree of latitude with respect to these most serious variants. In the course of utilizing MPS technology for further elucidation of the genetic architecture of health and disease, it is imperative that research participants and researchers be engaged in a continuous discussion about the state of scientific knowledge and the types of information that could potentially be learned from MPS. Although resource-intensive, this “partnership model” [ 2 ] or informed cohort approach to informed consent promotes respect for participants, and allows evaluation of the benefits and harms of disclosure in a more timely and relevant manner.

We have here proposed a categorization of massively-parallel clinical genomics research studies as hypothesis-testing versus hypothesis-generating to help clarify the issue of so-called incidental or secondary results for the consent process, and aid the communication of the research goals to study participants. By using this categorization approach and considering seven important features of this kind of research (Primary versus secondary variant results and the open-ended nature of clinical genomics, Volume and nature of information, Return of individual genotype results, Duty to warn, Length of researcher and participant interaction, Target population, and Privacy and confidentiality) researchers can design an informed consent process that is open, transparent, and appropriately balances risks and benefits of this exciting approach to heritable disease research.

This study was supported by funding from the Intramural Research Program of the National Human Genome Research Institute. The authors have no conflicts to declare.

Netzer C, Klein C, Kohlhase J, Kubisch C: New challenges for informed consent through whole genome array testing. J Med Genet. 2009, 46: 495-496. 10.1136/jmg.2009.068015.

Article   CAS   PubMed   Google Scholar  

McGuire AL, Beskow LM: Informed consent in genomics and genetic research. Annu Rev Genomics Hum Genet. 2010, 11: 361-381. 10.1146/annurev-genom-082509-141711.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bookman EB, Langehorne AA, Eckfeldt JH, Glass KC, Jarvik GP, Klag M, Koski G, Motulsky A, Wilfond B, Manolio TA, Fabsitz RR, Luepker RV, NHLBI Working Group: Reporting genetic results in research studies: Summary and recommendations of an NHLBI Working Group. Am J Med Genet A. 2006, 140: 1033-1040.

Article   PubMed   PubMed Central   Google Scholar  

Ng PC, Kirkness EF: Whole genome sequencing. Methods Mol Biol. 2010, 628: 215-226. 10.1007/978-1-60327-367-1_12.

Mascalzoni D, Hicks A, Pramstaller P, Wjst M: Informed consent in the genomics era. PLoS Med. 2008, 5: e192-10.1371/journal.pmed.0050192.

Rotimi CN, Marshall PA: Tailoring the process of informed consent in genetic and genomic research. Genome Med. 2010, 2: 20-10.1186/gm141.

Bredenoord AL, Kroes HY, Cuppen E, Parker M, van Delden JJ: Disclosure of individual genetic data to research participants: the debate reconsidered. Trends Genet. 2011, 27: 41-47. 10.1016/j.tig.2010.11.004.

Kronenthal C, Delaney SK, Christman MF: Broadening research consent in the era of genome-informed medicine. Genet Med. 2012, 14: 432-436. 10.1038/gim.2011.76.

Article   PubMed   Google Scholar  

Forsberg JS, Hansson MG, Eriksson S: Changing perspectives in biobank research: from individual rights to concerns about public health regarding the return of results. Eur J Hum Genet. 2009, 17: 1544-1549. 10.1038/ejhg.2009.87.

Shalowitz DI, Miller FG: Disclosing individual results of clinical research: implications of respect for participants. JAMA. 2005, 294: 737-740. 10.1001/jama.294.6.737.

Fernandez CV, Kodish E, Weijer C: Informing study participants of research results: an ethical imperative. IRB. 2003, 25: 12-19.

McGuire AL, Lupski JR: Personal genome research: what should the participant be told?. Trends Genet. 2010, 26: 199-201. 10.1016/j.tig.2009.12.007.

Wolf SM, Lawrenz FP, Nelson CA, Kahn JP, Cho MK, Clayton EW, Fletcher JG, Georgieff MK, Hammerschmidt D, Hudson K, Illes J, Kapur V, Keane MA, Koenig BA, Leroy BS, McFarland EG, Paradise J, Parker LS, Terry SF, Van Ness B, Wilfond BS: Managing incidental findings in human subjects research: analysis and recommendations. J Law Med Ethics. 2008, 36: 219-248. 10.1111/j.1748-720X.2008.00266.x.

Kohane IS, Taylor PL: Multidimensional results reporting to participants in genomic studies: Getting it right. Sci Transl Med. 2010, 2: 37cm19-10.1126/scitranslmed.3000809.

Fabsitz RR, McGuire A, Sharp RR, Puggal M, Beskow LM, Biesecker LG, Bookman E, Burke W, Burchard EG, Church G, Clayton EW, Eckfeldt JH, Fernandez CV, Fisher R, Fullerton SM, Gabriel S, Gachupin F, James C, Jarvik GP, Kittles R, Leib JR, O'Donnell C, O'Rourke PP, Rodriguez LL, Schully SD, Shuldiner AR, Sze RK, Thakuria JV, Wolf SM, Burke GL, National Heart, Lung, and Blood Institute working group: Ethical and practical guidelines for reporting genetic research results to study participants: updated guidelines from a national heart, lung, and blood institute working group. Circ Cardiovasc Genet. 2010, 3: 574-580. 10.1161/CIRCGENETICS.110.958827.

Facio FM, Fisher T, Eidem H, Brooks S, Linn A, Biesecker LG, Biesecker BB: Intentions to receive individual results from whole-genome sequencing among participants in the ClinSeqTM study. Eu J Hum Genet. in press

Morton NE: The detection and estimation of linkage between the genes for elliptocytosis and the Rh blood type. Am J Hum Genet. 1956, 8: 80-96.

CAS   PubMed   PubMed Central   Google Scholar  

Morton NE: The mutational load due to detrimental genes in man. Am J Hum Genet. 1960, 12: 348-364.

Biesecker LG, Mullikin JC, Facio FM, Turner C, Cherukuri PF, Blakesley RW, Bouffard GG, Chines PS, Cruz P, Hansen NF, Teer JK, Maskeri B, Young AC, Manolio TA, Wilson AF, Finkel T, Hwang P, Arai A, Remaley AT, Sachdev V, Shamburek R, Cannon RO, Green ED, NISC Comparative Sequencing Program: The ClinSeq Project: piloting large-scale genome sequencing for research in genomic medicine. Genome Res. 2009, 19: 1665-1674. 10.1101/gr.092841.109.

Kohane IS, Mandl KD, Taylor PL, Holm IA, Nigrin DJ, Kunkel LM: Medicine. Reestablishing the researcher-patient compact. Science. 2007, 316: 836-837. 10.1126/science.1135489.

Lin Z, Owen AB, Altman RB: Genomic Research and Human Subject Privacy. Science. 2004, 305: 183-10.1126/science.1095019.

Homer N, Szelinger S, Redman M, Duggan D, Tembe W, Muehling J, Pearson JV, Stephan DA, Nelson SF, Craig DW: Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet. 2008, 29: e1000167.

Article   Google Scholar  

Facio FM, Brooks S, Loewenstein J, Green S, Biesecker LG, Biesecker BB: Motivators for participation in a whole-genome sequencing study: implications for translational genomics research. Eur J Hum Genet. 2011, 19: 1213-1217. 10.1038/ejhg.2011.123.

Pre-publication history

The pre-publication history for this paper can be accessed here: http://www.biomedcentral.com/1755-8794/5/45/prepub

Download references

Author information

Authors and affiliations.

National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA

Flavia M Facio, Julie C Sapp, Amy Linn & Leslie G Biesecker

Kennedy Krieger Institute, Baltimore, MD, USA

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Leslie G Biesecker .

Additional information

Competing interests.

LGB is an uncompensated consultant to, and collaborates with, the Illumina Corp.

Authors’ contributions

FMF and JCS drafted the initial manuscript. LGB Organized and edited the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Facio, F.M., Sapp, J.C., Linn, A. et al. Approaches to informed consent for hypothesis-testing and hypothesis-generating clinical genomics research. BMC Med Genomics 5 , 45 (2012). https://doi.org/10.1186/1755-8794-5-45

Download citation

Received : 07 November 2011

Accepted : 05 October 2012

Published : 10 October 2012

DOI : https://doi.org/10.1186/1755-8794-5-45

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Whole genome sequencing
  • Whole exome sequencing
  • Informed consent

BMC Medical Genomics

ISSN: 1755-8794

hypothesis testing and hypothesis generating

  • Search Menu
  • Browse content in A - General Economics and Teaching
  • Browse content in A1 - General Economics
  • A11 - Role of Economics; Role of Economists; Market for Economists
  • Browse content in B - History of Economic Thought, Methodology, and Heterodox Approaches
  • Browse content in B4 - Economic Methodology
  • B49 - Other
  • Browse content in C - Mathematical and Quantitative Methods
  • Browse content in C0 - General
  • C00 - General
  • C01 - Econometrics
  • Browse content in C1 - Econometric and Statistical Methods and Methodology: General
  • C10 - General
  • C11 - Bayesian Analysis: General
  • C12 - Hypothesis Testing: General
  • C13 - Estimation: General
  • C14 - Semiparametric and Nonparametric Methods: General
  • C18 - Methodological Issues: General
  • Browse content in C2 - Single Equation Models; Single Variables
  • C21 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions
  • C23 - Panel Data Models; Spatio-temporal Models
  • C26 - Instrumental Variables (IV) Estimation
  • Browse content in C3 - Multiple or Simultaneous Equation Models; Multiple Variables
  • C30 - General
  • C31 - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions; Social Interaction Models
  • C32 - Time-Series Models; Dynamic Quantile Regressions; Dynamic Treatment Effect Models; Diffusion Processes; State Space Models
  • C35 - Discrete Regression and Qualitative Choice Models; Discrete Regressors; Proportions
  • Browse content in C4 - Econometric and Statistical Methods: Special Topics
  • C40 - General
  • Browse content in C5 - Econometric Modeling
  • C52 - Model Evaluation, Validation, and Selection
  • C53 - Forecasting and Prediction Methods; Simulation Methods
  • C55 - Large Data Sets: Modeling and Analysis
  • Browse content in C6 - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling
  • C63 - Computational Techniques; Simulation Modeling
  • C67 - Input-Output Models
  • Browse content in C7 - Game Theory and Bargaining Theory
  • C71 - Cooperative Games
  • C72 - Noncooperative Games
  • C73 - Stochastic and Dynamic Games; Evolutionary Games; Repeated Games
  • C78 - Bargaining Theory; Matching Theory
  • C79 - Other
  • Browse content in C8 - Data Collection and Data Estimation Methodology; Computer Programs
  • C83 - Survey Methods; Sampling Methods
  • Browse content in C9 - Design of Experiments
  • C90 - General
  • C91 - Laboratory, Individual Behavior
  • C92 - Laboratory, Group Behavior
  • C93 - Field Experiments
  • C99 - Other
  • Browse content in D - Microeconomics
  • Browse content in D0 - General
  • D00 - General
  • D01 - Microeconomic Behavior: Underlying Principles
  • D02 - Institutions: Design, Formation, Operations, and Impact
  • D03 - Behavioral Microeconomics: Underlying Principles
  • D04 - Microeconomic Policy: Formulation; Implementation, and Evaluation
  • Browse content in D1 - Household Behavior and Family Economics
  • D10 - General
  • D11 - Consumer Economics: Theory
  • D12 - Consumer Economics: Empirical Analysis
  • D13 - Household Production and Intrahousehold Allocation
  • D14 - Household Saving; Personal Finance
  • D15 - Intertemporal Household Choice: Life Cycle Models and Saving
  • D18 - Consumer Protection
  • Browse content in D2 - Production and Organizations
  • D20 - General
  • D21 - Firm Behavior: Theory
  • D22 - Firm Behavior: Empirical Analysis
  • D23 - Organizational Behavior; Transaction Costs; Property Rights
  • D24 - Production; Cost; Capital; Capital, Total Factor, and Multifactor Productivity; Capacity
  • Browse content in D3 - Distribution
  • D30 - General
  • D31 - Personal Income, Wealth, and Their Distributions
  • D33 - Factor Income Distribution
  • Browse content in D4 - Market Structure, Pricing, and Design
  • D40 - General
  • D41 - Perfect Competition
  • D42 - Monopoly
  • D43 - Oligopoly and Other Forms of Market Imperfection
  • D44 - Auctions
  • D47 - Market Design
  • D49 - Other
  • Browse content in D5 - General Equilibrium and Disequilibrium
  • D50 - General
  • D51 - Exchange and Production Economies
  • D52 - Incomplete Markets
  • D53 - Financial Markets
  • D57 - Input-Output Tables and Analysis
  • Browse content in D6 - Welfare Economics
  • D60 - General
  • D61 - Allocative Efficiency; Cost-Benefit Analysis
  • D62 - Externalities
  • D63 - Equity, Justice, Inequality, and Other Normative Criteria and Measurement
  • D64 - Altruism; Philanthropy
  • D69 - Other
  • Browse content in D7 - Analysis of Collective Decision-Making
  • D70 - General
  • D71 - Social Choice; Clubs; Committees; Associations
  • D72 - Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior
  • D73 - Bureaucracy; Administrative Processes in Public Organizations; Corruption
  • D74 - Conflict; Conflict Resolution; Alliances; Revolutions
  • D78 - Positive Analysis of Policy Formulation and Implementation
  • Browse content in D8 - Information, Knowledge, and Uncertainty
  • D80 - General
  • D81 - Criteria for Decision-Making under Risk and Uncertainty
  • D82 - Asymmetric and Private Information; Mechanism Design
  • D83 - Search; Learning; Information and Knowledge; Communication; Belief; Unawareness
  • D84 - Expectations; Speculations
  • D85 - Network Formation and Analysis: Theory
  • D86 - Economics of Contract: Theory
  • D89 - Other
  • Browse content in D9 - Micro-Based Behavioral Economics
  • D90 - General
  • D91 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on Decision Making
  • D92 - Intertemporal Firm Choice, Investment, Capacity, and Financing
  • Browse content in E - Macroeconomics and Monetary Economics
  • Browse content in E0 - General
  • E00 - General
  • E01 - Measurement and Data on National Income and Product Accounts and Wealth; Environmental Accounts
  • E02 - Institutions and the Macroeconomy
  • E03 - Behavioral Macroeconomics
  • Browse content in E1 - General Aggregative Models
  • E10 - General
  • E12 - Keynes; Keynesian; Post-Keynesian
  • E13 - Neoclassical
  • Browse content in E2 - Consumption, Saving, Production, Investment, Labor Markets, and Informal Economy
  • E20 - General
  • E21 - Consumption; Saving; Wealth
  • E22 - Investment; Capital; Intangible Capital; Capacity
  • E23 - Production
  • E24 - Employment; Unemployment; Wages; Intergenerational Income Distribution; Aggregate Human Capital; Aggregate Labor Productivity
  • E25 - Aggregate Factor Income Distribution
  • Browse content in E3 - Prices, Business Fluctuations, and Cycles
  • E30 - General
  • E31 - Price Level; Inflation; Deflation
  • E32 - Business Fluctuations; Cycles
  • E37 - Forecasting and Simulation: Models and Applications
  • Browse content in E4 - Money and Interest Rates
  • E40 - General
  • E41 - Demand for Money
  • E42 - Monetary Systems; Standards; Regimes; Government and the Monetary System; Payment Systems
  • E43 - Interest Rates: Determination, Term Structure, and Effects
  • E44 - Financial Markets and the Macroeconomy
  • Browse content in E5 - Monetary Policy, Central Banking, and the Supply of Money and Credit
  • E50 - General
  • E51 - Money Supply; Credit; Money Multipliers
  • E52 - Monetary Policy
  • E58 - Central Banks and Their Policies
  • Browse content in E6 - Macroeconomic Policy, Macroeconomic Aspects of Public Finance, and General Outlook
  • E60 - General
  • E62 - Fiscal Policy
  • E66 - General Outlook and Conditions
  • Browse content in E7 - Macro-Based Behavioral Economics
  • E71 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on the Macro Economy
  • Browse content in F - International Economics
  • Browse content in F0 - General
  • F00 - General
  • Browse content in F1 - Trade
  • F10 - General
  • F11 - Neoclassical Models of Trade
  • F12 - Models of Trade with Imperfect Competition and Scale Economies; Fragmentation
  • F13 - Trade Policy; International Trade Organizations
  • F14 - Empirical Studies of Trade
  • F15 - Economic Integration
  • F16 - Trade and Labor Market Interactions
  • F18 - Trade and Environment
  • Browse content in F2 - International Factor Movements and International Business
  • F20 - General
  • F21 - International Investment; Long-Term Capital Movements
  • F22 - International Migration
  • F23 - Multinational Firms; International Business
  • Browse content in F3 - International Finance
  • F30 - General
  • F31 - Foreign Exchange
  • F32 - Current Account Adjustment; Short-Term Capital Movements
  • F34 - International Lending and Debt Problems
  • F35 - Foreign Aid
  • F36 - Financial Aspects of Economic Integration
  • Browse content in F4 - Macroeconomic Aspects of International Trade and Finance
  • F40 - General
  • F41 - Open Economy Macroeconomics
  • F42 - International Policy Coordination and Transmission
  • F43 - Economic Growth of Open Economies
  • F44 - International Business Cycles
  • Browse content in F5 - International Relations, National Security, and International Political Economy
  • F50 - General
  • F51 - International Conflicts; Negotiations; Sanctions
  • F52 - National Security; Economic Nationalism
  • F55 - International Institutional Arrangements
  • Browse content in F6 - Economic Impacts of Globalization
  • F60 - General
  • F61 - Microeconomic Impacts
  • F63 - Economic Development
  • Browse content in G - Financial Economics
  • Browse content in G0 - General
  • G00 - General
  • G01 - Financial Crises
  • G02 - Behavioral Finance: Underlying Principles
  • Browse content in G1 - General Financial Markets
  • G10 - General
  • G11 - Portfolio Choice; Investment Decisions
  • G12 - Asset Pricing; Trading volume; Bond Interest Rates
  • G14 - Information and Market Efficiency; Event Studies; Insider Trading
  • G15 - International Financial Markets
  • G18 - Government Policy and Regulation
  • G19 - Other
  • Browse content in G2 - Financial Institutions and Services
  • G20 - General
  • G21 - Banks; Depository Institutions; Micro Finance Institutions; Mortgages
  • G22 - Insurance; Insurance Companies; Actuarial Studies
  • G23 - Non-bank Financial Institutions; Financial Instruments; Institutional Investors
  • G24 - Investment Banking; Venture Capital; Brokerage; Ratings and Ratings Agencies
  • G28 - Government Policy and Regulation
  • Browse content in G3 - Corporate Finance and Governance
  • G30 - General
  • G31 - Capital Budgeting; Fixed Investment and Inventory Studies; Capacity
  • G32 - Financing Policy; Financial Risk and Risk Management; Capital and Ownership Structure; Value of Firms; Goodwill
  • G33 - Bankruptcy; Liquidation
  • G34 - Mergers; Acquisitions; Restructuring; Corporate Governance
  • G38 - Government Policy and Regulation
  • Browse content in G4 - Behavioral Finance
  • G40 - General
  • G41 - Role and Effects of Psychological, Emotional, Social, and Cognitive Factors on Decision Making in Financial Markets
  • Browse content in G5 - Household Finance
  • G50 - General
  • G51 - Household Saving, Borrowing, Debt, and Wealth
  • Browse content in H - Public Economics
  • Browse content in H0 - General
  • H00 - General
  • Browse content in H1 - Structure and Scope of Government
  • H10 - General
  • H11 - Structure, Scope, and Performance of Government
  • Browse content in H2 - Taxation, Subsidies, and Revenue
  • H20 - General
  • H21 - Efficiency; Optimal Taxation
  • H22 - Incidence
  • H23 - Externalities; Redistributive Effects; Environmental Taxes and Subsidies
  • H24 - Personal Income and Other Nonbusiness Taxes and Subsidies; includes inheritance and gift taxes
  • H25 - Business Taxes and Subsidies
  • H26 - Tax Evasion and Avoidance
  • Browse content in H3 - Fiscal Policies and Behavior of Economic Agents
  • H31 - Household
  • Browse content in H4 - Publicly Provided Goods
  • H40 - General
  • H41 - Public Goods
  • H42 - Publicly Provided Private Goods
  • H44 - Publicly Provided Goods: Mixed Markets
  • Browse content in H5 - National Government Expenditures and Related Policies
  • H50 - General
  • H51 - Government Expenditures and Health
  • H52 - Government Expenditures and Education
  • H53 - Government Expenditures and Welfare Programs
  • H54 - Infrastructures; Other Public Investment and Capital Stock
  • H55 - Social Security and Public Pensions
  • H56 - National Security and War
  • H57 - Procurement
  • Browse content in H6 - National Budget, Deficit, and Debt
  • H63 - Debt; Debt Management; Sovereign Debt
  • Browse content in H7 - State and Local Government; Intergovernmental Relations
  • H70 - General
  • H71 - State and Local Taxation, Subsidies, and Revenue
  • H73 - Interjurisdictional Differentials and Their Effects
  • H75 - State and Local Government: Health; Education; Welfare; Public Pensions
  • H76 - State and Local Government: Other Expenditure Categories
  • H77 - Intergovernmental Relations; Federalism; Secession
  • Browse content in H8 - Miscellaneous Issues
  • H81 - Governmental Loans; Loan Guarantees; Credits; Grants; Bailouts
  • H83 - Public Administration; Public Sector Accounting and Audits
  • H87 - International Fiscal Issues; International Public Goods
  • Browse content in I - Health, Education, and Welfare
  • Browse content in I0 - General
  • I00 - General
  • Browse content in I1 - Health
  • I10 - General
  • I11 - Analysis of Health Care Markets
  • I12 - Health Behavior
  • I13 - Health Insurance, Public and Private
  • I14 - Health and Inequality
  • I15 - Health and Economic Development
  • I18 - Government Policy; Regulation; Public Health
  • Browse content in I2 - Education and Research Institutions
  • I20 - General
  • I21 - Analysis of Education
  • I22 - Educational Finance; Financial Aid
  • I23 - Higher Education; Research Institutions
  • I24 - Education and Inequality
  • I25 - Education and Economic Development
  • I26 - Returns to Education
  • I28 - Government Policy
  • Browse content in I3 - Welfare, Well-Being, and Poverty
  • I30 - General
  • I31 - General Welfare
  • I32 - Measurement and Analysis of Poverty
  • I38 - Government Policy; Provision and Effects of Welfare Programs
  • Browse content in J - Labor and Demographic Economics
  • Browse content in J0 - General
  • J00 - General
  • J01 - Labor Economics: General
  • J08 - Labor Economics Policies
  • Browse content in J1 - Demographic Economics
  • J10 - General
  • J12 - Marriage; Marital Dissolution; Family Structure; Domestic Abuse
  • J13 - Fertility; Family Planning; Child Care; Children; Youth
  • J14 - Economics of the Elderly; Economics of the Handicapped; Non-Labor Market Discrimination
  • J15 - Economics of Minorities, Races, Indigenous Peoples, and Immigrants; Non-labor Discrimination
  • J16 - Economics of Gender; Non-labor Discrimination
  • J18 - Public Policy
  • Browse content in J2 - Demand and Supply of Labor
  • J20 - General
  • J21 - Labor Force and Employment, Size, and Structure
  • J22 - Time Allocation and Labor Supply
  • J23 - Labor Demand
  • J24 - Human Capital; Skills; Occupational Choice; Labor Productivity
  • Browse content in J3 - Wages, Compensation, and Labor Costs
  • J30 - General
  • J31 - Wage Level and Structure; Wage Differentials
  • J33 - Compensation Packages; Payment Methods
  • J38 - Public Policy
  • Browse content in J4 - Particular Labor Markets
  • J40 - General
  • J42 - Monopsony; Segmented Labor Markets
  • J44 - Professional Labor Markets; Occupational Licensing
  • J45 - Public Sector Labor Markets
  • J48 - Public Policy
  • J49 - Other
  • Browse content in J5 - Labor-Management Relations, Trade Unions, and Collective Bargaining
  • J50 - General
  • J51 - Trade Unions: Objectives, Structure, and Effects
  • J53 - Labor-Management Relations; Industrial Jurisprudence
  • Browse content in J6 - Mobility, Unemployment, Vacancies, and Immigrant Workers
  • J60 - General
  • J61 - Geographic Labor Mobility; Immigrant Workers
  • J62 - Job, Occupational, and Intergenerational Mobility
  • J63 - Turnover; Vacancies; Layoffs
  • J64 - Unemployment: Models, Duration, Incidence, and Job Search
  • J65 - Unemployment Insurance; Severance Pay; Plant Closings
  • J68 - Public Policy
  • Browse content in J7 - Labor Discrimination
  • J71 - Discrimination
  • J78 - Public Policy
  • Browse content in J8 - Labor Standards: National and International
  • J81 - Working Conditions
  • J88 - Public Policy
  • Browse content in K - Law and Economics
  • Browse content in K0 - General
  • K00 - General
  • Browse content in K1 - Basic Areas of Law
  • K14 - Criminal Law
  • K2 - Regulation and Business Law
  • Browse content in K3 - Other Substantive Areas of Law
  • K31 - Labor Law
  • Browse content in K4 - Legal Procedure, the Legal System, and Illegal Behavior
  • K40 - General
  • K41 - Litigation Process
  • K42 - Illegal Behavior and the Enforcement of Law
  • Browse content in L - Industrial Organization
  • Browse content in L0 - General
  • L00 - General
  • Browse content in L1 - Market Structure, Firm Strategy, and Market Performance
  • L10 - General
  • L11 - Production, Pricing, and Market Structure; Size Distribution of Firms
  • L13 - Oligopoly and Other Imperfect Markets
  • L14 - Transactional Relationships; Contracts and Reputation; Networks
  • L15 - Information and Product Quality; Standardization and Compatibility
  • L16 - Industrial Organization and Macroeconomics: Industrial Structure and Structural Change; Industrial Price Indices
  • L19 - Other
  • Browse content in L2 - Firm Objectives, Organization, and Behavior
  • L21 - Business Objectives of the Firm
  • L22 - Firm Organization and Market Structure
  • L23 - Organization of Production
  • L24 - Contracting Out; Joint Ventures; Technology Licensing
  • L25 - Firm Performance: Size, Diversification, and Scope
  • L26 - Entrepreneurship
  • Browse content in L3 - Nonprofit Organizations and Public Enterprise
  • L33 - Comparison of Public and Private Enterprises and Nonprofit Institutions; Privatization; Contracting Out
  • Browse content in L4 - Antitrust Issues and Policies
  • L40 - General
  • L41 - Monopolization; Horizontal Anticompetitive Practices
  • L42 - Vertical Restraints; Resale Price Maintenance; Quantity Discounts
  • Browse content in L5 - Regulation and Industrial Policy
  • L50 - General
  • L51 - Economics of Regulation
  • Browse content in L6 - Industry Studies: Manufacturing
  • L60 - General
  • L62 - Automobiles; Other Transportation Equipment; Related Parts and Equipment
  • L63 - Microelectronics; Computers; Communications Equipment
  • L66 - Food; Beverages; Cosmetics; Tobacco; Wine and Spirits
  • Browse content in L7 - Industry Studies: Primary Products and Construction
  • L71 - Mining, Extraction, and Refining: Hydrocarbon Fuels
  • L73 - Forest Products
  • Browse content in L8 - Industry Studies: Services
  • L81 - Retail and Wholesale Trade; e-Commerce
  • L83 - Sports; Gambling; Recreation; Tourism
  • L84 - Personal, Professional, and Business Services
  • L86 - Information and Internet Services; Computer Software
  • Browse content in L9 - Industry Studies: Transportation and Utilities
  • L91 - Transportation: General
  • L93 - Air Transportation
  • L94 - Electric Utilities
  • Browse content in M - Business Administration and Business Economics; Marketing; Accounting; Personnel Economics
  • Browse content in M1 - Business Administration
  • M11 - Production Management
  • M12 - Personnel Management; Executives; Executive Compensation
  • M14 - Corporate Culture; Social Responsibility
  • Browse content in M2 - Business Economics
  • M21 - Business Economics
  • Browse content in M3 - Marketing and Advertising
  • M31 - Marketing
  • M37 - Advertising
  • Browse content in M4 - Accounting and Auditing
  • M42 - Auditing
  • M48 - Government Policy and Regulation
  • Browse content in M5 - Personnel Economics
  • M50 - General
  • M51 - Firm Employment Decisions; Promotions
  • M52 - Compensation and Compensation Methods and Their Effects
  • M53 - Training
  • M54 - Labor Management
  • Browse content in N - Economic History
  • Browse content in N0 - General
  • N00 - General
  • N01 - Development of the Discipline: Historiographical; Sources and Methods
  • Browse content in N1 - Macroeconomics and Monetary Economics; Industrial Structure; Growth; Fluctuations
  • N10 - General, International, or Comparative
  • N11 - U.S.; Canada: Pre-1913
  • N12 - U.S.; Canada: 1913-
  • N13 - Europe: Pre-1913
  • N17 - Africa; Oceania
  • Browse content in N2 - Financial Markets and Institutions
  • N20 - General, International, or Comparative
  • N22 - U.S.; Canada: 1913-
  • N23 - Europe: Pre-1913
  • Browse content in N3 - Labor and Consumers, Demography, Education, Health, Welfare, Income, Wealth, Religion, and Philanthropy
  • N30 - General, International, or Comparative
  • N31 - U.S.; Canada: Pre-1913
  • N32 - U.S.; Canada: 1913-
  • N33 - Europe: Pre-1913
  • N34 - Europe: 1913-
  • N36 - Latin America; Caribbean
  • N37 - Africa; Oceania
  • Browse content in N4 - Government, War, Law, International Relations, and Regulation
  • N40 - General, International, or Comparative
  • N41 - U.S.; Canada: Pre-1913
  • N42 - U.S.; Canada: 1913-
  • N43 - Europe: Pre-1913
  • N44 - Europe: 1913-
  • N45 - Asia including Middle East
  • N47 - Africa; Oceania
  • Browse content in N5 - Agriculture, Natural Resources, Environment, and Extractive Industries
  • N50 - General, International, or Comparative
  • N51 - U.S.; Canada: Pre-1913
  • Browse content in N6 - Manufacturing and Construction
  • N63 - Europe: Pre-1913
  • Browse content in N7 - Transport, Trade, Energy, Technology, and Other Services
  • N71 - U.S.; Canada: Pre-1913
  • Browse content in N8 - Micro-Business History
  • N82 - U.S.; Canada: 1913-
  • Browse content in N9 - Regional and Urban History
  • N91 - U.S.; Canada: Pre-1913
  • N92 - U.S.; Canada: 1913-
  • N93 - Europe: Pre-1913
  • N94 - Europe: 1913-
  • Browse content in O - Economic Development, Innovation, Technological Change, and Growth
  • Browse content in O1 - Economic Development
  • O10 - General
  • O11 - Macroeconomic Analyses of Economic Development
  • O12 - Microeconomic Analyses of Economic Development
  • O13 - Agriculture; Natural Resources; Energy; Environment; Other Primary Products
  • O14 - Industrialization; Manufacturing and Service Industries; Choice of Technology
  • O15 - Human Resources; Human Development; Income Distribution; Migration
  • O16 - Financial Markets; Saving and Capital Investment; Corporate Finance and Governance
  • O17 - Formal and Informal Sectors; Shadow Economy; Institutional Arrangements
  • O18 - Urban, Rural, Regional, and Transportation Analysis; Housing; Infrastructure
  • O19 - International Linkages to Development; Role of International Organizations
  • Browse content in O2 - Development Planning and Policy
  • O23 - Fiscal and Monetary Policy in Development
  • O25 - Industrial Policy
  • Browse content in O3 - Innovation; Research and Development; Technological Change; Intellectual Property Rights
  • O30 - General
  • O31 - Innovation and Invention: Processes and Incentives
  • O32 - Management of Technological Innovation and R&D
  • O33 - Technological Change: Choices and Consequences; Diffusion Processes
  • O34 - Intellectual Property and Intellectual Capital
  • O38 - Government Policy
  • Browse content in O4 - Economic Growth and Aggregate Productivity
  • O40 - General
  • O41 - One, Two, and Multisector Growth Models
  • O43 - Institutions and Growth
  • O44 - Environment and Growth
  • O47 - Empirical Studies of Economic Growth; Aggregate Productivity; Cross-Country Output Convergence
  • Browse content in O5 - Economywide Country Studies
  • O52 - Europe
  • O53 - Asia including Middle East
  • O55 - Africa
  • Browse content in P - Economic Systems
  • Browse content in P0 - General
  • P00 - General
  • Browse content in P1 - Capitalist Systems
  • P10 - General
  • P16 - Political Economy
  • P17 - Performance and Prospects
  • P18 - Energy: Environment
  • Browse content in P2 - Socialist Systems and Transitional Economies
  • P26 - Political Economy; Property Rights
  • Browse content in P3 - Socialist Institutions and Their Transitions
  • P37 - Legal Institutions; Illegal Behavior
  • Browse content in P4 - Other Economic Systems
  • P48 - Political Economy; Legal Institutions; Property Rights; Natural Resources; Energy; Environment; Regional Studies
  • Browse content in P5 - Comparative Economic Systems
  • P51 - Comparative Analysis of Economic Systems
  • Browse content in Q - Agricultural and Natural Resource Economics; Environmental and Ecological Economics
  • Browse content in Q1 - Agriculture
  • Q10 - General
  • Q12 - Micro Analysis of Farm Firms, Farm Households, and Farm Input Markets
  • Q13 - Agricultural Markets and Marketing; Cooperatives; Agribusiness
  • Q14 - Agricultural Finance
  • Q15 - Land Ownership and Tenure; Land Reform; Land Use; Irrigation; Agriculture and Environment
  • Q16 - R&D; Agricultural Technology; Biofuels; Agricultural Extension Services
  • Browse content in Q2 - Renewable Resources and Conservation
  • Q25 - Water
  • Browse content in Q3 - Nonrenewable Resources and Conservation
  • Q32 - Exhaustible Resources and Economic Development
  • Q34 - Natural Resources and Domestic and International Conflicts
  • Browse content in Q4 - Energy
  • Q41 - Demand and Supply; Prices
  • Q48 - Government Policy
  • Browse content in Q5 - Environmental Economics
  • Q50 - General
  • Q51 - Valuation of Environmental Effects
  • Q53 - Air Pollution; Water Pollution; Noise; Hazardous Waste; Solid Waste; Recycling
  • Q54 - Climate; Natural Disasters; Global Warming
  • Q56 - Environment and Development; Environment and Trade; Sustainability; Environmental Accounts and Accounting; Environmental Equity; Population Growth
  • Q58 - Government Policy
  • Browse content in R - Urban, Rural, Regional, Real Estate, and Transportation Economics
  • Browse content in R0 - General
  • R00 - General
  • Browse content in R1 - General Regional Economics
  • R11 - Regional Economic Activity: Growth, Development, Environmental Issues, and Changes
  • R12 - Size and Spatial Distributions of Regional Economic Activity
  • R13 - General Equilibrium and Welfare Economic Analysis of Regional Economies
  • Browse content in R2 - Household Analysis
  • R20 - General
  • R23 - Regional Migration; Regional Labor Markets; Population; Neighborhood Characteristics
  • R28 - Government Policy
  • Browse content in R3 - Real Estate Markets, Spatial Production Analysis, and Firm Location
  • R30 - General
  • R31 - Housing Supply and Markets
  • R38 - Government Policy
  • Browse content in R4 - Transportation Economics
  • R40 - General
  • R41 - Transportation: Demand, Supply, and Congestion; Travel Time; Safety and Accidents; Transportation Noise
  • R48 - Government Pricing and Policy
  • Browse content in Z - Other Special Topics
  • Browse content in Z1 - Cultural Economics; Economic Sociology; Economic Anthropology
  • Z10 - General
  • Z12 - Religion
  • Z13 - Economic Sociology; Economic Anthropology; Social and Economic Stratification
  • Advance Articles
  • Editor's Choice
  • Author Guidelines
  • Submission Site
  • Open Access Options
  • Self-Archiving Policy
  • Why Submit?
  • About The Quarterly Journal of Economics
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

I. introduction, ii. a simple framework for discovery, iii. application and data, iv. the surprising importance of the face, v. algorithm-human communication, vi. evaluating these new hypotheses, vii. conclusion, data availability.

  • < Previous

Machine Learning as a Tool for Hypothesis Generation *

  • Article contents
  • Figures & tables
  • Supplementary Data

Jens Ludwig, Sendhil Mullainathan, Machine Learning as a Tool for Hypothesis Generation, The Quarterly Journal of Economics , Volume 139, Issue 2, May 2024, Pages 751–827, https://doi.org/10.1093/qje/qjad055

  • Permissions Icon Permissions

While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not. We illustrate the procedure with a concrete application: judge decisions about whom to jail. We begin with a striking fact: the defendant’s face alone matters greatly for the judge’s jailing decision. In fact, an algorithm given only the pixels in the defendant’s mug shot accounts for up to half of the predictable variation. We develop a procedure that allows human subjects to interact with this black-box algorithm to produce hypotheses about what in the face influences judge decisions. The procedure generates hypotheses that are both interpretable and novel: they are not explained by demographics (e.g., race) or existing psychology research, nor are they already known (even if tacitly) to people or experts. Though these results are specific, our procedure is general. It provides a way to produce novel, interpretable hypotheses from any high-dimensional data set (e.g., cell phones, satellites, online behavior, news headlines, corporate filings, and high-frequency time series). A central tenet of our article is that hypothesis generation is a valuable activity, and we hope this encourages future work in this largely “prescientific” stage of science.

Science is curiously asymmetric. New ideas are meticulously tested using data, statistics, and formal models. Yet those ideas originate in a notably less meticulous process involving intuition, inspiration, and creativity. The asymmetry between how ideas are generated versus tested is noteworthy because idea generation is also, at its core, an empirical activity. Creativity begins with “data” (albeit data stored in the mind), which are then “analyzed” (through a purely psychological process of pattern recognition). What feels like inspiration is actually the output of a data analysis run by the human brain. Despite this, idea generation largely happens off stage, something that typically happens before “actual science” begins. 1 Things are likely this way because there is no obvious alternative. The creative process is so human and idiosyncratic that it would seem to resist formalism.

That may be about to change because of two developments. First, human cognition is no longer the only way to notice patterns in the world. Machine learning algorithms can also find patterns, including patterns people might not notice themselves. These algorithms can work not just with structured, tabular data but also with the kinds of inputs that traditionally could only be processed by the mind, like images or text. Second, data on human behavior is exploding: second-by-second price and volume data in asset markets, high-frequency cellphone data on location and usage, CCTV camera and police bodycam footage, news stories, children’s books, the entire text of corporate filings, and so on. The kind of information researchers once relied on for inspiration is now machine readable: what was once solely mental data is increasingly becoming actual data. 2

We suggest that these changes can be leveraged to expand how hypotheses are generated. Currently, researchers do of course look at data to generate hypotheses, as in exploratory data analysis, but this depends on the idiosyncratic creativity of investigators who must decide what statistics to calculate. In contrast, we suggest capitalizing on the capacity of machine learning algorithms to automatically detect patterns, especially ones people might never have considered. A key challenge is that we require hypotheses that are interpretable to people. One important goal of science is to generalize knowledge to new contexts. Predictive patterns in a single data set alone are rarely useful; they become insightful when they can be generalized. Currently, that generalization is done by people, and people can only generalize things they understand. The predictors produced by machine learning algorithms are, however, notoriously opaque—hard-to-decipher “black boxes.” We propose a procedure that integrates these algorithms into a pipeline that results in human-interpretable hypotheses that are both novel and testable.

While our procedure is broadly applicable, we illustrate it in a concrete application: judicial decision making. Specifically we study pretrial decisions about which defendants are jailed versus set free awaiting trial, a decision that by law is supposed to hinge on a prediction of the defendant’s risk ( Dobbie and Yang 2021 ). 3 This is also a substantively interesting application in its own right because of the high stakes involved and mounting evidence that judges make these decisions less than perfectly ( Kleinberg et al. 2018 ; Rambachan et al. 2021 ; Angelova, Dobbie, and Yang 2023 ).

We begin with a striking fact. When we build a deep learning model of the judge—one that predicts whether the judge will detain a given defendant—a single factor emerges as having large explanatory power: the defendant’s face. A predictor that uses only the pixels in the defendant’s mug shot explains from one-quarter to nearly one-half of the predictable variation in detention. 4 Defendants whose mug shots fall in the bottom quartile of predicted detention are 20.4 percentage points more likely to be jailed than those in the top quartile. By comparison, the difference in detention rates between those arrested for violent versus nonviolent crimes is 4.8 percentage points. Notice what this finding is and is not. We are not claiming the mug shot predicts defendant behavior; that would be the long-discredited field of phrenology ( Schlag 1997 ). We instead claim the mug shot predicts judge behavior: how the defendant looks correlates strongly with whether the judge chooses to jail them. 5

Has the algorithm found something new in the pixels of the mug shot or simply rediscovered something long known or intuitively understood? After all, psychologists have been studying people’s reactions to faces for at least 100 years ( Todorov et al. 2015 ; Todorov and Oh 2021 ), while economists have shown that judges are influenced by factors (like race) that can be seen from someone’s face ( Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ). When we control for age, gender, race, skin color, and even the facial features suggested by previous psychology research (dominance, trustworthiness, attractiveness, and competence), none of these factors (individually or jointly) meaningfully diminishes the algorithm’s predictive power (see Figure I , Panel A). It is perhaps worth noting that the algorithm on its own does rediscover some of the signal from these features: in fact, collectively these known features explain |$22.3\%$| of the variation in predicted detention (see Figure I , Panel B). The key point is that the algorithm has discovered a great deal more as well.

Correlates of Judge Detention Decision and Algorithmic Prediction of Judge Decision

Correlates of Judge Detention Decision and Algorithmic Prediction of Judge Decision

Panel A summarizes the explanatory power of a regression model in explaining judge detention decisions, controlling for the different explanatory variables indicated at left (shaded tiles), either on their own (dark circles) or together with the algorithmic prediction of the judge decisions (triangles). Each row represents a different regression specification. By “other facial features,” we mean variables that previous psychology research suggests matter for how faces influence people’s reactions to others (dominance, trustworthiness, competence, and attractiveness). Ninety-five percent confidence intervals around our R 2 estimates come from drawing 10,000 bootstrap samples from the validation data set. Panel B shows the relationship between the different explanatory variables as indicated at left by the shaded tiles with the algorithmic prediction itself as the outcome variable in the regressions. Panel C examines the correlation with judge decisions of the two novel hypotheses generated by our procedure about what facial features affect judge detention decisions: well-groomed and heavy-faced.

Perhaps we should control for something else? Figuring out that “something else” is itself a form of hypothesis generation. To avoid a possibly endless—and misleading—process of generating other controls, we take a different approach. We show mug shots to subjects and ask them to guess whom the judge will detain and incentivize them for accuracy. These guesses summarize the facial features people readily (if implicitly) believe influence jailing. Although subjects are modestly good at this task, the algorithm is much better. It remains highly predictive even after controlling for these guesses. The algorithm seems to have found something novel beyond what scientists have previously hypothesized and beyond whatever patterns people can even recognize in data (whether or not they can articulate them).

What, then, are the novel facial features the algorithm has discovered? If we are unable to answer that question, we will have simply replaced one black box (the judge’s mind) with another (an algorithmic model of the judge’s mind). We propose a solution whereby the algorithm can communicate what it “sees.” Specifically, our procedure begins with a mug shot and “morphs” it to create a mug shot that maximally increases (or decreases) the algorithm’s predicted detention probability. The result is pairs of synthetic mug shots that can be examined to understand and articulate what differs within the pairs. The algorithm discovers, and people name that discovery. In principle we could have just shown subjects actual mug shots with higher versus lower predicted detention odds. But faces are so rich that between any pair of actual mug shots, many things will happen to be different and most will be unrelated to detention (akin to the curse of dimensionality). Simply looking at pairs of actual faces can, as a result, lead to many spurious observations. Morphing creates counterfactual synthetic images that are as similar as possible except with respect to detention odds, to minimize extraneous differences and help focus on what truly matters for judge detention decisions.

Importantly, we do not generate hypotheses by looking at the morphs ourselves; instead, they are shown to independent study subjects (MTurk or Prolific workers) in an experimental design. Specifically, we showed pairs of morphed images and asked participants to guess which image the algorithm predicts to have higher detention risk. Subjects were given both incentives and feedback, so they had motivation and opportunity to learn the underlying patterns. While subjects initially guess the judge’s decision correctly from these morphed mug shots at about the same rate as they do when looking at “raw data,” that is, actual mug shots (modestly above the |$50\%$| random guessing mark), they quickly learn from these morphed images what the algorithm is seeing and reach an accuracy of nearly |$70\%$|⁠ . At the end, participants are asked to put words to the differences they see across images in each pair, that is, to name what they think are the key facial features the algorithm is relying on to predict judge decisions. Comfortingly, there is substantial agreement on what subjects see: a sizable share of subjects all name the same feature. To verify whether the feature they identify is used by the algorithm, a separate sample of subjects independently coded mug shots for this new feature. We show that the new feature is indeed correlated with the algorithm’s predictions. What subjects think they’re seeing is indeed what the algorithm is also “seeing.”

Having discovered a single feature, we can iterate the procedure—the first feature explains only a fraction of what the algorithm has captured, suggesting there are many other factors to be discovered. We again produce morphs, but this time hold the first feature constant: that is, we orthogonalize so that the pairs of morphs do not differ on the first feature. When these new morphs are shown to subjects, they consistently name a second feature, which again correlates with the algorithm’s prediction. Both features are quite important. They explain a far larger share of what the algorithm sees than all the other variables (including race and skin color) besides gender. These results establish our main goals: show that the procedure produces meaningful communication, and that it can be iterated.

What are the two discovered features? The first can be called “well-groomed” (e.g., tidy, clean, groomed, versus unkept, disheveled, sloppy look), and the second can be called “heavy-faced” (e.g., wide facial shape, puffier face, wider face, rounder face, heavier). These features are not just predictive of what the algorithm sees, but also of what judges actually do ( Figure I , Panel C). We find that both well-groomed and heavy-faced defendants are more likely to be released, even controlling for demographic features and known facial features from psychology. Detention rates of defendants in the top and bottom quartile of well-groomedness differ by 5.5 percentage points ( ⁠|$24\%$| of the base rate), while the top versus bottom quartile difference in heavy-facedness is 7 percentage points (about |$30\%$| of the base rate). Both differences are larger than the 4.8 percentage points detention rate difference between those arrested for violent versus nonviolent crimes. Not only are these magnitudes substantial, these hypotheses are novel even to practitioners who work in the criminal justice system (in a public defender’s office and a legal aid society).

Establishing whether these hypotheses are truly causally related to judge decisions is obviously beyond the scope of the present article. But we nonetheless present a few additional findings that are at least suggestive. These novel features do not appear to be simply proxies for factors like substance abuse, mental health, or socioeconomic status. Moreover, we carried out a lab experiment in which subjects are asked to make hypothetical pretrial release decisions as if they were a judge. They are shown information about criminal records (current charge, prior arrests) along with mug shots that are randomly morphed in the direction of higher or lower values of well-groomed (or heavy-faced). Subjects tend to detain those with higher-risk structured variables (criminal records), all else equal, suggesting they are taking the task seriously. These same subjects, though, are also more likely to detain defendants who are less heavy-faced or well-groomed, even though these were randomly assigned.

Ultimately, though, this is not a study about well-groomed or heavy-faced defendants, nor are its implications limited to faces or judges. It develops a general procedure that can be applied wherever behavior can be predicted using rich (especially high-dimensional) data. Development of such a procedure has required overcoming two key challenges.

First, to generate interpretable hypotheses, we must overcome the notorious black box nature of most machine learning algorithms. Unlike with a regression, one cannot simply inspect the coefficients. A modern deep-learning algorithm, for example, can have tens of millions of parameters. Noninspectability is especially problematic when the data are rich and high dimensional since the parameters are associated with primitives such as pixels. This problem of interpretation is fundamental and remains an active area of research. 6 Part of our procedure here draws on the recent literature in computer science that uses generative models to create counterfactual explanations. Most of those methods are designed for AI applications that seek to automate tasks humans do nearly perfectly, like image classification, where predictability of the outcome (is this image of a dog or a cat?) is typically quite high. 7 Interpretability techniques are used to ensure the algorithm is not picking up on spurious signal. 8 We developed our method, which has similar conceptual underpinnings to this existing literature, for social science applications where the outcome (human behavior) is typically more challenging to predict. 9 To what degree existing methods (as they currently stand or with some modification) could perform as well or better in social science applications like ours is a question we leave to future work.

Second, we must overcome what we might call the Rorschach test problem. Suppose we, the authors, were to look at these morphs and generate a hypothesis. We would not know if the procedure played any meaningful role. Perhaps the morphs, like ink blots, are merely canvases onto which we project our creativity. 10 Put differently, a single research team’s idiosyncratic judgments lack the kind of replicability we desire of a scientific procedure. To overcome this problem, it is key that we use independent (nonresearcher) subjects to inspect the morphs. The fact that a sizable share of subjects all name the same discovery suggests that human-algorithm communication has occurred and the procedure is replicable, rather than reflecting some unique spark of creativity.

At the same time, the fact that our procedure is not fully automatic implies that it will be shaped and constrained by people. Human participants are needed to name the discoveries. So whole new concepts that humans do not yet understand cannot be produced. Such breakthroughs clearly happen (e.g., gravity or probability) but are beyond the scope of procedures like ours. People also play a crucial role in curating the data the algorithm sees. Here, for example, we chose to include mug shots. The creative acquisition of rich data is an important human input into this hypothesis generation procedure. 11

Our procedure can be applied to a broad range of settings and will be particularly useful for data that are not already intrinsically interpretable. Many data sets contain a few variables that already have clear, fixed meanings and are unlikely to lead to novel discoveries. In contrast, images, text, and time series are rich high-dimensional data with many possible interpretations. Just as there is an ocean of plausible facial features, these sorts of data contain a large set of potential hypotheses that an algorithm can search through. Such data are increasingly available and used by economists, including news headlines, legislative deliberations, annual corporate reports, Federal Open Market Committee statements, Google searches, student essays, résumés, court transcripts, doctors’ notes, satellite images, housing photos, and medical images. Our procedure could, for example, raise hypotheses about what kinds of news lead to over- or underreaction of stock prices, which features of a job interview increase racial disparities, or what features of an X-ray drive misdiagnosis.

Central to this work is the belief that hypothesis generation is a valuable activity in and of itself. Beyond whatever the value might be of our specific procedure and empirical application, we hope these results also inspire greater attention to this traditionally “prescientific” stage of science.

We develop a simple framework to clarify the goals of hypothesis generation and how it differs from testing, how algorithms might help, and how our specific approach to algorithmic hypothesis generation differs from existing methods. 12

II.A. The Goals of Hypothesis Generation

What criteria should we use for assessing hypothesis generation procedures? Two common goals for hypothesis generation are ones that we ensure ex post. First is novelty. In our application, we aim to orthogonalize against known factors, recognizing that it may be hard to orthogonalize against all known hypotheses. Second, we require that hypotheses be testable ( Popper 2002 ). But what can be tested is hard to define ex ante, in part because it depends on the specific hypothesis and the potential experimental setups. Creative empiricists over time often find ways to test hypotheses that previously seemed untestable. 13 To these, we add two more: interpretability and empirical plausibility.

What do we mean by empirically plausible? Let y be some outcome of interest, which for simplicity we assume is binary, and let h ( x ) be some hypothesis that maps the features of each instance, x , to [0,1]. By empirical plausibility we mean some correlation between y and h ( x ). Our ultimate aim is to uncover causal relationships. But causality can only be known after causal testing. That raises the question of how to come up with ideas worth causally testing, and how we would recognize them when we see them. Many true hypotheses need not be visible in raw correlations. Those can only be identified with background knowledge (e.g., theory). Other procedures would be required to surface those. Our focus here is on searching for true hypotheses that are visible in raw correlations. Of course not every correlation will turn out to be a true hypothesis, but even in those cases, generating such hypotheses and then invalidating them can be a valuable activity. Debunking spurious correlations has long been one of the most useful roles of empirical work. Understanding what confounders produce those correlations can also be useful.

We care about our final goal for hypothesis generation, interpretability, because science is largely about helping people make forecasts into new contexts, and people can only do that with hypotheses they meaningfully understand. Consider an uninterpretable hypothesis like “this set of defendants is more likely to be jailed than that set,” but we cannot articulate a reason why. From that hypothesis, nothing could be said about a new set of courtroom defendants. In contrast an interpretable hypothesis like “skin color affects detention” has implications for other samples of defendants and for entirely different settings. We could ask whether skin color also affects, say, police enforcement choices or whether these effects differ by time of day. By virtue of being interpretable, these hypotheses let us use a wider set of knowledge (police may share racial biases; skin color is not as easily detected at night). 14 Interpretable descriptions let us generalize to novel situations, in addition to being easier to communicate to key stakeholders and lending themselves to interpretable solutions.

II.B. Human versus Algorithmic Hypothesis Generation

Human hypothesis generation has the advantage of generating hypotheses that are interpretable. By construction, the ideas that humans come up with are understandable by humans. But as a procedure for generating new ideas, human creativity has the drawback of often being idiosyncratic and not necessarily replicable. A novel hypothesis is novel exactly because one person noticed it when many others did not. A large body of evidence shows that human judgments have a great deal of “noise.” It is not just that different people draw different conclusions from the same observations, but the same person may notice different things at different times ( Kahneman, Sibony, and Sunstein 2022 ). A large body of psychology research shows that people typically are not able to introspect and understand why we notice specific things those times we do notice them. 15

There is also no guarantee that human-generated hypotheses need be empirically plausible. The intuition is related to “overfitting.” Suppose that people look at a subset of all data and look for something that differentiates positive ( y  = 1) from negative ( y  = 0) cases. Even with no noise in y , there is randomness in which observations are in the data. That can lead to idiosyncratic differences between y  = 0 and y  = 1 cases. As the number of comprehensible hypotheses gets large, there is a “curse of dimensionality”: many plausible hypotheses for these idiosyncratic differences. That is, many different hypotheses can look good in sample but need not work out of sample. 16

In contrast, supervised learning tools in machine learning are designed to generate predictions in new (out-of-sample) data. 17 That is, algorithms generate hypotheses that are empirically plausible by construction. 18 Moreover, machine learning can detect patterns in data that humans cannot. Algorithms can notice, for example, that livestock all tend to be oriented north ( Begall et al. 2008 ), whether someone is about to have a heart attack based on subtle indications in an electrocardiogram ( Mullainathan and Obermeyer 2022 ), or that a piece of machinery is about to break ( Mobley 2002 ). We call these machine learning prediction functions m ( x ), which for a binary outcome y map to [0, 1].

The challenge is that most m ( x ) are not interpretable. For this type of statistical model to yield an interpretable hypothesis, its parameters must be interpretable. That can happen in some simple cases. For example, if we had a data set where each dimension of x was interpretable (such as individual structured variables in a tabular data set) and we used a predictor such as OLS (or LASSO), we could just read the hypotheses from the nonzero coefficients: which variables are significant? Even in that case, interpretation is challenging because machine learning tools, built to generate accurate predictions rather than apportion explanatory power across explanatory variables, yield coefficients that can be unstable across realizations of the data ( Mullainathan and Spiess 2017 ). 19 Often interpretation is much less straightforward than that. If x is an image, text, or time series, the estimated models (such as convolutional neural networks) can have literally millions of parameters. The models are defined on granular inputs with no particular meaning: if we knew m ( x ) weighted a particular pixel, what have we learned? In these cases, the estimated model m ( x ) is not interpretable. Our focus is on these contexts where algorithms, as black-box models, are not readily interpreted.

Ideally one might marry people’s unique knowledge of what is comprehensible with an algorithm’s superior capacity to find meaningful correlations in data: to have the algorithm discover new signal and then have humans name that discovery. How to do so is not straightforward. We might imagine formalizing the set of interpretable prediction functions, and then focus on creating machine learning techniques that search over functions in that set. But mathematically characterizing those functions is typically not possible. Or we might consider seeking insight from a low-dimensional representation of face space, or “eigenfaces,” which are a common teaching tool for principal components analysis ( Sirovich and Kirby 1987 ). But those turn out not to provide much useful insight for our purposes. 20 In some sense it is obvious why: the subset of actual faces is unlikely to be a linear subspace of the space of pixels. If we took two faces and linearly interpolated them the resulting image would not look like a face. Some other approach is needed. We build on methods in computer science that use generative models to generate counterfactual explanations.

II.C. Related Methods

Our hypothesis generation procedure is part of a growing literature that aims to integrate machine learning into the way science is conducted. A common use (outside of economics) is in what could be called “closed world problems”: situations where the fundamental laws are known, but drawing out predictions is computationally hard. For example, the biochemical rules of how proteins fold are known, but it is hard to predict the final shape of a protein. Machine learning has provided fundamental breakthroughs, in effect by making very hard-to-compute outcomes computable in a feasible timeframe. 21

Progress has been far more limited with applications where the relationship between x and y is unknown (“open world” problems), like human behavior. First, machine learning here has been useful at generating unexpected findings, although these are not hypotheses themselves. Pierson et al. (2021) show that a deep-learning algorithm is better able to predict patient pain from an X-ray than clinicians can: there are physical knee defects that medicine currently does not understand. But that study is not able to isolate what those defects are. 22 Second, machine learning has also been used to explore investigator-generated hypotheses, such as Mullainathan and Obermeyer (2022) , who examine whether physicians suffer from limited attention when diagnosing patients. 23

Finally, a few papers take on the same problem that we do. Fudenberg and Liang (2019) and Peterson et al. (2021) have used algorithms to predict play in games and choices between lotteries. They inspected those algorithms to produce their insights. Similarly, Kleinberg et al. (2018) and Sunstein (2021) use algorithmic models of judges and inspect those models to generate hypotheses. 24 Our proposal builds on these papers. Rather than focusing on generating an insight for a specific application, we suggest a procedure that can be broadly used for many applications. Importantly, our procedure does not rely on researcher inspection of algorithmic output. When an expert researcher with a track record of generating scientific ideas uses some procedure to generate an idea, how do we know whether the result is due to the procedure or the researcher? By relying on a fixed algorithmic procedure that human subjects can interface with, hypothesis generation goes from being an idiosyncratic act of individuals to a replicable process.

III.A. Judicial Decision Making

Although our procedure is broadly applicable, we illustrate it through a specific application to the U.S. criminal justice system. We choose this application partly because of its social relevance. It is also an exemplar of the type of application where our hypothesis generation procedure can be helpful. Its key ingredients—a clear decision maker, a large number of choices (over 10 million people are arrested each year in the United States) that are recorded in data, and, increasingly, high-dimensional data that can also be used to model those choices, such as mug shot images, police body cameras, and text from arrest reports or court transcripts—are shared with a variety of other applications.

Our specific focus is on pretrial hearings. Within 24–48 hours after arrest, a judge must decide where the defendant will await trial, in jail or at home. This is a consequential decision. Cases typically take 2–4 months to resolve, sometimes up to 9–12 months. Jail affects people’s families, their livelihoods, and the chances of a guilty plea ( Dobbie, Goldin, and Yang 2018 ). On the other hand, someone who is released could potentially reoffend. 25

While pretrial decisions are by law supposed to hinge on the defendant’s risk of flight or rearrest if released ( Dobbie and Yang 2021 ), studies show that judges’ decisions deviate from those guidelines in a number of ways. For starters, judges seem to systematically mispredict defendant risk ( Jung et al. 2017 ; Kleinberg et al. 2018 ; Rambachan 2021 ; Angelova, Dobbie, and Yang 2023 ), partly because judges overweight the charge for which people are arrested ( Sunstein 2021 ). Judge decisions can also depend on extralegal factors like race ( Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ), whether the judge’s favorite football team lost ( Eren and Mocan 2018 ), weather ( Heyes and Saberian 2019 ), the cases the judge just heard ( Chen, Moskowitz, and Shue 2016 ), and if the hearing is on the defendant’s birthday ( Chen and Philippe 2023 ). These studies test hypotheses that some human being was clever enough to think up. But there remains a great deal of unexplained variation in judges’ decisions. The challenge of expanding the set of hypotheses for understanding this variation without losing the benefit of interpretability is the motivation for our own analysis here.

III.B. Administrative Data

We obtained data from Mecklenburg County, North Carolina, the second most populated county in the state (over 1 million residents) that includes North Carolina’s largest city (Charlotte). The county is similar to the rest of the United States in terms of economic conditions (2021 poverty rates were |$11.0\%$| versus |$11.4\%$|⁠ , respectively), although the share of Mecklenburg County’s population that is non-Hispanic white is lower than the United States as a whole ( ⁠|$56.6\%$| versus |$75.8\%$|⁠ ). 26 We rely on three sources of administrative data: 27

The Mecklenburg County Sheriff’s Office (MCSO) publicly posts arrest data for the past three years, which provides information on defendant demographics like age, gender, and race, as well as the charge for which someone was arrested.

The North Carolina Administrative Office of the Courts (NCAOC) maintains records on the judge’s pretrial decisions (detain, release, etc.).

Data from the North Carolina Department of Public Safety includes information about the defendant’s prior convictions and incarceration spells, if any.

We also downloaded photos of the defendants from the MCSO public website (so-called mug shots), 28 which capture a frontal view of each person from the shoulders up in front of a gray background. These images are 400 pixels wide by 480 pixels high, but we pad them with a black boundary to be square 512 × 512 images to conform with the requirements of some of the machine learning tools. In Figure II , we give readers a sense of what these mug shots look like, with two important caveats. First, given concerns about how the overrepresentation of disadvantaged groups in discussions of crime can contribute to stereotyping ( Bjornstrom et al. 2010 ), we illustrate the key ideas of the paper using images for non-Hispanic white males. Second, out of sensitivity to actual arrestees, we do not wish to display actual mug shots (which are available at the MCSO website). 29 Instead, the article only shows mug shots that are synthetic, generated using generative adversarial networks as described in Section V.B .

Illustrative Facial Images

Illustrative Facial Images

This figure shows facial images that illustrate the format of the mug shots posted publicly on the Mecklenberg County, North Carolina, sheriff’s office website. These are not real mug shots of actual people who have been arrested, but are synthetic. Moreover, given concerns about how the overrepresentation of disadvantaged groups in discussions of crime can exacerbate stereotyping, we illustrate the our key ideas using images for non-Hispanic white men. However, in our human intelligence tasks that ask participants to provide labels (ratings for different image features), we show images that are representative of the Mecklenberg County defendant population as a whole.

These data capture much of the information the judge has available at the time of the pretrial hearing, but not all of it. Both the judge and the algorithm see structured variables about each defendant like defendant demographics, current charge, and prior record. Because the mug shot (which the algorithm uses) is taken not long before the pretrial hearing, it should be a reasonable proxy for what the judge sees in court. The additional information the judge has but the algorithm does not includes the narrative arrest report from the police and what happens in court. While pretrial hearings can be quite brief in many jurisdictions (often not more than just a few minutes), the judge may nonetheless hear statements from police, prosecutors, defense lawyers, and sometimes family members. Defendants usually have their lawyers speak for them and do not say much at these hearings.

We downloaded 81,166 arrests made between January 18, 2017, and January 17, 2020, involving 42,353 unique defendants. We apply several data filters, like dropping cases without mugshots ( Online Appendix Table A.I ), leaving 51,751 observations. Because our goal is inference about new out-of-sample (OOS) observations, we partition our data as follows:

A train set of N = 22,696 cases, constructed by taking arrests through July 17, 2019, grouping arrests by arrestee, 30 randomly selecting |$70\%$| to the training-plus-validation data set, then randomly selecting |$70\%$| of those arrestees for the training data specifically.

A validation set of N = 9,604 cases used to report OOS performance in the article’s main exhibits, consisting of the remaining |$30\%$| in the combined training-plus-validation data frame.

A lock-box hold-out set of N = 19,009 cases that we did not touch until the article was accepted for final publication, to avoid what one might call researcher overfitting: we run lots of models over the course of writing the article, and the results on the validation data set may overstate our findings. This data set consists of the N = 4,759 valid cases for the last six months of our data period (July 17, 2019, to January 17, 2020) plus a random sample of |$30\%$| of those arrested before July 17, 2019, so that we can present results that are OOS with respect to individuals and time. Once this article was officially accepted, we replicated the findings presented in our main exhibits (see Online Appendix D and Online Appendix Tables A.XVIII–A.XXXII ). We see that our core findings are qualitatively similar. 31

Descriptive statistics are shown in Table I . Relative to the county as a whole, the arrested population substantially overrepresents men ( ⁠|$78.7\%$|⁠ ) and Black residents ( ⁠|$69.4\%$|⁠ ). The average age of arrestees is 31.8 years. Judges detain |$23.3\%$| of cases, and in |$25.1\%$| of arrests the person is rearrested before their case is resolved (about one-third of those released). Randomization of arrestees to the training versus validation data sets seems to have been successful, as shown in Table I . None of the pairwise comparisons has a p -value below .05 (see Online Appendix Table A.II ). A permutation multivariate analysis of variance test of the joint null hypothesis that the training-validation differences for all variables are all zero yields p  = .963. 32 A test for the same joint null hypothesis for the differences between the training sample and the lock-box hold-out data set (out of sample by individual) yields a test statistic of p  = .537.

Summary Statistics for Mecklenburg County NC Data, 2017–2020

Notes. This table reports descriptive statistics for our full data set and analysis subsets, which cover the period January 18, 2017, through January 17, 2020, from Mecklenburg County, NC. The lock-box hold-out data set consists of data from the last six months of our study period (July 17, 2019–January 17, 2020) plus a subset of cases through July 16, 2019, selected by randomly selecting arrestees. The remainder of the data set is then randomly assigned by arrestee to our training data set (used to build our algorithms) or to our validation set (which we use to report results in the article’s main exhibits). For additional details of our data filters and partitioning procedures, see Online Appendix Table A.I . We define pretrial release as being released on the defendant’s own recognizance or having been assigned and then posting cash bail requirements within three days of arrest. We define rearrest as experiencing a new arrest before adjudication of the focal arrest, with detained defendants being assigned zero values for the purposes of this table. Arrest charge categories reflect the most serious criminal charge for which a person was arrested, using the FBI Uniform Crime Reporting hierarchy rule in cases where someone is arrested and charged with multiple offenses. For analyses of variance for the test of the joint null hypothesis that the difference in means across each variable is zero, see Online Appendix Table A.II .

III.C. Human Labels

The administrative data capture many key features of each case but omit some other important ones. We solve these data insufficiency problems through a series of human intelligence tasks (HITs), which involve having study subjects on one of two possible platforms (Amazon’s Mechanical Turk or Prolific) assign labels to each case from looking at the mug shots. More details are in Online Appendix Table A.III . We use data from these HITs mostly to understand how the algorithm’s predictions relate to already-known determinants of human decision making, and hence the degree to which the algorithm is discovering something novel.

One set of HITs filled in demographic-related data: ethnicity; skin tone (since people are often stereotyped on skin color, or “colorism”; Hunter 2007 ), reported on an 18-point scale; the degree to which defendants appear more stereotypically Black on a 9-point scale ( Eberhardt et al. 2006 show this affects criminal justice decisions); and age, to compare to administrative data for label quality checks. 33 Because demographics tend to be easy for people to see in images, we collect just one label per image for each of these variables. To confirm one label is enough, we repeated the labeling task for 100 images but collected 10 labels for each image; we see that additional labels add little information. 34 Another data quality check comes from the fact that the distributions of skin color ratings do systematically differ by defendant race ( Online Appendix Figure A.III ).

A second type of HIT measured facial features that previous psychology research has shown affect human judgments. The specific set of facial features we focus on come from the influential study by Oosterhof and Todorov (2008) of people’s perceptions of the facial features of others. When subjects are asked to provide descriptions of different faces, principal components analysis suggests just two dimensions account for about |$80\%$| of the variation: (i) trustworthiness and (ii) dominance. We also collected data on two other facial features shown to be associated with real-world decisions like hiring or whom to vote for: (iii) attractiveness and (iv) competence ( Frieze, Olson, and Russell 1991 ; Little, Jones, and DeBruine 2011 ; Todorov and Oh 2021 ). 35

We asked subjects to rate images for each of these psychological features on a nine-point scale. Because psychological features may be less obvious than demographic features, we collected three labels per training–data set image and five per validation–data set image. 36 There is substantial variation in the ratings that subjects assign to different images for each feature (see Online Appendix Figure A.VI ). The ratings from different subjects for the same feature and image are highly correlated: interrater reliability measures (Cronbach’s α) range from 0.87 to 0.98 ( Online Appendix Figure A.VII ), similar to those reported in studies like Oosterhof and Todorov (2008) . 37 The information gain from collecting more than a few labels per image is modest. 38 For summary statistics, see Online Appendix Table A.IV .

Finally, we also tried to capture people’s implicit or tacit understanding of the determinants of judges’ decisions by asking subjects to predict which mug shot out of a pair would be detained, with images in each pair matched on gender, race, and five-year age brackets. 39 We incentivized study subjects for correct predictions and gave them feedback over the course of the 50 image pairs to facilitate learning. We treat the first 10 responses per subject as a “learning set” that we exclude from our analysis.

The first step of our hypothesis generation procedure is to build an algorithmic model of some behavior, which in our case is the judge’s detention decision. A sizable share of the predictable variation in judge decisions comes from a surprising source: the defendant’s face. Facial features implicated by past research explain just a modest share of this predictable variation. The algorithm seems to have found a novel discovery.

IV.A. What Drives Judge Decisions?

We begin by predicting judge pretrial detention decisions ( y  = 1 if detain, y  = 0 if release) using all the inputs available ( x ). We use the training data set to construct two separate models for the two types of data available. We apply gradient-boosted decision trees to predict judge decisions using the structured administrative data (current charge, prior record, age, gender), m s ( x ); for the unstructured data (raw pixel values from the mug shots), we train a convolutional neural network, m u ( x ). Each model returns an estimate of y (a predicted detention probability) for a given x . Because these initial steps of our procedure use standard machine learning methods, we relegate their discussion to the Online Appendix .

We pool the signal from both models to form a single weighted-average model |$m_p(x)=[\hat{\beta _s} m_s(x) + \hat{\beta _u} m_u(x)]$| using a so-called stacking procedure where the data are used to estimate the relevant weights. 40 Combining structured and unstructured data is an active area of deep-learning research, often called fusion modeling ( Yuhas, Goldstein, and Sejnowski 1989 ; Lahat, Adali, and Jutten 2015 ; Ramachandram and Taylor 2017 ; Baltrušaitis, Ahuja, and Morency 2019 ). We have tried several of the latest fusion architectures; none improve on our ensemble approach.

Judge decisions do have some predictable structure. We report predictive performance as the area under the receiver operating characteristic curve, or AUC, which is a measure of how well the algorithm rank-orders cases with values from 0.5 (random guessing) to 1.0 (perfect prediction). Intuitively, AUC can be thought of as the chance that a uniformly randomly selected detained defendant has a higher predicted detention likelihood than a uniformly randomly selected released defendant. The algorithm built using all candidate features, m p ( x ), has an AUC of 0.780 (see Online Appendix Figure A.X ).

What is the algorithm using to make its predictions? A single type of input captures a sizable share of the total signal: the defendant’s face. The algorithm built using only the mug shot image, m u ( x ), has an AUC of 0.625 (see Online Appendix Figure A.X ). Since an AUC of 0.5 represents random prediction, in AUC terms the mug shot accounts for |$\frac{0.625-0.5}{0.780-0.5}=44.6\%$| of the predictive signal about judicial decisions.

Another common way to think about predictive accuracy is in R 2 terms. While our data are high dimensional (because the facial image is a high-dimensional object), the algorithm’s prediction of the judge’s decision based on the facial image, m u ( x ), is a scalar and can be easily included in a familiar regression framework. Like AUC, measures like R 2 and mean squared error capture how well a model rank-orders observations by predicted probabilities, but R 2 , unlike AUC, also captures how close predictions are to observed outcomes (calibration). 41 The R 2 from regressing y against m s ( x ) and m u ( x ) in the validation data is 0.11. Regressing y against m u ( x ) alone yields an R 2 of 0.03. So depending on how we measure predictive accuracy, around a quarter ( ⁠|$\frac{0.03}{0.11}=27.3\%)$| to a half ( ⁠|$44.6\%$|⁠ ) of the predicted signal about judges’ decisions is captured by the face.

Average differences are another way to see what drives judges’ decisions. For any given feature x k , we can calculate the average detention rate for different values of the feature. For example, for the variable measuring whether the defendant is male ( x k  = 1) versus female ( x k  = 0), we can calculate and plot E [ y | x k  = 1] versus E [ y | x k  = 0]. As shown in Online Appendix Figure A.XI , the difference in detention rates equals 4.8 percentage points for those arrested for violent versus nonviolent crimes, 10.2 percentage points for men versus women, and 4.3 percentage points for bottom versus top quartile of skin tone, which are all sizable relative to the baseline detention rate of |$23.3\%$| in our validation data set. By way of comparison, average detention rates for the bottom versus top quartile of the mug shot algorithm’s predictions, m u ( x ), differ by 20.4 percentage points.

In what follows, we seek to understand more about the mug shot–based prediction of the judge’s decision, which we refer to simply as m ( x ) in the remainder of the article.

IV.B. Judicial Error?

So far we have shown that the face predicts judges’ behavior. Are judges right to use face information? To be precise, by “right” we do not mean a broader ethical judgment; for many reasons, one could argue it is never ethical to use the face. But suppose we take a rather narrow (exceedingly narrow) formulation of “right.” Recall the judge is meant to make jailing decisions based on the defendant’s risk. Is the use of these facial characteristics consistent with that objective? Put differently, if we account for defendant risk differences, do these facial characteristics still predict judge decisions? The fact that judges rely on the face in making detention decisions is in itself a striking insight regardless of whether the judges use appearance as a proxy for risk or are committing a cognitive error.

At first glance, the most straightforward way to answer this question would be to regress rearrest against the algorithm’s mug shot–based detention prediction. That yields a statistically significant relationship: The coefficient (and standard error) for the mug shot equals 0.6127 (0.0460) with no other explanatory variables in the regression versus 0.5735 (0.0521) with all the explanatory variables (as in the final column, Table III ). But the interpretation here is not so straightforward.

The challenge of interpretation comes from the fact that we have only measured crime rates for the released defendants. The problem with having measured crime, not actual crime, is that whether someone is charged with a crime is itself a human choice, made by police. If the choices police make about when to make an arrest are affected by the same biases that might afflict judges, then measured rearrest rates may correlate with facial characteristics simply due to measurement bias. The problem created by having measures of rearrest only for released defendants is that if judges have access to private information (defendant characteristics not captured by our data set), and judges use that information to inform detention decisions, then the released and detained defendants may be different in unobservable ways that are relevant for rearrest risk ( Kleinberg et al. 2018 ).

With these caveats in mind, at least we can perform a bounding exercise. We created a predictor of rearrest risk (see Online Appendix B ) and then regress judges’ decisions on predicted rearrest risk. We find that a one-unit change in predicted rearrest risk changes judge detention rates by 0.6103 (standard error 0.0213). By comparison, we found that a one-unit change in the mug shot (by which we mean the algorithm’s mug shot–based prediction of the judge detention decision) changes judge detention rates by 0.6963 (standard error 0.0383; see Table III , column (1)). That means if the judges were reacting to the defendant’s face only because the face is a proxy for rearrest risk, the difference in rearrest risk for those with a one-unit difference in the mug shot would need to be |$\frac{0.6963}{0.6103} = 1.141$|⁠ . But when we directly regress rearrest against the algorithm’s mug shot–based detention prediction, we get a coefficient of 0.6127 (standard error 0.0460). Clearly 0.6127 < 1.141; that is, the mug shot does not seem to be strongly related enough to rearrest risk to explain the judge’s use of it in making detention decisions. 42

Of course this leaves us with the second problem with our data: we only have crime data on the released. It is possible the relationship between the mug shot and risk could be very different among the |$23.3\%$| of defendants who are detained (which we cannot observe). Put differently, the mug shot–risk relationship among the |$76.7\%$| of the defendants who are released is 0.6127; and let A be the (unknown) mug shot–risk relationship among the jailed. What we really want to know is the mug shot–risk relationship among all defendants, which equals (0.767 · 0.6127) + (0.233 · A ). For this mug shot–risk relationship among all defendants to equal 1.141, A would need to be 2.880, nearly five times as great among the detained defendants as among the released. This would imply an implausibly large effect of the mug shot on rearrest risk relative to the size of the effects on rearrest risk of other defendant characteristics. 43

In addition, the results from Section VI.B call into question that these characteristics are well-understood proxies for risk. As we show there, experts who understand pretrial (public defenders and legal aid society staff) do not recognize the signal about judge decision making that the algorithm has discovered in the mug shot. These considerations as a whole—that measured rearrest is itself biased, the bounding exercise, and the failure of experts to recreate this signal—together lead us to tentatively conclude that it is unlikely that what the algorithm is finding in the face is merely a well-understood proxy for risk, but reflects errors in the judicial decision-making process. Of course, that presumption is not essential for the rest of the article, which asks: what exactly has the algorithm discovered in the face?

IV.C. Is the Algorithm Discovering Something New?

Previous studies already tell us a number of things about what shapes the decisions of judges and other people. For example, we know people stereotype by gender ( Avitzour et al. 2020 ), age ( Neumark, Burn, and Button 2016 ; Dahl and Knepper 2020 ), and race or ethnicity ( Bertrand and Mullainathan 2004 ; Arnold, Dobbie, and Yang 2018 ; Arnold, Dobbie, and Hull 2020 ; Fryer 2020 ; Hoekstra and Sloan 2022 ; Goncalves and Mello 2021 ). Is the algorithm just rediscovering known determinants of people’s decisions, or discovering something new? We address this in two ways. We first ask how much of the algorithm’s predictions can be explained by already-known features ( Table II ). We then ask how much of the algorithm’s predictive power in explaining actual judges’ decisions is diminished when we control for known factors ( Table III ). We carry out both analyses for three sets of known facial features: (i) demographic characteristics, (ii) psychological features, and (iii) incentivized human guesses. 44

Is the Algorithm Rediscovering Known Facial Features?

Notes. The table presents the results of regressing an algorithmic prediction of judge detention decisions against each of the different explanatory variables as listed in the rows, where each column represents a different regression specification (the specific explanatory variables in each regression are indicated by the filled-in coefficients and standard errors in the table). The algorithm was trained using mug shots from the training data set; the regressions reported here are carried out using data from the validation data set. Data on skin tone, attractiveness, competence, dominance, and trustworthiness comes from asking subjects to assign feature ratings to mug shot images from the Mecklenburg County, NC, Sheriff’s Office public website (see the text). The human guess about the judges’ decision comes from showing workers on the Prolific platform pairs of mug shot images and asking them to report which defendant they believe the judge would be more likely to detain. Regressions follow a linear probability model and also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Does the Algorithm Predict Judge Behavior after Controlling for Known Factors?

Notes. This table reports the results of estimating a linear probability specification of judges’ detain decisions against different explanatory variables in the validation set described in Table I . Each row represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). The algorithmic predictions of the judges’ detain decision come from our convolutional neural network algorithm built using the defendants’ face image as the only feature, using data from the training data set. Measures of defendant demographics and current arrest charge come from government administrative data obtained from a combination of Mecklenburg County, NC, and state agencies. Measures of skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Table II , columns (1)–(3) show the relationship of the algorithm’s predictions to demographics. The predictions vary enormously by gender (men have predicted detention likelihoods 11.9 percentage points higher than women), less so by age, 45 and by different indicators of race or ethnicity. With skin tone scored on a 0−1 continuum, defendants whom independent raters judge to be at the lightest end of the continuum are 4.4 percentage points less likely to be detained than those rated to have the darkest skin tone (column (3)). Conditional on skin tone, Black defendants have a 1.9 percentage point lower predicted likelihood of detention compared with whites. 46

Table II , column (4) shows how the algorithm’s predictions relate to facial features implicated by past psychological studies as shaping people’s judgments of one another. These features also help explain the algorithm’s predictions of judges’ detention decisions: people judged by independent raters to be one standard deviation more attractive, competent, or trustworthy have lower predicted likelihood of detention equal to 0.55, 0.91, and 0.48 percentage points, respectively, or |$2.2\%$|⁠ , |$3.6\%$|⁠ , and |$1.8\%$| of the base rate. 47 Those whom subjects judge are one standard deviation more dominant-looking have a higher predicted likelihood of detention of 0.37 percentage points (or |$1.5\%)$|⁠ .

How do we know we have controlled for everything relevant from past research? The literature on what shapes human judgments in general is vast; perhaps there are things that are relevant for judges’ decisions specifically that we have inadvertently excluded? One way to solve this problem would be to do a comprehensive scan of past studies of human judgment and decision making, and then decide which results from different non–criminal justice contexts might be relevant for criminal justice. But that itself is a form of human-driven hypothesis generation, bringing us right back to where we started.

To get out of this box, we take a different approach. Instead of enumerating individual characteristics, we ask people to embody their beliefs in a guess, which ought to be the compound of all these characteristics. Then we can ask whether the algorithm has rediscovered this human guess (and later whether it has discovered more). We ask independent subjects to look at pairs of mug shots matched by gender, race, and five-year age bins and forecast which defendant is more likely to be detained by a judge. We provide a financial incentive for accurate guesses to increase the chances that subjects take the exercise seriously. 48 We also provide subjects with an opportunity to learn by showing subjects 50 image pairs with feedback after each pair about which defendant the judge detained. We treat the first 10 image pairs from each subject as learning trials and only use data from the last 40 image pairs. This approach is intended to capture anything that influences judges’ decisions that subjects could recognize, from subtle signs of things like socioeconomic status or drug use or mood, to things people can recognize but not articulate.

It turns out subjects are modestly good at this task ( Table II ). Participants guess which mug shot is more likely to be detained at a rate of |$51.4\%$|⁠ , which is different to a statistically significant degree from the |$50\%$| random-guessing threshold. When we regress the algorithm’s predicted detention rate against these subject guesses, the coefficient is 3.99 percentage points, equal to |$17.1\%$| of the base rate.

The findings in Table II are somewhat remarkable. The only input the algorithm had access to was the raw pixel values of each mug shot, yet it has rediscovered findings from decades of previous research and human intuition.

Interestingly, these features collectively explain only a fraction of the variation in the algorithm’s predictions: the R 2 is only 0.2228. That by itself does not necessarily mean the algorithm has discovered additional useful signal. It is possible that the remaining variation is prediction error—components of the prediction that do not explain actual judges’ decisions.

In Table III , we test whether the algorithm uncovers any additional signal for actual judge decisions, above and beyond the influence of these known factors. The algorithm by itself produces an R 2 of 0.0331 (column (1)), substantially higher than all previously known features taken together, which produce an R 2 of 0.0162 (column (5)), or the human guesses alone which produce an R 2 of 0.0025 (so we can see the algorithm is much better at predicting detention from faces than people are). Another way to see that the algorithm has detected signal above and beyond these known features is that the coefficient on the algorithm prediction when included alone in the regression, 0.6963 (column (1)), changes only modestly when we condition on everything else, now equal to 0.6171 (column (7)). The algorithm seems to have discovered some novel source of signal that better predicts judge detention decisions. 49

The algorithm has made a discovery: something about the defendant’s face explains judge decisions, above and beyond the facial features implicated by existing research. But what is it about the face that matters? Without an answer, we are left with a discovery of an unsatisfying sort. We have simply replaced one black box hypothesis generation procedure (human creativity) with another (the algorithm). In what follows we demonstrate how existing methods like saliency maps cannot solve this challenge in our application and then discuss our solution to that problem.

V.A. The Challenge of Explanation

The problem of algorithm-human communication stems from the fact that we cannot simply look inside the algorithm’s “black box” and see what it is doing because m ( x ), the algorithmic predictor, is so complicated. A common solution in computer science is to forget about looking inside the algorithmic black box and focus instead on drawing inferences from curated outputs of that box. Many of these methods involve gradients: given a prediction function m ( x ), we can calculate the gradient |$\nabla m(x) = \frac{\mathrm{d}{m}}{\mathrm{d}x}(x)$|⁠ . This lets us determine, at any input value, what change in the input vector maximally changes the prediction. 50 The idea of gradients is useful for image classification tasks because it allows us to tell which pixel image values are most important for changing the predicted outcome.

For example, a widely used method known as saliency maps uses gradient information to highlight the specific pixels that are most important for predicting the outcome of interest ( Baehrens et al. 2010 ; Simonyan, Vedaldi, and Zisserman 2014 ). This approach works well for many applications like determining whether a given picture contains a given type of animal, a common task in ecology ( Norouzzadeh et al. 2018 ). What distinguishes a cat from a dog? A saliency map for a cat detector might highlight pixels around, say, the cat’s head: what is most cat-like is not the tail, paws, or torso, but the eyes, ears, and whiskers. But more complicated outcomes of the sort social scientists study may depend on complicated functions of the entire image.

Even if saliency maps were more selective in highlighting pixels in applications like ours, for hypothesis generation they also suffer from a second limitation: they do not convey enough information to enable people to articulate interpretable hypotheses. In the cat detector example, a saliency map can tell us that something about the cat’s (say) whiskers are key for distinguishing cats from dogs. But what about that feature matters? Would a cat look more like a dog if its whiskers were longer? Or shorter? More (or less?) even in length? People need to know not just what features matter but how they must change to change the prediction. For hypothesis generation, the saliency map undercommunicates with humans.

To test the ability of saliency maps to help with our application, we focused on a facial feature that people already understand and can easily recognize from a photo: age. We first build an algorithm that predicts each defendant’s age from their mug shot. For a representative image, as in the top left of Figure III , we can highlight which pixels are most important for predicting age, shown in the top right. 51 A key limitation of saliency maps is easy to see: because age (like many human facial features) is a function of almost every part of a person’s face, the saliency map highlights almost everything.

Candidate Algorithm-Human Communication Vehicles for a Known Facial Feature: Age

Candidate Algorithm-Human Communication Vehicles for a Known Facial Feature: Age

Panel A shows a randomly selected point in the GAN latent space for a non-Hispanic white male defendant. Panel B shows a saliency map that highlights the pixels that are most important for an algorithmic model that predicts the defendant’s age from the mug shot image. Panel C shows an image changed or “morphed” in the direction of older age, based on the gradient of the image-based age prediction, using the “naive” morphing procedure that does not constrain the new image to lie on the face manifold (see the text). Panel D shows the image morphed to the maximum age using our actual preferred morphing procedure.

An alternative to simply highlighting high-leverage pixels is to change them in the direction of the gradient of the predicted outcome, to—ideally—create a new face that now has a different predicted outcome, what we call “morphing.” This new image answers the counterfactual question: “How would this person’s face change to increase their predicted outcome?” Our approach builds on the ability of people to comprehend ideas through comparisons, so we can show morphed image pairs to subjects to have them name the differences that they see. Figure IV summarizes our semiautomated hypothesis generation pipeline. (For more details see Online Appendix B .) The benefit of morphed images over actual mug shot images is to isolate the differences across faces that matter for the outcome of interest. By reducing noise, morphing also reduces the risk of spurious discoveries.

Hypothesis Generation Pipeline

Hypothesis Generation Pipeline

The diagram illustrates all the algorithmic components in our procedure by presenting a full pipeline for algorithmic interpretation.

Figure V illustrates how this morphing procedure works in practice and highlights some of the technical challenges that arise. Let the box in the top panel represent the space of all possible images—all possible combinations of pixel values for, say, a 512 × 512 image. Within this space, we can apply our mug shot–based predictor of the known facial feature, age, to identify all images with the same predicted age, as shown by the contour map of the prediction function. Imagine picking some random initial mug shot image. We could follow the gradient to find an image with a higher predicted value of the outcome y .

Morphing Images for Detention Risk On and Off the Face Manifold

Morphing Images for Detention Risk On and Off the Face Manifold

The figure shows the difference between an unconstrained (naive) morphing procedure and our preferred new morphing approach. In both panels, the background represents the image space (set of all possible pixel values) and the blue line (color version available online) represents the set of all pixel values that correspond to any face image (the face manifold). The orange lines show all images that have the same predicted outcome (isoquants in predicted outcome). The initial face (point on the outermost contour line) is a randomly selected face in GAN face space. From there we can naively follow the gradients of an algorithm that predicts some outcome of interest from face images. As shown in Panel A, this takes us off the face manifold and yields a nonface image. Alternatively, with a model of the face manifold, we can follow the gradient for the predicted outcome while ensuring that the new image is again a realistic instance as shown in Panel B.

The challenge is that most points in this image space are not actually face images. Simply following the gradient will usually take us off the data distribution of face images, as illustrated abstractly in the top panel of Figure V . What this means in practice is shown in the bottom left panel of Figure III : the result is an image that has a different predicted outcome (in the figure, illustrated for age) but no longer looks like a real instance—that is, no longer looks like a realistic face image. This “naive” morphing procedure will not work without some way to ensure the new point we wind up on in image space corresponds to a realistic face image.

V.B. Building a Model of the Data Distribution

To ensure morphing leads to realistic face images, we need a model of the data distribution p ( x )—in our specific application, the set of images that are faces. We rely on an unsupervised learning approach to this problem. 52 Specifically, we use generative adversarial networks (GANs), originally introduced to generate realistic new images for a variety of tasks (see Goodfellow et al. 2014 ). 53

A GAN is built by training two algorithms that “compete” with each another, the generator G and the classifier C : the generator creates synthetic images and the classifier (or “discriminator”), presented with synthetic or real images, tries to distinguish which is which. A good discriminator pressures the generator to produce images that are harder to distinguish from real; in turn, a good generator pressures the classifier to get better at discriminating real from synthetic images. Data on actual faces are used to train the discriminator, which results in the generator being trained as it seeks to fool the discriminator. With machine learning, the performance of C and G improve with successive iterations of training. A perfect G would output images where the classifier C does no better than random guessing. Such a generator would by definition limit itself to the same input space that defines real images, that is, the data distribution of faces. (Additional discussion of GANs in general and how we construct our GAN specifically are in Online Appendix B .)

To build our GAN and evaluate its expressiveness we use standard training metrics, which turn out to compare favorably to what we see with other widely used GAN models on other data sets (see Online Appendix B.C for details). A more qualitative way to judge our GAN comes from visual inspection; some examples of synthetic face images are in Figure II . Most importantly, the GAN we build (as is true of GANs in general) is not generic. GANs are specific. They do not generate “faces” but instead seek to match the distribution of pixel combinations in the training data. For example, our GAN trained using mug shots would never generate generic Facebook profile photos or celebrity headshots.

Figure V illustrates how having a model such as the GAN lets morphing stay on the data distribution of faces and produce realistic images. We pick a random point in the space of faces (mug shots) and then use the algorithmic predictor of the outcome of interest m ( x ) to identify nearby faces that are similar in all respects except those relevant for the outcome. Notice this procedure requires that faces closer to one another in GAN latent space should look relatively more similar to one another to a human in pixel space. Otherwise we might make a small movement along the gradient and wind up with a face that looks different in all sorts of other ways that are irrelevant to the outcome. That is, we need the GAN not just to model the support of the data but also to provide a meaningful distance metric.

When we produce these morphs, what can possibly change as we morph? In principle there is no limit. The changes need not be local: features such as skin color, which involves many pixels, could change. So could features such as attractiveness, where the pixels that need to change to make a face more attractive vary from face to face: the “same” change may make one face more attractive and another less so. Anything represented in the face could change, as could anything else in the image beyond the face that matters for the outcome (if, for example, localities varied in both detention rates and the type of background they have someone stand in front of for mug shots).

In practice, though, there is a limit. What can change depends on how rich and expressive the estimated GAN is. If the GAN fails to capture a certain kind of face or a dimension of the face, then we are unlikely to be able to morph on that dimension. The morphing procedure is only as complete as the GAN is expressive. Assuming the GAN expresses a feature, then if m ( x ) truly depends on that feature, morphing will likely display it. Nor is there any guarantee that in any given application the classifier m ( x ) will find novel signal for the outcome y , or that the GAN successfully learns the data distribution ( Nalisnick et al. 2018 ), or that subjects can detect and articulate whatever signal the classifier algorithm has discovered. Determining the general conditions under which our procedure will work is something we leave to future research. Whether our procedure can work for the specific application of judge decisions is the question to which we turn next. 54

V.C. Validating the Morphing Procedure

We return to our algorithmic prediction of a known facial feature—age—and see what morphing by age produces as a way to validate or test our procedure. When we follow the gradient of the predicted outcome (age), by constraining ourselves to stay on the GAN’s latent space of faces we wind up with a new age-morphed face that does indeed look like a realistic face image, as shown in the bottom right of Figure III . We seem to have successfully developed a model of the data distribution and a way to move around on that surface to create realistic new instances.

To figure out if algorithm-human communication occurs, we run these age-morphed image pairs through our experimental pipeline ( Figure IV ). Our procedure is only useful if it is replicable—that is, if it does not depend on the idiosyncratic insights of any particular person. For that reason, the people looking at these images and articulating what they see should not be us (the investigators) but a sample of external, independent study subjects. In our application, we use Prolific workers (see Online Appendix Table A.III ). Reliability or replicability is indicated by the agreement in the subject responses: lots of subjects see and articulate the same thing in the morphed images.

We asked subjects to look at 50 age-morphed image pairs selected at random from a population of 100 pairs, and told them the images in each pair differ on some hidden dimension but did not tell them what that was. 55 We asked subjects to guess which image expresses that hidden feature more, gave them feedback about the right answer, treated the first 10 image pairs as learning examples, and calculated accuracy on the remaining 40 images. Subjects correctly selected the older image |$97.8\%$| of the time.

The final step was to ask subjects to name what differs in image pairs. Making sense of these responses requires some way to group them into semantic categories. Each subject comment could include several concepts (e.g., “wrinkles, gray hair, tired”). We standardized these verbal descriptions by removing punctuation, using only lowercase characters, and removing stop words. We gave three research assistants not otherwise involved in the project these responses and asked them to create their own categories that would capture all the responses (see Online Appendix Figure A.XIII ). We also gave them an illustrative subject comment and highlighted the different “types” of categories (descriptive physical features, i.e., “thick eyebrows,” descriptive impression category, i.e., “energetic,” but also an illustration of a category of comment that is too vague to lend itself to useful measurement, i.e., “ears”). In our validation exercise |$81.5\%$| of subject reports fall into the semantic categories of either age or the closely related feature of hair color. 56

V.D. Understanding the Judge Detention Predictor

Having validated our algorithm-human communication procedure for the known facial feature of age, we are ready to apply it to generate a new hypothesis about what drives judge detention decisions. To do this we combine the mug shot algorithm predictor of judges’ detention decisions, m ( x ), with our GAN of the data distribution of mug shot images, then create new synthetic image pairs morphed with respect to the likelihood the judge would detain the defendant (see Figure IV ).

The top panel of Figure VI shows a pair of such images. Underneath we show an “image strip” of intermediate steps, along with each image’s predicted detention rate. With an overall detention rate of |$23.3\%$| in our validation data set, morphing takes us from about one-half the base rate ( ⁠|$13\%$|⁠ ) up to nearly twice the base rate ( ⁠|$41\%$|⁠ ). Additional examples of morphed image pairs are shown in Figure VII .

Illustration of Morphed Faces along the Detention Gradient

Illustration of Morphed Faces along the Detention Gradient

Panel A shows the result of selecting a random point on the GAN latent face space for a white non-Hispanic male defendant, then using our new morphing procedure to increase the predicted detention risk of the image to 0.41 (left) or reduce the predicted detention risk down to 0.13 (right). The overall average detention rate in the validation data set of actual mug shot images is 0.23 by comparison. Panel B shows the different intermediate images between these two end points, while Panel C shows the predicted detention risk for each of the images in the middle panel.

Examples of Morphing along the Gradients of the Face-Based Detention Predictor

Examples of Morphing along the Gradients of the Face-Based Detention Predictor

We showed 54 subjects 50 detention-risk-morphed image pairs each, asked them to predict which defendant would be detained, offered them financial incentives for correct answers, 57 and gave them feedback on the right answer. Online Appendix Figure A.XV shows how accurate subjects are as they get more practice across successive morphed image pairs. With the initial image-pair trials, subjects are not much better than random guessing, in the range of what we see when subjects look at pairs of actual mugshots (where accuracy is |$51.4\%$| across the final 40 mug shot pairs people see). But unlike what happens when subjects look at actual images, when looking at morphed image pairs subjects seem to quickly learn what the algorithm is trying to communicate to them. Accuracy increased by over 10 percentage points after 20 morphed image pairs and reached |$67\%$| after 30 image pairs. Compared to looking at actual mugshots, the morphing procedure accomplished its goal of making it easier for subjects to see what in the face matters most for detention risk.

We asked subjects to articulate the key differences they saw across morphed image pairs. The result seems to be a reliable hypothesis—a facial feature that a sizable share of subjects name. In the top panel of Figure VIII , we present a histogram of individual tokens (cleaned words from worker comments) in “word cloud” form, where word size is approximately proportional to frequency. 58 Some of the most common words are “shaved,” “cleaner,” “length,” “shorter,” “moustache,” and “scruffy.” To form semantic categories, we use a procedure similar to what we describe for our validation exercise for the known feature of age. 59 Grouping tokens into semantic categories, we see that nearly |$40\%$| of the subjects see and name a similar feature that they think helps explain judge detention decisions: how well-groomed the defendant is (see the bottom panel of Figure VIII ). 60

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs

Panel A shows a word cloud of subject reports about what they see as the key difference between image pairs where one is a randomly selected point in the GAN latent space and the other is morphed in the direction of a higher predicted detention risk. Words are approximately proportionately sized to the frequency of subject mentions. Panel B shows the frequency of semantic groupings of those open-ended subject reports (see the text for additional details).

Can we confirm that what the subjects think the algorithm is seeing is what the algorithm actually sees? We asked a separate set of 343 independent subjects (MTurk workers) to label the 32,881 mug shots in our combined training and validation data sets for how well-groomed each image was perceived to be on a nine-point scale. 61 For data sets of our size, these labeling costs are fairly modest, but in principle those costs could be much more substantial (or even prohibitive) in some applications.

Table IV suggests algorithm-human communication has successfully occurred: our new hypothesis, call it h 1 ( x ), is correlated with the algorithm’s prediction of the judge, m ( x ). If subjects were mistaken in thinking they saw well-groomed differences across images, there would be no relationship between well-groomed and the detention predictions. Yet what we actually see is the R 2 from regressing the algorithm’s predictions against well-groomed equals 0.0247, or |$11\%$| of the R 2 we get from a model with all the explanatory variables (0.2361). In a bivariate regression the coefficient (−0.0172) implies that a one standard deviation increase in well-groomed (1.0118 points on our 9-point scale) is associated with a decline in predicted detention risk of 1.74 percentage points, or |$7.5\%$| of the base rate. Another way to see the explanatory power of this hypothesis is to note that this coefficient hardly changes when we add all the other explanatory variables to the regression (equal to −0.0153 in the final column) despite the substantial increase in the model’s R 2 .

Correlation between Well-Groomed and the Algorithm’s Prediction

Notes. This table shows the results of estimating a linear probability specification regressing algorithmic predictions of judges’ detain decision against different explanatory variables, using data from the validation set of cases from Mecklenburg County, NC. Each row of the table represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). Algorithmic predictions of judges’ decisions come from applying an algorithm built with face images in the training data set to validation set observations. Data on well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

V.E. Iteration

Our procedure is iterable. The first novel feature we discovered, well-groomed, explains some—but only some—of the variation in the algorithm’s predictions of the judge. We can iterate our procedure to generate hypotheses about the remaining residual variation as well. Note that the order in which features are discovered will depend on how important each feature is in explaining the judge’s detention decision and on how salient each feature is to the subjects who are viewing the morphed image pairs. So explanatory power for the judge’s decisions need not monotonically decline as we iterate and discover new features.

To isolate the algorithm’s signal above and beyond what is explained by well-groomed, we wish to generate a new set of morphed image pairs that differ in predicted detention but hold well-groomed constant. That would help subjects see other novel features that might differ across the detention-risk-morphed images, without subjects getting distracted by differences in well-groomed. 62 But iterating the procedure raises several technical challenges. To see these challenges, consider what would in principle seem to be the most straightforward way to orthogonalize, in the GAN’s latent face space:

use training data to build predictors of detention risk, m ( x ), and the facial features to orthogonalize against, h 1 ( x );

pick a point on the GAN latent space of faces;

collect the gradients with respect to m ( x ) and h 1 ( x );

use the Gram-Schmidt process to move within the latent space toward higher predicted detention risk m ( x ), but orthogonal to h 1 ( x ); and

show new morphed image pairs to subjects, have them name a new feature.

The challenge with implementing this playbook in practice is that we do not have labels for well-groomed for the GAN-generated synthetic faces. Moreover, it would be infeasible to collect this feature for use in this type of orthogonalization procedure. 63 That means we cannot orthogonalize against well-groomed, only against predictions of well-groomed. And orthogonalizing with respect to a prediction is an error-prone process whenever the predictor is imperfect (as it is here). 64 The errors in the process accumulate as we take many morphing steps. Worse, that accumulated error is not expected to be zero on average. Because we are morphing in the direction of predicted detention and we know predicted detention is correlated with well-groomed, the prediction error will itself be correlated with well-groomed.

Instead we use a different approach. We build a new detention-risk predictor with a curated training data set, limited to pairs of images matched on the features to be orthogonalized against. For each detained observation i (such that y i  = 1), we find a released observation j (such that y j  = 0) where h 1 ( x i ) =  h 1 ( x j ). In that training data set y is now orthogonal to h 1 ( x ), so we can use the gradient of the orthogonalized detention risk predictor to move in GAN latent space to create new morphed images with different detention odds but are similar with respect to well-groomed. 65 We call these “orthogonalized morphs,” which we then feed into the experimental pipeline shown in Figure IV . 66 An open question for future work is how many iterations are possible before the dimensionality of the matching problem required for this procedure would create problems.

Examples from this orthogonalized image-morphing procedure are in Figure IX . Changes in facial features across morphed images are notably different from those in the first iteration of morphs as in Figure VI . From these examples, it appears possible that orthogonalization may be slightly imperfect; sometimes they show subtle differences in “well-groomed” and perhaps age. As with the first iteration of the morphing procedure, the second (orthogonalized) iteration of the procedure again generates images that vary substantially in their predicted risk, from 0.07 up to 0.27 (see Online Appendix Figure A.XVIII ).

Examples of Morphing along the Orthogonal Gradients of the Face-Based Detention Predictor

Examples of Morphing along the Orthogonal Gradients of the Face-Based Detention Predictor

Still, there is a salient new signal: when presented to subjects they name a second facial feature, as shown in Figure X . We showed 52 subjects (Prolific workers) 50 orthogonalized morphed image pairs and asked them to name the differences they see. The word cloud shown in the top panel of Figure X shows that some of the most common terms reported by subjects include “big,” “wider,” “presence,” “rounded,” “body,” “jaw,” and “head.” When we ask independent research assistants to group the subject tokens into semantic groups, we can see as in the bottom of the figure that a sizable share of subject comments (around |$22\%$|⁠ ) refer to a similar facial feature, h 2 ( x ): how “heavy-faced” or “full-faced” the defendant is.

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs, Orthogonalized to the First Novel Feature Discovered (Well-Groomed)

Subject Reports of What They See between Detention-Risk-Morphed Image Pairs, Orthogonalized to the First Novel Feature Discovered (Well-Groomed)

Panel A shows a word cloud of subject reports about what they see as the key difference between image pairs, where one is a randomly selected point in the GAN latent space and the other is morphed in the direction of a higher predicted detention risk, where we are moving along the detention gradient orthogonal to well-groomed and skin tone (see the text). Panel B shows the frequency of semantic groupings of these open-ended subject reports (see the text for additional details).

This second facial feature (like the first) is again related to the algorithm’s prediction of the judge. When we ask a separate sample of subjects (343 MTurk workers, see Online Appendix Table A.III ) to independently label our validation images for heavy-facedness, we can see the R 2 from regressing the algorithm’s predictions against heavy-faced yields an R 2 of 0.0384 ( Table V , column (1)). With a coefficient of −0.0182 (0.0009), the results imply that a one standard deviation change in heavy-facedness (1.1946 points on our 9-point scale) is associated with a reduced predicted detention risk of 2.17 percentage points, or |$9.3\%$| of the base rate. Adding in other facial features implicated by past research substantially boosts the adjusted R 2 of the regression but barely changes the coefficient on heavy-facedness.

Correlation between Heavy-Faced and the Algorithm’s Prediction

Notes. This table shows the results of estimating a linear probability specification regressing algorithmic predictions of judges’ detain decision against different explanatory variables, using data from the validation set of cases from Mecklenburg County, NC. Each row of the table represents a different explanatory variable for the regression, while each column reports the results of a separate regression with different combinations of explanatory variables (as indicated by the filled-in coefficients and standard errors in the table). Algorithmic predictions of judges’ decisions come from applying the algorithm built with face images in the training data set to validation set observations. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

In principle, the procedure could be iterated further. After all, well-groomed, heavy-faced plus previously known facial features all together still only explain |$27\%$| of the variation in the algorithm’s predictions of the judges’ decisions. As long as there is residual variation, the hypothesis generation crank could be turned again and again. Because our goal is not to fully explain judges’ decisions but to illustrate that the procedure works and is iterable, we leave this for future work (ideally done on data from other jurisdictions as well).

Here we consider whether the new hypotheses our procedure has generated meet our final criterion: empirical plausibility. We show that these facial features are new not just to the scientific literature but also apparently to criminal justice practitioners, before turning to whether these correlations might reflect some underlying causal relationship.

VI.A. Do These Hypotheses Predict What Judges Actually Do?

Empirical plausibility need not be implied by the fact that our new facial features are correlated with the algorithm’s predictions of judges’ decisions. The algorithm, after all, is not a perfect predictor. In principle, well-groomed and heavy-faced might be correlated with the part of the algorithm’s prediction that is unrelated to judge behavior, or m ( x ) − y .

In Table VI , we show that our two new hypotheses are indeed empirically plausible. The adjusted R 2 from regressing judges’ decisions against heavy-faced equals 0.0042 (column (1)), while for well-groomed the figure is 0.0021 (column (2)) and for both together the figure equals 0.0061 (column (3)). As a benchmark, the adjusted R 2 from all variables (other than the algorithm’s overall mug shot–based prediction) in explaining judges’ decisions equals 0.0218 (column (6)). So the explanatory power of our two novel hypotheses alone equals about |$28\%$| of what we get from all the variables together.

Do Well-Groomed and Heavy-Faced Correlate with Judge Decisions?

Notes. This table reports the results of estimating a linear probability specification of judges’ detain decisions against different explanatory variables in the validation set described in Table I . The algorithmic predictions of the judges’ detain decision come from our convolutional neural network algorithm built using the defendants’ face image as the only feature, using data from the training data set. Measures of defendant demographics and current arrest charge come from Mecklenburg County, NC, administrative data. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Human guess variable comes from showing subjects pairs of mug shot images and asking subjects to identify the defendant they think the judge would be more likely to detain. Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

For a sense of the magnitude of these correlations, the coefficient on heavy-faced of −0.0234 (0.0036) in column (1) and on well-groomed of −0.0198 (0.0043) in column (2) imply that one standard deviation changes in each variable are associated with reduced detention rates equal to 2.8 and 2.0 percentage points, respectively, or |$12.0\%$| and |$8.9\%$| of the base rate. Interestingly, column (7) shows that heavy-faced remains statistically significant even when we control for the algorithm’s prediction. The discovery procedure led us to a facial feature that, when measured independently, captures signal above and beyond what the algorithm found. 67

VI.B. Do Practitioners Already Know This?

Our procedure has identified two hypotheses that are new to the existing research literature and to our study subjects. Yet the study subjects we have collected data from so far likely have relatively little experience with the criminal justice system. A reader might wonder: do experienced criminal justice practitioners already know that these “new” hypotheses affect judge decisions? The practitioners might have learned the influence of these facial features from day-to-day experience.

To answer this question, we carried out two smaller-scale data collections with a sample of N  = 15 staff at a public defender’s office and a legal aid society. We first asked an open-ended question: on what basis do judges decide to detain versus release defendants pretrial? Practitioners talked about judge misunderstandings of the law, people’s prior criminal records, and judge underappreciation for the social contexts in which criminal records arise. Aside from the defendant’s race, nothing about the appearance of defendants was mentioned.

We showed practitioners pairs of actual mug shots and asked them to guess which person is more likely to be detained by a judge (as we had done with MTurk and Prolific workers). This yields a sample of 360 detention forecasts. After seeing these mug shots practitioners were asked an open-ended question about what they think matters about the defendant’s appearance for judge detention decisions. There were a few mentions of well-groomed and one mention of something related to heavy-faced, but these were far from the most frequently mentioned features, as seen in Online Appendix Figure A.XX .

The practitioner forecasts do indeed seem to be more accurate than those of “regular” study subjects. Table VII , column (5) shows that defendants whom the practitioners predict will be detained are 29.2 percentage points more likely to actually be detained, even after controlling for the other known determinants of detention from past research. This is nearly four times the effect of forecasts made by Prolific workers, as shown in the last column of Table VI . The practitioner guesses (unlike the regular study subjects) are even about as accurate as the algorithm; the R 2 from the practitioner guess (0.0165 in column (1)) is similar to the R 2 from the algorithm’s predictions (0.0166 in column (6)).

Results from the Criminal Justice Practitioner Sample

Notes. This table shows the results of estimating judges’ detain decisions using a linear probability specification of different explanatory variables on a subset of the validation set. The criminal justice practitioner’s guess about the judge’s decision comes from showing 15 different public defenders and legal aid society members actual mug shot images of defendants and asking them to report which defendant they believe the judge would be more likely to detain. The pairs are selected to be congruent in gender and race but discordant in detention outcome. The algorithmic predictions of judges’ detain decisions come from applying the algorithm, which is built with face images in the training data set, to validation set observations. Measures of defendant demographics and current arrest charge come from Mecklenburg County, NC, administrative data. Data on heavy-faced, well-groomed, skin tone, attractiveness, competence, dominance, and trustworthiness come from subject ratings of mug shot images (see the text). Regression specifications also include indicators for unknown race and unknown gender. * p < .1; ** p < .05; *** p < .01.

Yet practitioners do not seem to already know what the algorithm has discovered. We can see this in several ways in Table VII . First, the sum of the adjusted R 2 values from the bivariate regressions of judge decisions against practitioner guesses and judge decisions against the algorithm mug shot–based prediction is not so different from the adjusted R 2 from including both variables in the same regression (0.0165 + 0.0166 = 0.0331 from columns (1) plus (6), versus 0.0338 in column (7)). We see something similar for the novel features of well-groomed and heavy-faced specifically as well. 68 The practitioners and the algorithm seem to be tapping into largely unrelated signal.

VI.C. Exploring Causality

Are these novel features actually causally related to judge decisions? Fully answering that question is clearly beyond the scope of the present article. But we can present some additional evidence that is at least suggestive.

For starters we can rule out some obvious potential confounders. With the specific hypotheses in hand, identifying the most important concerns with confounding becomes much easier. In our application, well-groomed and heavy-faced could in principle be related to things like (say) the degree to which the defendant has a substance-abuse problem, is struggling with mental health, or their socioeconomic status. But as shown in a series of Online Appendix  tables, we find that when we have study subjects independently label the mug shots in our validation data set for these features and then control for them, our novel hypotheses remain correlated with the algorithmic predictions of the judge and actual judge decisions. 69 We might wonder whether heavy-faced is simply a proxy for something that previous mock-trial-type studies suggest might matter for criminal justice decisions, “baby-faced” ( Berry and Zebrowitz-McArthur 1988 ). 70 But when we have subjects rate mug shots for baby-facedness, our full-faced measure remains strongly predictive of the algorithm’s predictions and actual judge decisions; see Online Appendix Tables A.XII and A.XVI .

In addition, we carried out a laboratory-style experiment with Prolific workers. We randomly morphed synthetic mug shot images in the direction of either higher or lower well-groomed (or full-faced), randomly assigned structured variables (current charge and prior record) to each image, explained to subjects the detention decision judges are asked to make, and then asked them which from each pair of subjects they would be more likely to detain if they were the judge. The framework from Mobius and Rosenblat (2006) helps clarify what this lab experiment gets us: appearance might affect how others treat us because others are reacting to something about our own appearance directly, because our appearance affects our own confidence, or because our appearance affects our effectiveness in oral communication. The experiment’s results shut down these latter two mechanisms and isolate the effects of something about appearance per se, recognizing it remains possible well-groomed and heavy-faced are correlated with some other aspect of appearance. 71

The study subjects recommend for detention those subjects with higher-risk structured variables (like current charge and prior record), which at the very least suggests they are taking the task seriously. Holding these other case characteristics constant, we find that the subjects are more likely to recommend for detention those defendants who are less well-groomed or less heavy-faced (see Online Appendix Table A.XVII ). Qualitatively, these results support the idea that well-groomed and heavy-faced could have a causal effect. It is not clear that the magnitudes in these experiments necessarily have much meaning: the subjects are not actual judges, and the context and structure of choice is very different from real detention decisions. Still, it is worth noting that the magnitudes implied by our results are nontrivial. Changing well-groomed or heavy-faced has the same effect on subject decisions as a movement within the predicted rearrest risk distribution of 4 and 6 percentile points, respectively (see Online Appendix C for details). Of course only an actual field experiment could conclusively determine causality here, but carrying out that type of field experiment might seem more worthwhile to an investigator in light of the lab experiment’s results.

Is this enough empirical support for these hypotheses to justify incurring the costs of causal testing? The empirical basis for these hypotheses would seem to be at least as strong as (or perhaps stronger than) the informal standard currently used to decide whether an idea is promising enough to test, which in our experience comes from some combination of observing the world, brainstorming, and perhaps some exploratory investigator-driven correlational analysis.

What might such causal testing look like? One possibility would follow in the spirit of Goldin and Rouse (2000) and compare detention decisions in settings where the defendant is more versus less visible to the judge to alter the salience of appearance. For example, many jurisdictions have continued to use some version of virtual hearings even after the pandemic. 72 In Chicago the court system has the defendant appear virtually but everyone else is in person, and the court system of its own volition has changed the size of the monitors used to display the defendant to court participants. One could imagine adding some planned variation to screen size or distance or angle to the judge. These video feeds could in principle be randomly selected for AI adjustment to the defendant’s level of well-groomedness or heavy-facedness (this would probably fall into a legal gray area). In the case of well-groomed, one could imagine a field experiment that changed this aspect of the defendant’s actual appearance prior to the court hearing. We are not claiming these are the right designs but intend only to illustrate that with new hypotheses in hand, economists are positioned to deploy the sort of creativity and rigorous testing that have become the hallmark of the field’s efforts at causal inference.

We have presented a new semi-automated procedure for hypothesis generation. We applied this new procedure to a concrete, socially important application: why judges jail some defendants and not others. Our procedure suggests two novel hypotheses: some defendants appear more well-groomed or more heavy-faced than others.

Beyond the specific findings from our illustrative application, our empirical analysis also illustrates a playbook for other applications. Start with a high-dimensional predictor m ( x ) of some behavior of interest. Build an unsupervised model of the data distribution, p ( x ). Then combine the models for m ( x ) and p ( x ) in a morphing procedure to generate new instances that answer the counterfactual question: what would a given instance look like with higher or lower likelihood of the outcome? Show morphed pairs of instances to participants and get them to name what they see as the differences between morphed instances. Get others to independently rate instances for whatever the new hypothesis is; do these labels correlate with both m ( x ) and the behavior of interest, y ? If so, we have a new hypothesis worth causal testing. This playbook is broadly applicable whenever three conditions are met.

The first condition is that we have a behavior we can statistically predict. The application we examine here fits because the behavior is clearly defined and measured for many cases. A study of, say, human creativity would be more challenging because it is not clear that it can be measured ( Said-Metwaly, Van den Noortgate, and Kyndt 2017 ). A study of why U.S. presidents use nuclear weapons during wartime would be challenging because there have been so few cases.

The second condition relates to what input data are available to predict behavior. Our procedure is likely to add only modest value in applications where we only have traditional structured variables, because those structured variables already make sense to people. Moreover the structured variables are usually already hypothesized to affect different behaviors, which is why economists ask about them on surveys. Our procedure will be more helpful with unstructured, high-dimensional data like images, language, and time series. The deeper point is that the collection of such high-dimensional data is often incidental to the scientific enterprise. We have images because the justice system photographs defendants during booking. Schools collect text from students as part of required assignments. Cellphones create location data as part of cell tower “pings.” These high-dimensional data implicitly contain an endless number of “features.”

Such high-dimensional data have already been found to predict outcomes in many economically relevant applications. Student essays predict graduation. Newspaper text predicts political slant of writers and editors. Federal Open Market Committee notes predict asset returns or volatility. X-ray images or EKG results predict doctor diagnoses (or misdiagnoses). Satellite images predict the income or health of a place. Many more relationships like these remain to be explored. From such prediction models, one could readily imagine human inspection of morphs leading to novel features. For example, suppose high-frequency data on volume and stock prices are used to predict future excess returns, for example, to understand when the market over- or undervalues a stock. Morphs of these time series might lead us to discover the kinds of price paths that produce overreaction. After all, some investors have even named such patterns (e.g., “head and shoulders,” “double bottom”) and trade on them.

The final condition is to be able to morph the input data to create new cases that differ in the predicted outcome. This requires some unsupervised learning technique to model the data distribution. The good news is that a number of such techniques are now available that work well with different types of high-dimensional data. We happen to use GANs here because they work well with images. But our procedure can accomodate a variety of unsupervised models. For example for text we can use other methods like Bidirectional Encoder Representations from Transformers ( Devlin et al. 2018 ), or for time series we could use variational auto-encoders ( Kingma and Welling 2013 ).

An open question is the degree to which our experimental pipeline could be changed by new technologies, and in particular by recent innovations in generative modeling. For example, several recent models allow people to create new synthetic images from text descriptions, and so could perhaps (eventually) provide alternative approaches to the creation of counterfactual instances. 73 Similarly, recent generative language models appear to be able to process images (e.g., GPT-4), although they are only recently publicly available. Because there is inevitably some uncertainty in forecasting what those tools will be able to do in the future, they seem unlikely to be able to help with the first stage of our procedure’s pipeline—build a predictive model of some behavior of interest. To see why, notice that methods like GPT-4 are unlikely to have access to data on judge decisions linked to mug shots. But the stage of our pipeline that GPT-4 could potentially be helpful for is to substitute for humans in “naming” the contrasts between the morphed pairs of counterfactual instances. Though speculative, such innovations potentially allow for more of the hypothesis generation procedure to be automated. We leave the exploration of these possibilities to future work.

Finally, it is worth emphasizing that hypothesis generation is not hypothesis testing. Each follows its own logic, and one procedure should not be expected to do both. Each requires different methods and approaches. What is needed to creatively produce new hypotheses is different from what is needed to carefully test a given hypothesis. Testing is about the curation of data, an effort to compare comparable subsets from the universe of all observations. But the carefully controlled experiment’s focus on isolating the role of a single prespecified factor limits the ability to generate new hypotheses. Generation is instead about bringing as much data to bear as possible, since the algorithm can only consider signal within the data available to it. The more diverse the data sources, the more scope for discovery. An algorithm could have discovered judge decisions are influenced by football losses, as in Eren and Mocan (2018) , but only if we thought to merge court records with massive archives of news stories as for example assembled by Leskovec, Backstrom, and Kleinberg (2009) . For generating ideas, creativity in experimental design useful for testing is replaced with creativity in data assembly and merging.

More generally, we hope to raise interest in the curious asymmetry we began with. Idea generation need not remain such an idiosyncratic or nebulous process. Our framework hopefully illustrates that this process can also be modeled. Our results illustrate that such activity could bear actual empirical fruit. At a minimum, these results will hopefully spur more theoretical and empirical work on hypothesis generation rather than leave this as a largely “prescientific” activity.

This is a revised version of Chicago Booth working paper 22-15 “Algorithmic Behavioral Science: Machine Learning as a Tool for Scientific Discovery.” We gratefully acknowledge support from the Alfred P. Sloan Foundation, Emmanuel Roman, and the Center for Applied Artificial Intelligence at the University of Chicago, and we thank Stephen Billings for generously sharing data. For valuable comments we thank Andrei Shleifer, Larry Katz, and five anonymous referees, as well as Marianne Bertrand, Jesse Bruhn, Steven Durlauf, Joel Ferguson, Emma Harrington, Supreet Kaur, Matteo Magnaricotte, Dev Patel, Betsy Levy Paluck, Roberto Rocha, Evan Rose, Suproteem Sarkar, Josh Schwartzstein, Nick Swanson, Nadav Tadelis, Richard Thaler, Alex Todorov, Jenny Wang, and Heather Yang, plus seminar participants at Bocconi, Brown, Columbia, ETH Zurich, Harvard, the London School of Economics, MIT, Stanford, the University of California Berkeley, the University of Chicago, the University of Pennsylvania, the University of Toronto, the 2022 Behavioral Economics Annual Meetings, and the 2022 NBER Summer Institute. For invaluable assistance with the data and analysis we thank Celia Cook, Logan Crowl, Arshia Elyaderani, and especially Jonas Knecht and James Ross. This research was reviewed by the University of Chicago Social and Behavioral Sciences Institutional Review Board (IRB20-0917) and deemed exempt because the project relies on secondary analysis of public data sources. All opinions and any errors are our own.

The question of hypothesis generation has been a vexing one in philosophy, as it appears to follow a process distinct from deduction and has been sometimes called “abduction” (see Schickore 2018 for an overview). A fascinating economic exploration of this topic can be found in Heckman and Singer (2017) , which outlines a strategy for how economists should proceed in the face of surprising empirical results. Finally, there is a small but growing literature that uses machine learning in science. In the next section we discuss how our approach is similar in some ways and different in others.

See Einav and Levin (2014) , Varian (2014) , Athey (2017) , Mullainathan and Spiess (2017) , Gentzkow, Kelly, and Taddy (2019) , and Adukia et al. (2023) on how these changes can affect economics.

In practice, there are a number of additional nuances, as discussed in Section III.A and Online Appendix A.A .

This is calculated for some of the most commonly used measures of predictive accuracy, area under the curve (AUC) and R 2 , recognizing that different measures could yield somewhat different shares of variation explained. We emphasize the word predictable here: past work has shown that judges are “noisy” and decisions are hard to predict ( Kahneman, Sibony, and Sunstein 2022 ). As a consequence, a predictive model of the judge can do better than the judge themselves ( Kleinberg et al. 2018 ).

In Section IV.B , we examine whether the mug shot’s predictive power can be explained by underlying risk differences. There, we tentatively conclude that the predictive power of the face likely reflects judicial error, but that working assumption is not essential to either our results or the ultimate goal of the article: uncovering hypotheses for later careful testing.

For reviews of the interpretability literature, see Doshi-Velez and Kim (2017) and Marcinkevičs and Vogt (2020) .

See Liu et al. (2019) , Narayanaswamy et al. (2020) , Lang et al. (2021) , and Ghandeharioun et al. (2022) .

For example, if every dog photo in a given training data set had been taken outdoors and every cat photo was taken indoors, the algorithm might learn what animal is in the image based in part on features of the background, which would lead the algorithm to perform poorly in a new data set of more representative images.

For example, for canonical computer science applications like image classification (does this photo contain an image of a dog or of a cat?), predictive accuracy (AUC) can be on the order of 0.99. In contrast, our model of judge decisions using the face only achieves an AUC of 0.625.

Of course even if the hypotheses that are generated are the result of idiosyncratic creativity, this can still be useful. For example, Swanson (1986 , 1988) generated two novel medical hypotheses: the possibility that magnesium affects migraines and that fish oil may alleviate Raynaud’s syndrome.

Conversely, given a data set, our procedure has a built-in advantage: one could imagine a huge number of hypotheses that, while possible, are not especially useful because they are not measurable. Our procedure is by construction guaranteed to generate hypotheses that are measurable in a data set.

For additional discussion, see Ludwig and Mullainathan (2023a) .

For example, isolating the causal effects of gender on labor market outcomes is a daunting task, but the clever test in Goldin and Rouse (2000) overcomes the identification challenges by using variation in screening of orchestra applicants.

See the clever paper by Grogger and Ridgeway (2006) that uses this source of variation to examine this question.

This is related to what Autor (2014) called “Polanyi’s paradox,” the idea that people’s understanding of how the world works is beyond our capacity to explicitly describe it. For discussions in psychology about the difficulty for people to access their own cognition, see Wilson (2004) and Pronin (2009) .

Consider a simple example. Suppose x  = ( x 1 , …, x k ) is a k -dimensional binary vector, all possible values of x are equally likely, and the true function in nature relating x to y only depends on the first dimension of x so the function h 1 is the only true hypothesis and the only empirically plausible hypothesis. Even with such a simple true hypothesis, people can generate nonplausible hypotheses. Imagine a pair of data points ( x 0 , 0) and ( x 1 , 1). Since the data distribution is uniform, x 0 and x 1 will differ on |$\frac{k}{2}$| dimensions in expectation. A person looking at only one pair of observations would have a high chance of generating an empirically implausible hypothesis. Looking at more data, the probability of discovering an implausible hypothesis declines. But the problem remains.

Some canonical references include Breiman et al. (1984) , Breiman (2001) , Hastie et al. (2009) , and Jordan and Mitchell (2015) . For discussions about how machine learning connects to economics, see Belloni, Chernozhukov, and Hansen (2014) , Varian (2014) , Mullainathan and Spiess (2017) , Athey (2018) , and Athey and Imbens (2019) .

Of course there is not always a predictive signal in any given data application. But that is equally an issue for human hypothesis generation. At least with machine learning, we have formal procedures for determining whether there is any signal that holds out of sample.

The intuition here is quite straightforward. If two predictor variables are highly correlated, the weight that the algorithm puts on one versus the other can change from one draw of the data to the next depending on the idiosyncratic noise in the training data set, but since the variables are highly correlated, the predicted outcome values themselves (hence predictive accuracy) can be quite stable.

See Online Appendix Figure A.I , which shows the top nine eigenfaces for the data set we describe below, which together explain |$62\%$| of the variation.

Examples of applications of this type include Carleo et al. (2019) , He et al. (2019) , Davies et al. (2021) , Jumper et al. (2021) , and Pion-Tonachini et al. (2021) .

As other examples, researchers have found that retinal images alone can unexpectedly predict gender of patient or macular edema ( Narayanaswamy et al. 2020 ; Korot et al. 2021 ).

Sheetal, Feng, and Savani (2020) use machine learning to determine which of the long list of other survey variables collected as part of the World Values Survey best predict people’s support for unethical behavior. This application sits somewhat in between an investigator-generated hypothesis and the development of an entirely new hypothesis, in the sense that the procedure can only choose candidate hypotheses for unethical behavior from the set of variables the World Values Survey investigators thought to include on their questionnaire.

Closest is Miller et al. (2019) , which morphs EKG output but stops at the point of generating realistic morphs and does not carry this through to generating interpretable hypotheses.

Additional details about how the system works are found in Online Appendix A .

For Black non-Hispanics, the figures for Mecklenburg County versus the United States were |$33.3\%$| versus |$13.6\%$|⁠ . See https://www.census.gov/programs-surveys/sis/resources/data-tools/quickfacts.html .

Details on how we operationalize these variables are found in Online Appendix A .

The mug shot seems to have originated in Paris in the 1800s ( https://law.marquette.edu/facultyblog/2013/10/a-history-of-the-mug-shot/ ). The etymology of the term is unclear, possibly based on “mug” as slang for either the face or an “incompetent person” or “sucker” since only those who get caught are photographed by police ( https://www.etymonline.com/word/mug-shot ).

See https://mecksheriffweb.mecklenburgcountync.gov/ .

We partition the data by arrestee, not arrest, to ensure people show up in only one of the partitions to avoid inadvertent information “leakage” across data partitions.

As the Online Appendix  tables show, while there are some changes to a few of the coefficients that relate the algorithm’s predictions to factors known from past research to shape human decisions, the core findings and conclusions about the importance of the defendant’s appearance and the two specific novel facial features we identify are similar.

Using the data on arrests up to July 17, 2019, we randomly reassign arrestees to three groups of similar size to our training, validation, and lock-box hold-out data sets, convert the data to long format (with one row for each arrest-and-variable) and calculate an F -test statistic for the joint null hypothesis that the difference in baseline characteristics are all zero, clustering standard errors by arrestee. We store that F -test statistic, rerun this procedure 1,000 times, and then report the share of splits with an F -statistic larger than the one observed for the original data partition.

For an example HIT task, see Online Appendix Figure A.II .

For age and skin tone, we calculated the average pairwise correlation between two labels sampled (without replacement) from the 10 possibilities, repeated across different random pairs. The Pearson correlation was 0.765 for skin tone, 0.741 for age, and between age assigned labels versus administrative data, 0.789. The maximum correlation between the average of the first k labels collected and the k + 1 label is not all that much higher for k  = 1 than k  = 9 (0.733 versus 0.837).

For an example of the consent form and instructions given to labelers, see Online Appendix Figures A.IV and A.V .

We actually collected at least three and at least five, but the averages turned out to be very close to the minimums, equal to 3.17 and 5.07, respectively.

For example, in Oosterhof and Todorov (2008) , Supplemental Materials Table S2, they report Cronbach’s α values of 0.95 for attractiveness, and 0.93 for both trustworthy and dominant.

See Online Appendix Figure A.VIII , which shows that the change in the correlation between the ( k + 1)th label with the mean of the first k labels declines after three labels.

For an example, see Online Appendix Figure A.IX .

We use the validation data set to estimate |$\hat{\beta }$| and then evaluate the accuracy of m p ( x ). Although this could lead to overfitting in principle, since we are only estimating a single parameter, this does not matter much in practice; we get very similar results if we randomly partition the validation data set by arrestee, use a random |$30\%$| of the validation data set to estimate the weights, then measure predictive performance in the other random |$70\%$| of the validation data set.

The mean squared area for a linear probability model’s predictions is related to the Brier score ( Brier 1950 ). For a discussion of how this relates to AUC and calibration, see Murphy (1973) .

Note how this comparison helps mitigate the problem that police arrest decisions could depend on a person’s face. When we regress rearrest against the mug shot, that estimated coefficient may be heavily influenced by how police arrest decisions respond to the defendant’s appearance. In contrast when we regress judge detention decisions against predicted rearrest risk, some of the variation across defendants in rearrest risk might come from the effect of the defendant’s appearance on the probability a police officer makes an arrest, but a great deal of the variation in predicted risk presumably comes from people’s behavior.

The average mug shot–predicted detention risk for the bottom and top quartiles equal 0.127 and 0.332; that difference times 2.880 implies a rearrest risk difference of 59.0 percentage points. By way of comparison, the difference in rearrest risk between those who are arrested for a felony crime rather than a less serious misdemeanor crime is equal to just 7.8 percentage points.

In our main exhibits, we impose a simple linear relationship between the algorithm’s predicted detention risk and known facial features like age or psychological variables, for ease of presentation. We show our results are qualitatively similar with less parametric specifications in Online Appendix Tables A.VI, A.VII, and A.VIII .

With a coefficient value of 0.0006 on age (measured in years), the algorithm tells us that even a full decade’s difference in age has |$5\%$| the impact on detention likelihood compared to the effects of gender (10 × 0.0006 = 0.6 percentage point higher likelihood of detention, versus 11.9 percentage points).

Online Appendix Table A.V shows that Hispanic ethnicity, which we measure from subject ratings from looking at mug shots, is not statistically significantly related to the algorithm’s predictions. Table II , column (2) showed that conditional on gender, Black defendants have slightly higher predicted detention odds than white defendants (0.3 percentage points), but this is not quite significant ( t  = 1.3). Online Appendix Table A.V , column (1) shows that conditioning on Hispanic ethnicity and having stereotypically Black facial features—as measured in Eberhardt et al. (2006) —increases the size of the Black-white difference in predicted detention odds (now equal to 0.8 percentage points) as well as the difference’s statistical significance ( t  = 2.2).

This comes from multiplying the effect of each 1 unit change in our 9-point scale associated, equal to 0.55, 0.91, and 0.48 percentage points, respectively, with the standard deviation of the average label for each psychological feature for each image, which equal 0.923, 0.911, and 0.844, respectively.

As discussed in Online Appendix Table A.III , we offer subjects a |${\$}$| 3.00 base rate for participation plus an incentive of 5 cents per correct guess. With 50 image pairs shown to each participant, they could increase their earnings by another |${\$}$| 2.50, or up to |$83\%$| above the base compensation.

Table III gives us another way to see how much of previously known features are rediscovered by the algorithm. That the algorithm’s prediction plus all previously known features yields an R 2 of just 0.0380 (column (7)), not much larger than with the algorithm alone, suggests the algorithm has discovered most of the signal in these known features. But not necessarily all: these other known features often do remain statistically significant predictors of judges’ decisions even after controlling for the algorithm’s predictions (last column). One possible reason is that, given finite samples, the algorithm has only imperfectly reconstructed factors such as “age” or “human guess.” Controlling for these factors directly adds additional signal.

Imagine a linear prediction function like |$m(x_1,x_2) = \widehat{\beta }_1 x_1 + \widehat{\beta }_2 x_2$|⁠ . If our best estimates suggested |$\widehat{\beta }_2=0$|⁠ , the maximum change to the prediction comes from incrementally changing x 1 .

As noted already, to avoid contributing to the stereotyping of minorities in discussions of crime, in our exhibits we show images for non-Hispanic white men, although in our HITs we use images representative of the larger defendant population.

Modeling p ( x ) through a supervised learning task would involve assembling a large set of images, having subjects label each image for whether they contain a realistic face, and then predicting those labels using the image pixels as inputs. But this supervised learning approach is costly because it requires extensive annotation of a large training data set.

Kaji, Manresa, and Pouliot (2020) and Athey et al. (2021 , 2022) are recent uses of GANs in economics.

Some ethical issues are worth considering. One is bias. With human hypothesis generation there is the risk people “see” an association that impugns some group yet has no basis in fact. In contrast our procedure by construction only produces empirically plausible hypotheses. A different concern is the vulnerability of deep learning to adversarial examples: tiny, almost imperceptible changes in an image changing its classification for the outcome y , so that mug shots that look almost identical (that is, are very “similar” in some visual image metric) have dramatically different m ( x ). This is a problem because tiny changes to an image don’t change the nature of the object; see Szegedy et al. (2013) and Goodfellow, Shlens, and Szegedy (2014) . In practice such instances are quite rare in nature, indeed, so rare they usually occur only if intentionally (maliciously) generated.

Online Appendix Figure A.XII gives an example of this task and the instructions given to participating subjects to complete it. Each subject was tested on 50 image pairs selected at random from a population of 100 images. Subjects were told that for every pair, one image was higher in some unknown feature, but not given details as to what the feature might be. As in the exercise for predicting detention, feedback was given immediately after selecting an image, and a 5 cent bonus was paid for every correct answer.

In principle this semantic grouping could be carried out in other ways, for example, with automated procedures involving natural-language processing.

See Online Appendix Table A.III for a high-level description of this human intelligence task, and Online Appendix Figure A.XIV for a sample of the task and the subject instructions.

We drop every token of just one or two characters in length, as well as connector words without real meaning for this purpose, like “had,” “the,” and “and,” as well as words that are relevant to our exercise but generic, like “jailed,” “judge,” and “image.”

We enlisted three research assistants blinded to the findings of this study and asked them to come up with semantic categories that captured all subject comments. Since each assistant mapped each subject comment to |$5\%$| of semantic categories on average, if the assistant mappings were totally uncorrelated, we would expect to see agreement of at least two assistant categorizations about |$5\%$| of the time. What we actually see is if one research assistant made an association, |$60\%$| of the time another assistant would make the same association. We assign a comment to a semantic category when at least two of the assistants agree on the categorization.

Moreover what subjects see does not seem to be particularly sensitive to which images they see. (As a reminder, each subject sees 50 morphed image pairs randomly selected from a larger bank of 100 morphed image pairs). If we start with a subject who says they saw “well-groomed” in the morphed image pairs they saw, for other subjects who saw 21 or fewer images in common (so saw mostly different images) they also report seeing well-groomed |$31\%$| of the time, versus |$35\%$| among the population. We select the threshold of 21 images because this is the smallest threshold in which at least 50 pairs of raters are considered.

See Online Appendix Table A.III and Online Appendix Figure A.XVI . This comes to a total of 192,280 individual labels, an average of 3.2 labels per image in the training set and an average of 10.8 labels per image in the validation set. Sampling labels from different workers on the same image, these ratings have a correlation of 0.14.

It turns out that skin tone is another feature that is correlated with well-groomed, so we orthogonalize on that as well as well-groomed. To simplify the discussion, we use “well-groomed” as a stand-in for both features we orthogonalize against, well-groomed plus skin tone.

To see why, consider the mechanics of the procedure. Since we orthogonalize as we create morphs, we would need labels at each morphing step. This would entail us producing candidate steps (new morphs), collecting data on each of the candidates, picking one that has the same well-groomed value, and then repeating. Moreover, until the labels are collected at a given step, the next step could not be taken. Since producing a final morph requires hundreds of such intermediate morphing steps, the whole process would be so time- and resource-consuming as to be infeasible.

While we can predict demographic features like race and age (above/below median age) nearly perfectly, with AUC values close to 1, for predicting well-groomed, the mean absolute error of our OOS prediction is 0.63, which is plus or minus over half a slider value for this 9-point-scaled variable. One reason it is harder to predict well-groomed is because the labels, which come from human subjects looking at and labeling mug shots, are themselves noisy, which introduces irreducible error.

For additional details see Online Appendix Figure A.XVII and Online Appendix B .

There are a few additional technical steps required, discussed in Online Appendix B . For details on the HIT we use to get subjects to name the new hypothesis from looking at orthogonalized morphs, and the follow-up HIT to generate independent labels for that new hypothesis or facial feature, see Online Appendix Table A.III .

See Online Appendix Figure A.XIX .

The adjusted R 2 of including the practitioner forecasts plus well-groomed and heavy-facedness together (column (3), equal to 0.0246) is not that different from the sum of the R 2 values from including just the practitioner forecasts (0.0165 in column (1)) plus that from including just well-groomed and heavy-faced (equal to 0.0131 in Table VII , column (2)).

In Online Appendix Table A.IX we show that controlling for one obvious indicator of a substance abuse issue—arrest for drugs—does not seem to substantially change the relationship between full-faced or well-groomed and the predicted detention decision. Online Appendix Tables A.X and A.XI show a qualitatively similar pattern of results for the defendant’s mental health and socioeconomic status, which we measure by getting a separate sample of subjects to independently rate validation–data set mug shots. We see qualitatively similar results when the dependent variable is the actual rather than predicted judge decision; see Online Appendix Tables A.XIII, A.XIV, and A.XV .

Characteristics of having a baby face included large eyes, narrow chin, small nose, and high, raised eyebrows. For a discussion of some of the larger literature on how that feature shapes the reactions of other people generally, see Zebrowitz et al. (2009) .

For additional details, see Online Appendix C .

See https://www.nolo.com/covid-19/virtual-criminal-court-appearances-in-the-time-of-the-covid-19.html .

See https://stablediffusionweb.com/ and https://openai.com/product/dall-e-2 .

The data underlying this article are available in the Harvard Dataverse, https://doi.org/10.7910/DVN/ILO46V ( Ludwig and Mullainathan 2023b ).

Adukia   Anjali , Eble   Alex , Harrison   Emileigh , Birali Runesha   Hakizumwami , Szasz   Teodora , “ What We Teach about Race and Gender: Representation in Images and Text of Children’s Books ,” Quarterly Journal of Economics , 138 ( 2023 ), 2225 – 2285 . https://doi.org/10.1093/qje/qjad028

Google Scholar

Angelova   Victoria , Dobbie   Will S. , Yang   Crystal S. , “ Algorithmic Recommendations and Human Discretion ,” NBER Working Paper no. 31747, 2023 . https://doi.org/10.3386/w31747

Arnold   David , Dobbie   Will S. , Hull   Peter , “ Measuring Racial Discrimination in Bail Decisions ,” NBER Working Paper no. 26999, 2020.   https://doi.org/10.3386/w26999

Arnold   David , Dobbie   Will , Yang   Crystal S. , “ Racial Bias in Bail Decisions ,” Quarterly Journal of Economics , 133 ( 2018 ), 1885 – 1932 . https://doi.org/10.1093/qje/qjy012

Athey   Susan , “ Beyond Prediction: Using Big Data for Policy Problems ,” Science , 355 ( 2017 ), 483 – 485 . https://doi.org/10.1126/science.aal4321

Athey   Susan , “ The Impact of Machine Learning on Economics ,” in The Economics of Artificial Intelligence: An Agenda , Ajay Agrawal, Joshua Gans, and Avi Goldfarb, eds. (Chicago: University of Chicago Press , 2018 ), 507 – 547 .

Athey   Susan , Imbens   Guido W. , “ Machine Learning Methods That Economists Should Know About ,” Annual Review of Economics , 11 ( 2019 ), 685 – 725 . https://doi.org/10.1146/annurev-economics-080217-053433

Athey   Susan , Imbens   Guido W. , Metzger   Jonas , Munro   Evan , “ Using Wasserstein Generative Adversarial Networks for the Design of Monte Carlo Simulations ,” Journal of Econometrics , ( 2021 ), 105076. https://doi.org/10.1016/j.jeconom.2020.09.013

Athey   Susan , Karlan   Dean , Palikot   Emil , Yuan   Yuan , “ Smiles in Profiles: Improving Fairness and Efficiency Using Estimates of User Preferences in Online Marketplaces ,” NBER Working Paper no. 30633 , 2022 . https://doi.org/10.3386/w30633

Autor   David , “ Polanyi’s Paradox and the Shape of Employment Growth ,” NBER Working Paper no. 20485 , 2014 . https://doi.org/10.3386/w20485

Avitzour   Eliana , Choen   Adi , Joel   Daphna , Lavy   Victor , “ On the Origins of Gender-Biased Behavior: The Role of Explicit and Implicit Stereotypes ,” NBER Working Paper no. 27818 , 2020 . https://doi.org/10.3386/w27818

Baehrens   David , Schroeter   Timon , Harmeling   Stefan , Kawanabe   Motoaki , Hansen   Katja , Müller   Klaus-Robert , “ How to Explain Individual Classification Decisions ,” Journal of Machine Learning Research , 11 ( 2010 ), 1803 – 1831 .

Baltrušaitis   Tadas , Ahuja   Chaitanya , Morency   Louis-Philippe , “ Multimodal Machine Learning: A Survey and Taxonomy ,” IEEE Transactions on Pattern Analysis and Machine Intelligence , 41 ( 2019 ), 423 – 443 . https://doi.org/10.1109/TPAMI.2018.2798607

Begall   Sabine , Červený   Jaroslav , Neef   Julia , Vojtěch   Oldřich , Burda   Hynek , “ Magnetic Alignment in Grazing and Resting Cattle and Deer ,” Proceedings of the National Academy of Sciences , 105 ( 2008 ), 13451 – 13455 . https://doi.org/10.1073/pnas.0803650105

Belloni   Alexandre , Chernozhukov   Victor , Hansen   Christian , “ High-Dimensional Methods and Inference on Structural and Treatment Effects ,” Journal of Economic Perspectives , 28 ( 2014 ), 29 – 50 . https://doi.org/10.1257/jep.28.2.29

Berry   Diane S. , Zebrowitz-McArthur   Leslie , “ What’s in a Face? Facial Maturity and the Attribution of Legal Responsibility ,” Personality and Social Psychology Bulletin , 14 ( 1988 ), 23 – 33 . https://doi.org/10.1177/0146167288141003

Bertrand   Marianne , Mullainathan   Sendhil , “ Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination ,” American Economic Review , 94 ( 2004 ), 991 – 1013 . https://doi.org/10.1257/0002828042002561

Bjornstrom   Eileen E. S. , Kaufman   Robert L. , Peterson   Ruth D. , Slater   Michael D. , “ Race and Ethnic Representations of Lawbreakers and Victims in Crime News: A National Study of Television Coverage ,” Social Problems , 57 ( 2010 ), 269 – 293 . https://doi.org/10.1525/sp.2010.57.2.269

Breiman   Leo , “ Random Forests ,” Machine Learning , 45 ( 2001 ), 5 – 32 . https://doi.org/10.1023/A:1010933404324

Breiman   Leo , Friedman   Jerome H. , Olshen   Richard A. , Stone   Charles J. , Classification and Regression Trees (London: Routledge , 1984 ). https://doi.org/10.1201/9781315139470

Google Preview

Brier   Glenn W. , “ Verification of Forecasts Expressed in Terms of Probability ,” Monthly Weather Review , 78 ( 1950 ), 1 – 3 . https://doi.org/10.1175/1520-0493(1950)078<0001:VOFEIT>2.0.CO;2

Carleo   Giuseppe , Cirac   Ignacio , Cranmer   Kyle , Daudet   Laurent , Schuld   Maria , Tishby   Naftali , Vogt-Maranto   Leslie , Zdeborová   Lenka , “ Machine Learning and the Physical Sciences ,” Reviews of Modern Physics , 91 ( 2019 ), 045002 . https://doi.org/10.1103/RevModPhys.91.045002

Chen   Daniel L. , Moskowitz   Tobias J. , Shue   Kelly , “ Decision Making under the Gambler’s Fallacy: Evidence from Asylum Judges, Loan Officers, and Baseball Umpires ,” Quarterly Journal of Economics , 131 ( 2016 ), 1181 – 1242 . https://doi.org/10.1093/qje/qjw017

Chen   Daniel L. , Philippe   Arnaud , “ Clash of Norms: Judicial Leniency on Defendant Birthdays ,” Journal of Economic Behavior & Organization , 211 ( 2023 ), 324 – 344 . https://doi.org/10.1016/j.jebo.2023.05.002

Dahl   Gordon B. , Knepper   Matthew M. , “ Age Discrimination across the Business Cycle ,” NBER Working Paper no. 27581 , 2020 . https://doi.org/10.3386/w27581

Davies   Alex , Veličković   Petar , Buesing   Lars , Blackwell   Sam , Zheng   Daniel , Tomašev   Nenad , Tanburn   Richard , Battaglia   Peter , Blundell   Charles , Juhász   András , Lackenby   Marc , Williamson   Geordie , Hassabis   Demis , Kohli   Pushmeet , “ Advancing Mathematics by Guiding Human Intuition with AI ,” Nature , 600 ( 2021 ), 70 – 74 . https://doi.org/10.1038/s41586-021-04086-x

Devlin   Jacob , Chang   Ming-Wei , Lee   Kenton , Toutanova   Kristina , “ BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding ,” arXiv preprint arXiv:1810.04805 , 2018 . https://doi.org/10.48550/arXiv.1810.04805

Dobbie   Will , Goldin   Jacob , Yang   Crystal S. , “ The Effects of Pretrial Detention on Conviction, Future Crime, and Employment: Evidence from Randomly Assigned Judges ,” American Economic Review , 108 ( 2018 ), 201 – 240 . https://doi.org/10.1257/aer.20161503

Dobbie   Will , Yang   Crystal S. , “ The US Pretrial System: Balancing Individual Rights and Public Interests ,” Journal of Economic Perspectives , 35 ( 2021 ), 49 – 70 . https://doi.org/10.1257/jep.35.4.49

Doshi-Velez   Finale , Kim   Been , “ Towards a Rigorous Science of Interpretable Machine Learning ,” arXiv preprint arXiv:1702.08608 , 2017 . https://doi.org/10.48550/arXiv.1702.08608

Eberhardt   Jennifer L. , Davies   Paul G. , Purdie-Vaughns   Valerie J. , Lynn Johnson   Sheri , “ Looking Deathworthy: Perceived Stereotypicality of Black Defendants Predicts Capital-Sentencing Outcomes ,” Psychological Science , 17 ( 2006 ), 383 – 386 . https://doi.org/10.1111/j.1467-9280.2006.01716.x

Einav   Liran , Levin   Jonathan , “ The Data Revolution and Economic Analysis ,” Innovation Policy and the Economy , 14 ( 2014 ), 1 – 24 . https://doi.org/10.1086/674019

Eren   Ozkan , Mocan   Naci , “ Emotional Judges and Unlucky Juveniles ,” American Economic Journal: Applied Economics , 10 ( 2018 ), 171 – 205 . https://doi.org/10.1257/app.20160390

Frieze   Irene Hanson , Olson   Josephine E. , Russell   June , “ Attractiveness and Income for Men and Women in Management ,” Journal of Applied Social Psychology , 21 ( 1991 ), 1039 – 1057 . https://doi.org/10.1111/j.1559-1816.1991.tb00458.x

Fryer   Roland G., Jr , “ An Empirical Analysis of Racial Differences in Police Use of Force: A Response ,” Journal of Political Economy , 128 ( 2020 ), 4003 – 4008 . https://doi.org/10.1086/710977

Fudenberg   Drew , Liang   Annie , “ Predicting and Understanding Initial Play ,” American Economic Review , 109 ( 2019 ), 4112 – 4141 . https://doi.org/10.1257/aer.20180654

Gentzkow   Matthew , Kelly   Bryan , Taddy   Matt , “ Text as Data ,” Journal of Economic Literature , 57 ( 2019 ), 535 – 574 . https://doi.org/10.1257/jel.20181020

Ghandeharioun   Asma , Kim   Been , Li   Chun-Liang , Jou   Brendan , Eoff   Brian , Picard   Rosalind W. , “ DISSECT: Disentangled Simultaneous Explanations via Concept Traversals ,” arXiv preprint arXiv:2105.15164   2022 . https://doi.org/10.48550/arXiv.2105.15164

Goldin   Claudia , Rouse   Cecilia , “ Orchestrating Impartiality: The Impact of ‘Blind’ Auditions on Female Musicians ,” American Economic Review , 90 ( 2000 ), 715 – 741 . https://doi.org/10.1257/aer.90.4.715

Goncalves   Felipe , Mello   Steven , “ A Few Bad Apples? Racial Bias in Policing ,” American Economic Review , 111 ( 2021 ), 1406 – 1441 . https://doi.org/10.1257/aer.20181607

Goodfellow   Ian , Pouget-Abadie   Jean , Mirza   Mehdi , Xu   Bing , Warde-Farley   David , Ozair   Sherjil , Courville   Aaron , Bengio   Yoshua , “ Generative Adversarial Nets ,” Advances in Neural Information Processing Systems , 27 ( 2014 ), 2672 – 2680 .

Goodfellow   Ian J. , Shlens   Jonathon , Szegedy   Christian , “ Explaining and Harnessing Adversarial Examples ,” arXiv preprint arXiv:1412.6572 , 2014 . https://doi.org/10.48550/arXiv.1412.6572

Grogger   Jeffrey , Ridgeway   Greg , “ Testing for Racial Profiling in Traffic Stops from Behind a Veil of Darkness ,” Journal of the American Statistical Association , 101 ( 2006 ), 878 – 887 . https://doi.org/10.1198/016214506000000168

Hastie   Trevor , Tibshirani   Robert , Friedman   Jerome H. , Friedman   Jerome H. , The Elements of Statistical Learning: Data Mining, Inference, and Prediction , vol. 2 (Berlin: Springer , 2009 ).

He   Siyu , Li   Yin , Feng   Yu , Ho   Shirley , Ravanbakhsh   Siamak , Chen   Wei , Póczos   Barnabás , “ Learning to Predict the Cosmological Structure Formation ,” Proceedings of the National Academy of Sciences , 116 ( 2019 ), 13825 – 13832 . https://doi.org/10.1073/pnas.1821458116

Heckman   James J. , Singer   Burton , “ Abducting Economics ,” American Economic Review , 107 ( 2017 ), 298 – 302 . https://doi.org/10.1257/aer.p20171118

Heyes   Anthony , Saberian   Soodeh , “ Temperature and Decisions: Evidence from 207,000 Court Cases ,” American Economic Journal: Applied Economics , 11 ( 2019 ), 238 – 265 . https://doi.org/10.1257/app.20170223

Hoekstra   Mark , Sloan   CarlyWill , “ Does Race Matter for Police Use of Force? Evidence from 911 Calls ,” American Economic Review , 112 ( 2022 ), 827 – 860 . https://doi.org/10.1257/aer.20201292

Hunter   Margaret , “ The Persistent Problem of Colorism: Skin Tone, Status, and Inequality ,” Sociology Compass , 1 ( 2007 ), 237 – 254 . https://doi.org/10.1111/j.1751-9020.2007.00006.x

Jordan   Michael I. , Mitchell   Tom M. , “ Machine Learning: Trends, Perspectives, and Prospects ,” Science , 349 ( 2015 ), 255 – 260 . https://doi.org/10.1126/science.aaa8415

Jumper   John , Evans   Richard , Pritzel   Alexander , Green   Tim , Figurnov   Michael , Ronneberger   Olaf , Tunyasuvunakool   Kathryn , Bates   Russ , Žídek   Augustin , Potapenko   Anna  et al.  , “ Highly Accurate Protein Structure Prediction with AlphaFold ,” Nature , 596 ( 2021 ), 583 – 589 . https://doi.org/10.1038/s41586-021-03819-2

Jung   Jongbin , Concannon   Connor , Shroff   Ravi , Goel   Sharad , Goldstein   Daniel G. , “ Simple Rules for Complex Decisions ,” SSRN working paper , 2017 . https://doi.org/10.2139/ssrn.2919024

Kahneman   Daniel , Sibony   Olivier , Sunstein   C. R , Noise (London: HarperCollins , 2022 ).

Kaji   Tetsuya , Manresa   Elena , Pouliot   Guillaume , “ An Adversarial Approach to Structural Estimation ,” University of Chicago, Becker Friedman Institute for Economics Working Paper No. 2020-144 , 2020 . https://doi.org/10.2139/ssrn.3706365

Kingma   Diederik P. , Welling   Max , “ Auto-Encoding Variational Bayes ,” arXiv preprint arXiv:1312.6114 , 2013 . https://doi.org/10.48550/arXiv.1312.6114

Kleinberg   Jon , Lakkaraju   Himabindu , Leskovec   Jure , Ludwig   Jens , Mullainathan   Sendhil , “ Human Decisions and Machine Predictions ,” Quarterly Journal of Economics , 133 ( 2018 ), 237 – 293 . https://doi.org/10.1093/qje/qjx032

Korot   Edward , Pontikos   Nikolas , Liu   Xiaoxuan , Wagner   Siegfried K. , Faes   Livia , Huemer   Josef , Balaskas   Konstantinos , Denniston   Alastair K. , Khawaja   Anthony , Keane   Pearse A. , “ Predicting Sex from Retinal Fundus Photographs Using Automated Deep Learning ,” Scientific Reports , 11 ( 2021 ), 10286 . https://doi.org/10.1038/s41598-021-89743-x

Lahat   Dana , Adali   Tülay , Jutten   Christian , “ Multimodal Data Fusion: An Overview of Methods, Challenges, and Prospects ,” Proceedings of the IEEE , 103 ( 2015 ), 1449 – 1477 . https://doi.org/10.1109/JPROC.2015.2460697

Lang   Oran , Gandelsman   Yossi , Yarom   Michal , Wald   Yoav , Elidan   Gal , Hassidim   Avinatan , Freeman   William T , Isola   Phillip , Globerson   Amir , Irani   Michal , et al.  , “ Explaining in Style: Training a GAN to Explain a Classifier in StyleSpace ,” paper presented at the IEEE/CVF International Conference on Computer Vision , 2021. https://doi.org/10.1109/ICCV48922.2021.00073

Leskovec   Jure , Backstrom   Lars , Kleinberg   Jon , “ Meme-Tracking and the Dynamics of the News Cycle ,” paper presented at the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, 2009. https://doi.org/10.1145/1557019.1557077

Little   Anthony C. , Jones   Benedict C. , DeBruine   Lisa M. , “ Facial Attractiveness: Evolutionary Based Research ,” Philosophical Transactions of the Royal Society B: Biological Sciences , 366 ( 2011 ), 1638 – 1659 . https://doi.org/10.1098/rstb.2010.0404

Liu   Shusen , Kailkhura   Bhavya , Loveland   Donald , Han   Yong , “ Generative Counterfactual Introspection for Explainable Deep Learning ,” paper presented at the IEEE Global Conference on Signal and Information Processing (GlobalSIP) , 2019. https://doi.org/10.1109/GlobalSIP45357.2019.8969491

Ludwig   Jens , Mullainathan   Sendhil , “ Machine Learning as a Tool for Hypothesis Generation ,” NBER Working Paper no. 31017 , 2023a . https://doi.org/10.3386/w31017

Ludwig   Jens , Mullainathan   Sendhil , “ Replication Data for: ‘Machine Learning as a Tool for Hypothesis Generation’ ,” ( 2023b ), Harvard Dataverse. https://doi.org/10.7910/DVN/ILO46V .

Marcinkevičs   Ričards , Vogt   Julia E. , “ Interpretability and Explainability: A Machine Learning Zoo Mini-Tour ,” arXiv preprint arXiv:2012.01805 , 2020 . https://doi.org/10.48550/arXiv.2012.01805

Miller   Andrew , Obermeyer   Ziad , Cunningham   John , Mullainathan   Sendhil , “ Discriminative Regularization for Latent Variable Models with Applications to Electrocardiography ,” paper presented at the International Conference on Machine Learning , 2019.

Mobius   Markus M. , Rosenblat   Tanya S. , “ Why Beauty Matters ,” American Economic Review , 96 ( 2006 ), 222 – 235 . https://doi.org/10.1257/000282806776157515

Mobley   R. Keith , An Introduction to Predictive Maintenance (Amsterdam: Elsevier , 2002 ).

Mullainathan   Sendhil , Obermeyer   Ziad , “ Diagnosing Physician Error: A Machine Learning Approach to Low-Value Health Care ,” Quarterly Journal of Economics , 137 ( 2022 ), 679 – 727 . https://doi.org/10.1093/qje/qjab046

Mullainathan   Sendhil , Spiess   Jann , “ Machine Learning: an Applied Econometric Approach ,” Journal of Economic Perspectives , 31 ( 2017 ), 87 – 106 . https://doi.org/10.1257/jep.31.2.87

Murphy   Allan H. , “ A New Vector Partition of the Probability Score ,” Journal of Applied Meteorology and Climatology , 12 ( 1973 ), 595 – 600 . https://doi.org/10.1175/1520-0450(1973)012<0595:ANVPOT>2.0.CO;2

Nalisnick   Eric , Matsukawa   Akihiro , Whye Teh   Yee , Gorur   Dilan , Lakshminarayanan   Balaji , “ Do Deep Generative Models Know What They Don’t Know? ,” arXiv preprint arXiv:1810.09136 , 2018 . https://doi.org/10.48550/arXiv.1810.09136

Narayanaswamy   Arunachalam , Venugopalan   Subhashini , Webster   Dale R. , Peng   Lily , Corrado   Greg S. , Ruamviboonsuk   Paisan , Bavishi   Pinal , Brenner   Michael , Nelson   Philip C. , Varadarajan   Avinash V. , “ Scientific Discovery by Generating Counterfactuals Using Image Translation ,” in International Conference on Medical Image Computing and Computer-Assisted Intervention , (Berlin: Springer , 2020), 273 – 283 . https://doi.org/10.1007/978-3-030-59710-8_27

Neumark   David , Burn   Ian , Button   Patrick , “ Experimental Age Discrimination Evidence and the Heckman Critique ,” American Economic Review , 106 ( 2016 ), 303 – 308 . https://doi.org/10.1257/aer.p20161008

Norouzzadeh   Mohammad Sadegh , Nguyen   Anh , Kosmala   Margaret , Swanson   Alexandra , S. Palmer   Meredith , Packer   Craig , Clune   Jeff , “ Automatically Identifying, Counting, and Describing Wild Animals in Camera-Trap Images with Deep Learning ,” Proceedings of the National Academy of Sciences , 115 ( 2018 ), E5716 – E5725 . https://doi.org/10.1073/pnas.1719367115

Oosterhof   Nikolaas N. , Todorov   Alexander , “ The Functional Basis of Face Evaluation ,” Proceedings of the National Academy of Sciences , 105 ( 2008 ), 11087 – 11092 . https://doi.org/10.1073/pnas.0805664105

Peterson   Joshua C. , Bourgin   David D. , Agrawal   Mayank , Reichman   Daniel , Griffiths   Thomas L. , “ Using Large-Scale Experiments and Machine Learning to Discover Theories of Human Decision-Making ,” Science , 372 ( 2021 ), 1209 – 1214 . https://doi.org/10.1126/science.abe2629

Pierson   Emma , Cutler   David M. , Leskovec   Jure , Mullainathan   Sendhil , Obermeyer   Ziad , “ An Algorithmic Approach to Reducing Unexplained Pain Disparities in Underserved Populations ,” Nature Medicine , 27 ( 2021 ), 136 – 140 . https://doi.org/10.1038/s41591-020-01192-7

Pion-Tonachini   Luca , Bouchard   Kristofer , Garcia Martin   Hector , Peisert   Sean , Bradley Holtz   W. , Aswani   Anil , Dwivedi   Dipankar , Wainwright   Haruko , Pilania   Ghanshyam , Nachman   Benjamin  et al.  “ Learning from Learning Machines: A New Generation of AI Technology to Meet the Needs of Science ,” arXiv preprint arXiv:2111.13786 , 2021 . https://doi.org/10.48550/arXiv.2111.13786

Popper   Karl , The Logic of Scientific Discovery (London: Routledge , 2nd ed. 2002 ). https://doi.org/10.4324/9780203994627

Pronin   Emily , “ The Introspection Illusion ,” Advances in Experimental Social Psychology , 41 ( 2009 ), 1 – 67 . https://doi.org/10.1016/S0065-2601(08)00401-2

Ramachandram   Dhanesh , Taylor   Graham W. , “ Deep Multimodal Learning: A Survey on Recent Advances and Trends ,” IEEE Signal Processing Magazine , 34 ( 2017 ), 96 – 108 . https://doi.org/10.1109/MSP.2017.2738401

Rambachan   Ashesh , “ Identifying Prediction Mistakes in Observational Data ,” Harvard University Working Paper, 2021 . www.nber.org/system/files/chapters/c14777/c14777.pdf

Said-Metwaly   Sameh , Van den Noortgate   Wim , Kyndt   Eva , “ Approaches to Measuring Creativity: A Systematic Literature Review ,” Creativity: Theories–Research-Applications , 4 ( 2017 ), 238 – 275 . https://doi.org/10.1515/ctra-2017-0013

Schickore   Jutta , “ Scientific Discovery ,” in The Stanford Encyclopedia of Philosophy , Edward N. Zalta, ed. (Stanford, CA: Stanford University , 2018).

Schlag   Pierre , “ Law and Phrenology ,” Harvard Law Review , 110 ( 1997 ), 877 – 921 . https://doi.org/10.2307/1342231

Sheetal   Abhishek , Feng   Zhiyu , Savani   Krishna , “ Using Machine Learning to Generate Novel Hypotheses: Increasing Optimism about COVID-19 Makes People Less Willing to Justify Unethical Behaviors ,” Psychological Science , 31 ( 2020 ), 1222 – 1235 . https://doi.org/10.1177/0956797620959594

Simonyan   Karen , Vedaldi   Andrea , Zisserman   Andrew , “ Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps ,” paper presented at the Workshop at International Conference on Learning Representations , 2014.

Sirovich   Lawrence , Kirby   Michael , “ Low-Dimensional Procedure for the Characterization of Human Faces ,” Journal of the Optical Society of America A , 4 ( 1987 ), 519 – 524 . https://doi.org/10.1364/JOSAA.4.000519

Sunstein   Cass R. , “ Governing by Algorithm? No Noise and (Potentially) Less Bias ,” Duke Law Journal , 71 ( 2021 ), 1175 – 1205 . https://doi.org/10.2139/ssrn.3925240

Swanson   Don R. , “ Fish Oil, Raynaud’s Syndrome, and Undiscovered Public Knowledge ,” Perspectives in Biology and Medicine , 30 ( 1986 ), 7 – 18 . https://doi.org/10.1353/pbm.1986.0087

Swanson   Don R. , “ Migraine and Magnesium: Eleven Neglected Connections ,” Perspectives in Biology and Medicine , 31 ( 1988 ), 526 – 557 . https://doi.org/10.1353/pbm.1988.0009

Szegedy   Christian , Zaremba   Wojciech , Sutskever   Ilya , Bruna   Joan , Erhan   Dumitru , Goodfellow   Ian , Fergus   Rob , “ Intriguing Properties of Neural Networks ,” arXiv preprint arXiv:1312.6199 , 2013 . https://doi.org/10.48550/arXiv.1312.6199

Todorov   Alexander , Oh   DongWon , “ The Structure and Perceptual Basis of Social Judgments from Faces. in Advances in Experimental Social Psychology , B. Gawronski, ed. (Amsterdam: Elsevier , 2021 ), 189–245.

Todorov   Alexander , Olivola   Christopher Y. , Dotsch   Ron , Mende-Siedlecki   Peter , “ Social Attributions from Faces: Determinants, Consequences, Accuracy, and Functional Significance ,” Annual Review of Psychology , 66 ( 2015 ), 519 – 545 . https://doi.org/10.1146/annurev-psych-113011-143831

Varian   Hal R. , “ Big Data: New Tricks for Econometrics ,” Journal of Economic Perspectives , 28 ( 2014 ), 3 – 28 . https://doi.org/10.1257/jep.28.2.3

Wilson   Timothy D. , Strangers to Ourselves (Cambridge, MA: Harvard University Press , 2004 ).

Yuhas   Ben P. , Goldstein   Moise H. , Sejnowski   Terrence J. , “ Integration of Acoustic and Visual Speech Signals Using Neural Networks ,” IEEE Communications Magazine , 27 ( 1989 ), 65 – 71 . https://doi.org/10.1109/35.41402

Zebrowitz   Leslie A. , Luevano   Victor X. , Bronstad   Philip M. , Aharon   Itzhak , “ Neural Activation to Babyfaced Men Matches Activation to Babies ,” Social Neuroscience , 4 ( 2009 ), 1 – 10 . https://doi.org/10.1080/17470910701676236

Supplementary data

Email alerts, citing articles via.

  • Recommend to Your Librarian

Affiliations

  • Online ISSN 1531-4650
  • Print ISSN 0033-5533
  • Copyright © 2024 President and Fellows of Harvard College
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Tutorial Playlist

Statistics tutorial, everything you need to know about the probability density function in statistics, the best guide to understand central limit theorem, an in-depth guide to measures of central tendency : mean, median and mode, the ultimate guide to understand conditional probability.

A Comprehensive Look at Percentile in Statistics

The Best Guide to Understand Bayes Theorem

Everything you need to know about the normal distribution, an in-depth explanation of cumulative distribution function, a complete guide to chi-square test, a complete guide on hypothesis testing in statistics, understanding the fundamentals of arithmetic and geometric progression, the definitive guide to understand spearman’s rank correlation, a comprehensive guide to understand mean squared error, all you need to know about the empirical rule in statistics, the complete guide to skewness and kurtosis, a holistic look at bernoulli distribution.

All You Need to Know About Bias in Statistics

A Complete Guide to Get a Grasp of Time Series Analysis

The Key Differences Between Z-Test Vs. T-Test

The Complete Guide to Understand Pearson's Correlation

A complete guide on the types of statistical studies, everything you need to know about poisson distribution, your best guide to understand correlation vs. regression, the most comprehensive guide for beginners on what is correlation, what is hypothesis testing in statistics types and examples.

Lesson 10 of 24 By Avijeet Biswal

A Complete Guide on Hypothesis Testing in Statistics

Table of Contents

In today’s data-driven world , decisions are based on data all the time. Hypothesis plays a crucial role in that process, whether it may be making business decisions, in the health sector, academia, or in quality improvement. Without hypothesis & hypothesis tests, you risk drawing the wrong conclusions and making bad decisions. In this tutorial, you will look at Hypothesis Testing in Statistics.

What Is Hypothesis Testing in Statistics?

Hypothesis Testing is a type of statistical analysis in which you put your assumptions about a population parameter to the test. It is used to estimate the relationship between 2 statistical variables.

Let's discuss few examples of statistical hypothesis from real-life - 

  • A teacher assumes that 60% of his college's students come from lower-middle-class families.
  • A doctor believes that 3D (Diet, Dose, and Discipline) is 90% effective for diabetic patients.

Now that you know about hypothesis testing, look at the two types of hypothesis testing in statistics.

Hypothesis Testing Formula

Z = ( x̅ – μ0 ) / (σ /√n)

  • Here, x̅ is the sample mean,
  • μ0 is the population mean,
  • σ is the standard deviation,
  • n is the sample size.

How Hypothesis Testing Works?

An analyst performs hypothesis testing on a statistical sample to present evidence of the plausibility of the null hypothesis. Measurements and analyses are conducted on a random sample of the population to test a theory. Analysts use a random population sample to test two hypotheses: the null and alternative hypotheses.

The null hypothesis is typically an equality hypothesis between population parameters; for example, a null hypothesis may claim that the population means return equals zero. The alternate hypothesis is essentially the inverse of the null hypothesis (e.g., the population means the return is not equal to zero). As a result, they are mutually exclusive, and only one can be correct. One of the two possibilities, however, will always be correct.

Your Dream Career is Just Around The Corner!

Your Dream Career is Just Around The Corner!

Null Hypothesis and Alternate Hypothesis

The Null Hypothesis is the assumption that the event will not occur. A null hypothesis has no bearing on the study's outcome unless it is rejected.

H0 is the symbol for it, and it is pronounced H-naught.

The Alternate Hypothesis is the logical opposite of the null hypothesis. The acceptance of the alternative hypothesis follows the rejection of the null hypothesis. H1 is the symbol for it.

Let's understand this with an example.

A sanitizer manufacturer claims that its product kills 95 percent of germs on average. 

To put this company's claim to the test, create a null and alternate hypothesis.

H0 (Null Hypothesis): Average = 95%.

Alternative Hypothesis (H1): The average is less than 95%.

Another straightforward example to understand this concept is determining whether or not a coin is fair and balanced. The null hypothesis states that the probability of a show of heads is equal to the likelihood of a show of tails. In contrast, the alternate theory states that the probability of a show of heads and tails would be very different.

Become a Data Scientist with Hands-on Training!

Become a Data Scientist with Hands-on Training!

Hypothesis Testing Calculation With Examples

Let's consider a hypothesis test for the average height of women in the United States. Suppose our null hypothesis is that the average height is 5'4". We gather a sample of 100 women and determine that their average height is 5'5". The standard deviation of population is 2.

To calculate the z-score, we would use the following formula:

z = ( x̅ – μ0 ) / (σ /√n)

z = (5'5" - 5'4") / (2" / √100)

z = 0.5 / (0.045)

 We will reject the null hypothesis as the z-score of 11.11 is very large and conclude that there is evidence to suggest that the average height of women in the US is greater than 5'4".

Steps of Hypothesis Testing

Step 1: specify your null and alternate hypotheses.

It is critical to rephrase your original research hypothesis (the prediction that you wish to study) as a null (Ho) and alternative (Ha) hypothesis so that you can test it quantitatively. Your first hypothesis, which predicts a link between variables, is generally your alternate hypothesis. The null hypothesis predicts no link between the variables of interest.

Step 2: Gather Data

For a statistical test to be legitimate, sampling and data collection must be done in a way that is meant to test your hypothesis. You cannot draw statistical conclusions about the population you are interested in if your data is not representative.

Step 3: Conduct a Statistical Test

Other statistical tests are available, but they all compare within-group variance (how to spread out the data inside a category) against between-group variance (how different the categories are from one another). If the between-group variation is big enough that there is little or no overlap between groups, your statistical test will display a low p-value to represent this. This suggests that the disparities between these groups are unlikely to have occurred by accident. Alternatively, if there is a large within-group variance and a low between-group variance, your statistical test will show a high p-value. Any difference you find across groups is most likely attributable to chance. The variety of variables and the level of measurement of your obtained data will influence your statistical test selection.

Step 4: Determine Rejection Of Your Null Hypothesis

Your statistical test results must determine whether your null hypothesis should be rejected or not. In most circumstances, you will base your judgment on the p-value provided by the statistical test. In most circumstances, your preset level of significance for rejecting the null hypothesis will be 0.05 - that is, when there is less than a 5% likelihood that these data would be seen if the null hypothesis were true. In other circumstances, researchers use a lower level of significance, such as 0.01 (1%). This reduces the possibility of wrongly rejecting the null hypothesis.

Step 5: Present Your Results 

The findings of hypothesis testing will be discussed in the results and discussion portions of your research paper, dissertation, or thesis. You should include a concise overview of the data and a summary of the findings of your statistical test in the results section. You can talk about whether your results confirmed your initial hypothesis or not in the conversation. Rejecting or failing to reject the null hypothesis is a formal term used in hypothesis testing. This is likely a must for your statistics assignments.

Types of Hypothesis Testing

To determine whether a discovery or relationship is statistically significant, hypothesis testing uses a z-test. It usually checks to see if two means are the same (the null hypothesis). Only when the population standard deviation is known and the sample size is 30 data points or more, can a z-test be applied.

A statistical test called a t-test is employed to compare the means of two groups. To determine whether two groups differ or if a procedure or treatment affects the population of interest, it is frequently used in hypothesis testing.

Chi-Square 

You utilize a Chi-square test for hypothesis testing concerning whether your data is as predicted. To determine if the expected and observed results are well-fitted, the Chi-square test analyzes the differences between categorical variables from a random sample. The test's fundamental premise is that the observed values in your data should be compared to the predicted values that would be present if the null hypothesis were true.

Hypothesis Testing and Confidence Intervals

Both confidence intervals and hypothesis tests are inferential techniques that depend on approximating the sample distribution. Data from a sample is used to estimate a population parameter using confidence intervals. Data from a sample is used in hypothesis testing to examine a given hypothesis. We must have a postulated parameter to conduct hypothesis testing.

Bootstrap distributions and randomization distributions are created using comparable simulation techniques. The observed sample statistic is the focal point of a bootstrap distribution, whereas the null hypothesis value is the focal point of a randomization distribution.

A variety of feasible population parameter estimates are included in confidence ranges. In this lesson, we created just two-tailed confidence intervals. There is a direct connection between these two-tail confidence intervals and these two-tail hypothesis tests. The results of a two-tailed hypothesis test and two-tailed confidence intervals typically provide the same results. In other words, a hypothesis test at the 0.05 level will virtually always fail to reject the null hypothesis if the 95% confidence interval contains the predicted value. A hypothesis test at the 0.05 level will nearly certainly reject the null hypothesis if the 95% confidence interval does not include the hypothesized parameter.

Simple and Composite Hypothesis Testing

Depending on the population distribution, you can classify the statistical hypothesis into two types.

Simple Hypothesis: A simple hypothesis specifies an exact value for the parameter.

Composite Hypothesis: A composite hypothesis specifies a range of values.

A company is claiming that their average sales for this quarter are 1000 units. This is an example of a simple hypothesis.

Suppose the company claims that the sales are in the range of 900 to 1000 units. Then this is a case of a composite hypothesis.

One-Tailed and Two-Tailed Hypothesis Testing

The One-Tailed test, also called a directional test, considers a critical region of data that would result in the null hypothesis being rejected if the test sample falls into it, inevitably meaning the acceptance of the alternate hypothesis.

In a one-tailed test, the critical distribution area is one-sided, meaning the test sample is either greater or lesser than a specific value.

In two tails, the test sample is checked to be greater or less than a range of values in a Two-Tailed test, implying that the critical distribution area is two-sided.

If the sample falls within this range, the alternate hypothesis will be accepted, and the null hypothesis will be rejected.

Become a Data Scientist With Real-World Experience

Become a Data Scientist With Real-World Experience

Right Tailed Hypothesis Testing

If the larger than (>) sign appears in your hypothesis statement, you are using a right-tailed test, also known as an upper test. Or, to put it another way, the disparity is to the right. For instance, you can contrast the battery life before and after a change in production. Your hypothesis statements can be the following if you want to know if the battery life is longer than the original (let's say 90 hours):

  • The null hypothesis is (H0 <= 90) or less change.
  • A possibility is that battery life has risen (H1) > 90.

The crucial point in this situation is that the alternate hypothesis (H1), not the null hypothesis, decides whether you get a right-tailed test.

Left Tailed Hypothesis Testing

Alternative hypotheses that assert the true value of a parameter is lower than the null hypothesis are tested with a left-tailed test; they are indicated by the asterisk "<".

Suppose H0: mean = 50 and H1: mean not equal to 50

According to the H1, the mean can be greater than or less than 50. This is an example of a Two-tailed test.

In a similar manner, if H0: mean >=50, then H1: mean <50

Here the mean is less than 50. It is called a One-tailed test.

Type 1 and Type 2 Error

A hypothesis test can result in two types of errors.

Type 1 Error: A Type-I error occurs when sample results reject the null hypothesis despite being true.

Type 2 Error: A Type-II error occurs when the null hypothesis is not rejected when it is false, unlike a Type-I error.

Suppose a teacher evaluates the examination paper to decide whether a student passes or fails.

H0: Student has passed

H1: Student has failed

Type I error will be the teacher failing the student [rejects H0] although the student scored the passing marks [H0 was true]. 

Type II error will be the case where the teacher passes the student [do not reject H0] although the student did not score the passing marks [H1 is true].

Level of Significance

The alpha value is a criterion for determining whether a test statistic is statistically significant. In a statistical test, Alpha represents an acceptable probability of a Type I error. Because alpha is a probability, it can be anywhere between 0 and 1. In practice, the most commonly used alpha values are 0.01, 0.05, and 0.1, which represent a 1%, 5%, and 10% chance of a Type I error, respectively (i.e. rejecting the null hypothesis when it is in fact correct).

Future-Proof Your AI/ML Career: Top Dos and Don'ts

Future-Proof Your AI/ML Career: Top Dos and Don'ts

A p-value is a metric that expresses the likelihood that an observed difference could have occurred by chance. As the p-value decreases the statistical significance of the observed difference increases. If the p-value is too low, you reject the null hypothesis.

Here you have taken an example in which you are trying to test whether the new advertising campaign has increased the product's sales. The p-value is the likelihood that the null hypothesis, which states that there is no change in the sales due to the new advertising campaign, is true. If the p-value is .30, then there is a 30% chance that there is no increase or decrease in the product's sales.  If the p-value is 0.03, then there is a 3% probability that there is no increase or decrease in the sales value due to the new advertising campaign. As you can see, the lower the p-value, the chances of the alternate hypothesis being true increases, which means that the new advertising campaign causes an increase or decrease in sales.

Why is Hypothesis Testing Important in Research Methodology?

Hypothesis testing is crucial in research methodology for several reasons:

  • Provides evidence-based conclusions: It allows researchers to make objective conclusions based on empirical data, providing evidence to support or refute their research hypotheses.
  • Supports decision-making: It helps make informed decisions, such as accepting or rejecting a new treatment, implementing policy changes, or adopting new practices.
  • Adds rigor and validity: It adds scientific rigor to research using statistical methods to analyze data, ensuring that conclusions are based on sound statistical evidence.
  • Contributes to the advancement of knowledge: By testing hypotheses, researchers contribute to the growth of knowledge in their respective fields by confirming existing theories or discovering new patterns and relationships.

Limitations of Hypothesis Testing

Hypothesis testing has some limitations that researchers should be aware of:

  • It cannot prove or establish the truth: Hypothesis testing provides evidence to support or reject a hypothesis, but it cannot confirm the absolute truth of the research question.
  • Results are sample-specific: Hypothesis testing is based on analyzing a sample from a population, and the conclusions drawn are specific to that particular sample.
  • Possible errors: During hypothesis testing, there is a chance of committing type I error (rejecting a true null hypothesis) or type II error (failing to reject a false null hypothesis).
  • Assumptions and requirements: Different tests have specific assumptions and requirements that must be met to accurately interpret results.

After reading this tutorial, you would have a much better understanding of hypothesis testing, one of the most important concepts in the field of Data Science . The majority of hypotheses are based on speculation about observed behavior, natural phenomena, or established theories.

If you are interested in statistics of data science and skills needed for such a career, you ought to explore Simplilearn’s Post Graduate Program in Data Science.

If you have any questions regarding this ‘Hypothesis Testing In Statistics’ tutorial, do share them in the comment section. Our subject matter expert will respond to your queries. Happy learning!

1. What is hypothesis testing in statistics with example?

Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence. An example: testing if a new drug improves patient recovery (Ha) compared to the standard treatment (H0) based on collected patient data.

2. What is hypothesis testing and its types?

Hypothesis testing is a statistical method used to make inferences about a population based on sample data. It involves formulating two hypotheses: the null hypothesis (H0), which represents the default assumption, and the alternative hypothesis (Ha), which contradicts H0. The goal is to assess the evidence and determine whether there is enough statistical significance to reject the null hypothesis in favor of the alternative hypothesis.

Types of hypothesis testing:

  • One-sample test: Used to compare a sample to a known value or a hypothesized value.
  • Two-sample test: Compares two independent samples to assess if there is a significant difference between their means or distributions.
  • Paired-sample test: Compares two related samples, such as pre-test and post-test data, to evaluate changes within the same subjects over time or under different conditions.
  • Chi-square test: Used to analyze categorical data and determine if there is a significant association between variables.
  • ANOVA (Analysis of Variance): Compares means across multiple groups to check if there is a significant difference between them.

3. What are the steps of hypothesis testing?

The steps of hypothesis testing are as follows:

  • Formulate the hypotheses: State the null hypothesis (H0) and the alternative hypothesis (Ha) based on the research question.
  • Set the significance level: Determine the acceptable level of error (alpha) for making a decision.
  • Collect and analyze data: Gather and process the sample data.
  • Compute test statistic: Calculate the appropriate statistical test to assess the evidence.
  • Make a decision: Compare the test statistic with critical values or p-values and determine whether to reject H0 in favor of Ha or not.
  • Draw conclusions: Interpret the results and communicate the findings in the context of the research question.

4. What are the 2 types of hypothesis testing?

  • One-tailed (or one-sided) test: Tests for the significance of an effect in only one direction, either positive or negative.
  • Two-tailed (or two-sided) test: Tests for the significance of an effect in both directions, allowing for the possibility of a positive or negative effect.

The choice between one-tailed and two-tailed tests depends on the specific research question and the directionality of the expected effect.

5. What are the 3 major types of hypothesis?

The three major types of hypotheses are:

  • Null Hypothesis (H0): Represents the default assumption, stating that there is no significant effect or relationship in the data.
  • Alternative Hypothesis (Ha): Contradicts the null hypothesis and proposes a specific effect or relationship that researchers want to investigate.
  • Nondirectional Hypothesis: An alternative hypothesis that doesn't specify the direction of the effect, leaving it open for both positive and negative possibilities.

Find our Data Analyst Online Bootcamp in top cities:

About the author.

Avijeet Biswal

Avijeet is a Senior Research Analyst at Simplilearn. Passionate about Data Analytics, Machine Learning, and Deep Learning, Avijeet is also interested in politics, cricket, and football.

Recommended Resources

The Key Differences Between Z-Test Vs. T-Test

Free eBook: Top Programming Languages For A Data Scientist

Normality Test in Minitab: Minitab with Statistics

Normality Test in Minitab: Minitab with Statistics

A Comprehensive Look at Percentile in Statistics

Machine Learning Career Guide: A Playbook to Becoming a Machine Learning Engineer

  • PMP, PMI, PMBOK, CAPM, PgMP, PfMP, ACP, PBA, RMP, SP, and OPM3 are registered marks of the Project Management Institute, Inc.
  • Bell Fork ES
  • Blue Creek ES
  • Carolina Forest ES
  • Clear View Elementary School
  • Clyde Erwin ES
  • Hunters Creek ES
  • Jacksonville Commons ES
  • Meadow View ES
  • Northwoods ES
  • Parkwood ES
  • Queens Creek ES
  • Richlands ES
  • Heritage ES
  • Sand Ridge ES
  • Silverdale ES
  • Southwest ES
  • Stateside ES
  • Summersill ES
  • Swansboro ES
  • Thompson Early Childhood Center
  • Hunters Creek MS
  • Jacksonville Commons MS
  • New Bridge MS
  • Northwoods Park MS
  • Southwest MS
  • Swansboro MS
  • Jacksonville HS
  • Northside HS
  • Onslow County Learning Center
  • Onslow Early College HS
  • Richlands HS
  • Southwest HS
  • Swansboro HS
  • White Oak HS
  • Eastern North Carolina Regional Skills Center
  • Onslow Virtual School

Search

  • Onslow County School District
  • Strategies In Action

Generating & Testing Hypotheses

Page navigation.

  • From Challenges to Controversies: Eight Task Fameworks
  • Anticipation Guide
  • Back to IS Homepage

Generating and Testing Hypotheses

Brief description of category.

"Generating and testing hypotheses" are often thought of as strategies that are only used in the sciences. However, students engage in this process in many content areas and as part of the natural learning process.

Sometimes we refer to hypothesizing as predicting, inferring, deducing, theorizing, etc. in other content areas, but the mental processes are still the same.

"When students create thesis statements, make predictions based on evidence, or consider the consequences of certain action, they are using the process of generating and testing hypotheses." (McREL, 2012)

Generating and Testing Hypotheses

Evidence of Implementation

  • Learned concepts are applied at high levels (Analyze, Evaluate, Create).
  • Students use knowledge in "real-world" contexts.
  • Students brainstorm, problem solve, and/or troubleshoot.
  • Students are heard saying "What if?" or "Let's try this."
  • Students are observed making and testing predictions and hypotheses.

(Adapted from  Tools for Classroom Instruction That Works ; McREL, 2018)

  • Questions or Feedback? |
  • Web Community Manager Privacy Policy (Updated) |
  • Machine Learning Tutorial
  • Data Analysis Tutorial
  • Python - Data visualization tutorial
  • Machine Learning Projects
  • Machine Learning Interview Questions
  • Machine Learning Mathematics
  • Deep Learning Tutorial
  • Deep Learning Project
  • Deep Learning Interview Questions
  • Computer Vision Tutorial
  • Computer Vision Projects
  • NLP Project
  • NLP Interview Questions
  • Statistics with Python
  • 100 Days of Machine Learning
  • Data Analysis with Python

Introduction to Data Analysis

  • What is Data Analysis?
  • Data Analytics and its type
  • How to Install Numpy on Windows?
  • How to Install Pandas in Python?
  • How to Install Matplotlib on python?
  • How to Install Python Tensorflow in Windows?

Data Analysis Libraries

  • Pandas Tutorial
  • NumPy Tutorial - Python Library
  • Data Analysis with SciPy
  • Introduction to TensorFlow

Data Visulization Libraries

  • Matplotlib Tutorial
  • Python Seaborn Tutorial
  • Plotly tutorial
  • Introduction to Bokeh in Python

Exploratory Data Analysis (EDA)

  • Univariate, Bivariate and Multivariate data and its analysis
  • Measures of Central Tendency in Statistics
  • Measures of spread - Range, Variance, and Standard Deviation
  • Interquartile Range and Quartile Deviation using NumPy and SciPy
  • Anova Formula
  • Skewness of Statistical Data
  • How to Calculate Skewness and Kurtosis in Python?
  • Difference Between Skewness and Kurtosis
  • Histogram | Meaning, Example, Types and Steps to Draw
  • Interpretations of Histogram
  • Quantile Quantile plots
  • What is Univariate, Bivariate & Multivariate Analysis in Data Visualisation?
  • Using pandas crosstab to create a bar plot
  • Exploring Correlation in Python
  • Mathematics | Covariance and Correlation
  • Factor Analysis | Data Analysis
  • Data Mining - Cluster Analysis
  • MANOVA Test in R Programming
  • Python - Central Limit Theorem
  • Probability Distribution Function
  • Probability Density Estimation & Maximum Likelihood Estimation
  • Exponential Distribution in R Programming - dexp(), pexp(), qexp(), and rexp() Functions
  • Mathematics | Probability Distributions Set 4 (Binomial Distribution)
  • Poisson Distribution - Definition, Formula, Table and Examples
  • P-Value: Comprehensive Guide to Understand, Apply, and Interpret
  • Z-Score in Statistics
  • How to Calculate Point Estimates in R?
  • Confidence Interval
  • Chi-square test in Machine Learning

Understanding Hypothesis Testing

Data preprocessing.

  • ML | Data Preprocessing in Python
  • ML | Overview of Data Cleaning
  • ML | Handling Missing Values
  • Detect and Remove the Outliers using Python

Data Transformation

  • Data Normalization Machine Learning
  • Sampling distribution Using Python

Time Series Data Analysis

  • Data Mining - Time-Series, Symbolic and Biological Sequences Data
  • Basic DateTime Operations in Python
  • Time Series Analysis & Visualization in Python
  • How to deal with missing values in a Timeseries in Python?
  • How to calculate MOVING AVERAGE in a Pandas DataFrame?
  • What is a trend in time series?
  • How to Perform an Augmented Dickey-Fuller Test in R
  • AutoCorrelation

Case Studies and Projects

  • Top 8 Free Dataset Sources to Use for Data Science Projects
  • Step by Step Predictive Analysis - Machine Learning
  • 6 Tips for Creating Effective Data Visualizations

Hypothesis testing involves formulating assumptions about population parameters based on sample statistics and rigorously evaluating these assumptions against empirical evidence. This article sheds light on the significance of hypothesis testing and the critical steps involved in the process.

What is Hypothesis Testing?

Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data. 

Example: You say an average height in the class is 30 or a boy is taller than a girl. All of these is an assumption that we are assuming, and we need some statistical way to prove these. We need some mathematical conclusion whatever we are assuming is true.

Defining Hypotheses

\mu

Key Terms of Hypothesis Testing

\alpha

  • P-value: The P value , or calculated probability, is the probability of finding the observed/extreme results when the null hypothesis(H0) of a study-given problem is true. If your P-value is less than the chosen significance level then you reject the null hypothesis i.e. accept that your sample claims to support the alternative hypothesis.
  • Test Statistic: The test statistic is a numerical value calculated from sample data during a hypothesis test, used to determine whether to reject the null hypothesis. It is compared to a critical value or p-value to make decisions about the statistical significance of the observed results.
  • Critical value : The critical value in statistics is a threshold or cutoff point used to determine whether to reject the null hypothesis in a hypothesis test.
  • Degrees of freedom: Degrees of freedom are associated with the variability or freedom one has in estimating a parameter. The degrees of freedom are related to the sample size and determine the shape.

Why do we use Hypothesis Testing?

Hypothesis testing is an important procedure in statistics. Hypothesis testing evaluates two mutually exclusive population statements to determine which statement is most supported by sample data. When we say that the findings are statistically significant, thanks to hypothesis testing. 

One-Tailed and Two-Tailed Test

One tailed test focuses on one direction, either greater than or less than a specified value. We use a one-tailed test when there is a clear directional expectation based on prior knowledge or theory. The critical region is located on only one side of the distribution curve. If the sample falls into this critical region, the null hypothesis is rejected in favor of the alternative hypothesis.

One-Tailed Test

There are two types of one-tailed test:

\mu \geq 50

Two-Tailed Test

A two-tailed test considers both directions, greater than and less than a specified value.We use a two-tailed test when there is no specific directional expectation, and want to detect any significant difference.

\mu =

What are Type 1 and Type 2 errors in Hypothesis Testing?

In hypothesis testing, Type I and Type II errors are two possible errors that researchers can make when drawing conclusions about a population based on a sample of data. These errors are associated with the decisions made regarding the null hypothesis and the alternative hypothesis.

\alpha

How does Hypothesis Testing work?

Step 1: define null and alternative hypothesis.

H_0

We first identify the problem about which we want to make an assumption keeping in mind that our assumption should be contradictory to one another, assuming Normally distributed data.

Step 2 – Choose significance level

\alpha

Step 3 – Collect and Analyze data.

Gather relevant data through observation or experimentation. Analyze the data using appropriate statistical methods to obtain a test statistic.

Step 4-Calculate Test Statistic

The data for the tests are evaluated in this step we look for various scores based on the characteristics of data. The choice of the test statistic depends on the type of hypothesis test being conducted.

There are various hypothesis tests, each appropriate for various goal to calculate our test. This could be a Z-test , Chi-square , T-test , and so on.

  • Z-test : If population means and standard deviations are known. Z-statistic is commonly used.
  • t-test : If population standard deviations are unknown. and sample size is small than t-test statistic is more appropriate.
  • Chi-square test : Chi-square test is used for categorical data or for testing independence in contingency tables
  • F-test : F-test is often used in analysis of variance (ANOVA) to compare variances or test the equality of means across multiple groups.

We have a smaller dataset, So, T-test is more appropriate to test our hypothesis.

T-statistic is a measure of the difference between the means of two groups relative to the variability within each group. It is calculated as the difference between the sample means divided by the standard error of the difference. It is also known as the t-value or t-score.

Step 5 – Comparing Test Statistic:

In this stage, we decide where we should accept the null hypothesis or reject the null hypothesis. There are two ways to decide where we should accept or reject the null hypothesis.

Method A: Using Crtical values

Comparing the test statistic and tabulated critical value we have,

  • If Test Statistic>Critical Value: Reject the null hypothesis.
  • If Test Statistic≤Critical Value: Fail to reject the null hypothesis.

Note: Critical values are predetermined threshold values that are used to make a decision in hypothesis testing. To determine critical values for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Method B: Using P-values

We can also come to an conclusion using the p-value,

p\leq\alpha

Note : The p-value is the probability of obtaining a test statistic as extreme as, or more extreme than, the one observed in the sample, assuming the null hypothesis is true. To determine p-value for hypothesis testing, we typically refer to a statistical distribution table , such as the normal distribution or t-distribution tables based on.

Step 7- Interpret the Results

At last, we can conclude our experiment using method A or B.

Calculating test statistic

To validate our hypothesis about a population parameter we use statistical functions . We use the z-score, p-value, and level of significance(alpha) to make evidence for our hypothesis for normally distributed data .

1. Z-statistics:

When population means and standard deviations are known.

z = \frac{\bar{x} - \mu}{\frac{\sigma}{\sqrt{n}}}

  • μ represents the population mean, 
  • σ is the standard deviation
  • and n is the size of the sample.

2. T-Statistics

T test is used when n<30,

t-statistic calculation is given by:

t=\frac{x̄-μ}{s/\sqrt{n}}

  • t = t-score,
  • x̄ = sample mean
  • μ = population mean,
  • s = standard deviation of the sample,
  • n = sample size

3. Chi-Square Test

Chi-Square Test for Independence categorical Data (Non-normally distributed) using:

\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}

  • i,j are the rows and columns index respectively.

E_{ij}

Real life Hypothesis Testing example

Let’s examine hypothesis testing using two real life situations,

Case A: D oes a New Drug Affect Blood Pressure?

Imagine a pharmaceutical company has developed a new drug that they believe can effectively lower blood pressure in patients with hypertension. Before bringing the drug to market, they need to conduct a study to assess its impact on blood pressure.

  • Before Treatment: 120, 122, 118, 130, 125, 128, 115, 121, 123, 119
  • After Treatment: 115, 120, 112, 128, 122, 125, 110, 117, 119, 114

Step 1 : Define the Hypothesis

  • Null Hypothesis : (H 0 )The new drug has no effect on blood pressure.
  • Alternate Hypothesis : (H 1 )The new drug has an effect on blood pressure.

Step 2: Define the Significance level

Let’s consider the Significance level at 0.05, indicating rejection of the null hypothesis.

If the evidence suggests less than a 5% chance of observing the results due to random variation.

Step 3 : Compute the test statistic

Using paired T-test analyze the data to obtain a test statistic and a p-value.

The test statistic (e.g., T-statistic) is calculated based on the differences between blood pressure measurements before and after treatment.

t = m/(s/√n)

  • m  = mean of the difference i.e X after, X before
  • s  = standard deviation of the difference (d) i.e d i ​= X after, i ​− X before,
  • n  = sample size,

then, m= -3.9, s= 1.8 and n= 10

we, calculate the , T-statistic = -9 based on the formula for paired t test

Step 4: Find the p-value

The calculated t-statistic is -9 and degrees of freedom df = 9, you can find the p-value using statistical software or a t-distribution table.

thus, p-value = 8.538051223166285e-06

Step 5: Result

  • If the p-value is less than or equal to 0.05, the researchers reject the null hypothesis.
  • If the p-value is greater than 0.05, they fail to reject the null hypothesis.

Conclusion: Since the p-value (8.538051223166285e-06) is less than the significance level (0.05), the researchers reject the null hypothesis. There is statistically significant evidence that the average blood pressure before and after treatment with the new drug is different.

Python Implementation of Hypothesis Testing

Let’s create hypothesis testing with python, where we are testing whether a new drug affects blood pressure. For this example, we will use a paired T-test. We’ll use the scipy.stats library for the T-test.

Scipy is a mathematical library in Python that is mostly used for mathematical equations and computations.

We will implement our first real life problem via python,

In the above example, given the T-statistic of approximately -9 and an extremely small p-value, the results indicate a strong case to reject the null hypothesis at a significance level of 0.05. 

  • The results suggest that the new drug, treatment, or intervention has a significant effect on lowering blood pressure.
  • The negative T-statistic indicates that the mean blood pressure after treatment is significantly lower than the assumed population mean before treatment.

Case B : Cholesterol level in a population

Data: A sample of 25 individuals is taken, and their cholesterol levels are measured.

Cholesterol Levels (mg/dL): 205, 198, 210, 190, 215, 205, 200, 192, 198, 205, 198, 202, 208, 200, 205, 198, 205, 210, 192, 205, 198, 205, 210, 192, 205.

Populations Mean = 200

Population Standard Deviation (σ): 5 mg/dL(given for this problem)

Step 1: Define the Hypothesis

  • Null Hypothesis (H 0 ): The average cholesterol level in a population is 200 mg/dL.
  • Alternate Hypothesis (H 1 ): The average cholesterol level in a population is different from 200 mg/dL.

As the direction of deviation is not given , we assume a two-tailed test, and based on a normal distribution table, the critical values for a significance level of 0.05 (two-tailed) can be calculated through the z-table and are approximately -1.96 and 1.96.

(203.8 - 200) / (5 \div \sqrt{25})

Step 4: Result

Since the absolute value of the test statistic (2.04) is greater than the critical value (1.96), we reject the null hypothesis. And conclude that, there is statistically significant evidence that the average cholesterol level in the population is different from 200 mg/dL

Limitations of Hypothesis Testing

  • Although a useful technique, hypothesis testing does not offer a comprehensive grasp of the topic being studied. Without fully reflecting the intricacy or whole context of the phenomena, it concentrates on certain hypotheses and statistical significance.
  • The accuracy of hypothesis testing results is contingent on the quality of available data and the appropriateness of statistical methods used. Inaccurate data or poorly formulated hypotheses can lead to incorrect conclusions.
  • Relying solely on hypothesis testing may cause analysts to overlook significant patterns or relationships in the data that are not captured by the specific hypotheses being tested. This limitation underscores the importance of complimenting hypothesis testing with other analytical approaches.

Hypothesis testing stands as a cornerstone in statistical analysis, enabling data scientists to navigate uncertainties and draw credible inferences from sample data. By systematically defining null and alternative hypotheses, choosing significance levels, and leveraging statistical tests, researchers can assess the validity of their assumptions. The article also elucidates the critical distinction between Type I and Type II errors, providing a comprehensive understanding of the nuanced decision-making process inherent in hypothesis testing. The real-life example of testing a new drug’s effect on blood pressure using a paired T-test showcases the practical application of these principles, underscoring the importance of statistical rigor in data-driven decision-making.

Frequently Asked Questions (FAQs)

1. what are the 3 types of hypothesis test.

There are three types of hypothesis tests: right-tailed, left-tailed, and two-tailed. Right-tailed tests assess if a parameter is greater, left-tailed if lesser. Two-tailed tests check for non-directional differences, greater or lesser.

2.What are the 4 components of hypothesis testing?

Null Hypothesis ( ): No effect or difference exists. Alternative Hypothesis ( ): An effect or difference exists. Significance Level ( ): Risk of rejecting null hypothesis when it’s true (Type I error). Test Statistic: Numerical value representing observed evidence against null hypothesis.

3.What is hypothesis testing in ML?

Statistical method to evaluate the performance and validity of machine learning models. Tests specific hypotheses about model behavior, like whether features influence predictions or if a model generalizes well to unseen data.

4.What is the difference between Pytest and hypothesis in Python?

Pytest purposes general testing framework for Python code while Hypothesis is a Property-based testing framework for Python, focusing on generating test cases based on specified properties of the code.

Please Login to comment...

Similar reads.

  • data-science
  • Data Science
  • Machine Learning

advertisewithusBannerImg

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Out on a Lim

Reflections on learning at a distance.

Out on a Lim

Marzano: Generating and Testing Hypotheses

May 8, 2009.

This post is part of a series on integrating the McREL research on classroom instruction that works with videoconferencing .

Setting Objections: Generalizations The generating and testing of hypotheses can be approached in an inductive or deductive manner. Teachers should ask students to clearly explain their hypotheses and their conclusions. Recommendations Use these to improve your practice. Make sure students can explain their hypotheses. Use a variety of structured tasks to guide students through generating and testing hypotheses (Pitler, et al., 2007, p. 2002-2003).

Brainstorming for Videoconferencing

This is a strategy that you miss quite a bit of the depth if you only use the technology reference book. The handbook has much more detail. While you might think that generating and testing hypotheses is only done in science experiments (think of COSI’s Gadget Works program )… actually there are six types of student tasks, each of which makes for a great videoconference.

1. Systems Analysis Have students predict what might happen if one part of the system changes. To do this students need to:

What are the parts of the system? How does each part work? How do the parts affect one another? Pick a part of the system. What might happen if that part did something differently? Change the part to test your hypothesis or act it out or think it through (Marzano, et. al., 2001, p. 201)

These steps should be taught to students with content that is familiar to them.

What are some systems that student could analyze and share their analysis with each other via videoconference?

  • A computer & it’s parts
  • An ecosystem
  • A transportation system
  • A quadratic equation
  • Can you think of more?

After explaining their analysis, students could ask each other:

  • What did you learn as a result of doing the analysis?
  • What did you learn as a result of listening to our analysis?
  • What other system is this like? ( metaphor )

2. Problem Solving This strategy is particularly designed for unstructured problems – those with no obvious solution, no clearly goals or constraints – messy problems! To solve these problems students need to ask these questions:

What am I trying to do? What things are in my way? What are some things I can do to get around these things? Which solution seems to be the best? Did this solution work? Should I try another solution? (Marzano et. al., 2001, p. 211).

What are some unstructured problems for various content areas? I think I want to look at our MI curriculum more to get better ideas. But here are a few:

  • carbon emissions
  • playground conflict
  • build/create something with unlimited resources
  • any current event problem…
  • Roxanne’s Imagine It project/videoconference adaptation
  • What other ideas can you think of?

3. Decision Making To use this strategy, students need to understand criteria as value laden preferences, on which decisions are based. The process includes these steps:

What am I trying to decide? What are my choices? What are important criteria for making this decision? How important is each criterion? How well does each of my choices match my criteria? Which choice matches best? How do I feel about the decision? Should I change any criteria and try again? (Marzano et. al., 2001, p. 221).

This format makes me think of Tammy Worcester’s decision making spreadsheet . It’s a great tool for following this process. The Marzano Handbook has a suggestion for using a similar graphic organizer to assist in decision making.

What topics could students practice decision making and then share their logic/rationale with another class?

  • elections – national, state, and local
  • current issues
  • deciding on a major, career, or college
  • the most important invention in your content area
  • the most ____ character in 3-4 books (fill in the blank with a desired characteristic)
  • what other ideas do you have?

4. Historical Investigation This is done on events where there is no clear agreement on what exactly happened. There should not be any quick answers. Students will have to construct a possible resolution to conflicting scenarios. Students will need to be able to collect and analyze evidence to make a decision. They will probably need instruction on the difference between evidence and opinion and how to interpret different materials. Students could practice on a simple event in the local paper and make hypotheses about what really happened. They will need to follow these steps:

What historical event do I want to explain? What do people already know about this event? What confuses people about this event? What suggestions do I have for clearing up these confusions? How can I explain my suggestions? Is there evidence that my scenario is plausible? (Marzano et. al., 2001, p. 232).

What topics could be used for this? These are from the Handbook (2001).

  • Did George Washington really chop down a cherry tree?
  • What happened to Amelia Earhart?
  • What happened when the Titanic sank?

Clearly these would need to be topics that are included in the curriculum, and not just obscure little known events. A careful review of the curriculum for your history class may find more topics.

5. Experimental Inquiry This isn’t just for science!! Inquiry can be used to describe observations, generate hypotheses, make predictions, and test them – in many content areas. Students will have to use their prior knowledge to make the predictions and then be able to apply their knowledge to new situations. The steps for this process are:

Observe something and describe what occurred. Explain what was observed. Based on the explanation, make a prediction. Set up an experiment to test the prediction. Explain the results of the experiment and compare to your earlier explanation. (Marzano et. al., 2001, p. 232).

It’s easy to think of examples and scenarios for doing this with science – LEARNnco ‘s programs, Science Seekers , etc. Imagine the students doing an experiment together; designing an experiment for the other class to do; comparing their solutions and results to an experiment.

6. Invention Invention isn’t just creating intimidating big things! It’s also any solution for anything. “Isn’t there a better way to….” Students could brainstorm solution/inventions to improve a situation. The steps for this are:

What do I want to make? What do I want to make better? What standards do I want to set for my invention? What is the best way to make a rough draft of my invention? How can I make my rough draft better? Does my invention meet the standards I have set?(Marzano et. al., 2001, p. 253).

The challenge with this strategy is that there aren’t any really good examples out there for regular content. Work needs to be done!

I think if I was in the classroom, I would want to try this about half way through the semester or school year, and ask students – what do you see around school that could be made better? Wouldn’t it be interesting to have classes share their results on this? Do we have time for this with state testing? Maybe not. Maybe that’s why there aren’t very many examples. Still, check out a few here: Generating and testing hypotheses & technology resources. This Invention at Play site is kind of cool. Could students share their little inventions with each other & share what they learned? What ideas do you have?

Any comments or thoughts? New ideas? Content you want to try these out with? Please comment!

Reference: Pitler, H., Hubbell, E. R., Kuhn, M., & Malenoski, K. (2007). Using technology with classroom instruction that works . Alexandria, VA: Association for Supervision and Curriculum Development.

Marzano, R. J., Norford, J. S., Paynter, D. E., Pickering, D. J., & Gaddy, B. B. (2001). A handbook for classroom instruction that works . Alexandria, VA: Association for Supervision and Curriculum Development.

Share this:

hypothesis testing and hypothesis generating

  • Share on Tumblr

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed .

Hypothesis Maker Online

Looking for a hypothesis maker? This online tool for students will help you formulate a beautiful hypothesis quickly, efficiently, and for free.

Are you looking for an effective hypothesis maker online? Worry no more; try our online tool for students and formulate your hypothesis within no time.

  • 🔎 How to Use the Tool?
  • ⚗️ What Is a Hypothesis in Science?

👍 What Does a Good Hypothesis Mean?

  • 🧭 Steps to Making a Good Hypothesis

🔗 References

📄 hypothesis maker: how to use it.

Our hypothesis maker is a simple and efficient tool you can access online for free.

If you want to create a research hypothesis quickly, you should fill out the research details in the given fields on the hypothesis generator.

Below are the fields you should complete to generate your hypothesis:

  • Who or what is your research based on? For instance, the subject can be research group 1.
  • What does the subject (research group 1) do?
  • What does the subject affect? - This shows the predicted outcome, which is the object.
  • Who or what will be compared with research group 1? (research group 2).

Once you fill the in the fields, you can click the ‘Make a hypothesis’ tab and get your results.

⚗️ What Is a Hypothesis in the Scientific Method?

A hypothesis is a statement describing an expectation or prediction of your research through observation.

It is similar to academic speculation and reasoning that discloses the outcome of your scientific test . An effective hypothesis, therefore, should be crafted carefully and with precision.

A good hypothesis should have dependent and independent variables . These variables are the elements you will test in your research method – it can be a concept, an event, or an object as long as it is observable.

You can observe the dependent variables while the independent variables keep changing during the experiment.

In a nutshell, a hypothesis directs and organizes the research methods you will use, forming a large section of research paper writing.

Hypothesis vs. Theory

A hypothesis is a realistic expectation that researchers make before any investigation. It is formulated and tested to prove whether the statement is true. A theory, on the other hand, is a factual principle supported by evidence. Thus, a theory is more fact-backed compared to a hypothesis.

Another difference is that a hypothesis is presented as a single statement , while a theory can be an assortment of things . Hypotheses are based on future possibilities toward a specific projection, but the results are uncertain. Theories are verified with undisputable results because of proper substantiation.

When it comes to data, a hypothesis relies on limited information , while a theory is established on an extensive data set tested on various conditions.

You should observe the stated assumption to prove its accuracy.

Since hypotheses have observable variables, their outcome is usually based on a specific occurrence. Conversely, theories are grounded on a general principle involving multiple experiments and research tests.

This general principle can apply to many specific cases.

The primary purpose of formulating a hypothesis is to present a tentative prediction for researchers to explore further through tests and observations. Theories, in their turn, aim to explain plausible occurrences in the form of a scientific study.

It would help to rely on several criteria to establish a good hypothesis. Below are the parameters you should use to analyze the quality of your hypothesis.

🧭 6 Steps to Making a Good Hypothesis

Writing a hypothesis becomes way simpler if you follow a tried-and-tested algorithm. Let’s explore how you can formulate a good hypothesis in a few steps:

Step #1: Ask Questions

The first step in hypothesis creation is asking real questions about the surrounding reality.

Why do things happen as they do? What are the causes of some occurrences?

Your curiosity will trigger great questions that you can use to formulate a stellar hypothesis. So, ensure you pick a research topic of interest to scrutinize the world’s phenomena, processes, and events.

Step #2: Do Initial Research

Carry out preliminary research and gather essential background information about your topic of choice.

The extent of the information you collect will depend on what you want to prove.

Your initial research can be complete with a few academic books or a simple Internet search for quick answers with relevant statistics.

Still, keep in mind that in this phase, it is too early to prove or disapprove of your hypothesis.

Step #3: Identify Your Variables

Now that you have a basic understanding of the topic, choose the dependent and independent variables.

Take note that independent variables are the ones you can’t control, so understand the limitations of your test before settling on a final hypothesis.

Step #4: Formulate Your Hypothesis

You can write your hypothesis as an ‘if – then’ expression . Presenting any hypothesis in this format is reliable since it describes the cause-and-effect you want to test.

For instance: If I study every day, then I will get good grades.

Step #5: Gather Relevant Data

Once you have identified your variables and formulated the hypothesis, you can start the experiment. Remember, the conclusion you make will be a proof or rebuttal of your initial assumption.

So, gather relevant information, whether for a simple or statistical hypothesis, because you need to back your statement.

Step #6: Record Your Findings

Finally, write down your conclusions in a research paper .

Outline in detail whether the test has proved or disproved your hypothesis.

Edit and proofread your work, using a plagiarism checker to ensure the authenticity of your text.

We hope that the above tips will be useful for you. Note that if you need to conduct business analysis, you can use the free templates we’ve prepared: SWOT , PESTLE , VRIO , SOAR , and Porter’s 5 Forces .

❓ Hypothesis Formulator FAQ

Updated: Oct 25th, 2023

  • How to Write a Hypothesis in 6 Steps - Grammarly
  • Forming a Good Hypothesis for Scientific Research
  • The Hypothesis in Science Writing
  • Scientific Method: Step 3: HYPOTHESIS - Subject Guides
  • Hypothesis Template & Examples - Video & Lesson Transcript
  • Free Essays
  • Writing Tools
  • Lit. Guides
  • Donate a Paper
  • Referencing Guides
  • Free Textbooks
  • Tongue Twisters
  • Job Openings
  • Expert Application
  • Video Contest
  • Writing Scholarship
  • Discount Codes
  • IvyPanda Shop
  • Terms and Conditions
  • Privacy Policy
  • Cookies Policy
  • Copyright Principles
  • DMCA Request
  • Service Notice

Use our hypothesis maker whenever you need to formulate a hypothesis for your study. We offer a very simple tool where you just need to provide basic info about your variables, subjects, and predicted outcomes. The rest is on us. Get a perfect hypothesis in no time!

IMAGES

  1. Hypothesis Testing Solved Examples(Questions and Solutions)

    hypothesis testing and hypothesis generating

  2. Hypothesis Testing Steps & Examples

    hypothesis testing and hypothesis generating

  3. How to Optimize the Value of Hypothesis Testing

    hypothesis testing and hypothesis generating

  4. Introduction to Hypothesis Testing in R

    hypothesis testing and hypothesis generating

  5. Cracking the Code of Data: A Student’s Guide to Hypothesis Testing

    hypothesis testing and hypothesis generating

  6. 05 Easy Steps for Hypothesis Testing with Examples

    hypothesis testing and hypothesis generating

VIDEO

  1. Concept of Hypothesis

  2. Hypothesis testing

  3. Testing of hypothesis |Part-2|statistics and numerical methods-MA3251

  4. 【统计学 Business Statistics 必考点线上小课堂】Hypothesis Testing

  5. What Is A Hypothesis?

  6. Hypothesis Testing

COMMENTS

  1. Hypothesis-generating research and predictive medicine

    The paradigm of hypothesis-generating research does not replace or undermine hypothesis-testing modes of research; instead, it complements them and has facilitated discoveries that may not have been possible with hypothesis-testing research. The hypothesis-generating mode of research has been primarily practiced in basic science but has ...

  2. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  3. General Principles of Preclinical Study Design

    1. An Overview. Broadly, preclinical research can be classified into two distinct categories depending on the aim and purpose of the experiment, namely, "hypothesis generating" (exploratory) and "hypothesis testing" (confirmatory) research (Fig. 1).Hypothesis generating studies are often scientifically-informed, curiosity and intuition-driven explorations which may generate testable ...

  4. The Importance of Hypothesis Generation vs. Hypothesis Testing: Case

    Why it Matters. The problem with transposing the concepts of "hypothesis generation" and "hypothesis testing" is that hypothesis generation is relatively easy. It's basically coming up ...

  5. Data-Driven Hypothesis Generation in Clinical Research: What We Learned

    Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study ...

  6. Hypothesis

    Generate a hypothesis in advance through pre-analyzing a problem (i.e., generation of a prestage hypothesis ). 3. Collect data related to the prestage hypothesis by appropriate means such as experiment, observation, database search, and Web search (i.e., data collection). 4. Process and transform the collected data as needed. 5.

  7. RESEARCH REPORT: Hypothesis Testing and Hypothesis Generating ...

    generation represent two distinct research objectives. In hypothesis testing research, the researcher specifies one or more a priori hypotheses, based on existing theory and/or data, and then puts these hypotheses to an empirical test with a new set of data. In hypothesis generating research, the researcher explores a set of data searching

  8. Hypothesis Testing: Principles and Methods

    Hypothesis testing is a fundamental tool used in scientific research to validate or reject hypotheses about population parameters based on sample data. It provides a structured framework for evaluating the statistical significance of a hypothesis and drawing conclusions about the true nature of a population. Hypothesis testing is widely used in ...

  9. PDF Scientific hypothesis generation process in clinical research: a

    Background: Scientific hypothesis generation is a critical step in scientific research that determines the direction and impact of any investigation. Despite its vital role, we have limited ... After formulating a scientific hypothesis, researchers design studies to test the scientific hypothesis to determine the answer to research questions 2,4.

  10. A Practical Guide to Writing Quantitative and Qualitative Research

    Hypothesis-generating (Qualitative hypothesis-generating research) - Qualitative research uses inductive reasoning. - This involves data collection from study participants or the literature regarding a phenomenon of interest, using the collected data to develop a formal hypothesis, and using the formal hypothesis as a framework for testing the ...

  11. Approaches to informed consent for hypothesis-testing and hypothesis

    The purpose of our hypothesis-generating study is to test the feasibility of using MPS to generate clinical hypotheses, and to approach the return of results as an experimental manipulation. Issues to consider in both designs include: volume and nature of the potential results, primary versus secondary results, return of individual results ...

  12. Machine Learning as a Tool for Hypothesis Generation*

    Abstract. While hypothesis testing is a highly formalized activity, hypothesis generation remains largely informal. We propose a systematic procedure to generate novel hypotheses about human behavior, which uses the capacity of machine learning algorithms to notice patterns people might not.

  13. What is Hypothesis Testing in Statistics? Types and Examples

    Hypothesis testing is a statistical method used to determine if there is enough evidence in a sample data to draw conclusions about a population. It involves formulating two competing hypotheses, the null hypothesis (H0) and the alternative hypothesis (Ha), and then collecting data to assess the evidence.

  14. Generating & Testing Hypotheses / Strategies In Action

    "Generating and testing hypotheses" are often thought of as strategies that are only used in the sciences. However, students engage in this process in many content areas and as part of the natural learning process. Sometimes we refer to hypothesizing as predicting, inferring, deducing, theorizing, etc. in other content areas, but the mental ...

  15. Understanding Hypothesis Testing

    Hypothesis testing is a statistical method that is used to make a statistical decision using experimental data. Hypothesis testing is basically an assumption that we make about a population parameter. It evaluates two mutually exclusive statements about a population to determine which statement is best supported by the sample data.

  16. Marzano: Generating and Testing Hypotheses

    The generating and testing of hypotheses can be approached in an inductive or deductive manner. ... Change the part to test your hypothesis or act it out or think it through (Marzano, et. al., 2001, p. 201) These steps should be taught to students with content that is familiar to them.

  17. Hypothesis Maker

    Our hypothesis maker is a simple and efficient tool you can access online for free. If you want to create a research hypothesis quickly, you should fill out the research details in the given fields on the hypothesis generator. Below are the fields you should complete to generate your hypothesis: