Multiple Hypothesis Testing

multiple hypothesis testing definition

Terminology

  • Per comparison error rate (PCER) : It is an estimate of the rate of false positives per hypothesis \[\mathrm{PCER} = \frac{\mathbb{E}(V)}{m}\]
  • Per-family error rate (PFER) : It is the expected number of type I errors (per family denotes the family of null hypotheses under consideration) \[\mathrm{PFER} = \mathbb{E}(V)\]
  • Family-wise error rate (FWER) : It is the probability of making at least one Type I error. This measure is useful in many of the techniques we will discuss later \[\mathrm{FWER} = P(V \geq 1)\]
  • False Discovery Rate (FDR) : It is the expected proportion of Type I errors among the rejected hypotheses. The probability term is introduced compensate as the rest of the expression becomes 1 when \(R = 0\). \[\mathrm{FDR} = \mathbb{E}(\frac{V}{R} | R > 0) P(R > 0)\]
  • Positive false discovery rate (pFDR) : The rate at which rejected discoveries are false positives, given \(R\) is positive \[\mathrm{pFDR} = \mathbb{E}(\frac{V}{R} | R > 0)\]
  • Multiple Testing

FWER based methods

  • Single Step : Equal adjustments made to all \(p\)-values based on the threshold \(\alpha\)
  • Sequential : Adaptive adjustments made sequentially to each \(p\)-value

Bonferroni Correction

Holm-bonferroni correction.

  • Order the unadjusted \(p\)-values such that \(p_1 \leq p_2 \leq \ldots \leq p_m\)
  • Given a type I error rate \(\alpha\), let \(k\) be the maximal index such that \[p_k \leq \frac{\alpha}{m - k + 1}\]
  • Reject all null hypotheses \(H_1, \ldots, H_{k-1}\) and accept the hypotheses \(H_k, \ldots, H_{m}\)
  • In case \(k = 1\), accept all null hypotheses

FDR based methods

Benjamini and hochberg.

  • Given a type I error rate \(\delta\), let \(k\) be the maximal index such that \[p_j \leq \delta\frac{j}{m}\]
  • Reject all null hypotheses \(H_1, \ldots, H_{j-1}\) and accept the hypotheses \(H_j, \ldots, H_{m}\)
  • Type I and II errors
  • False Discovery Rate

User Preferences

Content preview.

Arcu felis bibendum ut tristique et egestas quis:

  • Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
  • Duis aute irure dolor in reprehenderit in voluptate
  • Excepteur sint occaecat cupidatat non proident

Keyboard Shortcuts

6.3 - issues with multiple testing.

If we are conducting a hypothesis test with an \(\alpha\) level of 0.05, then we are accepting a 5% chance of making a Type I error (i.e., rejecting the null hypothesis when the null hypothesis is really true). If we would conduct 100 hypothesis tests at a 0.05 \(\alpha\) level where the null hypotheses are really true, we would expect to reject the null and make a Type I error in about 5 of those tests. 

Later in this course you will learn about some statistical procedures that may be used instead of performing multiple tests. For example, to compare the means of more than two groups you can use an analysis of variance ("ANOVA"). To compare the proportions of more than two groups you can conduct a chi-square goodness-of-fit test. 

A related issue is publication bias. Research studies with statistically significant results are published much more often than studies without statistically significant results. This means that if 100 studies are performed in which there is really no difference in the population, the 5 studies that found statistically significant results may be published while the 95 studies that did not find statistically significant results will not be published. Thus, when you perform a review of published literature you will only read about the studies that found statistically significance results. You would not find the studies that did not find statistically significant results.

One quick method for correcting for multiple tests is to divide the alpha level by the number of tests being conducted. For instance, if you are comparing three groups using a series of three pairwise tests you could divided your overall alpha level ("family-wise alpha level") by three. If we were using a standard alpha level of 0.05, then our pairwise alpha level would be \(\frac{0.05}{3}=0.016667\). We would then compare each of our three p-values to 0.016667 to determine statistical significance. This is known as the  Bonferroni  method. This is one of the most conservative approaches to controlling for multiple tests (i.e., more likely to make a Type II error). Later in the course you will learn how to use the Tukey method when comparing the means of three or more groups, this approach is often preferred because it is more liberal.

Book cover

Statistical Methods for Microarray Data Analysis pp 37–55 Cite as

Multiple Hypothesis Testing: A Methodological Overview

  • Anthony Almudevar 4  
  • First Online: 01 January 2013

2719 Accesses

2 Citations

1 Altmetric

Part of the book series: Methods in Molecular Biology ((MIMB,volume 972))

The process of screening for differentially expressed genes using microarray samples can usually be reduced to a large set of statistical hypothesis tests. In this situation, statistical issues arise which are not encountered in a single hypothesis test, related to the need to identify the specific hypotheses to be rejected, and to report an associated error. As in any complex testing problem, it is rarely the case that a single method is always to be preferred, leaving the analysts with the problem of selecting the most appropriate method for the particular task at hand. In this chapter, an introduction to current multiple testing methodology was presented, with the objective of clarifying the methodological issues involved, and hopefully providing the reader with some basis with which to compare and select methods.

  • Multiple hypothesis testing
  • Gene expression profiles
  • Stepwise procedures
  • Bayesian inference

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Springer Nature is developing a new tool to find and evaluate Protocols.  Learn more

Scherzer CR, Eklund AC, Morse LJ, Liao Z, Locascio JJ, Fefer D, Schwarzschild MA, Schlossmacher MG, Hauser MA, Vance JM, Sudarsky LR, Standaert DG, Growdon JH, Jensen RV, Gullans SR (2007) Molecular markers of early Parkinson’s disease based on gene expression in blood. PNAS 104:955–960

Article   PubMed   CAS   Google Scholar  

Benjamini Y, Braun H (2002) John W. Tukey’s contributions to multiple comparisons. Ann Stat 30:1576–1594

Article   Google Scholar  

Yang YH, Speed T (2003) Statistical analysis of gene expression microarray data. In: Speed T (ed) Design and analysis of comparitive microarray experiments. Chapman and Hall, Boca Raton, FL, pp 35–92

Google Scholar  

Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103

Dudoit S, van der Laan MJ (2008) Multiple testing procedures with applications to genomics. Springer, New York, NY

Book   Google Scholar  

Chu T, Glymour C, Scheines R, Spirtes P (2003) A statistical problem for inference to regulatory structure from associations of gene expression measurementswith microarrays. Bioinformatics 19:1147–1152

Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70

Shaffer JP (1986) Modified sequentially rejective test procedures. JASA 81:826–830

Šidák Z (1967) Rectangular confidence regions for the means of multivariate normal distribution. JASA 62:626–633

Šidák Z (1971) On probabilities of rectangles in multivariate Student distributions: their dependence on correlations. Ann Math Stat 42:169–175

Jogdeo K (1977) Association and probability inequalities. Ann Stat 5:495–504

Holland BS, Copenhaver MD (1987) An improved sequentially rejective rejective Bonferroni test procedure. Biometrics 43:417–423

Dykstra RL, Hewett JE, Thompson WA (1973) Events which are almost independent. Ann Stat 1:674–681

Simes RJ (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751–754

Hommel G (1988) A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75:383–386

Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802

Sarkar SK (1998) Some probability inequalities for ordered MTP 2 random variable: a proof of the Simes conjecture. Ann Stat 26:494–504

Sarkar SK, Chang C-K (1997) The Simes method for multiple hypothesis testing with positively dependent test statistics. JASA 92:1601–1608

Rom DR (1990) A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika 77:663–665

Huang Y, Hsu JC (2007) Hochberg’s step-up method: cutting corners off Holm’s step-down methods. Biometrika 94:965–975

Westfall PH, Young S (1993) Resampling-based multiple testing. Wiley, New York, NY

Pollard KS, Dudoit S, van der Laan MJ (2005) Bioinformatics and Compu-tational Biology Solutions Using R and Bioconductor. In: Gentleman R, Huber W, Carey VJ, Irizarry RA, Dudoit S (eds) chapter Multiple testing­procedures: the multest package and applications to genomics (pp 249–271). Springer, New York, NY,

Benjamini Y, Hochberg D (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300

Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188

Storey JD (2003) The positive false discovery rate: a Bayesian interpretation and the q -value. Ann Stat 31:2013–2035

Storey JD (2002) A direct approach to false discovery rates. JSS-B 64:479–498

Efron B (2003) Robbins, empirical Bayes and microarrays. Ann Stat 31:366–378

Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. JASA 99:96–104

Kall L, Storey JD, MacCross MJ, Noble WS (2008) Posterior error probabilities and false discovery rates: two sides of the same coin. J Proteome Res 7:40–44

Article   PubMed   Google Scholar  

Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96:1151–1160

Allison DB, Gadbury GL, Moonseong H, Fernandez JR, Cheol-Koo L, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20

Newton MA, Wang P, Kendziorski C (2006) Hierarchical mixture models for expression profiles. In: Do K, Muller P, Vannucci M (eds) Bayesian inference for gene expression and proteomics. Cambridge University Press, New York, NY, pp 40–52

Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52

Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5:155–176

Lewin A, Richardson S, Marshall C, Glazier A, Aitman T (2006) Bayesian modeling of differntial gene expression. Biometrics 62:1–9

Gottardo R, Raftery AE, Yeung KY, Bumgarner RE (2006) Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62:10–18

Do K, Muller P, Vannucci M (2006) Bayesian inference for gene expression and proteomics. Cambridge University Press, New York, NY

Download references

Author information

Authors and affiliations.

Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA

Anthony Almudevar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Anthony Almudevar .

Editor information

Editors and affiliations.

School of Medicine & Dentistry, Dept. Biostatistics & Computational, University of Rochester, Elmwood Ave. 601, Rochester, 14642, New York, USA

Andrei Y. Yakovlev

, Department of Probability and Statistics, Charles University, Sokolovska 83, Prague, 18675, Czech Republic

Lev Klebanov

State University of New York at Buffalo, Main St - 706 Kimball Tower 3435, Buffalo, 14214, New York, USA

Daniel Gaile

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this protocol

Cite this protocol.

Almudevar, A. (2013). Multiple Hypothesis Testing: A Methodological Overview. In: Yakovlev, A., Klebanov, L., Gaile, D. (eds) Statistical Methods for Microarray Data Analysis. Methods in Molecular Biology, vol 972. Humana Press, New York, NY. https://doi.org/10.1007/978-1-60327-337-4_3

Download citation

DOI : https://doi.org/10.1007/978-1-60327-337-4_3

Published : 03 January 2013

Publisher Name : Humana Press, New York, NY

Print ISBN : 978-1-60327-336-7

Online ISBN : 978-1-60327-337-4

eBook Packages : Springer Protocols

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base

Hypothesis Testing | A Step-by-Step Guide with Easy Examples

Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.

There are 5 main steps in hypothesis testing:

  • State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a  or H 1 ).
  • Collect data in a way designed to test the hypothesis.
  • Perform an appropriate statistical test .
  • Decide whether to reject or fail to reject your null hypothesis.
  • Present the findings in your results and discussion section.

Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.

Table of contents

Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.

After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.

The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.

  • H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.

Receive feedback on language, structure, and formatting

Professional editors proofread and edit your paper by focusing on:

  • Academic style
  • Vague sentences
  • Style consistency

See an example

multiple hypothesis testing definition

For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.

There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).

If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.

Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.

Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .

  • an estimate of the difference in average height between the two groups.
  • a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.

Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.

In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.

In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).

Prevent plagiarism. Run a free check.

The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .

In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.

In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.

However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.

If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”

These are superficial differences; you can see that they mean the same thing.

You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.

If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .

If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.

  • Normal distribution
  • Descriptive statistics
  • Measures of central tendency
  • Correlation coefficient

Methodology

  • Cluster sampling
  • Stratified sampling
  • Types of interviews
  • Cohort study
  • Thematic analysis

Research bias

  • Implicit bias
  • Cognitive bias
  • Survivorship bias
  • Availability heuristic
  • Nonresponse bias
  • Regression to the mean

Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.

A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.

A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).

Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved April 3, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/

Is this article helpful?

Rebecca Bevans

Rebecca Bevans

Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.

Multiple Hypothesis Testing

Aug 20, 2020

1. Motivation

It is important to control for false discoveries when multiple hypotheses are tested. Under the Neyman-Pearson formulation, each hypothesis test involves a decision rule with false positive rate (FPR) less than $\alpha$ (e.g. $\alpha = 0.05$). However, if there are $m$ $\alpha$-level independent tests, the probability of at least one false discovery could be as high as $\min(m\alpha, 1)$. Multiple hypothesis testing correction involves adjusting the significance level(s) to control error related to false discoveries. Some of the material presented is based on UC Berkeley’s Data102 course .

2. P-values

Consider $H_{0}: \theta = \theta_{0}$ versus $H_{1}: \theta = \theta_{1}$. Let \(\mathbb{P}_{\theta_{0}}(x)\) be the distribution of data \(X \in \mathbb{R}^{p}\) under the null, and let \(S = \{X^{(i)}\}_{i = 1}^{m}\) be the observed dataset. Additionally, denote $S_{0}$ as the unobserved dataset drawn from \(\mathbb{P}_{\theta_{0}}(x)\).

If the statistic $T(S_{0})$ has tail cumulative distribution function (CDF) \(F(t) = \mathbb{P}_{\theta_{0}}(T(S_{0}) > t)\), then the p-value is defined as the random variable $P = F(T(S))$. The graphical illustration of the density of $T$ (short for $T(S)$) is shown below.

png

An important fact about p-value $P$ is that it has $Unif(0, 1)$ distribution under the null. A random variable has $Unif(0, 1)$ distribution if and only if it has CDF $F(p) = p$ for $p \in [0, 1]$. We now show $P$ has CDF $F(p) = p$.

where the first equality is by definition of $P$. For the second equality, it is helpful to recall that for the 1-to-1 function $F(\cdot)$, $F: T \rightarrow u$ and $F^{-1}: u \rightarrow T$. Then from diagram above, notice that $F(T)$ is decreasing with respect to $T$. The third equality is from definition of $F(\cdot)$.

3. Bonferroni Correction

Let $V$ be the number of false positives. Then the probability of at least one false discovery $V > 0$ among $m$ tests (not necessarily independent) is defined as the family-wise error rate (FWER). Bonferroni correction adjusts the significance level to $\alpha / m$. This controls the FWER to be at most $\alpha$. If there are $m_{0} \leq m$ true null hypotheses, then

where the first inequality is from union bound (Boole’s inequality). In practice, the observed p-value $p_{i}$ is adjusted according to

for $i = 1, \dots, m$. Then the $i$-th null hypothesis is rejected if $p_{i}^{adj} \leq \alpha$. Let us simulate 10 p-values from $unif(0, 0.3)$ and implement Bonferroni corrected p-values.

4. Benjamini-Hochberg

A major criticism of Bonferroni correction is that it is too conservative - false positives are avoided at the expense of false negatives. The Benjamini-Hochberg (BH) procedure instead controls the FDR to avoid more false negatives. The FDR among $m$ tests is defined as

where $R$ is number of rejections among $m$ tests. BH procedure adjusts the p-value cutoff by allowing looser p-value cutoffs provided given earlier discoveries. This is graphically illustrated below.

png

The BH procedure is as follows

  • For each independent test, compute the p-value $p_{i}$. Sort the p-value from smallest to largest $p_{(1)} < \cdots < p_{(m)}$.
  • Select \(R = \max\big\{i: p_{(i)} < \frac{i\alpha}{m}\big\}\).
  • Reject null hypotheses with p-value $\leq p_{(R)}$.

By construction, this procedure rejects exactly $R$ hypotheses, and

Let $m_{0} \leq m$ be the number true null hypotheses. Let $X_{i} = \mathbb{1}(p_{i} \leq p_{(R)})$ be whether hypothesis $i$ is rejected or not. Since $p_{i} \sim unif(0, 1)$, $X_{i} \sim bernoulli(p_{(R)})$. Under the assumption that tests are independent, $V = \sum_{i = 1}^{m_{0}}X_{i} \sim binomial(m_{0}, p_{(R)})$. Then by definition

In practice, the observed p-value $p_{i}$ is adjusted according to

for $i = 1, \dots ,m$. The $i$-th null hypothesis is rejected if $p_{i}^{adj} \leq \alpha$. This results in exactly $R$ rejected null hypotheses because if $i \leq R$, then $p_{i}^{adj} < \alpha$, because

The first inequality is from definition of minimum over a set that includes \(\frac{mp_{(R)}}{R}\), and the second inequality is by construction of \(\frac{m}{R}p_{(R)}\). If $i > R$, then $p_{i}^{adj} > \alpha$ because $p_{(R)}$ is defined as the last p-value in sorted p-values with $p_{(i)} < \frac{i\alpha}{m}$. Let us simulate 10 p-values from $unif(0, 0.3)$ and implement BH corrected p-values.

Multiple testing: when is many too much?

Affiliations.

  • 1 Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, the Netherlands.
  • 2 Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
  • 3 Department of Endocrinology, Leiden University Medical Center, Leiden, the Netherlands.
  • PMID: 33300887
  • DOI: 10.1530/EJE-20-1375

In almost all medical research, more than a single hypothesis is being tested or more than a single relation is being estimated. Testing multiple hypotheses increases the risk of drawing a false-positive conclusion. We briefly discuss this phenomenon, which is often called multiple testing. Also, methods to mitigate the risk of false-positive conclusions are discussed.

Publication types

  • Biomedical Research*
  • False Positive Reactions*
  • Research Design*

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: December 2009

How does multiple testing correction work?

  • William S Noble 1  

Nature Biotechnology volume  27 ,  pages 1135–1137 ( 2009 ) Cite this article

125k Accesses

460 Citations

58 Altmetric

Metrics details

When prioritizing hits from a high-throughput experiment, it is important to correct for random events that falsely appear significant. How is this done and what methods should be used?

You have full access to this article via your institution.

Imagine that you have just invested a substantial amount of time and money in a shotgun proteomics experiment designed to identify proteins involved in a particular biological process. The experiment successfully identifies most of the proteins that you already know to be involved in the process and implicates a few more. Each of these novel candidates will need to be verified with a follow-up assay. How do you decide how many candidates to pursue?

The answer lies in the tradeoff between the cost associated with a false positive versus the benefit of identifying a novel participant in the biological process that you are studying. False positives tend to be particularly problematic in genomic or proteomic studies where many candidates must be statistically tested.

Such studies may include identifying genes that are differentially expressed on the basis of microarray or RNA-Seq experiments, scanning a genome for occurrences of candidate transcription factor binding sites, searching a protein database for homologs of a query protein or evaluating the results of a genome-wide association study. In a nutshell, the property that makes these experiments so attractive—their massive scale—also creates many opportunities for spurious discoveries, which must be guarded against.

In assessing the cost-benefit tradeoff, it is helpful to associate with each discovery a statistical confidence measure. These measures may be stated in terms of P -values, false discovery rates or q -values. The goal of this article is to provide an intuitive understanding of these confidence measures, a sense for how they are computed and some guidelines for how to select an appropriate measure for a given experiment.

As a motivating example, suppose that you are studying CTCF, a highly conserved zinc-finger DNA-binding protein that exhibits diverse regulatory functions and that may play a major role in the global organization of the chromatin architecture of the human genome 1 . To better understand this protein, you want to identify candidate CTCF binding sites in human chromosome 21. Using a previously published model of the CTCF binding motif ( Fig. 1a ) 2 , each 20 nucleotide (nt) sub-sequence of chromosome 21 can be scored for its similarity to the CTCF motif. Considering both DNA strands, there are 68 million such subsequences. Figure 1b lists the top 20 scores from such a search.

( a ) The binding preference of CTCF 2 represented as a sequence logo 9 , in which the height of each letter is proportional to the information content at that position. ( b ) The 20 top-scoring occurrences of the CTCF binding site in human chromosome 21. Coordinates of the starting position of each occurrence are given with respect to human genome assembly NCBI 36.1. ( c ) A histogram of scores produced by scanning a shuffled version of human chromosome 21 with the CTCF motif. ( d ) This panel zooms in on the right tail of the distribution shown in c . The blue histogram is the empirical null distribution of scores observed from scanning a shuffled chromosome. The gray line is the analytic distribution. The P -value associated with an observed score of 17.0 is equal to the area under the curve to the right of 17.0 (shaded pink). ( e ) The false discovery rate is estimated from the empirical null distribution for a score threshold of 17.0. There are 35 null scores >17.0 and 519 observed scores >17.0, leading to an estimate of 6.7%. This procedure assumes that the number of observed scores equals the number of null scores.

Interpreting scores with the null hypothesis and the P -value

How biologically meaningful are these scores? One way to answer this question is to assess the probability that a particular score would occur by chance. This probability can be estimated by defining a 'null hypothesis' that represents, essentially, the scenario that we are not interested in (that is, the random occurrence of 20 nucleotides that match the CTCF binding site).

The first step in defining the null hypothesis might be to shuffle the bases of chromosome 21. After this shuffling procedure, high-scoring occurrences of the CTCF motif will only appear because of random chance. Then, the shuffled chromosome can be rescanned with the same CTCF matrix. Performing this procedure results in the distribution of scores shown in Figure 1c .

Although it is not visible in Figure 1c , out of the 68 million 20-nt sequences in the shuffled chromosome, only one had a score ≥26.30. In statistics, we say that the probability of observing this score under the null hypothesis is 1/68 million, or 1.5 × 10 −8 . This probability—the probability that a score at least as large as the observed score would occur in data drawn according to the null hypothesis—is called the P -value.

Likewise, the P -value of a candidate CTCF binding site with a score of 17.0 is equal to the percentage of scores in the null distribution that are ≥17.0. Among the 68 million null scores shown in Figure 1c , 35 are ≥17.0, leading to a P -value of 5.5 × 10 −7 (35/68 million). The P -value associated with score x corresponds to the area under the null distribution to the right of x ( Fig. 1d ).

Shuffling the human genome and rescanning with the CTCF motif is an example of an 'empirical null model'. Such an approach can be inefficient because a large number of scores must be computed. In some cases, however, it is possible to analytically calculate the form of the null distribution and calculate corresponding P -values (that is, by defining the null distribution with mathematical formulae rather than by estimating it from measured data).

In the case of scanning for CTCF motif occurrences, an analytic null distribution (gray line in Fig. 1d ) can be calculated using a dynamic programming algorithm, assuming that the sequence being scanned is generated randomly with a specified frequency of each of the four nucleotides 3 . This distribution allows us to compute, for example, that the P -value associated with the top score in Figure 1b is 2.3 × 10 −10 (compared to 1.5 × 10 −8 under the empirical null model). This P -value is more accurate and much cheaper to compute than the P -value estimated from the empirical null model.

In practice, determining whether an observed score is statistically significant requires comparing the corresponding statistical confidence measure (the P -value) to a confidence threshold α. For historical reasons, many studies use thresholds of α = 0.01 or α = 0.05, though there is nothing magical about these values. The choice of the significance threshold depends on the costs associated with false positives and false negatives, and these costs may differ from one experiment to the next.

Why P -values are problematic in a high-throughput experiment

Unfortunately, in the context of an experiment that produces many scores, such as scanning a chromosome for CTCF binding sites, reporting a P -value is inappropriate. This is because the P -value is only statistically valid when a single score is computed. For instance, if a single 20-nt sequence had been tested as a match to the CTCF binding site, rather than scanning all of chromosome 21, the P -value could be used directly as a statistical confidence measure.

In contrast, in the example above, 68 million 20-nt sequences were tested. In the case of a score of 17.0, even though it is associated with a seemingly small P -value of 5.5 × 10 −7 (the chance of obtaining such a P -value from null data is less than one in a million), scores of 17.0 or larger were in fact observed in a scan of the shuffled genome, owing to the large number of tests performed. We therefore need a 'multiple testing correction' procedure to adjust our statistical confidence measures based on the number of tests performed.

Correcting for multiple hypothesis tests

Perhaps the simplest and most widely used method of multiple testing correction is the Bonferroni adjustment. If a significance threshold of α is used, but n separate tests are performed, then the Bonferroni adjustment deems a score significant only if the corresponding P -value is ≤α/ n . In the CTCF example, we considered 68 million distinct 20-mers as candidate CTCF sites, so achieving statistical significance at α = 0.01 according to the Bonferroni criterion would require a P -value <0.01/(68 × 10 6 ) = 1.5 × 10 −10 . Because the smallest observed P -value in Figure 1b is 2.3 × 10 −10 , no scores are deemed significant after correction.

The Bonferroni adjustment, when applied using a threshold of α to a collection of n scores, controls the 'family-wise error rate'. That is, the adjustment ensures that for a given score threshold, one or more larger scores would be expected to be observed in the null distribution with a probability of α. Practically speaking, this means that, given a set of CTCF sites with a Bonferroni adjusted significance threshold of α = 0.01, we can be 99% sure that none of the scores would be observed by chance when drawn according to the null hypothesis.

In many multiple testing settings, minimizing the family-wise error rate is too strict. Rather than saying that we want to be 99% sure that none of the observed scores is drawn according to the null, it is frequently sufficient to identify a set of scores for which a specified percentage of scores are drawn according to the null. This is the basis of multiple testing correction using false discovery rate (FDR) estimation.

The simplest form of FDR estimation is illustrated in Figure 1e , again using an empirical null distribution for the CTCF scan. For a specified score threshold t = 17.0, we count the number s obs of observed scores ≥ t and the number s null of null scores ≥ t . Assuming that the total number of observed scores and null scores are equal, then the estimated FDR is simply s null /s obs . In the case of our CTCF scan, the FDR associated with a score of 17.0 is 35/519 = 6.7%.

Note that, in Figure 1e , FDR estimates were computed directly from the score. It is also possible to compute FDRs from P -values using the Benjamini-Hochberg procedure, which relies on the P -values being uniformly distributed under the null hypothesis 4 . For example, if the P -values are uniformly distributed, then the P -value 5% of the way down the sorted list should be ∼ 0.05. Accordingly, the procedure consists of sorting the P -values in ascending order, and then dividing each observed P -value by its percentile rank to get an estimated FDR. In this way, small P -values that appear far down the sorted list will result in small FDR estimates, and vice versa.

In general, when an analytical null model is available, you should use it to compute P -values and then use the Benjamini-Hochberg procedure because the resulting estimated FDRs will be more accurate. However, if you only have an empirical null model, then there is no need to estimate P -values in an intermediate step; instead you may directly compare your score distribution to the empirical null, as in Figure 1e .

These simple FDR estimation methods are sufficient for many studies, and the resulting estimates are provably conservative with respect to a specified null hypothesis; that is, if the simple method estimates that the FDR associated with a collection of scores is 5%, then on average the true FDR is ≤5%. However, a variety of more sophisticated methods have been developed for achieving more accurate FDR estimates (reviewed in ref. 5 ). Most of these methods focus on estimating a parameter π0, which represents the percentage of the observed scores that are drawn according to the null distribution. Depending on the data, applying such methods may make a big difference or almost no difference at all. For the CTCF scan, one such method 6 assigns slightly lower estimated FDRs to each observed score, but the number of sites identified at a 5% FDR threshold remains unchanged relative to the simpler method.

Complementary to the FDR, Storey6 proposed defining the q -value as an analog of the P -value that incorporates FDR-based multiple testing correction. The q -value is motivated, in part, by a somewhat unfortunate mathematical property of the FDR: when considering a ranked list of scores, it is possible for the FDR associated with the first m scores to be higher than the FDR associated with the first m + 1 scores. For example, the FDR associated with the first 84 candidate CTCF sites in our ranked list is 0.0119, but the FDR associated with the first 85 sites is 0.0111. Unfortunately, this property (called nonmonotonicity, meaning that the FDR does not consistently get bigger) can make the resulting FDR estimates difficult to interpret. Consequently, Storey proposed defining the q -value as the minimum FDR attained at or above a given score. If we use a score threshold of T , then the q -value associated with T is the expected proportion of false positives among all of the scores above the threshold. This definition yields a well-behaved measure that is a function of the underlying score. We saw, above, that the Bonferroni adjustment yielded no significant matches at α = 0.05. If we use FDR analysis instead, then we are able to identify a collection of 519 sites at a q -value threshold of 0.05.

In general, for a fixed significance threshold and fixed null hypothesis, performing multiple testing correction by means of FDR estimation will always yield at least as many significant scores as using the Bonferroni adjustment. In most cases, FDR analysis will yield many more significant scores, as in our CTCF analysis. The question naturally arises, then, whether a Bonferroni adjustment is ever appropriate.

Costs and benefits help determine the best correction method

Like choosing a significance threshold, choosing which multiple testing correction method to use depends upon the costs associated with false positives and false negatives. In particular, FDR analysis is appropriate if follow-up analyses will depend upon groups of scores. For example, if you plan to perform a collection of follow-up experiments and are willing to tolerate having a fixed percentage of those experiments fail, then FDR analysis may be appropriate. Alternatively, if follow-up will focus on a single example, then the Bonferroni adjustment is more appropriate.

It is worth noting that the statistics literature describes a related probability score, known as the 'local FDR' 7 . Unlike the FDR, which is calculated with respect to a collection of scores, the local FDR is calculated with respect to a single score. The local FDR is the probability that a particular test gives rise to a false positive. In many situations, especially if we are interested in following up on a single gene or protein, this score may be precisely what is desired. However, in general, the local FDR is quite difficult to estimate accurately.

Furthermore, all methods for calculating P -values or for performing multiple testing correction assume a valid statistical model—either analytic or empirical—that captures dependencies in the data. For example, scanning a chromosome with the CTCF motif leads to dependencies among overlapping 20-nt sequences. Also, the simple null model produced by shuffling assumes that nucleotides are independent. If these assumptions are not met, we risk introducing inaccuracies in our statistical confidence measures.

In summary, in any experimental setting in which multiple tests are performed, P -values must be adjusted appropriately. The Bonferroni adjustment controls the probability of making one false positive call. In contrast, false discovery rate estimation, as summarized in a q -value, controls the error rate among a set of tests. In general, multiple testing correction can be much more complex than is implied by the simple methods described here. In particular, it is often possible to design strategies that minimize the number of tests performed for a particular hypothesis or set of hypotheses. For more in-depth treatment of multiple testing issues, see reference 8 .

Phillips, J.E. & Corces, V.G. Cell 137 , 1194–1211 (2009).

Article   Google Scholar  

Kim, T.H. et al. Cell 128 , 1231–1245 (2007).

Article   CAS   Google Scholar  

Staden, R. Methods Mol. Biol. 25 , 93–102 (1994).

CAS   PubMed   Google Scholar  

Benjamini, Y. & Hochberg, Y. J. R. Stat. Soc., B 57 , 289–300 (1995).

Google Scholar  

Kerr, K.F. Bioinformatics 25 , 2035–2041 (2009).

Storey, J.D. J. R. Stat. Soc. Ser. A Stat. Soc. 64 , 479–498 (2002).

Efron, B., Tibshirani, R., Storey, J. & Tusher, V. J. Am. Stat. Assoc. 96 , 1151–1161 (2001).

Dudoit, S. & van der Laan, M.J. Multiple Testing Procedures with Applications To Genomics (Springer, New York, 2008).

Book   Google Scholar  

Schneider, T.D. & Stephens, R.M. Nucleic Acids Res. 18 , 6097–6100 (1990).

Download references

Acknowledgements

National Institutes of Health award P41 RR0011823.

Author information

Authors and affiliations.

William S. Noble is at the Department of Genome Sciences, Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA.,

William S Noble

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to William S Noble .

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Noble, W. How does multiple testing correction work?. Nat Biotechnol 27 , 1135–1137 (2009). https://doi.org/10.1038/nbt1209-1135

Download citation

Issue Date : December 2009

DOI : https://doi.org/10.1038/nbt1209-1135

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Comparison of t-cell receptor diversity of people with myalgic encephalomyelitis versus controls.

  • Joshua J Dibble
  • Ben Ferneyhough
  • Chris P Ponting

BMC Research Notes (2024)

Clinical data mining: challenges, opportunities, and recommendations for translational applications

  • Huimin Qiao
  • Yijing Chen

Journal of Translational Medicine (2024)

A robot-aided visuomotor wrist training induces motor and proprioceptive learning that transfers to the untrained ipsilateral elbow

  • Huiying Zhu
  • Yizhao Wang
  • Jürgen Konczak

Journal of NeuroEngineering and Rehabilitation (2023)

Associations of bullying perpetration and peer victimization subtypes with preadolescent’s suicidality, non-suicidal self-injury, neurocognition, and brain development

  • Runsen Chen

BMC Medicine (2023)

Brain structural and functional signatures of multi-generational family history of suicidal behaviors in preadolescent children

Molecular Psychiatry (2023)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

multiple hypothesis testing definition

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Proc Natl Acad Sci U S A
  • v.105(48); 2008 Dec 2

Logo of pnas

A general framework for multiple testing dependence

Jeffrey t. leek.

a Department of Oncology, Johns Hopkins University School of Medicine, Baltimore, MD 21287; and

John D. Storey

b Lewis-Sigler Institute and Department of Molecular Biology, Princeton University, Princeton, NJ 08544

Author contributions: J.T.L. and J.D.S. designed research, performed research, contributed new reagents/analytic tools, analyzed data, and wrote the paper.

Associated Data

We develop a general framework for performing large-scale significance testing in the presence of arbitrarily strong dependence. We derive a low-dimensional set of random vectors, called a dependence kernel, that fully captures the dependence structure in an observed high-dimensional dataset. This result shows a surprising reversal of the “curse of dimensionality” in the high-dimensional hypothesis testing setting. We show theoretically that conditioning on a dependence kernel is sufficient to render statistical tests independent regardless of the level of dependence in the observed data. This framework for multiple testing dependence has implications in a variety of common multiple testing problems, such as in gene expression studies, brain imaging, and spatial epidemiology.

In many areas of science, there has been a rapid increase in the amount of data collected in any given study. This increase is due in part to the ability to computationally handle large datasets and the introduction of various high-throughput technologies. Analyzing data from such high-dimensional studies is often carried out by performing simultaneous hypothesis tests for some behavior of interest, on each of thousands or more measured variables. Large-scale multiple testing has been applied in fields such as genomics ( 1 – 3 ), astrophysics ( 4 , 5 ), brain imaging ( 6 – 8 ), and spatial epidemiology ( 9 ). By their very definition, high-dimensional studies rarely involve the analysis of independent variables, rather, many related variables are analyzed simultaneously. However, most statistical methods for performing multiple testing rely on independence, or some form of weak dependence, among the data corresponding to the variables being tested. Ignoring the dependence among hypothesis tests can result in both highly variable significance measures and bias caused by the confounding of dependent noise and the signal of interest.

Here, we develop an approach for addressing arbitrarily strong multiple testing dependence at the level of the original data collected in a high-dimensional study, before test statistics or P values have been calculated. We derive a low-dimensional set of random vectors that fully captures multiple testing dependence in any fixed dataset. By including this low-dimensional set of vectors in the model-fitting process, one may remove arbitrarily strong dependence resulting in independent parameter estimates, test statistics, and P values. This result represents a surprising reversal of the “curse of dimensionality” ( 10 ), because of the relatively small sample size in relation to the large number of tests being performed. Essentially, we show that the manifestation of the dependence cannot be too complex and must exist in a low-dimensional subspace of the data, driven by the sample size rather than by the number of hypothesis tests. This approach provides a sharp contrast to currently available approaches to this problem, such as the estimation of a problematically large covariance matrix, the conservative adjustment of P values, or the empirical warping of the test statistics' null distribution.

The main contributions of this article can be summarized as follows. We provide a precise definition of multiple testing dependence in terms of the original data, rather than in terms of P values or test statistics. We also state and prove a theoretical result showing how to account for arbitrarily strong dependence among multiple tests; no assumptions about a restricted dependence structure are required. By exploiting the dimensionality of the problem, we are able to account for dependence on each specific dataset, rather than relying on a population-level solution. We introduce a model that, when fit, makes the tests independent for all subsequent inference steps. Utilizing our framework allows all existing multiple testing procedures requiring independence to be extended so that they now provide strong control in the presence of general dependence. Our general characterization of multiple testing dependence directly shows that latent structure in high-dimensional datasets, such as population genetic substructure ( 11 ) or expression heterogeneity ( 12 ), is a special case of multiple testing dependence. We propose and demonstrate an estimation technique for implementing our framework in practice, which is applicable to a large class of problems considered here.

Notation and Assumptions

We assume that m related hypothesis tests are simultaneously performed, each based on an n −vector of data sampled from a common probability space on ℝ n . The data corresponding to hypothesis test i are x i = ( x i 1 , x i 2 , …, x in ), for i = 1, 2, …, m . The overall data can be arranged into an m × n matrix X where the i th row is composed of x i . We assume that there are “primary variables” Y = ( y 1 ,…, y n ) collected, describing the study design or experimental outcomes of interest, and any other covariates that will be employed. Primary variables are those that are both measured and included in the model used to test the hypotheses.

We assume that the goal is to perform a hypothesis test on E[ x i | Y ]. We will also assume that E[ x i | Y ] can be modeled with a standard basis-function model, which would include linear models, nonparametric smoothers, longitudinal models, and others. To this end, we write E[ x i | Y ] = b i S ( Y ), where b i is a 1 × d −vector and S ( Y ) is a d × n matrix of basis functions evaluated at Y d < n . When there is no ambiguity, we will write S = S ( Y ) to simplify notation. Note that Y can be composed of variables such as time, a treatment, experimental conditions, and demographic variables. The basis S can be arbitrarily flexible to incorporate most of the models commonly used in statistics for continuous data.

The residuals of the model are then e i = x i − E[ x i | Y ] = x i − b i S . Analogously, we let E be the m × n matrix, where the i th row is e i . We make no assumptions about what distribution the residuals follow, although by construction E[ e i | S ( Y )] = 0. We allow for arbitrary dependence across the tests, i.e., dependence across the rows of E . We assume that the marginal model for each e i is known or approximated sufficiently when performing the hypothesis tests. That is, we assume that the marginal null model for each test is correctly specified.

In matrix form, the model can be written as

equation image

The goal is then to test m hypotheses of the form:

equation image

where the null and alternative hypothesis tests are identically defined for each of the tests. This setup encompasses what is typically employed in practice, such as in gene expression studies and other applications of microarrays, brain imaging, spatial epidemiology, astrophysics, and environmental modeling ( 4 , 7 , 9 , 13 , 14 ).

Two Open Problems

The classical approach to testing multiple hypotheses is to first perform each test individually. This involves calculating a 1-dimensional statistic for each test, usually as some comparison of the model fit under the constraint of the null hypothesis to that under no constraints. By utilizing the observed test statistics and their null distributions, we calculate a P value for each test ( 15 ). An algorithm or point estimate is then applied to the set of P values to determine a significance threshold that controls a specific error measure at a specific level ( 16 ), such as the false discovery rate (FDR) at 10% ( 17 , 18 ). Variations on this approach have been suggested, such as estimating a q-value for each test ( 19 ) or a posterior error probability ( 20 ). Regardless of the approach, the validity and accuracy of these procedures are essentially determined by whether the null distributions are correctly specified (or conservatively specified) and whether the data are independent (or weakly dependent) across tests ( 21 , 22 ).

Two open problems in multiple testing have received a lot of recent attention. The first is concerned with controlling multiple testing error measures, such as the FDR, in the presence of dependence among the P values ( 23 , 24 ). This dependence is usually formulated as being present in the “noise” component of the models used to obtain the P values. The second open problem is concerned with the fact that latent structure among the tests can distort what would usually be the correct null distribution of the test statistics ( 11 , 25 – 27 ). The approach proposed here shows that both problems actually stem from sources of variation that are common among tests, which we show is multiple testing dependence, and both problems can be simultaneously resolved through one framework.

The current paradigm for addressing these two problems can be seen in Fig. 1 , where the steps taken to get from the original data X to a set of significant tests are shown. It can be seen that existing approaches are applied far downstream in the process. Specifically, adjustments are performed after 1-dimensional summaries of each test have been formed, either to the test statistics or P values. As we show below, the information about noise dependence and latent structure is found in the original data X by modeling common sources of variation among tests. Our proposed approach addresses multiple testing dependence (from either noise dependence or latent structure) at the early model-fitting stage ( Fig. 1 ), at which point the tests have been made stochastically independent and the null distribution is no longer distorted.

An external file that holds a picture, illustration, etc.
Object name is zpq999085246001.jpg

A schematic of the general steps of multiple hypothesis testing. We directly account for multiple testing dependence in the model-fitting step, where all the downstream steps in the analysis are not affected by dependence and have the same operating characteristics as independent tests. Our approach differs from current methods, which address dependence indirectly by modifying the test statistics, adaptively modifying the null distribution, or altering significance cutoffs. For these downstream methods the multiple testing dependence is not directly modeled from the data, so distortions of the signal of interest and the null distribution may be present regardless of which correction is implemented.

Proposed Framework

Definition of multiple testing dependence..

Multiple testing dependence has typically been defined in terms of P values or test statistics resulting from multiple tests ( 21 , 24 , 26 , 28 , 29 ). Here, we form population-level and estimation-level definitions that apply directly to the full dataset, X . The estimation-level definition also explicitly involves the model assumption and fit utilized in the significance analysis. When fitting model 1 , we denote the estimate of B by B ^ .

Definition: We say that population-level multiple testing dependence exists when it is the case that:

equation image

We say that estimation-level multiple testing dependence exists when it is the case that:

equation image

Multiple testing dependence at the population level is therefore any probabilistic dependence among the x i , after conditioning on Y . In terms of model 1 , this is equivalent to the existence of dependence across the rows of E ; i.e., dependence among the e 1 , e 2 , …, e m . Estimation-level dependence is equivalent to dependence among the rows of the residual matrix R = X − B ^ S . It will usually be the case that if population-level multiple testing dependence exists, then this will lead to estimation-level multiple testing dependence. The framework we introduce in this article is aimed at addressing both types of multiple testing dependence.

A General Decomposition of Dependence.

Dependence among the rows of E and among the rows of R = X − B ^ S are types of multivariate dependence among vectors. The standard approach for modeling multivariate dependence is to estimate a population-level parameterization of the dependence and then include estimates of these parameters when performing inference ( 30 ). For example, if the e i are assumed to be Normally distributed with the columns of E being independently and identically distributed random m -vectors, then one would estimate the m × m covariance matrix which parameterizes dependence across the rows of E . One immediate problem is that because n ≪ m , it is computationally and statistically problematic to estimate the covariance matrix ( 31 ).

A key feature is that, in the multiple testing scenario, the dimension along which the sampling occurs is different than the dimension along which the multivariate inference occurs. In terms of our notation, the sampling occurs with respect to the columns of X , whereas the multiple tests occur across the rows of X . This sampling-to-inference structure requires one to develop a specialized approach to multivariate dependence that is different from the classical scenarios. For example, the classical construction and interpretation of a P value threshold is such that a true null test is called significant with P value ≤ α at a rate of α over many independent replications of the study. However, in the multiple testing scenario, the P values that we utilize are not P values corresponding to a single hypothesis test over m independent replications of the study. Rather, the P values result from m related variables that have all been observed in a single study from a single sample of size n . The “sampling variation” that forms the backbone of most statistical thinking is different in our case: we observe one instance of sampling variation among the variables being tested. Therefore, even if each hypothesis test's P value behaves as expected over repeated studies, the set of P values from multiple tests in a single study will not necessarily exhibit the same behavior. Whereas this phenomenon prevents us from invoking well-established statistical principles, such as the classical interpretation of a P value, the fact that we have measured thousands of related variables from this single instance of sampling variation allows us to capture and model the common sources of variation across all tests. Multiple testing dependence is variation that is common among hypothesis tests.

Thus, rather than proposing a population-level approach to this problem (which includes the population of all hypothetical studies that could take place in terms of sampling of the columns of X ), we directly model the random manifestation of dependence in the observed data from a given study, by aggregating the common sampling variation across all tests' data. Including this information in the model during subsequent significance analyses removes the dependence within the study. Therefore dependence is removed across all studies, providing study-specific and population-level solutions. To directly model the random manifestation of dependence in the observed data, we do the following: ( i ) additively partition E into dependent and independent components, ( ii ) take the singular value decomposition of the dependent component, and ( iii ) treat the right singular values as covariates in the model fitting and subsequent hypothesis testing. To this end, we provide the following result, which shows that any dependence can be additively decomposed into a dependent component and an independent component. It is important to note that this is both for an arbitrary distribution for E and an arbitrary (up to degeneracy) level of dependence across the rows of E .

Proposition 1. Let the data corresponding to multiple hypothesis tests be modeled according to Eq. 1 . Suppose that for each e i , there is no Borel measurable function g such that e i = g ( e 1 , …, e i −1 , e i +1 , …, e m ) almost surely. Then, there exist matrices Γ m × r , G r × n ( r ≤ n ), and U m × n such that

equation image

where the rows of U are jointly independent random vectors so that

equation image

Also, for all i = 1, 2, …, m , u i ≠ 0 and u i = h i ( e i ) for a non-random Borel measurable function h i .

A formal proof of Proposition 1 and all subsequent theoretical results can be found in the supporting information (SI) Appendix . Note that if we let r = n and then set U = 0 or set U equal to an arbitrary m × n matrix of independently distributed random variables, then the independence of the rows of U is trivially satisfied. However, our added assumption regarding e i allows us to show that a nontrivial U exists where u i ≠ 0 and u i = h i ( e i ) for a deterministic function h i . In other words, u i is a function of e i in a nondegenerate fashion, which means that U truly represents a row-independent component of E . The intuition behind these properties is that our assumption guarantees that e i does indeed contain some variation that is independent from the other tests. For hypothesis tests where there does exist a Borel measurable g such that e i = g ( e 1 , …, e i −1 , e i +1 , …, e m ), then the variation of e i is completely dependent with that of the other tests' data. In this case, one can set u i = 0 and the above decomposition is still meaningful.

The decomposition of Proposition 1 immediately indicates one direction to take in solving the multiple testing dependence problem, namely to account for the Γ G component, thereby removing dependence. To this end, we now define a “dependence kernel” for the data X .

Definition: An r × n matrix G forms a dependence kernel for the high-dimensional data X , if the following equality holds:

equation image

where the rows of U are jointly independent as in Proposition 1 .

In practice, one would be interested in minimal dependence kernels, which are those satisfying the above definition and having the smallest number of rows, r . Proposition 1 shows that at least one such G exists with r ≤ n rows. As we discuss below in Scientific Applications , the manner in which one incorporates additional information beyond the original observations to estimate and utilize Γ and G is context specific. In the SI Appendix , we provide explicit descriptions for two scientific applications, latent structure as encountered in genomics and spatial dependence as encountered brain imaging. We propose a new algorithm for estimating G in the genomics application and demonstrate that it has favorable operating characteristics.

Dependence Kernel Accounts for Dependence.

An important question arises from Proposition 1 . Is including G , in addition to S ( Y ), in the model used to perform the hypothesis tests sufficient to remove the dependence from the tests? If this is the case, then only an r × n matrix must be known to fully capture the dependence. This is in contrast to the m ( m −1)/2 parameters that must be known for a covariance matrix among tests, for example. To put this into context, consider a microarray experiment with 1,000 genes and 20 arrays. In this case, the covariance has ∼500,000 unknown parameters, whereas G has, at most, 400 unknown values. The following two results show that including G in addition to S ( Y ) in the modeling is sufficient to remove all multiple hypothesis testing dependence.

Corollary 1. Under the assumptions of Proposition 1, all population-level multiple testing dependence is removed when conditioning on both Y and a dependence kernel G . That is ,

equation image

If instead of fitting model 1 , suppose that we instead fit the decomposition from Proposition 1 , where we assume that S and G are known:

equation image

It follows that estimation-level multiple testing independence may then be achieved.

Proposition 2. Assume the data for multiple tests follow model 1 , and let G be any valid dependence kernel. Suppose that model 3 is fit by least squares, resulting in residuals r i = x i − b ^ i S − γ ^ i G . When the row space jointly spanned by S and G has dimension less than n , the residuals r 1 , r 2 , …, r m are jointly independent given S and G , the b ^ 1 , b ^ 2 , … , b ^ m are jointly independent given S and G , and

equation image

The analogous results hold for the residuals and parameter estimates when fitting the model under the constraints of the null hypothesis.

Since G will be unknown in practice, the practical implication of this proposition is that we have to estimate only the relatively small r × n matrix G well in order to account for all of the dependence, while the simple least-squares solution to Γ suffices. When the row space jointly spanned by S and G has dimension equal to n , then the above proposition becomes trivially true. However, if we assume that S , G , and Γ are known, then the analogous estimation-level independence holds. In this case, we have to estimate Γ and G well in order to account for dependence. These ( m + r ) n parameters are still far smaller than the unknown m ( m - 1)/2 parameters of a covariance matrix, for example.

Strong Control of Multiple Testing Error Rates.

Many methods exist for strongly controlling the family-wise error rate (FWER) or FDR ( 16 , 18 , 19 , 21 , 24 , 32 ). These methods are applied to the P values calculated from multiple hypothesis tests. Most of these methods require the P values corresponding to true null hypotheses to be independent in order for the procedure to provide strong control. For example, finite-sample strong control of several FDR procedures ( 21 , 24 ) and the conservative point estimation of the FDR ( 19 ) all require the true null P values to be independent. Several methods exist for controlling FWER or FDR when dependence is present. However, these either tend to be quite conservative or require special restrictions on the dependence structure ( 21 , 24 ).

When utilizing model 3 , the statistics formed for testing the hypothesis should be based on a function of the model fits and residuals. When this is the case, we achieve the desired independence of P values.

Corollary 2. Suppose that the assumptions of Proposition 2 hold, model 3 is utilized to perform multiple hypothesis tests, and G is a known dependence kernel. If P values are calculated from test statistics based on a function of the model fits and residuals, then the resulting P values and test statistics are independent across tests.

In other words, Corollary 2 extends all existing multiple testing procedures that have been shown to provide strong control when the null P values are independent to the general dependence case. Instead of deriving new multiple testing procedures for dependence at the level of P values, we can use the existing ones by including G into the model fitting and inference carried out to get the P values themselves.

Scientific Applications

Two causes for multiple testing dependence can be directly derived from scientific problems of interest. In each case, the dependence kernel G has a practical scientific intepretation.

Spatial Dependence.

Spatial dependence usually arises as dependence in the noise because of a structural relationship among the tests. In this case, we will consider the e i of model 1 to simply represent “noise,” an example being the spatial dependence for noise that is typically assumed for brain-imaging data ( 6 – 8 ). In this setting, the activity levels of thousands of points in the brain are simultaneously measured, where the goal is to identify regions of the brain that are active. A common model for the measured intensities is a Gaussian random field ( 6 ). It is assumed that the Gaussian noise among neighboring points in the brain are dependent, where the covariance between two points in the brain is usually a function of their distance.

In Fig. 2 A and B , we show two datasets generated from a simplified 2-dimensional version of this model. It can be seen that the manifestation of dependence changes notably between the two studies, even though they come from the same data generating distribution. Using model 3 for each dataset, we removed the Γ G term. In both cases, the noise among points in the 2-dimensional space becomes independent and the P value distributions of points corresponding to true null hypotheses follow the Uniform distribution. It has been shown that null P values following the Uniform (0,1) distribution is the property that confirms that the assumed null distribution is correct ( 22 ). Additionally, it can be seen that the null P values from the unadjusted data fluctuate substantially between the two studies, and neither follows the Uniform (0,1) null distribution. This is due to varying levels of correlation between S and G from model 3 . In one case, S and G are correlated producing spurious signal among the true null hypotheses; this would lead to a major inflation of significance. In the other case, they are uncorrelated leading to a major loss of power. By accounting for the Γ G term, we have resolved these issues.

An external file that holds a picture, illustration, etc.
Object name is zpq999085246002.jpg

Simulated examples of multiple testing dependence. A and B consist of spatial dependence examples as simplified versions of that encountered in brain imaging, and C and D consist of latent structure examples as encountered in gene expression studies. In all examples, the data and the null P values are plotted both before and after subtracting the dependence kernel. The data are plotted in the form of a heat map (red, high numerical value; white, middle; blue, low). The signal is clearer and the true null tests' P values are unbiased after the dependence kernel is subtracted. ( A and B ) Each point in the heat map represents the data for one spatial variable. The two true signals are in the diamond and circle shapes, and there is autoregressive spatial dependence between the pixels. ( A ) An example where the spatial dependence confounds the true signal, and the null P values are anticonservatively biased. ( B ) An example where the spatial dependence is nearly orthogonal to the true signal, and the null P values are conservatively biased. ( C and D ) Each row of the heat map corresponds to a gene's expression values, where the first 400 rows are genes simulated to be truly associated with the dichotomous primary variable. Dependence across tests is induced by common unmodeled variables that also influence expression, as described in the text. ( C ) An example where dependence due to latent structure confounds the true signal, and the null P values are anticonservatively biased. ( D ) An example where dependence due to latent structure is nearly orthogonal to the true signal, and the null P values are conservatively biased.

Latent Structure.

A second source of multiple testing dependence is a latent structure due to relevant factors not being included in the model. It is possible for there to be unmodeled factors that are common among the multiple tests but that are not included in S . Suppose there exists unmodeled factors Z such that E( x i | Y ) ≠ E( x i | Y , Z ) for more than one test. If we utilize model 1 when performing the significance analysis, there will be dependence across the rows of E induced by the common factor Z , causing population-level multiple testing dependence. Likewise, there will be dependence across the rows of R causing estimation-level multiple testing dependence. A similar case can arise when the model for x i in terms of Y is incorrect. For example, it could be the case that E[ x i | Y ] = b i S *( Y ), where the differences between S and S * are nontrivial among multiple tests. Here, there will be dependence across the rows of R induced by the variation common to multiple tests due to S * but not captured by S , which would cause estimation-level multiple testing dependence. Failing to include all relevant factors is a common issue in genomics leading to latent structure ( 11 , 12 ). The adverse effects of latent structure due to unmodeled factors on differential expression significance analyses has only recently been recognized ( 12 ).

Fig. 2 C and D shows independently simulated microarray studies in this scenario, where we have simulated a treatment effect plus effects from several unmodeled variables. The unmodeled factors were simulated as being independently distributed with respect to the treatment, which is equivalent to a study in which the treatment is randomized. As in Fig. 2 A and B , it can be seen that the P values corresponding to true null hypotheses (i.e., genes not differentially expressed with respect to the treatment) are not Uniformly distributed. When utilizing model 3 for these data and subtracting the term Γ G , the residuals are now made independent and the null P values are Uniform (0,1) distributed.

Estimating G in Practice

There are a number of scenarios where estimating G is feasible in practice. One scenario is when nothing is known about the dependence structure, but it is also the case that d + r < n , where d and r are the number of rows of the model S and dependence kernel G , respectively. This is likely when the dependence is driven by latent variables, such as in gene expression heterogeneity ( 12 ). In the SI Appendix , we present an algorithm for estimating G in this scenario. It is shown that the proposed algorithm, called iteratively reweighted surrogate variable analysis (IRW-SVA), exhibits favorable operating characteristics. We provide evidence for this over a broad range of simulations. Another scenario is when the dependence structure is well characterized at the population level. Here, it may even be the case that d + r ≈ n . This scenario is common in brain imaging ( 6 , 7 ) and other spatial dependence problems ( 9 ), as discussed above. The fact that Γ is largely determined by the known spatial structure allows us to overcome the fact that d + r ≈ n ( SI Appendix ).

We have described a general framework for multiple testing dependence in high-dimensional studies. Our framework defines multiple testing dependence as stochastic dependence among tests that remains when conditioning on the model is used in the significance analysis. We presented an approach for addressing the problem of multiple testing dependence based on estimating the dependence kernel, a low-dimensional set of vectors that completely defines the dependence in any high-throughput dataset. We have shown that if the dependence kernel is known and included in the model, then the hypothesis tests can be made stochastically independent. This work extends existing results regarding error rate control under independence to the case of general dependence. An additional advantage of our approach is that we can not only estimate dependence at the level of the data, which is intuitively more appealing than estimating dependence at the level of P values or test statistics, but we can also directly adjust for that dependence in each specific study. We presented an algorithm with favorable operating characteristics for estimating the dependence kernel for one of the main two scientific areas of interest that we discussed. We anticipate that well behaved estimates of the dependence kernel in other scientific areas are feasible.

One important implication of this work is that multiple testing dependence is tractable at the level of the original data. Downstream approaches to dealing with multiple testing dependence are not able to directly capture general dependence structure ( Fig. 1 ). Another implication of this work is that, for a fixed complexity, the stronger the dependence is among tests, the more feasible it is to properly estimate and model it. It has also been shown that the weaker multiple testing dependence is, the more appropriate it is to utilize methods that are designed for the independence case ( 21 ). Therefore, there is promise that the full range of multiple testing dependence levels is tractable for a large class of relevant scientific problems.

Supplementary Material

Acknowledgments..

We thank the editor and several anonymous referees for helpful comments. This research was supported in part by National Institutes of Health Grant R01 HG002913.

The authors declare no conflict of interest.

This article contains supporting information online at www.pnas.org/cgi/content/full/0808709105/DCSupplemental .

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: multiple object tracking as id prediction.

Abstract: In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of heuristic techniques prevents end-to-end exploitation of training data, leading to increasingly cumbersome and challenging manual modification while facing complicated or novel scenarios. In this paper, we regard this object association task as an End-to-End in-context ID prediction problem and propose a streamlined baseline called MOTIP. Specifically, we form the target embeddings into historical trajectory information while considering the corresponding IDs as in-context prompts, then directly predict the ID labels for the objects in the current frame. Thanks to this end-to-end process, MOTIP can learn tracking capabilities straight from training data, freeing itself from burdensome hand-crafted algorithms. Without bells and whistles, our method achieves impressive state-of-the-art performance in complex scenarios like DanceTrack and SportsMOT, and it performs competitively with other transformer-based methods on MOT17. We believe that MOTIP demonstrates remarkable potential and can serve as a starting point for future research. The code is available at this https URL .

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. Statistical Hypothesis Testing: Step by Step

    multiple hypothesis testing definition

  2. 13 Different Types of Hypothesis (2024)

    multiple hypothesis testing definition

  3. Hypothesis Testing Solved Examples(Questions and Solutions)

    multiple hypothesis testing definition

  4. PPT

    multiple hypothesis testing definition

  5. Hypothesis Testing Steps & Examples

    multiple hypothesis testing definition

  6. Hypothesis Testing

    multiple hypothesis testing definition

VIDEO

  1. Multiple Regression and Hypothesis Testing

  2. Multiple Hypothesis Tracking for Autonomous Driving

  3. Multiple hypothesis 1

  4. MultipleHypothesisCorrection

  5. Intro to Data Science Lecture 13

  6. اختبارات الفروض : تحليل الانحدار المتعدد Hypothesis tests: multiple regression analysis

COMMENTS

  1. Multiple Hypothesis Testing

    'Multiple Hypothesis Testing' published in 'Encyclopedia of Systems Biology' Editors and Affiliations. Biomedical Sciences Research Institute, University of Ulster, Coleraine, UK

  2. Multiple comparisons problem

    Classification of multiple hypothesis tests. The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: H 1, H 2, ..., H m. Using a statistical test, we reject the null hypothesis if the test is declared significant. We do not reject the null hypothesis if ...

  3. Multiple Testing Problem / Multiple Comparisons

    If you use the standard alpha level of 5% (which is the probability of getting a false positive), you're going to get around 500 significant results — most of which will be false alarms. This large number of false alarms produced when you run multiple hypothesis tests is called the multiple testing problem. (Or multiple comparisons problem).

  4. Multiple Hypothesis Testing

    Hypothesis testing is a commonnly used method in statistics where we run tests to check whether a null hypothesis \(H_0\) is true or should we accept the alternate hypothesis \(H_1\). In cases, where there are multiple (\(m\)) null hypotheses, it is not possible to determine which of the \(m\) hypotheses are acceptable using a single test.

  5. Multiple Testing · Pathway Guide

    Multiple testing correction methods attempt to control or at least quantify the flood of type I errors that arise when multiple hypothesis are performed simultaneously Definition The p-value is the probability of observing a result more extreme than that observed given the null hypothesis is true.

  6. PDF Multiple Testing

    A multiple testing procedure (MTP) is a rule which makes some decision about each H s. The term false discovery refers to the rejection of a true null hypothesis. Also, let I(P) denote the set of true null hypotheses, that is, s2I(P) if and only if H s is true. We also assume that a test of the individual hypothesis H s is based on a test ...

  7. Multiple Hypothesis Testing Correction for Data Scientist

    Multiple Hypothesis Testing. There is always a minimum of two different hypotheses; Null Hypothesis and Alternative Hypothesis. The hypothesis could be anything, but the most common one is the one I presented below. ... Confidence Interval (CI): — Definition: A range of values derived from a sample of data that is likely to contain the true ...

  8. PDF TESTING MULTIPLE HYPOTHESES

    The classical method of adjusting for testing multiple hypotheses is the so-called Bonferroni correction, given in beginning statistics courses. Recall that it works as follows. Suppose we are testing m hypotheses H0j, j = 1, ..., m. The overall null hypothesis H0 is that all. hypotheses H0j are true.

  9. Chapter 14 Multiple hypothesis testing

    Multiple hypothesis testing is the testing of two or more separate hypotheses simultaneously. The 't' and 'F' tests are the most frequently used tests in econometrics. In regression analysis, there are two different procedures that can be used to test the hypothesis that all the coefficients are zero. One procedure is to test each ...

  10. PDF 1 Why is multiple testing a problem?

    The second line of code is nding the p-values for a hypothesis test on each value of x. The hypothesis being tested is that the value of x is not di erent from 0, given the entries are drawn from a standard normal distribution. The alternate is a one-sided test, claiming that the value is larger than 0.

  11. 6.3

    6: Hypothesis Testing, Part 2. 6.1 - Type I and Type II Errors; 6.2 - Significance Levels; 6.3 - Issues with Multiple Testing; 6.4 - Practical Significance; 6.5 - Power; 6.6 - Confidence Intervals & Hypothesis Testing; 6.7 - Lesson 6 Summary; 7: Normal Distributions. 7.1 - Standard Normal Distribution; 7.2 - Minitab: Finding Proportions Under a ...

  12. PDF Lecture 10: Multiple Testing

    Digression: p-values. Implicit in all multiple testing procedures is the assumption that the distribution of p-values is "correct". This assumption often is not valid for genomics data where p-values are obtained by asymptotic theory. Thus, resampling methods are often used to calculate calculate p-values.

  13. [PDF] Multiple Hypothesis Testing : A Review

    TLDR. This thesis provides an introduction to the problem and reviews some published work on the subject in multiple hypothesis testing and specifically in data mining, and uses randomization, which involves drawing samples of random data sets and using the data mining algorithm with them. PDF. 1 Excerpt.

  14. Multiple Comparisons (Post Hoc Testing)

    When multiple tests are conducted this leads to a problem known as the multiple testing problem (also known as the multiple comparisons problem, or the post hoc testing problem, data dredging, and sometimes, data mining), whereby the more tests that are conducted, the more false discoveries that are made. For example, if one hundred independent ...

  15. PDF Lecture 11: Multiple hypothesis test

    The multiple hypothesis testing is the scenario that we are conducting several hypothesis tests at the same time. Suppose we have ntests, each leads to a p-value. So we can view the 'data' as P 1; ;P n 2[0;1], where P i is the p-value of the i-th test. We can think of this problem as conducting hypothesis tests of n nulls: H 1;0; ;H n;0 ...

  16. Multiple Hypothesis Testing: A Methodological Overview

    These problems fall under the theory of multiple hypothesis testing, which has been a formal branch of statistical theory since John Tukey's pioneering work in the early 1950s (see ref. ). It is easy to see that if a sequence of hypothesis tests are performed with using a 5% level of significance, then an average of 1 out of 20 among true ...

  17. Chapter 14 Multiple hypothesis testing

    Multiple hypothesis testing is the testing of two or more separate hypotheses simultaneously. The 't' and 'F' tests are the most frequently used tests in econometrics. In regression analysis, there are two different procedures that can be used to test the hypothesis that all the coefficients are zero. One procedure is to test each ...

  18. Hypothesis Testing

    There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1 ). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...

  19. Why and How to Adjust P-values in Multiple Hypothesis Testing

    One of the most popular methods for correcting for multiple hypothesis testing is a Bonferroni procedure. The reason this method is popular is because it is very easy to calculate, even by hand. This procedure multiplies each p-value by the total number of tests performed or sets it to 1 if this multiplication would push it past 1.

  20. Multiple Hypothesis Testing

    It is important to control for false discoveries when multiple hypotheses are tested. Under the Neyman-Pearson formulation, each hypothesis test involves a decision rule with false positive rate (FPR) less than α (e.g. α = 0.05 ). However, if there are m α -level independent tests, the probability of at least one false discovery could be as ...

  21. Multiple testing: when is many too much?

    In almost all medical research, more than a single hypothesis is being tested or more than a single relation is being estimated. Testing multiple hypotheses increases the risk of drawing a false-positive conclusion. We briefly discuss this phenomenon, which is often called multiple testing. Also, methods to mitigate the risk of false-positive ...

  22. How does multiple testing correction work?

    This definition yields a well-behaved measure that is a function of the underlying score. ... for a fixed significance threshold and fixed null hypothesis, performing multiple testing correction ...

  23. A general framework for multiple testing dependence

    Definition of Multiple Testing Dependence. Multiple testing dependence has typically been defined in terms of P values or test statistics resulting from multiple tests (21, ... Let the data corresponding to multiple hypothesis tests be modeled according to Eq.1. Suppose that for each e i, there is no Borel measurable function g such that e i ...

  24. [2403.16848] Multiple Object Tracking as ID Prediction

    In Multiple Object Tracking (MOT), tracking-by-detection methods have stood the test for a long time, which split the process into two parts according to the definition: object detection and association. They leverage robust single-frame detectors and treat object association as a post-processing step through hand-crafted heuristic algorithms and surrogate tasks. However, the nature of ...