Multiple Hypothesis Testing
- Reference work entry
- pp 1468–1469
- Cite this reference work entry
- Roger Higdon 5
2764 Accesses
1 Citations
Multiple comparisons ; Multiple testing
The multiple hypothesis testing problem occurs when a number of individual hypothesis tests are considered simultaneously. In this case, the significance or the error rate of individual tests no longer represents the error rate of the combined set of tests. Multiple hypothesis testing methods correct error rates for this issue.
Characteristics
In conventional hypothesis testing, the level of significance or type I error rate (the probability of wrongly rejecting the null hypothesis) for a single test is less than the probability of making an error on at least one test in a multiple hypothesis testing situation. While this is typically not an issue when testing a small number of preplanned hypotheses, the likelihood of making false discoveries is greatly increased when there are large numbers of unplanned or exploratory tests conducted based on the significance level or type I error rate from a single test. Therefore, it is...
This is a preview of subscription content, log in via an institution to check access.
Access this chapter
Subscribe and save.
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Durable hardcover edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Institutional subscriptions
Barrett T et al (2007) NCBI GEO: mining tens of millions of expression profiles–database and tools update. Nucleic Acids Res 35:D760–D765
Article CAS PubMed Google Scholar
Dudoit S et al (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
Article Google Scholar
Higdon R, van Belle G, Kolker E (2008) A note on the false discovery rate and inconsistent comparison between experiments. Bioinformatics 24:1225–1228
Hochberg Y, Tamahane A (2009) Multiple comparison procedures. Wiley, New York
Google Scholar
Download references
Author information
Authors and affiliations.
Seattle Children’s Research Institute, 1900 9th Ave, C9S-9, 98101, Seattle, WA, USA
Dr. Roger Higdon
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Roger Higdon .
Editor information
Editors and affiliations.
Biomedical Sciences Research Institute, University of Ulster, Coleraine, UK
Werner Dubitzky
Department of Computer Science, University of Rostock, Rostock, Germany
Olaf Wolkenhauer
Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea
Kwang-Hyun Cho
Department of Biomedical Engineering, Rensselaer Polytechnic Institute, Troy, NY, USA
Hiroki Yokota
Rights and permissions
Reprints and permissions
Copyright information
© 2013 Springer Science+Business Media, LLC
About this entry
Cite this entry.
Higdon, R. (2013). Multiple Hypothesis Testing. In: Dubitzky, W., Wolkenhauer, O., Cho, KH., Yokota, H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-9863-7_1211
Download citation
DOI : https://doi.org/10.1007/978-1-4419-9863-7_1211
Publisher Name : Springer, New York, NY
Print ISBN : 978-1-4419-9862-0
Online ISBN : 978-1-4419-9863-7
eBook Packages : Biomedical and Life Sciences Reference Module Biomedical and Life Sciences
Share this entry
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Publish with us
Policies and ethics
- Find a journal
- Track your research
Multiple Hypothesis Testing
Terminology
- Per comparison error rate (PCER) : It is an estimate of the rate of false positives per hypothesis \[\mathrm{PCER} = \frac{\mathbb{E}(V)}{m}\]
- Per-family error rate (PFER) : It is the expected number of type I errors (per family denotes the family of null hypotheses under consideration) \[\mathrm{PFER} = \mathbb{E}(V)\]
- Family-wise error rate (FWER) : It is the probability of making at least one Type I error. This measure is useful in many of the techniques we will discuss later \[\mathrm{FWER} = P(V \geq 1)\]
- False Discovery Rate (FDR) : It is the expected proportion of Type I errors among the rejected hypotheses. The probability term is introduced compensate as the rest of the expression becomes 1 when \(R = 0\). \[\mathrm{FDR} = \mathbb{E}(\frac{V}{R} | R > 0) P(R > 0)\]
- Positive false discovery rate (pFDR) : The rate at which rejected discoveries are false positives, given \(R\) is positive \[\mathrm{pFDR} = \mathbb{E}(\frac{V}{R} | R > 0)\]
- Multiple Testing
FWER based methods
- Single Step : Equal adjustments made to all \(p\)-values based on the threshold \(\alpha\)
- Sequential : Adaptive adjustments made sequentially to each \(p\)-value
Bonferroni Correction
Holm-bonferroni correction.
- Order the unadjusted \(p\)-values such that \(p_1 \leq p_2 \leq \ldots \leq p_m\)
- Given a type I error rate \(\alpha\), let \(k\) be the maximal index such that \[p_k \leq \frac{\alpha}{m - k + 1}\]
- Reject all null hypotheses \(H_1, \ldots, H_{k-1}\) and accept the hypotheses \(H_k, \ldots, H_{m}\)
- In case \(k = 1\), accept all null hypotheses
FDR based methods
Benjamini and hochberg.
- Given a type I error rate \(\delta\), let \(k\) be the maximal index such that \[p_j \leq \delta\frac{j}{m}\]
- Reject all null hypotheses \(H_1, \ldots, H_{j-1}\) and accept the hypotheses \(H_j, \ldots, H_{m}\)
- Type I and II errors
- False Discovery Rate
All Subjects
study guides for every class
That actually explain what's on your next test, multiple hypothesis testing, from class:, engineering probability.
Multiple hypothesis testing refers to the statistical method of conducting several tests simultaneously to evaluate the validity of multiple hypotheses. This approach is essential in scenarios where numerous hypotheses are being tested, particularly in fields such as communication systems, where decisions are often made based on uncertain signals and competing information. The challenge lies in controlling the overall error rate, which can increase significantly when multiple tests are performed.
congrats on reading the definition of multiple hypothesis testing . now let's actually learn it.
5 Must Know Facts For Your Next Test
- Multiple hypothesis testing is crucial in communication systems where many signals or models are evaluated at once, increasing the likelihood of finding significant results by chance.
- Controlling the family-wise error rate (FWER) is important to maintain the integrity of results when multiple hypotheses are tested, as this can lead to more reliable conclusions.
- Common methods for adjusting p-values in multiple hypothesis testing include the Bonferroni correction and the Benjamini-Hochberg procedure, each with different implications for error control.
- The process of conducting multiple hypothesis tests can lead to inflated Type I errors, making it necessary to apply corrections to avoid erroneous conclusions.
- In practical applications, especially in fields like genomics and communications, managing the trade-off between discovering true effects and limiting false discoveries is vital for effective decision-making.
Review Questions
- Multiple hypothesis testing impacts decision-making in communication systems by introducing the possibility of making erroneous conclusions when evaluating numerous signals at once. As various hypotheses are tested simultaneously, the risk of Type I errors increases, leading to potentially false positives. Therefore, it becomes crucial to apply statistical methods to control these errors, ensuring that decisions based on testing results remain reliable and valid.
- Using the Bonferroni correction in multiple hypothesis testing has significant implications for communication systems, as it helps control the family-wise error rate when many hypotheses are tested. While this method reduces the chances of false positives by adjusting the significance level for individual tests, it can also increase the risk of Type II errors by making it harder to detect true effects. In communication systems, where detecting actual signal patterns is critical, this trade-off must be carefully considered to balance sensitivity and specificity.
- Controlling the False Discovery Rate (FDR) can greatly enhance outcomes in multiple hypothesis testing scenarios within communication systems by allowing for more discoveries while maintaining a reasonable rate of false positives. By focusing on controlling FDR rather than family-wise error rates, researchers can strike a better balance between sensitivity and specificity. This approach enables more effective signal detection and interpretation, leading to improved performance in systems that rely on accurate decision-making under uncertainty, ultimately yielding better operational results.
Related terms
Type I Error : The error that occurs when a true null hypothesis is incorrectly rejected, leading to a false positive conclusion.
False Discovery Rate (FDR) : The expected proportion of false discoveries among all discoveries or rejections made in multiple hypothesis testing.
Bonferroni Correction : A statistical adjustment method used to counteract the problem of multiple comparisons by lowering the significance threshold for each individual test.
" Multiple hypothesis testing " also found in:
Subjects ( 1 ).
- Theoretical Statistics
© 2024 Fiveable Inc. All rights reserved.
Ap® and sat® are trademarks registered by the college board, which is not affiliated with, and does not endorse this website..
Pathway Guide
Multiple testing, ii. hypothesis testing errors, iii. multiple testing control, iv. controlling the family-wise error rate (fwer).
- V. Controlling the False Discovery Rate (FDR)
- Appendix A. Proof of Lemma 1
Large-scale approaches have enabled routine tracking of the entire mRNA complement of a cell, genome-wide methylation patterns and the ability to enumerate DNA sequence alterations across the genome. Software tools have been developed whose to unearth recurrent themes within the data relevant to the biological context at hand. Invariably the power of these tools rests upon statistical procedures in order to filter through the data and sort the search results.
The broad reach of these approaches presents challenges not previously encountered in the laboratory. In particular, errors associated with testing any particular observable aspect of biology will be amplified when many such tests are performed. In statistical terms, each testing procedure is referred to as a hypothesis test and performing many tests simultaneously is referred to as multiple testing or multiple comparison . Multiple testing arises enrichment analyses, which draw upon databases of annotated sets of genes with shared themes and determine if there is ‘enrichment’ or ‘depletion’ in the experimentally derived gene list following perturbation entails performing tests across many gene sets increases the chance of mistaking noise as true signals.
This goal of this section is to introduce concepts related to quantifying and controlling errors in multiple testing. By the end of this section you should:
- Be familiar with the conditions in which multiple testing can arise
- Understand what a Type I error and false discovery are
- Be familiar with multiple control procedures
- Be familiar with the Bonferroni control of family-wise error rate
- Be familiar with Benjamini-Hochberg control of false discovery rates
For better or worse, hypothesis testing as it is known today represents a gatekeeper for much of the knowledge appearing in scientific publications. A considered review of hypothesis testing is beyond the scope of this primer and we refer the reader elsewhere (Whitley 2002a). Below we provide an intuitive example that introduces the various concepts we will need for a more rigorous description of error control in section III .
Example 1: A coin flip
To illustrate errors incurred in hypothesis testing, suppose we wish to assess whether a five cent coin is fair. Fairness here is defined as an equal probability of heads and tails after a toss. Our hypothesis test involves an experiment (i.e. trial) whereby 20 identically minted nickels are tossed and the number of heads counted. We take the a priori position corresponding to the null hypothesis : The nickels are fair. The null hypothesis would be put into doubt if we observed trials where the number of heads was larger (or smaller) than some predefined threshold that we considered reasonable.
Let us pause to more deeply consider our hypothesis testing strategy. We have no notion of how many heads an unfair coin might generate. Thus, rather than trying to ascertain the unknown distribution of heads for some unfair nickel, we stick to what we do know: The probability distribution under the null hypothesis for a fair nickel. We then take our experimental results and compare them to this null hypothesis distribution and look for discrepancies.
Conveniently, we can use the binomial distribution to model the exact probability of observing any possible number of heads (0 to 20) in a single test where 20 fair nickels are flipped (Figure 1).
plot of chunk unnamed-chunk-1
Figure 1. Probability distribution for the number of heads. The binomial probability distribution models the number of heads in a single test where 20 fair coins are tossed. Each coin has equal probability of being heads or tails. The vertical line demarcates our arbitrary decision threshold beyond which results would be labelled 'significant'.
In an attempt to standardize our decision making, we arbitrarily set a threshold of doubt: Observing 14 or more heads in a test will cause us to label that test as ‘significant’ and worthy of further consideration. In modern hypothesis testing terms, we would ‘reject’ the null hypothesis beyond this threshold in favour of some alternative, which in this case would be that the coin was unfair. Note that in principle we should set a lower threshold in the case that the coin is unfairly weighted towards tails but omit this for simplicity.
Multiple testing correction methods attempt to control or at least quantify the flood of type I errors that arise when multiple hypothesis are performed simultaneously
Definition The p-value is the probability of observing a result more extreme than that observed given the null hypothesis is true.
Definition A type I error is the incorrect rejection of a true null hypothesis.
Definition A type II error is the incorrect failure to reject a false null hypothesis.
Consider an extension of our nickel flipping protocol whereby multiple trials are performed and a hypothesis test is performed for each trial. In an alternative setup, we could have some of our friends each perform our nickel flipping trial once, each performing their own hypothesis test. How many type I errors would we encounter? Figure 2 shows a simulation where we repeatedly perform coin flip experiments as before.
plot of chunk unnamed-chunk-2
Figure 2. Number of tests where more than 14 heads are observed. Simulations showing the number of times more than 14 heads were counted in an individual test when we performed 1, 2, 10, 100, and 250 simultaneous tests.
Figure 2 should show that with increasing number of tests we see trials with 14 or more heads. This makes intuitive sense: Performing more tests boosts the chances that we are going to see rare events, purely by chance. Technically speaking, buying more lottery tickets does in fact increase the chances of a win (however slightly). This means that the errors start to pile up.
Example 2: Pathway analyses
Multiple testing commonly arises in the statistical procedures underlying several pathway analysis software tools. In this guide, we provide a primer on Gene Set Enrichment Analysis.
Gene Set Enrichment Analysis derives p-values associated with an enrichment score which reflects the degree to which a gene set is overrepresented at the top or bottom of a ranked list of genes. The nominal p-value estimates the statistical significance of the enrichment score for a single gene set. However, evaluating multiple gene sets requires correction for gene set size and multiple testing.
When does multiple testing apply?
Defining the family of hypotheses.
In general, sources of multiplicity arise in cases where one considers using the same data to assess more than one:
There are cases where the applicability of multiple testing may be less clear:
- Multiple research groups work on the same problem and only those successful ones publish
- One researcher tests differential expression of 1 000 genes while a thousand different researchers each test 1 of a possible 1 000 genes
- One researcher performs 20 tests versus another performing 20 tests then an additional 80 tests for a total of 100
In these cases identical data sets are achieved in more than one way but the particular statistical procedure used could result in different claims regarding significance. A convention that has been proposed is that the collection or family of hypotheses that should be considered for correction are those tested in support of a finding in a single publication (Goeman 2014). For a family of hypotheses, it is meaningful to take into account some combined measure of error.
The severity of errors
The use of microarrays, enrichment analyses or other large-scale approaches are most often performed under the auspices of exploratory investigations. In such cases, the results are typically used as a first step upon which to justify more detailed investigations to corroborate or validate any significant results. The penalty for being wrong in such multiple testing scenarios is minor assuming the time and effort required to dismiss it is minimal or if claims that extend directly from such a result are conservative.
On the other hand, there are numerous examples were errors can have profound negative consequences. Consider a clinical test applied to determine the presence of HIV infection or any other life-threatening affliction that might require immediate and potentially injurious medical intervention. Control for any errors in testing is important for those patients tested.
The take home message is that there is no substitute for considered and careful thought on the part of researchers who must interpret experimental results in the context of their wider understanding of the field.
The concept that the scientific worker can regard himself as an inert item in a vast co-operative concern working according to accepted rules, is encouraged by directing attention away from his duty to form correct scientific conclusions, to summarize them and to communicate them to his scientific colleagues, and by stressing his supposed duty mechanically to make a succession of automatic ‘decisions’…The idea that this responsibility can be delegated to a giant computer programmed with Decision Functions belongs to a phantasy of circles rather remote from scientific research. -R. A. Fisher (Goodman 1998)
The introductory section provides an intuitive feel for the errors associated with multiple testing. In this section our goal is to put those concepts on more rigorous footing and examine some perspectives on error control.
Type I errors increase with the number of tests
Table 1. Multiple hypothesis testing summary
Definition The family-wise error rate (FWER) is the probability of at least one (1 or more) type I error
The Bonferroni Correction
Caveats, concerns, and objections.
Definition The statistical power of a test is the probability of rejecting a null hypothesis when the alternative is true
Indeed our discussion above would indicate that large-scale experiments are exploratory in nature and that we should be assured that testing errors are of minor consequence. We could accept more potential errors as a reasonable trade-off for identifying more significant genes. There are many other arguments made over the past few decades against using such control procedures, some of which border on the philosophical (Goodman 1998, Savitz 1995). Some even have gone as far as to call for the abandonment of correction procedures altogether (Rothman 1990). At least two arguments are relevant to the context of multiple testing involving large-scale experimental data.
1. The composite “universal” null hypothesis is irrelevant
The origin of the Bonferroni correction is predicated on the universal hypothesis that only purely random processes govern all the variability of all the observations in hand. The omnibus alternative hypothesis is that some associations are present in the data. Rejection of the null hypothesis amounts to a statement merely that at least one of the assumptions underlying the null hypothesis is invalid, however, it does not specify exactly what aspect.
Concretely, testing a multitude of genes for differential expression in treatment and control cells on a microarray could be grounds for Bonferroni correction. However, rejecting the composite null hypothesis that purely random processes governs expression of all genes represented on the array is not very interesting. Rather, researchers are more interested in which genes or subsets demonstrate these non-random expression patterns following treatment.
2. Penalty for peeking and ‘p hacking’
This argument boils down to the argument: Why should one independent test result impact the outcome of another?
V. Controlling the false discovery rate (FDR)
Figure 3. Depiction of false discoveries. Variable names are as in Table 1. The m hypotheses consist of true (m0) and false (m1=m-m0) null hypotheses. In multiple hypothesis testing procedures a fraction of these hypotheses are declared significant (R, shaded light grey) and are termed 'discoveries'. The subset of true null hypotheses are termed 'false discoveries' (V) in contrast to 'true discoveries' (S).
In an exploratory analysis, we are happy to sacrifice are strict control on type I errors for a wider net of discovery. This is the underlying rationale behind the second control procedure.
Benjamini-Hochberg control
A landmark paper by Yoav Benjamini and Yosef Hochberg (Benjamini 1995) rationalized an alternative view of the errors associated with multiple testing:
In this work we suggest a new point of view on the problem of multiplicity. In many multiplicity problems the number of erroneous rejections should be taken into account and not only the question of whether any error was made. Yet, at the same time, the seriousness of the loss incurred by erroneous rejections is inversely related to the number of hypotheses rejected. From this point of view, a desirable error rate to control may be the expected proportion of errors among the rejected hypotheses, which we term the false discovery rate (FDR).
The Benjamini-Hochberg procedure
A sketch(y) proof.
Here, we provide an intuitive explanation for the choice of the BH procedure bound.
Proof of Theorem 1 The theorem follows from Lemma 1 whose proof is added as Appendix A at the conclusion of this section.
Proof of Lemma 1. This is provided as Appendix A .
From Lemma 1, if we integrate the inequality we can state
and the FDR is thus bounded.
Two properties of FDR
This last term is precisely the expression for FWER. This means that when all null hypotheses are true, FDR implies control of FWER. You will often see this referred to as control in the weak sense which is another way of referring to the case only when all null hypotheses are true.
The key here is to note that the expected value of an indicator function is the probability of the event in the indicator
Example of BH procedure
Table 2. Example BH calculations
Practical implications of BH compared to Bonferroni correction
The BH procedure overcomes some of the caveats associated with FWER control procedures.
Caveats and limitations
Since the original publication of the BH procedure in 1995, there have been a number of discussion regarding the conditions and limitations surrounding the use of the method for genomics data. In particular, the assumption of independence between tests is unlikely to hold in large-scale genomic measurements. We leave it to the reader to explore more deeply the various discussions surrounding the use of BH or its variants (Goeman 2014).
Appendix A: Proof of Lemma 1
We intend on proving Lemma 1 that underlies the BH procedure for control of the FDR. The proof is adapted from the original publication by Benjamini and Hochberg (Benjamini 1995) with variant notation and diagrams for clarification purposes. We provide some notation and restate the lemma followed by the proof.
There are a few not so obvious details that we will need along the way. We present these as a set of numbered ‘asides’ that we will refer back to.
1. Distribution of true null hypothesis p-values
Let’s rearrange this.
If the cdf is monotonic increasing then
The last two results allow us to say that
2. Distribution of the largest order statistic
The first integral
Substitute this back into the first integral.
The second integral
Let us remind ourselves what we wish to evaluate.
The next part of the proof relies on a description of p-values and indices but is often described in a very compact fashion. Keeping track of everything can outstrip intuition, so we pause to reflect on a schematic of the ordered false null hypothesis p-values and relevant indices (Figure 4).
Figure 4. Schematic of p-values and indices. Shown are the p-values ordered in ascending value from left to right corresponding to the false null hypotheses (z). Indices j0 and m1 for true null hypotheses are as described in main text. Blue segment represents region where z' can lie. Green demarcates regions larger than z' where p-values corresponding to hypotheses (true or false) that will not be rejected lie.
Now we are ready to tackle the expectation inside the integral.
Let us now place this result inside the original integral.
- Benjamini Y and Hochberg Y. Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. Roy. Stat. Soc., v57(1) pp289-300, 1995.
- Glickman ME et al. False Discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. Journal of Clinical Epidemiology, v67, pp850-857, 2014.
- Goeman JJ and Solari A. Multiple hypothesis testing in genomics. Stat. Med., 33(11) pp1946-1978, 2014.
- Goodman SN. Multiple Comparisons, Explained. Amer. J. Epid., v147(9) pp807-812, 1998.
- Rothman KJ. No Adjustments Are Needed for Multiple Comparisons. Epidemiology, v1(1) pp. 43-46, 1990.
- Savitz DA and Oshlan AF. Multiple Comparisons and Related Issues in the Interpretation of Epidemiologic Data. Amer. J. Epid., v142(9) pp904 -908, 1995.
- Whitley E and Ball J. Statistics review 3: Hypothesis testing and P values. Critical Care, v6(3) pp. 222-225, 2002a.
- Whitley E and Ball J. Statistics review 4: Sample size calculations. Critical Care, v6(4) pp. 335-341, 2002b.
Have a language expert improve your writing
Run a free plagiarism check in 10 minutes, generate accurate citations for free.
- Knowledge Base
Hypothesis Testing | A Step-by-Step Guide with Easy Examples
Published on November 8, 2019 by Rebecca Bevans . Revised on June 22, 2023.
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics . It is most often used by scientists to test specific predictions, called hypotheses, that arise from theories.
There are 5 main steps in hypothesis testing:
- State your research hypothesis as a null hypothesis and alternate hypothesis (H o ) and (H a or H 1 ).
- Collect data in a way designed to test the hypothesis.
- Perform an appropriate statistical test .
- Decide whether to reject or fail to reject your null hypothesis.
- Present the findings in your results and discussion section.
Though the specific details might vary, the procedure you will use when testing a hypothesis will always follow some version of these steps.
Table of contents
Step 1: state your null and alternate hypothesis, step 2: collect data, step 3: perform a statistical test, step 4: decide whether to reject or fail to reject your null hypothesis, step 5: present your findings, other interesting articles, frequently asked questions about hypothesis testing.
After developing your initial research hypothesis (the prediction that you want to investigate), it is important to restate it as a null (H o ) and alternate (H a ) hypothesis so that you can test it mathematically.
The alternate hypothesis is usually your initial hypothesis that predicts a relationship between variables. The null hypothesis is a prediction of no relationship between the variables you are interested in.
- H 0 : Men are, on average, not taller than women. H a : Men are, on average, taller than women.
Here's why students love Scribbr's proofreading services
Discover proofreading & editing
For a statistical test to be valid , it is important to perform sampling and collect data in a way that is designed to test your hypothesis. If your data are not representative, then you cannot make statistical inferences about the population you are interested in.
There are a variety of statistical tests available, but they are all based on the comparison of within-group variance (how spread out the data is within a category) versus between-group variance (how different the categories are from one another).
If the between-group variance is large enough that there is little or no overlap between groups, then your statistical test will reflect that by showing a low p -value . This means it is unlikely that the differences between these groups came about by chance.
Alternatively, if there is high within-group variance and low between-group variance, then your statistical test will reflect that with a high p -value. This means it is likely that any difference you measure between groups is due to chance.
Your choice of statistical test will be based on the type of variables and the level of measurement of your collected data .
- an estimate of the difference in average height between the two groups.
- a p -value showing how likely you are to see this difference if the null hypothesis of no difference is true.
Based on the outcome of your statistical test, you will have to decide whether to reject or fail to reject your null hypothesis.
In most cases you will use the p -value generated by your statistical test to guide your decision. And in most cases, your predetermined level of significance for rejecting the null hypothesis will be 0.05 – that is, when there is a less than 5% chance that you would see these results if the null hypothesis were true.
In some cases, researchers choose a more conservative level of significance, such as 0.01 (1%). This minimizes the risk of incorrectly rejecting the null hypothesis ( Type I error ).
Receive feedback on language, structure, and formatting
Professional editors proofread and edit your paper by focusing on:
- Academic style
- Vague sentences
- Style consistency
See an example
The results of hypothesis testing will be presented in the results and discussion sections of your research paper , dissertation or thesis .
In the results section you should give a brief summary of the data and a summary of the results of your statistical test (for example, the estimated difference between group means and associated p -value). In the discussion , you can discuss whether your initial hypothesis was supported by your results or not.
In the formal language of hypothesis testing, we talk about rejecting or failing to reject the null hypothesis. You will probably be asked to do this in your statistics assignments.
However, when presenting research results in academic papers we rarely talk this way. Instead, we go back to our alternate hypothesis (in this case, the hypothesis that men are on average taller than women) and state whether the result of our test did or did not support the alternate hypothesis.
If your null hypothesis was rejected, this result is interpreted as “supported the alternate hypothesis.”
These are superficial differences; you can see that they mean the same thing.
You might notice that we don’t say that we reject or fail to reject the alternate hypothesis . This is because hypothesis testing is not designed to prove or disprove anything. It is only designed to test whether a pattern we measure could have arisen spuriously, or by chance.
If we reject the null hypothesis based on our research (i.e., we find that it is unlikely that the pattern arose by chance), then we can say our test lends support to our hypothesis . But if the pattern does not pass our decision rule, meaning that it could have arisen by chance, then we say the test is inconsistent with our hypothesis .
If you want to know more about statistics , methodology , or research bias , make sure to check out some of our other articles with explanations and examples.
- Normal distribution
- Descriptive statistics
- Measures of central tendency
- Correlation coefficient
Methodology
- Cluster sampling
- Stratified sampling
- Types of interviews
- Cohort study
- Thematic analysis
Research bias
- Implicit bias
- Cognitive bias
- Survivorship bias
- Availability heuristic
- Nonresponse bias
- Regression to the mean
Hypothesis testing is a formal procedure for investigating our ideas about the world using statistics. It is used by scientists to test specific predictions, called hypotheses , by calculating how likely it is that a pattern or relationship between variables could have arisen by chance.
A hypothesis states your predictions about what your research will find. It is a tentative answer to your research question that has not yet been tested. For some research projects, you might have to write several hypotheses that address different aspects of your research question.
A hypothesis is not just a guess — it should be based on existing theories and knowledge. It also has to be testable, which means you can support or refute it through scientific research methods (such as experiments, observations and statistical analysis of data).
Null and alternative hypotheses are used in statistical hypothesis testing . The null hypothesis of a test always predicts no effect or no relationship between variables, while the alternative hypothesis states your research prediction of an effect or relationship.
Cite this Scribbr article
If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.
Bevans, R. (2023, June 22). Hypothesis Testing | A Step-by-Step Guide with Easy Examples. Scribbr. Retrieved September 23, 2024, from https://www.scribbr.com/statistics/hypothesis-testing/
Is this article helpful?
Rebecca Bevans
Other students also liked, choosing the right statistical test | types & examples, understanding p values | definition and examples, what is your plagiarism score.
Multiple Hypothesis Testing: A Methodological Overview
Series: Methods In Molecular Biology > Book: Statistical Methods for Microarray Data Analysis
Overview | DOI: 10.1007/978-1-60327-337-4_3
- Department of Biostatistics and Computational Biology, University of Rochester, Rochester, NY, USA
The process of screening for differentially expressed genes using microarray samples can usually be reduced to a large set of statistical hypothesis tests. In this situation, statistical issues arise which are not encountered in a single hypothesis
The process of screening for differentially expressed genes using microarray samples can usually be reduced to a large set of statistical hypothesis tests. In this situation, statistical issues arise which are not encountered in a single hypothesis test, related to the need to identify the specific hypotheses to be rejected, and to report an associated error. As in any complex testing problem, it is rarely the case that a single method is always to be preferred, leaving the analysts with the problem of selecting the most appropriate method for the particular task at hand. In this chapter, an introduction to current multiple testing methodology was presented, with the objective of clarifying the methodological issues involved, and hopefully providing the reader with some basis with which to compare and select methods.
Figures ( 0 ) & Videos ( 0 )
Experimental specifications, other keywords.
- Scherzer CR, Eklund AC, Morse LJ, Liao Z, Locascio JJ, Fefer D, Schwarzschild MA, Schlossmacher MG, Hauser MA, Vance JM, Sudarsky LR, Standaert DG, Growdon JH, Jensen RV, Gullans SR (2007) Molecular markers of early Parkinson’s disease based on gene expression in blood. PNAS 104:955–960
- Benjamini Y, Braun H (2002) John W. Tukey’s contributions to multiple comparisons. Ann Stat 30:1576–1594
- Yang YH, Speed T (2003) Statistical analysis of gene expression microarray data. In: Speed T (ed) Design and analysis of comparitive microarray experiments. Chapman and Hall, Boca Raton, FL, pp 35–92
- Dudoit S, Shaffer JP, Boldrick JC (2003) Multiple hypothesis testing in microarray experiments. Stat Sci 18:71–103
- Dudoit S, van der Laan MJ (2008) Multiple testing procedures with applications to genomics. Springer, New York, NY
- Chu T, Glymour C, Scheines R, Spirtes P (2003) A statistical problem for inference to regulatory structure from associations of gene expression measurementswith microarrays. Bioinformatics 19:1147–1152
- Holm S (1979) A simple sequentially rejective multiple test procedure. Scand J Stat 6:65–70
- Shaffer JP (1986) Modified sequentially rejective test procedures. JASA 81:826–830
- Šidák Z (1967) Rectangular confidence regions for the means of multivariate normal distribution. JASA 62:626–633
- Šidák Z (1971) On probabilities of rectangles in multivariate Student distributions: their dependence on correlations. Ann Math Stat 42:169–175
- Jogdeo K (1977) Association and probability inequalities. Ann Stat 5:495–504
- Holland BS, Copenhaver MD (1987) An improved sequentially rejective rejective Bonferroni test procedure. Biometrics 43:417–423
- Dykstra RL, Hewett JE, Thompson WA (1973) Events which are almost independent. Ann Stat 1:674–681
- Simes RJ (1986) An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751–754
- Hommel G (1988) A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 75:383–386
- Hochberg Y (1988) A sharper Bonferroni procedure for multiple tests of significance. Biometrika 75:800–802
- Sarkar SK (1998) Some probability inequalities for ordered MTP 2 random variable: a proof of the Simes conjecture. Ann Stat 26:494–504
- Sarkar SK, Chang C-K (1997) The Simes method for multiple hypothesis testing with positively dependent test statistics. JASA 92:1601–1608
- Rom DR (1990) A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika 77:663–665
- Huang Y, Hsu JC (2007) Hochberg’s step-up method: cutting corners off Holm’s step-down methods. Biometrika 94:965–975
- Westfall PH, Young S (1993) Resampling-based multiple testing. Wiley, New York, NY
- Pollard KS, Dudoit S, van der Laan MJ (2005) Bioinformatics and Compu-tational Biology Solutions Using R and Bioconductor. In: Gentleman R, Huber W, Carey VJ, Irizarry RA, Dudoit S (eds) chapter Multiple testingprocedures: the multest package and applications to genomics (pp 249–271). Springer, New York, NY,
- Benjamini Y, Hochberg D (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
- Benjamini Y, Yekutieli D (2001) The control of the false discovery rate in multiple testing under dependency. Ann Stat 29:1165–1188
- Storey JD (2003) The positive false discovery rate: a Bayesian interpretation and the q -value. Ann Stat 31:2013–2035
- Storey JD (2002) A direct approach to false discovery rates. JSS-B 64:479–498
- Efron B (2003) Robbins, empirical Bayes and microarrays. Ann Stat 31:366–378
- Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. JASA 99:96–104
- Kall L, Storey JD, MacCross MJ, Noble WS (2008) Posterior error probabilities and false discovery rates: two sides of the same coin. J Proteome Res 7:40–44
- Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. JASA 96:1151–1160
- Allison DB, Gadbury GL, Moonseong H, Fernandez JR, Cheol-Koo L, Prolla TA, Weindruch R (2002) A mixture model approach for the analysis of microarray gene expression data. Comput Stat Data Anal 39:1–20
- Newton MA, Wang P, Kendziorski C (2006) Hierarchical mixture models for expression profiles. In: Do K, Muller P, Vannucci M (eds) Bayesian inference for gene expression and proteomics. Cambridge University Press, New York, NY, pp 40–52
- Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW (2001) On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data. J Comput Biol 8:37–52
- Newton MA, Noueiry A, Sarkar D, Ahlquist P (2004) Detecting differential gene expression with a semiparametric hierarchical mixture method. Biostatistics 5:155–176
- Lewin A, Richardson S, Marshall C, Glazier A, Aitman T (2006) Bayesian modeling of differntial gene expression. Biometrics 62:1–9
- Gottardo R, Raftery AE, Yeung KY, Bumgarner RE (2006) Bayesian robust inference for differential gene expression in microarrays with multiple samples. Biometrics 62:10–18
- Do K, Muller P, Vannucci M (2006) Bayesian inference for gene expression and proteomics. Cambridge University Press, New York, NY
Advertisement
User Preferences
Content preview.
Arcu felis bibendum ut tristique et egestas quis:
- Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
- Duis aute irure dolor in reprehenderit in voluptate
- Excepteur sint occaecat cupidatat non proident
Keyboard Shortcuts
6.3 - issues with multiple testing.
If we are conducting a hypothesis test with an \(\alpha\) level of 0.05, then we are accepting a 5% chance of making a Type I error (i.e., rejecting the null hypothesis when the null hypothesis is really true). If we would conduct 100 hypothesis tests at a 0.05 \(\alpha\) level where the null hypotheses are really true, we would expect to reject the null and make a Type I error in about 5 of those tests.
Later in this course you will learn about some statistical procedures that may be used instead of performing multiple tests. For example, to compare the means of more than two groups you can use an analysis of variance ("ANOVA"). To compare the proportions of more than two groups you can conduct a chi-square goodness-of-fit test.
A related issue is publication bias. Research studies with statistically significant results are published much more often than studies without statistically significant results. This means that if 100 studies are performed in which there is really no difference in the population, the 5 studies that found statistically significant results may be published while the 95 studies that did not find statistically significant results will not be published. Thus, when you perform a review of published literature you will only read about the studies that found statistically significance results. You would not find the studies that did not find statistically significant results.
One quick method for correcting for multiple tests is to divide the alpha level by the number of tests being conducted. For instance, if you are comparing three groups using a series of three pairwise tests you could divided your overall alpha level ("family-wise alpha level") by three. If we were using a standard alpha level of 0.05, then our pairwise alpha level would be \(\frac{0.05}{3}=0.016667\). We would then compare each of our three p-values to 0.016667 to determine statistical significance. This is known as the Bonferroni method. This is one of the most conservative approaches to controlling for multiple tests (i.e., more likely to make a Type II error). Later in the course you will learn how to use the Tukey method when comparing the means of three or more groups, this approach is often preferred because it is more liberal.
Multiple Testing Problem / Multiple Comparisons
Hypothesis Testing > Multiple Testing Problem
What is the Multiple Testing Problem?
If you run a hypothesis test, there’s a small chance (usually about 5%) that you’ll get a bogus significant result. If you run thousands of tests, then the number of false alarms increases dramatically. For example, let’s say you run 10,000 separate hypothesis tests (which is common in fields like genomics). If you use the standard alpha level of 5% (which is the probability of getting a false positive), you’re going to get around 500 significant results — most of which will be false alarms . This large number of false alarms produced when you run multiple hypothesis tests is called the multiple testing problem. (Or multiple comparisons problem).
Correcting for Multiple Testing
When you run multiple tests, the p-values have to be adjusted for how many hypothesis tests you are running. In other words, you have to control the Type I error rate (a Type I error is another name for incorrectly rejecting the null hypothesis ). There isn’t a universally-accepted way to control for the problem of multiple testing.
- Single step methods like the Bonferroni correction and sequential methods like Holm’s method control the Family-wise Error Rate(FWER) . The FWER is just a term for all of those false positives you get with multiple tests. Usually used when it’s important not to make any Type I Errors at all.
- The Benjamini-Hochberg procedure and Storey’s positive FDR control the False Discovery rate. These procedures limit the number of false discoveries , but you’ll still get some, so use these procedures if a small number of Type I errors is acceptable.
When Not to Control for Multiple Comparisons
An unfortunate “side effect” of controlling for multiple comparisons is that you’ll probably increase the number of false negatives — that is, there really is something significant happening but you fail to detect it. False negatives ( “Type II Errors” ) can be very costly (for example, in pharmaceutical research — where missing an important discovery can put research behind for decades). So if that’s the case, you may not even want to try to control for multiple comparisons. The alternative would be to note in your research results that there is a possibility your findings may be a false positive.
Multiple Comparisons for Non Parametric Tests
For non parametric tests, use the Bonferroni correction — which is your only viable option.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
The PMC website is updating on October 15, 2024. Learn More or Try it out now .
- Advanced Search
- Journal List
- HHS Author Manuscripts
Comparisons of Methods for Multiple Hypothesis Testing in Neuropsychological Research
Richard e. blakesley.
Department of Biostatistics, University of Pittsburgh
Sati Mazumdar
Department of Biostatistics, University of Pittsburgh, and Department of Psychiatry, University of Pittsburgh School of Medicine
Mary Amanda Dew
Department of Psychiatry, University of Pittsburgh School of Medicine, and Departments of Epidemiology and Psychology, University of Pittsburgh
Patricia R. Houck
Department of Psychiatry, University of Pittsburgh School of Medicine
Charles F. Reynolds, III
Meryl a. butters, associated data.
Hypothesis testing with multiple outcomes requires adjustments to control Type I error inflation, which reduces power to detect significant differences. Maintaining the prechosen Type I error level is challenging when outcomes are correlated. This problem concerns many research areas, including neuropsychological research in which multiple, interrelated assessment measures are common. Standard p value adjustment methods include Bonferroni-, Sidak-, and resampling-class methods. In this report, the authors aimed to develop a multiple hypothesis testing strategy to maximize power while controlling Type I error. The authors conducted a sensitivity analysis, using a neuropsychological dataset, to offer a relative comparison of the methods and a simulation study to compare the robustness of the methods with respect to varying patterns and magnitudes of correlation between outcomes. The results lead them to recommend the Hochberg and Hommel methods (step-up modifications of the Bonferroni method) for mildly correlated outcomes and the step-down minP method (a resampling-based method) for highly correlated outcomes. The authors note caveats regarding the implementation of these methods using available software.
Neuropsychological datasets typically consist of multiple, partially overlapping measures, henceforth termed outcomes . A given neuropsychological domain—for example, executive function—is composed of multiple interrelated subfunctions, and frequently all subfunction outcomes of interest are subject to hypothesis testing. At a given α (critical threshold), the risk of incorrectly rejecting a null hypothesis, a Type I error, increases as more hypotheses are tested. This applies to all types of hypotheses, including a set of two-group comparisons across multiple outcomes (e.g., differences between two groups across several cognitive measures) or multiple-group comparisons within an analysis of variance framework (e.g., cognitive performance differences between several treatment groups and a control group). Collectively, we define these issues as the multiplicity problem ( Pocock, 1997 ).
Controlling Type I error at a desired level is a statistical challenge, further complicated by the correlated outcomes prevalent in neuropsychological data. By making adjustments to control Type I error, we increase the risk of incorrectly accepting a null hypothesis, a Type II error. In other words, we reduce power. Failure to control Type I error when examining multiple outcomes may yield false inferences, which may slow or sidetrack research progress. Researchers need strategies that maximize power while ensuring an acceptable Type I error rate.
Many methods exist to manage the multiplicity problem. Several methods are based on the Bonferroni and Sidak inequalities ( Sidak, 1967 ; Simes, 1986 ). These methods adjust α values or p values using simple functions of the number of tested hypotheses ( Sankoh, Huque, & Dubey, 1997 ; Westfall & Young, 1993 ). Holm (1979) , Hochberg (1988) , and Hommel (1988) developed Bonferroni derivatives incorporating stepwise components. Using rank-ordered p values, stepwise methods alter the magnitude of change as a function of p value order. Mathematical proofs order these methods, from least to most power, as Bonferroni, Holm, Hochberg, and Hommel ( Hochberg, 1988 ; Hommel, 1989 ; Sankoh et al., 1997 ). The Tukey-Ciminera-Heyse (TCH), Dubey/Armitage-Parmar (D/AP), and R 2 -adjustment (RSA) methods are single-step Sidak derivatives ( Sankoh et al., 1997 ). Another class of methods uses resampling methodology. The bootstrap (single-step) minP and step-down minP methods adjust p values using the nonparametrically estimated null distribution of the minimum p value ( Westfall & Young, 1993 ).
The Bonferroni-class methods and the Sidak method are theoretically valid with independent, uncorrelated outcomes only ( Hochberg, 1988 ; Holm, 1979 ; Hommel, 1988 ; Westfall & Young, 1993 ). The D/AP and RSA methods incorporate measures of correlation ( Sankoh et al., 1997 ), and the resampling-class methods incorporate correlational characteristics via bootstrapping procedures ( Westfall & Young, 1993 ). However, it is unclear which methods perform better when analyzing correlated outcomes. Theoretical and empirical comparisons of these p value adjustment methods have been limited in the breadth of methods compared and correlation structures explored ( Hochberg & Benjamini, 1990 ; Hommel, 1988 , 1989 ; Sankoh, D'Agostino, & Huque, 2003 ; Sankoh et al., 1997 ; Simes, 1986 ). We aimed to identify the optimal method(s) for multiple hypothesis testing in neuropsychological research.
We organized this article into several sections. First, we provide definitions and illustrations of 10 p value adjustment methods. Next, we describe a sensitivity analysis, defined as using statistical techniques in parallel to compare estimates, hypothesis inferences, and relative plausibility of the inferences ( Saltelli, Chan, & Scott, 2000 ; Verbeke & Molenberghs, 2001 ). Using a neuropsychological dataset, we compare the p value adjustment methods by the adjusted p value and inferences patterns. After the sensitivity analysis, we detail a simulation study, which, by definition, permits the examination of measures of interest under controlled conditions. We examined the Type I error and power rates of the p value adjustment methods under a systematic series of correlation and null hypothesis conditions. This allows us to compare the methods’ performance relative to simulation conditions, that is, when the truth is known. Last, we offer guidelines for using these methods when analyzing multiple correlated outcomes.
p Value Adjustment Method
Multiple testing adjustment methods may be formulated as either p value adjustment (with higher adjusted p values) or α-value adjustment (with lower adjusted α values). We focus on p value adjustment method formulas because adjusted p values allow direct interpretation against a chosen α value and eliminate the need for lookup tables or knowledge of complex hypothesis rejection rules ( Westfall & Young, 1993 ; Wright, 1992 ). Furthermore, adjusted α values are not supported by statistical software.
We describe the methods assuming a neuropsychological dataset with N participants, belonging to one of two groups, with M outcomes observed for each participant. The objective is to determine which outcomes are different between groups using two-sample t tests. For the j th outcome, where j = {1, 2,…, M }, there exists a null hypothesis and an observed p value resulting from testing the null hypothesis, denoted V ( j ), H 0 j , and p j , respectively. The observed p values are arranged such that p 1 ≥…≥ p j ≥…≥ p M . For each outcome, we test the null hypothesis of no difference between the groups, that is, the groups come from the same population. For any method, we calculate a sequence of adjusted p values in which we denote p aj as the adjusted p value corresponding to p j .
Bonferroni-Class Method
The parametric Bonferroni-class methods consist of the Bonferroni method and its derivatives. The Bonferroni method, defined as p aj = min{ Mp j , 1}, increases each p value by a factor of M to a maximum value of 1. Holm (1979) and Hochberg (1988) enhanced this single-step approach with stepwise adjustments that adjust p values sequentially and maintain the observed p value order. Holm’s step-down approach begins by adjusting the smallest p value p M as p aM = min{ Mp M , 1}. For each subsequent p j , with j = { M − 1, M − 2,…, 1}, p aj is defined as min{ jp j , 1} if min{ jp j , 1} is greater than or equal to all previously adjusted p values, p aM through p a ( j + 1) . Otherwise, it is the maximum of these previously adjusted p values. Therefore, we define Holm p values as p aj = min{1, max[ jp j , ( j + 1) p j + 1 , …, Mp M ]}, all of which are between 0 and 1. Hochberg’s method uses a step-up approach, such that p aj = min{1 p 1 , 2 p 2 , …, jp j }. Converse to Holm’s method, adjustment begins with the largest p value, p a 1 = 1 p 1 , and steps up to more significant p values, where each subsequent p aj is the minimum of jp j and the previously adjusted p values, p a 1 through p a ( j − 1) .
Hommel’s (1988) method is a derivative of Simes’s (1986) global test, which is derived from the Bonferroni method. For a subset of S null hypotheses, 1 ≤ S ≤ M , we define p Simes = min{( S / S ) p 1 , …, ( S /[ S − i + 1]) p i , …, ( S /1) p S }, for i = {1, 2, …, S }, where the p i s are the ordered p values corresponding to the S hypotheses within the subset. Hommel extended this method, permitting individual adjusted p values, defining p aj as the maximum p Simes calculated for all subsets of hypotheses containing the j th null hypothesis, H 0 j . Consider a simple case of M = 2 hypotheses, H 01 and H 02 . We calculate p a 1 as the maximum of the Simes p values for the subsets {H 01 } and {H 01 , H 02 }, such that p a 1 = max[(1/1) p 1 , min{(2/2) p 1 , (2/1) p 2 }]. We calculate p a 2 similarly with subsets {H 02 } and {H 01 , H 02 }. Wright (1992) provided an illustrative example and an efficient algorithm for Hommel p value calculations.
Sidak-Class Method
The Sidak method and its derivatives make up the parametric Sidak-class methods. The Sidak method defines p aj = 1 − (1 − p j ) M , which is approximately equal to Mp j for small values of p j , resembling the Bonferroni method ( Westfall & Young, 1993 ). Like the Bonferroni method, the Sidak method reduces Type I error in the presence of M hypothesis tests with independent outcomes. The Sidak derivatives have the general adjusted p value form, p aj = 1 − (1 − p j ) g ( j ) , where g ( j ) is some function defined per each method with 1 ≤ g ( j ) ≤ M . Some Sidak derivatives define g ( j ) to depend on measures of correlation between outcomes, where g ( j ) would range between M , for completely uncorrelated outcomes, and 1, for completely correlated outcomes. In turn, the magnitude of p value adjustment would range from the maximum adjustment (Sidak level) to no adjustment at all.
The TCH method defines g ( j ) = √ M ( Sankoh et al., 1997 ). The D/AP and the RSA methods incorporate measures of correlation between outcomes ( Sankoh et al., 1997 ). The j th adjusted D/AP p value is calculated using the mean correlation between the j th outcome and the remaining M − 1 outcomes, denoted mean.ρ( j ), such that g ( j ) = M 1 − mean.ρ( j ) . The j th adjusted RSA p value uses the value of R 2 from an intercept-free linear regression with the j th variable as the outcome and the remaining M − 1 variables as the predictors, denoted R 2( j ), such that g ( j ) = M 1 − R 2( j ) .
Resampling-Class Methods
Resampling-class methods use a nonparametric approach to adjusting p values. We examined the bootstrap variants of the minP and step-down minP (sd.minP) methods proposed by Westfall and Young (1993) . The minP method defines p aj = P [ X ≤ p j | X ~ minP(1, …, M )], the probability of observing a random variable X as extreme as p j , where X follows the empirical null distribution of the minimum p value. This is similar to the calculation of a p value using a z value statistic against the standard normal distribution, except that the distribution of X is derived through resampling. We generate the distribution of X by the following algorithm. Assume the original dataset has M outcomes for each of the N participants. We transform the original dataset by centering all observations by the group- and outcome-specific means. Next, we generate a bootstrap sample with N observations by sampling observation vectors with replacement from this mean-centered dataset. We then calculate p values by conducting hypothesis tests on each bootstrap sample. These M p values are considered an observation vector of a matrix consisting of outcomes B (1) through B ( M ), where B ( j ) are p values corresponding to outcome V ( j ) of the bootstrap dataset. Unlike the p values calculated from the original dataset, these p values are not reordered by rank. A total of N boot bootstrap datasets are generated, creating N boot observations in each B ( j ). The minimum p value from each observation vector defines the N boot values of empirical minP null distribution for the minP method, from which the adjusted p values are calculated.
The sd.minP method alters this general algorithm by using different empirical distributions for each p j . The matrix with outcomes B (1) through B ( j ) are calculated as before. For p j , we form an empirical minP null distribution from the minimum p values, not from the entire observation vectors with outcomes B (1) through B ( M ), but the subset corresponding to outcomes B (1) through B ( j ), and determine the values of P [ X ≤ p j | X ~ minP(1, …, j )]. The last step of the sd.minP method is a stepwise procedure that ensures the observed p value order as in the Holm method. That is, p aj is the maximum of the value P [ X ≤ p j | X ~ minP(1, …, j )] and the values P [ X ≤ p j + 1 | X ~ minP(1,…, j + 1)] through P [ X ≤ p M | X ~ minP(1, …, M )] .
Illustrative Example
We demonstrate these methods with an illustrative example, with values summarized in Table 1 . In practice, we would calculate most of these adjusted p values via efficient computer algorithms available in several statistical packages, including R ( R Development Core Team, 2006 ) and SAS/STAT software ( SAS Institute Inc., 2002–2006 ). Suppose we conduct two-sample t tests with M = 4 outcomes and observe ordered p values p 1 = 0.3587, p 2 = 0.1663, p 3 = 0.1365, and p 4 = 0.0117. Using the Bonferroni method, these unadjusted p values are each multiplied by 4, producing the values 1.4348, 0.6653, 0.5462, and 0.0470, respectively. By the minimum function, p a 1 is set to 1 rather than 1.4348, ensuring adjusted p values between 0 and 1.
Illustrative Example: Observed p Values and Adjusted p Values by Class and Method
Bonferroni | Sidak | Resampling | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
Observed | Bonferroni | Holm | Hochberg | Hommel | Sidak | TCH | D/AP | RSA | minP | sd.minP |
0.3587 | 1.0000 | 0.4096 | 0.3587 | 0.3587 | 0.8309 | 0.5887 | 0.6622 | 0.7362 | 0.7980 | 0.3616 |
0.1663 | 0.6653 | 0.4096 | 0.3326 | 0.3326 | 0.5169 | 0.3050 | 0.3448 | 0.3919 | 0.4749 | 0.3328 |
0.1365 | 0.5462 | 0.4096 | 0.3326 | 0.2731 | 0.4441 | 0.2544 | 0.3017 | 0.3486 | 0.4055 | 0.3328 |
0.0117 | 0.0470 | 0.0470 | 0.0470 | 0.0470 | 0.0462 | 0.0234 | 0.0274 | 0.0323 | 0.0434 | 0.0434 |
Note. TCH = Tukey-Ciminera-Heyse; D/AP = Dubey/Armitage-Parmar; RSA = R 2 adjustment.
The Holm (1979) and Hochberg (1988) methods begin by computing the values where jp j , which are 0.3587, 0.3326, 0.4096, and 0.0470. These are potential adjusted p values, determined ultimately by the stepwise procedures. Per the Holm method, we note 0.3326 < 0.4096. Because the method requires that p a 2 ≥ p a 3 , we set p a 2 = 0.4096, not the initial potential value 0.3326. Similarly, with the requirement p a 1 ≥ p a 2 , we set p a 1 = 0.4096, resulting in the Holm p values of 0.4096, 0.4096, 0.4096, and 0.0470. Per the Hochberg method, we again note that 0.3326 < 0.4096 and that the requirement p a 2 ≥ p a 3 exists. Under the Hochberg method, we set p a 3 = 0.3326 rather than to the initial potential value 0.4096, resulting in the Hochberg p values 0.3587, 0.3326, 0.3326, and 0.0470.
The Hommel (1988) method requires the calculation of Simes (1986) p values for subsets of hypotheses for each adjusted p value. For example, p a 3 requires the calculation of Simes p values for the following four hypothesis subsets: {H 01 , H 02 , H 03 , H 04 }, {H 01 , H 02 , H 03 }, {H 01 , H 03 }, and {H 03 }. The Simes p values for these subsets are 0.0470, 0.2495, 0.2731, and 0.1365, respectively, where p a 3 is the maximum of these values, 0.2731. The Hommel p values are 0.3587, 0.3326, 0.2731, and 0.0470, respectively.
The Sidak-class methods have the same general form, p aj = 1 − (1 − p j ) g ( j ) . Using g ( j ) = M = 4, the Sidak p values are 0.8309, 0.5169, 0.4441, and 0.0462, respectively, for the four hypothesis subsets. Using g ( j ) = √ M = 2, the TCH p values are 0.5887, 0.3050, 0.2544, and 0.0234, respectively. The D/AP and RSA methods require correlation information. Suppose the values of mean.ρ( j ), the mean correlation for the j th outcome with all other outcomes, are 0.3558, 0.3915, 0.3546, and 0.3841 for outcomes V(1)– V(4), respectively. Using the D/AP formula, the adjusted p values are 0.6622, 0.3448, 0.3017, and 0.0274, respectively. Similarly, with R 2( j ) values of 0.2077, 0.2744, 0.2271, and 0.2618, the RSA p values are 0.7362, 0.3919, 0.3486, and 0.0323, respectively.
The resampling-class methods rely on the empirical minP null distributions. We generated the distributions on the basis of N boot = 100,000 resamples. By the minP method, p aj is the probability of observing a value X ≤ p j , where X follows the empirical minP null distribution derived using all four outcomes. In a graphical representation, this corresponds to the area under the empirical distribution plot to the left of the value of p j . The minP p values based on our generated distribution are 0.7980, 0.4748, 0.4055, and 0.0434. Per the sd.minP method, we compare only p 4 , the smallest p value, against this distribution. Recall that each p j is compared with the distribution derived from using only outcomes B (1)– B ( j ). Thus, p a 3 is calculated using the distribution based only on B (1)– B (3), and so forth. On the basis of these distributions, the potential value for each p aj is the area to the left of p j and below the appropriate distribution curve. These potential values are 0.3616, 0.2925, 0.3328, and 0.0434. Similar to the Holm (1979) method, we note 0.2925 < 0.3328 and thus adjust p a 2 upward to the value of p a 3 , resulting in sd.minP p values of 0.3616, 0.3328, 0.3328, and 0.0434. We provide a graphical representation in Figure S1 of the supplemental materials.
Sensitivity Analysis
We used a dataset from a study of neuropsychological performance conducted through the University of Pittsburgh’s Advanced Center for Interventions and Services Research for Late-Life Mood Disorders, Western Psychiatric Institute and Clinic in Pittsburgh, PA ( Butters et al., 2004 ). The study used a group of 140 participants (100 depressed participants and 40 nondepressed comparison participants), ages 60 and older, group matched in terms of age and education. We conducted our sensitivity analysis with respect to 17 interrelated neuropsychological test (i.e., outcome measures) from this dataset, with tests detailed and cited in Butters et al. These outcome measures were grouped into five theoretical domains. The outcome correlation matrix is shown in Table 2 .
Neuropsychological Outcome Correlation Matrix
ID and outcome | 1.1 | 1.2 | 1.3 | 2.1 | 2.2 | 2.3 | 3.1 | 3.2 | 3.3 | 3.4 | 4.1 | 4.2 | 4.3 | 5.1 | 5.2 | 5.3 | 5.4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.1. Grooved pegboard | — | ||||||||||||||||
1.2. Digit Symbol | .61 | — | |||||||||||||||
1.3. Trails Making Test–A (Trails A) | .63 | .62 | — | ||||||||||||||
2.1. Block design | .48 | .53 | .43 | — | |||||||||||||
2.2. Simple drawings | .55 | .46 | .41 | .54 | — | ||||||||||||
2.3. Clock drawing | .40 | .38 | .39 | .49 | .51 | — | |||||||||||
3.1. Trails Making Test–B (Trails B) | .62 | .61 | .69 | .49 | .52 | .40 | — | ||||||||||
3.2. Wisconsin Card Sorting Test | .43 | .48 | .40 | .47 | .35 | .42 | .44 | — | |||||||||
3.3. Executive Interview | .47 | .42 | .36 | .48 | .35 | .23 | .40 | .36 | — | ||||||||
3.4. Stroop | .60 | .40 | .50 | .32 | .32 | .32 | .60 | .36 | .23 | — | |||||||
4.1. California Verbal Learning Test | .42 | .49 | .39 | .38 | .30 | .38 | .40 | .38 | .43 | .36 | — | ||||||
4.2. Modified Rey-Osterrieth Figure | .47 | .32 | .40 | .49 | .38 | .25 | .37 | .22 | .35 | .29 | .38 | — | |||||
4.3. Logical Memory | .28 | .33 | .24 | .38 | .34 | .22 | .32 | .14 | .33 | .09 | .41 | .44 | — | ||||
5.1. Boston Naming Test | .54 | .40 | .36 | .38 | .48 | .30 | .36 | .33 | .38 | .22 | .34 | .47 | .33 | — | |||
5.2. Animal Fluency | .38 | .48 | .27 | .36 | .33 | .22 | .39 | .25 | .27 | .11 | .35 | .38 | .37 | .46 | — | ||
5.3. Letter Fluency | .34 | .47 | .30 | .22 | .35 | .22 | .37 | .24 | .44 | .12 | .36 | .23 | .27 | .41 | .50 | — | |
5.4. Spot-the-Word | .06 | .17 | .09 | .24 | .28 | .14 | .12 | .09 | .23 | .08 | .18 | .17 | .19 | .40 | .16 | .31 | — |
Note. For ID, x.y indicates the yth outcome of domain x. Domain 1 = information-processing speed; Domain 2 = visuospatial; Domain 3 = executive; Domain 4 = memory; Domain 5 = language.
We compared the sensitivity analysis to compare the 10 adjustment methods, described in the p -Value Adjustment Methods section, with respect to patterns of hypothesis rejection and inference. We conducted two-sample t tests to test the null hypothesis of no difference between the depressed and comparison groups for each of the 17 outcome measures. The p value adjustment methods were applied using the multtest procedure, available in the SAS/STAT software ( SAS Institute Inc., 2002–2006 ). This procedure allowed for the computation of adjusted p values for the Bonferroni- and resampling-class methods, as well as the Sidak method. For the resampling methods, we used 100,000 bootstrap samples in the calculations. The Sidak derivatives (TCH, D/AP, and RSA) were programmed in a SAS macro (available on request).
Figure 1 compares the adjusted p values for each method across all outcomes. The legend indicates the total number of rejected hypotheses per method. We used a square-root scale for the y -axis to reduce the quantity of overlapping points. Adjusted p values based on the smaller unadjusted p values, primarily in the information-processing speed and visuospatial ability domains, remained difficult to distinguish; the numerical values are shown in Table S1 in the supplemental materials. Among Bonferroni-class methods, the Bonferroni method had the largest p values and thus was the most conservative of the methods, followed by the Holm (1979) , Hochberg (1988) , and Hommel (1988) methods, which were the least conservative. The Sidak method produced similar results to the Bonferroni method. The Sidak derivatives were more liberal, all producing results similar to the Hochberg and Hommel methods; D/AP was most conservative of the three. Generally, TCH was the least conservative, although RSA produced some smaller p values, mostly when the observed p value was also quite small.
Adjusted p values by method across neuropsychological outcomes. There are 17 observed p values for a set of 17 neuropsychological measures and adjusted p values per each method. A square-root scale is used to reduce overlapping points. Numbers in parentheses in the legend indicate the number of rejected hypotheses for that method. Symbols for outcomes with a null hypothesis rejected without adjustment indicate the following: + = null hypothesis rejected using each adjustment method; x = null hypothesis not rejected using any adjustment method; o = null hypothesis rejected by some adjustment methods. A full color version of this figure is included in the supplemental materials online.
The resampling methods produced relatively conservative results, with overall inferences similar to the Bonferroni and Sidak methods. The sd.minP method rejected the null hypothesis for the Clock Drawing Test, which was not rejected by the Bonferroni or Sidak methods. Whereas the order relations of the Bonferroni- and Sidak-class adjusted p values were highly consistent, this failed to hold for the resampling-class methods. The adjusted resampling-class p values were smaller than the Hommel counterpart for some outcomes and larger than the Bonferroni counterpart for others. Compared against each other, the sd.minP p values were smaller than the minP p values.
The importance of multiple hypothesis testing is highlighted by these results. Of the 17 outcomes and corresponding null hypotheses, we rejected 14 null hypotheses without adjustment. Of these 14, only 6 null hypotheses were rejected using each p value adjustment method. The null hypotheses regarding Animal Fluency and Stroop were not rejected using any method. Therefore, of the 14 null hypotheses rejected without adjustment, we can say confidently that 2 hypothesis decisions were Type I errors, 6 null hypotheses were rejected correctly, and 6 hypothesis decisions remain unclear. Without knowing the true differences (or lack thereof) between the populations regarding these seven outcomes, we gain confidence in our hypothesis rejection criteria by evaluating the Type I error and power of the p value adjustment methods.
Simulation Study
The premise of the simulation study, conducted using the R statistical package ( R Development Core Team, 2006 ), was to assess adjustment method performance across two series of trials. Performance included both Type I error protection and power to detect true effects. We defined each trial by a combination of hypothesis set and correlation structure conditions, defined below and summarized in Table 3 . In a given trial, we generated 10,000 random datasets, termed replicates , with two groups of size N = 100 observations each. We chose to generate M = 4 outcome variables, termed V1 through V4, to represent an average neuropsychological domain. Outcomes were generated to follow multivariate normal distribution using the mvrnorm function ( Venables & Ripley, 2002 ). Type I error and power estimates were calculated using the method-specific adjusted p values, based on two-sample, equal-variance, two-sided t test p values from each replicate. The number of resampled datasets, N boot , nontrivially affects computation time but has less impact on performance estimation accuracy compared with the number of replicates ( Westfall & Young, 1993 ). We set N boot = 500 for efficiency.
Compound–Symmetry Simulation Series Parameters
Outcome types | ||||
---|---|---|---|---|
Hypothesis sets | V1 | V2 | V3 | V4 |
Uniform–true | TN | TN | TN | TN |
Uniform–false | FN | FN | FN | FN |
Split (split–uniform) | TN | TN | FN | FN |
Correlation structure | ||||
Correlation structure | V1 | V2 | V3 | V4 |
V1 | 1 | ρ | ρ | ρ |
V2 | ρ | 1 | ρ | ρ |
V3 | ρ | ρ | 1 | ρ |
V4 | ρ | ρ | ρ | 1 |
Note. Outcomes types: TN = true null; FN = false null; V1–V4 = ?.
Compound symmetry: ρ = {0.0, 0.1, …, 0.9}.
We defined a true null (TN) as a simulated outcome with no difference between groups. The null hypothesis is actually true, and the p value for the hypothesis test should be nonsignificant. True null outcomes were simulated with an effect size of 0.0 between the two groups and were used for Type I error estimation. We defined a false null (FN) as a simulated outcome with a significant difference between the groups, or, alternatively, the null hypothesis is false. False null outcomes were simulated with an effect size of 0.5 between groups and were used for power estimation. Varying combinations of TNs and FNs, termed hypothesis sets , defined the outcomes V1–V4. The uniform hypothesis sets defined all four outcomes to be the same type, either all true nulls or all false nulls, allowing only Type I error or power estimation, respectively. The split hypothesis set defined two outcomes as TNs and the other two as FNs and allows both Type I error and power estimation using the relevant simulated outcomes. These hypothesis sets defined the truth in a given trial, allowing for absolute comparisons of the p value adjustment methods against the truth instead of only the relative comparisons afforded by the sensitivity analysis.
For all trials, we defined the significance threshold for all p values at α = .05. We used several performance measures detailed by Dudoit, Shaffer, and Boldrick (2003) with adapted nomenclature. Using TN outcomes, we defined Type I error as the familywise error rate, meaning the probability of rejecting at least one TN hypothesis. We defined minimal power as the probability of rejecting at least one FN. We defined maximal power as the probability of rejecting all FNs. These performance measures were calculated as the proportion of replicates satisfying the respective conditions. We defined average power as the average probability of rejecting the FNs across outcomes. This measure was calculated as the mean proportion of rejected FNs across outcomes.
To examine the effect of correlation between outcomes on p value adjustment method performance, we varied the correlation levels in the two simulation series systematically. The first simulation series, the compound-symmetry (CS) series, used a CS correlation structure in which all outcomes were equicorrelated with each other. We varied the correlation parameter ρ from 0.0 to 0.9 with an interval of 0.1 for 10 possible values. With three specified hypothesis sets (uniform–true, uniform–false, and split) and 10 CS structures, 30 trials were conducted in this series, summarized in Table 3 .
The second simulation series, block symmetry (BS), defined the outcomes V1–V2 and V3–V4 as constituting Blocks 1 and 2. Outcomes were equicorrelated within and between blocks, but with different levels. Within- and between-block correlation parameters W and B were varied among the values 0.0, 0.2, 0.5, and 0.8 (no, low, moderate, and high correlation), where within-block correlation was held strictly greater than between-block correlation, that is, W > B . The correlation structure of the sensitivity analysis data indicated higher correlation magnitude between outcomes within a block (domain) than between outcomes from different blocks. The BS correlation structure allows for the variation of these magnitudes in a simpler, four-outcome, two-block setting. In addition, the split–split hypothesis set was used, which defined a mix of outcome types overall and within blocks. This differed from the split, or split–uniform, hypothesis set in which block-specific hypothesis subsets were uniform. With four hypothesis sets and six correlation structures, 24 trials were conducted in this series. Table S2 in the supplemental materials summarizes the BS series parameters.
These structures represent correlation patterns observed between outcomes within and across several domains in the sensitivity analysis data. The CS structure is relevant to studies that focus on a single domain, for example, visuospatial ability, with multiple outcomes, for example, block design, simple drawings, and clock drawing. Although less intuitive compared with the CS structure, the BS structure is relevant for studies with multiple domains, for example, visuospatial ability and memory. Although correlation structures of real data are more complicated, these structures provided a relevant and convenient basis for evaluating the p value adjustment methods.
For brevity, we report the simulation results for the CS series in full. BS series results exhibited similar patterns, and thus we provide BS series performance results in Figures S2, S3, and S4 in the supplemental materials. We also note that the primary purpose of the p value adjustment methods is to control Type I error, that is, they maintain Type I error near or below α =.05. When viewing the power plots, take note of Type I error as well, as methods with power greater than others but with insufficient Type I error control fail the primary purpose and render them suboptimal.
CS–uniform hypothesis set
In Figure 2 , we show the performance across CS correlation structures for the p value adjustment methods under the uniform hypothesis sets (four TNs for Type I error, four FNs for power). Type I error performance is shown in the upper left panel. The resampling-class methods demonstrated stable Type I error around α = .05 as the CS correlation ρ increased. The Bonferroni-class methods demonstrated a decreasing trend in Type I error with increasing correlation between outcomes. The Bonferroni and Holm (1979) methods showed the lowest Type I error, whereas the Hochberg (1988) and Hommel (1988) methods allowed more error but were still conservative when ρ exceeded 0.5. The Sidak method exhibited marginally higher Type I error than the Bonferroni method. The TCH method followed a decreasing, but elevated trend; in the case of independence, it demonstrated high Type I error with values nearly double the threshold α = .05. However, in the case of high correlation, ρ = 0.9, it was the only method that reasonably approached α = .05. The D/AP and RSA methods followed liberal nonmonotonic trends. These methods showed increasing Type I error up to around ρ = 0.6–0.7, after which the trends decreased.
p value adjustment method performance across compound-symmetry correlation structures, Type I error, and power estimates for uniform hypothesis set. The upper left panel shows Type I error rates of the p value adjustment methods across increasing values of the compound-symmetry correlation parameter ρ. In this case, all M = 4 hypotheses are simulated to be true. Values near α = .05 are optimal. Values well above α = .05 indicate failure to protect Type I error at α. The remaining panels show different measures of power, where the four hypotheses are simulated to be false. Higher power is optimal, conditional on Type I error not exceeding α. A full color version of this figure is included in the supplemental materials online.
For average power, shown in the lower left panel, all the methods exhibited acceptable rates greater than 0.8. The Bonferroni and Sidak methods exhibited low, stable power near 0.85. The stepwise Bonferroni derivatives exhibited high power that decreased slowly with increasing correlation. The Hommel (1988) method was slightly more powerful than the Hochberg (1988) method, which was more powerful than the Holm (1979) method. The TCH method showed reasonably stable power around 0.9. The D/AP and RSA methods increased in average power as ρ increased and, at high correlation, were more powerful than the Bonferroni derivatives. However, as noted before, the power for the Sidak derivatives is irrelevant considering the Type I error rates well above α = .05. The minP method showed an increasing trend in average power with increasing correlation. The sd.minP method demonstrated an increase in power associated with a stepwise approach.
For minimal power, shown in the upper right panel, all methods were able to detect a difference between groups for at least one of four outcomes across all correlations with power greater than 0.9. The original Bonferroni and Sidak methods had the least power, followed by the Bonferroni derivatives, the resampling-class methods, and finally the Sidak derivatives.
For maximal power, shown in the lower right panel, all methods exhibited less power in comparison to the minimal and average power and demonstrated monotonic increasing trends with higher correlation with differing rates of change. The Bonferroni and Sidak methods again demonstrated the least power. The Bonferroni derivatives and the sd.minP performed generally well, ranging from just below 0.8 for low correlation and approaching 0.9 for high correlation. As before, the Holm (1979) method was less powerful than the Hochberg (1988) method, which was equivalent to the Hommel (1979) method, with the sd.minP method in between. Again, the TCH method followed the Sidak pattern in an elevated fashion. The D/AP and RSA methods demonstrated a steep rate of increase with increasing correlation, with power levels near Sidak with low correlation and power similar to the Bonferroni derivatives and the sd.minP method at high correlation.
CS–split hypothesis set
Figure 3 shows the results for the split hypothesis set across CS correlation structures. Similar relationships were found in comparison to the uniform hypothesis set, although the overall magnitudes decreased for all methods. Of note is the relative lack of decrease seen among stepwise methods, the Bonferroni derivatives and the sd.minP methods. The Type I error rates of the other methods were nearly halved in many instances. The D/AP and RSA methods exceeded α = .05 for high values of ρ.
p value adjustment method performance across compound-symmetry correlation structures, Type I error, and power estimates for split hypothesis set. The upper left panel shows Type I error rates of the p value adjustment methods across increasing values of the CS correlation parameter ρ. In this case, all only two of the M = 4 hypotheses are simulated to be true. Values near α = .05 are optimal. Values well above α = .05 indicate failure to protect Type I error at α. The remaining panels show different measures of power, using the two hypotheses simulated to be false. Higher power is optimal, conditional on Type I error not exceeding α. A full color version of this figure is included in the supplemental materials online.
Compared with the uniform hypothesis set power estimates, the Bonferroni derivatives exhibited lower average power, whereas the other methods performed similarly. The sd.minP method also showed a decrease in average power, although it increased with correlation. For minimal power, all methods exhibited a small reduction in power, although less pronounced for the Sidak derivatives. In terms of maximal power, the results for the Bonferroni derivatives were similar to the uniform hypothesis set counterparts, and all other methods exhibited greater power. The Bonferroni and Sidak methods continued to be the most conservative, but the Sidak derivatives exhibited higher power than all other methods for CS correlation ρ > 0.3.
The simulation results indicated that the Bonferroni and Sidak methods, although protecting Type I error, became increasingly conservative with high correlation between outcomes and were under-powered, particularly with regard to maximal power. The Bonferroni derivatives, although not improving the Type I error issue, notably improved average and maximal power. The single-step Sidak derivatives did not exhibit power similar to the stepwise methods. The average power of the D/AP and RSA methods increased with increasing correlation. However, these methods did not maintain acceptable Type I error. The resampling-class methods demonstrated consistent Type I error across the correlation structures and levels explored. The sd.minP method again demonstrated the advantage of a stepwise approach with similar power to the Bonferroni derivatives. Among methods examined, the Hochberg (1988) , Hommel (1979), and sd.minP methods exhibited the best performance, with considerable power and reasonable Type I error protection. With higher outcome correlation, the sd.minP method demonstrated higher power, particularly in the split hypothesis experiments. Thus, for lower correlation between neuropsychological outcomes, that is, average ρ < 0.5, we recommend either the Hochberg or the Hommel methods for reasons of easy implementation and exact replicability. For higher correlation between neuropsychological outcomes, we recommend the sd.minP method for increased power.
However, we must note a caveat to this simple guideline. With the implementation of the SAS/STAT multtest procedure ( SAS Institute Inc., 2002–2006 ), the equal-variance assumption was the only option for the test statistics used with the minP and sd.minP methods. When the equal-variance assumption is violated, using equal-variance t tests may yield inaccurate observed p values and inaccurate empirical minP null distributions, thus producing the conservative results shown in our sensitivity analysis.
Ideally, one might wish to use the sd.minP method without assuming equal variances for all outcomes, although to our knowledge current statistical software packages do not support this feature. Whereas the parametric methods are simple formulas that produce identical results across packages, the resampling-class methods may vary in their implementation from package to package, specifically with respect to the type of tests that may be conducted. If equality-of-variance tests are rejected for many outcomes, current software implementations may yield lower power. In this case, for average ρ ≥ 0.5, we prefer the Hochberg (1988) and Hommel (1979) methods. For the neuropsychological data examined in the sensitivity analysis, with high correlation between outcomes and many outcomes with unequal variances between groups, the Hochberg and Hommel methods are most appropriate.
Another important caveat with regard to the resampling-class methods is the number of N boot samples used to generate the empirically derived null minimum p value distributions. Westfall and Young (1993) recommended at least 10,000. In practice, this may not be enough. One cannot estimate small p values with a reasonable amount of precision without enough samples to estimate the tails of the distribution. With too few resamples, repeated applications of these methods may yield different inferences. Although we used 100,000 for our sensitivity analysis, admittedly the smallest unadjusted p value could not have been precisely estimated with 100,000, although the adjusted counterpart was still quite below α = .05.
The D/AP and RSA methods, designed to incorporate correlation into the adjustment, proved insufficient in protecting Type I error. The average power of these methods was adequate, but maximal power was weak for low correlation between outcomes. Further research in this area may yield another function that overcomes these deficiencies.
More methods might have been considered in this investigation. Dunnett and Tamhane (1992) and Rom (1990) both developed stepwise procedures with the motivation of lowering Type II error. Both methods make strong distributional assumptions and require complicated, iterative calculation. Furthermore, neither method has been implemented in any statistical software. The resampling-class methods also include permutation methods, which yield similar results to bootstrap methods when both methods can be easily applied but are extremely complicated to apply in many analytical situations ( Westfall & Young, 1993 ). Thus, we excluded these methods from consideration.
We chose to simulate only four outcomes to obtain a perspective of the performance of these methods. It is likely that the trends would simply become more pronounced and exaggerated with a higher number of outcomes, although this could be confirmed by another extensive simulation study.
The sensitivity analysis and simulation study were conducted in SAS and R because many of the methods used were built into the software and the remaining methods could be programmed with relative ease. SPSS and Stata, software preferred by some researchers, have a limited selection of methods available for analysis of variance–type comparisons, and none for multiple, two-sample tests as explored in this study ( SPSS Inc., 2006 ; Stata Press, 2007 ). The Hochberg (1988) method could be programmed with relative ease in either package; in fact, it could be programmed in spreadsheet software. The Hommel (1979) and sd.minP methods, however, would be more complicated. Reprogramming these methods for SPSS or Stata would likely be less efficient than learning the comparatively few commands necessary to conduct the p value adjustments in SAS or R.
Currently, there exists no perfect adjustment method for multiple hypothesis testing with neuropsychological data. The sd.minP, Hochberg (1988) , and Hommel (1979) methods demonstrated Type I error protection with good power, although new research may yield methods that surpass their performance.
Supplementary Material
Acknowledgments.
This research was supported by the National Institute of Mental Health Grants T32 MH073451, P30 MH071944, and R01 MH072947 and National Institute on Aging Grant P01 AG020677. We thank Sanat Sarkar of Temple University for his input on this article.
Supplemental materials : http://dx.doi.org/10.1037/a0012850.supp
Contributor Information
Richard E. Blakesley, Department of Biostatistics, University of Pittsburgh.
Sati Mazumdar, Department of Biostatistics, University of Pittsburgh, and Department of Psychiatry, University of Pittsburgh School of Medicine.
Mary Amanda Dew, Department of Psychiatry, University of Pittsburgh School of Medicine, and Departments of Epidemiology and Psychology, University of Pittsburgh.
Patricia R. Houck, Department of Psychiatry, University of Pittsburgh School of Medicine.
Gong Tang, Department of Biostatistics, University of Pittsburgh.
Charles F. Reynolds, III, Department of Psychiatry, University of Pittsburgh School of Medicine.
Meryl A. Butters, Department of Psychiatry, University of Pittsburgh School of Medicine.
- Butters MA, Whyte EM, Nebes RD, Begley AE, Dew MA, Mulsant BH, et al. The nature and determinants of neuropsychological functioning in late-life depression. Archives of General Psychiatry. 2004; 61 :587–595. [ PubMed ] [ Google Scholar ]
- Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Statistical Science. 2003; 18 :71–103. [ Google Scholar ]
- Dunnett CW, Tamhane AC. A step-up multiple test procedure. Journal of the American Statistical Association. 1992; 87 :162–170. [ Google Scholar ]
- Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988; 75 :800–802. [ Google Scholar ]
- Hochberg Y, Benjamini Y. More powerful procedures for multiple significance testing. Statistics in Medicine. 1990; 9 :811–818. [ PubMed ] [ Google Scholar ]
- Holm S. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics. 1979; 6 :65–70. [ Google Scholar ]
- Hommel G. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika. 1988; 75 :383–386. [ Google Scholar ]
- Hommel G. A comparison of two modified Bonferroni procedures. Biometrika. 1989; 76 :624–625. [ Google Scholar ]
- Pocock SJ. Clinical trials with multiple outcomes: A statistical perspective on their design, analysis, and interpretation. Controlled Clinical Trials. 1997; 18 :530–545. [ PubMed ] [ Google Scholar ]
- R Development Core Team. Vienna: R Foundation for Statistical Computing; 2006. R: A language and environment for statistical computing. Available at http://www.R-project.org . [ Google Scholar ]
- Rom DM. A sequentially rejective test procedure based on a modified Bonferroni inequality. Biometrika. 1990; 77 :663–665. [ Google Scholar ]
- Saltelli A, Chan K, Scott EM, editors. Sensitivity analysis: Gauging the worth of scientific models. New York: Wiley; 2000. [ Google Scholar ]
- Sankoh AJ, D’Agostino RB, Huque MF. Efficacy endpoint selection and multiplicity adjustment methods in clinical trials with inherent multiple endpoint issues. Statistics in Medicine. 2003; 22 :3133–3150. [ PubMed ] [ Google Scholar ]
- Sankoh AJ, Huque MF, Dubey SD. Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine. 1997; 16 :2529–2542. [ PubMed ] [ Google Scholar ]
- SAS Institute Inc. SA S OnlineDoc 9.1.3. Cary, NC: Author; 2002–2006. [ Google Scholar ]
- Sidak Z. Rectangular confidence regions for the means of multivariate normal distributions. Journal of the American Statistical Association. 1967; 62 :626–633. [ Google Scholar ]
- Simes RJ. An improved Bonferroni procedure for multiple tests of significance. Biometrika. 1986; 73 :751–754. [ Google Scholar ]
- SPSS Inc. SPSS base 1 5.0 user’s guide. Chicago: Author; 2006. [ Google Scholar ]
- Stata Press. Stata 10 base documentation set. College Station, TX: Author; 2007. [ Google Scholar ]
- Venables WN, Ripley BD. Modern applied statistics with S. 4th ed. New York: Springer; 2002. [ Google Scholar ]
- Verbeke G, Molenberghs G. Linear mixed models for longitudinal data. New York: Springer; 2001. [ Google Scholar ]
- Westfall PH, Young SS. Resampling-based multiple testing: Examples and methods for p-value adjustment. New York: Wiley; 1993. [ Google Scholar ]
- Wright SP. Adjusted P-values for simultaneous inference. Biometrics. 1992; 48 :1005–1013. [ Google Scholar ]
Multiple Hypothesis Testing
Aug 20, 2020
1. Motivation
It is important to control for false discoveries when multiple hypotheses are tested. Under the Neyman-Pearson formulation, each hypothesis test involves a decision rule with false positive rate (FPR) less than $\alpha$ (e.g. $\alpha = 0.05$). However, if there are $m$ $\alpha$-level independent tests, the probability of at least one false discovery could be as high as $\min(m\alpha, 1)$. Multiple hypothesis testing correction involves adjusting the significance level(s) to control error related to false discoveries. Some of the material presented is based on UC Berkeley’s Data102 course .
2. P-values
Consider $H_{0}: \theta = \theta_{0}$ versus $H_{1}: \theta = \theta_{1}$. Let \(\mathbb{P}_{\theta_{0}}(x)\) be the distribution of data \(X \in \mathbb{R}^{p}\) under the null, and let \(S = \{X^{(i)}\}_{i = 1}^{m}\) be the observed dataset. Additionally, denote $S_{0}$ as the unobserved dataset drawn from \(\mathbb{P}_{\theta_{0}}(x)\).
If the statistic $T(S_{0})$ has tail cumulative distribution function (CDF) \(F(t) = \mathbb{P}_{\theta_{0}}(T(S_{0}) > t)\), then the p-value is defined as the random variable $P = F(T(S))$. The graphical illustration of the density of $T$ (short for $T(S)$) is shown below.
An important fact about p-value $P$ is that it has $Unif(0, 1)$ distribution under the null. A random variable has $Unif(0, 1)$ distribution if and only if it has CDF $F(p) = p$ for $p \in [0, 1]$. We now show $P$ has CDF $F(p) = p$.
where the first equality is by definition of $P$. For the second equality, it is helpful to recall that for the 1-to-1 function $F(\cdot)$, $F: T \rightarrow u$ and $F^{-1}: u \rightarrow T$. Then from diagram above, notice that $F(T)$ is decreasing with respect to $T$. The third equality is from definition of $F(\cdot)$.
3. Bonferroni Correction
Let $V$ be the number of false positives. Then the probability of at least one false discovery $V > 0$ among $m$ tests (not necessarily independent) is defined as the family-wise error rate (FWER). Bonferroni correction adjusts the significance level to $\alpha / m$. This controls the FWER to be at most $\alpha$. If there are $m_{0} \leq m$ true null hypotheses, then
where the first inequality is from union bound (Boole’s inequality). In practice, the observed p-value $p_{i}$ is adjusted according to
for $i = 1, \dots, m$. Then the $i$-th null hypothesis is rejected if $p_{i}^{adj} \leq \alpha$. Let us simulate 10 p-values from $unif(0, 0.3)$ and implement Bonferroni corrected p-values.
4. Benjamini-Hochberg
A major criticism of Bonferroni correction is that it is too conservative - false positives are avoided at the expense of false negatives. The Benjamini-Hochberg (BH) procedure instead controls the FDR to avoid more false negatives. The FDR among $m$ tests is defined as
where $R$ is number of rejections among $m$ tests. BH procedure adjusts the p-value cutoff by allowing looser p-value cutoffs provided given earlier discoveries. This is graphically illustrated below.
The BH procedure is as follows
- For each independent test, compute the p-value $p_{i}$. Sort the p-value from smallest to largest $p_{(1)} < \cdots < p_{(m)}$.
- Select \(R = \max\big\{i: p_{(i)} < \frac{i\alpha}{m}\big\}\).
- Reject null hypotheses with p-value $\leq p_{(R)}$.
By construction, this procedure rejects exactly $R$ hypotheses, and
Let $m_{0} \leq m$ be the number true null hypotheses. Let $X_{i} = \mathbb{1}(p_{i} \leq p_{(R)})$ be whether hypothesis $i$ is rejected or not. Since $p_{i} \sim unif(0, 1)$, $X_{i} \sim bernoulli(p_{(R)})$. Under the assumption that tests are independent, $V = \sum_{i = 1}^{m_{0}}X_{i} \sim binomial(m_{0}, p_{(R)})$. Then by definition
In practice, the observed p-value $p_{i}$ is adjusted according to
for $i = 1, \dots ,m$. The $i$-th null hypothesis is rejected if $p_{i}^{adj} \leq \alpha$. This results in exactly $R$ rejected null hypotheses because if $i \leq R$, then $p_{i}^{adj} < \alpha$, because
The first inequality is from definition of minimum over a set that includes \(\frac{mp_{(R)}}{R}\), and the second inequality is by construction of \(\frac{m}{R}p_{(R)}\). If $i > R$, then $p_{i}^{adj} > \alpha$ because $p_{(R)}$ is defined as the last p-value in sorted p-values with $p_{(i)} < \frac{i\alpha}{m}$. Let us simulate 10 p-values from $unif(0, 0.3)$ and implement BH corrected p-values.
The Informaticists
A take on the science of information, lecture 9: multiple hypothesis testing: more examples.
(Warning: These materials may be subject to lots of typos and errors. We are grateful if you could spot errors and leave suggestions in the comments, or contact the author at [email protected] .)
In the last lecture we have seen the general tools and concrete examples of reducing the statistical estimation problem to multiple hypothesis testing. In these examples, the loss function is typically straightforward and the construction of the hypotheses is natural. However, other than statistical estimation the loss function may become more complicated (e.g., the excess risk in learning theory), and the hypothesis constructions may be implicit. To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems.
1. Example I: Density Estimation
The main result of this section is as follows:
Next we check the conditions of Fano’s inequality. Since
2. Example II: Aggregation
In estimation or learning problems, sometimes the learner is given a set of candidate estimators or predictors, and she aims to aggregate them into a new estimate based on the observed data. In scenarios where the candidates are not explicit, aggregation procedures can still be employed based on sample splitting, where the learner splits the data into independent parts, uses the first part to construct the candidates and the second part to aggregate them.
Some special cases are in order:
The main result in this section is summarized in the following theorem.
Theorem 1 If there is a cube such that admits a density lower bounded from below w.r.t. the Lebesgue measure on , then
We remark that the rates in Theorem 1 are all tight. In the upcoming subsections we will show that although the loss function of aggregation becomes more complicated, the idea of multiple hypothesis testing can still lead to tight lower bounds.
2.1. Linear aggregation
To apply the Assoaud’s lemma, note that for the loss function
Therefore, the Assoud’s lemma (with Pinsker’s inequality) gives
2.2. Convex aggregation
Again we consider the well-specified case where
2.3. Model selection aggregation
Hence, Fano’s inequality gives
Exercise 1 Under the same assumptions of Theorem 1, show that
3. Example III: Learning Theory
The central claim of this section is the following:
Theorem 2 Let the VC dimension of be . Then
Recall that the definition of VC dimension is as follows:
Definition 3 For a given function class consisting of mappings from to , the VC dimension of is the largest integer such that there exist points from which can be shattered by . Mathematically, it is the largest such that there exist , and for all , there exists a function such that for all .
VC dimension plays a significant role in statistical learning theory. For example, it is well-known that for the empirical risk minimization (ERM) classifier
Hence, Theorem 2 shows that the ERM classifier attains the minimax excess risk for all function classes, and the VC dimension exactly characterizes the difficulty of the learning problem.
3.1. Optimistic case
We first examine the separation condition in Assoaud’s lemma, where the loss function here is
Therefore, Assoaud’s lemma gives
3.2. Pessimistic case
and therefore tensorization gives
3.3. General case
Exercise 2 Show that when the VC dimension of is , then
4. Example IV: Stochastic Optimization
where the expectation is taken over the randomness in the oracle output. The main result in this section is summarized in the following theorem:
and therefore Assoaud’s lemma gives
5. Bibliographic Notes
The linear and convex aggregations are proposed in Nemirovski (2000) for the adaptive nonparametric thresholding estimators, and the concept of model selection aggregation is due to Yang (2000). For the optimal rates of different aggregations (together with upper bounds), we refer to Tsybakov (2003) and Leung and Barron (2006).
The examples from statistical learning theory and stochastic optimization are similar in nature. The results of Theorem 2 and the corresponding upper bounds are taken from Vapnik (1998), albeit with a different proof language. For general results of the oracle complexity of convex optimization, we refer to the wonderful book Nemirovksi and Yudin (1983) and the lecture note Nemirovski (1995). The current proof of Theorem 4 is due to Agarwal et al. (2009).
- Arkadi Nemirovski, Topics in non-parametric statistics. Ecole d’Eté de Probabilités de Saint-Flour 28 (2000): 85.
- Oleg V. Lepski, Enno Mammen, and Vladimir G. Spokoiny, Optimal spatial adaptation to inhomogeneous smoothness: an approach based on kernel estimates with variable bandwidth selectors. The Annals of Statistics 25.3 (1997): 929–947.
- David L. Donoho, Iain M. Johnstone, Gérard Kerkyacharian, and Dominique Picard, Density estimation by wavelet thresholding. The Annals of Statistics (1996): 508–539.
- Yuhong Yang, Combining different procedures for adaptive regression. Journal of multivariate analysis 74.1 (2000): 135–161.
- Alexandre B. Tsybakov, Optimal rates of aggregation. Learning theory and kernel machines. Springer, Berlin, Heidelberg, 2003. 303–313.
- Gilbert Leung, and Andrew R. Barron. Information theory and mixing least-squares regressions. IEEE Transactions on Information Theory 52.8 (2006): 3396–3410.
- Vlamimir Vapnik, Statistical learning theory. Wiley, New York (1998): 156–160.
- Arkadi Nemirovski, and David Borisovich Yudin. Problem complexity and method efficiency in optimization. 1983.
- Arkadi Nemirovski, Information-based complexity of convex programming. Lecture Notes, 1995.
- Alekh Agarwal, Martin J. Wainwright, Peter L. Bartlett, and Pradeep K. Ravikumar, Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 2009.
Share this:
2 thoughts on “ lecture 9: multiple hypothesis testing: more examples ”.
These notes are excellent! I think it’d be worth considering combining them into a survey monograph
Really glad you like them Jonathan! We’ll definitely combine these lecture notes at the end (also plan to teach a course based on these materials soon). Also this is not the end of the lecture series – more (tools and examples) to follow!
Leave a Reply Cancel reply
Discover more from the informaticists.
Subscribe now to keep reading and get access to the full archive.
Type your email…
Continue reading
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Publications
- Account settings
- My Bibliography
- Collections
- Citation manager
Save citation to file
Email citation, add to collections.
- Create a new collection
- Add to an existing collection
Add to My Bibliography
Your saved search, create a file for external citation management software, your rss feed.
- Search in PubMed
- Search in NLM Catalog
- Add to Search
Multiple testing: when is many too much?
Affiliations.
- 1 Department of Clinical Epidemiology, Leiden University Medical Center, Leiden, the Netherlands.
- 2 Department of Biomedical Data Sciences, Leiden University Medical Center, Leiden, the Netherlands.
- 3 Department of Endocrinology, Leiden University Medical Center, Leiden, the Netherlands.
- PMID: 33300887
- DOI: 10.1530/EJE-20-1375
In almost all medical research, more than a single hypothesis is being tested or more than a single relation is being estimated. Testing multiple hypotheses increases the risk of drawing a false-positive conclusion. We briefly discuss this phenomenon, which is often called multiple testing. Also, methods to mitigate the risk of false-positive conclusions are discussed.
PubMed Disclaimer
Similar articles
- Strategies in adjusting for multiple comparisons: A primer for pediatric surgeons. Staffa SJ, Zurakowski D. Staffa SJ, et al. J Pediatr Surg. 2020 Sep;55(9):1699-1705. doi: 10.1016/j.jpedsurg.2020.01.003. Epub 2020 Jan 23. J Pediatr Surg. 2020. PMID: 32029234
- False discovery rate control is a recommended alternative to Bonferroni-type adjustments in health studies. Glickman ME, Rao SR, Schultz MR. Glickman ME, et al. J Clin Epidemiol. 2014 Aug;67(8):850-7. doi: 10.1016/j.jclinepi.2014.03.012. Epub 2014 May 13. J Clin Epidemiol. 2014. PMID: 24831050 Review.
- The problem with unadjusted multiple and sequential statistical testing. Albers C. Albers C. Nat Commun. 2019 Apr 23;10(1):1921. doi: 10.1038/s41467-019-09941-0. Nat Commun. 2019. PMID: 31015469 Free PMC article.
- The fallback procedure for evaluating a single family of hypotheses. Wiens BL, Dmitrienko A. Wiens BL, et al. J Biopharm Stat. 2005;15(6):929-42. doi: 10.1080/10543400500265660. J Biopharm Stat. 2005. PMID: 16279352
- A patient called Medical Research. Furst T, Strojil J. Furst T, et al. Biomed Pap Med Fac Univ Palacky Olomouc Czech Repub. 2017 Mar;161(1):54-57. doi: 10.5507/bp.2017.005. Epub 2017 Mar 14. Biomed Pap Med Fac Univ Palacky Olomouc Czech Repub. 2017. PMID: 28323292 Review.
- Mediators and moderators of the effects of a school-based intervention on adolescents' fruit and vegetable consumption: the HEIA study. Daas MC, Gebremariam MK, Poelman MP, Andersen LF, Klepp KI, Bjelland M, Lien N. Daas MC, et al. Public Health Nutr. 2024 Jan 25;27(1):e50. doi: 10.1017/S1368980024000260. Public Health Nutr. 2024. PMID: 38269621 Free PMC article. Clinical Trial.
- Exploring the Association Between Trauma, Instability, and Youth Cardiometabolic Health Outcomes Over Three Years. Schuler BR, Gardenhire RA, Jones SD, Spilsbury JC, Moore SM, Borawski EA. Schuler BR, et al. J Adolesc Health. 2024 Feb;74(2):301-311. doi: 10.1016/j.jadohealth.2023.08.049. Epub 2023 Oct 15. J Adolesc Health. 2024. PMID: 37843478 Free PMC article.
- Analytical HDR prostate brachytherapy planning with automatic catheter and isotope selection. Frank CH, Ramesh P, Lyu Q, Ruan D, Park SJ, Chang AJ, Venkat PS, Kishan AU, Sheng K. Frank CH, et al. Med Phys. 2023 Oct;50(10):6525-6534. doi: 10.1002/mp.16677. Epub 2023 Aug 31. Med Phys. 2023. PMID: 37650773
- Applying WHO2013 diagnostic criteria for gestational diabetes mellitus reveals currently untreated women at increased risk. Scheuer CM, Jensen DM, McIntyre HD, Ringholm L, Mathiesen ER, Nielsen CPK, Nolsöe RLM, Milbak J, Hillig T, Damm P, Overgaard M, Clausen TD. Scheuer CM, et al. Acta Diabetol. 2023 Dec;60(12):1663-1673. doi: 10.1007/s00592-023-02148-2. Epub 2023 Jul 18. Acta Diabetol. 2023. PMID: 37462764 Free PMC article.
- The impact of smartphone app-based interventions on adolescents' dietary intake: a systematic review and evaluation of equity factor reporting in intervention studies. Schaafsma HN, Jantzi HA, Seabrook JA, McEachern LW, Burke SM, Irwin JD, Gilliland JA. Schaafsma HN, et al. Nutr Rev. 2024 Mar 11;82(4):467-486. doi: 10.1093/nutrit/nuad058. Nutr Rev. 2024. PMID: 37330675 Free PMC article.
Publication types
- Search in MeSH
Related information
Linkout - more resources, full text sources.
- Ovid Technologies, Inc.
- Sheridan PubFactory
- Silverchair Information Systems
Other Literature Sources
- scite Smart Citations
- Citation Manager
NCBI Literature Resources
MeSH PMC Bookshelf Disclaimer
The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.
Multiple Hypothesis Testing in R
In the first article of this series , we looked at understanding type I and type II errors in the context of an A/B test, and highlighted the issue of “peeking”. In the second , we illustrated a way to calculate always-valid p-values that were immune to peeking. We will now explore multiple hypothesis testing, or what happens when multiple tests are conducted on the same family of data.
We will set things up as before, with the false positive rate \(\alpha = 0.05\) and false negative rate \(\beta=0.20\) .
To illustrate the concepts in this article, we are going to use the same monte_carlo utility function that we used previously:
We’ll use the monte_carlo utility function to run 1000 experiments, measuring whether the p.value is less than alpha after n_obs observations . If it is, we reject the null hypothesis. We will set the effect size to 0; we know that there is no effect and that the null hypothesis is globally true. In this case, we expect about 50 rejections and about 950 non-rejections, since 50/1000 would represent our expected maximum false positive rate of 5%.
In practice, we don’t usually test the same thing 1000 times; instead, we test it once and state that there is a maximum 5% chance that we have falsely said there was an effect when there wasn’t one 1 .
The Family-Wise Error Rate (FWER)
Now imagine we test two separate statistics using the same source data, with each test constrained by the same \(\alpha\) and \(\beta\) as before. What is the probability that we will detect at least one false positive considering the results of both tests? This is known as the family-wise error rate (FWER 2 3 ), and would apply to the case where a researcher claims there is a difference between the populations if any of the tests yields a positive result. It’s clear that this could present issues, as the family-wise error rate Wikipedia page illustrates:
Suppose the treatment is a new way of teaching writing to students, and the control is the standard way of teaching writing. Students in the two groups can be compared in terms of grammar, spelling, organization, content, and so on. As more attributes are compared, it becomes increasingly likely that the treatment and control groups will appear to differ on at least one attribute due to random sampling error alone.
What is the FWER for the two tests? To calculate the probability that at least one false positive will arise in our two-test example, consider that the probability that one test will not reject the null is \(1-\alpha\) . Thus, the probability that both tests will not reject the null is \((1-\alpha)^2)\) and the probability that at least one test will reject the null is \(1-(1-\alpha)^2\) . For \(m\) tests, this generalizes to \(1-(1-\alpha)^m\) . With \(\alpha=0.05\) and \(m=2\) , we have:
Let’s see if we can produce the same result with a Monte Carlo simulation. We will run the Monte Carlo for n_trials and run n_tests_per_trial . For each trial, if at least one of the n_tests_per_trial results in a rejection of the null, we consider that the trial rejects the null. We should see that about 1 in 10 trials reject the null. This is implemented below:
Both results show that evaluating two tests on the same family of data will lead to a ~10% chance that a researcher will claim a “significant” result if they look for either test to reject the null. Any claim there is a maximum 5% false positive rate would be mistaken. As an exercise, verify that doing the same on \(m=4\) tests will lead to an ~18% chance!
A bad testing platform would be one that claims a maximum 5% false positive rate when any one of multiple tests on the same family of data show significance at the 5% level. Clearly, if a researcher is going to claim that the FWER is no more than \(\alpha\) , then they must control for the FWER and carefully consider how individual tests reject the null.
Controlling the FWER
There are many ways to control for the FWER, and the most conservative is the Bonferroni correction . The “Bonferroni method” will reject null hypotheses if \(p_i \le \frac{\alpha}{m}\) . Let’s switch our reject_at_i function for a p_value_at_i function, and then add in the Bonferroni correction:
With the Bonferroni correction, we see that the realized false positive rate is back near the 5% level. Note that we use any(...) to add 1 if any hypothesis is rejected.
Until now, we have only shown that the Bonferroni correction controls the FWER for the case that all null hypotheses are actually true: the effect is set to zero. This is called controlling in the weak sense . Next, let’s use R’s p.adjust function to illustrate the Bonferroni and Holm adjustments to the p-values:
We see that the Holm correction is very similar to the Bonferroni correction in the case that the null hypothesis is always true.
Strongly controlling the FWER
Both the Bonferroni and Holm corrections guarantee that the FWER is controlled in the strong sense , in which we have any configuration of true and non-true null hypothesis. This is ideal, because in reality, we do not know if there is an effect or not.
The Holm correction is uniformly more powerful than the Bonferroni correction, meaning that in the case that there is an effect and the null is false, using the Holm correction will be more likely to detect positives.
Let’s test this by randomly setting the effect size to the minimum detectable effect in about half the cases. Note the slightly modified p_value_at_i function as well as the null_true variable, which will randomly decide if there is a minimum detectable effect size or not for that particular trial.
Note that in the below example, we will not calculate the FWER using the same any(...) construct from the previous code segments. If we were to do this, we would see that the both corrections have the same FWER and the same power (since the outcome of the trial is then decided by whether at least one of the hypotheses was rejected for the trial). Instead, we will tabulate the result for each of the hypotheses. We should see the same false positive rate 4 , but great power for the Holm method.
Indeed, we observe that while the realized false positive rates of both the Bonferroni and Holm methods are very similar, the Holm method has greater power. These corrections essentially reduce our threshold for each test so that across the family of tests, we produce false positives with a probability of no more than \(\alpha\) . This comes at the expense of a reduction in power from the optimal power ( \(1-\beta\) ).
We have illustrated two methods for deciding what null hypotheses in a family of tests to reject. The Bonferroni method rejects hypotheses at the \(\alpha/m\) level. The Holm method has a more involved algorithm for which hypotheses to reject. The Bonferroni and Holm methods have the property that they do control the FWER at \(\alpha\) , and Holm is uniformly more powerful than Bonferroni.
This raises an interesting question: What if we are not concerned about controlling the probability of detecting at least one false positive, but something else? We might be more interested in controlling the expected proportion of false discoveries amongst all discoveries, known as the false discovery rate. As a quick preview, let’s calculate the false discovery rate for our two cases:
By choosing to control for a metric other than the FWER, we may be able to produce results with power closer to the optimal power ( \(1-\beta\) ). We will look at the false discovery rate and other measures in a future article.
Roland Stevenson is a data scientist and consultant who may be reached on LinkedIn .
And a maximum 20% chance that we said there wasn’t an effect when there was one. ↩
Hochberg, Y.; Tamhane, A. C. (1987). Multiple Comparison Procedures. New York: Wiley. p. 5. ISBN 978-0-471-82222-6 ↩
A sharper Bonferroni procedure for multiple tests of significance ↩
Verify this by inspecting the table_bf and table_holm variables. ↩
You may leave a comment below or discuss the post in the forum community.rstudio.com .
The multiple hypothesis testing problem
I must admit that I only learnt about the “multiple testing” problem in statistical inference when I started reading about A/B testing. In many ways I knew about it already, since the essence of it can be captured by a basic example in probability theory: suppose a particular event has a chance of 1% of happening. Now, if we make N attempts what is the probability that this event will have happened at least once among the N attempts?
As we will see more in detail below, the answer is $1-0.99^N$. For $N=5$, the event has already almost a $5\%$ chance of happening at least once in five attempts.
Fine… so what’s the problem with this? Well, here is why it can complicate things in statistical inference: suppose this 1% event is “rejecting the null hypothesis $H_0$ when the null is actually true”. In other words, committing a Type-I error in hypothesis testing. Continuing with the parallel, “making N attempts” would mean making N hypothesis tests. That’s the problem: if we are not careful, making multiple hypothesis tests can dangerously imply underestimating the Type-I error. Things are not that funny anymore right?
The problem becomes particularly important now that streaming data are becoming the norm. In this case it may be very tempting to continue collecting data and perform tests after tests… until we reach statistical significance. Uh… that’s exactly when the analysis becomes biased and things get bad.
The problem of bias in hypothesis testing is much more general than what is in the example above. In particle physics searches, the need for reducing bias goes as far as performing blind analysis , in which the data are scrambled and altered until a final statistical analysis is performed.
Let’s now go back to our “multiple testing” problem and be a bit more precise about the implications. Suppose we formulate a hypothesis test by defining a null hypothesis $H_0$ and alternative hypothesis $H_1$. We then set a type-I error $\alpha$, which means that if the null hypothesis $H_0$ were true, we would incorrectly reject the null with probability $\alpha$.
In general, given $n$ tests the probability of rejecting the null in any of the tests can be written as $$\begin{equation} P(\mathrm{rejecting\ the\ null\ in \ any \ of \ the \ tests})=P(r_1\lor r_2\lor\dots\lor r_n) \label{eq:prob} \end{equation}$$ in which $r_j$ denotes the event “the null is rejected at the j-th test”.
While it is difficult to evaluate eq. (\ref{eq:prob}) in general, the expression greatly simplifies for independent tests as it will be clear in the next section.
1. Independent tests
For two independent tests $A$ and $B$ we have that $P(A\land B)=P(A)P(B)$. The hypothesis of independent tests can thus be used to simplify the expression (\ref{eq:prob}) as $$\begin{equation} P(r_1\lor r_2\lor\dots\lor r_n) = 1 - P(r_1^* \land r_2^* \land\dots\land r_n^* ) = 1 - \prod_{j=1}^n P(r_j^* ), \end{equation}$$
where $ r_j^* $ denotes the event “the null is NOT rejected at the j-th test”.
What is the consequence of all this? Let’s give an example.
Suppose that we do a test where we fix the type-I error $\alpha=5\%$. By definition, if we do one test only we will reject the null 5% of the times if the null is actually true (making an error…). What if we make 2 tests? What are the chances of committing a type-I error then? The error will be
What if we do $n$ tests then? Well, the effective type-I error will be .
\begin{equation} \bbox[lightblue,5px,border:2px solid red]{\mathrm{Type \ I \ error} = 1-(1-\alpha)^n} \ \ \ (\mathrm{independent \ tests}) \label{eq:typeI_independent} \end{equation}
What can we do to prevent this from happening?
- In many cases, it is not even necessary to do multiple tests
- If multiple testing is unavoidable (for example, we are testing multiple hypothesis in a single test because we have multiple groups), then we can just correct the type-I error as an effective type-I error $\alpha_\mathrm{eff}$. In order to recover the original type-I error (thereby “controlling” the multiple testing), we must ensure that
Note that in order to arrive at eq. (\ref{eq:alpha_eff}), we have assumed $\alpha\ll 1$ and use Taylor’s expansion $$(1-x)^m\approx 1-mx$$
What we have just shown is that, for independent tests, we can take into account the multiple hypothesis testing by correcting the type-I error by a factor $1/n$ such that $$\bbox[lightblue,5px,border:2px solid red]{\alpha_\mathrm{eff} = \frac{\alpha}{n} }$$
This is what goes under the name of Bonferroni correction , from the name of the Italian mathematician Carlo Emilio Bonferroni.
2. Practical application
In this section we are going to do several t-tests on independent sets of data with the null hypothesis being true. We will see that eq. (\ref{eq:typeI_independent}) fits well the real type-I error.
We will draw samples from two Bernoulli distributions A and B, each with a probability $p=0.5$ of success. Each hypothesis test looks like $$\begin{eqnarray} H_0 &:& \Delta \mu = \mu_B -\mu_A = 0 \nonumber \\ H_1 &:& \Delta \mu = \mu_B -\mu_A \neq 0 \nonumber \end{eqnarray}$$ where $\mu_A$ and $\mu_B$ are the two sample means.
By our definition, the null $H_0$ is true as we are going to set $\mu_A=\mu_B=0.5$ (hence $\Delta \mu=0$). The figure below shows the probability of committing a type-I error as a function of the number of independent t-tests, assuming $\alpha=0.05$. Without correction, the Monte Carlo results are well fitted by eq. (\ref{eq:typeI_independent}) and show a rapid increase of the type-I error rate. Applying the Bonferroni correction does succeed in controlling the error at the nominal 5%.
Below is the Python code that I used to produce the figure above (code that you can also download). The parameter nsims that I used for the figure was 5000, but I had my machine running for a couple of hours so I decided to use 1000 as default. Give it a try, if you are curious!
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | scipy as sp from scipy.stats import bernoulli import numpy as np import matplotlib.pyplot as plt import matplotlib import os matplotlib.rc('font', size=18) matplotlib.rc('font', family='Arial') np.random.seed(seed=1) #set random seed for replicability alpha = 0.05 #Type I error rate p1 = 0.5 #Probability for population 1 p2 = 0.5 #Probability for population 2 #simulation parameters nsamples = 500 #each test will use nsamples ntests = 20 #max number of tests nsims = 1000 #number of simulations from which to draw average #simulations without Bonferroni correction def run_exp(nsamples,ntests): for i in range(ntests): testA = bernoulli(p1).rvs(nsamples) testB = bernoulli(p2).rvs(nsamples) _, p = sp.stats.ttest_ind(testA,testB) #perform t-test and get p-value if p < alpha: #do not apply Bonferroni correction return True return False #simulations with Bonferroni correction def run_exp_corrected(nsamples,ntests): for i in range(ntests): testA = bernoulli(p1).rvs(nsamples) testB = bernoulli(p2).rvs(nsamples) _, p = sp.stats.ttest_ind(testA,testB) #perform t-test and get p-value if p < alpha/ntests: #apply Bonferroni correction return True return False p_reject = [] p_reject_corrected = [] for nt in range(1,ntests+1): print("ntests =", nt) #results without using Bonferroni correction p_reject.append(np.mean([run_exp(nsamples,nt) for k in range(nsims)])) #results using Bonferroni correction p_reject_corrected.append(np.mean([run_exp_corrected(nsamples,nt) for k in range(nsims)])) print("p_reject = %f, p_reject_corrected = %f"%(p_reject[nt-1],p_reject_corrected[nt-1])) #plot results f = plt.figure() ax = f.add_subplot(111) n = np.arange(1,ntests+1) ax.semilogy(n,p_reject,'ko',markersize=8,label="Monte Carlo") ax.semilogy(n,1-(1-alpha)**n,linewidth=3,label=r"$1-(1-\alpha)^n$") ax.semilogy(n,p_reject_corrected,'ro',markersize=8) ax.tick_params('both', length=10, width=1, which='major') ax.tick_params('both', length=5, width=1, which='minor') plt.yticks([0.01,0.05,0.1,0.2,0.3,0.4,0.5,0.6,0.7],["0.01","0.05","0.1","0.2","0.3","0.4","0.5","0.6","0.7"]) plt.ylim([0.01,0.7]) plt.xlabel('Number of tests') plt.ylabel('Probability of Type I error') plt.legend(loc=4) plt.show() |
Click through the PLOS taxonomy to find articles in your field.
For more information about PLOS Subject Areas, click here .
Loading metrics
Open Access
Peer-reviewed
Research Article
Can expected error costs justify testing a hypothesis at multiple alpha levels rather than searching for an elusive optimal alpha?
Roles Conceptualization, Formal analysis, Methodology, Software, Writing – original draft, Writing – review & editing
* E-mail: [email protected]
Affiliation Complexity Science, Meraglim Holdings Corporation, Palm Beach Gardens, FL, United States of America
- Janet Aisbett
- Published: September 25, 2024
- https://doi.org/10.1371/journal.pone.0304675
- Peer Review
- Reader Comments
Simultaneous testing of one hypothesis at multiple alpha levels can be performed within a conventional Neyman-Pearson framework. This is achieved by treating the hypothesis as a family of hypotheses, each member of which explicitly concerns test level as well as effect size. Such testing encourages researchers to think about error rates and strength of evidence in both the statistical design and reporting stages of a study. Here, we show that these multi-alpha level tests can deliver acceptable expected total error costs. We first present formulas for expected error costs from single alpha and multiple alpha level tests, given prior probabilities of effect sizes that have either dichotomous or continuous distributions. Error costs are tied to decisions, with different decisions assumed for each of the potential outcomes in the multi-alpha level case. Expected total costs for tests at single and multiple alpha levels are then compared with optimal costs. This comparison highlights how sensitive optimization is to estimated error costs and to assumptions about prevalence. Testing at multiple default thresholds removes the need to formally identify decisions, or to model costs and prevalence as required in optimization approaches. Although total expected error costs with this approach will not be optimal, our results suggest they may be lower, on average, than when “optimal” test levels are based on mis-specified models.
Citation: Aisbett J (2024) Can expected error costs justify testing a hypothesis at multiple alpha levels rather than searching for an elusive optimal alpha? PLoS ONE 19(9): e0304675. https://doi.org/10.1371/journal.pone.0304675
Editor: Stephan Leitner, University of Klagenfurt, AUSTRIA
Received: November 2, 2023; Accepted: May 15, 2024; Published: September 25, 2024
Copyright: © 2024 Janet Aisbett. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Data Availability: All relevant data are within the manuscript and its Supporting Information files.
Funding: The author(s) received no specific funding for this work.
Competing interests: The authors have declared that no competing interests exist.
1. Introduction
A long-standing debate concerns which, if any, alpha levels are appropriate to use as thresholds in statistical hypothesis testing, e.g., [ 1 ]. Among the issues raised is how the costs of errors should affect thresholds [ 2 ]. One approach is to determine optimal alpha levels based on these costs, or on broader decision costs, in a given research context.
When Type II error rates are presented as functions of the alpha level, a level can be selected that minimizes the sum of Type I and Type II error costs, contingent upon parameter settings such as effect and sample sizes. This is a local view of optimization. Optimization can also be interpreted in a global sense, as minimizing the expected total error cost given the prior probability of the hypothesis being true [ 3 , 4 ]. In either case, the total cost may also incorporate payoffs from correct decisions [ 5 , 6 ].
In the global approach, estimates of the prior probability—also called the base rate or prevalence of true hypotheses—may be based on knowledge about the relationships under consideration, on replication rates in similar studies, or on other domain knowledge [ 7 – 9 ]. The prevalence estimate is then assumed to be the probability mass at the hypothesized effect size, with the remaining mass assigned according to the test hypothesis that the researcher hopes to reject—typically the null hypothesis of no effect. This dichotomous model has been extended to continuous distributions of true effect sizes in a research scenario [ 10 , 11 ].
Despite their mathematical and philosophical appeal, optimization strategies have not been widely adopted by researchers when selecting statistical test levels. This is arguably due to the difficulty of scoping, estimating, and justifying the various costs/payoffs [ 3 ], and of even defining a research scenario in which to estimate priors [ 12 ]. Greenland says that costs cannot be estimated without understanding the goals behind the analysis [ 2 ]. In view of the domain knowledge required, he suggests stakeholders other than researchers may be better placed to determine costs.
When studies involve well understood processes, domain knowledge will also guide estimation of prior probabilities of true hypotheses. Mayo & Morey argue, however, that such estimates require identifying a unique reference class, which is impossible, and, if it were possible, the prevalence proportion or probability would be unknowable [ 12 ].
And yet researchers are told they cannot hope to identify optimal alpha levels without good information about the base rate of true hypotheses and the cost of errors ([ 6 ]: S1 File ). Relying on a standard threshold such as 0.05 is also not recommended, given the broad spread in optimal alpha values seen in plausible scenarios. For example, Miller & Ulrich find very small values are optimal when the prevalence of true hypotheses is low and the relative cost of Type I to Type II errors is high, and larger alphas are optimal when the converse holds [ 5 , 6 ].
An alternative approach
We propose that, rather than try to find an optimum level in the face of such difficulties, researchers should simultaneously test at multiple test levels and report results in terms of these. This approach may deliver acceptable error costs without having to grapple with ill-defined notions of costs and research scenarios.
Simultaneous tests at multiple alpha levels lead to logical inconsistencies—when a hypothesis about a parameter value is rejected at one level but not rejected at another level, what can we say about the parameter value? This logical inconsistency can be overcome by extending the parameter space with an independent parameter that acts as a proxy for the test level [ 13 ]. The test hypothesis is extended with a conjecture about the value of the new parameter, so findings must also say something about it. This construction is simply to generate copies of the original hypothesis that can be identified by their proposed test level. The data are therefore extended so that all the added conjectures are rejected at their designated alpha level.
This construction puts testing at multiple alpha levels within the established field of multiple hypothesis testing. It is formalized mathematically in the S1 File , which also gives an example and shows how results from tests at multiple levels should be reported.
The advantage of such testing over simply reporting a raw P-value is that a priori attention must be paid to expected losses and the relative importance of Type I and Type II error rates and hence sample sizes. The advantage over a conventional Neyman-Pearson approach is that, at the study design stage, researchers have to consider more than one test level, and, at the reporting stage, they must tie findings to test levels.
We will refer to tests using this construction as multi-alpha tests . The remainder of the paper does not refer to the extended hypotheses and will speak of the original hypothesis being rejected at one alpha level but not at another. This is shorthand for saying that the extended hypothesis linked to the first alpha level is rejected but the extended hypothesis linked to the second alpha level is not.
Error considerations in multi-alpha testing concern a family of hypotheses, even if the original research design only concerned one hypothesis. The family-wise error rate is the probability of one or more incorrect decisions. Obviously, family-wise error rates in multi-alpha testing are driven by the worst performers, namely Type I errors due to tests at the least stringent levels and Type II errors at the most stringent levels.
Looking at rates is misleading when alpha levels are explicitly reported alongside findings, because the costs of errors may vary with level. Many journals require that P-values be reported, since different values have different evidential meaning [ 14 ] and may lead to different practical outcomes. For example, when two portfolios are found to outperform the S&P 500 with respective P-values of 0.10 and 0.01, an investment web site [ 15 ] advises that “the investor can be much more confident” of the second portfolio and so might invest more in it.
Thus, rather than a statistical test leading to dichotomous decisions, there may be a range of decisions, each associated with different error costs. A small P-value may be interpreted as strong evidence that triggers a decision bringing a large positive payoff when the finding is correct and a large loss when the finding is incorrect as compared with the payoff when no evidence level is reported. Likewise, finding an effect only at a relatively weak level may lower the cost/payoff. As a result, total costs could on average be lower when hypotheses are tested at multiple alpha levels than when a single default such as 0.05 or 0.005 is applied over many studies.
Below, we generalize error cost calculations to scenarios with continuous effect size distributions and to multi-alpha testing. Examples based on the resulting formulas then compare expected total error costs from multi-alpha tests with costs from tests at single alpha levels, including optimal levels. Our findings support previous conclusions about the difficulty of choosing optimal alpha levels and they again illustrate how an alpha level that under some conditions is close to optimal can, with different prevalence and cost assumptions, bring comparatively high costs.
Testing the same hypothesis at two or more alpha levels smooths expected total error costs. The costs will not be optimal but may be lower on average than costs from tests using “optimal” alpha levels determined using inappropriate models.
2. Expected total cost of errors in a research scenario
We review and generalize conventional cost calculations to cover continuous effect size distributions. Then we adapt cost considerations to deal with multi-alpha tests and show that, in important cases, the expected error costs are a weighted sum of the expected costs of the individual tests.
To illustrate the mathematical formulas, and to highlight issues with their application, we draw on studies in which the subjects are overweight pet dogs and the study purpose is to investigate diets designed to promote weight loss. The primary outcome measure is percentage weight loss. Examples of such research are a large international study into high fiber/high protein diets [ 16 ], a comparison of high fiber/high protein with high fiber/low fat diets [ 17 ], and a study into the effects of enriching a diet with fish oil and soy germ meal [ 18 ]. Numerous published studies into nutritional effects on the weight of companion animals provide the basis for a relevant research scenario. Studies are frequently funded by pet food manufacturers that advertise their products as being based on expert scientific knowledge, as is the case with the studies cited above.
Expected total error costs when testing at a single alpha level
Dichotomous scenario..
Given a research scenario, suppose that proportion P of research hypotheses are true (or, for local optimization, set P = 0.5). Suppose “true” and “false” effect sizes are dichotomously distributed, with d the difference between the two values in a particular study. In our example, P is the proportion of dietary interventions on overweight pets that lead to meaningful weight loss and d is the standardized mean difference in weight loss.
The alpha level that minimizes this cost can be estimated by searching over alpha levels and sample sizes, given prevalence P , effect difference d and relative error cost C 0 / C 1 [ 3 ]. If optimization is against error rates rather than costs, this relative cost is fixed at 1.
In our example, a Type I error might lead veterinarians and dog food manufacturers to promote high fiber/low fat diet for weight loss when a high fiber/high protein diet would work as well and possibly have fewer adverse effects. Conversely, a Type II error might mean that a cheap and easily implemented weight loss approach was overlooked, with possibly less effective or more expensive high protein diets being promoted. Understanding the relative costs of these errors obviously requires expert knowledge of canine diet, obesity issues, weight-loss alternatives and so on. The costs also depend on how large an effect the low-fat diet has on canine weight compared with the other diet.
Continuous scenario.
Suppose a histogram could be formed from the true size of effects of interventions targeting weight loss in companion animals. This histogram would approximate the density function that describes the prevalence of effect sizes in the research scenario.
More generally, let p ( e ) describe the prevalence of effect size e in a domain R (typically, an interval). Let E be the subset of effect sizes which satisfy the test hypothesis that the researcher is seeking to reject. R–E is therefore the subset of effects for which the research hypothesis is true. For example, if veterinarians consider a standardized weight loss of more than 0.1 to be practically meaningful, and the researcher wants to show a diet leads to meaningful weight loss, R–E is anything larger than 0.1.
Although the assumption of a continuous prevalence distribution is appealing, estimating the distribution is potentially even more difficult than estimating a prevalence proportion P in the dichotomous modelling. Any prevalence model compiled from relevant published literature will be distorted by publication bias and by the discrepancies between reported and true effect sizes highlighted by the “reproducibility crisis”.
Expected total error costs when testing one hypothesis at multiple alpha levels (multi-alpha testing)
Decisions, costs, and test levels..
Greenland argues that decisions cannot be justified without explicit costs [ 2 ]. We therefore relate decisions to costs and test levels before extending the cost expressions to multi-alpha testing.
Consider a hypothesis, such as that higher fiber diets increase weight loss in overweight dogs or that an intervention to control an invasive plant species is more efficacious than a standard treatment under some specified conditions. Suppose a set of decisions D(m), m = 0, 1, 2, …, k, concerns potential actions to be taken, with D (0) the decision to take no new action.
For instance, decisions might be directly research-related, such as terms used in reporting the study findings or choices of publication vehicle. Or they might concern practical actions, such as the study team encouraging local veterinarians to recommend low fat diets over high protein diets, or the research sponsor publicizing the benefits of switching to low fat diets, or the sponsor manufacturing low fat products and promoting them as better than high protein products.
The decisions are ordered according to their anticipated payoff C 1 ( m ) compared with the decision D (0) to take no new action, given that the intervention has a meaningful effect. Thus, D ( k ) will bring the greatest payoff C 1 ( k ). For example, publication in a more highly ranked journal may improve the researchers’ resumés. In the canine example, the overall health benefit from weight loss in overweight dogs increases with the number of dogs that switch from high protein to low fat diets; thus the research sponsor offering a low fat product as well as advertising the benefits will have greater payoff than the other actions.
However, deciding D ( m ) when the effect is not meaningful brings a cost, call it C 0 ( m ). These costs are assumed to increase with level m so that decision D ( k ) carries the greatest cost if the intervention has no meaningful effect. This cost might include more critical attention if published findings are not supported by later studies. In the canine example, costs include promotional costs, the continuing health costs for overweight dogs which might have otherwise had a more effective high protein diet, and potential impacts on the researchers’ and sponsor’s reputations.
Finally, assume that a decreasing set of alpha levels α m , m = 1, …, k has been identified, such that decision D ( m ) will be made if the test hypothesis is rejected at alpha level α m but not at level α m +1 . That is, only one decision is to be made, and it will be selected according to the most stringent alpha level at which the test hypothesis is rejected. If the test is not rejected at any of the alphas, then the decision is D (0), do nothing.
Thus, the research team and sponsor might decide to encourage local veterinarians to promote low fat diets over high protein ones if the findings of the canine dietary study are statistically significant at alpha level 0.05; to recommend such diets on the sponsor’s website and to veterinarian societies if findings are significant at level 0.01; and to manufacture low fat products and promote switching to them if findings are significant at level 0.001.
One approach to making such choices is provided at the end of this section, and the Discussion section further looks at how such a set of alpha levels might be identified.
Expected total costs of errors.
Suppose P is the prevalence of true hypotheses in a research scenario in which a dichotomous distribution of effects is assumed (or if error costs are considered locally, set P = 0.5).
Eq ( 1 ) is the expected error cost when a test will result either in a decision involving an action, or in doing nothing, with the action taken when the P-value calculated from the data is below α . In the multiple alpha level testing scenario, however, decision D ( m ) is only made when the P-value lies between α m +1 and α m . For example, local veterinarians would only be enlisted by the canine diet researchers if the findings are statistically significant at level 0.05 but are not at level 0.01, since at the higher level a wider campaign would be undertaken involving veterinarian organizations.
When the effect of an intervention is meaningful, but the test hypothesis is not rejected by a test at some alpha level, the error cost depends on the difference between the largest payoff C 1 ( k ) and the payoff brought by what was decided. That is, if the decision is D ( m ), the loss is C 1 ( k )− C 1 ( m ). In our example, if low fat canine diets are more effective that high protein diets yet the decision was to only promote them through local veterinarians, the loss would be due to the benefit difference between that and the wider advertising campaign and offering of products that saw more overweight dogs put on such diets.
Here, Δ C 0 ( m ; e ) is the difference in costs for decisions D ( m ) and D ( m – 1) when the effect size is e ∈ E , and Δ C 1 ( m ; e ) is the difference between payoffs when e ∈ R–E .
Expected error cost of a multi-alpha test as the weighted sum of the costs of the single level tests
In standard single-level test situations within a conventional Neyman-Pearson framework, the potential research outcomes are independent of test level, according to the “all or nothing” nature of findings when a test threshold is applied. These dichotomous outcomes can plausibly be set to D (0) (do nothing if the test hypothesis is not rejected) and D ( k ). Thus, if the canine diet study gets a significant result, the sponsor will go ahead with manufacturing and marketing a product; otherwise, say, the researchers and the sponsor’s product innovation team may be asked to identify factors affecting the result. The Type I and Type II costs for these decisions are C 0 ( k ) and C 1 ( k ) respectively.
Now suppose that the relative cost r of Type II to Type I errors is the same at each test level, so that C 1 ( m ) = rC 0 ( m ) for m = 1, …, k. For example, if both cost types are proportional to the number of dogs switching to a low fat diet on the basis of a study, and decisions made at the different test levels primarily determine this number, then the ratio would be approximately constant.
In such a case, it is straightforward to show (see S2 File ) that the total cost of the multi-alpha test is a weighted average of the costs of the individual tests given in (1). It follows that the multi-alpha test will always be less costly than the worst, and more costly than the best, of the single level tests. It is thus more costly than the optimal single level test. The case with a continuous distribution of true effects is analogous.
The value of the information obtained from a test
What might be said about the relationship between costs at different test levels? Rafi & Greenland [ 19 ] suggest that the information conveyed by a test with P-value p is represented by the surprisal value −log 2 ( p ). The smaller the p , the more informative the test. In a multi-alpha test, the information conveyed on rejecting a test hypothesis at level α m might therefore be proportional to −log 2 ( α m ).
Therefore, when error costs at each test level are proportional to the surprisal value of that level, the total cost of the multi-alpha test is a weighted average of the costs of the component tests. This analysis carries over to the case of a continuous distribution of true effect sizes.
The S2 File has mathematical details. It also shows how the first equality in (8) can be used to set alpha levels if costs are known, or conversely can be used to guide appropriate decisions given pre-set alpha levels.
The surprisal values of test levels 0.05, 0.01 and 0.001 are respectively 4.3, 6.6 and 10.0. The information brought by the test at level 0.01 is therefore 50% more than that of the test at 0.05, and that brought by the stringent test at level 0.001 is 50% larger again. If experts agree that the error costs associated with the hypothetical decisions in our low fat versus high protein canine diet study increase by much more than 50% between levels, the range of test levels would need to be increased.
Note that transforming P-values to information values assumes no prior knowledge. A small P-value for a non-inferiority test contrasting the effectiveness of drug A with drug B would not be “surprising” if drug A had been shown to be superior in many previous trials. Indeed, over a very large sample, a large P-value would be surprising in this case.
3. Applying the expected cost formulations
We compare expected error costs of multi-alpha tests with those from conventional testing in studies investigating whether an effect size is practically meaningful. This offers insights into the cost behavior of multi-alpha tests, as well as illustrating the sensitivity of optimization and the potential pitfalls of using a single default test level. Research scenarios are modelled using both dichotomous and normally distributed true effect sizes, and costs from tests at multiple alpha levels are compared with those from single level tests and tests at optimal levels.
The first subsection uses cost estimates drawn from a simplified example in which Type II error costs vary with effect size and can be much higher than Type I costs. The second subsection investigates a research scenario in which different research teams anticipate different effect sizes, and so conduct trials with different sample sizes. Both true effect sizes and anticipated effect sizes are assumed to be normally distributed, and costs are reported as the total expected costs averaged over all the research teams.
We apply test alpha levels ranging from extremely weak through to strong to highlight how each of the levels can give a lower expected total error cost than the others under some parameter settings. Throughout, a two-group design with equal groups is assumed, where n is the total sample size, M is the boundary between meaningful and non-meaningful effect sizes, and all alpha levels are with respect to one-sided tests. For the multi-alpha tests, error costs are assumed to be proportional to the surprisal value of the component test level. This simplifying assumption implies no prior knowledge about test outcomes.
S3 File further illustrates the sensitivity of cost computations and the smoothing effect that multi-alpha testing has on error rates. It summarizes results from simulations in which cost differences between test levels are randomly assigned but Type I errors are on average more costly. The simulations apply one-sided test levels commonly found in the literature.
Cost comparisons in an example research scenario
Consider a research scenario of clinical trials investigating Molnupiravir as therapy in non-hospitalized patients with mild to moderate symptoms of Covid 19. Suppose the primary outcome is all-cause hospitalization or death by day 29.
Following Ruggeri et al [ 20 ] and Goswami et al [ 21 ], only consider the economic burden when estimating costs. If an ineffective drug is administered, economic costs stem from the direct price of the drug and from its distribution. If a drug that would reduce hospitalizations is not used, the economic costs stem from hospitalizations and deaths that were not avoided; the lower the risk ratio, the more hospitalizations could have been prevented using the drug, and hence the greater the costs of Type II errors.
Given a study of Molnupiravir in a particular population, suppose the true risk of hospitalization or death is r 1 in the untreated group and r 2 in the treatment group. Let I be the risk of getting mild to moderate Covid in the population of interest. If the average cost of administering Molnupiravir to an individual is c T then, over the population, the per capita cost of administering it to those diagnosed with mild to moderate Covid is C 0 = c T I .
The difference can be negative. The drug would therefore only be economical to administer if the absolute risk reduction exceeds c T / c H , which in this simplified example we take to be the definition of practically meaningful.
A Molnupiravir cost-benefit analysis using US data [ 21 ] set the cost of a treatment course with the drug at $707 and the cost of a Covid hospitalization (without ICU) at $40,000, with the risk of hospitalization for the untreated population about 0.092. On these conservative notions of cost, the risk must be reduced by at least 707/40,000 ≈ 0.018 for the drug rollout to be cost-effective. These values are used in the calculations reported below.
Dichotomous distribution of effects in the research scenario
Table 1 reports expected total error costs for various parameter settings calculated from (1) with β ( d , α ) and costs defined as above (dropping the scale factor I which appears in all terms). Single level tests are at the one-sided alphas shown in columns 2–4, and multi-alpha tests are formed from these three levels. Note that group size 1000 gives a prior calculated power of 80% to detect effect size –0.046 if testing against the break-even point M at alpha level 0.05, assuming risk in the untreated group is 0.092.
- PPT PowerPoint slide
- PNG larger image
- TIFF original image
https://doi.org/10.1371/journal.pone.0304675.t001
The table reveals a range of optimal alpha levels, allowing each the selected single test levels to out-perform the others in some setting. The values P = 0.5 and P = 0.1 respectively model the local or “no information on prevalence” case and the pessimistic research scenario [ 8 ]. If the true risk difference is near the break-even point and prevalence is low, Type I errors dominate and small alphas are optimal. For larger risk differences or higher prevalence, larger alphas help limit the more costly Type II errors.
For the multi-alpha tests, costs are assumed to be proportional to the surprisal value of the component test level, as described earlier. Proportionality could result from deciding only to administer Molnupiravir to a proportion of people with Covid symptoms, for example, limiting the roll-out to some locations, where the proportion depends on the test level at which the hypothesis of no effect is to be rejected. Then, as discussed earlier, the total expected cost of the multi-alpha tests is a weighted sum of the total expected costs of the one-sided tests at each level. Because of the wide range of alphas, these weighted averages can be substantially higher than the optimal costs. We will return to this point in the Discussion.
Continuous distribution of effects in the research scenario
Suppose the risk difference between treatment and control groups in the Molnupiravir studies approximately follows a normal distribution, and that in a study the true risk in the untreated group is r 1 .
Given critical value M + s z α and standard deviations s and ss defined as above, the standard normal cumulative probability at ( r 1 – r 2 + M + s z α )/ ss is the expected Type I error rate for effect sizes r 2 – r 1 larger than M and is the expected power for effect sizes smaller than M . This formulation equates to that given by (10) for calculating expected Type II error rates.
Based on these calculations, Table 2 reports expected total error costs for different research scenario distributions. For the multi-alpha tests, costs are again assumed to be proportional to the surprisal value of the test level, so that the total expected cost is a weighted sum of the single level test costs. This table is striking for the very large alphas it shows optimizing expected total error costs. The wide range of alpha levels involved in the multi-alpha tests again lead to costs that can be far from optimal.
https://doi.org/10.1371/journal.pone.0304675.t002
A normal distribution of true effects about –0.02 is an optimistic research scenario, in that interventions have more than 50% chance of being meaningful (i.e., cost effective). A normal distribution about zero with a tight standard deviation of 0.015 represents a pessimistic scenario, with only 12% chance of meaningful intervention. The test level 0.05 is close to optimal in the pessimistic scenario because Type II errors are rare. Costs are also lower for each test level in this scenario compared with the optimistic scenario because of the low probability of costly Type II errors. This is illustrated graphically in the S4 File .
Note that when Type II errors have higher cost, Neymann advises the test direction should be reversed because Type I errors rates are better controlled [ 22 ]. In this example, decisions would then be made according to the rejection level, if any, of the tests of the hypothesis that Molnupiravir treatment was cost-effective.
A research scenario in which different research teams anticipate different standardized effect sizes
Consider a research scenario in which studies collect evidence about whether an intervention is practically meaningful, in the sense of a standardized effect difference exceeding some boundary value M .
In the study design stages, different research teams make different predictions about the effect size. For example, the core literature might support a value such as M + 0.4, say, but each research team may apply other evidence to adjust its prediction. Suppose these predictions approximately follow a distribution with density function p ′( x ).
Given an anticipated effect size, each research team calculates sample sizes to achieve 80% power at a one-sided test level of 0.025, using the standard formula. For simplicity, suppose each team selects equal sized groups and estimates the same error costs. Finally, assume the true standardized effects in the research scenario follow a continuous distribution with density function p ( x ).
Fig 1A illustrates distributions of true and anticipated effect sizes. Fig 1B shows how the research teams’ sample sizes vary as a function of the effect sizes they anticipate. Fig 1C converts Fig 1B into a density function describing the probability that a sample size is selected in the research scenario. The requirement to achieve 80% power means that teams who are pessimistic about the anticipated effect size may need very large samples.
(a) Distributions of true (solid curve) and anticipated (dashed curve) standardized effect sizes, assuming true effects are centered on M. The dotted curve is the sampling distribution of an effect on the boundary of meaningful effects when total sample size is 98. The heavy vertical line is at the critical value for level 0.025 tests on this sample, indicating a high Type II error rate. (b) Sample sizes as function of anticipated effect size, calculated using normal distribution approximations with α = 0.025 and 80% power. (c) Probability distribution of sample sizes in the research scenario given the distribution of anticipated effect sizes shown in (a).
https://doi.org/10.1371/journal.pone.0304675.g001
The differing sample sizes in the studies affect both Type I and Type II error rates, denoted β 0 and β 1 in Eq ( 2 ). These error rates, and hence the expected total error cost ϖ ( p , α ), are functions of the anticipated effect size x . The expected total cost over all studies at a single test level α is then ∫ ϖ ( p , α ) p ′( x ) dx , for ϖ ( p , α ) as defined in Eq ( 2 ). The multi-alpha expected total cost over all studies is similarly obtained by integrating the expression in (7).
Table 3 reports expected total costs averaged over all studies, for different means of the true effect distribution and different ratios of Type I to Type II error costs. The costs for the tests at the more stringent level hardly vary with cost ratio because the Type I error rate is negligible. The higher Type I error rates for the weak level tests cause costs to decrease with smaller cost ratios.
https://doi.org/10.1371/journal.pone.0304675.t003
Averaged expected costs for the multi-alpha test are intermediate between the costs of the single level tests and can thus be seen as smoothing costs compared with testing at either level as a default. However, the optimal test levels can be even more lenient than we saw in Table 2 , making the optimal average expected costs substantially lower than for the other tests reported in Table 3 .
4. Discussion
A single alpha level cannot be appropriate for all test situations. Yet it is difficult to establish the level at which costs will be approximately minimized in any given research context. As Miller & Ulrich noted [ 6 ], and as our examples show, optimization is very sensitive to the proportion of true test hypotheses in the research scenario. Even allowing for perfect knowledge of the various costs and of the nature of the distribution of true effect, very different alpha levels can be close to optimal with different sample sizes and different parameters of the effect size distribution.
The impact of the research scenario is not surprising, given how true effect size distributions weight expected Type I versus Type II error rates and hence weight relative costs. Previous investigations into optimal test levels modelled effect sizes as taking one of two possible values. We followed [ 11 ] in also investigating research scenarios in which effect sizes are continuously distributed.
We accepted the convention that rejection of the test hypothesis leads to stakeholders “acting as if” the alternate hypothesis is true, and hence assumed that costs in single level tests are independent of the test level. In the multi-alpha tests, findings are tied to their test level. The costs of a false rejection were therefore assumed to be lower if tests at more stringent levels were not rejected, since it is not plausible to “act as if” in the same way. We suggested that the costs at each alpha level might be proportional to the surprisal value (also called information content) of a finding with a P-value at that level. Our reasoning was that, with less information, any response would be more muted and therefore Type I costs would be lower. On the other hand, Type II costs would be lowered when the test hypothesis was rejected at some, but not all, the test levels, because a little information is better than none.
Research is needed into how reporting findings against test levels affects their interpretation, and hence affects costs. This is related to how reporting P-values affects decisions, although multi-alpha test reporting goes further, in explicitly stating, for example, that a study did not provide any information about the efficacy of an intervention at level 0.005 but did at level 0.05. The results presented in section 3 and in the S3 File obviously give just a glimpse into how expected total error costs vary with modelling assumptions. Nevertheless, the results suggest costs from testing at multiple alpha levels are smoothed from the extremes of the costs when tests in different research scenarios are made at one fixed alpha level. Testing at an “optimal” level calculated using models based on incorrect assumptions may also lead to larger costs than testing at multiple levels.
Theoretically, multiple test levels could be chosen to optimize expected total costs, using Eq ( 6 ), say, and searching over the multidimensional space formed by sample size and the vector of alpha levels. The problems of estimating unknowables would obviously be worse, given our cost formulations assign different error costs at each of the levels. Aisbett [ 13 ]: Appendix 2 adapts an optimization strategy to the multi-alpha case in part by making simple assumptions about cost behavior. However, we do not recommend trying to optimize error costs across tests at multiple alphas.
How then should the alpha levels be selected? When an alpha that incurs high costs is included in the set of test levels, the multi-alpha test costs are increased. For example, in the low prevalence conditions in Table 1 , the high Type I error rates from testing at 0.25 blows out the difference between the optimal or best performer costs and those of the multi-alpha test.
With this in mind, we suggest three strategies for setting the alpha levels in a multi-alpha test.
The first approach is to calculate optimal alphas for single level tests against a range of cost and prevalence models that are reasonable for the research scenario, and then use the smallest and largest of these alphas. The expected error costs from the multi-alpha tests will smooth out high costs if incorrect models are chosen.
A second strategy appropriate for applied research is to follow the procedure presented, without justification, in section 2. In this, potential decisions/courses of actions are identified, associated costs are estimated in terms of the consequences of incorrect decisions, and then test alpha levels are assigned. We envisage this to be an informal process rather than a search for a mathematically optimal solution. Greenland [ 2 ] recommends that researchers identify potential stakeholders who may have competing interests and hence differing views on costs, as Neyman illustrated with manufacturers and users of a possibly carcinogenic chemical [ 22 ]. Testing at multiple alpha levels selected a priori may allow these differing views to be incorporated at the study design stage. In the reporting stage too, different stakeholders can act differently on the findings according to their perceptions of costs, since findings are tied to test levels and weaker levels are associated with lower costs.
A third approach that avoids estimating error costs is the pragmatic strategy of assigning default values, guided by common practice in the researcher’s disciplinary area. Common practice may also be incorporated in standards such as those of the U.S. Census Bureau which define “weak”, “moderate”, and “strong” evidence categories of findings through the test levels 0.10, 0.05 and 0.01 [ 23 ]. Default alpha levels might also be assigned indirectly using default effect sizes, such as the Cohen’s d values for “small”, “medium” and “large” effects, given the functional relationships between P-values and effect sizes that depend, inter alia, on sample size and estimated variance. This pragmatic approach is open to the criticisms made about default alphas in the single level case, but, again, it smooths out high costs that arise when a default alpha level is far from optimal.
An appealing variation on this approach builds on links between Bayes factors (BFs) and P-values [ 4 ]. BFs measure the relative evidence provided by the data for the test and alternate hypotheses. Formulas relating P-values to BFs have been provided for various tests and assumed priors [ 24 ]. Supporting software computes alpha levels that deliver a nominated BF for a given sample size, subject to the assumptions. Alpha levels in some multi-alpha tests could therefore be set from multiple BF nominations. Intervals on the BF scale are labelled “weak”, “moderate” and “strong” categories of evidence favoring one hypothesis over the other. The boundaries of these intervals are potential default BF nominations. The corresponding alphas are functions of sample size, avoiding Lindley’s Paradox when sample sizes are large.
Whatever levels are chosen, testing at multiple alphas is subject to the qualifications needed for any application of P-values, which only indicate the degree to which data are unusual if a test hypothesis together with all other modelling assumptions are correct. Data from a repeat of a trial may still, with the same assumptions and tests, yield a different conclusion.
Nevertheless, testing at more than one alpha level discourages dichotomous interpretations of findings and encourages researchers to move beyond p < 0.05. We have shown that, rather than raising total error costs, multi-alpha tests can be seen as a compromise, offering adequate rather than optimal performance. Such costs, however, may be lower than those from optimization approaches based on mis-specified models. Empirical studies involving committed practitioners are needed in diverse fields, such as health, ecology and management science, to better understand the relative cost performance and to evaluate practical strategies for setting test levels.
Supporting information
S1 file. testing one hypothesis at multiple alpha levels: theoretical foundation and indicative reporting..
https://doi.org/10.1371/journal.pone.0304675.s001
S2 File. Total cost of the multi-alpha test as a weighted average of the costs of the individual tests.
https://doi.org/10.1371/journal.pone.0304675.s002
S3 File. Averaged results from simulations with random costs.
https://doi.org/10.1371/journal.pone.0304675.s003
S4 File. Illustration of impact of the distribution of effect sizes in the research scenario on Type I and Type II error rates.
https://doi.org/10.1371/journal.pone.0304675.s004
S5 File. R code to produce all Tables and Figures: See github . com/JA090/ErrorCosts .
https://doi.org/10.1371/journal.pone.0304675.s005
- View Article
- Google Scholar
- PubMed/NCBI
- 15. Beers B. P-value: what it is, how to calculate it, and why it matters. Investopedia. 2023; investopedia.com/terms/p/P-value.asp . [cited 2023 Aug. 11]
IMAGES
VIDEO
COMMENTS
The following table defines the possible outcomes when testing multiple null hypotheses. Suppose we have a number m of null hypotheses, denoted by: H 1, H 2, ..., H m. Using a statistical test, we reject the null hypothesis if the test is declared significant.We do not reject the null hypothesis if the test is non-significant.
'Multiple Hypothesis Testing' published in 'Encyclopedia of Systems Biology' Editors and Affiliations. Biomedical Sciences Research Institute, University of Ulster, Coleraine, UK
Digression: p-values. Implicit in all multiple testing procedures is the assumption that the distribution of p-values is "correct". This assumption often is not valid for genomics data where p-values are obtained by asymptotic theory. Thus, resampling methods are often used to calculate calculate p-values.
Hypothesis testing is a commonnly used method in statistics where we run tests to check whether a null hypothesis \(H_0\) is true or should we accept the alternate hypothesis \(H_1\). In cases, where there are multiple (\(m\)) null hypotheses, it is not possible to determine which of the \(m\) hypotheses are acceptable using a single test ...
Definition. Multiple hypothesis testing refers to the procedure of simultaneously testing several hypotheses to determine if any of them can be rejected or accepted based on the data. This concept is crucial as it addresses the increased chance of obtaining false positives when performing multiple tests, leading to misleading conclusions.
A multiple testing procedure (MTP) is a rule which makes some decision about each H s. The term false discovery refers to the rejection of a true null hypothesis. Also, let I(P) denote the set of true null hypotheses, that is, s2I(P) if and only if H s is true. We also assume that a test of the individual hypothesis H s is based on a test ...
Multiple hypothesis testing refers to the statistical method of conducting several tests simultaneously to evaluate the validity of multiple hypotheses. This approach is essential in scenarios where numerous hypotheses are being tested, particularly in fields such as communication systems, where decisions are often made based on uncertain signals and competing information. The challenge lies ...
Multiple testing correction methods attempt to control or at least quantify the flood of type I errors that arise when multiple hypothesis are performed simultaneously Definition The p-value is the probability of observing a result more extreme than that observed given the null hypothesis is true.
There are 5 main steps in hypothesis testing: State your research hypothesis as a null hypothesis and alternate hypothesis (H o) and (H a or H 1). Collect data in a way designed to test the hypothesis. Perform an appropriate statistical test. Decide whether to reject or fail to reject your null hypothesis. Present the findings in your results ...
Biometrika 75:800-802. Sarkar SK (1998) Some probability inequalities for ordered MTP 2 random variable: a proof of the Simes conjecture. Ann Stat 26:494-504. Sarkar SK, Chang C-K (1997) The Simes method for multiple hypothesis testing with positively dependent test statistics. JASA 92:1601-1608.
Multiple hypothesis testing is the testing of two or more separate hypotheses simultaneously. The 't' and 'F' tests are the most frequently used tests in econometrics. In regression analysis, there are two different procedures that can be used to test the hypothesis that all the coefficients are zero. One procedure is to test each ...
Some Key Concepts Closed Test Procedures Some Key Concepts Closed Test Procedures Consider a hypothesis set involving 4 means, with the highest hypothesis in the hierarchy H 1234 and the six hypotheses H ij;i 6= j = 1;2;3;4 as the minimal hypotheses. No hypotheses below H 1234 can be rejected unless H 1234 is rejected. Suppose H 1234 is ...
6: Hypothesis Testing, Part 2. 6.1 - Type I and Type II Errors; 6.2 - Significance Levels; 6.3 - Issues with Multiple Testing; 6.4 - Practical Significance; 6.5 - Power; 6.6 - Confidence Intervals & Hypothesis Testing; 6.7 - Lesson 6 Summary; 7: Normal Distributions. 7.1 - Standard Normal Distribution; 7.2 - Minitab: Finding Proportions Under a ...
Hypothesis Testing > Multiple Testing Problem. What is the Multiple Testing Problem? If you run a hypothesis test, there's a small chance (usually about 5%) that you'll get a bogus significant result. If you run thousands of tests, then the number of false alarms increases dramatically. For example, let's say you run 10,000 separate ...
Neuropsychological datasets typically consist of multiple, partially overlapping measures, henceforth termed outcomes.A given neuropsychological domain—for example, executive function—is composed of multiple interrelated subfunctions, and frequently all subfunction outcomes of interest are subject to hypothesis testing.
It is important to control for false discoveries when multiple hypotheses are tested. Under the Neyman-Pearson formulation, each hypothesis test involves a decision rule with false positive rate (FPR) less than α (e.g. α = 0.05). However, if there are m α -level independent tests, the probability of at least one false discovery could be as ...
To better illustrate the power of the multiple hypothesis testing, this lecture will be exclusively devoted to more examples in potentially non-statistical problems. 1. Example I: Density Estimation. This section considers the most fundamental problem in nonparametric statistics, i.e., the density estimation.
One of the most popular methods for correcting for multiple hypothesis testing is a Bonferroni procedure. The reason this method is popular is because it is very easy to calculate, even by hand. This procedure multiplies each p-value by the total number of tests performed or sets it to 1 if this multiplication would push it past 1.
In almost all medical research, more than a single hypothesis is being tested or more than a single relation is being estimated. Testing multiple hypotheses increases the risk of drawing a false-positive conclusion. We briefly discuss this phenomenon, which is often called multiple testing. Also, methods to mitigate the risk of false-positive ...
Multiple hypothesis testing is the testing of two or more separate hypotheses simultaneously. The 't' and 'F' tests are the most frequently used tests in econometrics. In regression analysis, there are two different procedures that can be used to test the hypothesis that all the coefficients are zero. One procedure is to test each ...
We will now explore multiple hypothesis testing, or what happens when multiple tests are conducted on the same family of data. We will set things up as before, with the false positive rate α= 0.05 α = 0.05 and false negative rate β =0.20 β = 0.20. library (pwr) library (ggplot2) set.seed(1) mde <- 0.1 # minimum detectable effect.
Multiple hypothesis testing practices vary widely, without consensus on which are appropriate when. We provide an economic foundation for these practices. In studies of multiple interventions or sub-populations, adjustments may be ap-propriate depending on scale economies in the research production function, with
The testing of some scientific hypothesis like whether or not there is a positive link between lifespan and insulin-like growth factor levels in humans (Fontana et al 2008), like the link between lifespan and IGFs in other organisms (Holtzenberger et al 2003), can be further advanced by considering multiple hypotheses and a test of nested ...
Each hypothesis test looks like. H0: Δμ =μB−μA = 0 H1: Δμ =μB−μA ≠0 H 0: Δ μ = μ B − μ A = 0 H 1: Δ μ = μ B − μ A ≠ 0. where μA μ A and μB μ B are the two sample means. By our definition, the null H0 H 0 is true as we are going to set μA = μB = 0.5 μ A = μ B = 0.5 (hence Δμ= 0 Δ μ = 0).
The table reveals a range of optimal alpha levels, allowing each the selected single test levels to out-perform the others in some setting. The values P = 0.5 and P = 0.1 respectively model the local or "no information on prevalence" case and the pessimistic research scenario [].If the true risk difference is near the break-even point and prevalence is low, Type I errors dominate and small ...