Mulling over adjustments for multiple testing
Having worked in trials for almost eight years (two pharma with the rest in academia), one topic I have taken for granted was that of multiple testing and the adjustments required for them. It was not that I never considered them; having either adjusted accordingly where multiple treatment arms were concerned, or not if the issues related to secondary/exploratory outcomes but I never really investigated why such choices were made. I just followed what I thought to be a consensus. However, an innocuous twitter post by Andrew Althouse, PhD, bought the issue to the front of my mind and made me re-assess why I had undertaken such steps. After studying a few papers on the topic, I have realised that there are a multitude of varying opinions on the matter. I present a couple of these opinions below, alongside my own thoughts.
The first manuscript, the aforementioned paper by A. D. Althouse, entitled “Adjust for Multiple Comparisons? It's Not That Simple asks the question of why we mustn’t automatically assume that adjustments must occur and provides common examples where we do not consider adjustment. The rational behind this is simply; if we do not adjust for these situations, why do we automatically assume adjustments are required under other scenarios. Occasionally, a researcher may automatically add adjustments where they are not actually necessary, or a reviewer might flag why adjustments have not been considered. The paper aims to reassure that adjustments are not a universal requirement and that such technicalities should not cloud what is ultimately being explained by the analysis.
The three points are as follows:
1) What consists of a test that requires adjustment? Baseline comparisons & health economics are often specified within a study yet would never be considered. Why? Likewise, most studies will publish a primary paper and then return to do additional analysis. Should we then return to the primary paper and re-analyse to account for this? When you think about it like this… the selection process behind choosing analyses for adjustment is “spurious”.
2) “Adjusting p values for multiple comparisons effectively penalizes an association for being found in a large study rather than in a small study.” To put it simply, bigger studies = more analysis = more adjustment. In principle this does not seem correct and leads to point 3…
3) Following on from the two points above, if the consensus is to only adjust on a “per-paper” approach, then this surely encourages the “slicing” of data to produce as many single-hypothesis papers as possible. Multiple papers and significant results are then more likely to be presented without context of the whole study in general.
The points press the reader to think about adjustments in the context of the whole study. Do we need to adjust for exploratory and sub-group analysis? The author concludes “In this reader’s opinion, the best approach is simply to (1) describe what was done in a study; (2) report effect sizes, confidence intervals, and p values; and (3) let readers use their own judgment about the relative weight of the conclusions.”. Personally, I would lean towards not producing p-values for exploratory outcomes but the idea that we should explain our approach (in advance) and not allow the study to be weighed down by a desire for a dichotomised significant/non-significant result. It is far more productive to explain the true story presented by a study than to worry about whether an exploratory outcome is significant or not due to whether adjustments were made.
Whilst the first manuscript covered scenarios where adjustments are not necessary, the second, “Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification” by Parker & Weir, describes a situation where adjustment would be a common approach. The paper proceeds to explain the authors’ views as to why they feel that adjustment is counter-productive.
The scenario in question relates to multi-arm trials where the treatments are distinct in nature. The authors agree with the general consensus that where treatments are related (e.g. multiple arms with same treatment but different dose levels), adjustment are required in order to control the family-wise type I error rate (FWER). However, they believe that where treatments are distinct, such adjustments are no longer required. Their rationale boils down to the question of ‘what is family’ and whether the controlling of the type I error rate (under multiple testing of distinct treatments) falls under FWER or comparison-wise error, which is not inflated by multiple testing .
One argument for adjustment is “The confirmatory trials argument” which puts the case forward that FWER adjustments should be made in confirmatory multi-arm trials but not necessary for exploratory studies. This would make sense for multiple reasons (hypothesis testing, acceptance for clinical practice, etc). However, the authors’ feel that even for confirmatory trials, the correction is problematic stating, “There is also the danger that by focussing too much on controlling type I error, we overlook the type II error rate (failing to reject the null hypothesis even though it is false). The consequence of reducing the type I error rate is that the type II error rate is increased”. This point is important and potentially overlooked when considering adjustment. For example, if you apply a Bonferroni correction to leave yourself with α < 0.025 being significant, how do you then interpret a result of p=0.03? If you ran a study comparing the distinct treatment alone against a control, you would now be looking at an entirely different result under hypothesis testing. Are such measures counter-productive?
Parker and Weir refer to this issue stating “An argument often presented in favour of non-adjustment is that if independent trials were done, then no multiple testing correction would be performed in this case. In fact, the reasons for non-adjustment go deeper than this. Although adjusting for multiple testing enables control of the FWER at the study level and may make sense theoretically, we would argue that it does not make good sense from the perspective of clinical practice and can lead to difficulties with interpretation. It also tends to be logically incompatible with the main clinical questions of interest.”. It is a salient and fair point to make. However, Frank Bretz, Franz Koenig in 2020 published a commentary piece arguing against this.
Bretz and Koenig argue that the methods put forward by Parker and Weir equate to ‘weak’ FWER control, something which falls short of pharma regulatory practice. Weak FWER controls the probability of declaring a significant treatment effect when no treatments are effective in other words, the global null hypothesis of no effect is expected to hold. Bretz and Koenig argue that in multi-treatment studies, this assumption is incorrect, and that type I error should be controlled under any configuration of the null, not just under the assumption of no effect. This relates to strong FWER control and as such should be applied. Personally, I feel this is a fair comment, but the commentary fails to address the Type II problem presented by Parker and Weir. As a result, you are left without a definitive answer and are left deciding whether to sacrifice Type I or Type II error.
The Bretz and Koenig “rebuttal” is interesting in that it comes from a pharmaceutical, drug-regulatory perspective, rather than an academic one. As someone who has experience under both sectors, I can understand why one sector may be more concerned at controlling Type I conservatively whilst another has concerns regarding increasing the Type II error rate.
When I raised these points for discussion across the unit, the ensuing discussions that followed regarding were very interesting. One conclusion was that Parker and Weir’s concerns regarding the Type II error-rate may well be potentially under-considered in multi-arm studies. Are we potentially ruling out findings due to being overly-conservative with adjustments? This brings us back full circle to the Althouse paper and the idea that such concerns on the rights and wrongs of adjustments and hypothesis testing are getting in the way of the ‘bigger picture’. An interesting point was made in the discussion session by Dr Graham Wheeler, who wondered if it was worth delving back into the history of why adjustments were required in the first place. Was it related to observational studies which explored tens and hundreds of endpoints prior to obtaining a ‘significant’ relationship? One thing was clear, whatever the position on the topic, it is essential that you state clearly what you plan to do in advance in the Statistical Analysis Plan (SAP), even if it is to state that no adjustments are taking place.
This is an interesting topic and one I plan to investigate in more detail. It would be interesting to establish the specific scenarios where a consensus for adjustments in multiple testing have and have not been reached. Once that has been established a further plan would be to investigate why we currently do not have a consensus under certain scenarios (and if it is possible that one can ever be reached).
By N A Johnson, ICTU Clinical Trial Statistician
A D Althouse, Adjust for Multiple Comparisons? It's Not That Simple. Ann Thorac Surg. 2016 May;101(5):1644-5
R A Parker, C J Weir, Non-adjustment for multiple testing in multi-arm trials of distinct treatments: Rationale and justification. Clin Trials. 2020 Oct;17(5):562-566
F Bretz, F Koenig, Commentary on Parker and Weir, Clin Trials. 2020 Oct;17(5):567-569
 Dmitrienko, A, Bretz, F, Westfall, PH, et al. Multiple testing methodology. In: Dmitrienko, A, Tamhane, AC, Bretz, F (eds) Multiple testing problems in pharmaceutical statistics. 1st ed. Chapman & Hall/CRC Biostatistics Series, Boca Raton, Florida, U.S.A. 2010. Pages 35–41