Multiplicity Control in Oncology Trials

Definition

Multiplicity arises whenever a clinical trial tests more than one hypothesis and the trial may be declared positive if any test reaches significance. Without adjustment, the probability of at least one false-positive finding (the family-wise error rate, FWER) exceeds the nominal alpha level.

"Failure to account for multiplicity when there are several clinical endpoints evaluated in a study can lead to false conclusions regarding the effects of the drug. The regulatory concern regarding multiplicity arises principally in the evaluation of clinical trials intended to demonstrate effectiveness and support drug approval." — FDA Multiple Endpoints Guidance (Final, January 2017), §III

Sources of multiplicity in oncology trials:

Source	Example	Typical magnitude
Multiple endpoints	PFS + OS + ORR + PRO	2–6 hypotheses
Multiple populations	ITT + biomarker-positive subgroup	2–3 populations
Multiple treatment arms / doses	High dose vs. low dose vs. control	2–4 comparisons
Interim analyses	1–3 interim looks for efficacy	2–4 analyses
Multiple time points	12-month vs. 24-month landmark	Rarely >2

FWER vs. FDR:

FWER (Family-Wise Error Rate): Probability of at least one false rejection among all tested hypotheses. FDA requires strong FWER control at alpha = 0.05 (two-sided) or 0.025 (one-sided) for confirmatory trials. This is the standard for Phase 3 oncology submissions.
FDR (False Discovery Rate): Expected proportion of false rejections among all rejections. More liberal than FWER; appropriate for exploratory biomarker screening (e.g., testing 200 genes for differential expression) but not acceptable for confirmatory efficacy endpoints in registrational trials.

"An important principle for controlling multiplicity is to prospectively specify all planned endpoints, time points, analysis populations, and analyses." — FDA Multiple Endpoints Guidance (Final, January 2017), §III

Regulatory Position

FDA requires strong FWER control for all confirmatory claims:

Primary endpoints supporting the efficacy claim for approval
Key secondary endpoints intended for labeling with inferential language
Co-primary endpoints (both must be significant, or alpha must be split)
Pre-specified subgroup analyses used for labeling (e.g., biomarker-positive population)
Interim analyses that can lead to early stopping for efficacy

FDA does not require multiplicity adjustment for:

Exploratory endpoints (descriptive only; p-values should not be presented as confirmatory)
Sensitivity analyses labeled as "supportive"
Safety analyses (unless pre-specified safety endpoints in dedicated safety studies)
Post hoc subgroup analyses (hypothesis-generating only)

"Post hoc analyses of trials that fail on their prospectively specified endpoints may be useful for generating hypotheses for future testing, they do not yield definitive results... post hoc analyses by themselves cannot establish effectiveness." — FDA Multiple Endpoints Guidance (Final, January 2017)

ICH E8(R1) reinforces that confirmatory studies must evaluate clinical endpoints relevant to disease burden with pre-specified statistical methods to ensure scientific rigor. (ICH E8(R1), adopted October 2021, Final guidance)

Source: FDA Multiple Endpoints Guidance (January 2017) = Final; ICH E8(R1) General Principles for Clinical Studies (October 2021) = Final

Methods for Multiplicity Control

1. Bonferroni Method (Single-Step)

The simplest and most conservative approach. Divide alpha equally across k hypotheses: each tested at alpha/k.

Procedure:

For k hypotheses at overall alpha = 0.05:
  Reject H_i if p_i <= alpha/k

Example (k = 2, co-primary PFS + OS):
  Test PFS at alpha = 0.025
  Test OS  at alpha = 0.025
  Both must achieve p < 0.025 for their respective claims

Properties:

Strongly controls FWER under any dependence structure
Conservative when endpoints are positively correlated (PFS and OS typically rho = 0.4-0.6)
Power loss: each endpoint tested at reduced alpha
Unequal splits allowed (e.g., alpha_1 = 0.04, alpha_2 = 0.01) if clinically justified

R implementation:

p_values <- c(0.018, 0.032)
p.adjust(p_values, method = "bonferroni")
# [1] 0.036 0.064  → PFS significant, OS not significant

2. Holm Procedure (Step-Down)

Less conservative than Bonferroni. Uniformly more powerful. FDA recommended.

Procedure:

Order p-values from smallest to largest: p_(1) <= p_(2) <= ... <= p_(k)
Reject H_(1) if p_(1) <= alpha/k
Reject H_(2) if p_(2) <= alpha/(k-1)
Continue until first non-rejection; retain all subsequent hypotheses

Example (3 endpoints, alpha = 0.05):

p-values: p_PFS = 0.008, p_OS = 0.022, p_ORR = 0.045
Sorted:   0.008 <= 0.022 <= 0.045

Step 1: p_(1) = 0.008 <= 0.05/3 = 0.0167?  YES -> Reject H_PFS
Step 2: p_(2) = 0.022 <= 0.05/2 = 0.025?   YES -> Reject H_OS
Step 3: p_(3) = 0.045 <= 0.05/1 = 0.05?     YES -> Reject H_ORR

Conclusion: All three endpoints claimed
(Bonferroni would have failed OS: 0.022 > 0.0167)

Properties:

Strongly controls FWER under any dependence structure (no assumptions needed)
Always at least as powerful as Bonferroni; often strictly more powerful
Step-down: starts with the most significant hypothesis
Recommended by FDA as a default improvement over Bonferroni

R implementation:

p_values <- c(PFS = 0.008, OS = 0.022, ORR = 0.045)
p.adjust(p_values, method = "holm")
# [1] 0.024 0.044 0.045 → all significant at alpha = 0.05

3. Hochberg Procedure (Step-Up)

More powerful than Holm when hypotheses are independent or positively correlated (common for PFS/OS in oncology).

Procedure:

Order p-values from largest to smallest: p_(k) >= ... >= p_(1)
If p_(k) <= alpha, reject all k hypotheses
If not, check p_(k-1) <= alpha/2; if yes, reject H_(1) through H_(k-1)
Continue stepping down

Example (PFS and OS, alpha = 0.05):

p_PFS = 0.02, p_OS = 0.03
Sorted (descending): p_(2) = 0.03, p_(1) = 0.02

Step 1: p_(2) = 0.03 <= 0.05?  YES -> Reject BOTH H_PFS and H_OS

Comparison:
  Bonferroni: p_OS = 0.03 > 0.025 -> OS NOT significant
  Holm:       p_(2) = 0.03 > 0.025 -> OS NOT significant
  Hochberg:   p_(2) = 0.03 <= 0.05  -> OS SIGNIFICANT

Properties:

More powerful than Holm for positively correlated endpoints
Valid under independence or positive dependence (Simes' inequality condition)
Caution: Not valid under arbitrary negative correlation (rare in oncology)
Particularly useful for PFS + OS testing (strong positive correlation)

R implementation:

p_values <- c(PFS = 0.02, OS = 0.03)
p.adjust(p_values, method = "hochberg")
# [1] 0.03 0.03 → both significant at alpha = 0.05

library(multcomp)  # Additional multiple comparison tools

4. Fixed-Sequence (Hierarchical) Testing

Test endpoints in a pre-specified clinically motivated order, each at the full alpha level. Proceed to the next hypothesis only if the current one is rejected.

Procedure:

H_1 -> H_2 -> H_3 -> ... (pre-specified order)
Test H_1 at alpha = 0.05
  If rejected: test H_2 at alpha = 0.05
    If rejected: test H_3 at alpha = 0.05
      ...
  If NOT rejected: STOP. No subsequent claims.

Properties:

No alpha penalty — full alpha available at each step
Most power-efficient when the ordering matches expected significance
Risk: if primary fails unexpectedly, entire inference chain is lost
Strongly controls FWER regardless of correlation structure

Common oncology hierarchies:

Indication	Hierarchy
NSCLC targeted therapy	PFS -> OS -> ORR -> PRO
Colorectal / Pancreatic	OS -> PFS -> ORR
Breast adjuvant	iDFS -> OS -> PRO
Myeloma maintenance	PFS -> OS -> MRD negativity rate

5. Gatekeeping Strategies

Primary endpoint family "gates" the testing of secondary endpoints. Alpha from the primary family is recycled to secondary endpoints only after gating conditions are met.

Serial gatekeeping:

Family 1 (Primary): PFS [test at alpha = 0.025]
  IF significant -> Gate opens ->
Family 2 (Secondary): OS, ORR [test with recycled alpha]
  IF PFS not significant -> Gate closed; no secondary testing

Parallel gatekeeping (truncated Holm/Hochberg):

Family 1 (Primary): H_PFS_ITT, H_PFS_BIO [both tested; alpha recycled]
  IF at least one significant -> Gate partially opens ->
Family 2 (Secondary): OS [receives recycled alpha from rejected primaries]

Truncated Holm procedure for gatekeeping:

Reserve a fraction gamma of alpha from primary family for secondary family
Primary tested using Holm at (1 - gamma) * alpha
Secondary family receives gamma * alpha plus any alpha from rejected primary hypotheses
FDA recommendation: use truncated Holm or Hochberg to preserve some alpha for secondary endpoints even when not all primaries are rejected

R implementation:

library(gMCP)
library(graphicalMCP)

# Define a serial gatekeeper
# Primary: PFS (weight = 1.0)
# Secondary: OS (weight = 0.0, receives alpha only if PFS rejected)
m <- matrix(c(0, 1,    # PFS -> OS transition = 1.0
              0, 0),   # OS -> (no further transition)
            nrow = 2, byrow = TRUE)
weights <- c(1, 0)
g <- graph_create(weights, m)

6. Fallback Method

A hybrid of fixed-sequence and alpha splitting. Allows secondary endpoints to be tested at reduced alpha even if the primary fails.

Procedure:

Allocate alpha: alpha_1 (primary) + alpha_2 (secondary) = alpha_total

Scenario A: H_1 significant (p < alpha_1)
  -> Test H_2 at full alpha (alpha_1 + alpha_2 = alpha_total)

Scenario B: H_1 NOT significant (p >= alpha_1)
  -> Test H_2 at alpha_2 only (reduced power, but still testable)

Oncology example (PFS primary, OS secondary):

alpha_total = 0.05 (two-sided)
alpha_PFS = 0.04, alpha_OS = 0.01

If PFS p = 0.035 < 0.04:
  -> PFS significant; test OS at alpha = 0.05
  -> If OS p = 0.03 < 0.05: OS significant

If PFS p = 0.06 >= 0.04:
  -> PFS NOT significant; test OS at alpha = 0.01
  -> If OS p = 0.008 < 0.01: OS significant (at reduced alpha)
  -> Advantage: OS can still be claimed even though PFS failed

When to use: When the secondary endpoint (typically OS) may succeed independently and its claim is clinically important even without the primary. Common in IO combinations where PFS signals may be uncertain but OS benefit is expected.

7. Graphical Approach (Bretz et al.)

The graphical approach, developed by Bretz, Maurer, and Hommel (2009), generalizes all Bonferroni-based procedures into a single visual framework. It is FDA's recommended method for complex multiplicity structures.

Components:

Nodes (vertices): Each node represents a hypothesis H_i with an initial weight w_i (fraction of alpha allocated). Weights must sum to 1: sum(w_i) = 1.
Transition matrix G: A k x k matrix where g_ij is the fraction of alpha from H_i that transfers to H_j upon rejection of H_i. Row sums must equal 1 (or 0 if no transitions from that node). g_ii = 0 always.
Weight vector w: Initial alpha allocation. Alpha for H_i = w_i * alpha_total.

Step-by-step algorithm:

INITIALIZATION:
  Set alpha_i = w_i * alpha_total for each hypothesis i = 1, ..., k
  Set transition matrix G = [g_ij]

ITERATION:
  1. Select any hypothesis H_j with p_j <= alpha_j
     (If none exists, STOP — no further rejections)

  2. Reject H_j. Remove node j from the graph.

  3. UPDATE remaining hypotheses:
     For each remaining H_i (i != j):
       alpha_i_new = alpha_i + alpha_j * g_ji
       (H_i receives the fraction g_ji of H_j's alpha)

  4. UPDATE transition matrix for remaining nodes:
     For each pair (i, l) where i != j and l != j:
       g_il_new = (g_il + g_ij * g_jl) / (1 - g_ij * g_ji)
       if denominator = 0, set g_il_new = 0

  5. Return to Step 1.

Example: Co-primary PFS + OS with secondary ORR (3-node graph):

Nodes and initial weights:
  H_PFS: w = 0.5  (alpha = 0.025)
  H_OS:  w = 0.5  (alpha = 0.025)
  H_ORR: w = 0.0  (alpha = 0.000)

Transition matrix G:
         To_PFS  To_OS  To_ORR
From_PFS [  0     0.5    0.5  ]
From_OS  [  0.5   0      0.5  ]
From_ORR [  0.5   0.5    0    ]

Interpretation:
  - If PFS rejected: half its alpha goes to OS, half to ORR
  - If OS rejected: half its alpha goes to PFS, half to ORR
  - ORR can only accumulate alpha from rejected primary endpoints

Scenario: p_PFS = 0.01, p_OS = 0.04, p_ORR = 0.02
  Step 1: alpha_PFS = 0.025; p_PFS = 0.01 < 0.025 -> Reject H_PFS
  Update: alpha_OS  = 0.025 + 0.025*0.5 = 0.0375
          alpha_ORR = 0.000 + 0.025*0.5 = 0.0125
  Step 2: alpha_OS = 0.0375; p_OS = 0.04 > 0.0375 -> Cannot reject H_OS
          alpha_ORR = 0.0125; p_ORR = 0.02 > 0.0125 -> Cannot reject H_ORR
  Result: Only PFS claimed

Key properties:

Subsumes Bonferroni, Holm, fixed-sequence, fallback, and gatekeeping as special cases
Maintains strong FWER control under any dependence structure (Bonferroni-based)
Visually intuitive — facilitates communication with clinical teams, regulators, and DMCs
Iterative algorithm is straightforward to implement and verify
Can incorporate parametric extensions for additional power (weighted Simes, Dunnett)

Common Oncology Scenarios

Scenario 1: Co-Primary OS + PFS

Used when regulatory approval requires demonstration on both time-to-event endpoints.

Graph: 2-node loop
  H_OS  (w = 0.5) <---> H_PFS (w = 0.5)
  Transition: g_OS->PFS = 1.0, g_PFS->OS = 1.0

Testing:
  - Each starts at alpha = 0.025
  - If either rejected, full alpha (0.05) passes to the other
  - Equivalent to Holm procedure for 2 hypotheses

Power considerations:
  - Individual power 90% each, rho(PFS,OS) = 0.5
  - Joint power: ~83-85%
  - Sample size: ~15-20% inflation over single-primary design

Examples: IO combination trials (pembrolizumab + chemotherapy in NSCLC); some myeloma trials requiring both PFS and OS.

Scenario 2: Biomarker Subgroup Fallback

Testing treatment effect in both ITT and biomarker-positive subgroup, with potential for alpha recycling.

Graph: 3-node with fallback
  H_BIO (w = 0.5)  -----> H_ITT (w = 0.5)
                    <-----
  H_OS  (w = 0.0)  <----- both H_BIO and H_ITT

Transition matrix:
          To_BIO  To_ITT  To_OS
From_BIO [  0      0.5     0.5 ]
From_ITT [  0.5    0       0.5 ]
From_OS  [  0.5    0.5     0   ]

Strategy:
  1. Test PFS in biomarker+ (alpha = 0.025) and PFS in ITT (alpha = 0.025)
  2. If biomarker+ significant: recycle alpha to ITT and OS
  3. If ITT significant: recycle alpha to biomarker+ and OS
  4. OS testable only after at least one PFS hypothesis rejected

Examples: KEYNOTE-024/189 (pembrolizumab, PD-L1 stratified); CheckMate-227 (nivolumab + ipilimumab, TMB subgroup).

Scenario 3: Multiple Doses (Dunnett-Type Comparisons)

Testing multiple dose levels vs. control in dose-finding confirmatory trials.

Graph: 3-node (2 doses + control comparison)
  H_high (w = 0.5) -----> H_low (w = 0.5)
                    <-----
  Each node: treatment vs. control comparison

Transition: g_high->low = 1.0, g_low->high = 1.0

Dunnett test alternative:
  Exploit known correlation structure (shared control arm)
  rho = sqrt(n_ctrl / (n_ctrl + n_trt)) for balanced designs

  Dunnett-adjusted critical values are less conservative
  than Bonferroni because they account for the correlation
  from the shared control group.

R implementation:

library(multcomp)
# Dunnett's test for multiple doses vs. control
fit <- aov(outcome ~ dose_group, data = trial_data)
dunnett <- glht(fit, linfct = mcp(dose_group = "Dunnett"))
summary(dunnett)  # Adjusted p-values accounting for correlation

Examples: Dose-ranging Phase 2/3 trials in oncology; adaptive dose selection with multiplicity control.

Scenario 4: Hierarchical PFS -> OS with Interim Analyses

The most common oncology multiplicity structure: fixed-sequence primary PFS -> key secondary OS, with group sequential interim analyses.

Structure:
  PFS tested at interim(s) and final using alpha spending function
  OS tested only after PFS is significant (fixed-sequence gate)

Alpha management:
  Total alpha = 0.05 (two-sided) = 0.025 (one-sided)
  PFS alpha spending: O'Brien-Fleming via Lan-DeMets
    Interim 1 (50% info): boundary alpha ~ 0.003
    Interim 2 (75% info): boundary alpha ~ 0.018
    Final (100% info):    boundary alpha ~ 0.043 (adjusted)

  OS: Full alpha = 0.025 available only after PFS is rejected
  OS may also have its own interim analyses with separate spending

Interaction:
  - Alpha spent at PFS interims reduces PFS final boundary
  - But does NOT reduce OS alpha (OS uses full 0.025 after PFS gate opens)
  - OS interim analyses have their own spending function

R implementation:

library(gsDesign)
# PFS group sequential design
pfs_design <- gsDesign(
  k = 3,           # 2 interims + final
  test.type = 2,   # Two-sided symmetric
  alpha = 0.025,   # One-sided
  beta = 0.10,     # 90% power
  sfu = sfLDOF     # Lan-DeMets O'Brien-Fleming spending
)
pfs_design$upper$bound  # Efficacy boundaries (z-scale)

Graphical Approach: Detailed Mechanics

Weight Matrix

The weight vector w = (w_1, w_2, ..., w_k) specifies the initial alpha allocation:

alpha_i = w_i * alpha_total

Constraints:
  - 0 <= w_i <= 1 for all i
  - sum(w_i) = 1
  - Nodes with w_i = 0 start with no alpha (must receive alpha from other rejections)

Choosing weights:

Equal weights (w_i = 1/k) for equally important hypotheses
Asymmetric weights when clinical priority differs: allocate more alpha to the hypothesis most likely to succeed or most clinically important
Zero-weight nodes for secondary/exploratory hypotheses that can only be tested after primary rejections

Transition Matrix

The transition matrix G is a k x k matrix governing alpha redistribution:

G = [g_ij]  where g_ij = fraction of alpha_i passed to H_j when H_i is rejected

Constraints:
  - 0 <= g_ij <= 1
  - g_ii = 0 (no self-loops)
  - sum_j(g_ij) <= 1 for each row (usually = 1; <1 means alpha is "lost")

Common patterns:

Pattern	Transition structure	Use case
Fixed-sequence	g_12 = 1, all others 0	PFS -> OS hierarchy
Full loop	g_12 = g_21 = 1	Co-primary with recycling
Fallback	g_12 = 1, g_21 = epsilon	Primary -> secondary with fallback
Star	g_1j = 1/(k-1) for all j != 1	Primary distributes equally
Gatekeeping	Block diagonal with cross-family edges	Primary family gates secondary family

Graph Update Algorithm (Formal)

When hypothesis H_j is rejected, the graph is updated as follows:

For all remaining hypotheses i (i != j):
  1. Update alpha:
     alpha_i := alpha_i + g_ji * alpha_j

  2. Update transitions (for all remaining l != j, l != i):
     g_il := (g_il + g_ij * g_jl) / (1 - g_ij * g_ji)

     Special case: if g_ij * g_ji = 1, set g_il = 0

  3. Remove node j and all edges to/from j
  4. Renormalize remaining row sums if needed

This update ensures that alpha flows through removed nodes are preserved. The denominator (1 - g_ij * g_ji) accounts for potential loops between nodes i and j.

R Packages

gMCP (Classical Implementation)

library(gMCP)

# Define a 3-hypothesis graph
hypotheses <- c("H_PFS" = 0.5, "H_OS" = 0.5, "H_ORR" = 0)
transitions <- rbind(
  H_PFS = c(0,   0.5, 0.5),
  H_OS  = c(0.5, 0,   0.5),
  H_ORR = c(0.5, 0.5, 0)
)
g <- matrix2graph(transitions, hypotheses)

# Test with observed p-values
pvalues <- c(H_PFS = 0.01, H_OS = 0.04, H_ORR = 0.02)
result <- gMCP(g, pvalues, alpha = 0.05)
result@rejected  # Which hypotheses are rejected

graphicalMCP (Modern Interface)

library(graphicalMCP)

# Create graph
g <- graph_create(
  hypotheses = c(0.5, 0.5, 0),
  transitions = rbind(
    c(0,   0.5, 0.5),
    c(0.5, 0,   0.5),
    c(0.5, 0.5, 0)
  )
)

# Sequential (shortcut) test
graph_test_shortcut(
  graph = g,
  p = c(0.01, 0.04, 0.02),
  alpha = 0.05
)

# Closure-based test (more powerful with parametric extensions)
graph_test_closure(
  graph = g,
  p = c(0.01, 0.04, 0.02),
  alpha = 0.05
)

# Power simulation
graph_calculate_power(
  graph = g,
  alpha = 0.05,
  sim_n = 1e5,
  power_marginal = c(0.9, 0.8, 0.7)
)

multcomp (Dunnett and General Contrasts)

library(multcomp)

# Dunnett's procedure: multiple doses vs. shared control
fit <- lm(response ~ factor(arm), data = trial_data)
mc <- glht(fit, linfct = mcp(`factor(arm)` = "Dunnett"))
summary(mc)           # Adjusted p-values
confint(mc)           # Simultaneous confidence intervals

# Comparison of methods:
# Bonferroni:  p.adjust(p, method = "bonferroni")
# Holm:        p.adjust(p, method = "holm")
# Hochberg:    p.adjust(p, method = "hochberg")
# Hommel:      p.adjust(p, method = "hommel")
# BH (FDR):    p.adjust(p, method = "BH")  # exploratory only

SAP Template: Complete Graphical Multiplicity Procedure

9. MULTIPLICITY CONTROL

9.1 Overview

This trial tests [k] hypotheses across [describe: endpoints, populations, doses].
The overall Type I error rate is controlled at alpha = 0.05 (two-sided) using
the graphical approach of Bretz, Maurer, and Hommel (2009). Strong control of
the family-wise error rate (FWER) is maintained under the closed testing principle.

9.2 Hypotheses

The hypotheses to be tested are:
  H_1: No treatment effect on [endpoint 1] in [population 1]
  H_2: No treatment effect on [endpoint 2] in [population 1]
  H_3: No treatment effect on [endpoint 1] in [population 2]
  [...]

9.3 Initial Alpha Allocation (Weight Vector)

The initial weights and alpha allocation are:
  H_1: w_1 = [X], alpha_1 = [X] * 0.05 = [value]
  H_2: w_2 = [X], alpha_2 = [X] * 0.05 = [value]
  H_3: w_3 = [X], alpha_3 = [X] * 0.05 = [value]
  [...]
  Total: sum(w_i) = 1.0

Justification: [Clinical rationale for the allocation, e.g., "H_1 (PFS in
biomarker+ population) receives the largest weight because this population
has the strongest biologic rationale for treatment benefit."]

9.4 Transition Matrix

The transition matrix governing alpha redistribution upon hypothesis rejection is:

             H_1    H_2    H_3
  H_1    [   0     g_12   g_13  ]
  H_2    [  g_21    0     g_23  ]
  H_3    [  g_31   g_32    0    ]

where:
  g_12 = [value]: Upon rejection of H_1, [fraction] of its alpha passes to H_2
  g_13 = [value]: Upon rejection of H_1, [fraction] of its alpha passes to H_3
  [...]

The graph is depicted in Figure [X].

9.5 Testing Algorithm

The graphical testing procedure proceeds as follows:
  1. Test each hypothesis H_i at its currently allocated alpha_i.
  2. If any H_j has p_j <= alpha_j, reject H_j.
  3. Update the graph: redistribute alpha_j to remaining hypotheses
     according to the transition matrix, using the update formulas
     of Bretz et al. (2009).
  4. Repeat until no further hypotheses can be rejected.

The algorithm is implemented using the graphicalMCP R package (version [X])
and independently verified using gMCP.

9.6 Interaction with Interim Analyses

Interim analyses for [endpoint] use the [Lan-DeMets O'Brien-Fleming / Hwang-Shih-DeCani]
alpha spending function. Alpha spent at interim analyses is deducted from the
hypothesis-specific alpha within the graphical procedure.

The interim analysis schedule, information fractions, and spending function
are specified in Section [X] of this SAP.

9.7 Decision Rules

The trial will be declared positive for H_i if and only if H_i is rejected
by the graphical testing procedure at the pre-specified overall alpha = 0.05
(two-sided). Any hypothesis not rejected by the procedure will be reported
as "not statistically significant" regardless of the unadjusted p-value.

Adjusted p-values consistent with the graphical procedure will be computed
using the sequential rejection algorithm and reported in the CSR.

Limitations and Pitfalls

1. Fixed-sequence inflexibility: If the primary endpoint fails unexpectedly (e.g., PFS HR = 0.82, p = 0.08), no formal inference is possible for secondary endpoints — even if OS HR = 0.72, p = 0.005. The fallback or graphical approach mitigates this risk.

2. Post hoc reordering invalidates FWER control: Changing endpoint ordering, alpha allocation, or graph structure after unblinding is not acceptable. FDA: "presenting p-values from descriptive analyses is inappropriate because doing so would imply a statistically rigorous conclusion."

3. Hochberg invalid under negative correlation: The Hochberg step-up procedure requires independence or positive dependence. Negatively correlated endpoints (rare in oncology, but possible with competing risk endpoints) require Holm or Bonferroni instead.

4. Co-primary power loss underestimated: Joint power for two co-primary endpoints with individual power of 85% each is approximately 72% (assuming independence) to 78% (assuming rho = 0.5). Sample size must account for this reduction.

5. Graphical complexity without FDA pre-agreement: Graphs with 4+ nodes and complex transition rules should be discussed with FDA (Type C meeting) before SAP finalization. Complex graphs may be misinterpreted by reviewers without advance discussion.

6. Alpha spending interaction with multiplicity: Interim analyses consume alpha within each hypothesis. If PFS has 3 interim looks using O'Brien-Fleming spending, the cumulative alpha spent at interims must be tracked within the graphical procedure. The remaining PFS alpha at the final analysis is reduced, but OS alpha (gated behind PFS) remains at full 0.025.

7. Subgroup multiplicity: Testing more than 3 pre-specified subgroups without multiplicity adjustment inflates FWER. Only the primary biomarker-defined subgroup should receive formal alpha allocation; additional subgroups are exploratory.

8. FDR used inappropriately in confirmatory settings: FDR (Benjamini-Hochberg) controls the expected proportion of false discoveries, not the probability of any false discovery. This is inappropriate for registrational endpoints where each individual claim must be controlled. FDR is appropriate only for exploratory biomarker analyses.

Backlinks

Source: FDA Guidance for Industry — Multiple Endpoints in Clinical Trials (January 2017, Final); Bretz, Maurer, Hommel (2009) A Graphical Approach to Sequentially Rejective Multiple Test Procedures; ICH E8(R1) General Considerations for Clinical Studies (October 2021, Final) Status: Final guidance (FDA Multiple Endpoints); Final guideline (ICH E8(R1)) Compiled from retrieved FDA chunks + literature on graphical multiplicity methods