Multiple Endpoints and Alpha Allocation

Definition

Clinical trials often test effects on more than one endpoint. When multiple hypothesis tests are conducted and the trial is declared positive if any one test is significant, the Type I error rate is inflated beyond the nominal 5% level — this is the multiplicity problem.

"Failure to account for multiplicity when there are several clinical endpoints evaluated in a study can lead to false conclusions regarding the effects of the drug. The regulatory concern regarding multiplicity arises principally in the evaluation of clinical trials intended to demonstrate effectiveness and support drug approval." — FDA Multiple Endpoints Guidance (Final, January 2017), §III

"An important principle for controlling multiplicity is to prospectively specify all planned endpoints, time points, analysis populations, and analyses." — FDA Multiple Endpoints Guidance, §III

Endpoint hierarchy (FDA framework):

Primary endpoints — essential to establish effectiveness for approval; must be pre-specified with multiplicity control
Secondary endpoints — support primary evidence and/or demonstrate additional effects; key secondaries require alpha allocation if inferential claims are made
Exploratory endpoints — hypothesis-generating; no formal Type I error control required; p-values are descriptive only

Post hoc adjustments are unacceptable: "Post hoc analyses of trials that fail on their prospectively specified endpoints may be useful for generating hypotheses for future testing, they do not yield definitive results... post hoc analyses by themselves cannot establish effectiveness." (FDA 2017)

Status: FDA Multiple Endpoints Guidance (01/2017) = Final

Regulatory Position

FDA requires prospective alpha control for:

All primary endpoint claims supporting efficacy
Any secondary endpoint claim intended to appear in FDA-approved labeling with inferential language
Co-primary endpoints (both must be significant for the trial to be declared positive — or alpha is split)
Key secondary endpoints tested in hierarchical sequence
All pre-specified subgroup claims (biomarker-enriched populations, ITT + subgroup co-primary)

FDA does not require formal Type I error control for:

Exploratory endpoints
Descriptive analyses after the trial has "won" on pre-specified primary
Safety analyses (except pre-specified safety endpoints in dedicated safety studies)
Sensitivity analyses labeled as "supportive" (not confirmatory)

SAP requirement: The full testing strategy, including endpoint order, alpha allocation, and any interim analysis alpha spending, must be locked in the Statistical Analysis Plan before database lock/unblinding. Post-hoc changes are not permitted.

When to Use Each Strategy

Fixed-sequence (hierarchical testing): Most common in oncology Phase 3 trials.

Use when: primary endpoint is PFS and OS is key secondary; or OS is primary and PRO/ORR are hierarchical secondaries
Advantage: no alpha penalty if endpoint order is clinically justified; full alpha available at each step
Disadvantage: if primary fails, no secondary inference can be made
Power: 100% power-efficient (no inflation factor)

Co-primary strategy: Both endpoints must independently achieve significance (most conservative).

Use when: regulatory approval requires demonstration on two independent outcomes (e.g., PFS AND OS both required; or ITT AND biomarker-positive)
Alpha may be split (0.025 each, two-sided) or the same alpha used for each with Bonferroni adjustment
Power cost: each endpoint must be powered independently → larger total sample size needed
Example: each endpoint needs 80% power → joint power ≈ 64–72% (depending on correlation)

Prospective alpha allocation (Bonferroni or Holm-Hochberg): Alpha split across endpoints.

Use when: testing two endpoints in parallel where both contribute to the approval claim
Example: split alpha 0.04/0.01 (primary/secondary) or 0.025/0.025 (equal co-primary)
More powerful than Bonferroni when endpoints are positively correlated (use Holm or Hochberg)

Gatekeeping (serial and parallel): Primary endpoints "gate" the testing of secondary endpoints.

Serial gatekeeping: Secondary tested only if ALL primaries significant
Parallel gatekeeping: Secondary tested if AT LEAST ONE primary significant; allows "alpha recycling" from successful tests
Use when: PFS is primary gate for OS; or two populations (biomarker+/all-comers) gate each other
Most flexible for complex trial designs; requires clear graph specification

Graphical approach: Visual framework for specifying hypothesis weights and transition matrices.

Use when: complex multiplicity structure with 3+ hypotheses, multiple populations, or conditional gating
R packages: gMCP, graphicalMCP
FDA expectation: full graph with node weights and transition matrices pre-specified in SAP

Design Considerations

1. Fixed-Sequence Method

Test endpoints in pre-specified order at the same alpha level (α = 0.05, two-sided). Proceed to next endpoint only if current endpoint is significant. Stop at first non-significant test — no subsequent test can be declared significant.

Primary:     PFS (α = 0.05) → if significant (p < 0.05) →
Secondary 1: OS  (α = 0.05) → if significant →
Secondary 2: ORR (α = 0.05) → if significant →
Secondary 3: PRO (α = 0.05)

If PFS: p = 0.045 → PASS (claim PFS)
        Then test OS: p = 0.03 → PASS (claim OS)
        Then test ORR: p = 0.10 → FAIL (no ORR claim)

If PFS: p = 0.08 → FAIL (no PFS claim; stop — cannot claim OS, ORR, PRO)

Key properties:

No alpha inflation as long as pre-specified order is clinically justified
Full alpha available at each step
Failure at any step terminates formal inference chain
Most power-efficient when primary is expected to succeed
Strongly controls family-wise error rate (FWER) at nominal α

Most common oncology application:

NSCLC/Myeloma pattern: PFS (primary) → OS (key secondary) → ORR → PRO
CRC/Pancreatic pattern: OS (primary) → PFS → ORR
Breast adjuvant pattern: iDFS (primary) → OS (key secondary) → PRO

2. Bonferroni Method

Divide alpha equally across k hypotheses: α_i = α/k.

Example: two co-primary endpoints, each tested at α = 0.025 (two-sided)

If endpoint 1 p-value < 0.025 AND endpoint 2 p-value < 0.025, reject both null hypotheses
Conservative (slightly under-powered vs. more sophisticated methods)
Appropriate when endpoints are independent or negatively correlated
Less appropriate when endpoints are positively correlated (e.g., PFS and OS; use Hochberg instead)

R implementation:

# Simple Bonferroni alpha allocation
alpha_total <- 0.05
k_endpoints <- 2
alpha_per_endpoint <- alpha_total / k_endpoints  # 0.025 each

# Compare to Hochberg threshold (more powerful if positively correlated)
# See Section 4 below

3. Holm Procedure (Step-Down)

More powerful than Bonferroni. Order p-values from smallest to largest: p_(1) ≤ p_(2) ≤ ... ≤ p_(k).

Reject H_(1) if p_(1) ≤ α/k
Reject H_(2) if p_(2) ≤ α/(k-1)
Continue until first non-rejection; all subsequent hypotheses retained

Example (3 hypotheses, α = 0.05):

p-values: p_PFS = 0.015, p_OS = 0.025, p_ORR = 0.10
Sorted: 0.015 ≤ 0.025 ≤ 0.10

Step 1: Is p_(1) = 0.015 ≤ 0.05/3 = 0.0167? YES → Reject H_PFS (PFS significant)
Step 2: Is p_(2) = 0.025 ≤ 0.05/2 = 0.025? YES → Reject H_OS (OS significant)
Step 3: Is p_(3) = 0.10 ≤ 0.05/1 = 0.05? NO → Do not reject H_ORR (ORR not significant)

Conclusion: PFS and OS both claimed; ORR not claimed

Strongly controls family-wise error rate (FWER). Recommended by FDA as improvement over plain Bonferroni.

4. Hochberg Procedure (Step-Up)

More powerful than Holm when hypotheses are positively correlated (common in oncology — PFS and OS tend to move together). Order p-values from largest to smallest: p_(k) ≥ ... ≥ p_(1).

If p_(k) ≤ α, reject all k hypotheses
If not, reject all H_(i) for i ≤ k-1 if p_(k-1) ≤ α/2
Continue step-up

Example (PFS and OS with positive correlation, α = 0.05):

p_PFS = 0.02, p_OS = 0.03
Sorted (descending): 0.03 ≥ 0.02

Step 1: Is p_(2) = 0.03 ≤ 0.05? YES → Reject both H_PFS and H_OS
Conclusion: Both PFS and OS claimed (more powerful than Bonferroni)

Valid under independence or certain positive correlation structures. Requires care when endpoints have complex correlation structure. R packages: multcomp, stats::p.adjust(method="hochberg")

5. Fallback Method

A sequential method for two endpoints where the secondary can "inherit" alpha if primary fails.

Test H_1 at α_1 (e.g., 0.04); if significant, test H_2 at full α (0.05)
If H_1 fails, test H_2 at α_2 = α − α_1 (e.g., 0.01)
H_2 can still be claimed at reduced alpha even if primary fails

Oncology use case: PFS is tested at 0.04; OS is tested at full 0.05 if PFS succeeds, or at 0.01 if PFS fails. This allows an OS claim even if PFS misses formal significance.

Decision tree:

If H_PFS p < 0.04:
  → Reject H_PFS (PFS significant)
  → Test H_OS at α = 0.05
    If p_OS < 0.05: Reject H_OS (OS significant)
    Else: Do not reject H_OS

If H_PFS p ≥ 0.04:
  → Do not reject H_PFS
  → Test H_OS at α = 0.01 (fallback)
    If p_OS < 0.01: Reject H_OS (OS significant at reduced alpha)
    Else: Do not reject H_OS

6. Graphical Approach (Advanced)

Pre-specify a directed acyclic graph (DAG) with:

Nodes: Each hypothesis (endpoint or population)
Weights: Initial alpha allocation to each node (sum to 1)
Transition matrix: Rules for alpha redistribution when a hypothesis is rejected

Biomarker enrichment example:

Graph with 3 nodes:

- H_ITT (all-comers): weight = 0.5, alpha allocated = 0.025
- H_BIO (biomarker+): weight = 0.5, alpha allocated = 0.025
- Transition rules:
  * If H_ITT rejected: pass 100% of its alpha to H_BIO
  * If H_BIO rejected: pass 100% of its alpha to H_ITT

FDA requirements:

Full graph (nodes, weights, transitions) must be specified in SAP before unblinding
Post-hoc graph changes are not permitted
Justification for weight allocation and transition rules required
R packages: gMCP (Bretz et al.), graphicalMCP

SAP language template:

"The testing hierarchy will follow the graphical approach of Bretz et al. (2009).
Nodes, initial weights, and transition rules are specified in Table [X] and Figure [X].
Primary analysis will test H_ITT at initially allocated α = 0.025. Upon rejection of H_ITT,
alpha will be transferred to H_BIO as per the transition matrix in Figure [X].
All transitions follow the closed testing principle and maintain strong FWER control."

7. Co-Primary Endpoints

When success requires demonstration on BOTH endpoints simultaneously:

Each endpoint tested at same nominal α (e.g., both at 0.025 two-sided) — but both must be significant
Power drops: if power for each endpoint is 85%, joint power ≈ 64–72% (assuming positive correlation ρ = 0.5)
Requires larger sample size than single primary
FDA example: PFS AND OS both required for approval (some IO combinations in NSCLC); or ITT AND biomarker+ both required

Alpha split for co-primary:

Equal split: α_1 = α_2 = 0.025 (two-sided), appropriate when both endpoints equally important
Asymmetric split: α_1 = 0.04, α_2 = 0.01 — when one endpoint is substantially more likely to succeed (not recommended by FDA without strong justification)

Sample size inflation for co-primary:

Single primary: n per arm for 80% power at HR = 0.75
Co-primary (equal correlation ρ=0.5, equal power targets):
  Required n_per_arm ≈ 1.15–1.25 × single-primary n

Example: Single PFS primary needs n=180; co-primary PFS+OS needs n=210–225

8. Gatekeeping Strategies

Serial gatekeeping: Primary family must be fully significant before secondary family is tested.

Family 1 (Primary): PFS [must be significant at α=0.025] → Gate opens →
Family 2 (Secondary): OS, ORR [tested with remaining α if gate passes]

If PFS significant: Test OS at α=0.025 (full alpha available)
If PFS not significant: Gate fails; no testing of OS or ORR

Parallel gatekeeping (truncated Holm/Hochberg): Alpha is "recycled" from successful hypotheses to failed ones within and across families.

Used in complex trials with multiple populations or multiple treatment arms
Example: test PFS in all-comers AND PFS in PD-L1+ subgroup; recycle alpha from successful test to the other
More powerful than serial gatekeeping when multiple primary hypotheses exist

Multi-branched gatekeeping: Multiple gates in parallel, then downstream tests opened by partial gate success.

Common in basket trials, platform trials, or trials with biomarker stratification
Example: Test treatment in 3 histologies (3 primary gates); if ≥1 succeeds, test OS in that histology

R implementation:

# Example: Use graphicalMCP package for gating
library(graphicalMCP)

# Define graph with gating structure
# Nodes: PFS_overall, PFS_biomarker+
# Weights: c(0.5, 0.5)
# Edges and transition rules per gating structure

9. Subgroup Testing and Population Enrichment

Testing treatment effects in both overall (ITT) and biomarker-positive populations introduces multiplicity:

Hierarchical approach:

Primary population: Biomarker-positive subgroup (enriched, higher expected effect)
Secondary population: All-comers ITT (broader population)
Testing strategy: Test biomarker+ at α=0.025; if significant, test ITT at α=0.025 (fixed-sequence gating)

Parallel gatekeeping approach:

Test both ITT and biomarker+ as co-primary; apply alpha recycling
More flexible if ITT is expected to be successful

FDA expectation: When biomarker-stratified, clearly pre-specify:

Which population is primary (usually biomarker-enriched)
Testing order or gating structure
Alpha allocation
How results in each population will inform labeling

Intercurrent Events and Multiple Testing

IEs interact with multiple testing in two ways:

1. Different IE strategies for primary vs. sensitivity:

Only primary estimand needs formal alpha control; sensitivity analyses (hypothetical, while-on-treatment) are descriptive.

SAP language: "The primary analysis uses the treatment policy strategy for OS (censoring at last assessment before new therapy). All pre-specified sensitivity analyses (hypothetical scenario using RPSFT, while-on-treatment, composite strategy) are supportive and descriptive. No adjustment for multiplicity is applied to sensitivity analyses."

2. Multiple sensitivity analyses: If sensitivity analyses are pre-specified as "supportive" (not inferential), no multiplicity adjustment needed — but p-values should not be presented as confirmatory.

SAP language: "The OS analysis will be conducted under the primary treatment policy estimand. Sensitivity analyses addressing alternative estimands (RPSFT for crossover, hypothetical scenario excluding new therapy) are pre-specified and reported in the CSR as supportive evidence. These analyses do not contribute to the hypothesis testing family and no multiplicity adjustment is applied."

Regulatory Precedent

Common patterns in Phase 3 oncology trials:

Design Pattern	Method	Example Indication	Notes
PFS primary → OS key secondary	Fixed-sequence	NSCLC targeted therapy (osimertinib FLAURA, alectinib ALEX)	HR threshold: 0.50–0.60 PFS; OS supports
OS primary → PFS → ORR → PRO	Fixed-sequence	Colorectal (FOLFOX adjuvant), pancreatic, melanoma	Standard oncology hierarchy
PFS in biomarker+ AND all-comers co-primary	Parallel gatekeeping or co-primary	NSCLC PD-L1+ (pembrolizumab KEYNOTE-024/189)	FDA guidance: test both; alpha recycling if needed
OS AND ORR co-primary	Bonferroni (α=0.025 each)	Some myeloma trials, lymphoma	Both required for approval claim
PFS (α=0.04) + OS fallback (α=0.01)	Fallback method	IO combinations with uncertain PFS signal	Allows OS claim if PFS borderline
DFS primary → OS secondary (adjuvant)	Fixed-sequence	Breast cancer, colorectal adjuvant	Standard adjuvant hierarchy

Limitations and Pitfalls

1. Fixed-sequence inflexibility: If primary endpoint unexpectedly fails (e.g., PFS HR = 0.72, p = 0.08), no formal inference is possible for OS even if OS HR is 0.70, p = 0.01. Fallback or parallel gatekeeping methods allow some inference in this scenario at reduced alpha.

Mitigation: Plan fallback or graphical approach in SAP if secondary endpoint is critical to approval story

2. Post hoc reordering is not acceptable: Changing the endpoint testing order after unblinding or interim analysis without pre-specification invalidates the Type I error control. FDA will not accept post hoc hierarchical orderings.

Mitigation: SAP must be locked before database lock; any changes require protocol amendment with FDA agreement

3. Exploratory analyses with p-values (misleading inference): Reporting p-values for exploratory endpoints creates the false impression of inferential claims. FDA guidance is clear: "presenting p-values from descriptive analyses is inappropriate because doing so would imply a statistically rigorous conclusion." (FDA 2017)

Mitigation: Label exploratory endpoints as "Exploratory — not formally tested for Type I error control"; if p-values reported, do NOT use asterisks or bold formatting that implies significance

4. Subgroup multiplicity (testing too many subgroups): Testing 10+ pre-specified subgroups without multiplicity adjustment inflates FWER substantially. Only pre-specified primary subgroup analyses (e.g., biomarker+) should be treated as confirmatory.

Mitigation: Limit pre-specified subgroups to ≤3; apply gatekeeping or Bonferroni to subgroup tests; label additional subgroup analyses as exploratory

5. Interim analysis alpha spending not accounted for: Multiple interim analyses for efficacy (group sequential designs) also consume alpha. Alpha spending function (O'Brien-Fleming, Lan-DeMets) must be specified. Interim analyses for primary endpoint consume alpha from the same family budget used for secondary endpoints at final analysis.

Mitigation: Specify interim analysis schedule and alpha spending function in SAP; document total alpha spent at all analyses (interims + final)

6. Co-primary power loss underestimated: If planning co-primary endpoints with 85% power each, assuming independence gives joint power ~72%. But positive correlation (common for PFS/OS) can inflate this estimate, requiring larger sample size than initially calculated.

Mitigation: Use realistic correlation estimates (ρ = 0.4–0.6 for PFS/OS); recalculate sample size with correlation accounted for

7. Graphical approach complexity without FDA pre-meeting: Complex graphical approaches (with 4+ nodes, complex transition rules) may be misunderstood by FDA reviewers if not discussed in advance. Post-hoc reinterpretation of graph rules is not acceptable.

Mitigation: Request type-C meeting with FDA to pre-approve graphical testing strategy if novel/complex; provide clear diagrams and transition rules in meeting package

8. Alpha allocation not matching clinical priority: Allocating larger alpha to secondary endpoint than primary, or symmetric alpha to endpoints with asymmetric clinical importance, may not align with FDA expectations.

Mitigation: Justify alpha allocation based on clinical importance and statistical power; symmetric splits acceptable only for truly co-primary endpoints

Backlinks

Source: FDA Guidance for Industry — Multiple Endpoints in Clinical Trials (January 2017, Final); Bretz et al. (2009) Graphical Approach to Multiple Testing with Application to Clinical Trials Status: Final guidance Compiled from retrieved FDA chunks + ClinicalTrials.gov records + literature on graphical multiplicity methods