Biomarker-Enriched Trial Designs

Definition

Enrichment is the prospective use of any patient characteristic to select a study population in which detection of a drug effect — if one is in fact present — is more likely than it would be in an unselected population.

"Enrichment is the prospective use of any patient characteristic to select a study population in which detection of a drug effect (if one is in fact present) is more likely than it would be in an unselected population." — FDA Enrichment Strategies for Clinical Trials Guidance (March 2019, Final)

FDA's 2019 guidance describes three broad categories of enrichment:

Category	Mechanism	Effect on Trial
Variability-reducing	Narrow entry criteria, run-in periods, exclude poor compliers	Increases power by reducing noise
Prognostic enrichment	Select patients at higher baseline risk of events	Increases absolute effect size; relative effect unchanged
Predictive enrichment	Select patients more likely to respond to the specific drug	Increases both absolute and relative effect size

This article focuses on the oncology application of prognostic and predictive enrichment — the design frameworks, multiplicity control, companion diagnostic co-development, and worked examples across the major precision oncology domains.

1. Prognostic vs. Predictive Enrichment

1.1 Definitions

Prognostic biomarker: A factor that predicts disease outcome (e.g., event rate, progression speed, overall survival) regardless of treatment received. Prognostic enrichment selects patients at higher risk — those who will have more events in the control arm — without changing the relative treatment effect. The goal is statistical efficiency: more events in less time from a smaller cohort.

Predictive biomarker: A factor that predicts differential treatment response — i.e., the biomarker-positive subgroup benefits more (or less) from a specific treatment than the biomarker-negative subgroup. Predictive enrichment changes both the absolute and relative effect size, concentrating the trial in patients who are expected to benefit.

"These strategies would increase the absolute effect difference between groups but would not be expected to alter relative effect [prognostic]. … Predictive enrichment strategies include choosing patients who are more likely to respond to the drug treatment than other patients with the condition being treated. Such selection can lead to a larger effect size (both absolute and relative)." — FDA Enrichment Guidance 2019, Section II (Final)

1.2 How to Tell Them Apart

The key diagnostic question: Does the biomarker predict different treatment effects across subgroups, or does it merely predict worse baseline prognosis regardless of arm?

Feature	Prognostic Enrichment	Predictive Enrichment
Effect in biomarker-negative patients	Similar relative effect expected	Minimal or absent
Control arm event rate	Higher (by design)	Not necessarily changed
Relative risk reduction (HR)	Same across marker strata	Differs — biomarker+ benefits more
Treatment × biomarker interaction	Not expected	Expected (the scientific basis)
Regulatory design implication	Study all patients, stratify for efficiency	May study biomarker+ only
Example in oncology	High-risk breast cancer adjuvant (PSA velocity, 70-gene profile)	EGFR mutation → TKI in NSCLC; HER2 overexpression → trastuzumab

Formal test: A statistically significant treatment-by-biomarker interaction (p < 0.05 threshold for exploratory; pre-specified for confirmatory) provides evidence of prediction. Note: interaction tests are severely underpowered in typical Phase 3 trials; a non-significant interaction does not rule out a clinically meaningful differential.

1.3 Implications for Trial Design

Prognostic enrichment:

Does not limit label to enriched population if the treatment effect is generalizable
FDA has accepted prognostically enriched approvals (e.g., tamoxifen for high-risk breast cancer by Gail model) and described them in labeling
Post-marketing studies in lower-risk populations are typically required or committed

Predictive enrichment:

If biomarker-negative patients are excluded, FDA expects justification that they will not respond
Label restricted to biomarker-positive population; companion diagnostic approval required for patient selection
A strong mechanistic rationale (e.g., targeted mutation driving tumor proliferation) can make study of the marker-negative population unnecessary
Without strong mechanistic data, FDA encourages inclusion of some marker-negative patients to characterize the off-target population

"FDA encourages inclusion of some predictive marker-negative patients in most trials intended to provide primary effectiveness support, unless earlier studies have established that the marker-negative patients do not respond or a strong mechanistic rationale makes it clear that they will not respond." — FDA Enrichment Guidance 2019, Section VI.B (Final)

2. All-Comers with Stratification vs. Enriched Design

2.1 Design Options

Three primary architectures exist for biomarker-guided trials:

Option A — Enriched (biomarker-positive only):

Eligibility restricted to biomarker-positive patients. Primary analysis is the marker-positive population. No direct evidence in marker-negative patients from this trial.

Option B — All-comers with stratified randomization:

All patients enrolled; randomization stratified by biomarker status. Pre-specified primary analysis may be overall population (ITT), biomarker-positive subgroup, or hierarchical (biomarker+ first, then ITT). Provides data in both populations.

Option C — Marker-stratified with marker-negative sub-study:

Primary analysis in biomarker-positive patients; marker-negative patients randomized separately (smaller, exploratory substudy). Allows some characterization of marker-negative effect.

2.2 When Each Is Appropriate

Design	Best when	Caution
Enriched (positive only)	Strong mechanistic rationale; high biomarker prevalence (≥25%); significant drug toxicity argues against treating non-responders	Restricts label; requires approved CDx at launch; no data on marker-negative
All-comers + stratification	Biomarker evidence is promising but not definitive; regulatory label sought for broad population first; marker prevalence is high	Power diluted if effect concentrated in subgroup; risk of overall-negative trial with subgroup signal
Hybrid marker-stratified	Moderate biomarker confidence; ethical imperative to treat all patients (urgent disease); drug must be given before test result available	Complex multiplicity; marker-negative sample typically underpowered for definitive conclusions

2.3 Sample Size and Efficiency Trade-offs

FDA's Table 1 (Enrichment Guidance 2019) quantifies the efficiency gain of enrichment:

Marker Prevalence	Effect in Marker-Negative (% of Marker-Positive)	Sample Size Ratio (Unselected : Enriched)
50%	0%	4×
25%	0%	16×
50%	50%	1.8×
25%	50%	2.6×

"When the prevalence of marker-positive patients in a population is only 25% and no treatment effect is expected in the 75% of patients who are marker-negative, the required sample size in a study of an unselected population would be 16 times the sample size needed for a study that included only marker-positive patients." — FDA Enrichment Guidance 2019, Section V.A (Final); Simon and Maitournam 2004

Key insight: The efficiency gain from enrichment is maximized when marker prevalence is low AND the marker-negative population has no treatment effect. When the marker-negative population also benefits (even modestly), the advantage of enrichment narrows considerably.

2.4 Regulatory Considerations

Generalizability tension: An enriched approval carries a restricted label. If 75% of patients are marker-negative and there is no data in those patients, FDA may require post-marketing commitments to study the broader population.

Labeling implications:

Enriched design → indication language restricted to biomarker-positive patients (e.g., "for patients whose tumors have EGFR exon 19 deletions or exon 21 L858R substitutions")
All-comers design with subgroup → label may include overall population indication with subgroup guidance in labeling sections

Accelerated Approval: Predictive enrichment designs using response rate (ORR) as the primary endpoint in biomarker-selected populations have supported numerous accelerated approvals. The smaller, faster enriched trial is well-suited to the accelerated pathway, with OS/PFS confirmatory data required post-approval.

Regular Approval: Both enriched (powered for biomarker+ PFS or OS) and all-comers stratified designs have supported regular approval. Hierarchical testing (biomarker+ → ITT) is the most common statistical architecture for all-comers trials with subgroup interest.

3. Biomarker-Stratified Designs

3.1 Overview

The biomarker-stratified design (also called the biomarker-strategy design or the marker-stratified design) enrolls all-comers but uses biomarker status as a stratification factor at randomization. This ensures balance between arms within each biomarker stratum and enables pre-specified subgroup analyses.

Typical architecture:

All patients screened for Biomarker (BM+/BM-)
  |
  ├─ BM+ (e.g., 40% of screened)
  |     |── Randomized: Treatment vs. Control [stratum 1]
  |
  └─ BM- (e.g., 60% of screened)
        |── Randomized: Treatment vs. Control [stratum 2]

Primary Analysis:
  H1: Treatment effect in BM+ (powered for HR = 0.65, 80% power)
  H2: Treatment effect in ITT (all patients; powered separately or tested hierarchically)

3.2 Testing Hierarchies and Fallback

The most common statistical approach is a hierarchical (fixed-sequence) test with a biomarker-positive primary hypothesis and an ITT secondary hypothesis:

Step 1: Test H_BM+ at alpha = 0.025 (one-sided)
  If significant → Step 2: Test H_ITT at alpha = 0.025
  If not significant → Stop; H_ITT not formally tested

Advantage: No alpha penalty — full alpha available at each step.

Disadvantage: If the biomarker-positive subgroup fails (perhaps due to misclassification noise), the ITT cannot be claimed even if it was positive.

Fallback (Simon-style) procedure: Splits alpha between biomarker+ and ITT hypotheses, allowing the ITT to be tested at a reduced alpha even if the biomarker+ hypothesis fails:

Alpha allocation:
  H_BM+: alpha_1 = 0.02 (primary)
  H_ITT: alpha_2 = 0.005 (fallback)
  Total: 0.025 (one-sided)

Scenario A: H_BM+ significant (p < 0.02)
  → H_ITT tested at full alpha (0.02 + 0.005 = 0.025)
  → OS and further secondaries tested

Scenario B: H_BM+ NOT significant (p ≥ 0.02)
  → H_ITT still tested at alpha_2 = 0.005
  → If p < 0.005, ITT can be claimed independently

When to use fallback: When the biomarker-negative subgroup may have a smaller but non-trivial treatment benefit, and the sponsor wishes to preserve a regulatory path for the broader indication even if the biomarker-positive primary fails.

3.3 Graphical Multiplicity Procedures

For more complex structures — biomarker+ PFS → biomarker+ OS → ITT PFS → ITT OS — the graphical approach of Bretz, Maurer, and Hommel (2009) provides a unifying framework. See Multiplicity Control in Oncology Trials for the full algorithm.

Representative 4-node graph for biomarker-stratified trial:

Hypotheses:
  H1: PFS in BM+       (w=0.5, alpha=0.0125)
  H2: PFS in ITT       (w=0.5, alpha=0.0125)
  H3: OS in BM+        (w=0, receives alpha only after PFS rejections)
  H4: OS in ITT        (w=0, receives alpha only after PFS rejections)

Transition matrix (illustrative):
        H1    H2    H3    H4
  H1  [  0   0.5   0.5    0  ]  ← BM+ PFS → splits to ITT PFS and BM+ OS
  H2  [ 0.5   0     0    0.5 ]  ← ITT PFS → splits to BM+ PFS and ITT OS
  H3  [  0   0.5    0   0.5  ]
  H4  [ 0.5   0    0.5   0   ]

R implementation:

library(graphicalMCP)

g <- graph_create(
  hypotheses = c(H1_BM_PFS = 0.5, H2_ITT_PFS = 0.5,
                 H3_BM_OS  = 0.0, H4_ITT_OS  = 0.0),
  transitions = rbind(
    c(0,   0.5, 0.5,  0  ),
    c(0.5,  0,   0,  0.5 ),
    c(0,   0.5,  0,  0.5 ),
    c(0.5,  0,  0.5,  0  )
  )
)

# Simulate power under target operating characteristics
graph_calculate_power(
  graph = g,
  alpha = 0.025,
  sim_n = 1e5,
  power_marginal = c(0.90, 0.75, 0.80, 0.65)  # marginal powers by hypothesis
)

# Test with observed p-values
graph_test_shortcut(
  graph = g,
  p = c(0.008, 0.031, 0.012, 0.045),
  alpha = 0.025
)

3.4 Biomarker-Stratified Designs in SAP Language

7.3 BIOMARKER-STRATIFIED HYPOTHESIS TESTING

7.3.1 Primary Hypothesis
The primary analysis will evaluate the treatment effect on [endpoint] in the biomarker-positive
population (defined as [biomarker threshold] by [assay name/CDx]).

7.3.2 Secondary Hypothesis
Following significant demonstration of the primary hypothesis, the treatment effect will be
evaluated in the Intent-to-Treat (ITT) population using a hierarchical gate. The ITT analysis
will be performed only if the biomarker-positive primary hypothesis achieves significance at
the pre-specified alpha = [0.025 one-sided].

7.3.3 Fallback Provision
[If applicable:] In the event the biomarker-positive hypothesis does not achieve significance
at alpha = [0.020], the ITT hypothesis will be tested at the fallback alpha = [0.005].
The combined alpha for both hypotheses does not exceed 0.025 (one-sided), maintaining
strong FWER control.

7.3.4 Interaction Test
A pre-specified test of treatment-by-biomarker interaction will be conducted using
a [Cox proportional hazards / logistic regression] model including treatment arm,
biomarker stratum, and their interaction term. This test is conducted at a nominal
significance level of 0.10 (two-sided) and is considered hypothesis-generating.

4. Adaptive Biomarker Threshold Designs

4.1 Scientific Rationale for Adaptations

Fixed biomarker thresholds (e.g., PD-L1 ≥ 50%) determined a priori carry three sources of uncertainty:

Threshold uncertainty: The optimal cut-point may not be known before the trial; biological gradients exist (PD-L1 1% vs. 10% vs. 50%)
Prevalence uncertainty: The fraction of patients who are biomarker-positive may be poorly estimated from early-phase data
Effect size uncertainty: The magnitude of the treatment effect in biomarker-positive vs. biomarker-negative patients may be unknown

Adaptive enrichment designs allow pre-planned modifications of the study population at an interim analysis, based on accumulating efficacy data, to improve trial efficiency while maintaining type I error control.

4.2 FDA's Framework for Adaptive Enrichment

FDA Enrichment Guidance 2019 (Section VI.D) describes three adaptive enrichment scenarios:

Scenario 1 — Stop marker-negative enrollment at interim:

If interim data show the marker-negative group has much lower response than marker-positive, planned stopping of marker-negative enrollment may be permitted without alpha adjustment, provided all randomized patients remain in the final analysis.

Scenario 2 — Narrow entry criteria at interim:

Entry criteria changed to enrich the higher-responding subgroup. Type I error adjustment required only if an unblinded interim efficacy analysis was used to trigger the change; no adjustment needed if based solely on blinded pooled results or biomarker prevalence alone.

Scenario 3 — Refine biomarker threshold at interim:

An early endpoint (e.g., ORR, PK biomarker) is used at interim to test several candidate cut-off values, and the optimal cut-off is selected. Appropriate pre-specification and type I error control required; discuss with FDA in advance.

"If the only change was increased sample size based on blinded, pooled results because the prevalence of the marker-defined subgroup was lower than expected, there would be no need for a type I error rate adjustment." — FDA Enrichment Guidance 2019, Section VI.D (Final)

4.3 Key Design Elements for Adaptive Enrichment

Pre-specification requirements:

Interim analysis timing (information fraction or calendar milestone)
Decision rules: what data triggers what adaptation (must be algorithmic, not ad hoc)
Populations to be included/excluded after the adaptation
Analysis populations for the final analysis (all randomized patients are typically included)
Statistical method for combining pre- and post-adaptation data (combination test approach or conditional error principle)

Type I error control methods:

Method	Description	When to use
Combination test (Bauer-Kohne)	Combines p-values from stages using pre-specified weights; FWER preserved by construction	Unblinded interim with population change
Conditional error principle	Adaptation must not change the conditional type I error given interim data	Flexible; can accommodate complex adaptations
Pre-specified alpha spending	Allocation of alpha to interim and final looks; population change does not consume alpha if based on blinded data	Simpler; blinded prevalence-based only

Example adaptive enrichment structure (2-stage):

Stage 1: N1 patients per arm (BM+ and BM-)
          Interim analysis at t1 (e.g., 50% information)

          Decision rule:
          - If BM+ interim HR ≤ 0.60 and BM- interim HR > 0.90:
              → Restrict enrollment to BM+ only for Stage 2
          - If both BM+ and BM- show HR ≤ 0.75:
              → Continue enrolling all-comers
          - If BM+ interim HR > 0.80:
              → Futility stop

Stage 2: N2 patients (population determined at interim)

Final: Combination test p-value = f(p1_stage1, p2_stage2)
       e.g., Fisher's combination: -2[ln(p1) + ln(p2)] ~ chi-sq(4)

R implementation:

library(rpact)  # Adaptive designs with combination tests

# Two-stage adaptive enrichment design
design <- getDesignGroupSequential(
  kMax = 2,
  alpha = 0.025,
  sided = 1,
  typeOfDesign = "asOF",        # O'Brien-Fleming alpha spending
  informationRates = c(0.5, 1)
)

# Sample size for BM+ primary hypothesis
sampleSize <- getSampleSizeSurvival(
  design = design,
  hazardRatio = 0.65,           # Expected HR in BM+
  lambda2 = log(2)/12,          # Control median OS = 12 months
  dropoutRate1 = 0.05,
  dropoutRate2 = 0.05,
  accrualTime = 24,
  followUpTime = 18
)
summary(sampleSize)

4.4 Companion Diagnostic Co-Development: FDA Requirements

When predictive enrichment uses a biomarker that will be used to select patients in clinical practice, an FDA-cleared or -approved companion diagnostic (CDx) is required:

"Because assessment of marker status is critically important to determining whether the drug will be effective in patients, the test to assess the enrichment marker that would be used after the drug's approval would be an established, FDA-cleared or -approved, laboratory test explicitly labeled for this purpose as a companion diagnostic, although exceptions can be considered for a major advance in treatment." — FDA Enrichment Guidance 2019, Section VI.B.1 (Final)

CDx co-development timeline:

Phase 1/2:    Exploratory biomarker analyses using research-use-only (RUO) assay
              → Identify candidate thresholds, begin analytical validation

Phase 2/3:    Analytical validation of CDx (reproducibility, sensitivity, specificity)
              → Use prospective-retrospective validation on archived Phase 2 samples
              → Begin parallel regulatory submission track (device pathway)

Pivotal trial: CDx (or substantially equivalent device) used to define eligibility
              → Locked threshold; concurrent drug + CDx development

Filing:        Simultaneous BLA/NDA + PMA (premarket approval) or 510(k) for CDx
              → FDA expects concurrent submissions; staggered approvals complicate label

FDA CDx pathways:

PMA (Premarket Approval): For novel CDx with no predicate device; higher regulatory bar
510(k): For CDx substantially equivalent to already-cleared device
In Vitro Diagnostic (IVD) Exemption: Rare; for LDT (lab-developed test) when CDx development would be infeasible

Analytical validation requirements (FDA):

Analytical sensitivity and specificity (true positive/negative rates vs. reference method)
Precision: intra-assay, inter-assay, inter-laboratory, inter-operator
Accuracy across sample types: fresh biopsy, FFPE archival tissue, cell-free DNA (liquid biopsy)
Concordance with comparator assay and predefined thresholds
Specimen handling requirements (turnaround time, temperature, storage)

Alternative specimen sources: FDA has shown flexibility for rare biomarkers or when tissue quantity is insufficient:

Archival tissue may be used when prospective biopsy is not feasible
Liquid biopsy (ctDNA/cfDNA) increasingly accepted as an alternative, particularly for KRAS G12C and EGFR (e.g., Guardant360 CDx for osimertinib)
When liquid biopsy is used, concordance with tissue-based assay must be established

When CDx clearance is NOT required at drug approval:

If the marker test result will not be known before drug administration and no patient management decisions (e.g., treatment discontinuation) will be made based on the result, contemporaneous CDx clearance is not required. Example: antibacterial drugs where the infecting organism is identified post-randomization from baseline sputum.

5. Multiplicity for Biomarker Subgroup Testing

5.1 The Core Problem

Any trial testing treatment effects in both an overall (ITT) population and a biomarker-positive subgroup is testing at least two hypotheses. Without pre-specified multiplicity control:

FWER inflated (up to ~9.75% for two independent tests at alpha=0.05 each)
Regulatory agencies may reject the trial or restrict the label to the hypothesis with a multiplicity-adjusted analysis

FDA requires strong FWER control for all confirmatory biomarker subgroup claims intended for labeling.

5.2 Alpha Allocation Strategies

Strategy 1: Fixed-sequence (hierarchical gate)

Most common for single-biomarker designs where biomarker status is the primary scientific hypothesis.

H_BM+ (alpha = 0.025) → H_ITT (alpha = 0.025, available only after H_BM+ rejected)

Properties:
  - No alpha penalty
  - Full power on each test, conditional on the previous
  - Risk: If BM+ fails despite meaningful effect, ITT is untestable

Strategy 2: Split alpha with recycling (Bonferroni-based)

H_BM+: alpha_1 = 0.015
H_ITT: alpha_2 = 0.010
Total: 0.025

Recycling rules:
  If H_BM+ rejected: H_ITT tested at alpha_2 + alpha_1 = 0.025
  If H_ITT rejected first: H_BM+ tested at alpha_1 + alpha_2 = 0.025
  Both must fail their primary thresholds before recycling

Strategy 3: Graphical approach (Bretz et al. 2009)

Generalizes all Bonferroni-based procedures. See Multiplicity Control in Oncology Trials for full technical description.

For a two-population (BM+ and ITT) design with PFS and OS:

library(graphicalMCP)

# 4-node graph: BM+ PFS, ITT PFS, BM+ OS, ITT OS
g <- graph_create(
  hypotheses = c(0.5, 0.5, 0, 0),  # Equal split between BM+ PFS and ITT PFS
  transitions = rbind(
    # BM+PFS → ITT PFS (0.5) and BM+ OS (0.5)
    c(0,   0.5, 0.5,  0  ),
    # ITT PFS → BM+ PFS (0.5) and ITT OS (0.5)
    c(0.5,  0,   0,  0.5 ),
    # BM+ OS → ITT OS (0.5) and ITT PFS (0.5)
    c(0,   0.5,  0,  0.5 ),
    # ITT OS → BM+ OS (0.5) and BM+ PFS (0.5)
    c(0.5,  0,  0.5,  0  )
  )
)

# Power simulation: marginal powers under design assumptions
graph_calculate_power(
  graph = g,
  alpha = 0.025,
  sim_n = 1e5,
  power_marginal = c(0.90, 0.70, 0.85, 0.65)
)

5.3 FWER Control: Key Rules

Rule 1: Pre-specify the multiplicity structure before unblinding.

Any post-hoc restructuring of the graph or alpha allocation is not acceptable for confirmatory claims. FDA: "An important principle for controlling multiplicity is to prospectively specify all planned endpoints, time points, analysis populations, and analyses."

Rule 2: Interaction tests are exploratory, not confirmatory.

A treatment-by-biomarker interaction test does not substitute for a multiplicity-adjusted subgroup analysis. Interaction tests are typically run at a two-sided 0.10 significance level for exploratory purposes.

Rule 3: Three or more subgroups require explicit multiplicity structure. If three biomarker strata are tested (e.g., PD-L1 <1%, 1–49%, ≥50%), all three are in the family. Defining the ≥50% threshold as primary and declaring others exploratory must be done prospectively and justified clinically.

Rule 4: Interim analyses consume alpha within the graph. Each interim analysis for each hypothesis reduces the remaining alpha available at the final look, using a spending function. The graphical procedure must account for this: the alpha allocated to each node at any interim look is the node's current alpha allocation × the spending function's increment.

5.4 SAP Template: Biomarker Multiplicity Section

9. MULTIPLICITY CONTROL — BIOMARKER SUBGROUP TESTING

9.1 Hypothesis Families

This trial tests [k] efficacy hypotheses across [populations] and [endpoints]:

  H1: Progression-free survival (PFS) in the biomarker-positive population
  H2: Overall survival (OS) in the biomarker-positive population
  H3: PFS in the Intent-to-Treat (ITT) population
  H4: OS in the ITT population

9.2 Alpha Allocation and Graph Structure

The overall one-sided Type I error rate is 0.025. The initial alpha allocation
(weight vector) is:
  H1: w = 0.5 → alpha = 0.0125
  H2: w = 0.0 → alpha = 0 (receives alpha only after H1 or H3 rejection)
  H3: w = 0.5 → alpha = 0.0125
  H4: w = 0.0 → alpha = 0 (receives alpha only after H2 or H3 rejection)

The transition matrix [g_ij] governing alpha redistribution upon rejection is:
  From H1 (BM+ PFS): 0.5 to H3 (ITT PFS), 0.5 to H2 (BM+ OS)
  From H3 (ITT PFS): 0.5 to H1 (BM+ PFS), 0.5 to H4 (ITT OS)
  From H2 (BM+ OS):  0.5 to H4 (ITT OS),  0.5 to H3 (ITT PFS)
  From H4 (ITT OS):  0.5 to H2 (BM+ OS),  0.5 to H1 (BM+ PFS)

9.3 Testing Algorithm

The graphical testing procedure (Bretz et al. 2009) will be applied:
  1. Test each hypothesis H_i at its current alpha_i.
  2. Reject any H_j where p_j ≤ alpha_j.
  3. Update the graph per the transition matrix.
  4. Repeat until no further rejections possible.
Implemented in graphicalMCP (R package version [X]).

9.4 Interaction Test (Exploratory)

A treatment-by-biomarker-status interaction will be evaluated using a Cox
proportional hazards model. This test is conducted at a nominal two-sided
significance level of 0.10 and is hypothesis-generating only. Results will not
be used to modify the confirmatory testing procedure.

9.5 Subgroup Analyses

Pre-specified subgroup analyses (by age, performance status, line of therapy,
histology) are exploratory and will be presented descriptively with forest plots.
No multiplicity adjustment will be applied to these subgroup analyses. P-values
from subgroup analyses will not be used to support regulatory claims.

6. Oncology Worked Examples

6.1 PD-L1 / NSCLC: Threshold Complexity and All-Comers vs. Enriched

The scientific question: PD-L1 (programmed death-ligand 1) expression, measured by immunohistochemistry (IHC), predicts response to anti-PD-1/PD-L1 immunotherapy, but the predictive relationship is continuous, not binary, and threshold choice is drug-assay-specific.

Design spectrum across the PD-1/PD-L1 program:

Trial	Drug	Design	Biomarker Threshold	Primary Endpoint	Key Result
KEYNOTE-024	Pembrolizumab	Enriched (PD-L1 ≥50% only)	TPS ≥ 50% (22C3 IHC)	PFS	PFS HR 0.50, OS HR 0.63; regulatory approval enriched label
KEYNOTE-189	Pembro + chemo	All-comers + stratification	PD-L1 TPS stratified (<1%, 1–49%, ≥50%)	PFS (ITT primary)	PFS benefit across all strata; broader label
KEYNOTE-042	Pembrolizumab	Hierarchical: TPS≥50%, then TPS≥20%, then TPS≥1%	TPS ≥1%, ≥20%, ≥50%	OS by descending threshold	OS HR 0.69 in ≥50%; HR 0.77 in ≥1%; approved for TPS≥1%
CheckMate-026	Nivolumab	Enriched (PD-L1 ≥5% primary)	PD-L1 ≥5% (28-8 IHC)	PFS	PFS failed; TMB-high post-hoc subgroup positive — not confirmatory
CheckMate-227	Nivo + ipi	Parallel: PD-L1 stratified + TMB-high	TMB ≥10 mut/Mb	Part 1a: OS in PD-L1≥1%; Part 1b: OS in TMB-high	Part 1a PFS positive; TMB-high OS not significant at final analysis

Design lessons:

KEYNOTE-024 exemplifies the enriched design at its most successful: a strong mechanistic hypothesis (PD-L1 ≥50% as a predictive threshold), a pre-specified CDx (Dako 22C3), and a single primary hypothesis in the biomarker-positive population. Result: clear regulatory approval restricted to TPS ≥50% first-line NSCLC.
KEYNOTE-042 demonstrates hierarchical threshold testing: primary hypothesis at TPS ≥50%, then tested sequentially downward to ≥20% and ≥1%. This multiplicity approach allowed the label to ultimately include TPS ≥1% (with caveats), expanding the commercial and clinical scope.
CheckMate-026 illustrates the risk of a post-hoc biomarker subgroup: the TMB-high finding emerged after the pre-specified PD-L1 primary endpoint failed and could not support labeling.

Companion diagnostic considerations:

Pembrolizumab CDx: Dako 22C3 pharmDx (TPS scoring)
Nivolumab CDx: Dako 28-8 pharmDx (TC scoring)
Atezolizumab CDx: Ventana SP142 (TC + IC scoring)
Assay non-interchangeability is a major practical issue: PD-L1 ≥50% by 22C3 ≠ ≥50% by 28-8; regulatory label is drug-assay specific

SAP note on PD-L1 analyses: When multiple PD-L1 thresholds are tested in the same trial, the threshold defining the primary hypothesis must be pre-specified and clinically justified before unblinding. Secondary threshold analyses are exploratory unless formal multiplicity adjustment is applied to all tested thresholds.

6.2 HER2 / Breast Cancer: Paradigm for Biomarker-Enriched Drug Development

Trastuzumab: the founding model

Trastuzumab (anti-HER2 monoclonal antibody) was the first major oncology example of FDA-endorsed predictive enrichment using a proteomic biomarker.

"Trastuzumab was estimated to have increased median survival by about 5 months, about three to four times the effect that would have been expected in an unselected population... Enrichment thus allowed a modest-size study to show a striking effect and directed treatment to the population that could benefit." — FDA Enrichment Guidance 2019, Section V.C.2.c (Final)

HER2 assessment pathway:

IHC 3+ (strong complete membrane staining): HER2-positive
IHC 2+ (equivocal): reflex FISH/ISH testing; ratio ≥ 2.0 = positive
IHC 0/1+: HER2-negative (majority of breast cancers)

Early studies showed ORR < 5% in IHC 1+ patients; efficacy trials limited to IHC 2+/3+ patients (≈25% of breast cancers). Despite significant cardiotoxicity, the 5-month OS benefit in HER2-overexpressing metastatic breast cancer supported approval — the enrichment design made the benefit-risk favorable by excluding the 75% of patients who would not benefit.

HER2-low: expansion of the biomarker paradigm

The DESTINY-Breast04 trial (trastuzumab deruxtecan, T-DXd) prospectively enrolled HER2-low patients (IHC 1+ or IHC 2+/ISH-negative) who had been historically considered "HER2-negative" and ineligible for HER2-directed therapy.

Design highlights:

Enriched design: HER2-low patients only (excluded IHC 0 and IHC 3+)
Hierarchical testing: HR+/HER2-low primary → ITT (HR+/HER2-low + HR-/HER2-low) secondary
Primary endpoint: PFS; key secondary: OS
Result: PFS HR 0.50, OS HR 0.64 in HR+ primary population; FDA approval 2022 for HER2-low metastatic breast cancer

Regulatory implication: This trial redefined HER2-low as a clinically actionable category, requiring expansion of the CDx definition and relabeling of HER2 IHC assays to formally characterize 1+ vs. 0 staining intensity.

Key CTG examples from HER2/breast:

NCT	Trial	Design Feature	Primary Endpoint
NCT01772472	T-DM1 vs. trastuzumab adjuvant	HER2+ enriched	IDFS at 3 years
NCT02448420	Palbociclib + trastuzumab	HER2+ metastatic; stratified by HR status	PFS
NCT06068985	HER2 dependence / neoadjuvant de-escalation	HER2+ with adaptive CDx classification	pCR

6.3 KRAS-G12C and BRAF V600E / Colorectal Cancer

BRAF V600E in CRC: enriched approval via targeted combination

Approximately 10–15% of CRC patients harbor BRAF V600E mutation. Unlike BRAF-mutated melanoma, single-agent BRAF inhibitors failed in CRC due to EGFR feedback activation. The combination of encorafenib + binimetinib + cetuximab, tested in BEACON CRC:

Enriched design: BRAF V600E–mutated mCRC only
Biomarker: cobas BRAF V600 mutation test (CDx)
Primary endpoint: OS; key secondary: ORR, PFS
Result: OS 9.0 vs. 5.4 months (HR 0.52); ORR 27% vs. 2%
FDA approval 2020: encorafenib + cetuximab for BRAF V600E mCRC

Design lesson: Without enrichment, the 10–15% BRAF prevalence and absent effect in BRAF-WT patients would have diluted any ITT signal to unmeasurable. An unselected CRC trial would have required ~10× the sample size to detect the same benefit.

KRAS G12C in CRC: platform trial approach

KRAS mutations (historically undruggable) occur in ~45% of CRC. KRAS G12C specifically represents ~3–5% of mCRC. Adagrasib and soterasib received accelerated approval for KRAS G12C-mutated NSCLC (ORR ~37% and ~28% respectively), with CRC data supporting label extensions:

CodeBreak 100: Sotorasib single-arm Phase 2 in KRAS G12C NSCLC — all-enriched, ORR primary
KRYSTAL-1: Adagrasib Phase 1/2 basket trial, KRAS G12C enriched across tumor types
Ultra-low prevalence in CRC (~5%) requires basket/platform trial approach for efficient development

CTG examples from KRAS/BRAF/CRC:

NCT	Brief Title	Design	Biomarker Gate
NCT05312398	CAPRI 2 GOIM (cetuximab + FOLFIRI/FOLFOX)	Single-arm Phase 2	RAS/BRAF WT only
NCT02934529	mCRC RAS-wildtype maintenance	Phase 3	KRAS WT (RAS WT)
NCT01276379	Biomarkers in CRC with KRAS testing	Biomarker-stratified	KRAS G12 codon
NCT01543698	LGX818 + MEK162 (BRAF V600)	Phase 1b/2	BRAF V600 enriched

6.4 MSI-H / dMMR: Pan-Tumor Tissue-Agnostic Approval

The tissue-agnostic paradigm

Pembrolizumab received FDA accelerated approval in May 2017 for patients with unresectable or metastatic MSI-H (microsatellite instability-high) or dMMR (mismatch repair-deficient) solid tumors that have progressed after prior treatment — the first ever tumor-agnostic approval based on a biomarker rather than tumor histology (Keytruda FDA approval, NDA 125514).

Design: This was a multi-cohort single-arm basket study (KEYNOTE-158 and pooled KEYNOTE-016/164/012/028/158) enrolling patients with confirmed MSI-H or dMMR status across ≥15 cancer types. The design was:

Enriched: MSI-H/dMMR biomarker required for entry
No control arm: ORR primary endpoint with pre-specified threshold for accelerated approval
Heterogeneous histology: by design, demonstrating biomarker effect independent of tumor type
CDx: multiple tests accepted (IHC for MMR protein loss; PCR/NGS for MSI status)

Key statistical issue: With multiple tumor types, a single pooled ORR analysis across all histologies was the primary inference. This required pre-specification that the biomarker (not histology) was the primary enrichment factor. No multiplicity adjustment for individual tumor types was required because those were descriptive subsets of the biomarker-defined primary population.

dMMR IHC vs. MSI-H by PCR/NGS:

dMMR by IHC: Loss of MLH1, MSH2, MSH6, or PMS2 protein expression
MSI-H by PCR: ≥2 of 5 standard microsatellite loci show instability
Concordance is high (~95%) but not 100%; regulatory labels typically accept both tests
FDA CDx: Multiple FDA-approved tests accepted (Promega MSI Analysis System, FoundationOne CDx, others)

Subsequent regular approval: KEYNOTE-158 and additional data supported conversion to regular approval in 2023 for pembrolizumab in MSI-H/dMMR CRC (first-line in combination). CheckMate-142 (nivolumab ± ipilimumab) supported approval specifically in MSI-H/dMMR mCRC.

Tissue-agnostic design implications for statisticians:

The ITT for a tissue-agnostic study is the biomarker-defined population across all tumor types, not a single histology
Individual tumor-type cohorts are descriptive subsets — no multiplicity adjustment required for cohort-specific ORR
Power calculation must account for the expected distribution of tumor types and heterogeneous response rates
Minimum evaluable patients per cohort (typically ≥10) must be pre-specified for descriptive subgroup analyses

7. Design Choice Summary Table

Design Feature	Recommendation
Strong mechanistic biomarker evidence	Enriched design (biomarker+ only); CDx required at approval
Emerging biomarker evidence	All-comers + stratification; hierarchical primary BM+ → ITT
Uncertain biomarker threshold	Adaptive enrichment; specify decision rules and type I error control prospectively
Low biomarker prevalence (<20%)	Enriched design more efficient; basket/platform trial for ultra-rare markers
Biomarker-negative patients have meaningful residual benefit	All-comers with fallback alpha allocation; characterize BM- effect
Tissue-agnostic biomarker	Single-arm basket, biomarker as primary enrichment; ORR endpoint
CDx assay not yet approved	Use RUO assay in development; plan concurrent CDx submission
Multiple thresholds to test	Pre-specify hierarchical order before unblinding; graphical MCP
Multiple populations + endpoints	Graphical multiplicity approach (Bretz et al.); simulate power

8. Limitations and Pitfalls

1. Post-hoc biomarker threshold selection: Selecting or refining the biomarker threshold after unblinding — even if the original threshold was pre-specified — is not acceptable for confirmatory claims. FDA: "Any such approach would need scrupulous attention to maintaining the blind, perhaps by using an independent group to do the biomarker analysis."

2. Assay non-interchangeability: Different CDx assays measuring the "same" biomarker (e.g., PD-L1 22C3 vs. 28-8) are not interchangeable. Clinical decisions and trial eligibility must reference the specific assay-threshold combination approved as CDx for that drug. Using an unapproved assay to select patients in a registrational trial creates a regulatory risk.

3. Interaction tests are underpowered: Most Phase 3 trials have 20–30% power to detect a clinically meaningful treatment-by-biomarker interaction even when one exists. A non-significant interaction p-value does not establish that the biomarker is non-predictive. Conversely, a significant interaction in a large trial may be statistically detectable but clinically trivial.

4. Prognostic-predictive conflation: Selecting patients with poor prognosis (prognostic enrichment) and concluding that the biomarker is predictive is a common error. The relative treatment effect in a prognostically enriched population should be similar to that in lower-risk patients if the marker is purely prognostic. Formal interaction testing with a marker-negative comparator arm is needed to establish predictiveness.

5. Label scope vs. trial scope: An enriched trial produces evidence only in the biomarker-positive population. Claiming benefit in the broader unselected population from an enriched trial alone is not valid. Conversely, an all-comers trial failing the ITT primary does not prove absence of benefit in the biomarker-positive subgroup if that subgroup was underpowered.

6. CDx assay not locked at randomization: If the CDx assay is modified (sensitivity threshold, scoring algorithm, reagent change) during the trial, patients enrolled under the old assay definition may be misclassified under the new definition. FDA expects the CDx to be analytically validated and locked before pivotal enrollment begins.

7. Adaptive enrichment without pre-specification: Adaptive enrichment decisions made without complete pre-specification of decision rules and type I error control are not acceptable. FDA expects the full adaptive plan in the initial protocol and SAP, including all possible adaptation scenarios and their statistical consequences.

8. Basket trial heterogeneity: In tissue-agnostic basket trials, a positive pooled ORR may mask heterogeneous effects across tumor types. Regulatory reviewers will examine per-histology results; a biomarker-driven approval can be undermined if the evidence is driven by one dominant histology with known sensitivity.

Backlinks

Sources:

FDA Guidance for Industry: Enrichment Strategies for Clinical Trials to Support Determination of Effectiveness of Human Drugs and Biological Products (March 2019, Final guidance). CDER/CBER.

FDA Guidance for Industry: Multiple Endpoints in Clinical Trials (January 2017, Final guidance). CDER.

FDA Guidance for Industry and FDA Staff: In Vitro Companion Diagnostic Devices (August 2014, Final guidance). CDER/CDRH.

Bretz F, Maurer W, Hommel G (2009). A graphical approach to sequentially rejective multiple test procedures. Statistics in Medicine, 28(4):586–604.

Simon R, Maitournam A (2004). Evaluating the efficiency of targeted designs for randomized clinical trials. Clinical Cancer Research, 10(20):6759–6763.

ICH E9(R1) Addendum on Estimands and Sensitivity Analysis in Clinical Trials (November 2019, Final guideline).

ICH E20 Adaptive Clinical Trials (under development; draft 2023).

Polley M-YC et al. (2019). Statistical and practical considerations for clinical evaluation of predictive biomarkers. Journal of the National Cancer Institute, 105(22):1677–1683.

Status: Primary source is Final guidance (FDA 2019). All regulatory quotes are non-binding recommendations. Compiled from retrieved FDA guidance text, literature synthesis, and ClinicalTrials.gov records.