Skip to content

Statistical Analysis Methods in Oncology Trials

Purpose: This article provides a comprehensive reference for biostatisticians on the statistical methods used in confirmatory oncology trials. It follows the organizing principle mandated by ICH E9(R1): endpoint → estimand → estimator → assumptions → sensitivity analysis, and maps each method to the oncology-specific scenarios that drive method choice.


1. Organizing Principle: Endpoint → Estimand → Estimator → Assumptions → Sensitivity Analysis

The ICH E9(R1) Causal Chain

Every statistical analysis in a confirmatory oncology trial must follow a structured alignment:

Trial Objective
  └─ Estimand (5 attributes)
       ├─ Population
       ├─ Variable (endpoint)
       ├─ Treatment conditions
       ├─ Summary measure (HR, risk difference, mean difference, etc.)
       └─ IE strategies (treatment policy, hypothetical, composite, principal stratum, while-on-treatment)
            └─ Main Estimator (statistical method aligned to estimand)
                 └─ Assumptions (documented, testable where possible)
                      └─ Sensitivity Analysis (same estimand, relaxed assumptions)

"The main estimator will be underpinned by certain assumptions. To explore the robustness of inferences from the main estimator to deviations from its underlying assumptions, a sensitivity analysis should be conducted." — ICH E9(R1) Addendum, §A.5 (Final, 2019)

Key regulatory requirement: The estimand, main estimator, assumptions, and sensitivity analyses must all be pre-specified in the Statistical Analysis Plan (SAP) before unblinding.

How Oncology Is Different

Oncology trials introduce specific methodological challenges that alter standard statistical method choices:

Challenge Impact on Method Choice Typical Setting
Non-proportional hazards (NPH) Log-rank loses power; MaxCombo or RMST needed IO trials (delayed separation), targeted therapy (early separation then convergence)
Treatment crossover ITT OS estimate diluted; RPSFT/IPCW needed as sensitivity Open-label trials with post-progression switch
Informative censoring Kaplan-Meier biased; IPCW or sensitivity needed PFS with differential dropout; PRO data
Assessment-driven bias PFS timing depends on imaging schedule; IRC needed Open-label PFS trials; unscheduled imaging
Multiple post-progression therapies OS confounded by 4–6+ lines; dilutes signal Advanced NSCLC, breast cancer, myeloma
Competing risks Standard KM overestimates cumulative incidence Adjuvant DFS (non-cancer death); AML (transplant)
Missing tumor assessments PFS censoring rules change results ~10–30% of assessments missed in Phase 3
Small populations Exact methods needed; asymptotic methods unreliable Rare molecular subtypes (NTRK, RET, BRAF V600E)

2. Time-to-Event Methods

2.1 Kaplan-Meier Estimation

What it does: Non-parametric estimation of the survival function S(t) = P(T > t), accounting for right censoring.

Assumptions:

  1. Independent censoring: Censoring is non-informative (censored patients have the same prognosis as those who remain at risk)
  2. No left truncation: All patients enter observation at time 0 (randomization)
  3. Event times are exact (not interval-censored)

Oncology use: Primary descriptive method for OS, PFS, DFS, EFS. Always reported alongside formal tests.

Key quantities reported:

  • Median survival time (with 95% CI via Brookmeyer-Crowley)
  • Milestone survival rates: S(12 mo), S(24 mo), S(36 mo) with 95% CI (Greenwood's formula)
  • Kaplan-Meier curves by treatment arm

R packages: survival::survfit(), ggsurvfit (for publication-quality plots)

When KM fails in oncology:

  • Informative censoring (patients drop out because of disease progression) → IPCW adjustment
  • Competing risks (non-cancer death competes with cancer event) → Cumulative Incidence Function (CIF) via Aalen-Johansen instead

2.2 Log-Rank Test

What it does: Non-parametric hypothesis test comparing two survival distributions. Tests H₀: S₁(t) = S₂(t) for all t.

Formula:

χ² = [Σ (O₁ⱼ - E₁ⱼ)]² / Σ V₁ⱼ

where O₁ⱼ = observed events in group 1 at time j, E₁ⱼ = expected events under null, V₁ⱼ = hypergeometric variance.

Assumptions:

  1. Proportional hazards (PH): The hazard ratio is constant over time. The log-rank test has optimal power under PH.
  2. Independent censoring
  3. No tied event times (or handled via Breslow/Efron approximation)

When to use: Primary test for OS, PFS, DFS in most oncology Phase 3 trials under PH assumption.

Stratified log-rank test: When randomization is stratified by prognostic factors (e.g., ECOG, PD-L1 status, geographic region), the stratified log-rank test is the primary analysis:

χ²_stratified = [Σ_k Σ_j (O₁ⱼₖ - E₁ⱼₖ)]² / Σ_k Σ_j V₁ⱼₖ

where k indexes strata. This controls for stratum-level imbalances.

FDA position on stratification:

"Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol. The analysis plan should specify the primary analysis and one or more sensitivity analyses." — FDA Cancer Endpoints 2018 (Final)

Regulatory practice: The stratified log-rank test using the randomization stratification factors is the standard primary analysis. Using different strata in the analysis vs. randomization requires justification and is typically a sensitivity analysis.

R packages: survival::survdiff() (unstratified), nph::logrank.test() (weighted variants)

When log-rank fails in oncology:

  • Delayed treatment effects (IO): Curves overlap for 2–4 months, then separate. Log-rank under-weights the late separation → loss of power.
  • Crossing hazards: Treatment helps early but harms late (or vice versa). Log-rank averages across the cross → near-zero test statistic.
  • Early separation then convergence: Targeted therapy effective initially, then resistance develops. Log-rank may still detect but with reduced power.

Solution: MaxCombo test or weighted log-rank (see Section 2.5).


2.3 Cox Proportional Hazards Regression

What it does: Semi-parametric model estimating the hazard ratio (HR) and its 95% CI, adjusting for covariates.

Model:

h(t | X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)

where h₀(t) is the unspecified baseline hazard, and exp(β₁) = HR for covariate X₁.

Assumptions:

  1. Proportional hazards: HR is constant over time for each covariate
  2. Log-linearity: Covariates act multiplicatively on the hazard
  3. Independent censoring
  4. No model misspecification (correct functional form for continuous covariates)

PH diagnostics:

  • Schoenfeld residuals test: cox.zph() in R; p < 0.05 suggests PH violation
  • Log-log survival plot: Parallel curves indicate PH; crossing/divergence indicates NPH
  • Time-dependent covariates: Include interaction with log(time) to test for time-varying HR

Stratified Cox model: When PH holds within strata but not overall, use:

h_k(t | X) = h₀ₖ(t) × exp(β₁X₁ + ... + βₚXₚ)

where each stratum k has its own baseline hazard h₀ₖ(t) but shares the same regression coefficients. This is the standard approach when stratification factors are used in randomization.

Oncology-specific covariates:

Covariate Typical Use Notes
Treatment arm Primary Estimates HR
ECOG PS (0 vs. 1) Stratification factor Always included if used in randomization
PD-L1 status (≥50% vs. <50%) Stratification or subgroup IO trials
Histology (squamous vs. nonsquamous) Stratification factor NSCLC trials
Geographic region Stratification factor Global trials
Biomarker status (EGFR, ALK, BRCA) Enrichment/subgroup Targeted trials
Prior lines of therapy Stratification factor 2L+ trials

R packages: survival::coxph() (standard), coxphf (Firth penalized for small samples)

When Cox fails in oncology:

  • NPH (see MaxCombo, Section 2.5)
  • Very small samples (<30 events): Firth penalized Cox or exact conditional inference
  • Time-varying treatment effects: Extended Cox with time-dependent coefficients

2.4 Stratified Cox and Stratified Log-Rank: Handling Stratification Factors

Regulatory requirement: When randomization uses stratification factors, the primary analysis must use the same factors in the stratified analysis.

Common problem: Randomization stratification factors vs. analysis stratification factors may differ due to:

  • IRT/IXRS system stratification errors (e.g., patient misclassified by ECOG)
  • Small strata collapsed at analysis (e.g., geographic region strata with <10 patients)
  • Subgroup-specific analyses using different factor levels

FDA position:

"The SAP should specify the primary analysis and one or more sensitivity analyses... Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol." — FDA Cancer Endpoints 2018 (Final)

Best practice:

  1. Primary analysis: Stratified log-rank and stratified Cox using randomization strata (as recorded in IRT/IXRS system)
  2. Sensitivity 1: Stratified analysis using clinical strata (actual patient characteristics, correcting any IRT errors)
  3. Sensitivity 2: Unstratified log-rank and Cox (to confirm stratification does not create artificial effects)

Maximum number of strata: Aim for ≤8–12 total strata in primary analysis. Too many strata with small counts lead to:

  • Sparse stratum bias (HR estimates become unstable)
  • Information loss (patients in empty strata contribute nothing)
  • Rule of thumb: Each stratum should contain ≥10 events

R implementation:

# Stratified Cox (primary)
coxph(Surv(time, event) ~ treatment + strata(ecog, pdl1, region), data = trial)

# Stratified log-rank (primary)
survdiff(Surv(time, event) ~ treatment + strata(ecog, pdl1, region), data = trial)

2.5 Non-Proportional Hazards: MaxCombo Test and RMST

MaxCombo Test

When to use: Pre-specified primary or co-primary test when delayed treatment effects (IO) or crossing/converging hazards (targeted therapy) are anticipated.

Composition: The MaxCombo test takes the maximum of a panel of Fleming-Harrington FH(ρ, γ) weighted log-rank statistics:

Component Weight Optimal For Clinical Scenario
FH(0,0) 1 (equal weight) Proportional hazards Standard chemotherapy vs. chemotherapy
FH(1,0) S(t) (early events weighted more) Early separation, later convergence Targeted therapy with resistance development
FH(0,1) 1 − S(t) (late events weighted more) Delayed treatment effects IO trials (2–4 month lag before KM separation)
FH(1,1) S(t) × [1 − S(t)] (middle events weighted) Crossing or crossing-delayed patterns Mixed IO + targeted combinations

Test statistic: MaxCombo = max(Z₀₀, Z₁₀, Z₀₁, Z₁₁), where each Zᵢⱼ is the standardized FH(i,j) statistic. The joint null distribution is estimated via multivariate normal theory (correlation structure from the shared at-risk set) or permutation.

Performance: A 2022 JAMA Oncology meta-analysis (Mukhopadhyay et al.) of 63 IO studies (35,902 patients) confirmed MaxCombo achieved significance in all 15 trials where log-rank failed.

Alpha control: MaxCombo uses a single test with multiplicity-adjusted critical value — no alpha penalty for multiple components.

Sample size: Use simtrial (Merck) or nph packages for simulation-based power calculation under assumed NPH pattern.

Worked example (Phase 3 IO, NSCLC 1L):

Assumption: Delayed separation — HR = 1.0 for months 0–3, HR = 0.65 for months 3+
Standard log-rank: 70% power with 350 events
MaxCombo FH(0,0)+FH(0,1)+FH(1,0)+FH(1,1): 85% power with 350 events
Gain: 15 percentage points of power recovered under NPH

R packages: simtrial (Merck), nph, nphsim

Restricted Mean Survival Time (RMST)

Definition: Area under the Kaplan-Meier curve from time 0 to pre-specified horizon t*:

RMST(t*) = ∫₀^{t*} S(t) dt

Interpretable as "average months alive (or event-free) through t* months."

Advantages over HR:

  • No PH assumption: Model-free, non-parametric
  • Absolute measure (expressed in time units): Clinically meaningful to clinicians and patients
  • Stable under NPH: When curves converge or cross, RMST difference remains interpretable; HR becomes misleading

Selecting t*: Pre-specify based on:

  • Clinical rationale (e.g., 24-month landmark for adjuvant DFS)
  • Minimum of two groups' maximum follow-up (ensures stable KM)
  • Regulatory interest (e.g., 12-month PFS rate for targeted therapy)

Regulatory status:

  • Accepted as secondary/supplementary endpoint by FDA and EMA
  • Encouraged alongside HR for transparency
  • Not yet accepted as sole primary analysis, though 2017–2018 JAMA Oncology papers (Pak et al.; Liang et al.) have influenced consideration

Treatment effect measures:

  • RMST difference: ΔRMST = RMST₁ − RMST₀ (absolute benefit in months)
  • RMST ratio: RMST₁ / RMST₀ (relative benefit)

R packages: survRM2 (standard for regulatory submissions), survival::survfit() + integration


3. Binary Endpoint Methods

3.1 Cochran-Mantel-Haenszel (CMH) Test

What it does: Tests association between treatment and binary response (e.g., ORR, pCR), adjusting for stratification factors via stratum-specific 2×2 tables.

When to use: Primary analysis for binary endpoints (ORR, pCR) in stratified randomized trials.

Formula (common odds ratio estimator):

OR_CMH = Σ_k (a_k × d_k / n_k) / Σ_k (b_k × c_k / n_k)

where a, b, c, d are cells in each 2×2 table k.

Assumptions:

  1. Homogeneous odds ratio across strata (no qualitative interaction)
  2. Sufficient stratum sizes (expected cell counts ≥5 per cell)
  3. Independent observations

Oncology use: ORR in metastatic trials (CMH test stratified by prior lines, ECOG, PD-L1).

R packages: stats::mantelhaen.test(), DescTools::CochranMantelHaenszel()


3.2 Logistic Regression

What it does: Models log-odds of binary response as a function of treatment and covariates.

logit(P(Y=1)) = β₀ + β₁ × Treatment + β₂ × X₂ + ... + βₚ × Xₚ

When to use:

  • Adjusted analysis of ORR, pCR with multiple covariates
  • Subgroup analyses (treatment × biomarker interaction)
  • Multivariable models for exploratory prognostic/predictive analyses

Assumptions:

  1. Correct functional form (linearity in log-odds for continuous covariates)
  2. Independence of observations
  3. No perfect separation (all 1s or all 0s in a covariate level)

Oncology-specific: Firth penalized logistic regression (logistf package) recommended when:

  • Events are rare (pCR rate <15% or >85%)
  • Small strata with zero events
  • Separation or near-separation (perfect prediction by a covariate)

R packages: stats::glm() (standard), logistf (Firth penalized)


3.3 Exact Methods for Sparse Settings

When to use: Single-arm Phase 2 trials, rare molecular subtypes (NTRK, BRAF V600E, RET, MET exon 14) with <50 evaluable patients.

Methods:

  • Clopper-Pearson exact CI for single-arm ORR: 95% CI for proportion based on binomial distribution
  • Fisher's exact test for 2×2 tables: When expected cell counts <5, CMH unreliable
  • Barnard's unconditional exact test: More powerful than Fisher's for small samples

Hypothesis test (single-arm ORR):

H₀: p ≤ p₀ (null response rate, e.g., 15%)
H₁: p ≥ p₁ (alternative response rate, e.g., 40%)
Test: Exact binomial test (one-sided α = 0.025)
Sample size: Simon's two-stage optimal or minimax design

Worked example (rare subtype, NTRK fusion):

Historical ORR = 15% (p₀); Target ORR = 45% (p₁)
α = 0.025 one-sided; β = 0.10 (90% power)
Simon's optimal two-stage: Stage 1: n₁ = 12, r₁ = 2 (stop if ≤2 responses)
                           Stage 2: n = 37, r = 12 (reject H₀ if ≥13 responses)
Expected sample size under H₀: 18.6

R packages: clinfun::ph2simon() (Simon's two-stage), exact2x2 (Fisher/Barnard)


4. Continuous and Longitudinal Endpoint Methods

4.1 ANCOVA (Analysis of Covariance)

What it does: Compares treatment groups on a continuous endpoint (e.g., change in tumor size, QoL score) adjusting for baseline value and covariates.

Model:

Y_post = β₀ + β₁ × Treatment + β₂ × Y_baseline + β₃ × Covariates + ε

When to use: Single post-baseline assessment (e.g., change from baseline in tumor burden at Week 12, QoL at a single timepoint).

Assumptions:

  1. Linear relationship between baseline and post-baseline
  2. Homogeneity of regression slopes (same baseline-outcome relationship in both arms)
  3. Normality of residuals (for inference; robust to violations with large N)
  4. Complete data at the analysis timepoint

Advantages over raw change: Adjusting for baseline increases precision (reduces residual variance) and controls for baseline imbalances.

Oncology use: Tumor burden change (% change from baseline in SLD at a fixed timepoint); single-timepoint QoL comparisons.

Limitation: Discards intermediate measurements and patients with missing post-baseline data. For repeated measurements, use MMRM.

R packages: stats::lm(), emmeans (for adjusted means and contrasts)


4.2 MMRM (Mixed Model Repeated Measures)

What it does: Models correlated repeated measurements within subjects over time, using all available data under MAR (Missing At Random) assumption.

Model:

Y_it = μ + τ_treatment + γ_visit + (τ × γ)_treatment×visit + β × Y_baseline + ε_it

where ε ~ N(0, Σ) with Σ capturing within-subject correlation.

Key advantages over ANCOVA/LOCF:

  1. Uses all available data: No deletion of subjects with partial data
  2. Valid under MAR: Missingness can depend on observed data (not unobserved future values)
  3. Models correlation structure: Leverages within-subject correlation for efficiency
  4. Regulatory preference: FDA and EMA strongly prefer MMRM over ANCOVA/LOCF for longitudinal endpoints

Covariance structure selection:

Structure Parameters Best For Assumption
Unstructured (UN) p(p+1)/2 Irregular visit intervals; ≤8 visits None (most flexible)
AR(1) 2 Equally-spaced visits; correlation decays with time Corr = ρ^
Compound Symmetry (CS) 2 All correlations equal (rarely realistic) Constant correlation
Toeplitz p Equally-spaced visits; correlation depends on lag Band structure

Selection: Use AIC/BIC to compare structures. For ≤6 visits, UN is safe default. For >6 visits, AR(1) or Toeplitz more parsimonious.

The MAR assumption in oncology:

  • When MAR holds: Patient drops out because QoL score at last visit was poor (observed) → dropout depends on observed data → MAR
  • When MAR fails (MNAR): Patient drops out because their unobserved next QoL score would be terrible → depends on unobserved data → MNAR → MMRM is biased
  • Practical guidance: In oncology, dropout often correlates with observed disease status (progression documented on imaging), making MAR reasonable for on-treatment data. Post-progression dropout is more likely MNAR.

Sensitivity to MAR violation: Use reference-based imputation (J2R, CIR) or tipping-point analysis (see Section 6).

R packages: mmrm (CRAN, purpose-built for confirmatory trials; recommended), nlme::lme(), lme4::lmer()


4.3 PRO (Patient-Reported Outcomes) in Oncology: Informative Dropout

Unique challenge: PRO data suffer from systematic informative dropout — sicker patients stop completing surveys, inflating apparent QoL in remaining responders.

Consequences:

  1. Standard MMRM treating dropout as MAR underestimates disease burden
  2. If one arm has more toxicity-driven dropouts, remaining population appears healthier (masking harm)
  3. Only 7.4% of 215 oncology RCTs adequately reported missing QoL data (Olivier et al., 2021)

Regulatory requirement: FDA and EMA now require explicit PRO missingness handling via IE strategies:

  • Treatment policy estimand: Analyze all data regardless of dropout (MMRM under MAR)
  • Hypothetical estimand: "What would QoL be if patient had not progressed?" — principal stratification or reference-based imputation
  • While-on-treatment estimand: Analyze only data collected while patient remains on study drug

SAP language template (PRO):

The primary PRO analysis uses MMRM with unstructured covariance modeling change from 
baseline in [PRO instrument] at each scheduled assessment visit, including treatment, 
visit, treatment-by-visit interaction, baseline score, and stratification factors as 
fixed effects. The analysis is conducted in the ITT population under the MAR assumption.

Sensitivity Analysis 1 (Reference-Based Imputation): Jump-to-Reference (J2R) imputation 
where post-dropout outcomes for the experimental arm are imputed using the control arm's 
observed trajectory, reflecting the treatment policy estimand under MNAR.

Sensitivity Analysis 2 (Tipping Point): Systematic perturbation of imputed values for 
dropout patients (δ = 0 to 1.0 in increments of 0.1) to quantify the departure from MAR 
required to reverse the primary conclusion.

5. Handling Stratification Factors in Primary Analysis

Regulatory Framework

FDA Cancer Endpoints 2018 (Final): Primary analysis should use the stratification factors from randomization. Discrepancies between IRT-recorded strata and actual patient characteristics should be addressed via sensitivity analysis.

Decision Framework

Scenario Primary Analysis Sensitivity Analysis
Strata match between IRT and clinical data Stratified log-rank/Cox using IRT strata Unstratified analysis
IRT strata differ from actual characteristics Stratified analysis using IRT strata (primary) Stratified using actual clinical strata
Some strata have <10 events Collapse small strata or drop factor Full stratification as sensitivity
>3 stratification factors (>12 combinations) Use most prognostic 2–3 factors Full set as sensitivity
Subgroup analysis (e.g., PD-L1 ≥50%) Stratified within subgroup Unstratified within subgroup

Common Stratification Errors

  1. Over-stratification: Too many factors × levels create empty or near-empty strata → unstable HR estimates. Rule of thumb: Total strata ≤ min(12, total events / 10).
  2. Mismatched strata: IRT/IXRS records differ from CRF (e.g., ECOG classified as 0 in IRT but 1 in CRF due to clerical error). Primary analysis uses IRT; sensitivity uses CRF.
  3. Post-hoc stratification: Adding stratification factors not used at randomization introduces bias. Report as exploratory only.

6. Oncology-Specific Issues That Change Method Choice

6.1 Delayed Treatment Effects (Immunotherapy)

Problem: IO agents require 2–4 months to prime T-cell response. Kaplan-Meier curves overlap initially, then separate. Log-rank weights all timepoints equally → loss of power.

Method adaptation:

  • Primary: MaxCombo test (pre-specified) OR weighted log-rank FH(0,1)
  • Supplementary: RMST difference at clinically relevant t* (e.g., 24 months)
  • Effect measure: Report both HR (Cox) and RMST difference for transparency
  • Sample size: Simulation-based under piecewise hazard assumption (HR₁ for months 0–3, HR₂ for months 3+)

SAP language:

The primary analysis of OS uses the MaxCombo test combining FH(0,0), FH(1,0), FH(0,1), 
and FH(1,1) weighted log-rank statistics, with the maximum standardized statistic as the 
test statistic and critical value determined via multivariate normal theory at one-sided 
α = 0.025. The hazard ratio from a stratified Cox model is reported as the primary 
treatment effect measure. RMST difference at 24 months is reported as a supplementary 
measure.

6.2 Treatment Crossover: RPSFT, 2SRST, and IPCW

Problem: When control-arm patients cross over to experimental therapy post-progression, ITT OS analysis underestimates the true treatment effect.

Methods (sensitivity analyses only — not primary):

Method Mechanism Key Assumption When to Prefer
RPSFT Estimates shrinkage factor ψ applied to post-crossover time Accelerated failure time (constant treatment effect); rank preservation Well-established drug efficacy; single switch direction
2SRST Treats crossover as "second randomization"; re-censoring or IPW for post-switch phase Two-stage independence; no unmeasured confounders Late crossover; non-exponential survival
IPCW Weights observations by inverse probability of remaining uncensored Exchangeability (no unmeasured confounders); positivity; correct model specification Multiple switches; complex crossover patterns

Regulatory status:

  • EMA: Accepts RPSFT and IPCW as supporting sensitivity analyses; prefers pre-specification
  • FDA: Increasingly skeptical of RPSFT without strong justification; views adjusted results as "what-if" scenarios
  • Both agencies: Crossover adjustment methods are secondary analyses only — not sole basis for approval
  • Only 19% of RPSFT applications in 65 oncology trials were methodologically appropriate (Prasad et al., 2023)

When RPSFT assumptions fail:

  • Crossover represents inappropriate/inadequate treatment
  • Multiple sequential therapies make disentangling effects impossible
  • The delay before switching outweighs initial therapy benefit

IPCW positivity failure: When P(switch | history) → 0 or 1 for subgroups, extreme weights arise. Truncation at 1st/99th percentile or max weight = 10 reduces variability but introduces bias.

R packages: rpsftm (RPSFT via g-estimation), ipw (IPCW), custom code for 2SRST


6.3 Informative Censoring

Problem: In PFS analysis, patients may be censored due to events related to their prognosis (e.g., clinical deterioration triggers early discontinuation before next scheduled scan documents progression).

Detection: Compare baseline characteristics and post-baseline trajectories of censored vs. uncensored patients. Formal test: permutation test on censoring indicators.

Method adaptation:

  • Primary: Standard KM with pre-specified censoring rules (FDA Appendix C/D tables)
  • Sensitivity 1: Count early dropouts (without documented progression) as events (worst-case)
  • Sensitivity 2: IPCW-adjusted KM if strong evidence of informative censoring

FDA PFS censoring scheme (Cancer Endpoints 2018, Final):

Situation Date Outcome
Incomplete/no baseline tumor assessments Randomization Censored
Progression documented between scheduled visits Earliest date of progression Progressed
Progression documented after ≥2 missed visits Earliest date of progression Progressed
No progression; still on treatment Last adequate assessment date Censored
Death without prior documented progression Date of death Progressed (for PFS)
New anti-cancer therapy before progression Last adequate assessment before new therapy Censored
Lost to follow-up / withdrawal Last adequate assessment date Censored

Sensitivity analyses (FDA recommends ≥2):

  1. PFS-1 (uniform dates): Assign progression to scheduled visit midpoints (corrects for differential assessment timing)
  2. PFS-2 (conservative): Treat all unassessed/missed visits as events
  3. PFS-3 (investigator assessment): Use investigator PFS if IRC PFS is primary (for open-label trials)

6.4 Missing Tumor Assessments

Problem: 10–30% of scheduled tumor assessments are missed in Phase 3 oncology trials, creating gaps in PFS determination.

"Substantial numbers of missing tumor assessments can potentially overestimate or underestimate treatment differences." — FDA Cancer Endpoints 2018 (Final)

Analysis strategies:

Missing Pattern Primary Handling Sensitivity
Single missed visit, then progression at next visit Use next-visit progression date (event) Impute progression at midpoint of missed interval
Multiple missed visits, then progression Use progression date (event); note gap Count earliest missed visit as event date (conservative)
Missing baseline assessments Censor at randomization Exclude from primary; include in sensitivity
Missing final assessment (dropout without progression) Censor at last adequate assessment Count as event (worst-case)

6.5 Competing Risks

Problem: In adjuvant settings (DFS), non-cancer death is a competing event that prevents observation of cancer recurrence. Standard KM overestimates cumulative incidence of cancer events.

Methods:

  • Primary: KM with all-cause DFS (non-cancer deaths counted as events) — FDA-preferred
  • Sensitivity: Cumulative Incidence Function (CIF) via Aalen-Johansen estimator; Fine-Gray subdistribution hazard model

Fine-Gray model:

h_subdist(t | X) = h₀_subdist(t) × exp(β₁X₁ + ... + βₚXₚ)

Estimates the subdistribution hazard ratio (sdHR) — the effect of treatment on the cumulative incidence of the event of interest, accounting for competing risks.

When to use: Adjuvant breast cancer (non-cancer death competes with recurrence), AML (transplant-related mortality competes with relapse), elderly populations (comorbidity-driven mortality).

R packages: cmprsk::cuminc() (CIF), cmprsk::crr() (Fine-Gray), tidycmprsk (tidy interface)


6.6 Reference-Based Imputation for Missing Data

When to use: Treatment policy estimand where MAR is implausible — patients who discontinue the experimental drug likely have outcomes resembling control-arm patients.

Methods:

Method Post-Dropout Assumption When to Use
Jump to Reference (J2R) Mean trajectory "jumps" to control arm at dropout Abrupt effect loss (e.g., stopping active drug)
Copy Increments in Reference (CIR) Rate of change copies control arm's increments after dropout Gradual effect waning
Copy Reference (CR) Entire trajectory copies control arm Extreme conservative bound

Tipping point analysis: Systematically varies departure from MAR (δ = 0 to 1.0) to find δ where significance reverses. If δ > 0.8: robust. If δ* < 0.2: fragile.

R packages: rbmi (CRAN, modern standard for regulatory submissions), mice (general MI)


7. Decision Table: Endpoint/Scenario → Primary Method → Sensitivity Methods

Endpoint Scenario Primary Method Effect Measure Sensitivity Methods R Package
OS PH assumed Stratified log-rank HR (Cox) Unstratified log-rank; RMST survival
OS NPH (IO, delayed effect) MaxCombo HR + RMST Standard log-rank; piecewise Cox simtrial, nph
OS Crossover present Stratified log-rank (ITT) HR (Cox) RPSFT, IPCW, 2SRST (sensitivity) rpsftm, ipw
PFS Open-label, PH Stratified log-rank (IRC) HR (Cox) Investigator PFS; PFS-1/PFS-2 censoring survival
PFS Open-label, NPH (IO) MaxCombo (IRC) HR + RMST Standard log-rank; milestone PFS rates simtrial, survRM2
PFS Informative censoring suspected Stratified log-rank HR (Cox) IPCW-KM; worst-case (dropouts = events) ipw, survival
DFS (adjuvant) All-cause, competing risks Stratified log-rank (all-cause DFS) HR (Cox) Fine-Gray (cancer-specific CIF) survival, cmprsk
EFS (neoadjuvant) Composite endpoint Stratified log-rank HR (Cox) DFS as sensitivity; pCR subgroup analysis survival
ORR (randomized) Stratified comparison CMH test Risk difference, OR Logistic regression; unstratified Fisher stats
ORR (single-arm) Rare molecular subtype Exact binomial test Proportion + Clopper-Pearson CI Simon's two-stage design bounds clinfun
pCR (neoadjuvant) Stratified comparison CMH test Risk difference Logistic regression; subgroup by biomarker stats
QoL / PRO Repeated measures, dropout MMRM (UN covariance) LS mean difference J2R imputation; tipping point; while-on-treatment mmrm, rbmi
Continuous (single timepoint) Baseline-adjusted ANCOVA Adjusted mean difference Rank-based (Wilcoxon) if non-normal stats, emmeans
Continuous (longitudinal) Repeated measures MMRM LS mean difference at each visit ANCOVA at final visit; LOCF (historical comparison) mmrm

8. SAP Language Templates

Template 1: Primary TTE Analysis (OS or PFS, Proportional Hazards)

PRIMARY ANALYSIS
================
The primary analysis of [OS / PFS by IRC] will be performed in the Intent-to-Treat 
(ITT) population, defined as all randomized patients analyzed according to their 
randomized treatment assignment.

The primary test is the stratified log-rank test, stratified by [list randomization 
stratification factors: e.g., ECOG performance status (0 vs. 1), PD-L1 status 
(≥50% vs. <50%), geographic region (North America vs. Europe vs. Asia-Pacific)].

The primary treatment effect measure is the hazard ratio (HR) estimated from a 
stratified Cox proportional hazards model using the same stratification factors, 
with the 95% confidence interval and two-sided p-value from the Wald test.

Kaplan-Meier estimates of the survival function will be provided for each treatment 
arm, including median survival time (with 95% CI via Brookmeyer-Crowley method) 
and milestone survival rates at [12, 24, 36] months (with 95% CI via Greenwood's 
formula).

SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Unstratified): Unstratified log-rank test and unstratified 
Cox model to assess the impact of stratification on the primary result.

Sensitivity Analysis 2 (Clinical Strata): Stratified analysis using actual patient 
characteristics from CRF (rather than IRT-recorded strata) to assess impact of 
stratification errors.

Sensitivity Analysis 3 (PFS Censoring — conservative): [For PFS only] Patients who 
discontinue study treatment or initiate subsequent therapy without documented 
progression are counted as PFS events at the date of discontinuation/new therapy start.

Sensitivity Analysis 4 (RMST): Restricted mean survival time difference at [24] months 
reported as supplementary measure. RMST does not require the proportional hazards 
assumption.

PROPORTIONAL HAZARDS ASSESSMENT
================================
The proportional hazards assumption will be assessed via Schoenfeld residuals test 
(cox.zph) and visual inspection of log-log survival plots. If evidence of NPH is 
detected (p < 0.10 for Schoenfeld test), results of the MaxCombo test and RMST 
analysis will be reported alongside the primary log-rank result.

Template 2: Primary TTE Analysis Under Non-Proportional Hazards (IO Trial)

PRIMARY ANALYSIS
================
The primary analysis of [OS / PFS] will be performed in the ITT population using 
the MaxCombo test, which combines four Fleming-Harrington weighted log-rank 
statistics: FH(0,0), FH(1,0), FH(0,1), and FH(1,1). The test statistic is the 
maximum of the four standardized statistics, with the critical value determined 
from the asymptotic multivariate normal distribution at one-sided α = [0.025].

The stratified Cox proportional hazards model HR (with 95% CI) is reported as the 
primary treatment effect measure, stratified by [stratification factors].

SUPPLEMENTARY ANALYSES
======================
RMST Difference: The restricted mean survival time difference at [t* = 24 months] 
is reported with 95% CI as a model-free absolute treatment effect measure.

Piecewise Cox Model: A piecewise Cox model with a change-point at [3 months] is 
fitted to estimate early-phase HR (months 0–3) and late-phase HR (months 3+), 
characterizing the delayed treatment effect pattern.

Standard Log-Rank: The standard (unweighted) log-rank test is reported as a 
sensitivity analysis to assess whether conclusions differ under PH assumption.

Template 3: Crossover-Adjusted OS (Sensitivity Analysis)

CROSSOVER ADJUSTMENT (SENSITIVITY ANALYSIS)
============================================
To estimate the treatment effect on OS in the absence of crossover from [control arm] 
to [experimental therapy], the following pre-specified sensitivity analyses are 
conducted:

1. RPSFT (Rank-Preserving Structural Failure Time): Estimates the counterfactual 
   survival time had control-arm patients not received [experimental therapy] after 
   progression. The shrinkage factor ψ is estimated via g-estimation. The accelerated 
   failure time assumption is documented and its plausibility assessed.

2. IPCW (Inverse Probability of Censoring Weighting): Weights observations by the 
   inverse probability of remaining uncensored (not having crossed over), estimated 
   from a logistic model including [baseline covariates: ECOG, stage, biomarker status, 
   prior lines]. Weight truncation at the 1st and 99th percentiles is applied to 
   stabilize estimates.

Both adjusted HR estimates are reported alongside the primary ITT analysis. These 
analyses are exploratory and not the basis for efficacy claims. The ITT analysis 
remains the primary analysis for regulatory decision-making.

Template 4: Binary Endpoint (ORR, pCR)

PRIMARY ANALYSIS
================
The primary analysis of [ORR / pCR] will be performed in the [ITT / evaluable] 
population using the stratified Cochran-Mantel-Haenszel (CMH) test, stratified by 
[randomization stratification factors].

The primary treatment effect measure is the risk difference (experimental − control) 
with 95% CI estimated via the Newcombe-Wilson method. The common odds ratio from 
CMH is also reported with 95% CI.

SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Logistic Regression): Multivariable logistic regression 
adjusting for [treatment, stratification factors, baseline tumor burden] to estimate 
the adjusted odds ratio.

Sensitivity Analysis 2 (Unstratified): Fisher's exact test (for sparse strata) or 
chi-square test (for adequate cell counts) without stratification.

[For single-arm trials]:
The primary analysis of ORR uses the exact binomial test at one-sided α = 0.025, 
testing H₀: p ≤ [p₀] vs. H₁: p > [p₀]. The 95% CI for the observed response rate 
is calculated via the Clopper-Pearson exact method.

Template 5: Longitudinal PRO (MMRM with Reference-Based Sensitivity)

PRIMARY ANALYSIS
================
The primary PRO analysis uses a Mixed Model for Repeated Measures (MMRM) with 
unstructured covariance, modeling change from baseline in [PRO instrument score] 
at each scheduled assessment visit ([Weeks 6, 12, 18, 24]). Fixed effects include 
treatment, visit, treatment-by-visit interaction, baseline score, and [stratification 
factors]. The analysis is conducted in the ITT population. The primary timepoint 
for inference is [Week 24].

Missing data are handled implicitly under the MAR assumption. If the MAR assumption 
is judged implausible based on pattern-mixture analysis of dropout mechanisms, 
sensitivity analyses below apply.

SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Jump-to-Reference, J2R): Post-dropout outcomes for the 
experimental arm are imputed using the control arm's observed trajectory from the 
dropout timepoint forward. Multiple imputation (M = 100 datasets) with Rubin's 
rules pooling.

Sensitivity Analysis 2 (Copy Increments in Reference, CIR): Post-dropout rate of 
change imputed from the control arm's observed increments, preserving the individual's 
level at dropout.

Sensitivity Analysis 3 (Tipping Point): Systematic perturbation of J2R imputation 
(δ = 0.0 to 1.0 in 0.1 increments) to identify the departure from MAR required to 
reverse the primary conclusion. Results reported as tipping-point plot.

9. Limitations and Pitfalls

1. Log-rank under NPH gives misleading p-values:

The log-rank p-value can be non-significant even with clinically meaningful late benefit (IO). Always pre-specify NPH-aware tests if delayed effects are anticipated.

2. Cox HR is not meaningful under NPH:

When hazards cross, the "average" HR is uninterpretable. Report piecewise HR or RMST difference instead.

3. RPSFT is frequently misapplied:

Only 19% of RPSFT applications in oncology trials are methodologically appropriate (Prasad et al., 2023). Do not use RPSFT as primary or when multiple sequential therapies make causal inference impossible.

4. Over-stratification destroys power:

Too many strata with few events per stratum leads to unstable HR estimates and potential bias. Limit to ≤12 total strata with ≥10 events each.

5. LOCF is obsolete:

LOCF (Last Observation Carried Forward) for longitudinal endpoints is biased and inefficient under any realistic dropout mechanism. Use MMRM as primary. Do not use LOCF even as sensitivity — it provides no useful information.

6. Exact methods required for small samples:

Asymptotic tests (CMH, chi-square, Wald) are unreliable with <5 expected events per cell. Use Fisher's exact, Barnard's test, or Firth penalized regression.

7. Tipping point analysis is mandatory for PRO claims:

Any PRO-based labeling claim requires demonstration of robustness to MNAR via tipping-point or reference-based sensitivity analysis.

8. Missing tumor assessments are non-ignorable:

10–30% missing assessment rate in Phase 3 trials can bias PFS in either direction. Multiple pre-specified PFS censoring schemes (FDA Appendix C/D) are mandatory.



Sources: - FDA Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics (2018, Final) - FDA Clinical Trial Endpoints for the Approval of NSCLC Drugs and Biologics (2015/2020, Final) - ICH E9(R1) Addendum on Estimands and Sensitivity Analysis (2019, Final) - ICH E8(R1) General Considerations for Clinical Studies (2021, Final) - Estimand framework in oncology: ICH E9(R1) and intercurrent events (literature synthesis) - OS crossover adjustment: RPSFT, 2SRST, IPCW methods (literature synthesis) - Reference-based imputation: J2R, CIR, CR, tipping point (literature synthesis) - MMRM and longitudinal analysis in oncology (literature synthesis) - Non-proportional hazards: MaxCombo and RMST (literature synthesis) Last Updated: 2026-04-11