Statistical Analysis Methods in Oncology Trials
Purpose: This article provides a comprehensive reference for biostatisticians on the statistical methods used in confirmatory oncology trials. It follows the organizing principle mandated by ICH E9(R1): endpoint → estimand → estimator → assumptions → sensitivity analysis, and maps each method to the oncology-specific scenarios that drive method choice.
1. Organizing Principle: Endpoint → Estimand → Estimator → Assumptions → Sensitivity Analysis
The ICH E9(R1) Causal Chain
Every statistical analysis in a confirmatory oncology trial must follow a structured alignment:
Trial Objective
└─ Estimand (5 attributes)
├─ Population
├─ Variable (endpoint)
├─ Treatment conditions
├─ Summary measure (HR, risk difference, mean difference, etc.)
└─ IE strategies (treatment policy, hypothetical, composite, principal stratum, while-on-treatment)
└─ Main Estimator (statistical method aligned to estimand)
└─ Assumptions (documented, testable where possible)
└─ Sensitivity Analysis (same estimand, relaxed assumptions)
"The main estimator will be underpinned by certain assumptions. To explore the robustness of inferences from the main estimator to deviations from its underlying assumptions, a sensitivity analysis should be conducted." — ICH E9(R1) Addendum, §A.5 (Final, 2019)
Key regulatory requirement: The estimand, main estimator, assumptions, and sensitivity analyses must all be pre-specified in the Statistical Analysis Plan (SAP) before unblinding.
How Oncology Is Different
Oncology trials introduce specific methodological challenges that alter standard statistical method choices:
| Challenge | Impact on Method Choice | Typical Setting |
|---|---|---|
| Non-proportional hazards (NPH) | Log-rank loses power; MaxCombo or RMST needed | IO trials (delayed separation), targeted therapy (early separation then convergence) |
| Treatment crossover | ITT OS estimate diluted; RPSFT/IPCW needed as sensitivity | Open-label trials with post-progression switch |
| Informative censoring | Kaplan-Meier biased; IPCW or sensitivity needed | PFS with differential dropout; PRO data |
| Assessment-driven bias | PFS timing depends on imaging schedule; IRC needed | Open-label PFS trials; unscheduled imaging |
| Multiple post-progression therapies | OS confounded by 4–6+ lines; dilutes signal | Advanced NSCLC, breast cancer, myeloma |
| Competing risks | Standard KM overestimates cumulative incidence | Adjuvant DFS (non-cancer death); AML (transplant) |
| Missing tumor assessments | PFS censoring rules change results | ~10–30% of assessments missed in Phase 3 |
| Small populations | Exact methods needed; asymptotic methods unreliable | Rare molecular subtypes (NTRK, RET, BRAF V600E) |
2. Time-to-Event Methods
2.1 Kaplan-Meier Estimation
What it does: Non-parametric estimation of the survival function S(t) = P(T > t), accounting for right censoring.
Assumptions:
- Independent censoring: Censoring is non-informative (censored patients have the same prognosis as those who remain at risk)
- No left truncation: All patients enter observation at time 0 (randomization)
- Event times are exact (not interval-censored)
Oncology use: Primary descriptive method for OS, PFS, DFS, EFS. Always reported alongside formal tests.
Key quantities reported:
- Median survival time (with 95% CI via Brookmeyer-Crowley)
- Milestone survival rates: S(12 mo), S(24 mo), S(36 mo) with 95% CI (Greenwood's formula)
- Kaplan-Meier curves by treatment arm
R packages: survival::survfit(), ggsurvfit (for publication-quality plots)
When KM fails in oncology:
- Informative censoring (patients drop out because of disease progression) → IPCW adjustment
- Competing risks (non-cancer death competes with cancer event) → Cumulative Incidence Function (CIF) via Aalen-Johansen instead
2.2 Log-Rank Test
What it does: Non-parametric hypothesis test comparing two survival distributions. Tests H₀: S₁(t) = S₂(t) for all t.
Formula:
χ² = [Σ (O₁ⱼ - E₁ⱼ)]² / Σ V₁ⱼ
where O₁ⱼ = observed events in group 1 at time j, E₁ⱼ = expected events under null, V₁ⱼ = hypergeometric variance.
Assumptions:
- Proportional hazards (PH): The hazard ratio is constant over time. The log-rank test has optimal power under PH.
- Independent censoring
- No tied event times (or handled via Breslow/Efron approximation)
When to use: Primary test for OS, PFS, DFS in most oncology Phase 3 trials under PH assumption.
Stratified log-rank test: When randomization is stratified by prognostic factors (e.g., ECOG, PD-L1 status, geographic region), the stratified log-rank test is the primary analysis:
χ²_stratified = [Σ_k Σ_j (O₁ⱼₖ - E₁ⱼₖ)]² / Σ_k Σ_j V₁ⱼₖ
where k indexes strata. This controls for stratum-level imbalances.
FDA position on stratification:
"Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol. The analysis plan should specify the primary analysis and one or more sensitivity analyses." — FDA Cancer Endpoints 2018 (Final)
Regulatory practice: The stratified log-rank test using the randomization stratification factors is the standard primary analysis. Using different strata in the analysis vs. randomization requires justification and is typically a sensitivity analysis.
R packages: survival::survdiff() (unstratified), nph::logrank.test() (weighted variants)
When log-rank fails in oncology:
- Delayed treatment effects (IO): Curves overlap for 2–4 months, then separate. Log-rank under-weights the late separation → loss of power.
- Crossing hazards: Treatment helps early but harms late (or vice versa). Log-rank averages across the cross → near-zero test statistic.
- Early separation then convergence: Targeted therapy effective initially, then resistance develops. Log-rank may still detect but with reduced power.
Solution: MaxCombo test or weighted log-rank (see Section 2.5).
2.3 Cox Proportional Hazards Regression
What it does: Semi-parametric model estimating the hazard ratio (HR) and its 95% CI, adjusting for covariates.
Model:
h(t | X) = h₀(t) × exp(β₁X₁ + β₂X₂ + ... + βₚXₚ)
where h₀(t) is the unspecified baseline hazard, and exp(β₁) = HR for covariate X₁.
Assumptions:
- Proportional hazards: HR is constant over time for each covariate
- Log-linearity: Covariates act multiplicatively on the hazard
- Independent censoring
- No model misspecification (correct functional form for continuous covariates)
PH diagnostics:
- Schoenfeld residuals test:
cox.zph()in R; p < 0.05 suggests PH violation - Log-log survival plot: Parallel curves indicate PH; crossing/divergence indicates NPH
- Time-dependent covariates: Include interaction with log(time) to test for time-varying HR
Stratified Cox model: When PH holds within strata but not overall, use:
h_k(t | X) = h₀ₖ(t) × exp(β₁X₁ + ... + βₚXₚ)
where each stratum k has its own baseline hazard h₀ₖ(t) but shares the same regression coefficients. This is the standard approach when stratification factors are used in randomization.
Oncology-specific covariates:
| Covariate | Typical Use | Notes |
|---|---|---|
| Treatment arm | Primary | Estimates HR |
| ECOG PS (0 vs. 1) | Stratification factor | Always included if used in randomization |
| PD-L1 status (≥50% vs. <50%) | Stratification or subgroup | IO trials |
| Histology (squamous vs. nonsquamous) | Stratification factor | NSCLC trials |
| Geographic region | Stratification factor | Global trials |
| Biomarker status (EGFR, ALK, BRCA) | Enrichment/subgroup | Targeted trials |
| Prior lines of therapy | Stratification factor | 2L+ trials |
R packages: survival::coxph() (standard), coxphf (Firth penalized for small samples)
When Cox fails in oncology:
- NPH (see MaxCombo, Section 2.5)
- Very small samples (<30 events): Firth penalized Cox or exact conditional inference
- Time-varying treatment effects: Extended Cox with time-dependent coefficients
2.4 Stratified Cox and Stratified Log-Rank: Handling Stratification Factors
Regulatory requirement: When randomization uses stratification factors, the primary analysis must use the same factors in the stratified analysis.
Common problem: Randomization stratification factors vs. analysis stratification factors may differ due to:
- IRT/IXRS system stratification errors (e.g., patient misclassified by ECOG)
- Small strata collapsed at analysis (e.g., geographic region strata with <10 patients)
- Subgroup-specific analyses using different factor levels
FDA position:
"The SAP should specify the primary analysis and one or more sensitivity analyses... Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol." — FDA Cancer Endpoints 2018 (Final)
Best practice:
- Primary analysis: Stratified log-rank and stratified Cox using randomization strata (as recorded in IRT/IXRS system)
- Sensitivity 1: Stratified analysis using clinical strata (actual patient characteristics, correcting any IRT errors)
- Sensitivity 2: Unstratified log-rank and Cox (to confirm stratification does not create artificial effects)
Maximum number of strata: Aim for ≤8–12 total strata in primary analysis. Too many strata with small counts lead to:
- Sparse stratum bias (HR estimates become unstable)
- Information loss (patients in empty strata contribute nothing)
- Rule of thumb: Each stratum should contain ≥10 events
R implementation:
# Stratified Cox (primary)
coxph(Surv(time, event) ~ treatment + strata(ecog, pdl1, region), data = trial)
# Stratified log-rank (primary)
survdiff(Surv(time, event) ~ treatment + strata(ecog, pdl1, region), data = trial)
2.5 Non-Proportional Hazards: MaxCombo Test and RMST
MaxCombo Test
When to use: Pre-specified primary or co-primary test when delayed treatment effects (IO) or crossing/converging hazards (targeted therapy) are anticipated.
Composition: The MaxCombo test takes the maximum of a panel of Fleming-Harrington FH(ρ, γ) weighted log-rank statistics:
| Component | Weight | Optimal For | Clinical Scenario |
|---|---|---|---|
| FH(0,0) | 1 (equal weight) | Proportional hazards | Standard chemotherapy vs. chemotherapy |
| FH(1,0) | S(t) (early events weighted more) | Early separation, later convergence | Targeted therapy with resistance development |
| FH(0,1) | 1 − S(t) (late events weighted more) | Delayed treatment effects | IO trials (2–4 month lag before KM separation) |
| FH(1,1) | S(t) × [1 − S(t)] (middle events weighted) | Crossing or crossing-delayed patterns | Mixed IO + targeted combinations |
Test statistic: MaxCombo = max(Z₀₀, Z₁₀, Z₀₁, Z₁₁), where each Zᵢⱼ is the standardized FH(i,j) statistic. The joint null distribution is estimated via multivariate normal theory (correlation structure from the shared at-risk set) or permutation.
Performance: A 2022 JAMA Oncology meta-analysis (Mukhopadhyay et al.) of 63 IO studies (35,902 patients) confirmed MaxCombo achieved significance in all 15 trials where log-rank failed.
Alpha control: MaxCombo uses a single test with multiplicity-adjusted critical value — no alpha penalty for multiple components.
Sample size: Use simtrial (Merck) or nph packages for simulation-based power calculation under assumed NPH pattern.
Worked example (Phase 3 IO, NSCLC 1L):
Assumption: Delayed separation — HR = 1.0 for months 0–3, HR = 0.65 for months 3+
Standard log-rank: 70% power with 350 events
MaxCombo FH(0,0)+FH(0,1)+FH(1,0)+FH(1,1): 85% power with 350 events
Gain: 15 percentage points of power recovered under NPH
R packages: simtrial (Merck), nph, nphsim
Restricted Mean Survival Time (RMST)
Definition: Area under the Kaplan-Meier curve from time 0 to pre-specified horizon t*:
RMST(t*) = ∫₀^{t*} S(t) dt
Interpretable as "average months alive (or event-free) through t* months."
Advantages over HR:
- No PH assumption: Model-free, non-parametric
- Absolute measure (expressed in time units): Clinically meaningful to clinicians and patients
- Stable under NPH: When curves converge or cross, RMST difference remains interpretable; HR becomes misleading
Selecting t*: Pre-specify based on:
- Clinical rationale (e.g., 24-month landmark for adjuvant DFS)
- Minimum of two groups' maximum follow-up (ensures stable KM)
- Regulatory interest (e.g., 12-month PFS rate for targeted therapy)
Regulatory status:
- Accepted as secondary/supplementary endpoint by FDA and EMA
- Encouraged alongside HR for transparency
- Not yet accepted as sole primary analysis, though 2017–2018 JAMA Oncology papers (Pak et al.; Liang et al.) have influenced consideration
Treatment effect measures:
- RMST difference: ΔRMST = RMST₁ − RMST₀ (absolute benefit in months)
- RMST ratio: RMST₁ / RMST₀ (relative benefit)
R packages: survRM2 (standard for regulatory submissions), survival::survfit() + integration
3. Binary Endpoint Methods
3.1 Cochran-Mantel-Haenszel (CMH) Test
What it does: Tests association between treatment and binary response (e.g., ORR, pCR), adjusting for stratification factors via stratum-specific 2×2 tables.
When to use: Primary analysis for binary endpoints (ORR, pCR) in stratified randomized trials.
Formula (common odds ratio estimator):
OR_CMH = Σ_k (a_k × d_k / n_k) / Σ_k (b_k × c_k / n_k)
where a, b, c, d are cells in each 2×2 table k.
Assumptions:
- Homogeneous odds ratio across strata (no qualitative interaction)
- Sufficient stratum sizes (expected cell counts ≥5 per cell)
- Independent observations
Oncology use: ORR in metastatic trials (CMH test stratified by prior lines, ECOG, PD-L1).
R packages: stats::mantelhaen.test(), DescTools::CochranMantelHaenszel()
3.2 Logistic Regression
What it does: Models log-odds of binary response as a function of treatment and covariates.
logit(P(Y=1)) = β₀ + β₁ × Treatment + β₂ × X₂ + ... + βₚ × Xₚ
When to use:
- Adjusted analysis of ORR, pCR with multiple covariates
- Subgroup analyses (treatment × biomarker interaction)
- Multivariable models for exploratory prognostic/predictive analyses
Assumptions:
- Correct functional form (linearity in log-odds for continuous covariates)
- Independence of observations
- No perfect separation (all 1s or all 0s in a covariate level)
Oncology-specific: Firth penalized logistic regression (logistf package) recommended when:
- Events are rare (pCR rate <15% or >85%)
- Small strata with zero events
- Separation or near-separation (perfect prediction by a covariate)
R packages: stats::glm() (standard), logistf (Firth penalized)
3.3 Exact Methods for Sparse Settings
When to use: Single-arm Phase 2 trials, rare molecular subtypes (NTRK, BRAF V600E, RET, MET exon 14) with <50 evaluable patients.
Methods:
- Clopper-Pearson exact CI for single-arm ORR: 95% CI for proportion based on binomial distribution
- Fisher's exact test for 2×2 tables: When expected cell counts <5, CMH unreliable
- Barnard's unconditional exact test: More powerful than Fisher's for small samples
Hypothesis test (single-arm ORR):
H₀: p ≤ p₀ (null response rate, e.g., 15%)
H₁: p ≥ p₁ (alternative response rate, e.g., 40%)
Test: Exact binomial test (one-sided α = 0.025)
Sample size: Simon's two-stage optimal or minimax design
Worked example (rare subtype, NTRK fusion):
Historical ORR = 15% (p₀); Target ORR = 45% (p₁)
α = 0.025 one-sided; β = 0.10 (90% power)
Simon's optimal two-stage: Stage 1: n₁ = 12, r₁ = 2 (stop if ≤2 responses)
Stage 2: n = 37, r = 12 (reject H₀ if ≥13 responses)
Expected sample size under H₀: 18.6
R packages: clinfun::ph2simon() (Simon's two-stage), exact2x2 (Fisher/Barnard)
4. Continuous and Longitudinal Endpoint Methods
4.1 ANCOVA (Analysis of Covariance)
What it does: Compares treatment groups on a continuous endpoint (e.g., change in tumor size, QoL score) adjusting for baseline value and covariates.
Model:
Y_post = β₀ + β₁ × Treatment + β₂ × Y_baseline + β₃ × Covariates + ε
When to use: Single post-baseline assessment (e.g., change from baseline in tumor burden at Week 12, QoL at a single timepoint).
Assumptions:
- Linear relationship between baseline and post-baseline
- Homogeneity of regression slopes (same baseline-outcome relationship in both arms)
- Normality of residuals (for inference; robust to violations with large N)
- Complete data at the analysis timepoint
Advantages over raw change: Adjusting for baseline increases precision (reduces residual variance) and controls for baseline imbalances.
Oncology use: Tumor burden change (% change from baseline in SLD at a fixed timepoint); single-timepoint QoL comparisons.
Limitation: Discards intermediate measurements and patients with missing post-baseline data. For repeated measurements, use MMRM.
R packages: stats::lm(), emmeans (for adjusted means and contrasts)
4.2 MMRM (Mixed Model Repeated Measures)
What it does: Models correlated repeated measurements within subjects over time, using all available data under MAR (Missing At Random) assumption.
Model:
Y_it = μ + τ_treatment + γ_visit + (τ × γ)_treatment×visit + β × Y_baseline + ε_it
where ε ~ N(0, Σ) with Σ capturing within-subject correlation.
Key advantages over ANCOVA/LOCF:
- Uses all available data: No deletion of subjects with partial data
- Valid under MAR: Missingness can depend on observed data (not unobserved future values)
- Models correlation structure: Leverages within-subject correlation for efficiency
- Regulatory preference: FDA and EMA strongly prefer MMRM over ANCOVA/LOCF for longitudinal endpoints
Covariance structure selection:
| Structure | Parameters | Best For | Assumption |
|---|---|---|---|
| Unstructured (UN) | p(p+1)/2 | Irregular visit intervals; ≤8 visits | None (most flexible) |
| AR(1) | 2 | Equally-spaced visits; correlation decays with time | Corr = ρ^ |
| Compound Symmetry (CS) | 2 | All correlations equal (rarely realistic) | Constant correlation |
| Toeplitz | p | Equally-spaced visits; correlation depends on lag | Band structure |
Selection: Use AIC/BIC to compare structures. For ≤6 visits, UN is safe default. For >6 visits, AR(1) or Toeplitz more parsimonious.
The MAR assumption in oncology:
- When MAR holds: Patient drops out because QoL score at last visit was poor (observed) → dropout depends on observed data → MAR
- When MAR fails (MNAR): Patient drops out because their unobserved next QoL score would be terrible → depends on unobserved data → MNAR → MMRM is biased
- Practical guidance: In oncology, dropout often correlates with observed disease status (progression documented on imaging), making MAR reasonable for on-treatment data. Post-progression dropout is more likely MNAR.
Sensitivity to MAR violation: Use reference-based imputation (J2R, CIR) or tipping-point analysis (see Section 6).
R packages: mmrm (CRAN, purpose-built for confirmatory trials; recommended), nlme::lme(), lme4::lmer()
4.3 PRO (Patient-Reported Outcomes) in Oncology: Informative Dropout
Unique challenge: PRO data suffer from systematic informative dropout — sicker patients stop completing surveys, inflating apparent QoL in remaining responders.
Consequences:
- Standard MMRM treating dropout as MAR underestimates disease burden
- If one arm has more toxicity-driven dropouts, remaining population appears healthier (masking harm)
- Only 7.4% of 215 oncology RCTs adequately reported missing QoL data (Olivier et al., 2021)
Regulatory requirement: FDA and EMA now require explicit PRO missingness handling via IE strategies:
- Treatment policy estimand: Analyze all data regardless of dropout (MMRM under MAR)
- Hypothetical estimand: "What would QoL be if patient had not progressed?" — principal stratification or reference-based imputation
- While-on-treatment estimand: Analyze only data collected while patient remains on study drug
SAP language template (PRO):
The primary PRO analysis uses MMRM with unstructured covariance modeling change from
baseline in [PRO instrument] at each scheduled assessment visit, including treatment,
visit, treatment-by-visit interaction, baseline score, and stratification factors as
fixed effects. The analysis is conducted in the ITT population under the MAR assumption.
Sensitivity Analysis 1 (Reference-Based Imputation): Jump-to-Reference (J2R) imputation
where post-dropout outcomes for the experimental arm are imputed using the control arm's
observed trajectory, reflecting the treatment policy estimand under MNAR.
Sensitivity Analysis 2 (Tipping Point): Systematic perturbation of imputed values for
dropout patients (δ = 0 to 1.0 in increments of 0.1) to quantify the departure from MAR
required to reverse the primary conclusion.
5. Handling Stratification Factors in Primary Analysis
Regulatory Framework
FDA Cancer Endpoints 2018 (Final): Primary analysis should use the stratification factors from randomization. Discrepancies between IRT-recorded strata and actual patient characteristics should be addressed via sensitivity analysis.
Decision Framework
| Scenario | Primary Analysis | Sensitivity Analysis |
|---|---|---|
| Strata match between IRT and clinical data | Stratified log-rank/Cox using IRT strata | Unstratified analysis |
| IRT strata differ from actual characteristics | Stratified analysis using IRT strata (primary) | Stratified using actual clinical strata |
| Some strata have <10 events | Collapse small strata or drop factor | Full stratification as sensitivity |
| >3 stratification factors (>12 combinations) | Use most prognostic 2–3 factors | Full set as sensitivity |
| Subgroup analysis (e.g., PD-L1 ≥50%) | Stratified within subgroup | Unstratified within subgroup |
Common Stratification Errors
- Over-stratification: Too many factors × levels create empty or near-empty strata → unstable HR estimates. Rule of thumb: Total strata ≤ min(12, total events / 10).
- Mismatched strata: IRT/IXRS records differ from CRF (e.g., ECOG classified as 0 in IRT but 1 in CRF due to clerical error). Primary analysis uses IRT; sensitivity uses CRF.
- Post-hoc stratification: Adding stratification factors not used at randomization introduces bias. Report as exploratory only.
6. Oncology-Specific Issues That Change Method Choice
6.1 Delayed Treatment Effects (Immunotherapy)
Problem: IO agents require 2–4 months to prime T-cell response. Kaplan-Meier curves overlap initially, then separate. Log-rank weights all timepoints equally → loss of power.
Method adaptation:
- Primary: MaxCombo test (pre-specified) OR weighted log-rank FH(0,1)
- Supplementary: RMST difference at clinically relevant t* (e.g., 24 months)
- Effect measure: Report both HR (Cox) and RMST difference for transparency
- Sample size: Simulation-based under piecewise hazard assumption (HR₁ for months 0–3, HR₂ for months 3+)
SAP language:
The primary analysis of OS uses the MaxCombo test combining FH(0,0), FH(1,0), FH(0,1),
and FH(1,1) weighted log-rank statistics, with the maximum standardized statistic as the
test statistic and critical value determined via multivariate normal theory at one-sided
α = 0.025. The hazard ratio from a stratified Cox model is reported as the primary
treatment effect measure. RMST difference at 24 months is reported as a supplementary
measure.
6.2 Treatment Crossover: RPSFT, 2SRST, and IPCW
Problem: When control-arm patients cross over to experimental therapy post-progression, ITT OS analysis underestimates the true treatment effect.
Methods (sensitivity analyses only — not primary):
| Method | Mechanism | Key Assumption | When to Prefer |
|---|---|---|---|
| RPSFT | Estimates shrinkage factor ψ applied to post-crossover time | Accelerated failure time (constant treatment effect); rank preservation | Well-established drug efficacy; single switch direction |
| 2SRST | Treats crossover as "second randomization"; re-censoring or IPW for post-switch phase | Two-stage independence; no unmeasured confounders | Late crossover; non-exponential survival |
| IPCW | Weights observations by inverse probability of remaining uncensored | Exchangeability (no unmeasured confounders); positivity; correct model specification | Multiple switches; complex crossover patterns |
Regulatory status:
- EMA: Accepts RPSFT and IPCW as supporting sensitivity analyses; prefers pre-specification
- FDA: Increasingly skeptical of RPSFT without strong justification; views adjusted results as "what-if" scenarios
- Both agencies: Crossover adjustment methods are secondary analyses only — not sole basis for approval
- Only 19% of RPSFT applications in 65 oncology trials were methodologically appropriate (Prasad et al., 2023)
When RPSFT assumptions fail:
- Crossover represents inappropriate/inadequate treatment
- Multiple sequential therapies make disentangling effects impossible
- The delay before switching outweighs initial therapy benefit
IPCW positivity failure: When P(switch | history) → 0 or 1 for subgroups, extreme weights arise. Truncation at 1st/99th percentile or max weight = 10 reduces variability but introduces bias.
R packages: rpsftm (RPSFT via g-estimation), ipw (IPCW), custom code for 2SRST
6.3 Informative Censoring
Problem: In PFS analysis, patients may be censored due to events related to their prognosis (e.g., clinical deterioration triggers early discontinuation before next scheduled scan documents progression).
Detection: Compare baseline characteristics and post-baseline trajectories of censored vs. uncensored patients. Formal test: permutation test on censoring indicators.
Method adaptation:
- Primary: Standard KM with pre-specified censoring rules (FDA Appendix C/D tables)
- Sensitivity 1: Count early dropouts (without documented progression) as events (worst-case)
- Sensitivity 2: IPCW-adjusted KM if strong evidence of informative censoring
FDA PFS censoring scheme (Cancer Endpoints 2018, Final):
| Situation | Date | Outcome |
|---|---|---|
| Incomplete/no baseline tumor assessments | Randomization | Censored |
| Progression documented between scheduled visits | Earliest date of progression | Progressed |
| Progression documented after ≥2 missed visits | Earliest date of progression | Progressed |
| No progression; still on treatment | Last adequate assessment date | Censored |
| Death without prior documented progression | Date of death | Progressed (for PFS) |
| New anti-cancer therapy before progression | Last adequate assessment before new therapy | Censored |
| Lost to follow-up / withdrawal | Last adequate assessment date | Censored |
Sensitivity analyses (FDA recommends ≥2):
- PFS-1 (uniform dates): Assign progression to scheduled visit midpoints (corrects for differential assessment timing)
- PFS-2 (conservative): Treat all unassessed/missed visits as events
- PFS-3 (investigator assessment): Use investigator PFS if IRC PFS is primary (for open-label trials)
6.4 Missing Tumor Assessments
Problem: 10–30% of scheduled tumor assessments are missed in Phase 3 oncology trials, creating gaps in PFS determination.
"Substantial numbers of missing tumor assessments can potentially overestimate or underestimate treatment differences." — FDA Cancer Endpoints 2018 (Final)
Analysis strategies:
| Missing Pattern | Primary Handling | Sensitivity |
|---|---|---|
| Single missed visit, then progression at next visit | Use next-visit progression date (event) | Impute progression at midpoint of missed interval |
| Multiple missed visits, then progression | Use progression date (event); note gap | Count earliest missed visit as event date (conservative) |
| Missing baseline assessments | Censor at randomization | Exclude from primary; include in sensitivity |
| Missing final assessment (dropout without progression) | Censor at last adequate assessment | Count as event (worst-case) |
6.5 Competing Risks
Problem: In adjuvant settings (DFS), non-cancer death is a competing event that prevents observation of cancer recurrence. Standard KM overestimates cumulative incidence of cancer events.
Methods:
- Primary: KM with all-cause DFS (non-cancer deaths counted as events) — FDA-preferred
- Sensitivity: Cumulative Incidence Function (CIF) via Aalen-Johansen estimator; Fine-Gray subdistribution hazard model
Fine-Gray model:
h_subdist(t | X) = h₀_subdist(t) × exp(β₁X₁ + ... + βₚXₚ)
Estimates the subdistribution hazard ratio (sdHR) — the effect of treatment on the cumulative incidence of the event of interest, accounting for competing risks.
When to use: Adjuvant breast cancer (non-cancer death competes with recurrence), AML (transplant-related mortality competes with relapse), elderly populations (comorbidity-driven mortality).
R packages: cmprsk::cuminc() (CIF), cmprsk::crr() (Fine-Gray), tidycmprsk (tidy interface)
6.6 Reference-Based Imputation for Missing Data
When to use: Treatment policy estimand where MAR is implausible — patients who discontinue the experimental drug likely have outcomes resembling control-arm patients.
Methods:
| Method | Post-Dropout Assumption | When to Use |
|---|---|---|
| Jump to Reference (J2R) | Mean trajectory "jumps" to control arm at dropout | Abrupt effect loss (e.g., stopping active drug) |
| Copy Increments in Reference (CIR) | Rate of change copies control arm's increments after dropout | Gradual effect waning |
| Copy Reference (CR) | Entire trajectory copies control arm | Extreme conservative bound |
Tipping point analysis: Systematically varies departure from MAR (δ = 0 to 1.0) to find δ where significance reverses. If δ > 0.8: robust. If δ* < 0.2: fragile.
R packages: rbmi (CRAN, modern standard for regulatory submissions), mice (general MI)
7. Decision Table: Endpoint/Scenario → Primary Method → Sensitivity Methods
| Endpoint | Scenario | Primary Method | Effect Measure | Sensitivity Methods | R Package |
|---|---|---|---|---|---|
| OS | PH assumed | Stratified log-rank | HR (Cox) | Unstratified log-rank; RMST | survival |
| OS | NPH (IO, delayed effect) | MaxCombo | HR + RMST | Standard log-rank; piecewise Cox | simtrial, nph |
| OS | Crossover present | Stratified log-rank (ITT) | HR (Cox) | RPSFT, IPCW, 2SRST (sensitivity) | rpsftm, ipw |
| PFS | Open-label, PH | Stratified log-rank (IRC) | HR (Cox) | Investigator PFS; PFS-1/PFS-2 censoring | survival |
| PFS | Open-label, NPH (IO) | MaxCombo (IRC) | HR + RMST | Standard log-rank; milestone PFS rates | simtrial, survRM2 |
| PFS | Informative censoring suspected | Stratified log-rank | HR (Cox) | IPCW-KM; worst-case (dropouts = events) | ipw, survival |
| DFS (adjuvant) | All-cause, competing risks | Stratified log-rank (all-cause DFS) | HR (Cox) | Fine-Gray (cancer-specific CIF) | survival, cmprsk |
| EFS (neoadjuvant) | Composite endpoint | Stratified log-rank | HR (Cox) | DFS as sensitivity; pCR subgroup analysis | survival |
| ORR (randomized) | Stratified comparison | CMH test | Risk difference, OR | Logistic regression; unstratified Fisher | stats |
| ORR (single-arm) | Rare molecular subtype | Exact binomial test | Proportion + Clopper-Pearson CI | Simon's two-stage design bounds | clinfun |
| pCR (neoadjuvant) | Stratified comparison | CMH test | Risk difference | Logistic regression; subgroup by biomarker | stats |
| QoL / PRO | Repeated measures, dropout | MMRM (UN covariance) | LS mean difference | J2R imputation; tipping point; while-on-treatment | mmrm, rbmi |
| Continuous (single timepoint) | Baseline-adjusted | ANCOVA | Adjusted mean difference | Rank-based (Wilcoxon) if non-normal | stats, emmeans |
| Continuous (longitudinal) | Repeated measures | MMRM | LS mean difference at each visit | ANCOVA at final visit; LOCF (historical comparison) | mmrm |
8. SAP Language Templates
Template 1: Primary TTE Analysis (OS or PFS, Proportional Hazards)
PRIMARY ANALYSIS
================
The primary analysis of [OS / PFS by IRC] will be performed in the Intent-to-Treat
(ITT) population, defined as all randomized patients analyzed according to their
randomized treatment assignment.
The primary test is the stratified log-rank test, stratified by [list randomization
stratification factors: e.g., ECOG performance status (0 vs. 1), PD-L1 status
(≥50% vs. <50%), geographic region (North America vs. Europe vs. Asia-Pacific)].
The primary treatment effect measure is the hazard ratio (HR) estimated from a
stratified Cox proportional hazards model using the same stratification factors,
with the 95% confidence interval and two-sided p-value from the Wald test.
Kaplan-Meier estimates of the survival function will be provided for each treatment
arm, including median survival time (with 95% CI via Brookmeyer-Crowley method)
and milestone survival rates at [12, 24, 36] months (with 95% CI via Greenwood's
formula).
SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Unstratified): Unstratified log-rank test and unstratified
Cox model to assess the impact of stratification on the primary result.
Sensitivity Analysis 2 (Clinical Strata): Stratified analysis using actual patient
characteristics from CRF (rather than IRT-recorded strata) to assess impact of
stratification errors.
Sensitivity Analysis 3 (PFS Censoring — conservative): [For PFS only] Patients who
discontinue study treatment or initiate subsequent therapy without documented
progression are counted as PFS events at the date of discontinuation/new therapy start.
Sensitivity Analysis 4 (RMST): Restricted mean survival time difference at [24] months
reported as supplementary measure. RMST does not require the proportional hazards
assumption.
PROPORTIONAL HAZARDS ASSESSMENT
================================
The proportional hazards assumption will be assessed via Schoenfeld residuals test
(cox.zph) and visual inspection of log-log survival plots. If evidence of NPH is
detected (p < 0.10 for Schoenfeld test), results of the MaxCombo test and RMST
analysis will be reported alongside the primary log-rank result.
Template 2: Primary TTE Analysis Under Non-Proportional Hazards (IO Trial)
PRIMARY ANALYSIS
================
The primary analysis of [OS / PFS] will be performed in the ITT population using
the MaxCombo test, which combines four Fleming-Harrington weighted log-rank
statistics: FH(0,0), FH(1,0), FH(0,1), and FH(1,1). The test statistic is the
maximum of the four standardized statistics, with the critical value determined
from the asymptotic multivariate normal distribution at one-sided α = [0.025].
The stratified Cox proportional hazards model HR (with 95% CI) is reported as the
primary treatment effect measure, stratified by [stratification factors].
SUPPLEMENTARY ANALYSES
======================
RMST Difference: The restricted mean survival time difference at [t* = 24 months]
is reported with 95% CI as a model-free absolute treatment effect measure.
Piecewise Cox Model: A piecewise Cox model with a change-point at [3 months] is
fitted to estimate early-phase HR (months 0–3) and late-phase HR (months 3+),
characterizing the delayed treatment effect pattern.
Standard Log-Rank: The standard (unweighted) log-rank test is reported as a
sensitivity analysis to assess whether conclusions differ under PH assumption.
Template 3: Crossover-Adjusted OS (Sensitivity Analysis)
CROSSOVER ADJUSTMENT (SENSITIVITY ANALYSIS)
============================================
To estimate the treatment effect on OS in the absence of crossover from [control arm]
to [experimental therapy], the following pre-specified sensitivity analyses are
conducted:
1. RPSFT (Rank-Preserving Structural Failure Time): Estimates the counterfactual
survival time had control-arm patients not received [experimental therapy] after
progression. The shrinkage factor ψ is estimated via g-estimation. The accelerated
failure time assumption is documented and its plausibility assessed.
2. IPCW (Inverse Probability of Censoring Weighting): Weights observations by the
inverse probability of remaining uncensored (not having crossed over), estimated
from a logistic model including [baseline covariates: ECOG, stage, biomarker status,
prior lines]. Weight truncation at the 1st and 99th percentiles is applied to
stabilize estimates.
Both adjusted HR estimates are reported alongside the primary ITT analysis. These
analyses are exploratory and not the basis for efficacy claims. The ITT analysis
remains the primary analysis for regulatory decision-making.
Template 4: Binary Endpoint (ORR, pCR)
PRIMARY ANALYSIS
================
The primary analysis of [ORR / pCR] will be performed in the [ITT / evaluable]
population using the stratified Cochran-Mantel-Haenszel (CMH) test, stratified by
[randomization stratification factors].
The primary treatment effect measure is the risk difference (experimental − control)
with 95% CI estimated via the Newcombe-Wilson method. The common odds ratio from
CMH is also reported with 95% CI.
SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Logistic Regression): Multivariable logistic regression
adjusting for [treatment, stratification factors, baseline tumor burden] to estimate
the adjusted odds ratio.
Sensitivity Analysis 2 (Unstratified): Fisher's exact test (for sparse strata) or
chi-square test (for adequate cell counts) without stratification.
[For single-arm trials]:
The primary analysis of ORR uses the exact binomial test at one-sided α = 0.025,
testing H₀: p ≤ [p₀] vs. H₁: p > [p₀]. The 95% CI for the observed response rate
is calculated via the Clopper-Pearson exact method.
Template 5: Longitudinal PRO (MMRM with Reference-Based Sensitivity)
PRIMARY ANALYSIS
================
The primary PRO analysis uses a Mixed Model for Repeated Measures (MMRM) with
unstructured covariance, modeling change from baseline in [PRO instrument score]
at each scheduled assessment visit ([Weeks 6, 12, 18, 24]). Fixed effects include
treatment, visit, treatment-by-visit interaction, baseline score, and [stratification
factors]. The analysis is conducted in the ITT population. The primary timepoint
for inference is [Week 24].
Missing data are handled implicitly under the MAR assumption. If the MAR assumption
is judged implausible based on pattern-mixture analysis of dropout mechanisms,
sensitivity analyses below apply.
SENSITIVITY ANALYSES
====================
Sensitivity Analysis 1 (Jump-to-Reference, J2R): Post-dropout outcomes for the
experimental arm are imputed using the control arm's observed trajectory from the
dropout timepoint forward. Multiple imputation (M = 100 datasets) with Rubin's
rules pooling.
Sensitivity Analysis 2 (Copy Increments in Reference, CIR): Post-dropout rate of
change imputed from the control arm's observed increments, preserving the individual's
level at dropout.
Sensitivity Analysis 3 (Tipping Point): Systematic perturbation of J2R imputation
(δ = 0.0 to 1.0 in 0.1 increments) to identify the departure from MAR required to
reverse the primary conclusion. Results reported as tipping-point plot.
9. Limitations and Pitfalls
1. Log-rank under NPH gives misleading p-values:
The log-rank p-value can be non-significant even with clinically meaningful late benefit (IO). Always pre-specify NPH-aware tests if delayed effects are anticipated.
2. Cox HR is not meaningful under NPH:
When hazards cross, the "average" HR is uninterpretable. Report piecewise HR or RMST difference instead.
3. RPSFT is frequently misapplied:
Only 19% of RPSFT applications in oncology trials are methodologically appropriate (Prasad et al., 2023). Do not use RPSFT as primary or when multiple sequential therapies make causal inference impossible.
4. Over-stratification destroys power:
Too many strata with few events per stratum leads to unstable HR estimates and potential bias. Limit to ≤12 total strata with ≥10 events each.
5. LOCF is obsolete:
LOCF (Last Observation Carried Forward) for longitudinal endpoints is biased and inefficient under any realistic dropout mechanism. Use MMRM as primary. Do not use LOCF even as sensitivity — it provides no useful information.
6. Exact methods required for small samples:
Asymptotic tests (CMH, chi-square, Wald) are unreliable with <5 expected events per cell. Use Fisher's exact, Barnard's test, or Firth penalized regression.
7. Tipping point analysis is mandatory for PRO claims:
Any PRO-based labeling claim requires demonstration of robustness to MNAR via tipping-point or reference-based sensitivity analysis.
8. Missing tumor assessments are non-ignorable:
10–30% missing assessment rate in Phase 3 trials can bias PFS in either direction. Multiple pre-specified PFS censoring schemes (FDA Appendix C/D) are mandatory.
10. Backlinks & Related Articles
- Oncology Endpoint Overview
- Overall Survival (OS)
- Progression-Free Survival (PFS)
- Response-Based Endpoints (ORR, CR, DOR)
- DFS and EFS Endpoints
- Multiple Endpoints and Alpha Allocation
- Emerging Endpoints in Oncology Trials
- ICH E9(R1) Estimand Framework
- NSCLC Indication Guide: FDA Regulatory Endpoints & Trial Design Patterns
- Breast Cancer Trial Design Patterns: Indication-Specific Statistical Framework
Sources: - FDA Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics (2018, Final) - FDA Clinical Trial Endpoints for the Approval of NSCLC Drugs and Biologics (2015/2020, Final) - ICH E9(R1) Addendum on Estimands and Sensitivity Analysis (2019, Final) - ICH E8(R1) General Considerations for Clinical Studies (2021, Final) - Estimand framework in oncology: ICH E9(R1) and intercurrent events (literature synthesis) - OS crossover adjustment: RPSFT, 2SRST, IPCW methods (literature synthesis) - Reference-based imputation: J2R, CIR, CR, tipping point (literature synthesis) - MMRM and longitudinal analysis in oncology (literature synthesis) - Non-proportional hazards: MaxCombo and RMST (literature synthesis) Last Updated: 2026-04-11