Time-to-Event Assumptions and Nonproportional Hazards
Purpose: This article details the statistical assumptions underpinning time-to-event (TTE) methods in oncology, with emphasis on when those assumptions fail and what alternatives exist. It provides biostatisticians with practical guidance for designing and analyzing trials where nonproportional hazards (NPH) are expected — particularly immunotherapy, targeted therapy, and adjuvant settings — including simulation recommendations and SAP language templates.
Regulatory context: ICH E9(R1) (Final, 2019) requires that assumptions underpinning the main estimator be documented and tested via sensitivity analysis. ICH E20 (Draft, Step 2b, June 2025) addresses simulation requirements for adaptive and complex designs under NPH.
1. Core Assumptions Behind KM, Log-Rank, and Cox PH
1.1 Kaplan-Meier Estimator
The Kaplan-Meier (KM) estimator provides a non-parametric estimate of the survival function S(t) = P(T > t).
Assumptions:
| # | Assumption | Formal Statement | Consequence If Violated |
|---|---|---|---|
| 1 | Non-informative (independent) censoring | P(T > t | C = c) = P(T > t) for all c; i.e., censoring carries no information about event time | KM overestimates or underestimates S(t); bias direction depends on whether sicker or healthier patients are censored |
| 2 | No left truncation | All patients enter observation at time 0 (randomization) | Conditional survival estimates needed; standard KM biased |
| 3 | Exact event times | Events are not interval-censored | If events are detected only at scheduled visits (common for PFS), KM estimates have interval-censoring bias; typically minor with frequent assessments |
| 4 | Homogeneous population | All patients within an arm share the same survival distribution | If substantial heterogeneity exists (e.g., mixture of biomarker+ and biomarker−), overall KM curve is a mixture that may not represent any actual subgroup |
What KM does NOT assume: KM makes no assumption about the shape of the hazard function — it is fully non-parametric. It does NOT require proportional hazards.
1.2 Log-Rank Test
The log-rank test compares two survival distributions under the null hypothesis H₀: S₁(t) = S₂(t) for all t.
Assumptions:
| # | Assumption | What It Means | Diagnostic |
|---|---|---|---|
| 1 | Independent censoring | Same as KM: censoring carries no information about event time | Compare baseline characteristics of censored vs. uncensored patients; formal test on censoring indicators |
| 2 | Proportional hazards | The hazard ratio λ₁(t)/λ₂(t) is constant over time | Schoenfeld residuals test (cox.zph()); log-log survival plot; visual inspection of KM curves |
| 3 | No tied event times (or properly handled) | At most one event at each unique time | Breslow or Efron approximation handles ties (standard in software) |
Why PH matters for log-rank: The log-rank test assigns equal weight to all time points. Under PH, this is optimal (achieves maximum power). Under NPH, it is suboptimal because:
- Delayed effect (IO): Early timepoints (where curves overlap) contribute noise, not signal → diluted test statistic
- Crossing hazards: Positive and negative contributions cancel → near-zero test statistic despite real differences
- Early effect then convergence: Late timepoints (where curves re-converge) add noise → power loss
Quantified power loss:
- Under delayed IO effect (HR = 1.0 for months 0–3, HR = 0.65 for months 3+): Log-rank power ~70% vs. MaxCombo ~85% with 350 events (15 percentage points lost)
- Under crossing hazards: Log-rank power can drop to <30% even when true long-term benefit exists
1.3 Cox Proportional Hazards Model
Model: h(t | X) = h₀(t) × exp(β₁X₁ + ... + βₚXₚ)
Assumptions:
| # | Assumption | Formal Statement | Consequence If Violated |
|---|---|---|---|
| 1 | Proportional hazards | exp(β) is constant over time for each covariate | HR estimate is a weighted average of the time-varying HR — may not represent any actual effect at any timepoint |
| 2 | Log-linearity | Covariates act multiplicatively on the hazard (additive on log-hazard scale) | Misspecified dose-response or covariate effects; remedy: categorize continuous covariates or use splines |
| 3 | Independent censoring | Same as KM/log-rank | Biased HR if informative censoring present |
| 4 | Correct model specification | No omitted confounders; correct functional form | Omitted-variable bias; remedy: include stratification factors, pre-specified covariates |
PH Diagnostics — Practical Guide:
# Fit Cox model
fit <- coxph(Surv(time, event) ~ treatment + strata(ecog, pdl1), data = trial)
# Schoenfeld residuals test
test_ph <- cox.zph(fit)
print(test_ph) # Global test p-value
plot(test_ph) # Visual: flat line = PH holds; trend = NPH
# Log-log survival plot
plot(survfit(Surv(time, event) ~ treatment, data = trial),
fun = "cloglog", xlab = "log(time)", ylab = "log(-log(S(t)))")
# Parallel lines = PH holds; non-parallel = NPH
Interpretation when PH is violated: The Cox HR becomes a weighted average of the instantaneous hazard ratio over time, where the weights depend on the event distribution. This average HR:
- Under delayed effect: Overestimates early treatment effect (which is null) and underestimates late effect (which is substantial)
- Under crossing hazards: May be close to 1.0 even when the treatment provides significant long-term benefit
- Under diminishing effect: Overestimates the long-term treatment effect
"The hazard ratio, as a summary measure, may not adequately characterize the treatment effect when hazards are not proportional." — FDA-recognized limitation in IO trial evaluations
2. Independent / Non-Informative Censoring: How It Fails in Oncology
Definition
Independent censoring means that censoring carries no information about the time to event: knowing a patient was censored at time c tells you nothing about when their event would have occurred. Formally: T ⊥ C (event time independent of censoring time), conditional on covariates.
How Independent Censoring Fails in Oncology
| Failure Mode | Mechanism | Direction of Bias | Setting |
|---|---|---|---|
| Toxicity-driven dropout | Patients on experimental arm discontinue due to adverse events; sicker patients leave study | KM overestimates survival in experimental arm (healthier patients remain) | IO combinations (colitis, pneumonitis); targeted therapy (hepatotoxicity) |
| Clinical deterioration without documented progression | Patient too ill to attend scan; withdrawn before RECIST progression documented | PFS overestimated if deterioration → censoring (events missed) | Open-label trials; 2L+ settings with rapid decline |
| Differential imaging frequency | Open-label trials: physicians order more frequent scans for patients on experimental arm (safety concern) → earlier detection of progression | PFS underestimated in experimental arm (earlier detection bias) | Open-label PFS trials |
| Protocol-mandated crossover | Control patients cross to experimental arm at progression; censored for OS at crossover time | OS overestimated in control arm (post-crossover survival improves) | Trials with built-in crossover |
| Subsequent therapy initiation | Patients initiating next-line therapy may stop PFS follow-up; censored at new therapy start | PFS overestimated if new therapy delays true progression | Multi-line treatment settings |
| Withdrawal of consent | Patients who withdraw may be those experiencing either benefit (no motivation to continue) or harm (too ill) | Bias direction uncertain; depends on reason for withdrawal | Long-duration adjuvant trials |
Detection Strategies
- Compare baseline characteristics: Censored vs. uncensored patients. If censored patients have worse ECOG, higher tumor burden → informative censoring likely.
- Time-to-censoring analysis: Plot KM curve of censoring times by arm. If one arm has systematically earlier censoring → differential censoring.
- Sensitivity analyses:
- Worst-case: All censored patients treated as having events at censoring date
- Best-case: All censored patients assumed event-free to end of study
- IPCW: Weight remaining patients by inverse probability of not being censored
Regulatory Expectation
"Missing data can complicate endpoint analysis. For endpoints based on tumor assessments, the protocol should define an adequate assessment visit for each patient... Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol." — FDA Cancer Endpoints 2018 (Final)
FDA requires ≥2 pre-specified PFS censoring schemes (Appendix C/D) to test sensitivity to censoring assumptions.
3. Proportional Hazards: Diagnosis and Interpretation Limits Under NPH
When PH Holds vs. Fails
| Scenario | PH Status | Typical HR Pattern | Common Drug Classes |
|---|---|---|---|
| Chemotherapy vs. chemotherapy | Usually holds | Constant HR ~0.70–0.85 | Platinum doublets, taxanes |
| Targeted therapy vs. chemotherapy (1L) | Often holds | Constant HR 0.30–0.60 | EGFR TKI, ALK TKI, BRAF/MEK |
| IO monotherapy vs. chemotherapy | Fails (delayed effect) | HR ~1.0 (months 0–3) → 0.60 (months 3+) | Anti-PD-1/PD-L1 monotherapy |
| IO + chemo vs. chemotherapy | May hold or mild NPH | HR ~0.75–0.85, modest early lag | IO + platinum doublet |
| IO + IO vs. chemotherapy | Fails (delayed effect) | HR ~1.0–1.1 (months 0–4) → 0.55 (months 4+) | Nivo + ipi |
| Targeted therapy (2L+ post-resistance) | Often holds initially, then converges | HR 0.40 early → 0.80 late (resistance) | Osimertinib, ceritinib |
| Adjuvant (long follow-up) | May fail (cure fraction) | HR ~0.50 early → undefined late (plateau) | Adjuvant IO, TKI |
How to Diagnose NPH at the Design Stage
Before unblinding, NPH should be anticipated based on:
- Mechanism of action: IO agents require T-cell priming (2–4 months) → expect delayed separation
- Historical data: Phase 2 KM curves showing lag phase
- Class precedent: Prior IO trials in same indication (e.g., KEYNOTE-024 showed 3-month lag)
- Biological rationale: Cure-fraction models in adjuvant settings
Pre-specified NPH assessment plan (at design stage):
At the design stage, the proportional hazards assumption is assessed based on:
(a) mechanism of action of [study drug] (immune checkpoint inhibitor with expected
2–4 month delayed onset of action),
(b) Phase 2 data showing Kaplan-Meier separation at approximately [3] months,
(c) precedent from [KEYNOTE-024/CheckMate 227] trials in [NSCLC].
Based on this assessment, the trial is designed with both the stratified log-rank
test and the MaxCombo test as co-primary analyses, with alpha allocated equally
(α/2 each) or with MaxCombo as primary and log-rank as sensitivity.
Interpretation Limits of HR Under NPH
| NPH Pattern | What HR Reports | What HR Misses | Better Measure |
|---|---|---|---|
| Delayed effect | Weighted average ~0.75 (looks modest) | Late HR may be 0.50 (substantial benefit in responders) | RMST difference; piecewise HR; milestone rates |
| Crossing hazards | HR ~1.0 (looks null) | Treatment provides long-term benefit after crossover point | RMST difference at long horizon; milestone OS rates |
| Cure fraction | HR ~0.60 (looks moderate) | 20–30% of patients are cured (infinite survival); HR doesn't capture cure | Cure-rate model; long-term milestone rates |
| Diminishing effect | HR ~0.55 (looks strong) | Late HR may be 0.85 (resistance developing); early benefit overstated | Piecewise HR; landmark analyses |
4. Typical Oncology NPH Scenarios
Scenario 1: Delayed Immunotherapy Effect
Pattern: KM curves overlap for 2–4 months (T-cell priming phase), then separate durably.
Hazard ratio profile:
- Months 0–3: HR ≈ 1.0 (no treatment effect; T-cell response developing)
- Months 3–24: HR ≈ 0.65 (substantial benefit in responders)
- Months 24+: HR ≈ 0.65 (sustained benefit; may plateau)
Clinical examples: KEYNOTE-024 (pembrolizumab 1L NSCLC), CheckMate 227 (nivo + ipi NSCLC), IMpower150 (atezolizumab + bev + chemo NSCLC)
Power impact:
- Log-rank: ~70% power with 350 events
- MaxCombo: ~85% power with 350 events
- FH(0,1) alone: ~83% power (optimal for delayed effect but less robust to misspecification)
Design recommendation: MaxCombo as primary; RMST at 24 months as supplementary; piecewise Cox as descriptive
Scenario 2: Early Separation Then Convergence (Targeted Therapy Resistance)
Pattern: KM curves separate rapidly (months 0–6), then reconverge as resistance develops (months 12+).
Hazard ratio profile:
- Months 0–6: HR ≈ 0.45 (strong early benefit from targeted inhibition)
- Months 6–12: HR ≈ 0.65 (diminishing as resistance develops)
- Months 12+: HR ≈ 0.85 (resistance; curves converging)
Clinical examples: EGFR TKI vs. chemo after 18+ months; BRAF inhibitors in melanoma
Power impact:
- Log-rank: Power maintained (~80%) but HR estimate inflated relative to late-phase truth
- FH(1,0): Highest power (~85%, weights early events)
- RMST at early t: Captures early benefit; at late t, benefit may disappear
Design recommendation: Standard log-rank acceptable (PH approximately holds for primary analysis window); report piecewise HR and RMST at multiple horizons
Scenario 3: Crossing Hazards
Pattern: Experimental arm may be worse early (toxicity, initial tumor flare), then better later. Kaplan-Meier curves cross.
Hazard ratio profile:
- Months 0–3: HR ≈ 1.2 (experimental arm slightly worse — toxicity, initial worsening)
- Months 3–6: HR ≈ 1.0 (equilibrium)
- Months 6+: HR ≈ 0.60 (substantial late benefit)
Clinical examples: Rare; may occur with IO combinations causing early immune-related adverse events before benefit; hormone therapy vs. chemotherapy switchover
Power impact:
- Log-rank: Catastrophic power loss — positive and negative contributions cancel
- MaxCombo: Moderate power recovery (~65%)
- RMST at long horizon: Captures net benefit if sufficient follow-up
- Risk: If trial is powered for late benefit, early harm is hidden in overall HR
Design recommendation: RMST at clinically meaningful horizon as primary; MaxCombo as sensitivity; accept that proof of benefit requires extended follow-up beyond crossover point
Scenario 4: Cure Fraction (Long-Term Plateau)
Pattern: KM curve for experimental arm plateaus (20–30% of patients achieve durable remission / functional cure), while control arm continues to decline.
Hazard ratio profile:
- Months 0–12: HR ≈ 0.70 (standard treatment effect)
- Months 12–36: HR ≈ 0.50 (diverging further as cured patients survive)
- Months 36+: HR → undefined (no events in cured fraction; standard PH model breaks down)
Clinical examples: Adjuvant IO (pembrolizumab, nivolumab); CAR-T therapy in lymphoma; ipilimumab in melanoma (10-year data showing 20% plateau)
Power impact:
- Log-rank power increases dramatically because hazard separation is permanent
- Weibull/PH assumption violated; standard KM confidence intervals invalid for long-term survival
- Cure-rate models (mixture cure model) provide better long-term estimates
Design recommendation: Standard log-rank adequate for primary; supplement with milestone survival rates (e.g., 5-year OS rate) and mixture cure model; ensure follow-up ≥ 3× median to capture plateau
5. Alternatives and Complements to Standard TTE Methods
5.1 Restricted Mean Survival Time (RMST)
Definition: RMST(t) = ∫₀^{t} S(t) dt — average event-free time through horizon t*.
Advantages:
- No PH assumption (model-free, non-parametric)
- Expressed in time units (clinically interpretable: "X additional months alive")
- Stable under NPH; meaningful when HR is misleading
Treatment effect measures:
- RMST difference: ΔRMST = RMST₁ − RMST₀ (absolute benefit in months)
- RMST ratio: RMST₁ / RMST₀
Selecting t*:
- Pre-specify based on clinical rationale (not data-driven)
- Set to minimum of two arms' maximum follow-up (ensures stable KM tail)
- Common choices: 12, 24, 36, 60 months depending on setting
Regulatory status:
- Accepted as secondary/supplementary by FDA and EMA
- Encouraged alongside HR for transparency
- Not yet sole primary; regulatory evolution ongoing
R packages: survRM2 (standard for submissions), survival::survfit() + integration
5.2 Milestone Survival Rates
Definition: S(t*) estimated from KM curve at pre-specified landmarks (e.g., 12-month OS rate, 24-month PFS rate).
When to use:
- NPH patterns where the treatment effect at a specific timepoint is more clinically meaningful than overall HR
- Cure-fraction scenarios: 5-year OS rate captures plateau
- Adjuvant settings: 2-year DFS rate
Statistical test: Z-test comparing two proportions with Greenwood SE; or landmark analysis at t*.
Limitation: Single-timepoint summary; loses information about entire curve shape.
5.3 Piecewise Hazard Models
Model: Cox model with time-varying coefficients — HR allowed to change at pre-specified changepoints:
# Piecewise Cox with changepoint at 3 months
trial$period <- ifelse(trial$time <= 3, "early", "late")
coxph(Surv(time, event) ~ treatment * period + strata(ecog), data = trial)
When to use: Descriptive characterization of NPH pattern; reports HR separately for each time interval.
Changepoint selection: Pre-specify based on mechanism of action (e.g., 3 months for IO T-cell priming; 12 months for targeted therapy resistance).
5.4 Weighted Log-Rank Tests (Fleming-Harrington Family)
FH(ρ, γ) weighting: Weight at each event time proportional to [S(t)]^ρ × [1 − S(t)]^γ.
| Test | Weight | Optimal For |
|---|---|---|
| FH(0,0) | 1 (standard log-rank) | Proportional hazards |
| FH(1,0) | S(t) — early events weighted | Early separation, convergence |
| FH(0,1) | 1−S(t) — late events weighted | Delayed effects (IO) |
| FH(1,1) | S(t)×[1−S(t)] — middle events | Crossing/crossing-delayed |
Risk of pre-specifying a single weighted test: If the assumed NPH pattern is wrong, the chosen weight may perform worse than standard log-rank.
Solution: MaxCombo (takes max of all four → robust across NPH patterns).
5.5 MaxCombo Test
Composition: MaxCombo = max(Z₀₀, Z₁₀, Z₀₁, Z₁₁), where each Zᵢⱼ is the standardized FH(i,j) statistic.
Properties:
- Robust across NPH patterns: Performs well regardless of which pattern occurs
- Alpha-controlled: Single test; no multiplicity penalty for multiple components
- Accepted by FDA: Regulatory precedent in IO trials (supplementary analysis)
Evidence: 2022 JAMA Oncology meta-analysis (Mukhopadhyay et al.): MaxCombo achieved significance in all 15 trials where log-rank failed, across 63 IO studies and 35,902 patients.
Critical value: Derived from asymptotic multivariate normal distribution (correlation from shared at-risk sets) or permutation. Use simtrial for simulation-based critical values.
R packages: simtrial (Merck), nph, nphsim
6. Crossover and Post-Progression Treatment Switching
Why Crossover Creates a Methodological Problem
When control-arm patients cross over to experimental therapy at progression, ITT OS analysis underestimates the true treatment effect because control-arm survival is artificially improved.
Adjustment Methods (Sensitivity Analyses Only)
| Method | Mechanism | Key Assumption | When Appropriate | When Inappropriate |
|---|---|---|---|---|
| RPSFT | Estimates shrinkage factor ψ applied to post-crossover time intervals; preserves rank order | Accelerated failure time: treatment effect operates multiplicatively on hazard; constant effect before and after switch | Well-established drug efficacy; single switch direction; crossover is medically appropriate | Multiple sequential therapies; drug efficacy not established; switching is inappropriate/delayed |
| 2SRST | Treats crossover as "second randomization"; re-censoring or IPW for post-switch phase | Two-stage independence; no unmeasured confounders | Late crossover; non-exponential survival; sparse post-switch events | If crossover timing is deterministic (violates positivity) |
| IPCW | Weights observations by 1/P(not censored at crossover) | Exchangeability (no unmeasured confounders); positivity; correct weight model | Multiple switches; complex patterns; bidirectional switching | When P(switch) → 0 or 1 for subgroups (extreme weights); unmeasured confounders |
RPSFT Detailed Mechanics
Observed event time for crossing-over patient =
(time on original arm) + ψ × (time after switch to experimental therapy)
Where ψ ∈ [0, 1]:
ψ = 1: no adjustment (crossover had no effect)
ψ = 0: experimental therapy eliminates all hazard post-switch
ψ is estimated via g-estimation under the null of no treatment effect
When RPSFT assumptions fail:
- Multiple sequential therapies make disentangling effects impossible
- Harm from delay in switching outweighs initial therapy benefit
- Only 19% of RPSFT applications in 65 oncology trials were methodologically appropriate (Prasad et al., 2023)
IPCW Positivity Failure
IPCW weights become extreme when P(switch | history) → 0 or → 1:
- Rapid progression nearly deterministically predicts switching
- Patient characteristics perfectly predict crossover
Stabilization: Truncate weights at 1st/99th percentile or cap at max weight = 10. Report both truncated and untruncated results.
Regulatory Status
| Agency | Position |
|---|---|
| EMA | Accepts RPSFT and IPCW as supporting sensitivity analyses; prefers pre-specification; requires transparency |
| FDA | Increasingly skeptical of RPSFT without strong justification; prefers designs that avoid crossover (early access programs) |
| Both | Crossover-adjusted analyses are secondary only — never sole basis for approval; ITT remains primary |
R packages: rpsftm (RPSFT via g-estimation), ipw (IPCW), custom code for 2SRST
7. Primary vs. Sensitivity Analysis Strategy When PH Is Doubtful
Decision Framework
| Design-Stage PH Assessment | Primary Analysis | Sensitivity Analyses | Effect Measure |
|---|---|---|---|
| PH likely (chemo vs. chemo; targeted vs. chemo) | Stratified log-rank | Unstratified log-rank; RMST at t* | HR (Cox) with 95% CI |
| NPH possible (IO + chemo vs. chemo) | Stratified log-rank | MaxCombo; RMST; piecewise Cox | HR + RMST difference |
| NPH expected (IO mono vs. chemo; IO + IO vs. chemo) | MaxCombo (pre-specified) | Standard log-rank; RMST; piecewise Cox | HR + RMST + milestone rates |
| NPH expected + crossover likely | MaxCombo (ITT) | Standard log-rank; RPSFT/IPCW for OS sensitivity | HR + RMST + adjusted HR |
| Cure fraction expected (adjuvant IO, CAR-T) | Stratified log-rank | Milestone rates (3-yr, 5-yr); mixture cure model; RMST at long t* | HR + milestone rates |
Alpha Allocation Options
When using both log-rank and MaxCombo:
| Strategy | Allocation | When to Use |
|---|---|---|
| MaxCombo as sole primary | Full α to MaxCombo; log-rank as sensitivity | Strong NPH expectation based on mechanism + Phase 2 data |
| Sequential: log-rank first, MaxCombo second | Full α to log-rank first; if non-significant, test MaxCombo at same α (no penalty if tests are positively correlated) | Mild NPH expectation; want to preserve power under PH |
| Split α equally | α/2 to each | When uncertain about PH vs. NPH; conservative |
| Hierarchical | Full α to log-rank; MaxCombo as pre-specified secondary | Regulatory preference in some settings; MaxCombo supports but doesn't determine |
8. Simulation Recommendations for NPH Scenarios
When Simulation Is Required (vs. Analytical Sample Size)
Analytical (Schoenfeld) formula works when:
- Proportional hazards assumed
- Single primary endpoint, single final analysis (no interims)
- Simple 1:1 randomization
- No complex censoring patterns
Simulation required when:
- NPH expected (piecewise hazard models)
- Group-sequential designs with interim analyses under NPH
- Multiple endpoints (PFS + OS co-primary) with correlated stopping
- MaxCombo or weighted log-rank as primary test
- Complex enrollment patterns (staggered entry, variable accrual)
- Crossover expected (need to simulate adjusted OS)
ICH E20 (Draft, Step 2b, June 2025): "For adaptive designs... the operating characteristics should be evaluated through simulation studies. The simulation should cover a range of plausible scenarios."
Simulation Parameters to Specify
1. Patient Population and Enrollment
# Accrual rate: 20 patients/month for 24 months
# Total enrollment: 480 patients
enrollRates <- data.frame(
Stratum = "All",
duration = 24,
rate = 20
)
2. Piecewise Failure Rates (Control Arm)
# Control arm: median OS = 12 months → λ = log(2)/12 = 0.0578
# Piecewise failure rates for experimental arm under delayed effect:
failRates <- data.frame(
Stratum = "All",
period = 1:2,
duration = c(3, Inf), # First 3 months, then beyond
failRate = c(0.058, 0.058), # Control event rate (constant)
hr = c(1.0, 0.65), # HR = 1.0 early, 0.65 late
dropoutRate = c(0.01, 0.01) # 1% monthly dropout
)
3. Treatment Effect (Hazard Ratio Pattern)
| Scenario | Period 1 (months 0–3) | Period 2 (months 3–12) | Period 3 (months 12+) |
|---|---|---|---|
| Delayed IO effect | HR = 1.0 | HR = 0.65 | HR = 0.65 |
| Convergence (TKI) | HR = 0.45 | HR = 0.65 | HR = 0.85 |
| Crossing hazards | HR = 1.2 | HR = 1.0 | HR = 0.60 |
| Cure fraction | HR = 0.70 | HR = 0.50 | HR → 0 (plateau) |
4. Dropout and Censoring
# Monthly dropout rate: 1–3% (setting-dependent)
# Minimum follow-up after last patient enrolled: 12 months
# Maximum study duration: 48 months
5. Interim Analysis Timing
# Interim at 60% information fraction (~210 events of 350 target)
# Final at 100% information fraction (350 events)
# O'Brien-Fleming spending function for efficacy
# Futility: conditional power < 20% at interim
Operating Characteristics to Report
After running ≥50,000 replicate trials:
| Output | Question Answered | Reporting |
|---|---|---|
| Type I error | Does FWER stay ≤ α under null? | Report by test (log-rank, MaxCombo, RMST) |
| Power | Under alternative, what is P(reject H₀)? | Report by test and NPH scenario |
| Expected sample size | How many events/patients needed on average? | Report median and 95% interval |
| Conditional power at interim | If interim data look like alternative, what is P(success at final)? | Report by information fraction |
| Bias in HR estimate | Is the average estimated HR close to the true average HR? | Report mean estimated HR vs. true piecewise HR |
| Coverage of CI | Does the 95% CI contain the true effect 95% of the time? | Report coverage probability |
Operating Characteristics Across NPH Scenarios: Summary
| Scenario | Log-Rank Power | FH(0,1) Power | MaxCombo Power | RMST Power | Optimal Test | Min Follow-up |
|---|---|---|---|---|---|---|
| PH (HR = 0.70) | 85% | 78% | 83% | 80% | Log-rank | 24 mo |
| Delayed IO (3-mo lag, HR = 0.65) | 70% | 83% | 85% | 78% | MaxCombo | 30 mo |
| Convergence (TKI resistance) | 80% | 70% | 78% | 75% | Log-rank | 18 mo |
| Crossing hazards | <30% | 45% | 65% | 60% | RMST | 36 mo |
| Cure fraction (20% cured) | 90% | 88% | 92% | 85% | MaxCombo | 48 mo |
R Packages for Simulation
| Package | Capabilities | Best For |
|---|---|---|
simtrial (Merck) |
Piecewise hazard simulation; MaxCombo power; group-sequential under NPH | Primary simulation tool for NPH oncology trials |
gsDesign2 (Merck) |
Group-sequential design with NPH support; integrates with simtrial |
Sample size and boundary calculation under NPH |
nph |
FH weighted log-rank tests; NPH-aware sample size | Quick power calculations; analytical bounds |
rpact |
General-purpose confirmatory trial simulation; GSD + adaptive | Standard PH designs; adaptive extensions |
survRM2 |
RMST computation and testing | RMST-based analyses |
Common Simulation Pitfalls
| Pitfall | Consequence | Fix |
|---|---|---|
| Assuming PH for IO trial | Log-rank power severely underestimated (~15% too optimistic) | Simulate piecewise hazard with delay period |
| Ignoring dropout | Sample size underestimated | Model realistic dropout; sensitivity ±50% dropout rate |
| Single-point parameters only | Overconfident design; FDA considers inadequate | Sweep across HR range (0.55–0.75), dropout (1–5%), accrual (±20%) |
| Too few replications | Unstable power estimates (SE > 0.5%) | ≥50,000 replications (SE ≈ 0.2% at 80% power) |
| Not simulating MaxCombo critical value | Incorrect alpha control | Use permutation or multivariate normal to derive critical value |
9. SAP Template: TTE Endpoint with PH-Risk Mitigation
Template: Primary OS/PFS Analysis with NPH Awareness
1. ESTIMAND
===========
Population: All randomized patients (ITT population)
Variable: Overall survival, defined as time from randomization to death from any cause
Treatment conditions: [Experimental drug] vs. [Control/SOC]
Summary measure: Hazard ratio (stratified Cox), RMST difference at [t* = 24] months
IE strategy: Treatment policy — OS analyzed regardless of subsequent anti-cancer
therapy, treatment discontinuation, or crossover. All deaths included regardless
of cause or post-randomization events.
2. PRIMARY ANALYSIS
====================
The primary test of the null hypothesis H₀: no difference in OS between treatment
arms is the [MaxCombo test / stratified log-rank test], stratified by [ECOG
performance status (0 vs. 1), PD-L1 status (≥50% vs. <50%), geographic region].
The MaxCombo test combines four Fleming-Harrington weighted log-rank statistics
[FH(0,0), FH(1,0), FH(0,1), FH(1,1)], with the maximum standardized statistic
as the test statistic and critical value determined from the asymptotic multivariate
normal distribution at one-sided α = 0.025.
[Alternative if log-rank is primary]:
The primary test is the stratified log-rank test at one-sided α = 0.025.
The primary treatment effect measure is the hazard ratio estimated from a stratified
Cox proportional hazards model, with 95% CI and two-sided p-value (Wald test).
Kaplan-Meier survival curves are estimated for each arm, with median OS (95% CI
via Brookmeyer-Crowley) and milestone survival rates at [12, 24, 36] months
(95% CI via Greenwood's formula).
3. PROPORTIONAL HAZARDS ASSESSMENT
====================================
The PH assumption is assessed post-hoc via:
(a) Schoenfeld residuals global test (cox.zph); p < 0.10 flags PH concern
(b) Visual inspection of log-log survival plots and scaled Schoenfeld residuals
over time
(c) Comparison of HR estimates from piecewise Cox model (change-point at [3] months)
If PH is rejected (p < 0.10 or visual evidence of non-proportionality), the
MaxCombo result and RMST difference are given interpretive priority over the
standard log-rank and Cox HR.
4. SENSITIVITY ANALYSES
========================
Sensitivity Analysis 1 ([MaxCombo / Standard log-rank]):
[If primary is log-rank]: MaxCombo test as sensitivity to assess robustness
under delayed treatment effect.
[If primary is MaxCombo]: Standard stratified log-rank as sensitivity to assess
consistency under PH assumption.
Sensitivity Analysis 2 (RMST):
Restricted mean survival time difference at t* = [24] months, with 95% CI.
RMST is reported as a model-free, clinically interpretable absolute treatment
effect measure that does not require the PH assumption.
Sensitivity Analysis 3 (Piecewise Cox):
Stratified Cox model with a pre-specified change-point at [3] months, reporting
separate HR estimates for months 0–[3] (early phase) and months [3]+ (late phase),
characterizing the time-varying treatment effect.
Sensitivity Analysis 4 (Unstratified):
Unstratified log-rank and Cox to confirm stratification does not create
artificial effects.
Sensitivity Analysis 5 (PFS Censoring — if PFS endpoint):
Alternative censoring schemes per FDA Appendix C/D:
(a) PFS-1: Uniform progression/assessment dates
(b) PFS-2: Any dropout/change treated as event (conservative)
(c) PFS-3: Investigator assessment (if IRC is primary)
5. CROSSOVER ADJUSTMENT (if applicable)
=========================================
If ≥15% of control-arm patients receive [experimental therapy] post-progression:
(a) RPSFT: Rank-Preserving Structural Failure Time model estimating
counterfactual OS under no-crossover assumption. The accelerated failure
time assumption is documented.
(b) IPCW: Inverse probability of censoring weighting, with weights estimated
from logistic model including [ECOG, stage, biomarker, prior lines].
Weights truncated at 1st/99th percentiles.
Both analyses are secondary/exploratory. ITT remains the primary analysis.
6. SIMULATION JUSTIFICATION
=============================
Sample size and power were determined via simulation (≥50,000 replicates) under
the following scenarios:
Primary scenario: Delayed treatment effect
- Control median OS: [12] months
- HR months 0–[3]: 1.0 (no effect during immune priming)
- HR months [3]+: 0.65 (sustained benefit)
- Accrual: [20] patients/month over [24] months
- Monthly dropout: [2]%
- Target events: [350]
Sensitivity scenarios:
- Optimistic: HR 0.55 (months 3+), dropout 1%
- Pessimistic: HR 0.75 (months 3+), dropout 4%
- PH scenario: HR 0.70 constant (benchmark)
- Extended delay: HR 1.0 for months 0–6, then 0.60
Under primary scenario:
- MaxCombo power: [85]% at one-sided α = 0.025
- Log-rank power: [70]% (reference)
- RMST difference at 24 months: [2.8] months (95% CI [1.1, 4.5])
- Type I error (MaxCombo): [2.48]% (confirmed ≤ 2.5%)
R packages used: simtrial (v0.4.1), gsDesign2 (v1.1.2), survival (v3.5-8)
10. Limitations and Pitfalls
1. Pre-specifying the wrong NPH pattern is risky:
If you design for delayed effect (FH(0,1) primary) but the true pattern is early separation, FH(0,1) loses power vs. standard log-rank. MaxCombo mitigates this but is slightly less powerful than the optimal single test for any given pattern.
2. RMST is sensitive to t* selection:
RMST difference at t = 12 months may show no benefit (curves haven't separated yet); at t = 36 months may show large benefit. If t* is not pre-specified, results can be cherry-picked.
3. HR remains the primary regulatory effect measure:
Despite its limitations under NPH, FDA still expects HR reported. RMST and MaxCombo are supplementary, not replacements. Design accordingly.
4. PH tests have low power:
Schoenfeld residuals test has limited power to detect NPH with <200 events. A non-significant PH test does NOT prove PH holds — it means you can't detect NPH. Design-stage assessment (mechanism, Phase 2 data) is more informative.
5. RPSFT is frequently misapplied:
Only 19% of 65 oncology RPSFT applications were appropriate (Prasad et al., 2023). Never use RPSFT when multiple sequential therapies make causal inference impossible or when the drug's own efficacy is not yet established.
6. Simulation assumptions drive results:
Simulation is only as good as the assumed hazard model. If the piecewise hazard specification is wrong (e.g., assumed 3-month delay but true delay is 6 months), power estimates will be misleading. Always sweep across parameter ranges.
7. Cure-fraction models require long follow-up:
Mixture cure models need ≥3× the median survival to reliably estimate the cured fraction. Short trials cannot distinguish between cure and delayed events.
8. MaxCombo is not universally accepted as primary:
Some regulatory agencies prefer log-rank as primary with MaxCombo as sensitivity. Discuss with FDA/EMA in pre-IND/scientific advice meetings before locking the primary analysis.
11. Backlinks & Related Articles
- Overall Survival (OS)
- Progression-Free Survival (PFS)
- Multiple Endpoints and Alpha Allocation
- NSCLC Indication Guide: FDA Regulatory Endpoints & Trial Design Patterns
- ICH E9(R1) Estimand Framework
- Statistical Analysis Methods in Oncology Trials
Sources: - FDA Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics (2018, Final) - FDA Clinical Trial Endpoints for the Approval of NSCLC Drugs and Biologics (2015/2020, Final) - ICH E9(R1) Addendum on Estimands and Sensitivity Analysis (2019, Final) - ICH E20 Adaptive Clinical Trials (June 2025, Draft — Step 2b) - Non-proportional hazards: MaxCombo and RMST methods (literature synthesis) - OS crossover adjustment: RPSFT, 2SRST, IPCW (literature synthesis) - Simulation studies in clinical trial design (literature synthesis) Last Updated: 2026-04-11