Time-to-Event Assumptions and Nonproportional Hazards

Purpose: This article details the statistical assumptions underpinning time-to-event (TTE) methods in oncology, with emphasis on when those assumptions fail and what alternatives exist. It provides biostatisticians with practical guidance for designing and analyzing trials where nonproportional hazards (NPH) are expected — particularly immunotherapy, targeted therapy, and adjuvant settings — including simulation recommendations and SAP language templates.

Regulatory context: ICH E9(R1) (Final, 2019) requires that assumptions underpinning the main estimator be documented and tested via sensitivity analysis. ICH E20 (Draft, Step 2b, June 2025) addresses simulation requirements for adaptive and complex designs under NPH.

1. Core Assumptions Behind KM, Log-Rank, and Cox PH

1.1 Kaplan-Meier Estimator

The Kaplan-Meier (KM) estimator provides a non-parametric estimate of the survival function S(t) = P(T > t).

Assumptions:

#	Assumption	Formal Statement	Consequence If Violated
1	Non-informative (independent) censoring	P(T > t \| C = c) = P(T > t) for all c; i.e., censoring carries no information about event time	KM overestimates or underestimates S(t); bias direction depends on whether sicker or healthier patients are censored
2	No left truncation	All patients enter observation at time 0 (randomization)	Conditional survival estimates needed; standard KM biased
3	Exact event times	Events are not interval-censored	If events are detected only at scheduled visits (common for PFS), KM estimates have interval-censoring bias; typically minor with frequent assessments
4	Homogeneous population	All patients within an arm share the same survival distribution	If substantial heterogeneity exists (e.g., mixture of biomarker+ and biomarker−), overall KM curve is a mixture that may not represent any actual subgroup

What KM does NOT assume: KM makes no assumption about the shape of the hazard function — it is fully non-parametric. It does NOT require proportional hazards.

1.2 Log-Rank Test

The log-rank test compares two survival distributions under the null hypothesis H₀: S₁(t) = S₂(t) for all t.

Assumptions:

#	Assumption	What It Means	Diagnostic
1	Independent censoring	Same as KM: censoring carries no information about event time	Compare baseline characteristics of censored vs. uncensored patients; formal test on censoring indicators
2	Proportional hazards	The hazard ratio λ₁(t)/λ₂(t) is constant over time	Schoenfeld residuals test (`cox.zph()`); log-log survival plot; visual inspection of KM curves
3	No tied event times (or properly handled)	At most one event at each unique time	Breslow or Efron approximation handles ties (standard in software)

Why PH matters for log-rank: The log-rank test assigns equal weight to all time points. Under PH, this is optimal (achieves maximum power). Under NPH, it is suboptimal because:

Delayed effect (IO): Early timepoints (where curves overlap) contribute noise, not signal → diluted test statistic
Crossing hazards: Positive and negative contributions cancel → near-zero test statistic despite real differences
Early effect then convergence: Late timepoints (where curves re-converge) add noise → power loss

Quantified power loss:

Under delayed IO effect (HR = 1.0 for months 0–3, HR = 0.65 for months 3+): Log-rank power ~70% vs. MaxCombo ~85% with 350 events (15 percentage points lost)
Under crossing hazards: Log-rank power can drop to <30% even when true long-term benefit exists

1.3 Cox Proportional Hazards Model

Model: h(t | X) = h₀(t) × exp(β₁X₁ + ... + βₚXₚ)

Assumptions:

#	Assumption	Formal Statement	Consequence If Violated
1	Proportional hazards	exp(β) is constant over time for each covariate	HR estimate is a weighted average of the time-varying HR — may not represent any actual effect at any timepoint
2	Log-linearity	Covariates act multiplicatively on the hazard (additive on log-hazard scale)	Misspecified dose-response or covariate effects; remedy: categorize continuous covariates or use splines
3	Independent censoring	Same as KM/log-rank	Biased HR if informative censoring present
4	Correct model specification	No omitted confounders; correct functional form	Omitted-variable bias; remedy: include stratification factors, pre-specified covariates

PH Diagnostics — Practical Guide:

# Fit Cox model
fit <- coxph(Surv(time, event) ~ treatment + strata(ecog, pdl1), data = trial)

# Schoenfeld residuals test
test_ph <- cox.zph(fit)
print(test_ph)           # Global test p-value
plot(test_ph)            # Visual: flat line = PH holds; trend = NPH

# Log-log survival plot
plot(survfit(Surv(time, event) ~ treatment, data = trial),
     fun = "cloglog", xlab = "log(time)", ylab = "log(-log(S(t)))")
# Parallel lines = PH holds; non-parallel = NPH

Interpretation when PH is violated: The Cox HR becomes a weighted average of the instantaneous hazard ratio over time, where the weights depend on the event distribution. This average HR:

Under delayed effect: Overestimates early treatment effect (which is null) and underestimates late effect (which is substantial)
Under crossing hazards: May be close to 1.0 even when the treatment provides significant long-term benefit
Under diminishing effect: Overestimates the long-term treatment effect

"The hazard ratio, as a summary measure, may not adequately characterize the treatment effect when hazards are not proportional." — FDA-recognized limitation in IO trial evaluations

2. Independent / Non-Informative Censoring: How It Fails in Oncology

Definition

Independent censoring means that censoring carries no information about the time to event: knowing a patient was censored at time c tells you nothing about when their event would have occurred. Formally: T ⊥ C (event time independent of censoring time), conditional on covariates.

How Independent Censoring Fails in Oncology

Failure Mode	Mechanism	Direction of Bias	Setting
Toxicity-driven dropout	Patients on experimental arm discontinue due to adverse events; sicker patients leave study	KM overestimates survival in experimental arm (healthier patients remain)	IO combinations (colitis, pneumonitis); targeted therapy (hepatotoxicity)
Clinical deterioration without documented progression	Patient too ill to attend scan; withdrawn before RECIST progression documented	PFS overestimated if deterioration → censoring (events missed)	Open-label trials; 2L+ settings with rapid decline
Differential imaging frequency	Open-label trials: physicians order more frequent scans for patients on experimental arm (safety concern) → earlier detection of progression	PFS underestimated in experimental arm (earlier detection bias)	Open-label PFS trials
Protocol-mandated crossover	Control patients cross to experimental arm at progression; censored for OS at crossover time	OS overestimated in control arm (post-crossover survival improves)	Trials with built-in crossover
Subsequent therapy initiation	Patients initiating next-line therapy may stop PFS follow-up; censored at new therapy start	PFS overestimated if new therapy delays true progression	Multi-line treatment settings
Withdrawal of consent	Patients who withdraw may be those experiencing either benefit (no motivation to continue) or harm (too ill)	Bias direction uncertain; depends on reason for withdrawal	Long-duration adjuvant trials

Detection Strategies

Compare baseline characteristics: Censored vs. uncensored patients. If censored patients have worse ECOG, higher tumor burden → informative censoring likely.
Time-to-censoring analysis: Plot KM curve of censoring times by arm. If one arm has systematically earlier censoring → differential censoring.
Sensitivity analyses:
- Worst-case: All censored patients treated as having events at censoring date
- Best-case: All censored patients assumed event-free to end of study
- IPCW: Weight remaining patients by inverse probability of not being censored

Regulatory Expectation

"Missing data can complicate endpoint analysis. For endpoints based on tumor assessments, the protocol should define an adequate assessment visit for each patient... Methodology for analyzing incomplete and/or missing follow-up visits and censoring methods should be specified in the protocol." — FDA Cancer Endpoints 2018 (Final)

FDA requires ≥2 pre-specified PFS censoring schemes (Appendix C/D) to test sensitivity to censoring assumptions.

3. Proportional Hazards: Diagnosis and Interpretation Limits Under NPH

When PH Holds vs. Fails

Scenario	PH Status	Typical HR Pattern	Common Drug Classes
Chemotherapy vs. chemotherapy	Usually holds	Constant HR ~0.70–0.85	Platinum doublets, taxanes
Targeted therapy vs. chemotherapy (1L)	Often holds	Constant HR 0.30–0.60	EGFR TKI, ALK TKI, BRAF/MEK
IO monotherapy vs. chemotherapy	Fails (delayed effect)	HR ~1.0 (months 0–3) → 0.60 (months 3+)	Anti-PD-1/PD-L1 monotherapy
IO + chemo vs. chemotherapy	May hold or mild NPH	HR ~0.75–0.85, modest early lag	IO + platinum doublet
IO + IO vs. chemotherapy	Fails (delayed effect)	HR ~1.0–1.1 (months 0–4) → 0.55 (months 4+)	Nivo + ipi
Targeted therapy (2L+ post-resistance)	Often holds initially, then converges	HR 0.40 early → 0.80 late (resistance)	Osimertinib, ceritinib
Adjuvant (long follow-up)	May fail (cure fraction)	HR ~0.50 early → undefined late (plateau)	Adjuvant IO, TKI

How to Diagnose NPH at the Design Stage

Before unblinding, NPH should be anticipated based on:

Mechanism of action: IO agents require T-cell priming (2–4 months) → expect delayed separation
Historical data: Phase 2 KM curves showing lag phase
Class precedent: Prior IO trials in same indication (e.g., KEYNOTE-024 showed 3-month lag)
Biological rationale: Cure-fraction models in adjuvant settings

Pre-specified NPH assessment plan (at design stage):

At the design stage, the proportional hazards assumption is assessed based on:
(a) mechanism of action of [study drug] (immune checkpoint inhibitor with expected 
    2–4 month delayed onset of action),
(b) Phase 2 data showing Kaplan-Meier separation at approximately [3] months,
(c) precedent from [KEYNOTE-024/CheckMate 227] trials in [NSCLC].

Based on this assessment, the trial is designed with both the stratified log-rank 
test and the MaxCombo test as co-primary analyses, with alpha allocated equally 
(α/2 each) or with MaxCombo as primary and log-rank as sensitivity.

Interpretation Limits of HR Under NPH

NPH Pattern	What HR Reports	What HR Misses	Better Measure
Delayed effect	Weighted average ~0.75 (looks modest)	Late HR may be 0.50 (substantial benefit in responders)	RMST difference; piecewise HR; milestone rates
Crossing hazards	HR ~1.0 (looks null)	Treatment provides long-term benefit after crossover point	RMST difference at long horizon; milestone OS rates
Cure fraction	HR ~0.60 (looks moderate)	20–30% of patients are cured (infinite survival); HR doesn't capture cure	Cure-rate model; long-term milestone rates
Diminishing effect	HR ~0.55 (looks strong)	Late HR may be 0.85 (resistance developing); early benefit overstated	Piecewise HR; landmark analyses

4. Typical Oncology NPH Scenarios

Scenario 1: Delayed Immunotherapy Effect

Pattern: KM curves overlap for 2–4 months (T-cell priming phase), then separate durably.

Hazard ratio profile:

Months 0–3: HR ≈ 1.0 (no treatment effect; T-cell response developing)
Months 3–24: HR ≈ 0.65 (substantial benefit in responders)
Months 24+: HR ≈ 0.65 (sustained benefit; may plateau)

Clinical examples: KEYNOTE-024 (pembrolizumab 1L NSCLC), CheckMate 227 (nivo + ipi NSCLC), IMpower150 (atezolizumab + bev + chemo NSCLC)

Power impact:

Log-rank: ~70% power with 350 events
MaxCombo: ~85% power with 350 events
FH(0,1) alone: ~83% power (optimal for delayed effect but less robust to misspecification)

Design recommendation: MaxCombo as primary; RMST at 24 months as supplementary; piecewise Cox as descriptive

Scenario 2: Early Separation Then Convergence (Targeted Therapy Resistance)

Pattern: KM curves separate rapidly (months 0–6), then reconverge as resistance develops (months 12+).

Hazard ratio profile:

Months 0–6: HR ≈ 0.45 (strong early benefit from targeted inhibition)
Months 6–12: HR ≈ 0.65 (diminishing as resistance develops)
Months 12+: HR ≈ 0.85 (resistance; curves converging)

Clinical examples: EGFR TKI vs. chemo after 18+ months; BRAF inhibitors in melanoma

Power impact:

Log-rank: Power maintained (~80%) but HR estimate inflated relative to late-phase truth
FH(1,0): Highest power (~85%, weights early events)
RMST at early t: Captures early benefit; at late t, benefit may disappear

Design recommendation: Standard log-rank acceptable (PH approximately holds for primary analysis window); report piecewise HR and RMST at multiple horizons

Scenario 3: Crossing Hazards

Pattern: Experimental arm may be worse early (toxicity, initial tumor flare), then better later. Kaplan-Meier curves cross.

Hazard ratio profile:

Months 0–3: HR ≈ 1.2 (experimental arm slightly worse — toxicity, initial worsening)
Months 3–6: HR ≈ 1.0 (equilibrium)
Months 6+: HR ≈ 0.60 (substantial late benefit)

Clinical examples: Rare; may occur with IO combinations causing early immune-related adverse events before benefit; hormone therapy vs. chemotherapy switchover

Power impact:

Log-rank: Catastrophic power loss — positive and negative contributions cancel
MaxCombo: Moderate power recovery (~65%)
RMST at long horizon: Captures net benefit if sufficient follow-up
Risk: If trial is powered for late benefit, early harm is hidden in overall HR

Design recommendation: RMST at clinically meaningful horizon as primary; MaxCombo as sensitivity; accept that proof of benefit requires extended follow-up beyond crossover point

Scenario 4: Cure Fraction (Long-Term Plateau)

Pattern: KM curve for experimental arm plateaus (20–30% of patients achieve durable remission / functional cure), while control arm continues to decline.

Hazard ratio profile:

Months 0–12: HR ≈ 0.70 (standard treatment effect)
Months 12–36: HR ≈ 0.50 (diverging further as cured patients survive)
Months 36+: HR → undefined (no events in cured fraction; standard PH model breaks down)

Clinical examples: Adjuvant IO (pembrolizumab, nivolumab); CAR-T therapy in lymphoma; ipilimumab in melanoma (10-year data showing 20% plateau)

Power impact:

Log-rank power increases dramatically because hazard separation is permanent
Weibull/PH assumption violated; standard KM confidence intervals invalid for long-term survival
Cure-rate models (mixture cure model) provide better long-term estimates

Design recommendation: Standard log-rank adequate for primary; supplement with milestone survival rates (e.g., 5-year OS rate) and mixture cure model; ensure follow-up ≥ 3× median to capture plateau

5. Alternatives and Complements to Standard TTE Methods

5.1 Restricted Mean Survival Time (RMST)

Definition: RMST(t) = ∫₀^{t} S(t) dt — average event-free time through horizon t*.

Advantages:

No PH assumption (model-free, non-parametric)
Expressed in time units (clinically interpretable: "X additional months alive")
Stable under NPH; meaningful when HR is misleading

Treatment effect measures:

RMST difference: ΔRMST = RMST₁ − RMST₀ (absolute benefit in months)
RMST ratio: RMST₁ / RMST₀

Selecting t*:

Pre-specify based on clinical rationale (not data-driven)
Set to minimum of two arms' maximum follow-up (ensures stable KM tail)
Common choices: 12, 24, 36, 60 months depending on setting

Regulatory status:

Accepted as secondary/supplementary by FDA and EMA
Encouraged alongside HR for transparency
Not yet sole primary; regulatory evolution ongoing

R packages: survRM2 (standard for submissions), survival::survfit() + integration

5.2 Milestone Survival Rates

Definition: S(t*) estimated from KM curve at pre-specified landmarks (e.g., 12-month OS rate, 24-month PFS rate).

When to use:

NPH patterns where the treatment effect at a specific timepoint is more clinically meaningful than overall HR
Cure-fraction scenarios: 5-year OS rate captures plateau
Adjuvant settings: 2-year DFS rate

Statistical test: Z-test comparing two proportions with Greenwood SE; or landmark analysis at t*.

Limitation: Single-timepoint summary; loses information about entire curve shape.

5.3 Piecewise Hazard Models

Model: Cox model with time-varying coefficients — HR allowed to change at pre-specified changepoints:

# Piecewise Cox with changepoint at 3 months
trial$period <- ifelse(trial$time <= 3, "early", "late")
coxph(Surv(time, event) ~ treatment * period + strata(ecog), data = trial)

When to use: Descriptive characterization of NPH pattern; reports HR separately for each time interval.

Changepoint selection: Pre-specify based on mechanism of action (e.g., 3 months for IO T-cell priming; 12 months for targeted therapy resistance).

5.4 Weighted Log-Rank Tests (Fleming-Harrington Family)

FH(ρ, γ) weighting: Weight at each event time proportional to [S(t)]^ρ × [1 − S(t)]^γ.

Test	Weight	Optimal For
FH(0,0)	1 (standard log-rank)	Proportional hazards
FH(1,0)	S(t) — early events weighted	Early separation, convergence
FH(0,1)	1−S(t) — late events weighted	Delayed effects (IO)
FH(1,1)	S(t)×[1−S(t)] — middle events	Crossing/crossing-delayed

Risk of pre-specifying a single weighted test: If the assumed NPH pattern is wrong, the chosen weight may perform worse than standard log-rank.

Solution: MaxCombo (takes max of all four → robust across NPH patterns).

5.5 MaxCombo Test

Composition: MaxCombo = max(Z₀₀, Z₁₀, Z₀₁, Z₁₁), where each Zᵢⱼ is the standardized FH(i,j) statistic.

Properties:

Robust across NPH patterns: Performs well regardless of which pattern occurs
Alpha-controlled: Single test; no multiplicity penalty for multiple components
Accepted by FDA: Regulatory precedent in IO trials (supplementary analysis)

Evidence: 2022 JAMA Oncology meta-analysis (Mukhopadhyay et al.): MaxCombo achieved significance in all 15 trials where log-rank failed, across 63 IO studies and 35,902 patients.

Critical value: Derived from asymptotic multivariate normal distribution (correlation from shared at-risk sets) or permutation. Use simtrial for simulation-based critical values.

R packages: simtrial (Merck), nph, nphsim

6. Crossover and Post-Progression Treatment Switching

Why Crossover Creates a Methodological Problem

When control-arm patients cross over to experimental therapy at progression, ITT OS analysis underestimates the true treatment effect because control-arm survival is artificially improved.

Adjustment Methods (Sensitivity Analyses Only)

Method	Mechanism	Key Assumption	When Appropriate	When Inappropriate
RPSFT	Estimates shrinkage factor ψ applied to post-crossover time intervals; preserves rank order	Accelerated failure time: treatment effect operates multiplicatively on hazard; constant effect before and after switch	Well-established drug efficacy; single switch direction; crossover is medically appropriate	Multiple sequential therapies; drug efficacy not established; switching is inappropriate/delayed
2SRST	Treats crossover as "second randomization"; re-censoring or IPW for post-switch phase	Two-stage independence; no unmeasured confounders	Late crossover; non-exponential survival; sparse post-switch events	If crossover timing is deterministic (violates positivity)
IPCW	Weights observations by 1/P(not censored at crossover)	Exchangeability (no unmeasured confounders); positivity; correct weight model	Multiple switches; complex patterns; bidirectional switching	When P(switch) → 0 or 1 for subgroups (extreme weights); unmeasured confounders

RPSFT Detailed Mechanics

Observed event time for crossing-over patient = 
    (time on original arm) + ψ × (time after switch to experimental therapy)

Where ψ ∈ [0, 1]:
  ψ = 1: no adjustment (crossover had no effect)
  ψ = 0: experimental therapy eliminates all hazard post-switch
  ψ is estimated via g-estimation under the null of no treatment effect

When RPSFT assumptions fail:

Multiple sequential therapies make disentangling effects impossible
Harm from delay in switching outweighs initial therapy benefit
Only 19% of RPSFT applications in 65 oncology trials were methodologically appropriate (Prasad et al., 2023)

IPCW Positivity Failure

IPCW weights become extreme when P(switch | history) → 0 or → 1:

Rapid progression nearly deterministically predicts switching
Patient characteristics perfectly predict crossover

Stabilization: Truncate weights at 1st/99th percentile or cap at max weight = 10. Report both truncated and untruncated results.

Regulatory Status

Agency	Position
EMA	Accepts RPSFT and IPCW as supporting sensitivity analyses; prefers pre-specification; requires transparency
FDA	Increasingly skeptical of RPSFT without strong justification; prefers designs that avoid crossover (early access programs)
Both	Crossover-adjusted analyses are secondary only — never sole basis for approval; ITT remains primary

R packages: rpsftm (RPSFT via g-estimation), ipw (IPCW), custom code for 2SRST

7. Primary vs. Sensitivity Analysis Strategy When PH Is Doubtful

Decision Framework

Design-Stage PH Assessment	Primary Analysis	Sensitivity Analyses	Effect Measure
PH likely (chemo vs. chemo; targeted vs. chemo)	Stratified log-rank	Unstratified log-rank; RMST at t*	HR (Cox) with 95% CI
NPH possible (IO + chemo vs. chemo)	Stratified log-rank	MaxCombo; RMST; piecewise Cox	HR + RMST difference
NPH expected (IO mono vs. chemo; IO + IO vs. chemo)	MaxCombo (pre-specified)	Standard log-rank; RMST; piecewise Cox	HR + RMST + milestone rates
NPH expected + crossover likely	MaxCombo (ITT)	Standard log-rank; RPSFT/IPCW for OS sensitivity	HR + RMST + adjusted HR
Cure fraction expected (adjuvant IO, CAR-T)	Stratified log-rank	Milestone rates (3-yr, 5-yr); mixture cure model; RMST at long t*	HR + milestone rates

Alpha Allocation Options

When using both log-rank and MaxCombo:

Strategy	Allocation	When to Use
MaxCombo as sole primary	Full α to MaxCombo; log-rank as sensitivity	Strong NPH expectation based on mechanism + Phase 2 data
Sequential: log-rank first, MaxCombo second	Full α to log-rank first; if non-significant, test MaxCombo at same α (no penalty if tests are positively correlated)	Mild NPH expectation; want to preserve power under PH
Split α equally	α/2 to each	When uncertain about PH vs. NPH; conservative
Hierarchical	Full α to log-rank; MaxCombo as pre-specified secondary	Regulatory preference in some settings; MaxCombo supports but doesn't determine

8. Simulation Recommendations for NPH Scenarios

When Simulation Is Required (vs. Analytical Sample Size)

Analytical (Schoenfeld) formula works when:

Proportional hazards assumed
Single primary endpoint, single final analysis (no interims)
Simple 1:1 randomization
No complex censoring patterns

Simulation required when:

NPH expected (piecewise hazard models)
Group-sequential designs with interim analyses under NPH
Multiple endpoints (PFS + OS co-primary) with correlated stopping
MaxCombo or weighted log-rank as primary test
Complex enrollment patterns (staggered entry, variable accrual)
Crossover expected (need to simulate adjusted OS)

ICH E20 (Draft, Step 2b, June 2025): "For adaptive designs... the operating characteristics should be evaluated through simulation studies. The simulation should cover a range of plausible scenarios."

Simulation Parameters to Specify

1. Patient Population and Enrollment

# Accrual rate: 20 patients/month for 24 months
# Total enrollment: 480 patients
enrollRates <- data.frame(
  Stratum = "All",
  duration = 24,
  rate = 20
)

2. Piecewise Failure Rates (Control Arm)

# Control arm: median OS = 12 months → λ = log(2)/12 = 0.0578
# Piecewise failure rates for experimental arm under delayed effect:
failRates <- data.frame(
  Stratum = "All",
  period = 1:2,
  duration = c(3, Inf),           # First 3 months, then beyond
  failRate = c(0.058, 0.058),     # Control event rate (constant)
  hr = c(1.0, 0.65),             # HR = 1.0 early, 0.65 late
  dropoutRate = c(0.01, 0.01)    # 1% monthly dropout
)

3. Treatment Effect (Hazard Ratio Pattern)

Scenario	Period 1 (months 0–3)	Period 2 (months 3–12)	Period 3 (months 12+)
Delayed IO effect	HR = 1.0	HR = 0.65	HR = 0.65
Convergence (TKI)	HR = 0.45	HR = 0.65	HR = 0.85
Crossing hazards	HR = 1.2	HR = 1.0	HR = 0.60
Cure fraction	HR = 0.70	HR = 0.50	HR → 0 (plateau)

4. Dropout and Censoring

# Monthly dropout rate: 1–3% (setting-dependent)
# Minimum follow-up after last patient enrolled: 12 months
# Maximum study duration: 48 months

5. Interim Analysis Timing

# Interim at 60% information fraction (~210 events of 350 target)
# Final at 100% information fraction (350 events)
# O'Brien-Fleming spending function for efficacy
# Futility: conditional power < 20% at interim

Operating Characteristics to Report

After running ≥50,000 replicate trials:

Output	Question Answered	Reporting
Type I error	Does FWER stay ≤ α under null?	Report by test (log-rank, MaxCombo, RMST)
Power	Under alternative, what is P(reject H₀)?	Report by test and NPH scenario
Expected sample size	How many events/patients needed on average?	Report median and 95% interval
Conditional power at interim	If interim data look like alternative, what is P(success at final)?	Report by information fraction
Bias in HR estimate	Is the average estimated HR close to the true average HR?	Report mean estimated HR vs. true piecewise HR
Coverage of CI	Does the 95% CI contain the true effect 95% of the time?	Report coverage probability

Operating Characteristics Across NPH Scenarios: Summary

Scenario	Log-Rank Power	FH(0,1) Power	MaxCombo Power	RMST Power	Optimal Test	Min Follow-up
PH (HR = 0.70)	85%	78%	83%	80%	Log-rank	24 mo
Delayed IO (3-mo lag, HR = 0.65)	70%	83%	85%	78%	MaxCombo	30 mo
Convergence (TKI resistance)	80%	70%	78%	75%	Log-rank	18 mo
Crossing hazards	<30%	45%	65%	60%	RMST	36 mo
Cure fraction (20% cured)	90%	88%	92%	85%	MaxCombo	48 mo

R Packages for Simulation

Package	Capabilities	Best For
`simtrial` (Merck)	Piecewise hazard simulation; MaxCombo power; group-sequential under NPH	Primary simulation tool for NPH oncology trials
`gsDesign2` (Merck)	Group-sequential design with NPH support; integrates with `simtrial`	Sample size and boundary calculation under NPH
`nph`	FH weighted log-rank tests; NPH-aware sample size	Quick power calculations; analytical bounds
`rpact`	General-purpose confirmatory trial simulation; GSD + adaptive	Standard PH designs; adaptive extensions
`survRM2`	RMST computation and testing	RMST-based analyses

Common Simulation Pitfalls

Pitfall	Consequence	Fix
Assuming PH for IO trial	Log-rank power severely underestimated (~15% too optimistic)	Simulate piecewise hazard with delay period
Ignoring dropout	Sample size underestimated	Model realistic dropout; sensitivity ±50% dropout rate
Single-point parameters only	Overconfident design; FDA considers inadequate	Sweep across HR range (0.55–0.75), dropout (1–5%), accrual (±20%)
Too few replications	Unstable power estimates (SE > 0.5%)	≥50,000 replications (SE ≈ 0.2% at 80% power)
Not simulating MaxCombo critical value	Incorrect alpha control	Use permutation or multivariate normal to derive critical value

9. SAP Template: TTE Endpoint with PH-Risk Mitigation

Template: Primary OS/PFS Analysis with NPH Awareness

1. ESTIMAND
===========
Population: All randomized patients (ITT population)
Variable: Overall survival, defined as time from randomization to death from any cause
Treatment conditions: [Experimental drug] vs. [Control/SOC]
Summary measure: Hazard ratio (stratified Cox), RMST difference at [t* = 24] months
IE strategy: Treatment policy — OS analyzed regardless of subsequent anti-cancer 
therapy, treatment discontinuation, or crossover. All deaths included regardless 
of cause or post-randomization events.

2. PRIMARY ANALYSIS
====================
The primary test of the null hypothesis H₀: no difference in OS between treatment 
arms is the [MaxCombo test / stratified log-rank test], stratified by [ECOG 
performance status (0 vs. 1), PD-L1 status (≥50% vs. <50%), geographic region].

The MaxCombo test combines four Fleming-Harrington weighted log-rank statistics 
[FH(0,0), FH(1,0), FH(0,1), FH(1,1)], with the maximum standardized statistic 
as the test statistic and critical value determined from the asymptotic multivariate 
normal distribution at one-sided α = 0.025.

[Alternative if log-rank is primary]:
The primary test is the stratified log-rank test at one-sided α = 0.025.

The primary treatment effect measure is the hazard ratio estimated from a stratified 
Cox proportional hazards model, with 95% CI and two-sided p-value (Wald test).

Kaplan-Meier survival curves are estimated for each arm, with median OS (95% CI 
via Brookmeyer-Crowley) and milestone survival rates at [12, 24, 36] months 
(95% CI via Greenwood's formula).

3. PROPORTIONAL HAZARDS ASSESSMENT
====================================
The PH assumption is assessed post-hoc via:
(a) Schoenfeld residuals global test (cox.zph); p < 0.10 flags PH concern
(b) Visual inspection of log-log survival plots and scaled Schoenfeld residuals 
    over time
(c) Comparison of HR estimates from piecewise Cox model (change-point at [3] months)

If PH is rejected (p < 0.10 or visual evidence of non-proportionality), the 
MaxCombo result and RMST difference are given interpretive priority over the 
standard log-rank and Cox HR.

4. SENSITIVITY ANALYSES
========================
Sensitivity Analysis 1 ([MaxCombo / Standard log-rank]):
[If primary is log-rank]: MaxCombo test as sensitivity to assess robustness 
under delayed treatment effect.
[If primary is MaxCombo]: Standard stratified log-rank as sensitivity to assess 
consistency under PH assumption.

Sensitivity Analysis 2 (RMST):
Restricted mean survival time difference at t* = [24] months, with 95% CI. 
RMST is reported as a model-free, clinically interpretable absolute treatment 
effect measure that does not require the PH assumption.

Sensitivity Analysis 3 (Piecewise Cox):
Stratified Cox model with a pre-specified change-point at [3] months, reporting 
separate HR estimates for months 0–[3] (early phase) and months [3]+ (late phase), 
characterizing the time-varying treatment effect.

Sensitivity Analysis 4 (Unstratified):
Unstratified log-rank and Cox to confirm stratification does not create 
artificial effects.

Sensitivity Analysis 5 (PFS Censoring — if PFS endpoint):
Alternative censoring schemes per FDA Appendix C/D:
(a) PFS-1: Uniform progression/assessment dates
(b) PFS-2: Any dropout/change treated as event (conservative)
(c) PFS-3: Investigator assessment (if IRC is primary)

5. CROSSOVER ADJUSTMENT (if applicable)
=========================================
If ≥15% of control-arm patients receive [experimental therapy] post-progression:

(a) RPSFT: Rank-Preserving Structural Failure Time model estimating 
    counterfactual OS under no-crossover assumption. The accelerated failure 
    time assumption is documented.
(b) IPCW: Inverse probability of censoring weighting, with weights estimated 
    from logistic model including [ECOG, stage, biomarker, prior lines]. 
    Weights truncated at 1st/99th percentiles.

Both analyses are secondary/exploratory. ITT remains the primary analysis.

6. SIMULATION JUSTIFICATION
=============================
Sample size and power were determined via simulation (≥50,000 replicates) under 
the following scenarios:

Primary scenario: Delayed treatment effect
  - Control median OS: [12] months
  - HR months 0–[3]: 1.0 (no effect during immune priming)
  - HR months [3]+: 0.65 (sustained benefit)
  - Accrual: [20] patients/month over [24] months
  - Monthly dropout: [2]%
  - Target events: [350]

Sensitivity scenarios:
  - Optimistic: HR 0.55 (months 3+), dropout 1%
  - Pessimistic: HR 0.75 (months 3+), dropout 4%
  - PH scenario: HR 0.70 constant (benchmark)
  - Extended delay: HR 1.0 for months 0–6, then 0.60

Under primary scenario:
  - MaxCombo power: [85]% at one-sided α = 0.025
  - Log-rank power: [70]% (reference)
  - RMST difference at 24 months: [2.8] months (95% CI [1.1, 4.5])
  - Type I error (MaxCombo): [2.48]% (confirmed ≤ 2.5%)

R packages used: simtrial (v0.4.1), gsDesign2 (v1.1.2), survival (v3.5-8)

10. Limitations and Pitfalls

1. Pre-specifying the wrong NPH pattern is risky:

If you design for delayed effect (FH(0,1) primary) but the true pattern is early separation, FH(0,1) loses power vs. standard log-rank. MaxCombo mitigates this but is slightly less powerful than the optimal single test for any given pattern.

2. RMST is sensitive to t* selection:

RMST difference at t = 12 months may show no benefit (curves haven't separated yet); at t = 36 months may show large benefit. If t* is not pre-specified, results can be cherry-picked.

3. HR remains the primary regulatory effect measure:

Despite its limitations under NPH, FDA still expects HR reported. RMST and MaxCombo are supplementary, not replacements. Design accordingly.

4. PH tests have low power:

Schoenfeld residuals test has limited power to detect NPH with <200 events. A non-significant PH test does NOT prove PH holds — it means you can't detect NPH. Design-stage assessment (mechanism, Phase 2 data) is more informative.

5. RPSFT is frequently misapplied:

Only 19% of 65 oncology RPSFT applications were appropriate (Prasad et al., 2023). Never use RPSFT when multiple sequential therapies make causal inference impossible or when the drug's own efficacy is not yet established.

6. Simulation assumptions drive results:

Simulation is only as good as the assumed hazard model. If the piecewise hazard specification is wrong (e.g., assumed 3-month delay but true delay is 6 months), power estimates will be misleading. Always sweep across parameter ranges.

7. Cure-fraction models require long follow-up:

Mixture cure models need ≥3× the median survival to reliably estimate the cured fraction. Short trials cannot distinguish between cure and delayed events.

8. MaxCombo is not universally accepted as primary:

Some regulatory agencies prefer log-rank as primary with MaxCombo as sensitivity. Discuss with FDA/EMA in pre-IND/scientific advice meetings before locking the primary analysis.

Sources: - FDA Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics (2018, Final) - FDA Clinical Trial Endpoints for the Approval of NSCLC Drugs and Biologics (2015/2020, Final) - ICH E9(R1) Addendum on Estimands and Sensitivity Analysis (2019, Final) - ICH E20 Adaptive Clinical Trials (June 2025, Draft — Step 2b) - Non-proportional hazards: MaxCombo and RMST methods (literature synthesis) - OS crossover adjustment: RPSFT, 2SRST, IPCW (literature synthesis) - Simulation studies in clinical trial design (literature synthesis) Last Updated: 2026-04-11

Time-to-Event Assumptions and Nonproportional Hazards

1. Core Assumptions Behind KM, Log-Rank, and Cox PH

1.1 Kaplan-Meier Estimator

1.2 Log-Rank Test

1.3 Cox Proportional Hazards Model

2. Independent / Non-Informative Censoring: How It Fails in Oncology

Definition

How Independent Censoring Fails in Oncology

Detection Strategies

Regulatory Expectation

3. Proportional Hazards: Diagnosis and Interpretation Limits Under NPH

When PH Holds vs. Fails

How to Diagnose NPH at the Design Stage

Interpretation Limits of HR Under NPH

4. Typical Oncology NPH Scenarios

Scenario 1: Delayed Immunotherapy Effect

Scenario 2: Early Separation Then Convergence (Targeted Therapy Resistance)

Scenario 3: Crossing Hazards

Scenario 4: Cure Fraction (Long-Term Plateau)

5. Alternatives and Complements to Standard TTE Methods

5.1 Restricted Mean Survival Time (RMST)

5.2 Milestone Survival Rates

5.3 Piecewise Hazard Models

5.4 Weighted Log-Rank Tests (Fleming-Harrington Family)

5.5 MaxCombo Test

6. Crossover and Post-Progression Treatment Switching

Why Crossover Creates a Methodological Problem

Adjustment Methods (Sensitivity Analyses Only)

RPSFT Detailed Mechanics

IPCW Positivity Failure

Regulatory Status

7. Primary vs. Sensitivity Analysis Strategy When PH Is Doubtful

Decision Framework

Alpha Allocation Options

8. Simulation Recommendations for NPH Scenarios

When Simulation Is Required (vs. Analytical Sample Size)

Simulation Parameters to Specify

1. Patient Population and Enrollment

2. Piecewise Failure Rates (Control Arm)

3. Treatment Effect (Hazard Ratio Pattern)

4. Dropout and Censoring

5. Interim Analysis Timing

Operating Characteristics to Report

Operating Characteristics Across NPH Scenarios: Summary

R Packages for Simulation

Common Simulation Pitfalls

9. SAP Template: TTE Endpoint with PH-Risk Mitigation

Template: Primary OS/PFS Analysis with NPH Awareness

10. Limitations and Pitfalls

11. Backlinks & Related Articles