Skip to content

Sensitivity Analysis Playbook for Oncology Trials

Scope: Phase 2/3 oncology trials. Audience: biostatisticians writing SAPs and defending analyses to FDA/EMA. Anchored in ICH E9(R1) Section A.5 and endpoint-specific FDA guidance (Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics, 2018; NSCLC endpoints, 2020).


1. The Core Rule: Same Estimand, Different Assumptions

ICH E9(R1) A.5.2 is explicit: sensitivity analyses target the same estimand as the primary analysis. What changes is the set of assumptions underpinning the estimator — not the clinical question.

Quoting the addendum:

"The main estimator will be underpinned by certain assumptions. To explore the robustness of inferences from the main estimator to deviations from its underlying assumptions, a sensitivity analysis … should be pre-planned."

Three consequences flow from this:

  1. Sensitivity analyses are not a buffet of alternative estimands. An OS analysis that re-censors at crossover and an OS analysis that applies RPSFT target different estimands (treatment policy vs. hypothetical "no crossover"); they are not sensitivity analyses of each other.
  2. Report the targeted estimand up front. Each sensitivity analysis entry in the SAP should restate the estimand (population, variable, ICE strategies, summary measure) and identify which assumption is being stressed.
  3. Structured, one-assumption-at-a-time perturbation is preferred over simultaneously varying many assumptions, because only then can a deviation be attributed to a specific assumption (A.5.2.2).

2. Sensitivity vs. Supplementary vs. Exploratory Analysis

E9(R1) separates three related but distinct categories. Confusing them is one of the most common SAP defects flagged by regulators.

Category Estimand Purpose Labeling in SAP
Sensitivity analysis Same as primary Stress-test assumptions underpinning the primary estimator (missingness model, censoring rules, tie-handling, distributional form). "Sensitivity analysis for [primary estimand]." Result interpreted as robustness evidence.
Supplementary analysis May differ from primary (often a different ICE strategy) Provide additional understanding of the data — e.g., a hypothetical-strategy analysis alongside a treatment-policy primary. "Supplementary analysis targeting [alternative estimand]." Pre-specified, but not a robustness check.
Exploratory analysis Usually different or undefined Hypothesis-generating, subgroup, post-hoc. Not confirmatory, not a robustness claim. "Exploratory." No inferential weight.

Rule of thumb: If the clinical question or the estimand attributes change, you have a supplementary analysis, not a sensitivity analysis. If only the estimator's assumptions change, it is a sensitivity analysis.


3. A Framework: Which Assumptions Matter Most, by Endpoint

Before listing techniques, identify the dominant failure mode of the primary estimator. For each endpoint class, three questions drive the sensitivity hierarchy:

  1. What ICEs are frequent and how are they handled? (E9(R1) A.3.)
  2. What missing/censoring mechanism does the primary estimator assume? (MCAR? MAR? Non-informative censoring? Proportional hazards?)
  3. What measurement assumptions are load-bearing? (Central vs. investigator assessment; scan window; progression definition.)

The table below is the diagnostic:

Endpoint class Load-bearing assumption of primary estimator Dominant ICE risk First-line sensitivity dimension
OS (unadjusted) Non-informative censoring; proportional hazards (if HR is summary) Crossover, subsequent therapy, long follow-up dropout NPH (RMST, weighted log-rank); crossover adjustment as supplementary
PFS / EFS Progression assessment schedule, evaluable-at-scheduled-visit censoring rules Missed scans, early dropout, new anticancer therapy before progression Alternate censoring rules (FDA 2018 sensitivity set), central vs investigator review
ORR / DCR Responder/non-responder at fixed window; central confirmation Early dropout before assessment window, window shifts Best-of-both-readers, non-responder imputation, window widening/narrowing
PRO / longitudinal MAR; specified mean-structure Informative dropout, death as ICE Delta-adjusted MI, reference-based MI, tipping point
Composite (e.g., death-or-progression) Event definition ordering; handling of competing death Death before progression Cause-specific vs subdistribution; alternate event-priority rules

4. Time-to-Event Sensitivity Analyses

4.1 Alternate censoring rules (PFS/EFS)

FDA's 2018 Clinical Trial Endpoints guidance and the 2020 NSCLC endpoints guidance recommend pre-specifying a primary censoring convention and at least one sensitivity convention. Typical conventions:

  • Primary (conservative, event-favoring): event = documented progression or death from any cause; censor at last adequate tumor assessment prior to missed visits or new anti-cancer therapy.
  • Sensitivity A — "no censoring for new therapy": count event at progression regardless of subsequent therapy (treatment-policy-style).
  • Sensitivity B — "all missed-visit progressions imputed at next visit": stresses the assumption that missed scans are non-informative.
  • Sensitivity C — "date-of-next-assessment imputation": events timed at the scheduled assessment rather than observed date, removing assessment-schedule bias.

Pre-specify the full grid, interpret convergence (or divergence) of HRs and medians as robustness evidence.

4.2 Informative censoring: IPCW

When censoring is plausibly related to prognosis (e.g., patients discontinue for toxicity correlated with worse outcome), inverse probability of censoring weighting (IPCW) can recover a marginal estimand under the assumption of no unmeasured confounders of censoring. Diagnostics:

  • Stabilized weights; truncate/trim at 1st–99th percentile if extreme.
  • Report weight distribution and sensitivity to trimming threshold.

4.3 Crossover adjustment (OS)

If the primary OS estimand uses treatment policy (standard ITT), crossover adjustment is a supplementary, not a sensitivity, analysis because it targets a hypothetical estimand. Within that hypothetical estimand, however, multiple estimators provide mutual sensitivity:

  • RPSFT (Rank-Preserving Structural Failure Time): common AFT assumption (treatment acts as a constant time-multiplier). Sensitivity: vary the assumed AFT φ; check re-censoring vs. not.
  • 2SRST: treats switch as second randomization; less reliant on AFT, requires exchangeability at switch.
  • IPCW on OS: weights by probability of not switching; requires no unmeasured confounding of switch.

Triangulation across RPSFT, 2SRST, and IPCW is the canonical oncology sensitivity hierarchy for the hypothetical "no-switch" OS estimand (e.g., RECORD-1, sunitinib-GIST precedents).

4.4 Non-proportional hazards: RMST and MaxCombo

If the primary estimator is a log-rank test + Cox HR, the load-bearing assumption is proportional hazards. Under delayed separation (immunotherapy), crossing curves, or cure-fraction patterns, sensitivity analyses include:

  • RMST difference at τ (e.g., τ = minimum of maximum follow-up in each arm). Reports a mean-survival-time difference that is meaningful regardless of PH.
  • Weighted log-rank tests: FH(0,1) up-weights late events (delayed effect); FH(1,0) up-weights early events; FH(1,1) up-weights mid events / crossing patterns.
  • MaxCombo: maximum across a pre-specified weight set; controls Type I error via multivariate-normal reference. Best deployed as pre-specified primary under suspected NPH, not as post-hoc rescue.
  • Piecewise HR or landmark analyses: descriptive sensitivity to time-varying effect.

SAP should state the assumed hazard pattern at design and pre-specify the NPH sensitivity set before unblinding.


5. Longitudinal / Continuous Endpoint Sensitivity Analyses

Load-bearing assumption for MMRM / standard MI is Missing At Random (MAR). Sensitivity analyses perturb MAR toward MNAR in interpretable ways.

5.1 Delta-adjusted multiple imputation

After imputing under MAR, add a clinically meaningful delta (penalty) to imputed values in the active arm (and optionally control). Vary delta across a pre-specified grid (e.g., 0, 0.25σ, 0.5σ, 1σ of baseline SD).

5.2 Tipping-point analysis

Extension of delta adjustment: identify the delta at which statistical significance is lost, and ask whether that delta is clinically plausible. If the tipping point is implausibly large, the conclusion is robust.

5.3 Reference-based imputation

Imputes missing active-arm values as if the subject behaved like the reference (control) arm post-dropout. Variants:

  • Jump-to-Reference (J2R): post-dropout mean equals reference mean.
  • Copy Increments in Reference (CIR): post-dropout increments equal reference increments; preserves pre-dropout trajectory.
  • Copy Reference (CR): entire profile copied from reference, typically more conservative.

J2R/CIR are principled conservative sensitivity analyses for a treatment-policy estimand where dropout is plausibly linked to loss of benefit. Pitfalls: information-anchored variance (Cro et al.); ensure the variance estimator matches the chosen inferential framework.

5.4 Pattern-mixture thinking

When dropout patterns are heterogeneous (e.g., dropout-due-to-AE vs. dropout-due-to-progression), a pattern-mixture model stratifies by dropout reason and imposes different MNAR mechanisms per stratum. Useful when the ICE taxonomy is rich.


6. Binary Endpoint Sensitivity Analyses (ORR, DCR, responder)

6.1 Responder / non-responder imputation

  • Primary (conservative): non-evaluable = non-responder.
  • Sensitivity A: non-evaluable excluded (complete-case).
  • Sensitivity B: tipping-point — what fraction of missing subjects would need to be responders to overturn the conclusion?
  • Sensitivity C: multiple imputation under MAR using baseline covariates.

6.2 Window shifts

Response windows (e.g., confirmation at ≥ 4 weeks, assessment at 6/12 weeks ± 7 days) are assumption-laden. Sensitivity: widen/narrow the window, or require confirmation at two consecutive assessments vs. one.

6.3 Central vs. investigator review

If primary is BICR (blinded independent central review), sensitivity is investigator-assessed, and vice versa. Concordance (κ), early-discrepancy (EDR), and late-discrepancy (LDR) rates should be reported per FDA 2018. In oncology, BICR-primary with investigator sensitivity is standard for single-arm accelerated-approval trials.


7. Why One-at-a-Time Perturbation Is Preferred

E9(R1) A.5.2.2 is explicit: "a structured approach is recommended specifying changes in assumptions rather than comparing multiple sets simultaneously." Reasons:

  1. Attribution. If two assumptions change together and the estimate shifts, you cannot identify which assumption drove the shift.
  2. Interpretability. A one-dimensional tipping-point plot (δ vs. p-value) is diagnostic; a multi-dimensional surface is not.
  3. Pre-specification. It is tractable to pre-specify a univariate grid of δs, κs, or φs; joint grids inflate the analysis plan without adding inferential content.

Exception: when assumptions are known to be linked (e.g., an MNAR model that jointly specifies dropout mechanism and measurement error), a joint perturbation may be unavoidable — but it should still be presented as a single, named model rather than an ad-hoc combination.


8. Master Table — Endpoint → Main Estimator → Key Assumptions → Sensitivity Analyses

Endpoint Main estimator Key assumptions Recommended sensitivity analyses
OS (treatment policy) Log-rank + Cox HR; KM medians Non-informative censoring; proportional hazards RMST at τ; weighted log-rank (FH(0,1) for delayed effect); piecewise HR; follow-up sensitivity (cutoff ±X months). Supplementary (hypothetical estimand): RPSFT, 2SRST, IPCW for crossover.
PFS / EFS (treatment policy or hypothetical) Log-rank + Cox HR; KM medians Non-informative interval censoring; assessment-schedule independence of treatment arm Alternate censoring rules (new therapy event vs. censor; missed-visit imputation); BICR vs. investigator; window-adherence subset; RMST under NPH; date-of-next-assessment imputation.
ORR / DCR / CR rate Cochran–Mantel–Haenszel or exact CI on proportion Non-evaluable = non-responder; BICR definition Complete-case; responder tipping point; investigator review; widened confirmation window; MI under MAR using baseline covariates.
DoR KM median among responders Conditioning on response is non-informative; censoring non-informative Competing-risk (death before progression as competing event); subgroup DoR by response depth; landmark analyses.
Symptom PRO / HRQoL (longitudinal) MMRM; LS-mean difference at week k MAR; correct mean/covariance structure Delta-adjusted MI; tipping-point δ; reference-based MI (J2R, CIR, CR); pattern-mixture by dropout reason; alternate covariance structure.
Composite (death-or-progression as event) Log-rank on composite Events interchangeable; competing death handled by composite Cause-specific HR for each component; subdistribution (Fine–Gray) for progression with death competing; win-ratio re-analysis if ordered components.
Time-to-next-treatment (TTNT) Log-rank + Cox HR Next-line choice independent of arm; non-informative censoring Restricted to patients with progression; stratify by investigator's pre-specified next-line; RMST.

9. SAP Template Language

9.1 Sensitivity analysis hierarchy (generic)

"The primary estimand targets [population; variable; ICE-handling strategy; summary measure]. The primary estimator is [estimator] under the following assumptions: [list]. To assess robustness of inferences to deviations from these assumptions, the following pre-specified sensitivity analyses will be performed, each targeting the same estimand and varying one assumption at a time:

  1. [Sensitivity 1 name] — perturbs the assumption of [X]. Method: [method]. Pre-specified grid: [values]. Robustness will be judged by consistency of the point estimate and preservation of the direction and clinical relevance of the treatment effect; the primary analysis is not considered contradicted unless the sensitivity result reverses direction or renders the effect clinically negligible across the pre-specified range.
  2. [Sensitivity 2 name] — perturbs the assumption of [Y]. …

Analyses that target a different estimand (e.g., hypothetical estimand removing the effect of switching) are labelled supplementary and reported separately. Post-hoc analyses are labelled exploratory and carry no confirmatory weight."

9.2 Time-to-event example (PFS)

"The primary PFS analysis censors at last adequate tumor assessment prior to initiation of new anti-cancer therapy or two or more missed assessments (treatment-policy estimand, with the latter rule reflecting assessment-schedule non-informativeness). Sensitivity analyses: (i) count PFS events regardless of subsequent therapy; (ii) date-of-next-assessment imputation for missed-visit events; (iii) BICR vs. investigator assessment concordance, including early- and late-discrepancy rates per FDA 2018 guidance; (iv) RMST difference at τ = [X] months to assess robustness under potential non-proportional hazards. One-at-a-time perturbation will be used; the joint analysis is not pre-specified."

9.3 Longitudinal PRO example

"The primary analysis is an MMRM on change from baseline with unstructured covariance, treatment, visit, and treatment-by-visit, under a Missing-At-Random assumption. Sensitivity analyses stress this MAR assumption: (i) delta-adjusted multiple imputation with δ ∈ {0, 0.25σ, 0.5σ, 1σ} applied to the active arm post-dropout; (ii) tipping-point analysis identifying the minimal δ at which statistical significance is lost, interpreted in light of clinically meaningful change; (iii) reference-based imputation (jump-to-reference) as a conservative treatment-policy sensitivity. Rubin's rules variance will be reported with information-anchored variance as a supplementary check."

9.4 Crossover (OS) example

"The primary OS analysis is a stratified log-rank test under a treatment-policy strategy for subsequent therapy including post-progression crossover. Because crossover is expected to attenuate the ITT OS effect, a supplementary hypothetical-strategy estimand will also be reported, estimated by RPSFT assuming a common AFT effect of experimental treatment, with sensitivity to (i) the re-censoring rule, (ii) 2SRST as an alternative estimator, and (iii) IPCW weighting for switch, each under its own identifying assumptions. These supplementary analyses do not alter the primary conclusion and are reported for regulatory context."

9.5 Interpretation language (to place in the Results discussion section)

"The primary analysis is interpreted as robust if the direction and clinical magnitude of the treatment effect are preserved across the pre-specified sensitivity grid. Divergence of a sensitivity analysis from the primary is not a failure of the trial but an indication that the corresponding assumption materially influences the inference; interpretation then requires clinical judgment about the plausibility of that assumption. Sensitivity analyses are not ranked by p-value."


Overall Survival (OS) Progression-Free Survival (PFS) Response-Based Endpoints (ORR, CR, DOR) ICH E9(R1) Estimand Framework Intercurrent Events in Oncology Trials Sensitivity Analyses for Estimands Missing Data: Mechanisms, Methods, and Estimand-Driven Strategy Statistical Analysis Methods in Oncology Trials Time-to-Event Assumptions and Nonproportional Hazards Longitudinal, PRO, and Repeated-Measures Methods Response, Binary, and Disease-Control Endpoint Methods


Sources: ICH E9(R1) Addendum (2019), Sections A.5.1–A.5.2; FDA Clinical Trial Endpoints for the Approval of Cancer Drugs and Biologics (2018); FDA Clinical Trial Endpoints for the Approval of Non-Small Cell Lung Cancer Drugs and Biologics (2020); oncology literature on RPSFT/2SRST/IPCW, MaxCombo/RMST under NPH, and reference-based imputation. Status: Final (ICH E9(R1)); FDA guidances current as of draft date.