Sensitivity Analyses for Estimands

Definition

"Inferences based on a particular estimand should be robust to limitations in the data and deviations from the assumptions used in the statistical model for the main estimator. This robustness is evaluated through a sensitivity analysis." — ICH E9(R1) Addendum, §A.5.2.1 (Final, November 2019)

Sensitivity analysis in the estimand framework serves a specific purpose: to verify that the primary estimator's conclusions are not artefacts of its underlying assumptions. This is distinct from supplementary analyses, which explore additional insights without the goal of verifying robustness.

Three-tier analysis hierarchy (ICH E9(R1) §A.5):

Main estimator — the pre-specified primary analysis for the primary estimand
Sensitivity analysis — one or more analyses targeting the same estimand, varying the assumptions of the main estimator to test robustness
Supplementary analysis — analyses that provide additional context or explore secondary estimands; generally given lower weight in regulatory assessment

Key distinction: A sensitivity analysis uses the same estimand with different analytic assumptions. An analysis targeting a different estimand (e.g., hypothetical sensitivity when treatment policy is primary) is technically a supplementary analysis exploring an alternative estimand — not a sensitivity analysis in the strict ICH E9(R1) sense. In oncology practice, the terms are used loosely, but both types are pre-specified and expected by regulators.

ICH E9(R1) on tipping point:

"This might be characterised as the extent of departures from assumptions that change the interpretation of the results in terms of their statistical or clinical significance (e.g. tipping point analysis)." (ICH E9(R1), §A.5.2.1)

Regulatory Position

FDA (post-E9(R1) adoption, 2021): Sensitivity analyses for all primary estimands must be:

Pre-specified in the SAP (not post-hoc)
Focused on the same estimand (same estimand, different assumptions) OR pre-specified as alternative estimand analyses
Structured: "altering multiple aspects of the main analysis simultaneously can make it challenging to identify which assumptions, if any, are responsible for any potential differences seen" — therefore, sensitivity analyses should vary one assumption at a time (ICH E9(R1), §A.5.2.2)

FDA OS 2025 draft (implicit): Sensitivity analyses for OS as a safety endpoint require:

RMST comparison (for non-proportional hazards)
Landmark analysis at pre-specified time points
Subgroup consistency evaluation

Common FDA feedback patterns:

Missing treatment policy sensitivity analysis when hypothetical is primary (PFS)
Missing RPSFT/IPCW sensitivity when crossover occurred
No tipping point analysis when informative censoring suspected
Inadequate specification of missing data handling assumptions

Status: ICH E9(R1) = Final (November 2019); FDA adoption = May 2021

Structured Approach to Sensitivity Analysis (ICH E9(R1) §A.5.2.2)

Key Principle — Vary One Assumption at a Time:

ICH E9(R1) emphasizes that sensitivity analyses should adopt a "structured approach, specifying the changes in assumptions that underlie the alternative analyses, rather than simply comparing the results of different analyses." This means:

Identify key assumptions of the main estimator
For each assumption, pre-specify one or more alternative analyses that test robustness to plausible departures
Document explicitly which assumption is being varied in each sensitivity analysis
Order by importance (not by expected favorable result)

Example: OS Treatment Policy Main Estimator

Assumption 1 (Censoring): Administrative censoring at LKDA is non-informative
Sensitivity: Alternative censoring rule (censor at last contact)
Assumption 2 (Proportional hazards): Cox model PH assumption holds
Sensitivity: RMST at 24 months (does not assume PH)
Assumption 3 (Differential follow-up): Follow-up is similar between arms
Sensitivity: Landmark analysis at 12, 24, 36 months

SAP Language:

"Pre-specified sensitivity analyses for the primary OS estimand are structured to test robustness to departures from main estimator assumptions: (1) Censoring rule sensitivity — alternative censoring rule (last contact vs. LKDA); (2) Non-proportional hazards sensitivity — RMST at 24 months; (3) Differential follow-up sensitivity — landmark analysis at 12, 24, 36 months. Each sensitivity analysis independently varies a single assumption of the main Cox model."

Sensitivity Analysis by Estimand Strategy

Treatment Policy Primary Estimand

What needs testing: The main assumption of treatment policy is that the ITT analysis correctly captures the treatment assignment effect. Threats to this include:

Informative censoring (patients censored for reasons correlated with prognosis)
Non-proportional hazards (Cox model assumption)
Differential follow-up between arms
Major protocol deviations

Standard sensitivity hierarchy for treatment policy primary (OS or PFS):

Analysis	What it tests	Implementation
Alternative censoring rule	Whether censoring decisions affect conclusions	Censor at last contact vs. last adequate tumor assessment; per-protocol vs. administrative censoring date
RMST analysis	Non-proportional hazards robustness	Restricted mean survival time at pre-specified time horizon (e.g., 24 months) instead of HR
Landmark analysis	Survival difference at fixed time points	OS/PFS rates at 12, 24, 36 months
Per-protocol set analysis	Sensitivity to major protocol deviations	Exclude patients with major eligibility violations; should give consistent results if ITT is valid
Subgroup consistency	Whether treatment effect is consistent	Check consistency across key prognostic subgroups (ECOG PS, prior treatment lines, biomarker status)

SAP language:

"Pre-specified sensitivity analyses for the primary treatment policy OS estimand include: (1) Alternative censoring rule analysis censoring patients at administrative data cutoff rather than last known survival contact; (2) RMST analysis at 24 months to assess robustness under non-proportional hazards; (3) Landmark OS rates at 12, 24, and 36 months; (4) Subgroup consistency analysis by baseline ECOG performance status and prior treatment lines."

Hypothetical Primary Estimand (PFS with censoring at new therapy)

What needs testing: The main assumption of the hypothetical strategy (censor at new therapy) is that censoring at new therapy initiation is non-informative — that patients who initiate new therapy have the same prognosis (conditional on covariates) as those who do not. This is often violated if:

Patients with shorter PFS (poorer prognosis) are more likely to receive new therapy earlier
New therapy itself is effective and prolongs post-PFS survival, making censoring at new therapy informative

Standard sensitivity hierarchy for hypothetical PFS primary:

Analysis	What it tests	Implementation
Treatment policy sensitivity	Whether censoring at new therapy materially biases the estimate	Remove censoring at new therapy; allow progression/death regardless of subsequent therapy
Alternative censoring window	Whether the timing of the censoring rule matters	Compare: censor at new therapy start vs. censor at last tumor assessment before new therapy start vs. censor 30 days before new therapy
Tipping point analysis	How many additional events in control arm would negate significance	Vary the assumed non-informative censoring assumption; determine how many censored patients would need to have had events to change the conclusion
Missing tumor assessment sensitivity	Robustness to imputed progression dates	Analyze per alternative rules for handling missing assessment windows (e.g., treat all missing assessments as progression)

SAP language:

"The primary PFS analysis (hypothetical strategy) will be supported by the following pre-specified sensitivity analyses: (1) Treatment policy sensitivity: PFS will be re-analyzed without censoring at initiation of subsequent anti-cancer therapy; all patients analyzed to first documented progression or death. (2) Alternative censoring window: patients initiating new therapy will be censored at [date of new therapy start] rather than last adequate tumor assessment; (3) Tipping point analysis: the minimum number of events in the control arm among censored patients that would render the primary PFS analysis non-significant at the two-sided 0.05 level will be calculated."

Composite Variable Primary Estimand (PFS with death as event)

What needs testing: The composite strategy's main assumption is that death is informative about tumor progression — i.e., patients who die without documented progression would likely have progressed imminently had they not died. If this is not true (e.g., patients die of unrelated causes in a trial with elderly population), the composite overestimates the progression event rate.

Standard sensitivity hierarchy for composite PFS:

Analysis	What it tests	Implementation
Competing risk analysis	Whether death "competing" with progression affects interpretation	Fine-Gray subdistribution hazard model; cumulative incidence function instead of 1-KM
PFS analysis censoring deaths	Whether deaths are driving the PFS result	Censor at death date (time-to-progression only); note: this analysis should be clearly labeled as a sensitivity, not primary
Cause-specific hazard	Whether the treatment effect on progression component differs from the death component	Separate cause-specific hazard models for progression and death
Death classification sensitivity	Robustness to classification of on-study vs. off-study deaths	Re-run analysis with all deaths (including post-trial) vs. on-study only

SAP language:

"A pre-specified sensitivity analysis will apply a competing risk approach to PFS. The cumulative incidence function for disease progression will be estimated using the Fine-Gray subdistribution hazard model, treating death as a competing risk. This analysis addresses whether the composite PFS effect is primarily driven by the progression or survival component."

DFS Primary Estimand (Adjuvant Settings)

Standard sensitivity hierarchy for composite DFS:

Analysis	What it tests	Implementation
Competing risk (Fine-Gray)	Non-cancer deaths as competing risk for recurrence	Fine-Gray model treating non-cancer death as competing event
Cancer-related deaths only	Whether all-cause death assumption drives result	Restrict to cancer deaths + recurrences as events; censor non-cancer deaths (note: FDA prefers all-cause, but this is a sensitivity)
Alternative event list	Whether secondary events (second primary cancer, contralateral breast) drive result	Sensitivity excluding specific event types
Per-protocol sensitivity	Protocol deviation impact	Exclude patients with major pre-recurrence protocol violations

Sensitivity Analysis for Missing Data

Separate from IE strategy sensitivity: Even after pre-specifying IE strategies, missing data in the collected observations requires separate sensitivity analysis. The distinction:

IE strategy handles events that are conceptually resolved (e.g., treatment discontinuation — handled as treatment policy, composite, or hypothetical)
Missing data handles observations that should have been collected but were not (e.g., patient missed a tumor assessment, patient withdrew from study without an event)

ICH E9(R1) §A.5.1: "Even after defining estimands that address intercurrent events in an appropriate manner and making efforts to collect the data required for estimation, some data may still be missing."

Missing Data Sensitivity Approaches by Assumption

Assumption	Description	Sensitivity Label	Oncology Context
Missing at random (MAR)	Probability of missingness depends only on observed data	Main analysis assumption in most mixed models	Used for missing tumor assessments when dropout is explained by observed covariates
Missing not at random — control-based imputation (CBI)	Missing observations assumed to follow control arm trajectory	Conservative sensitivity for active treatment arm	Assumes missing patients in treatment arm had control-like outcomes
Missing not at random — reference-based imputation (RBI)	Missing observations follow reference group increment pattern	Tipping point sensitivity	Assumes missing treatment patients had control-like changes
Missing not at random — worst-case imputation	All missing values assumed to be worst possible outcome	Extreme sensitivity / tipping point	All missing progressions are "events"; all missing survivals are "deaths"
Pattern mixture model	Sensitivity to different missing data patterns (e.g., patients who discontinue vs. who remain on treatment)	ICH E9(R1)-recommended structured approach	Separate imputations for early-discontinuation vs. late-discontinuation cohorts
Tipping point (delta adjustment)	Magnitude of departure from MAR needed to change conclusion	"How much MNAR would negate significance?"	Estimates treatment effect under varying degrees of informative missingness

Reference-Based Imputation Methods for MNAR Sensitivity

When missingness is suspected to be not at random (MNAR) — e.g., sicker patients more likely to drop out and have missing efficacy measurements — reference-based imputation (RBI) offers a structured approach to sensitivity analysis.

When Reference-Based Imputation Is Appropriate

Reference-based imputation is most appropriate when two conditions hold:

Outcome data is continuous (e.g., biomarker change, symptom score, lung function) — RBI is less straightforward for event-based endpoints (OS, PFS) where complete-case analysis or event-based imputation is more common
Dropout creates missing data that should be imputed under MNAR assumption, with the reference group defining the missing data distribution

Reference-Based Imputation Methods

Copy Increments in Reference (CIR):

Mechanics: Assumes the individual's increment profile (change per visit) after dropout equals the reference group's observed increment profile. This preserves the trend observed in patients who remained on control therapy.
Example: If a patient drops out at Week 8, their Week 12 and 16 values are imputed as: imputed value = baseline + (patient's change from baseline to Week 8) + (average change in reference group from Week 8 to Week 12).
Assumption: Patients with missing data have the same change trajectory as the reference (control) group
Use case: Conservative sensitivity when treatment patients drop out early

Copy Reference (CR):

Mechanics: Directly imputes missing observations as if the patient had the reference group's mean value at that visit.
Assumption: Missing treatment patients had control-like outcomes (most conservative)
Use case: Extreme MNAR sensitivity

Jump-to-Reference (J2R):

Mechanics: Imputes missing values as a blend: starting with the patient's last observation, then gradually transitioning to the reference group's mean over subsequent visits.
Formula: Imputed value = patient's last observed value + (δ × (reference mean − patient's trajectory))
Assumption: Missing patients revert to control group trajectory, but gradual transition is more plausible than abrupt
Use case: Moderate MNAR sensitivity, bridges between treatment and control assumptions

Tipping Point Framework for Reference-Based Imputation

Methodology:

Baseline: Conduct primary analysis under MAR (e.g., standard multiple imputation or mixed model for repeated measures, MMRM)
Parameterize departure: Define a sensitivity parameter δ (delta) representing the magnitude of departure from MAR
- δ = 0: No MNAR — same as MAR (primary analysis)
- δ = −1, −2, −3, ...: Increasing MNAR — missing patients have progressively worse outcomes
Conduct sensitivity analyses: For each δ value, re-impute missing data under the MNAR assumption and re-run primary analysis
Identify tipping point: Report the minimum δ at which the primary conclusion changes (p-value crosses 0.05)

Interpretation guide:

If tipping point requires δ = −1 to negate significance → primary result is fragile; modest MNAR negates result
If tipping point requires δ = −3 or worse → primary result is robust; implausible level of missingness needed to negate
If tipping point is clinically implausible (requires very severe assumptions) → confidence in primary result increases

SAP Language Template (Reference-Based Imputation Sensitivity):

"For the primary continuous efficacy endpoint, the main analysis will use mixed model for repeated measures (MMRM) under the missing at random (MAR) assumption. As a pre-specified sensitivity analysis addressing the robustness to missing not at random (MNAR) assumptions, a tipping point analysis will be conducted using reference-based imputation methods:

(1) Copy Increments in Reference (CIR): Missing values for treatment-arm patients will be imputed using the average increment (change from visit to visit) observed in the control-arm patients.

(2) Jump-to-Reference (J2R): Missing treatment-arm values will be imputed using a gradual transition from the patient's last observed value (at discontinuation) to the control group's mean, over [pre-specified] visits.

(3) Tipping Point Delta Parameter: For each method, a sensitivity parameter δ will be varied (δ = 0, −1, −2, −3) to represent departures from MAR. The tipping point — the minimum δ at which the primary efficacy conclusion reverses — will be reported. If the tipping point δ is clinically implausible (e.g., δ < −3), the primary result is considered robust to reasonable departures from MAR."

Stress Tests for Estimand Robustness

Stress tests extend sensitivity analysis beyond single-assumption variation to examine estimand robustness under multiple simultaneous plausible departures. While ICH E9(R1) emphasizes varying one assumption at a time, stress tests provide a complementary assessment of estimand resilience.

When to Use Stress Tests

Stress tests are particularly useful for:

High-risk estimands: Hypothetical strategies with strong structural assumptions (RPSFT, principal stratum)
Complex trial settings: Multiple IEs, high missing data, substantial post-baseline imbalances
Regulatory concerns: Anticipated questions about robustness of critical efficacy claims
Non-proportional hazards: When multiple assumptions (PH, censoring, baseline imbalances) may simultaneously be violated

Stress Test Scenarios for OS Estimand

Scenario 1: Informative Censoring + Protocol Deviations

Assumption 1 departure: Censoring is informative (20% of censored patients would have died in year 2)
Assumption 2 departure: 10% of control-arm patients violated major protocol criteria
Analysis: RMST + per-protocol sensitivity + tipping point on censoring

Scenario 2: Non-Proportional Hazards + Crossover

Assumption 1 departure: Cox model PH assumption violated (late emerging treatment effect)
Assumption 2 departure: 50% crossover in control arm (RPSFT adjustment applies)
Analysis: RMST (instead of HR) + RPSFT-adjusted HR + Fleming-Harrington weighted log-rank

Scenario 3: Treatment Discontinuation (AE) + Differential Follow-up

Assumption 1 departure: Treatment-arm patients discontinue earlier due to AEs (treatment policy is primary, but may miss delayed benefit)
Assumption 2 departure: Control-arm patients have better follow-up adherence (differential follow-up bias)
Analysis: Landmark at multiple timepoints (12, 24, 36 months) + RMST

SAP Language Template (Stress Test):

"To evaluate robustness of the primary OS estimate under multiple plausible simultaneous departures from statistical assumptions, a pre-specified stress test will be conducted:

Stress Test Scenario: Combination of (1) informative censoring sensitivity (assume 20% of censored control-arm patients would have died in year 2) AND (2) per-protocol sensitivity (exclude major protocol violators, n=[X]). Under this stress test scenario: - Primary analysis: RMST at 24 months (instead of Cox HR, to address potential non-PH) - Supplementary: Stratified log-rank by baseline ECOG (to address differential follow-up) - Tipping point: Estimate minimum proportion of censored patients that must have died to reverse OS significance

Interpretation: If the primary OS conclusion holds under this stress test (p < 0.05), robustness is demonstrated across multiple simultaneous assumption departures."

Crossover Adjustment Sensitivity Analyses

When formal crossover occurred (control arm patients received investigational drug post-progression), OS sensitivity analyses are expected:

RPSFT (Rank Preserving Structural Failure Time Model)

Purpose: Estimate OS under the hypothetical scenario where crossover did not occur Assumptions: Common treatment effect assumption — the treatment effect is the same for initial and subsequent use Implementation:

Define treatment-free intervals (before experimental drug exposure) and treatment intervals
Apply accelerated failure time shrinkage to on-treatment intervals
Re-censor to remove informative censoring induced by RPSFT adjustment SAP language: "A rank-preserving structural failure time model will be applied to the control arm to estimate overall survival under the hypothetical scenario in which no patients received [drug] after disease progression. The null hypothesis of no treatment effect will be tested using the g-estimation procedure. Bootstrapped 95% confidence intervals (1,000 iterations) will be reported. RPSFT-adjusted OS HR is presented as a pre-specified supplementary analysis; the primary OS analysis remains the ITT treatment policy estimate."

Two-Stage Estimation (TSE)

Purpose: Estimate the counterfactual OS from secondary baseline (progression date) Assumptions: No unmeasured confounders at secondary baseline; separable switcher/non-switcher populations Best for: Settings with >20% non-switchers in control arm

IPCW (Inverse Probability of Censoring Weighting)

Purpose: Re-weight patients to account for covariate-dependent censoring at crossover Assumptions: No unmeasured time-varying confounders; overlap in covariate distributions between arms Best for: 40–85% switching rates; when comprehensive covariate data collected SAP language: "As a sensitivity analysis, IPCW will be applied to the primary OS analysis. Weights will be derived from a time-varying logistic regression model predicting the probability of not having crossed over at each event time, using pre-specified time-varying covariates [list]. The weighted log-rank test and IPCW-adjusted Cox model will be presented."

Agency acceptance of crossover adjustments:

FDA: Supportive only — ITT OS remains primary; RPSFT/IPCW presented as supplementary
EMA/NICE: Accepted for HTA submissions — RPSFT or IPCW may be the primary basis for comparative effectiveness claim
IQWIG (Germany): Does not accept any switching adjustment; ITT-only accepted

Non-Proportional Hazards Sensitivity

When PH assumption is suspected violated (delayed IO effect, crossing survival curves, biomarker-selected populations with early responder depletion):

Pre-specified NP hazards analyses:

RMST (Restricted Mean Survival Time): Compare mean survival time within a pre-specified window (e.g., 24 or 36 months). Robust to PH violation; directly interpretable as "mean additional months of survival."
- SAP language: "RMST will be calculated at 24 months for both OS and PFS as a pre-specified sensitivity analysis to assess robustness of the primary Cox model results under potential non-proportional hazards."
Weighted log-rank test (Fleming-Harrington): Use ρ=0, γ=1 weighting to emphasize late events; compare to standard log-rank to identify timing of treatment effect
Landmark analysis: Pre-specify OS/PFS rates at 12, 24, 36 months; if HR changes substantially between periods, NP hazards are present
Schoenfeld residuals test: Formal test of PH assumption; report p-value and plot scaled residuals vs. time

FDA expectation: For IO trials (immunotherapy), non-proportional hazards are expected and pre-specification of RMST or weighted log-rank is recommended in the pre-IND or Type B meeting agreement.

Tipping Point Analysis Framework

For PFS Censoring (Hypothetical Strategy)

When PFS uses hypothetical strategy (censoring at new therapy), a tipping point analysis quantifies how many censored patients in the control arm would need to have events to negate significance:

Methodology:

Start from the primary PFS analysis (log-rank test statistic, HR)
Sequentially convert censored control arm patients to events (starting from last censored)
Track when the primary p-value crosses 0.05
Report: "The PFS analysis remains significant (p < 0.05) even if [X] of [Y] censored control arm patients ([Z]%) had experienced events at their censoring date."

Interpretation guide:

If tipping point requires <10% of censored patients to have events → primary result is fragile; sensitivity matters
If tipping point requires 10–30% → primary result is moderately robust
If tipping point requires >30% → primary result is robust to most plausible MNAR departures

For Missing Data (Delta Parameter)

When continuous endpoints have missing data, a delta-adjusted tipping point analysis tests robustness:

Methodology:

Fit primary MMRM or MI model under MAR assumption
Define δ = change in treatment effect per-unit-dropout-risk
Re-fit for δ = 0 (MAR), δ = −1, δ = −2, δ = −3
Identify tipping point δ where primary conclusion reverses

Interpretation:

δ = 0: Primary MAR analysis
δ = −1: Modest MNAR (missing patients have outcomes ~1 unit worse)
δ = −2, −3: Strong MNAR (clinically implausible)

SAP Language:

"A tipping point analysis for missing data will be conducted by varying the delta parameter (δ) representing departure from the missing at random assumption. For each δ value (0, −1, −2, −3), missing treatment-arm observations will be imputed under the specified MNAR assumption, and the primary analysis re-run. The tipping point is defined as the minimum δ at which the primary efficacy conclusion (p = 0.05 significance threshold) is negated."

Sensitivity Analysis Documentation Requirements

SAP requirements (pre-specified before unblinding):

For each primary estimand, the SAP must include:

Full list of pre-specified sensitivity analyses, each labeled as:
- Same estimand, different estimator (pure sensitivity) — e.g., alternative censoring rule
- Alternative estimand (supplementary, different clinical question) — e.g., hypothetical when treatment policy is primary
For each sensitivity: description of which assumption is varied and why
Ordered from most important to least important (not by expected favorable result)
Statement that sensitivity analyses do not contribute to Type I error control (no alpha adjustment required for sensitivity)

SAP multiplicity statement (required):

"The primary analysis uses the treatment policy strategy for [endpoint]. All pre-specified sensitivity analyses are supportive and do not contribute to the confirmatory testing hierarchy. P-values from sensitivity analyses are descriptive only. No adjustment for multiplicity is applied to sensitivity analyses."

Backlinks

Source: ICH Harmonised Guideline E9(R1) — Addendum on Estimands and Sensitivity Analysis in Clinical Trials (Final, November 2019), §A.5.1–A.5.3 Status: Final (ICH E9(R1) Step 4, adopted November 2019; FDA adoption May 2021) Compiled from ICH E9(R1) §A.5.1–A.5.3 + reference-based imputation best practices

Last Updated: April 2026
Knowledge Base: oncology_kb
Section: Clinical Trial Design — Sensitivity Analyses for Estimands