Abstract
The time-to-event relationship for survival modeling is considered when designing a study in clinical trials. However, because time-to-event data are mostly not normally distributed, survival analysis uses non-parametric data processing and analysis methods, mainly Kaplan–Meier (KM) estimation models and Cox proportional hazards (CPH) regression models. At the same time, the log-rank test can be applied to compare curves from different groups. However, resorting to conventional survival analysis when fundamental assumptions, such as the Cox PH assumption, are not met can seriously affect the results, rendering them flawed. Consequently, it is necessary to examine and report more sophisticated statistical methods related to the processing of survival data, but at the same time, able to adequately respond to the contemporary real problems of clinical applications. On the other hand, the frequent misinterpretation of survival analysis methodology, combined with the fact that it is a complex statistical tool for clinicians, necessitates a better understanding of the basic principles underlying this analysis to effectively interpret medical studies in making treatment decisions. In this review, we first consider the basic models and mechanisms behind survival analysis. Then, due to common errors arising from the inappropriate application of conventional models, we revise more demanding statistical extensions of survival models related to data manipulation to avoid wrong results. By providing a structured review of the most representative statistical methods and tests covering contemporary survival analysis, we hope this review will assist in solving problems that arise in clinical applications.
- Survival analysis
- (weighted) Kaplan–Meier tests
- (weighted) log-rank tests
- hazard ratios
- (extended) cox model
- (non)-proportional hazards
- restricted mean survival time
- review
Survival analysis methodology is commonly used to investigate the future behavior of data collected in time, such as for a prospective cohort study or a clinical trial. If the outcome of a study is the time until an event occurs, we would not necessarily be able to wait until the event in question has occurred in all subjects. In addition, some of the initial participants may drop out of the study unexpectedly for any reason before it is completed. Similar situations lead to censored observations since there is no complete information for each subject. Therefore, the data set will be a combined set consisting of times for the event of interest and times after which there is no additional individual information. The methodology and survival analysis tools are a unique technique to date where censored observations can be handled without being omitted as missing data.
General descriptive methods for exploring survival times in a sample include life tables and fitting techniques for visualizing Kaplan–Meier (KM) curves. Furthermore, various statistical tests are available for comparing survival in two or more groups. Widely used is the log-rank, where the corresponding limitations for its correct application have led to various alternatives. However, the need for survival analysis to assess the effect of more than one prognostic factor on the survival model recommends constructing and studying regression models. The CPH model is the most popular of the regression methods that allows the modeling of the hazard function in a set of explanatory variables without restrictions on the number or class of covariates -qualitative or quantitative-. However, because the problems arise when the basic assumptions for the smooth operation of this model do not hold or are amenable to correction, new innovative statistical tests can provide the solution even in very special cases, such as long-term survivors or delayed-effect scenarios.
In this review, we discuss issues closely related to survival analysis, giving in parallel some examples that correspond to modern analysis problems in clinical oncology. We first give the basic definitions and rules that govern the survival analysis of censored data, which is one of the main challenges to date. Then, we state the basic functions and estimators for the classical KM models, as well as the assumptions for the proper functioning of the statistical tests (log-rank) for the comparison of survival curves. To also include and assess the effect of multiple covariates in the survival model, we analyze the function and assumptions of using the Cox regression model while proposing both the corresponding methods for investigating its violation and the main causes of this violation. However, although the log-rank test is still widely used in many studies today because it can become insensitive to a possible violation of the proportional hazards (PH) assumption, a variety of statistical methods and tests free from this violation -with each of them having its field of application- are used to identify and resolve discrepancies between survival curves or hazard functions. Then, complementary, or alternative survival models are proposed as methods for special situations, such as for individuals who belong to close kinship circles/communities or started follow-up later than others or have some other unexpected event besides the event of interest, or for studies that depend on some pre-existing condition, thus covering a wider range of applications.
Time-to-event Censoring Data
The characteristics of time-to-event data present unique peculiarities. This is because in a clinical study, it is possible that the event of interest (death or improvement through treatment) has not occurred during the follow-up of a sample to the end of the study, thus heralding the so-called censored observations (1, 2). Something similar could also occur due to a subject dropping out of the study for reasons other than the event, such as encountering a ‘competing’ event making further follow-up impossible (3). The main advantage of survival analysis is that it provides a method of including patients who fail to complete the trial or do not reach the endpoint of the study (the censored data) by comparing the number of survivors in each group at different time points (4). The patient sample is censored because we only know that the individual survived until lost to follow-up, but we know nothing about their survival after that time. This is the most common form of censorship called right censorship (2, 5). Left censoring occurs when the initial diagnosis or time of exposure is unknown. Time-to-event data can also be interval censored (6), in the sense that individuals move in and out of observation, or when it is certain that the event occurred for a patient between two-time points, but the exact time point is not known. Most survival data include right-censored observations, while methods for interval and left-censored data are available but occur less often (7, 8).
As a direct application, a simple implementation of time-to-event data focusing on correct censored hypotheses via the Python package lifelines (9-11) is illustrated in Figure 1 for a typical timeline of seven subjects with exit points in a hypothetical study with death events. The figure is unique in the sense that it shows survival time starting from a common time zero (i.e., all study participants have been rearranged to start at zero simultaneously). Usually, some participants enter the study at different times.
Timelines for survival analysis in a simulated dataset. Each original record ends at different time points, where the event in question (death) is indicated by red circles, and x marks indicate censoring.
In survival analysis in general, the censoring is required to be ‘non-informative’, meaning that if participants drop out of the study, it will be solely for reasons unrelated to the study, essentially assuming that participants whose data are censored will have the same distribution of failure times (or times to event) if observed (12). This hypothesis presupposes that censoring occurs regardless of the outcome of interest and that subjects to be censored at a given time have the same risk of having the event occur to them as those already being monitored at the same time, even if it is not possible to observe them after the moment they are censored (13). In any other case, the analysis of survival curves may not be unbiased, thus leading excellent studies to erroneous results (14). In contrast, informative censoring can occur, e.g., when patients drop out of a clinical trial due to drug toxicity or if their clinical picture worsens (1). However, it is not always possible to know if censoring in a study is truly non-informative. Time-to-event data are also likely to remain truncated, which occurs when patients are excluded from the sample because their event times are shorter or longer than certain values. The most common in biomedical research is the left truncation, i.e., not observing individuals with short event times. This truncation occurs when we begin observation after some, or all individuals have already been at risk for the event of interest (5, 15).
Although censoring is a problem in survival analysis, we cannot exclude these subjects from the study as missing data, as this would affect the research and the results we would obtain. This is because most of these individuals are ‘survivors’, thus their exclusion reflects the success of treatment. Consequently, any omission of these censored observations excludes valuable survival information, leading to over/underestimation of overall survival probability (16). Therefore, to estimate the probability of survival at a given time, we use the risk to include the information we have about a censored individual up to the censoring date (17). However, it remains a fundamental goal in clinical applications of time-to-event data to estimate patient survival with censored data, because although there is a huge variety of relevant methods, there is no universal method that applies in any case of missing study data (18, 19).
Survival Functions
In longitudinal studies, we cannot assume that rates are constant when we follow samples/subjects from a specific initial time point (which we consider to be the baseline) until the outcome event is observed. For example, the risk of death after surgery decreases as the patient recovers but may increase again as the patient ages. Time-to-event (survival) analysis allows the study of event rates over time, without the assumption that rates are constant, by modeling time-to-event or even ascertaining the form of the correlation of time with the event with the corresponding quantitative variables. In probabilistic terms, the survival function is the probability that an individual will survive until a certain time, i.e., the probability that the event has not yet occurred. Suppose T is a random variable representing the survival (event) time. In that case, the survival function is the probability that an individual survives more than time t, i.e.,
1
with the expected survival time being
Simply speaking, the probability of survival at any given point in time is the fraction of survivors divided by participants at risk. The corresponding probability density function f(t) is defined as the frequency of events per unit time (5) and is obtained by the derivative of the survival function concerning time, i.e., f(t)=−S′ (t)
The representation of the survival function is the KM curve, which is essentially a non-increasing step function that depicts the cumulative probability of survival over time. This curve is horizontal in time intervals where no event occurs, while thereafter it falls vertically whenever an event occurs, corresponding to a change in the survival function. KM plots can approximate measures such as median survival times (20, 21). The sampling distribution of survival times is often highly skewed (asymmetric), therefore, the median is generally a better measure of centrality than the mean (22).
KM Estimator for Survival Function
The KM estimator (23) of the survival function S(t), as a non-parametric statistic, aims to estimate the function from specific lifetime data. It is given by:
2
with ti being the time at least one event occurred, di the number of events (deaths) that occurred at time ti and ni the number of individuals known to have survived (have not yet shown an event or have been censored) until time ti. Finding standard errors and estimating confidence intervals -two approximate 95% confidence intervals for S(t) for a fixed time t, or (1-α)×100% confidence intervals for S(t), where α is the significant level- is a prerequisite for calculating survival probabilities. The most popular way to estimate the standard error of survival estimates is the Greenwood formula (5, 24), which approximates the variance as
3
The confidence intervals for the survival function are then calculated as
4
where the square root represents the standard error of the KM estimate.
Hazard Functions
The instantaneous hazard function (or hazard rate) calculates the probability that the event of interest will occur to a subject in the next infinitesimal time interval Δt. The hazard rate is essentially the ratio of the probability that the event will occur in a unit of time Δt, but based on its survival at any earlier time before time t (5, 25):
5
which is generally difficult to estimate, thus the cumulative hazard H(t) is used, defined as the integral of the hazard or the section under the graph of the hazard function between time intervals and is given by the formula where a non-parametric method for its estimate is the Nelson-Aalen estimator (8). The cumulative hazard function is related to the survival function through the relation S(t)=e−H(t), and consequently knowing one quantity it is easy to determine the other as well. A greater hazard implies a lesser survival and vice versa (26). Also, while the survival function is related to the absence of an event, the hazard function is related to the event, i.e., as the hazard is related to its rate of occurrence, survival refers to its cumulative absence.
KM Curve Graph
To plot KM curves, three assumptions must be considered in the analysis using the KM estimator. First, at any time patients who are censored have the same survival prospects as those who continue to be monitored. Second, that survival probabilities are the same for subjects recruited early and late in the study, and third, that the event occurs at a fixed time (12, 27, 28). The estimate of the KM survival function is essentially the fraction of patients who live for a certain period after treatment (or exposure to the disease). Given a data set of failure and censoring times, the KM estimate of the survival function can be easily obtained by dividing the entire time interval into subintervals with cut-off points at failure times and then calculating the survival probabilities (conditional survival) for each interval. It should also be noted that when there is no truncation or censoring, the KM curve is the complement of the empirical distribution function.
Typically, survival plots are presented as cumulative survival distributions for percentages of event-free patients decreasing over time. In this way, important summaries of the data considered for the estimation of the measurements such as the median survival time of the participants, are obtained. As we can see in Figure 2 in the application of the KM curve, each event time is represented as a ‘jump’. In contrast, the small red vertical marks represent individual patients whose survival times have been correctly censored. The survival plot has been created using the R base (29) and more specifically the ggsurv function (30), the KMsurv (31), ggplot2 (32), survival (33, 34), plotly (35), and survminer (36) packages with 95% confidence limits of the survivor function. For the study and processing of the graph, there is one entry for each subject followed by a (+) if the subject was censored, whereas for the time points at which the curve shows a step then at least one event has occurred. Here, it should be noted that possible neglect of the censoring leads to an overestimation of the overall probability of survival and therefore to completely incorrect conclusions. The study selectively refers to a dataset including 2,982 patients with primary breast cancer whose records were located in the Rotterdam tumor bank. Of the total number of those patients, 1,272 had died and their overall median survival was 11.04 (10.65-11.81) years for 95%CI, where the survival time was in days (37, 38). The five-year survival probability is 74.4 (72.8-76) for 95%CI as can be seen from the graph.
The KM curve for overall survival (OS) of patients with primary breast cancer. Taking observation censored into account, the median survival time was 11.04 (10.65-11.81) years for 95%CI (i.e., 4,033 days for the graph).
Comparing KM Curves
Another key role of survival analysis that is particularly important in clinical trials, is comparing survival in different groups. Its usefulness is that apart from the visualization of this difference, the corresponding quantification of the statistical significance should be given. For example, statistically significant differences in the probability of survival have been found between a group of patients with various types of cancer versus a matched sample that received a placebo, such as in breast cancer (39-41). Equally critical is the assessment of statistical significance between two different treatments, so that the most effective appears in an RCT. The most common test for comparing two survival curves is the non-parametric Mantel-Haenszel or log-rank (LR) test (42, 43), which compares the observed number of events with the expected number in two independent groups. The LR test uses as a null hypothesis the assumption that there are no differences between the populations studied in the probability of an event at any point in time, based on the same assumptions as the KM method (27, 44). Test approximates the χ2 (chi-squared) distribution by ranking all (complete) failure times (of both groups) in ascending order and computing the expected event times for each group separately, where at the end they are summed to yield the final χ2 so that the full curves of each group can be compared (5, 45).
A typical example of the application of the LR test is the graph shown in Figure 3, which provides a comparison of the survival curves between subjects who received or did not hormone therapy for the previously mentioned data set that includes patients with primary breast cancer. The corresponding OS curves for subgroups separated by hormonal status were plotted with the events represented below at each step of the curves over time. This comparison was also achieved by using the survminer and survival packages in R, implying that the alternative hypothesis holds, i.e., a statistically significant survival difference between the two groups of patients was observed (χ2-statistic=23.7, p-value<0.05) (38).
Application of the log-rank test. Comparison of overall survival curves between patients receiving hormone therapy and those who are not receiving hormone therapy, where the dichotomous variable “hormone” indicates whether each patient received hormone therapy or not (0=No, 1=Yes).
There are specific criteria that must be satisfied for the correct application of the LR test, such as that initially the censorship must be non-informative. An important condition -which we will discuss in more detail below- is that the basic assumption of PH must be met (26, 46, 47). Regarding the risk of the event occurring at different time points during the observation period, we do not assume that it is constant, but rather that the hazard ratio between groups is constant over time. Otherwise, one group could have better survival than the other group in the past and then worse survival at later time points. Such a trend will be reflected in KM plots that initially diverge and then intersect, a case where the basic assumption of PH breaks down (48-50). In that case, the LR test should not be used due to an increased chance of a false negative result, and thus low statistical power (12). Since there is no specific path as to which test is used in each situation, in case of cross-over, a re-evaluation should be done, using weighted (51, 52) or modified LR tests (53, 54), which we will talk about in more detail below in case of lack of the PH assumption. However, as the LR test is a test of statistical significance, it is evident that it cannot estimate the size of the difference between groups and the confidence interval (1, 44).
Cox Proportional Hazards (CPH) Models
The aforementioned methods -i.e., both KM curves and LR tests- are examples of univariate statistical analysis. Specifically, they analyze survival according to one factor under investigation but ignore the influence of any other, and therefore they cannot be used for multivariate analysis (44). In addition, KM curves and LR tests are extremely useful only when the predictor variable is categorical (treatment, sex), while they do not work well for estimating the effect of quantitative variables (e.g., age). An alternative way to investigate differences in survival probabilities between treatment groups, while accounting for several covariates, is to fit regression models. The CPH as a semi-parametric model (55) is the most common approach to assess the effects of multiple risk factors, while it can assess the effect of both categorical and continuous variables by modeling the effect of multiple variables simultaneously. It has also the advantage of using all available information, including patients unable or failing to complete the experimental trial, thus assessing the relative efficacy of the experimental treatment compared to the control treatment during the trial. In a CPH regression model, the outcome of interest will be a proportion, known as the hazard ratio (HR) that represents the ratio of hazards between two groups at any specific time. This is achieved by examining the relationship between the predictor variables and time to the event where we use the hazard function if the predictors have a multiplicative effect on that hazard, but this effect is stable over time (56). The mathematical formula of the CPH regression model is as follows:
6
where h0 (t) is the baseline hazard, representing the hazard when all the predictors (or independent variables) X1,X2,….,Xp are equal to zero and can take any operational form. b1,b2,… bp are the model parameters describing the effect of prognostic factors on the overall hazard. The log of the ratio of the hazard at time t to the baseline hazard (relative hazard) is a linear function of the predictors:
7
where the connection to a linear regression model is obvious. A closer look at the previously mentioned basic principle of PH (the most basic operating assumption of the Cox PH model) states that on a logarithmic scale, any change in a risk factor leads to a corresponding change (in proportion) in the HR. Cox model output aims to estimate the HR applied to time-to-event outcomes as a causal effect of a binary “(0-1)” variable. Essentially, HRs are measures of association used for comparisons of hazard rates between exposed and unexposed. Thus, for a risk outcome (e.g., death), a hazard ratio for values below 1 indicates that treatment may be favorable (indicating a reduced risk of death), with the lower the score, the more successful the treatment. The unit should not overlap if the effect between the two treatments is statistically significant at the p<0.05 level. In the special case of univariate analysis (simple covariate), when we have a regression parameter b1, the HR=eb1. The interpretation of the parameter b is that for each unit increase in the covariate X, the risk is multiplied by eb1. In particular, when X takes the values 1 or 0, e.g., for groups 1 and 2 respectively, then group 1 has eb1 times greater risk than group 2 (5, 57). Direct application of the CPH model to the data set of breast cancer patients we have already used in the review of classical survival analysis methods gives a HR of 1.511 for patients who received hormone therapy regarding those who did not receive hormone therapy. This means that there is a 51.1% higher risk of death over time in patients who received the treatment compared to those who did not, as can be seen from the forest plot in Figure 4, with the corresponding 95% confidence interval being (1.3-1.8) and the result is statistically significant.
The forest plot for the Cox PH model. The graphical representation of the HR in the primary breast cancer dataset about hormone therapy.
The HR as an effect measure differs from the corresponding measures of relative risks (RRs) and odds ratios (ORs) in that the latter are cumulative over a study, whereas HRs represent instantaneous risk during the study period or some subset thereof (58). In case the PH assumption is valid, the HR is independent of time and the next stage will be the construction of CIs, a step necessary to measure the precision of the study estimate, with the sample size of a study as the main effect factor (12, 45, 59). Here too, the censoring should not be informative for the application of the model. However, HR’s interpretation can hide pitfalls, on the one hand, because HR can change over time, and on the other, because it can incorporate selection bias (60).
Testing the PH Assumption
The application of the Cox PH model assumes that the survival curves for different strata have PH functions during time t, as well as that the relationship between the log hazard and each confounding variable is linear. A deviation from the PH assumption exists when the regression coefficients are time-dependent, i.e., time interacts with one or more covariates. The most common approach to testing the proportionality assumption is to use graphical diagnostics and goodness-of-fit (GOF) tests (61) related to finding the scaled Schoenfeld residuals (8, 62). Graphically, the survival functions of log minus log (LML) form, i.e., , as now they are transformed into a linear functional form, should be approximately parallel over the observation period without intersecting or meeting. This particular property implies a constant hazard ratio and thus validation of proportionality (63, 64). It should also be noted that for a continuous explanatory variable it is required to transform it into a multiple (split) categorical variable to observe parallel survival curves corresponding to different levels of the covariate to obtain an LML plot (12). On the other hand, the Schoenfeld residuals as characteristics of GOF tests for each covariate, determine whether the residuals remain constant (independent of time) or change (increase or decrease over time where the assumption of proportionality breaks down) (12, 65). A Schoenfeld residual for an explanatory variable at each failure time point is the difference between its observed value at that time from its expected value. A plot that would depict a non-random pattern concerning time (tested by a significant regression slope) is evidence of a violation of the PH assumption, while if the Schoenfeld residuals have a random scatter then the PH assumption is valid (12). However, some of the crossing early intervals may be a product of noise in the survival estimates and not necessarily a violation of the PH assumption (66). To determine the Schoenfeld residuals, all explanatory variables included in the model are calculated. If the PH assumption does not hold the prerequisites for a specific covariate, we obtain a mean HR, derived from averaging over event times, which is assumed not to be an erroneous estimate. Due to the complexity of the problems arising from the investigation of the PH assumption, it is necessary to apply both methods that were examined above at the same time. However, in case of generalized violation of these tests, it is necessary to apply alternative methods that allow the analysis of time-dependent variables, such as the stratified Cox method or time-dependent Cox regression analysis (non-PH models).
Causes for Violation of the PH Case
As mentioned, using KM curves and Cox regression techniques to analyze survival data is essentially based on the PH assumption. This hypothesis states that the HR ratio of the hazard function to the baseline hazard is constant over time. However, the credibility of this hypothesis is not obvious and should be checked by methods described in previous paragraphs. Realistically, treatment effects in long-term trials, especially those currently used in late-stage oncology trials, can change over time, making the PH hypothesis inapplicable throughout the study (67). The most important possible reasons for the occurrence of non-proportional hazards (NPH) in survival studies are the following (68-70):
• survival is influenced by the progression of the disease occurring at different rates in the treatment and control groups,
• the effect of the treatment appears with a delay or is reduced,
• a significant number of long-term survivors appear in the treatment subgroup,
• treatment can switch or cross-over to the other arm after progressing on the original arm,
• the existence of subgroups with differential treatment outcomes, and
• treatment shows negative effects in a subgroup but is encouraging in its complement, which may also lead to a crossover of the survival curves.
In this way, there may be different types of NPH, such as delayed treatment effects, crossing hazards, and reduced treatment effects over time, as well as the appearance of long-term survivors (69, 70), as shown in Figure 5), where examples of the geometry of two survival curves are illustrated in the most common types of NPH frequently encountered in clinical data (70). In all these cases, specifically in the presence of NPH, the LR test and the Cox PH model, which are optimal in the setting of PH, may lack power (71) and the interpretation of HR for treatment effect from the Cox model is challenged (72).
Non-proportional hazard types where the proportional hazard assumption is violated. (A) Delayed treatment effect. (B) Crossing hazards. (C) Diminishing effect. (D) Long-term survivors.
NPH Cox Models
In many cases, some of the covariates under consideration may change their values over time. Given the appearance of dependent covariates or coefficients with time in the Cox PH model, it is no longer a ‘proportional hazards’ model since the relative hazard is also time-varying. Therefore, in the violation of the PH assumption, the simple Cox regression model is not valid, and more sophisticated techniques have been utilized to deal with the non-proportionality of hazards (73-75). Mathematically, extending the simple PH model to a more general NPH model is easy. Particularly, a general expression for the hazard function, given covariate X, is:
8
where b(t) is a time-varying regression effect coefficient and X is allowed to depend upon time. Consequently, PH models can be viewed as a special case of NPH models (76). Even though there is no consensus on globally optimal practices under NPH -since these depend on the specific NPH scenario- there exist some methods to confront with NPH problem. For example, using the stratified analysis of the time-dependent covariate (subgroup analyses), or, including the time-dependent covariate in the model, or using a time-varying coefficient for that covariate. These approaches can be utilized by the stratified Cox regression model, as well as the extended Cox regression model (77).
The stratified Cox regression model. This model is a modification of the Cox regression model, in which covariates that do not meet the PH assumption are stratified (78, 79). Considering a dataset that has several covariates that do not satisfy PH and others that satisfy PH (e.g., Xj, where j=1,…,p) then the hazard function for the failure time of an individual in stratum i takes the form:
9
where i=1,…,k (76). Therefore, predictors that are assumed to satisfy the PH assumption are included in the model, whereas the stratified predictor is not (77, 80). This means that covariates with non-proportional effects can be incorporated into the model only as stratification factors and not as predictors. However, because stratification only works for categorical variables, it is the quantitative variables that must be discretized.
The stratified Cox model allows different groups of the population under study to have different baseline hazard functions h0i (t) and thus, different survival curves S0i (t) (81). However, the regression coefficients b1,b2,…, bp are the same for each stratum, meaning that the HRs are also the same for each stratum (no-interaction assumption) (77). The coefficients are estimated via the maximization of a likelihood function obtained by multiplying together the likelihood functions for each stratum.
The extended Cox regression model. Considering several time-independent covariates (e.g., Xi,i=1,…,p1) in combination with others time-dependent (e.g., Xt_j, where j=1,…,p2), the specific form of hazard function is formed:
10
where bi and dj are the coefficient vectors of the covariates and rt is a function of time that can take various forms, such as t,t2,log(t), as well as the step function of the form:
11
Time tc represents the cutoff time point for subject-specific data for each study. A model of this form no longer satisfies the PH assumption as the HR is time-dependent (82) since baseline hazard values are the same, but parameter estimates are different over time. In addition, regression coefficients could also be considered to change over time, rather than being fixed or constant (83). For example, the regression coefficient could be a linear, a fractional polynomial, or a restricted cubic function of time (84). These time-varying covariate effects methods allow a more thorough study of the effect of a covariate on the process over time rather than just on the outcome.
Yang-Prentice model. This semiparametric model can accommodate both the Cox PH model as a special case and the NPH (51, 85). In this model, the HR function depends on the survival function of the control group and two parameters θ1 and θ2 (70). The survival time in both groups is modeled by:
12
where θ1 and θ2 are two positive parameters, SP is the survival function of the first group, while λP and λT are the HRs in the first and second groups, respectively (76). The parameter θ1 corresponds to the short-term relative risk and θ2 to the respective long-term. If θ1=θ2 the PH model is recovered, while in case θ1>θ2 then the HR function decreases, and finally if θ1<θ2 then the HR function increases. Therefore, this model accounts for the cases of PH, crossover hazards, no initial treatment effect, and an extinction effect (86), by taking short-term and long-term HRs as summary measures of treatment effect, the corresponding CIs for these HRs, and a satisfactory p-value from the adaptive weighted log-rank test. However, the Yang-Prentice model has been shown to adequately handle dichotomous rather than continuous covariates (85).
Other NPH Models
Cure rate models. These models can be seen as a special case of survival models, in which a fraction of subjects never experiences the event of interest (70, 87-93). Consequently, cure rate models are suitable for use when a new treatment is expected to induce long-term survivors. This will result in a ‘plateau’ at the tail of the survival curve for patients in the new treatment group. Such models can be distinguished into two common types: mixture and no-mixture cure rate models. The former has received far more attention than the latter. However, both cure rate models can describe short and long-term effects as well as those covariates associated with each of these effects.
Mixture model. In the mixture treatment model, a certain fraction of the population is treated, and the rest of the population is not. Assuming that a population is distinguished into two groups of subjects, namely a group with cured subjects with a probability p and the other group with susceptible subjects (not-censored observations) with a baseline survival function S0 (t), then the survival function is given by the formula:
13
where p ∈ (0,1) (87). Common choices of S0 (t) are the Gompertz, logistic, exponential, and Weibull distributions. The corresponding probability density function of a lifetime T is f(t)=(1-p) f0 (t), where f0 (t) is the baseline probability density function for susceptible subjects and F(t)=1-S(t). If we consider a random sample (ti,δi) of size n with i=1,…,n, then the contribution of the i-th individual for the likelihood function is given by:
14
where δi is a censoring indicator variable.
Non-mixture model. In this case, the survival function is given by the formula S(t)=pF0 (t)=eln(p)F0(t), whereas F0(t)=1-S0 (t). In this case the likelihood function is given by:
15
while under assumptions, models of this type can be considered as PH models for a certain fraction of the population that is cured. With cure rate models, covariates connected with short and long-term effects can be examined, even though their interpretation is different between mixture and non-mixture models (89). Also, a basic condition for the use of these models is the necessity for adequate monitoring.
Conditional survival (CS). This type of analysis has been widely used to assess the long-term effect of a treatment, providing new insight into the differences and effects of early and delayed treatment in the case of NPH (94). The explanation is that an aggressive surgical treatment could have a high early mortality but lead to high survival or even cure rates after the treatment period, in contrast to a chemotherapy that has no severe treatment-related mortality but still corresponds to a moderate therapeutic effect. Such an analysis would be particularly useful for comparing this survival benefit between treatments. It has been shown that CS provides further information about the prognosis of cancer patients, particularly regarding the progression of their prognosis (95). Considering patients who have already survived for many years, CS provides the relevant information as the prognosis is adjusted for the time the patient has already survived. In mathematical terms, CS is defined as the conditional probability of a patient’s further survival in t years, given that he has already survived in s years after the diagnosis of his disease (96):
16
where for prediction time s=0, CS(t│0) coincides with the survival function S(t) since S(0)=1. Given a clinical cohort of patients with a particular type of cancer, absolute CS can be estimated by conditional KM estimates in strata defined, for example, by age and disease stage or by a conditional version of the Cox and other regression models for time-to-event data. CS provides valuable and relevant information on how prognosis develops over time. It also serves as a starting point for identifying factors related to long-term survival.
NPH Tests
To test differences in treatment effect when the PH assumption is not met, alternative methods are needed, which can deal with different patterns of NPH. Some of these, commonly used, methods are weighted log-rank tests (WLR), weighted Kaplan–Meier (WKM) statistics, Restricted Mean Survival Time (RMST), and combination tests like the Max-Combo test. Essentially, some of these methods are modifications of the LR test and others are derived from KM curves. In the following, a brief description of these methods is provided.
The Weighted Log-rank (WLR) test. This type of test can be used when a delayed treatment effect is expected and/or observed, resulting in delayed separation of survival curves. The WLR test is a generalization of the log-rank test, which assigns different weights to time points and in this way emphasizes a certain portion of the survival curves. The WLR test has been utilized in studies with NPH to increase power (50, 97-101). Particularly, by adding a weight function, a more powerful test is obtained for other types of hazard alternatives, e.g., early, late, or even crossing hazard differences. A weighted version of the LR test statistic is given by:
17
Here Oj and Ej denote the observed and expected (under the null hypothesis) number of events in the treatment arm at an event or censoring time tj (t1≤t2≤…≤tJ), Vj is the variance of Ej and w is a weight function of time with non-negative values. If w(tj)≡1, the well-known LR statistic is obtained. In addition, various options can be used as a weight function, such as the Prentice-Wilcoxon test (102), the Peto-Peto test (103), or the Gehan-Breslow-Wilcoxon method, which gives more weight to failures in the initial time intervals but lags when the results of a large percentage of patients are censored in time (26). It has also been proposed a Gρ,γ family (99) that can represent a variety of function shapes based on observed survival functions S(t), as w(t)=S(t)ρ(1-S(t))γ, where ρ and γ in the equation are parameters specifying the shape of the weight function. If ρ=γ=0, the standard LR test -as equally weighted throughout the whole period- is recovered, while if ρ=1,γ=0, the Prentice-Wilcoxon test is chosen to test for differences between time to first event profile between two comparable treatments. When ρ=0,γ=1, more weight is given at later time points, while when ρ=1,γ=1, more weight is allocated at the middle time points.
The Max-Combo test. In the absence of prior knowledge regarding the PH or non-PH patterns, the Max-Combo test can be used to test a treatment difference between two treatment arms (104). This test as a composite test, is a combination of the Fleming-Harrington Gρ,γ WLR tests. Thus, the test statistic can be given as Zmax=maxρ,γ (ZFH(ρ,γ)), where ZFH(ρ,γ) is the standardized Fleming-Harrington WLR test statistics (69, 70). This test offers an alternative approach of combining multiple weights, addressing different patterns of NPH. Moreover, the Max-Combo test, compared to the LR test, is advantageous in terms of power for a delayed treatment effect and crossing hazards and has an acceptable loss of power for a diminishing treatment effect (69, 70). However, when using the Max-Combo test it is difficult to interpret the resulting weighted HR used to describe the treatment effect (70, 72, 105).
Weighted Kaplan–Meier (WKM) test statistic. For comparing survival functions, other types of tests than LR tests can be used to compare survival functions. One example of such a test is based on the sum of the differences between WKM curves (76). In these tests, the statistic is the maximum over several statistics from the class, with various weights. Particularly, WKM statistic over the restricted range after t0 is defined as:
18
with
where the random weight function computes the deterministic function w(.) that is used to stabilize the test statistic. Quantities n1 and n2 are the sample sizes of the treatment and control gr. up respectively in pi=ni/(n1+n2), and
denote the left-continuous version of the KM estimator of the censoring random variables of the group i (106-108). WKM tests have the advantage that as location tests, they do not rely solely on the ranks of the data, in contrast to all the other procedures mentioned above, which are rank-based, and thus do not inherently incorporate information about the magnitude of the survival difference relative to the time scale.
The Restricted Mean Survival Time Method (RMST). As a special case of WKM, RMST is a metric that has attracted intense interest in recent years due to its simple and intuitive interpretation in measuring clinical treatment effectiveness (69, 109-113). It is a summary measure that estimates average survival time and is independent of the PH assumption, so it can be considered for designing trials in the presence of NPH (114). In addition, it is less sensitive to the distribution of survival times and is less affected by extreme values than the mean survival time. Geometrically, RMST is determined as the area under the survival curve S(t) from 0 up to a pre-specified time point t′, that can be interpreted as the average event-free survival time up to the pre-specified time t′ (109) and is given by:
19
Furthermore, the difference Δ in RMST between two arms - control and treatment- with KM curves S0(t) and S1(t) respectively, is given by:
20
denoting that Δ is the area between the survival curves. Thus, the larger the area, the greater the treatment effect. Here, it should be noted that the difference between two RMSTs can be seen as a WKM test (107), where the weight is equal to 1 (110, 115). Along with the difference in RMST, the ratio of restricted mean time lost (RMTL), which is the area above the survival curve, can also be a useful summary measure. Indeed, in some application studies, they had similar properties (116). The RMTL ratio can be approximately the same as the hazard ratio when the event rate is low. However, the difference in RMST can provide an absolute effect size which is not available with hazard ratios, but in any case, the RMST value still depends on the choice t′. The follow-up boundary time t′ is crucial since it can affect the RMST difference along with its CI and any conclusions drawn from it, so t′ needs to be pre-specified to avoid bias.
A typical example of the application of the RMST method that is implemented in the survRM2 R package (117-119) is the illustration shown in Figure 6, where from a total of 418 data related to a study of primary biliary cirrhosis conducted by the Mayo Clinic, a subset of 312 cases participating in a randomized clinical trial is selected. The 158 cases belong to the D-penicillamine group and correspondingly 154 cases to the placebo group. Time is measured in years from initial subject enrollment to death or last known alive, status describes the event indicator (1: death, 0: censored), while arm is the treatment assignment indicator (1: D-penicillamine, 0: placebo). Areas highlighted in pink, and orange are the RMST and RMTL estimates in the penicillamine D and placebo groups, respectively, when follow-up is set at 10 years. The median survival time was 7.146 years in the D-penicillamine group and 7.283 years in the placebo group. Overall, patients treated with D-penicillamine lost an average of 2.85 years compared to 2.7 years in the placebo group (RMTL). The estimate shows that patients who received this treatment survived an average of 0.137 years less than those in the placebo group when the patients were followed for 10 years (RMST difference), suggesting however, that it was ineffective, but with statistical non-significance (p=0.738).
An example of the RMST and RMTL implementation. Areas highlighted in pink, and orange are the RMST and RMTL estimates respectively in a study of primary biliary cirrhosis, showing the mean survival time (or loss) for subjects at 10 years of follow-up for both arms (arm1: D-penicillamine, arm0: placebo).
There are many advantages to using RMST (111). For example, RMST can represent the average duration of a specific disease state. In addition, in contrast to HR, RMST can provide a simple quantitative measure of the improvement in survival in time over the whole period of follow-up. Finally, overall survival can be divided into periods, namely one corresponding to disease progression and one corresponding to the time from progression to the event of interest. Then, by estimating RMST in these two different periods for two trial arms and comparing them, additional valuable information can be extracted concerning the differences and similarities between the two trial arms in these two periods (111). Different patterns of NPH can be dealt with the RMST method, such as delayed treatment effect, diminishing treatment effect over time, and long-term survivors. However, a loss of power with the NPH pattern of crossing hazards has been observed (69).
Other Survival Models
Parametric models. An alternative way of modeling in the presence of NPH is the use of parametric distributions. The basic assumption here is that in censored regression models the survival times (or the logarithm of survival times) of all cases in the data follow a certain theoretical distribution (120). Common non-negative parametric distributions that may be suitable for time-to-event modeling are the Weibull, Gompertz, gamma, log-normal, log-logistic, and segmental exponential distributions (56, 70). The shape of the HR function depends on the pre-selected parametric distribution. In this way, different parametric distributions give the modeling of different NPH patterns. The choice of which parametric distribution to use is made by comparing the model fit to a variety of different distributions. This is achieved by a penalized metric provided by model selection indices, such as the Akaike information criterion (121) (AIC) or the Bayesian information criterion (122) (BIC) to select which distribution fits best among the candidates, with the lowest criterion value giving the best choice. The strong advantage of parametric models is that they provide higher performance, especially for smaller sample sizes (123).
The main classes of these models are the parametric PH model (supporting parametric form on the baseline hazard), the cumulative hazards model (where predictors affect the hazard function in an additive manner), and the Accelerated Failure Time (AFT) model which assumes that the relationship of the log of survival time with the covariates is linear. For this reason, the last model is one of the most popular because its interpretation is more understandable. An AFT model is of the form:
21
or
22
where T is the failure time and ε is the error assumed to have a certain parametric distribution but no longer follows a normal distribution. It is obvious that the AFT model is additive with respect to ln(T), while multiplicative with respect to T (56).
The AFT model assumes that disease either accelerates or decelerates the rate of decline of the survival function, so if S1(t) and S2(t) denote the survival functions of the presence and absence of the disease respectively, then S1(t)=S2(gt), where g is the acceleration factor for comparison of the survival time of the two groups. This means that if the median time to the event is of interest, then the AFT model suggests that the median time for those with the disease is g times the median time for those without the disease (120). In contrast to HR, the effects of individual predictors in the AFT model are explained using time ratios (TR), where the ratio denotes the accelerating factor. If TR>1, then an event is less likely to occur, i.e., we must wait longer for the event to occur, while if TR<1 then the event is more likely to occur (56). It is also noteworthy that the AFT model is identical with PH if the survival distribution is Weibull and vice versa (124).
Landmark analysis. An alternative approach for analyzing time-dependent covariates in time-to-event data is landmarking analysis (125), where the estimated effect of the time-dependent covariate is based on its value at a landmark time point, after which the time-dependent covariate is likely to change its value. Landmark analysis refers to the practice of defining this time point occurring during the follow-up period and analyzing only those subjects who have survived until the landmark time (126), taking in parallel as a null hypothesis that survival from the landmark time point is not affected by the condition of response at landmark (127). Such a type of analysis is mainly used when the covariate of interest is known within a relatively short period after the start of follow-up, in contrast to the methods already discussed that rely on measuring the covariate at baseline.
Competing risks. In survival analysis, the presence of competing risks is important, because it is possible that in a subject who has experienced multiple events, the occurrence of a specific event prevents the rest, thus modifying the risk of the primary event. For example, when the primary event is the time of death from cancer, then death from any other cause, such as cardiovascular disease (CVD), is considered a competing risk against the expectation of the main event, which it essentially excludes (absorbing event). Therefore, the participant must be censored at the time of death from CVD because not only he has not experienced the event of interest, but also because he is no longer at risk for the event of interest (46). This procedure, however, needs cautious handling, as censorship at the time of competitive events may cancel the general assumption of non-informative censorship (5, 128, 129). The key concern in competing risk cases is both the correct prediction of events and the discovery of those risk factors that influence event times. However, since the occurrence of competing events requires control of the regression covariates that lead to specific types of failure, as well as the correlations between these types, the application of conventional survival methods has no logical interpretation. This is because in the case of competing events, even if they are independent of each other, we are led to biased results and more specifically to an overestimation of the frequency of the outcome of interest, and thus of the beneficial effect of an experimental treatment, especially when the risk of the competing event is greater (12, 130).
In the case of competing risks, the hazard effect is not directly related to the changes in the cumulative incidence function (CIF) as it is in the classical methodology with survival functions where F(t)=1 − S(t)=1 − e−H(t) (128, 131), except in situations where the event of interest is relatively rare (132). Therefore, alternative modeling methods are required, which contribute significantly to solving this problem and are an outcome from a synthesis of multiple results (131, 133). There are two statistical methodologies used to model studies involving competing risks problems and these are the cause-specific hazard model and the subdistribution hazard model (13, 127, 134). The first model considers competitive events other than the outcome of interest to be censored, where Cox regression analysis is applied for the specific outcome and the HR reflects the risk (cumulatively) of the study sample. This method is used to examine the etiology of a disease in conjunction with the corresponding therapeutic effects. On the other hand, the subdistribution hazard model essentially considers subjects who faced competing risks as remaining at follow-up rather than having been censored. This model examines the prognosis or prediction of a subject’s risk (prognostic). In the case of comparing two groups, the Gray’s test is used (135) as an adaptation of the LR test developed for competing risks data. This test estimates the CIF, which is the probability of the event occurring among the subjects that have not experienced any type of event but includes the outcome and competing risks (136). In contrast, in the multivariate analysis, the Fine-Gray model (132, 137) is used to estimate the subdistribution hazard ratios by modeling the effects of covariates on the CIF. Here, the subdistribution hazard ratios cannot be interpreted as standard hazard ratios, as subjects who faced competing risks are included in the number at risk of having the outcome of interest.
Frailty models. These models are time-varying random effects models, where the random effect (frailty) has a multiplicative effect on risk, and they are used for incorporating heterogeneity between individuals (5, 138). This approach may also seem very critical both for survival times for individuals, such as twins or family members or more generally cluster individuals, and for repeated events for the same individual (123). The distribution here can be constant, inverse Gaussian, or follow an exponential family of power distribution functions, or even a gamma distribution for frailty when the dependence is more important for late events (139).
Conclusion
In this review, basic and advanced concepts concerning survival analysis were presented. Particularly, the most popular basic methods for estimating and comparing survival curves in the presence of censoring, such as KM curves, LR tests, and the Cox PH model were described along with indicative clinical examples. However, in many real-world applications some of the key assumptions of these methods, such as the PH assumption, are not met and their use becomes problematic. As presented in this review, on some occasions, the PH assumption is doubtful, and its plausibility should be thoroughly examined. For example, different NPH patterns can occur, such as delayed treatment effects, crossing hazards, reduced treatment effects over time, as well as the appearance of long-term survivors. In this direction, the use of graphical diagnostics and GOF tests related to finding the scaled Schoenfeld residuals could provide evidence of proportionality assumption validation. When the assumption of PH is not satisfied, the use of advanced methods, which are extensions, variations, and/or alternatives of the basic methods mentioned earlier becomes necessary. These methods include stratified and extended Cox models, the Yang-Prentice model, cure-rate models, conditional survival models, parametric models, frailty models, competing risks, and NPH tests, such as the WLR test, Max-Combo test, WKM test, and RMST. However, each one of these models/tests has its advantages and drawbacks. For example, in stratified Cox models statistical information concerning the stratified variables cannot be provided, whereas in extended Cox models the choice of proper time function depends strongly on the information available. In addition, the Yang-Prentice model was shown to often exhibit inflated type I error, while cure rate models strongly depend on adequate follow-up. On the other hand, in some models/tests, some specific information must be predefined because they can greatly affect their performance, such as the follow-up boundary time in the RMST method, the weights in the Max-Combo tests, and the distribution of survival times in the parametric models.
In summary, it is important to consider the initial assumptions of each model and recognize that if the assumptions are not met, the results may be misleading. Each choice should depend on the desired outcome, match the data and the corresponding type of NPH to be used, and not necessarily select the test that gives a statistically significant probability for the prediction of interest. A proper understanding and management of both available survival methods and data sets is imperative for the correct application and interpretation of results, leading to optimized decision-making for the selection of the most appropriate treatment.
Footnotes
Authors’ Contributions
Conceptualization, G.B., and A.I.; formal analysis, G.B., and A.I.; investigation, G.B., and A.I.; writing – original draft, G.B. and A.I.; writing – review & editing, G.B., and A.I.; Supervision, I.P.
Conflicts of Interest
The Authors declare no conflicts of interest in relation to this study.
Funding
This research received no external funding.
- Received December 21, 2023.
- Revision received January 12, 2024.
- Accepted January 15, 2024.
- Copyright © 2024 International Institute of Anticancer Research (Dr. George J. Delinasios), All rights reserved.
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY-NC-ND) 4.0 international license (https://creativecommons.org/licenses/by-nc-nd/4.0).