Test for the significance of difference in lifespan
In a lifespan study, the comparisons of survival functions between experiment and control groups are important to determine the efficacy of the experimental treatments such as genetic manipulation, dietary intervention, or drug treatments. To systematically compare survival functions between experiment and control, we need to check various statistics in survival datasets because different conditions may increase or decrease lifespan in different ways. For example, some conditions could only increase the average lifespan, whereas others could increase both of average and maximum lifespan. Therefore, the statistics of overall lifespan is compared using log-rank test, whilst those of a specific time point is compared with Fisher's exact test. Based on comparisons of various statistics with overall lifespan, we can infer which condition reduces mortality caused by mid-life diseases or slow down fundamental processes of ageing.
1. Log-rank test
Mantel-Cox test, so-called log-rank test, is a kind of nonparametric test which is frequently used for the comparison of two survival functions through overall lifespan data. The log-rank statistics in two groups such as experiment and control is calculated as follows.
where di is the number of deaths in group 1, and ei (estimated as (dinli)/ni ) is the number of expected deaths in group 1. n1i is the size of the population of group 1 at risk during the ith interval, and ni is the total size of population at risk during the ith interval.
P-value 0.00E+00 is provided when P < 1.0 * 10-10.
2. Weighted log-rank test
One might want to put more emphasis on earlier deaths than the later ones or vice versa. To generalize log-rank test for these needs, Fleming and Harrington developed G(rho, gamma)-weighted log-rank test. The weighted test statistics is calculated by following equation.
where wi is vary according to the type of tests.
- Wilcoxon-Breslow-Gehan Test
Wilcoxon test statistic is constructed by weighting the contribution of each failure time to the overall test statistic by the number of subjects at risk. Thus it gives heavier weights to earlier failure times when the number at risk is higher. As a result, this test is susceptible to differences in the censoring pattern of the groups.
- Tarone-Ware Test
Test is suggested by Tarone and Ware (1977), with weights equal to the square root of the number of subjects in the risk pool at time ti.
Like Wilcoxon’s test, this test is appropriate when hazard functions are thought to vary in ways other than proportionally and when censoring patterns are similar across groups. The test statistic is constructed by weighting the contribution of each failure time to the overall test statistic by the square root of the number of subjects at risk. Thus, like the Wilcoxon test, it gives heavier weights, although not as large, to earlier failure times. Although less susceptible to the failure and censoring pattern in the data than Wilcoxon’s test, this could remain a problem if large differences in these patterns exist between groups.
- Peto-Peto-Prentice Test
The test uses as the weight function an estimate of the overall survivor function, which is similar to that obtained using the Kaplan–Meier estimator. This test is appropriate when hazard functions are thought to vary in ways other than proportionally, but unlike the Wilcoxon–Breslow–Gehan test, it is not affected by differences in censoring patterns across groups.
- Fleming-Harrington Test
G(rho, gamma) weight defined as S(t)rho(1-S(t))gamma. Generally if rho > 0 and gamma = 0, the test is sensitive against early difference, whereas if rho = 0 and gamma > 0, the test is sensitive against later differences.
P-value 0.00E+00 is provided when P < 1.0 * 10-10.
Maximal Lifespan comparison
OASIS 2 has a new feature for proper quantification of differences in maximal lifespan between datasets. Maximal lifespan is an upper percentiles of the distribution of lifespan, which contrasts with mean lifespan. Maximal lifespan could be determined by "fundamental process of aging" whereas mean lifespan changes with various condition such as diseases. This is of interest because increasing maximal lifespan may be an indicator that an intervention is slowing the general processes of aging and not merely retarding the development of specific diseases. Thus, it could be useful to detect differences in maximal lifespan as opposed to simply “curve squaring” that can be induced by increasing mean or median lifespan without increasing maximal lifespan.
3. Boschloo's test
Regarding comparison of maximal lifespan, Boschloo’s Test is used to compare the fractions of the longevity outliers. The null hypothesis of Boschloo’s test (H0,A) is that the fraction of outliers, which live longer than specific-time points, is similar between population A and B. The equation of H0,A is:
where x is an observation from population. L(x) is the survival time of x. τ denotes some threshold chosen by the investigator that could represent the criteria for a specific time point. OASIS 2 provides 25, 50, 75, 90% percentile-threshold as τ.
4. Mann-Whitney U test
The modified Mann-Whitney U test were able to determine the differences in the distribution tails of survival data affecting maximal lifespan as well as the differences in the proportion of longevity outliers. The null hypothesis of the test (H0,AB) is compound of H0,A and the another null hypothesis (H0,B). H0,B is that the outlier have a similar average of survival time between two different populations. The equation of the compound null hypothesis (H0,AB) is:
where I(i) is indicator function taking on value of one if i is true and zero otherwise. Mann-Whitney U Test is applied to compare the average (μ) of Z variables between two populations.
5. Fisher's exact test
Fisher's exact test is frequently used in survival analysis. To test different survival functions at specific time point instead of overall lifespan, the program can calculate the probability of observed data with Fisher's exact test at different time points as following formula.
where a and b are the number of living subjects in group 1 and group 2 respectively and c and d are that of dead subjects in group 1 and group 2 respectively at specific time t. P-value of Fisher's exact test was calculated with the sum of probabilities less than or equal to pt of all combinations. Generally, 90% mortality is used for Fisher's exact test. However, in some cases, comparisons between two datasets showed no statistically significant difference because of several reasons, including drastic death at old age. This analysis suggests that one might want to put more emphasis on earlier deaths than the later ones because later deaths might result from causes unrelated to normal ageing.
6. Kolmogorov-Smirnov test
While the log-rank test is commonly used for comparing survival data between samples, it is optimized for special assumptions on the underlying distributions such that the hazard ratio or relative risk / is constant in time t. In that case, a log-rank test generally gives optimal results. However, for considering general situation, a generalized test that does not depend on a special underlying distribution is needed. The Kolmogorov-Smirnov test is an appropriate solution for this purpose so that it robustly works in the condition where the hazard functions and cross over through time t. The Kolmogorov-Smirnov test is based on the following equation.
where sup represents a supremum of a set which gives smallest real number that is greater than or equal to every number in the set and D represents the largest absolute vertical deviation. OASIS 2 adopted surv2.ks function implemented in R packages (R Development Core Team and contributors worldwide, 2008) to provide Kolmogorov-Smirnov test. We note that the Kolmogorov-Smirnov test in OASIS 2 is not applicable to survival data that contain tied observations [eg. multiple events (death or failure) during an observed time interval]. OASIS 2 provides a warning message if there is any tied observation in survival data within or between samples.
7. Neyman's smooth test
Another test capable of detecting a wide spectrum of alternatives is Neyman's smooth test. It is developed to test the homogeneity of two different survival data by comparing a null model, S1(t) = S2(t), and various alternative models. The alternative models embedded the null model with Legendre polynomials based on Neyman's goodness-of-fit idea as following equation.
where is a parameter set of bounded functions which are modelling possible difference between S1(t) and S2(t). Therefore, if , then null hypothesis is accepted. Since the Neyman's smooth test selects optimal smooth model in Legendre polynomials with Schwarz's selection rule, it is different from Kolmogorov-Smirnov test in the respect of providing an idea of the types of difference between two survival data. The selected dimension represents type of difference between S1(t) and S2(t). If the selected dimension (d) is 1, it suggests that S1(t) and S2(t) are different from each other by the constant hazard ratio. If the selected d is 2, then the relationship between two samples is likely to be monotonic. If the selected d is 3, then relationship between two samples is likely to have convex or concave form. OASIS 2 adopted surv2.neyman function implemented in R packages to provide the Neyman's smooth test. Similar to the Kolmogorov-Smirnov test, the Neyman's smooth test is not currently applicable when there are tied observations in survival data.
8. Chow test
Chow test, a variant of F-test, was invented by economist Gregory Chow to test whether the coefficients in two linear regressions on different data sets are same or not. This test is generally used for detecting structural break that is an unexpected shift in time series data. In the OASIS 2, we used this analysis for detecting structural differences between two different log cumulative hazard functions by using the following equation.
where RSSp represents the sum of squared residuals from the pooled log cumulative hazard data. RSS1 and RSS2 represent the sum of squared residuals from two different log cumulative hazard data respectively. N1 and N2 are the number of observation in each data and k is 3 which is the total number of parameters of linear regression model. The test statistic follows the F distribution with k and N1 + N2 - 2k degrees of freedom.
OASIS 2 adopted chow.py function implemented by Dr. Ernesto P. Adorio in http://adorio-research.org/wordpress/?p=1789.
9. Survival time F-test
We provide survival time F-test, which is used to examine whether two normal
populations have same variance or not. Because censored data are generally used in
survival analysis, one can estimate the number of dead animals using survival function S(t)
and then perform F-test for the comparison of variance between two different survival data.
The F-test is used under the condition that the survival times of individual follow normal distribution. As a normality check method for the given dataset, we provide the Shapiro-Wilk test in the OASIS 2 website. If a P-value from the Shapiro-Wilk test is smaller than 0.01, then the chance of survival data following the normal distribution is less than 0.01, and therefore the results of F-test are not applicable. Thus, we provide warning message in this case.
10. Partial slopes Ranksum test
We devised another statistical test method for comparing the difference of slope of
two log cumulative hazard plots. We calculated partial slopes of the log cumulative hazard
plot. With null hypothesis that two different log cumulative hazard plots have same slopes,
we conducted rank-sum tests with set of partial slopes as following definitions.
The partial slopes rank-sum test is based on non-parametric statistics that requires sufficient number of samples (in this case, partial slopes) for the reliable analysis. Since a partial slope is defined as the changes in log cumulative hazard divided by the corresponding change in survival time between two neighbouring time points, the number of observed time points rather than the total sample size is important for the non-parametric analysis. To obtain statistically meaningful results, at least six observed time points are needed.
,where D1 and D2 are set of partial slopes of each group. These sets are compared with ranksum
11. Normalized Chow test
Chow test is used for testing whether the coefficients in two linear regressions on
different datasets are same or not. However, researchers who perform survival analysis
tend to be interested in examining the difference in slope rather than in determining the
difference in y-intersect. For this purpose, before conducting Chow test, we normalized the
log cumulative hazard data to have a mean of zero. In this case, the linear regression of
each dataset has zero y-intersect. Thus, one can test the difference of the slope of each
dataset and the pooled data.
We verified the difference of lifespan variations through normalized Chow test, a statistical test that examines whether the coefficients of two linear regressions on different normalized data sets are equal. Like log-rank test, the assumption is that survival rate is constant over time to apply the normalized Chow test.