Cox proportional hazards with correlated variables

survival-analysis

Published

January 22, 2024

1 Introduction

Cox proportional hazards (CoxPH) model is a common approach to study the occurrence of an event as a factor of time. In medical studies, CoxPH is used to model patient survival based on disease type, gene expression, or treatment with a new drug. But CoxPH can be also used to analyze real-world business data. CoxPH analysis enables you to answer questions like:

Scientific questions:
Do female patients live longer than males?
Does drug A increase the overall patient survival time?
Does the high levels of Gene A significantly correlate with the survival outcome?
Business questions:
Do women customers spend more time in my store than men?
Which of my employees retain clients for the longest time?
Does the age of my clients correlate with their long term brand loyalty?
What is the risk of failure in the production line with the new machine compared to the old one?

One of the most powerful features of CoxPH analysis is that it allows including multiple predictors in the model, and these predictors (variables) can have different types. For instance, you can analyze continuous variables (age, gene expression, etc) along with categorical variables (male-vs-female, drug-vs-control, etc). You can also model interactions between two predictors to study complex relationships between predictors (this could be useful if you are trying to understand whether or not a drug can work function synergistically in males, for instance).

I’m in the middle of analyzing a skin cancer data set right now, and I’m trying to understand which genes correlate with the survival outcome. There are several other variables that correlate with my genes of interest (also known as problem of co-linearity), and these tend to have a bias for long-vs-short term survival. Let’s simulate a situation like this to understand how to interpret CoxPH results when there is co-linear variables.

2 Simulate data

We will create a dataset with 500 observations which have survival times (time variable) between 50 and 1000 days. The death variable indicates whether or not the patient is dead at the time of measurement. 1 indicates the patient has passed, and the 0 indicates patient is still alive at that time point, but we don’t have any further information. This is called censoring, and it can happen if the patient visited the hospital at some point (ie. alive) but didn’t came back, for instance.

gene variable in the data frame shows the expression levels of an imaginary gene that is associated with increased death.

Code

# Set seed for reproducibility
set.seed(123)

# Number of observations
n <- 500

# Max time for survival measurement
maxtime <- 1000


# Simulate data frame. 
# Sort the death and gene variables to create easy correlation
# ie higher gene expressors are dying more 
df <- data.frame(death = sort(rbinom(n, 1, 0.75)), # 0(25%)-1(75%) distribution
                 time = runif(n=n, min = 50, max=maxtime),
                 gene = sort(rnorm(n, mean=4, sd=1))+rnorm(n, mean=0, sd=1))

Let’s add some positively and negatively correlated genes now. I will do that by adding some noise to the original gene variable. I created 4 different correlated variables:

gene_high_cor: High positive correlation
gene_low_cor: Low positive correlation
gene_inv_high_cor: High negative (inverse) correlation
gene_inv_high_cor: Low negative correlation

Code

# Create high/low correlated genes
df$gene_high_cor <- df$gene + rnorm(n, mean=0, sd=1)
df$gene_low_cor <- df$gene + rnorm(n, mean=0, sd=3)

# Create genes with high/low inverse correlation 
df$gene_inv_high_cor <- - 1* (df$gene + rnorm(n, mean=0, sd=1))
df$gene_inv_low_cor <- -1 * (df$gene + rnorm(n, mean=0, sd=3))


cor(df[, grepl("gene", colnames(df))])

                        gene gene_high_cor gene_low_cor gene_inv_high_cor
gene               1.0000000     0.7871349    0.4808360        -0.8273054
gene_high_cor      0.7871349     1.0000000    0.3757030        -0.6695418
gene_low_cor       0.4808360     0.3757030    1.0000000        -0.3654133
gene_inv_high_cor -0.8273054    -0.6695418   -0.3654133         1.0000000
gene_inv_low_cor  -0.4356404    -0.3487226   -0.2231452         0.3734353
                  gene_inv_low_cor
gene                    -0.4356404
gene_high_cor           -0.3487226
gene_low_cor            -0.2231452
gene_inv_high_cor        0.3734353
gene_inv_low_cor         1.0000000

Code

head(df, 3)

  death     time      gene gene_high_cor gene_low_cor gene_inv_high_cor
1     0 385.9258 0.1944266   -0.62656010   -1.3403846         0.4843810
2     0 398.1194 0.3583453    0.05108805    1.0691589        -0.9326580
3     0 322.7451 1.4741020    0.57200395   -0.1506656        -0.7695874
  gene_inv_low_cor
1        0.2564958
2        0.6249261
3        2.8703939

Code

tail(df, 3)

    death     time     gene gene_high_cor gene_low_cor gene_inv_high_cor
498     1 421.9238 7.594208      7.871655     6.963697         -7.165028
499     1 724.1009 7.773971      8.310827    12.127814         -8.962983
500     1 153.3829 6.528585      6.068099     8.453239         -7.362879
    gene_inv_low_cor
498        -5.396508
499        -3.626073
500       -12.044975

Code

plot(df)

3 Univariable CoxPH analysis

Next step is using CoxPH. Details about the fit is given under specific columns:

coef: Regression coefficients in the log odds scale. Positive values indicate increased risk of death.
exp(coef): Hazard ratio (HR) which is equal to the exponentiated coefficient (\(e^{coef}\)). Above 1 indicates increased risk (HR=1.2 means 20% higher risk; whereas HR=0.8 means 20% reduced risk).
se(coef): Standard error of the coefficient estimate
z: Wald-statistic which is equal to coef/se(coef)
p: P-value of the estimates
Likelihood ratio test: Indicates the global p-values for rejecting the null hypothesis (ie. all of the coefficients in the model is zero) . You can see other statistics about the model by calling summary() function on the CoxPH results.

Code

library(survival)

coxph(Surv(time,death) ~ gene, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene, data = df)

        coef exp(coef) se(coef)     z        p
gene 0.21486   1.23969  0.03554 6.045 1.49e-09

Likelihood ratio test=35.84  on 1 df, p=2.138e-09
n= 500, number of events= 381

Code

coxph(Surv(time,death) ~ gene_high_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_high_cor, data = df)

                 coef exp(coef) se(coef)     z        p
gene_high_cor 0.15365   1.16609  0.03209 4.788 1.68e-06

Likelihood ratio test=22.76  on 1 df, p=1.839e-06
n= 500, number of events= 381

Code

coxph(Surv(time,death) ~ gene_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_low_cor, data = df)

                coef exp(coef) se(coef)     z        p
gene_low_cor 0.05600   1.05760  0.01629 3.437 0.000589

Likelihood ratio test=11.72  on 1 df, p=0.0006199
n= 500, number of events= 381

Code

coxph(Surv(time,death) ~ gene_inv_high_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_inv_high_cor, data = df)

                      coef exp(coef) se(coef)      z       p
gene_inv_high_cor -0.12229   0.88489  0.02815 -4.344 1.4e-05

Likelihood ratio test=18.66  on 1 df, p=1.56e-05
n= 500, number of events= 381

Code

coxph(Surv(time,death) ~ gene_inv_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_inv_low_cor, data = df)

                     coef exp(coef) se(coef)      z        p
gene_inv_low_cor -0.06808   0.93418  0.01635 -4.165 3.12e-05

Likelihood ratio test=17.45  on 1 df, p=2.946e-05
n= 500, number of events= 381

4 Multivariable CoxPH analysis

After seeing how the result looks like above, next I’d like to try survival modeling using other correlated predictors alone or in combination with each other.

4.1 gene + gene_high_cor

Code

# covariable in combo
coxph(Surv(time,death) ~ gene + gene_high_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene + gene_high_cor, data = df)

                 coef exp(coef) se(coef)     z       p
gene          0.20482   1.22731  0.05652 3.624 0.00029
gene_high_cor 0.01150   1.01157  0.05038 0.228 0.81939

Likelihood ratio test=35.9  on 2 df, p=1.604e-08
n= 500, number of events= 381

4.2 gene + gene_low_cor

Code

# covariable in combo
coxph(Surv(time,death) ~ gene + gene_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene + gene_low_cor, data = df)

                coef exp(coef) se(coef)     z        p
gene         0.20134   1.22304  0.04078 4.938 7.91e-07
gene_low_cor 0.01230   1.01237  0.01840 0.668    0.504

Likelihood ratio test=36.29  on 2 df, p=1.317e-08
n= 500, number of events= 381

4.3 gene_high_cor + gene_low_cor

Code

# covariables in combo
coxph(Surv(time,death) ~ gene_high_cor + gene_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_high_cor + gene_low_cor, 
    data = df)

                 coef exp(coef) se(coef)     z        p
gene_high_cor 0.13049   1.13939  0.03431 3.803 0.000143
gene_low_cor  0.03223   1.03276  0.01738 1.855 0.063579

Likelihood ratio test=26.19  on 2 df, p=2.056e-06
n= 500, number of events= 381

4.4 gene + gene_inv_high_cor

Code

# covariable in combo
coxph(Surv(time,death) ~ gene + gene_inv_high_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene + gene_inv_high_cor, 
    data = df)

                     coef exp(coef) se(coef)     z        p
gene              0.27951   1.32248  0.06439 4.341 1.42e-05
gene_inv_high_cor 0.06172   1.06366  0.05146 1.199     0.23

Likelihood ratio test=37.29  on 2 df, p=7.998e-09
n= 500, number of events= 381

4.5 gene + gene_inv_low_cor

Code

# covariable in combo
coxph(Surv(time,death) ~ gene + gene_inv_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene + gene_inv_low_cor, 
    data = df)

                     coef exp(coef) se(coef)      z        p
gene              0.18481   1.20299  0.04044  4.570 4.88e-06
gene_inv_low_cor -0.02845   0.97195  0.01842 -1.544    0.123

Likelihood ratio test=38.23  on 2 df, p=4.983e-09
n= 500, number of events= 381

4.6 gene_inv_high_cor + gene_inv_low_cor

Code

# covariables in combo
coxph(Surv(time,death) ~ gene_inv_high_cor + gene_inv_low_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_inv_high_cor + gene_inv_low_cor, 
    data = df)

                      coef exp(coef) se(coef)      z       p
gene_inv_high_cor -0.09271   0.91146  0.03033 -3.057 0.00224
gene_inv_low_cor  -0.04936   0.95184  0.01745 -2.829 0.00467

Likelihood ratio test=26.68  on 2 df, p=1.606e-06
n= 500, number of events= 381

4.7 gene_high_cor + gene_inv_high_cor

Code

# covariables in combo
coxph(Surv(time,death) ~ gene_high_cor + gene_inv_high_cor, data = df)

Call:
coxph(formula = Surv(time, death) ~ gene_high_cor + gene_inv_high_cor, 
    data = df)

                      coef exp(coef) se(coef)      z      p
gene_high_cor      0.10934   1.11554  0.04239  2.579 0.0099
gene_inv_high_cor -0.05931   0.94241  0.03714 -1.597 0.1103

Likelihood ratio test=25.31  on 2 df, p=3.194e-06
n= 500, number of events= 381

4.8 All covariables together

Code

formula_pred <- formula(
  paste("Surv(time,death) ~",
        paste(
          grep("gene", colnames(df), value  =T),
          collapse = " + ")
        )
  )

# covariables in combo
coxph(formula_pred, data = df)

Call:
coxph(formula = formula_pred, data = df)

                       coef exp(coef)  se(coef)      z      p
gene               0.217201  1.242594  0.085836  2.530 0.0114
gene_high_cor      0.015962  1.016091  0.050472  0.316 0.7518
gene_low_cor       0.009508  1.009553  0.018484  0.514 0.6070
gene_inv_high_cor  0.053296  1.054742  0.052402  1.017 0.3091
gene_inv_low_cor  -0.027322  0.973048  0.018442 -1.481 0.1385

Likelihood ratio test=39.8  on 5 df, p=1.635e-07
n= 500, number of events= 381

4.9 All covariables made random and modelled together

Note

You will see why I’m doing this in the next section.

Code

# make all covariables random
df[, 4:ncol(df)] <- rnorm(n*(ncol(df)-3), mean = 20, sd = 10)

# covariables in combo
coxph(formula_pred, data = df)

Call:
coxph(formula = formula_pred, data = df)

                        coef  exp(coef)   se(coef)      z        p
gene               2.172e-01  1.243e+00  3.580e-02  6.069 1.29e-09
gene_high_cor     -5.234e-03  9.948e-01  5.211e-03 -1.004    0.315
gene_low_cor      -9.997e-05  9.999e-01  5.430e-03 -0.018    0.985
gene_inv_high_cor -6.686e-03  9.933e-01  4.841e-03 -1.381    0.167
gene_inv_low_cor   3.418e-03  1.003e+00  5.076e-03  0.673    0.501

Likelihood ratio test=39.02  on 5 df, p=2.35e-07
n= 500, number of events= 381

5 Take home point

Not unexpectedly, when there are correlated variables in the survival model, the coefficients of individual predictors can change depending on what enters the model. Let’s try to digest this a bit. In our simulated example, variable gene is closely associated with the survival outcome (ie. patients with high gene are dying more ). The other covariables are generated so that they are correlated with the variable gene, but there is no mathematical relationship between these covariables and the survival outcome. In other words, higher levels of gene go in parallel with higher chances for observing death = 1. After capturing this relationship, the addition of random noise to gene to obtain gene_high_cor (and other covariables) does not explain the changes vital status any further. You might have two correlated variables gene and gene_high_cor for example, but only one of them may be associated with the survival outcome whereas the other one is just a derivative of the first one.

This is why you see a “significant” HR value of 1.16609 in the univariable model involving gene_high_cor in Section 3 (keep in mind that this covariable is positively correlated with the variable of interest, gene); but this significance goes away when you model gene_high_cor together with gene in Section 4.1. The same can be said for multivariable analyses involving other covariables.

Finally, when you model all predictors together in Section 4.8, you see that the HR values of all the covariables converge to 1 (ie. no effect on survival) whereas HR value of gene is 1.242594. However, when you look at the p-value, albeit it is still significant (as defined by < 0.05), you’ll see that it is higher now compared to univariable analysis. This may be happening because our covariables could still explain vital status changes at least somewhat. To support this notion, when we made all covariables random and performed survival modeling in Section 4.9, the coefficient and the p-value of gene looked quite similar to the univariable model in Section 3. I hope this post helped you wraped your mind around how CoxPH works when there are correlated variables involved.