Case-control studies

library(survival)
library(glmnet)
library(tibble)
library(dplyr)

In this exercise we will use data from a cohort of \(1720\) female patients who were discharged from two tuberculosis sanatoria in Massachusetts between 1930 and 1956 to investigate breast cancer risk of radiation exposure due to fluoroscopy. Radiation doses have been estimated for \(1022\) women who received radiation exposure to the chest from X-ray fluoroscopy lung examinations. Some women had multiple exposures to radiation. The remaining \(698\) women in the cohort received treatments that did not require fluoroscopic monitoring and were radiation unexposed. The patients had been followed up until the end of 1980, by which time \(75\) breast cancer cases were observed. (Source: Hrubec et al., Cancer Research, 1989)

For this cohort radiation data have been collected for all \(1720\) women. But the workload of exposure data collection would have been reduced if the investigators had used a cohort sampling design. In this exercise we will look at estimation based on the full cohort and for nested case-control and case-cohort samples selected from the full cohort.

We first look at the cohort data. You may read these data into R by the command

cohort=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.cohort.txt", header=T) %>% as_tibble()
cohort

# A tibble: 1,720 × 6
      id   dose number ageentry ageexit status
   <int>  <dbl>  <int>    <dbl>   <dbl>  <int>
 1     1 0           0     14.6    47.4      0
 2     2 0           0     14.2    52.6      0
 3     3 0           0     17.5    69.8      0
 4     4 0           0     14.0    46.3      0
 5     5 0           0     16.5    61.7      0
 6     6 0.0660      4     16.0    58.6      0
 7     7 1.25      108     13.0    62.3      0
 8     8 0           0     13.2    53.5      0
 9     9 0           0     17.3    42.5      1
10    10 0.141      32     19.0    22.6      0
# ℹ 1,710 more rows

The data are organized with one line for each of the 1720 women, and with the following variables in the six columns:

id: subject id
dose: radiation dose (Gy)
number: number of fluoroscopic examinations
ageentry: age at entry into the study (in years)
ageexit: age at breast cancer or censoring (in years)
status: breast cancer status (0: censored; 1: breast cancer)

To model the effect of radiation dose on breast cancer risk, we will consider a Cox regression model with \(log2(dose + 1)\), with dose measured in grays (Gy), as covariate. You may fit this model for cohort data by the commands:

fit.cohort=coxph(Surv(ageentry,ageexit,status)~log2(dose+1), data=cohort)
summary(fit.cohort)

Call:
coxph(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 
    1), data = cohort)

  n= 1720, number of events= 75 

                 coef exp(coef) se(coef)     z Pr(>|z|)   
log2(dose + 1) 0.4907    1.6335   0.1617 3.035  0.00241 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

               exp(coef) exp(-coef) lower .95 upper .95
log2(dose + 1)     1.634     0.6122      1.19     2.243

Concordance= 0.586  (se = 0.034 )
Likelihood ratio test= 8.64  on 1 df,   p=0.003
Wald test            = 9.21  on 1 df,   p=0.002
Score (logrank) test = 9.44  on 1 df,   p=0.002

We then consider nested case-control data with two controls per case \((m=2)\). You may read these data into R by the command

ncc=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.ncc.txt", header=T) %>% as_tibble()
ncc

# A tibble: 225 × 8
      id  dose number ageentry ageexit status setno  case
   <int> <dbl>  <int>    <dbl>   <dbl>  <int> <int> <int>
 1     9 0          0    17.3     42.5      1     1     1
 2  2717 0          0    31.2     83.2      0     1     0
 3   311 0          0    14.4     61.7      0     1     0
 4    22 0.617     39    17.2     25.4      1     2     1
 5    66 0.259     27    11.5     44.8      0     2     0
 6   401 1.83     111     5.25    51.4      0     2     0
 7    29 0.594     36    17.7     48.5      1     3     1
 8  3027 2.63     267    21.5     73.8      0     3     0
 9  2349 0.807     82    23.4     69.7      0     3     0
10    31 0          0    12.5     36.6      1     4     1
# ℹ 215 more rows

The data are organized with three lines for each of the 75 sampled risk sets (one line for the case, and one line for each of the two controls, randomly selected, but falling within the risk set, e.g. the is an overlap between the age interval - think!), and with eight columns. The first six columns are as for the cohort data, while the variables in the two last columns are:

case: case-control status within the sampled risk set (0=control, 1=case)
setno: label of sampled risk set

You may fit the Cox regression model for the nested case-control data by the commands: Note we use case instead of status and strata(setno) so that only the 3 patients per setno are used for the calculation of the risk set:

fit.ncc = coxph(Surv(ageentry,ageexit,case) ~ log2(dose+1)+strata(setno),data=ncc)
summary(fit.ncc)

Call:
coxph(formula = Surv(ageentry, ageexit, case) ~ log2(dose + 1) + 
    strata(setno), data = ncc)

  n= 225, number of events= 75 

                 coef exp(coef) se(coef)     z Pr(>|z|)  
log2(dose + 1) 0.5404    1.7167   0.2366 2.284   0.0224 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

               exp(coef) exp(-coef) lower .95 upper .95
log2(dose + 1)     1.717     0.5825      1.08      2.73

Concordance= 0.563  (se = 0.058 )
Likelihood ratio test= 5.39  on 1 df,   p=0.02
Wald test            = 5.22  on 1 df,   p=0.02
Score (logrank) test = 5.39  on 1 df,   p=0.02

Explain why these commands maximize the partial likelihood for nested case-control data. Perform the commands and compare the result with those you obtained for the cohort data.

We then consider case-cohort data with a subcohort of \(150\) women. You may read these data into R by the command

caseco=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.caseco.txt", header=T) %>% as_tibble()
caseco

# A tibble: 218 × 7
      id  dose number ageentry ageexit status  subc
   <int> <dbl>  <int>    <dbl>   <dbl>  <int> <int>
 1  2133 0          0     33.3    68.6      0     1
 2  2608 0.826     84     25.8    67.6      0     1
 3   497 0          0     16.2    60.6      0     1
 4  2658 1.62     193     24.4    63.7      0     1
 5  2611 0          0     24.2    68.5      0     1
 6  3152 0          0     36.5    75.7      0     1
 7   118 0.974     59     16.3    34.2      0     1
 8  3352 0          0     27.9    39.3      1     1
 9  2210 1.51     181     20.8    54.0      0     1
10   165 0.248     15     10.2    19.1      0     1
# ℹ 208 more rows

The data are organized with one line for each woman who is in the subcohort or is a cancer case outside the subcohort (no censored patients outside the subcohort!). The \(150\) women were randomly selected to be in the subcohort. The rest to reach \(218\) were the cases added (\(status=1\)) that are not in the subcohort. All the cases (\(75\)) are included in the data (as was the case with ncc)! The data file has seven columns. The first six columns are as for the cohort data, while the last column is:

subc: subcohort status (0: not in subcohort; 1: member of subcohort)

For case cohort-data there are two possible methods of estimation. You may fit the Cox model for the case-cohort data using Prentice’s method by the commands:

fit.prentice = cch(Surv(ageentry,ageexit,status)~ log2(dose+1), data=caseco, subcoh=~subc, id=~id, method="Prentice", cohort.size=1720)

print(fit.prentice)

Case-cohort analysis,x$method, Prentice 
 with subcohort of 150 from cohort of 1720 

Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1), 
    data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720, 
    method = "Prentice")

Coefficients:
         Value        SE        Z          p
[1,] 0.5194255 0.2152382 2.413259 0.01581057

summary(fit.prentice)

Case-cohort analysis,x$method, Prentice 
 with subcohort of 150 from cohort of 1720 

Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1), 
    data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720, 
    method = "Prentice")

Coefficients:
       Coef    HR  (95%   CI)     p
Value 0.519 1.681 1.102 2.563 0.016

Perform the commands and compare the result with those you obtained for cohort data and nested case-control data. Also fit the model using inverse probability weighting (which is achieved by using method=“LinYing”), and compare with the results for Prentice’s method.

fit.ipw = survival::cch(Surv(ageentry,ageexit,status)~ log2(dose+1), data=caseco, subcoh=~subc, id=~id, method="LinYing", cohort.size=1720)

summary(fit.ipw)

Case-cohort analysis,x$method, LinYing 
 with subcohort of 150 from cohort of 1720 

Call: survival::cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 
    1), data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720, 
    method = "LinYing")

Coefficients:
       Coef    HR  (95%   CI)     p
Value 0.524 1.688 1.116 2.555 0.013

Finally we consider stratified case-cohort data where the sub-cohort consists of 50 women selected from each of the three strata:

women with no fluoroscopic examinations (698 women)
women with 1-149 fluoroscopic examinations (765 women)
women with 150 or more fluoroscopic examinations (257 women)

You may read the stratified case-cohort data into R by the command:

stratcaseco=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.stratcaseco.txt", header=T) %>% as_tibble()
stratcaseco

# A tibble: 221 × 8
      id  dose number ageentry ageexit status  subc stratum
   <int> <dbl>  <int>    <dbl>   <dbl>  <int> <int>   <int>
 1  2894     0      0     29.3    39.8      0     1       1
 2  2885     0      0     40.6    76.4      0     1       1
 3  2490     0      0     18.3    40.9      0     1       1
 4  2717     0      0     31.2    83.2      0     1       1
 5  3015     0      0     29.7    60.4      0     1       1
 6    73     0      0     16.3    64.6      0     1       1
 7    44     0      0     17.9    19.5      0     1       1
 8  3277     0      0     26.1    70.4      0     1       1
 9  2873     0      0     25.6    62.9      0     1       1
10  2423     0      0     24.8    62.1      0     1       1
# ℹ 211 more rows

The data are organized with one line for each woman who is in the subcohort or is a cancer case outside the subcohort. The data file has eight columns. The first six columns are as for the cohort data, while the two last column are

subc: subcohort status (0: not in subcohrt; 1: member of subcohort)
stratum: sampling stratum (1: no fluoroscopic examinations; 2: 1-149 examinations; 3: 150 examinations or more)

You may fit the Cox model for the stratified case-cohort data using inverse probability weighting by the commands:

stratsize=c(698,765,257)
names(stratsize)=c(1,2,3)
fit.ipw2=cch(Surv(ageentry,ageexit,status)~log2(dose+1), data=stratcaseco, subcoh=~subc,id=~id, stratum=~stratum, method="II.Borgan", cohort.size=stratsize)
# Borgan II method is a generalization of the Lin-Ying estimator for stratified case-control data
summary(fit.ipw2)

Exposure-stratified case-cohort analysis, II.Borgan method.
            1   2   3
subcohort  71  87  63
cohort    698 765 257
Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1), 
    data = stratcaseco, subcoh = ~subc, id = ~id, stratum = ~stratum, 
    cohort.size = stratsize, method = "II.Borgan")

Coefficients:
       Coef    HR  (95%   CI)     p
Value 0.506 1.658 1.157 2.377 0.006

Perform the commands and compare the result with those you obtained for case-cohort data with no stratification.