library(survival)
library(glmnet)
library(tibble)
library(dplyr)Case-control studies
In this exercise we will use data from a cohort of \(1720\) female patients who were discharged from two tuberculosis sanatoria in Massachusetts between 1930 and 1956 to investigate breast cancer risk of radiation exposure due to fluoroscopy. Radiation doses have been estimated for \(1022\) women who received radiation exposure to the chest from X-ray fluoroscopy lung examinations. Some women had multiple exposures to radiation. The remaining \(698\) women in the cohort received treatments that did not require fluoroscopic monitoring and were radiation unexposed. The patients had been followed up until the end of 1980, by which time \(75\) breast cancer cases were observed. (Source: Hrubec et al., Cancer Research, 1989)
For this cohort radiation data have been collected for all \(1720\) women. But the workload of exposure data collection would have been reduced if the investigators had used a cohort sampling design. In this exercise we will look at estimation based on the full cohort and for nested case-control and case-cohort samples selected from the full cohort.
We first look at the cohort data. You may read these data into R by the command
cohort=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.cohort.txt", header=T) %>% as_tibble()
cohort# A tibble: 1,720 × 6
id dose number ageentry ageexit status
<int> <dbl> <int> <dbl> <dbl> <int>
1 1 0 0 14.6 47.4 0
2 2 0 0 14.2 52.6 0
3 3 0 0 17.5 69.8 0
4 4 0 0 14.0 46.3 0
5 5 0 0 16.5 61.7 0
6 6 0.0660 4 16.0 58.6 0
7 7 1.25 108 13.0 62.3 0
8 8 0 0 13.2 53.5 0
9 9 0 0 17.3 42.5 1
10 10 0.141 32 19.0 22.6 0
# ℹ 1,710 more rows
The data are organized with one line for each of the 1720 women, and with the following variables in the six columns:
- id: subject id
- dose: radiation dose (Gy)
- number: number of fluoroscopic examinations
- ageentry: age at entry into the study (in years)
- ageexit: age at breast cancer or censoring (in years)
- status: breast cancer status (0: censored; 1: breast cancer)
To model the effect of radiation dose on breast cancer risk, we will consider a Cox regression model with \(log2(dose + 1)\), with dose measured in grays (Gy), as covariate. You may fit this model for cohort data by the commands:
fit.cohort=coxph(Surv(ageentry,ageexit,status)~log2(dose+1), data=cohort)
summary(fit.cohort)Call:
coxph(formula = Surv(ageentry, ageexit, status) ~ log2(dose +
1), data = cohort)
n= 1720, number of events= 75
coef exp(coef) se(coef) z Pr(>|z|)
log2(dose + 1) 0.4907 1.6335 0.1617 3.035 0.00241 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
log2(dose + 1) 1.634 0.6122 1.19 2.243
Concordance= 0.586 (se = 0.034 )
Likelihood ratio test= 8.64 on 1 df, p=0.003
Wald test = 9.21 on 1 df, p=0.002
Score (logrank) test = 9.44 on 1 df, p=0.002
We then consider nested case-control data with two controls per case \((m=2)\). You may read these data into R by the command
ncc=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.ncc.txt", header=T) %>% as_tibble()
ncc# A tibble: 225 × 8
id dose number ageentry ageexit status setno case
<int> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 9 0 0 17.3 42.5 1 1 1
2 2717 0 0 31.2 83.2 0 1 0
3 311 0 0 14.4 61.7 0 1 0
4 22 0.617 39 17.2 25.4 1 2 1
5 66 0.259 27 11.5 44.8 0 2 0
6 401 1.83 111 5.25 51.4 0 2 0
7 29 0.594 36 17.7 48.5 1 3 1
8 3027 2.63 267 21.5 73.8 0 3 0
9 2349 0.807 82 23.4 69.7 0 3 0
10 31 0 0 12.5 36.6 1 4 1
# ℹ 215 more rows
The data are organized with three lines for each of the 75 sampled risk sets (one line for the case, and one line for each of the two controls, randomly selected, but falling within the risk set, e.g. the is an overlap between the age interval - think!), and with eight columns. The first six columns are as for the cohort data, while the variables in the two last columns are:
- case: case-control status within the sampled risk set (0=control, 1=case)
- setno: label of sampled risk set
You may fit the Cox regression model for the nested case-control data by the commands: Note we use case instead of status and strata(setno) so that only the 3 patients per setno are used for the calculation of the risk set:
fit.ncc = coxph(Surv(ageentry,ageexit,case) ~ log2(dose+1)+strata(setno),data=ncc)
summary(fit.ncc)Call:
coxph(formula = Surv(ageentry, ageexit, case) ~ log2(dose + 1) +
strata(setno), data = ncc)
n= 225, number of events= 75
coef exp(coef) se(coef) z Pr(>|z|)
log2(dose + 1) 0.5404 1.7167 0.2366 2.284 0.0224 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
log2(dose + 1) 1.717 0.5825 1.08 2.73
Concordance= 0.563 (se = 0.058 )
Likelihood ratio test= 5.39 on 1 df, p=0.02
Wald test = 5.22 on 1 df, p=0.02
Score (logrank) test = 5.39 on 1 df, p=0.02
- Explain why these commands maximize the partial likelihood for nested case-control data. Perform the commands and compare the result with those you obtained for the cohort data.
We then consider case-cohort data with a subcohort of \(150\) women. You may read these data into R by the command
caseco=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.caseco.txt", header=T) %>% as_tibble()
caseco# A tibble: 218 × 7
id dose number ageentry ageexit status subc
<int> <dbl> <int> <dbl> <dbl> <int> <int>
1 2133 0 0 33.3 68.6 0 1
2 2608 0.826 84 25.8 67.6 0 1
3 497 0 0 16.2 60.6 0 1
4 2658 1.62 193 24.4 63.7 0 1
5 2611 0 0 24.2 68.5 0 1
6 3152 0 0 36.5 75.7 0 1
7 118 0.974 59 16.3 34.2 0 1
8 3352 0 0 27.9 39.3 1 1
9 2210 1.51 181 20.8 54.0 0 1
10 165 0.248 15 10.2 19.1 0 1
# ℹ 208 more rows
The data are organized with one line for each woman who is in the subcohort or is a cancer case outside the subcohort (no censored patients outside the subcohort!). The \(150\) women were randomly selected to be in the subcohort. The rest to reach \(218\) were the cases added (\(status=1\)) that are not in the subcohort. All the cases (\(75\)) are included in the data (as was the case with ncc)! The data file has seven columns. The first six columns are as for the cohort data, while the last column is:
- subc: subcohort status (0: not in subcohort; 1: member of subcohort)
For case cohort-data there are two possible methods of estimation. You may fit the Cox model for the case-cohort data using Prentice’s method by the commands:
fit.prentice = cch(Surv(ageentry,ageexit,status)~ log2(dose+1), data=caseco, subcoh=~subc, id=~id, method="Prentice", cohort.size=1720)
print(fit.prentice)Case-cohort analysis,x$method, Prentice
with subcohort of 150 from cohort of 1720
Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1),
data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720,
method = "Prentice")
Coefficients:
Value SE Z p
[1,] 0.5194255 0.2152382 2.413259 0.01581057
summary(fit.prentice)Case-cohort analysis,x$method, Prentice
with subcohort of 150 from cohort of 1720
Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1),
data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720,
method = "Prentice")
Coefficients:
Coef HR (95% CI) p
Value 0.519 1.681 1.102 2.563 0.016
- Perform the commands and compare the result with those you obtained for cohort data and nested case-control data. Also fit the model using inverse probability weighting (which is achieved by using method=“LinYing”), and compare with the results for Prentice’s method.
fit.ipw = survival::cch(Surv(ageentry,ageexit,status)~ log2(dose+1), data=caseco, subcoh=~subc, id=~id, method="LinYing", cohort.size=1720)
summary(fit.ipw)Case-cohort analysis,x$method, LinYing
with subcohort of 150 from cohort of 1720
Call: survival::cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose +
1), data = caseco, subcoh = ~subc, id = ~id, cohort.size = 1720,
method = "LinYing")
Coefficients:
Coef HR (95% CI) p
Value 0.524 1.688 1.116 2.555 0.013
Finally we consider stratified case-cohort data where the sub-cohort consists of 50 women selected from each of the three strata:
women with no fluoroscopic examinations (698 women)
women with 1-149 fluoroscopic examinations (765 women)
women with 150 or more fluoroscopic examinations (257 women)
You may read the stratified case-cohort data into R by the command:
stratcaseco=read.table("https://www.med.uio.no/imb/english/research/centres/ocbe/courses/imb9335/radiationbreast.stratcaseco.txt", header=T) %>% as_tibble()
stratcaseco# A tibble: 221 × 8
id dose number ageentry ageexit status subc stratum
<int> <dbl> <int> <dbl> <dbl> <int> <int> <int>
1 2894 0 0 29.3 39.8 0 1 1
2 2885 0 0 40.6 76.4 0 1 1
3 2490 0 0 18.3 40.9 0 1 1
4 2717 0 0 31.2 83.2 0 1 1
5 3015 0 0 29.7 60.4 0 1 1
6 73 0 0 16.3 64.6 0 1 1
7 44 0 0 17.9 19.5 0 1 1
8 3277 0 0 26.1 70.4 0 1 1
9 2873 0 0 25.6 62.9 0 1 1
10 2423 0 0 24.8 62.1 0 1 1
# ℹ 211 more rows
The data are organized with one line for each woman who is in the subcohort or is a cancer case outside the subcohort. The data file has eight columns. The first six columns are as for the cohort data, while the two last column are
- subc: subcohort status (0: not in subcohrt; 1: member of subcohort)
- stratum: sampling stratum (1: no fluoroscopic examinations; 2: 1-149 examinations; 3: 150 examinations or more)
You may fit the Cox model for the stratified case-cohort data using inverse probability weighting by the commands:
stratsize=c(698,765,257)
names(stratsize)=c(1,2,3)
fit.ipw2=cch(Surv(ageentry,ageexit,status)~log2(dose+1), data=stratcaseco, subcoh=~subc,id=~id, stratum=~stratum, method="II.Borgan", cohort.size=stratsize)
# Borgan II method is a generalization of the Lin-Ying estimator for stratified case-control data
summary(fit.ipw2)Exposure-stratified case-cohort analysis, II.Borgan method.
1 2 3
subcohort 71 87 63
cohort 698 765 257
Call: cch(formula = Surv(ageentry, ageexit, status) ~ log2(dose + 1),
data = stratcaseco, subcoh = ~subc, id = ~id, stratum = ~stratum,
cohort.size = stratsize, method = "II.Borgan")
Coefficients:
Coef HR (95% CI) p
Value 0.506 1.658 1.157 2.377 0.006
- Perform the commands and compare the result with those you obtained for case-cohort data with no stratification.