In: Statistics and Probability
What are the different commands in R software to analyze the Mayo clinic's "pbc" dataset?
Importing package
library(survival)
Loading data set
data(pbc)
Checking the head of data
head(pbc)
Structure of PBC data
str(pbc)
Summary of PBC data
summary(pbc)
The data set pbc (primary biliary cirrhosis) contains the main variable time, which is the observed, potentially censored survival time and the status, which indicates if the observation has been censored. In addition there are 17 observed covariates, where we show above five that in previous analyzes have been found to be important.
plot of the survival times
nr <- 50
plot(c(0, pbc$time[1]), c(1, 1), type = "l", ylim = c(0, nr + 1), xlim = c(0,
max(pbc$time[1:nr]) + 10), ylab = "nr", xlab = "Survival time")
for (i in 2:nr) lines(c(0, pbc$time[i]), c(i, i))
for (i in 1:nr) {
if (pbc$status[i] == 0)
points(pbc$time[i], i, col = "red", pch = 20) #censored
if (pbc$status[i] == 1)
points(pbc$time[i], i) }
In the complete dataset there are 418 patients, and the status variable shows that 161 have died, 232 have been censored and then 25 have got a transplant. We exclude those 25 observations from the further analysis. Furthermore, we change the status variable to a logical for subsequent use.
pbc <- subset(pbc, status != 1)
pbc <- transform(pbc, status = as.logical(status))
-> Estimation of the survival function using the Kaplan-Meier estimator can be done using the survfit function.
pbcSurv <- survfit(Surv(time, status) ~ 1, data = pbc)
plot(pbcSurv, mark.time = FALSE)
The result from survfit is an R object of type survfit. Calling
summary with a survfit object as argument provides the
non-parametric estimate of the survival function including
additional information as a long list. The plot with the added,
pointwise confidence bands
is perhaps more informative. Setting mark.time = FALSE prevents all
the censored observations to be marked on the plot by a “+”.
The function Surv applied to the time and status variables for the PBC data is a function that creates a survival object.
-> We can compute the Kaplan-Meier and the Nelson-Aalen estimators directly. First we form the individuals at risk process for the subsequent computations
risk <- with(pbc, data.frame(Y = nrow(pbc):1, time =
sort(time)))
surv <- with(pbc, risk[status[order(time)], ])
plot(Y ~ time, data = risk, type = "l")
points(Y ~ time, data = surv, col = "red") ## uncensored
survivals
-> Then we compute the two non-parametric estimators of the survival function.
### Kaplan-Meier and Nelson-Aalen estimates
surv <- transform(surv, KM = cumprod(1 - 1/Y), NAa =
exp(-cumsum(1/Y)))
plot(KM ~ time, data = surv, type = "s", ylim = c(0, 1), xlim =
c(0, 4200))
lines(NAa ~ time, data = surv, type = "s", col =
"blue")
-> It is difficult to see any differences. We can also compute and plot the estimated cumulative hazard functions. Then it is possible to observe small differences in the tail.
plot(-log(KM) ~ time, data = surv, type = "s", xlim =
c(0, 4200))
lines(-log(NAa) ~ time, data = surv, type = "s", col =
"blue")