Title: | Assessment of Risk Prediction Models |
---|---|
Description: | We included functions to assess the performance of risk models. The package contains functions for the various measures that are used in empirical studies, including univariate and multivariate odds ratios (OR) of the predictors, the c-statistic (or area under the receiver operating characteristic (ROC) curve (AUC)), Hosmer-Lemeshow goodness of fit test, reclassification table, net reclassification improvement (NRI) and integrated discrimination improvement (IDI). Also included are functions to create plots, such as risk distributions, ROC curves, calibration plot, discrimination box plot and predictiveness curves. In addition to functions to assess the performance of risk models, the package includes functions to obtain weighted and unweighted risk scores as well as predicted risks using logistic regression analysis. These logistic regression functions are specifically written for models that include genetic variables, but they can also be applied to models that are based on non-genetic risk factors only. Finally, the package includes function to construct a simulated dataset with genotypes, genetic risks, and disease status for a hypothetical population, which is used for the evaluation of genetic risk models. |
Authors: | Suman Kundu, Yurii S. Aulchenko, A. Cecile J.W. Janssens |
Maintainer: | Suman Kundu <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.2-4 |
Built: | 2025-01-30 07:00:37 UTC |
Source: | CRAN |
An R package for the analysis of (genetic) risk prediction studies.
Fueled by the substantial gene discoveries from genome-wide association studies, there is increasing interest in investigating the predictive ability of genetic risk models. To assess the performance of genetic risk models, PredictABEL includes functions for the various measures and plots that have been used in empirical studies, including univariate and multivariate odds ratios (ORs) of the predictors, the c-statistic (or AUC), Hosmer-Lemeshow goodness of fit test, reclassification table, net reclassification improvement (NRI) and integrated discrimination improvement (IDI). The plots included are the ROC plot, calibration plot, discrimination box plot, predictiveness curve, and several risk distributions.
These functions can be applied to predicted risks that are obtained using logistic regression analysis, to weighted or unweighted risk scores, for which the functions are included in this package. The functions can also be used to assess risks or risk scores that are constructed using other methods, e.g., Cox Proportional Hazards regression analysis, which are not included in the current version. Risks obtained from other methods can be imported into R for assessment of the predictive performance.
The functions to construct the risk models using logistic regression analyses
are specifically written for models that include genetic variables,
eventually in addition to non-genetic factors, but they can also be applied
to construct models that are based on non-genetic risk factors only.
Before using the functions fitLogRegModel
for constructing
a risk model or riskScore
for computing risk
scores, the following checks on the dataset are advisable to be done:
(1) Missing values: The logistic regression analyses and computation of
the risk score are done only for subjects that have no missing data. In case
of missing values, individuals with missing data can be removed from the
dataset or imputation strategies can be used to fill in missing data.
Subjects with missing data can be removed with the R function na.omit
(available in stats
package).
Example: DataFileNew <- na.omit(DataFile)
will make a new dataset (DataFileNew
) with no missing values;
(2) Multicollinearity: When there is strong correlation between the predictor variables, regression coefficients may be estimated imprecisely and risks scores may be biased because the assumption of independent effects is violated. In genetic risk prediction studies, problems with multicollinearity should be expected when single nucleotide polymorphisms (SNPs) located in the same gene are in strong linkage disequilibrium (LD). For SNPs in LD it is common to select the variant with the lowest p-value in the model;
(3) Outliers: When the data contain significant outliers, either clinical variables with extreme values of the outcomes or extreme values resulting from errors in the data entry, these may impact the construction of the risk models and computation of the risks scores. Data should be carefully checked and outliers need to be removed or replaced, if justified;
(4) Recoding of data: In the computation of unweighted risk scores, it is assumed
that the genetic variants are coded 0,1,2
representing the number of alleles carried. When variants
are coded 0,1
representing a dominant or recessive effect of the alleles,
the variables need to be recoded before unweighted risk scores can be computed.
To import data into R several alternative strategies can be used. Use the
Hmisc
package for importing SPSS and SAS data into R.
Use "ExampleData <- read.table("DataName.txt", header=T, sep="\t")
" for text
files where variable names are included as column headers and data are
separated by tabs.
Use "ExampleData <- read.table("Name.csv", sep=",", header=T)
"
for comma-separated files with variable names as column headers.
Use "setwd(dir)"
to set the working directory to "dir". The datafile
needs to be present in the working directory.
To export datafiles from R tables to a tab-delimited textfile with the first row as
the name of the variables,
use "write.table(R_Table, file="Name.txt", row.names=FALSE, sep="\t")
" and
when a comma-separated textfile is requested and variable names are provided in the first row,
use "write.table(R_Table, file="Name.csv", row.names=FALSE, sep=",")
".
When the directory is not specified, the file will be
saved in the working directory. For exporting R data into SPSS, SAS and
Stata data, use functions in the the foreign
package.
Several functions in this package depend on other R packages:
(1) Hmisc
, is used to compute NRI and IDI;
(2) ROCR
, is used to produce ROC plots;
(3) epitools
, is used to compute univariate odds ratios;
(4) PBSmodelling
, is used to produce predictiveness curve.
The authors would like to acknowledge Lennart Karssen, Maksim Struchalin and Linda Broer from the Department of Epidemiology, Erasmus Medical Center, Rotterdam for their valuable comments and suggestions to make this package.
The current version of the package includes the basic measures and plots that are used in the assessment of (genetic) risk prediction models and the function to construct a simulated dataset that contains individual genotype data, estimated genetic risk and disease status, used for the evaluation of genetic risk models (see Janssens et al, Genet Med 2006). Planned extensions of the package include functions to construct risk models using Cox Proportional Hazards analysis for prospective data and assess the performance of risk models for time-to-event data.
Suman Kundu
Yurii S. Aulchenko
A. Cecile J.W. Janssens
S Kundu, YS Aulchenko, CM van Duijn, ACJW Janssens. PredictABEL:
an R package for the assessment of risk prediction models.
Eur J Epidemiol. 2011;26:261-4.
ACJW Janssens, JPA Ioannidis, CM van Duijn, J Little, MJ Khoury.
Strengthening the Reporting of Genetic Risk Prediction Studies: The GRIPS
Statement Proposal. Eur J Epidemiol. 2011;26:255-9.
ACJW Janssens, JPA Ioannidis, S Bedrosian, P Boffetta, SM Dolan, N Dowling,
I Fortier, AN. Freedman, JM Grimshaw, J Gulcher, M Gwinn, MA Hlatky, H Janes,
P Kraft, S Melillo, CJ O'Donnell, MJ Pencina, D Ransohoff, SD Schully,
D Seminara, DM Winn, CF Wright, CM van Duijn, J Little, MJ Khoury.
Strengthening the reporting of genetic risk prediction studies
(GRIPS)-Elaboration and explanation. Eur J Epidemiol. 2011;26:313-37.
Aulchenko YS, Ripke S, Isaacs A, van Duijn CM. GenABEL: an R package for genome-wide association analysis. Bioinformatics 2007;23(10):1294-6.
ExampleData is a hypothetical dataset constructed to demonstrate all functions in the package. ExampleData is a data frame containing a binary outcome variable (e.g., disease present/absent) and genetic and non-genetic predictor variables for 10,000 persons. In this dataset, column 1 is the ID number, column 2 is the outcome variable, columns 3-10 are non-genetic variables and columns 11-16 are genetic variables.
data(ExampleData)
data(ExampleData)
data(ExampleData) # show first 5 records (rows) of the dataset head(ExampleData,5)
data(ExampleData) # show first 5 records (rows) of the dataset head(ExampleData,5)
ExampleModels
constructs two risk models using logistic regression analysis.
Most of the functions in this package require a logistic regression model as an input and
estimate predicted risks from this fitted model.
To illustrate these functions without repeating the construction of a
logistic regression model, this example code has been created.
The function returns two different risk models, riskModel1 which is based
on non-genetic predictors and riskModel2 which includes genetic and non-genetic predictors.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred1 <- c(3:10) cNonGenPred2 <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat1 <- c(6:8) cNonGenPredCat2 <- c(6:8) # specify column numbers of genetic predictors cGenPred1 <- c(0) cGenPred2 <- c(11:16) # specify column numbers of genetic predictors that are categorical cGenPredsCat1 <- c(0) cGenPredsCat2 <- c(0) # fit logistic regression models riskmodel1 <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred1, cNonGenPredsCat=cNonGenPredCat1, cGenPreds=cGenPred1, cGenPredsCat=cGenPredsCat1) riskmodel2 <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred2, cNonGenPredsCat=cNonGenPredCat2, cGenPreds=cGenPred2, cGenPredsCat=cGenPredsCat2) # combine output in a list ExampleModels <- list(riskModel1=riskmodel1, riskModel2=riskmodel2)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred1 <- c(3:10) cNonGenPred2 <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat1 <- c(6:8) cNonGenPredCat2 <- c(6:8) # specify column numbers of genetic predictors cGenPred1 <- c(0) cGenPred2 <- c(11:16) # specify column numbers of genetic predictors that are categorical cGenPredsCat1 <- c(0) cGenPredsCat2 <- c(0) # fit logistic regression models riskmodel1 <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred1, cNonGenPredsCat=cNonGenPredCat1, cGenPreds=cGenPred1, cGenPredsCat=cGenPredsCat1) riskmodel2 <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred2, cNonGenPredsCat=cNonGenPredCat2, cGenPreds=cGenPred2, cGenPredsCat=cGenPredsCat2) # combine output in a list ExampleModels <- list(riskModel1=riskmodel1, riskModel2=riskmodel2)
The function fits a standard GLM function for the logistic regression model.
fitLogRegModel(data, cOutcome, cNonGenPreds, cNonGenPredsCat, cGenPreds, cGenPredsCat)
fitLogRegModel(data, cOutcome, cNonGenPreds, cNonGenPredsCat, cGenPreds, cGenPredsCat)
data |
Data frame or matrix that includes the outcome and predictor variables. |
cOutcome |
Column number of the outcome variable. |
cNonGenPreds |
Column numbers of the non-genetic predictors that are
included in the model. An example to denote column numbers is
|
cNonGenPredsCat |
Column numbers of the non-genetic predictors that
are entered as categorical variables in the model. When non-genetic
predictors are not specified as being categorical they are treated as
continuous variables in the model. If no non-genetic predictors are
categorical, denote |
cGenPreds |
Column numbers of the genetic predictors or genetic risk score.
Denote |
cGenPredsCat |
Column numbers of the genetic predictors that are entered as categorical variables in the model. When SNPs are considered as categorical, the model will estimate effects per genotype. Otherwise, SNPs are considered as continuous variables for which the model will estimate an allelic effect. Choose c(0) when no genetic predictors are considered as categorical or when genetic predictors are entered as a risk score into the model. |
The function fits a standard GLM function for the logistic regression model.
This function can be used to construct a logistic regression model based on genetic and non-genetic
predictors. The function also allows to enter the genetic predictors
as a single risk score. For that purpose, the function requires that
the dataset additionally includes the risk score.
A new dataset can be constructed using
"NewExampleData <- cbind(ExampleData,riskScore)
".
The genetic risk scores can be obtained
using the function riskScore
in this package or be
imported from other methods.
No value returned.
predRisk
, ORmultivariate
, riskScore
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # show summary details for the fitted risk model summary(riskmodel)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # show summary details for the fitted risk model summary(riskmodel)
The function estimates multivariate (adjusted) odds ratios (ORs) with 95% confidence intervals (CIs) for all the genetic and non-genetic variables in the risk model.
ORmultivariate(riskModel, filename)
ORmultivariate(riskModel, filename)
riskModel |
Name of logistic regression model that can be fitted using
the function |
filename |
Name of the output file in which the multivariate
ORs will be saved. If no directory is specified, the file is
saved in the working directory as a txt file.
When |
The function requires that first a logistic regression
model is fitted either by using GLM
function or the function
fitLogRegModel
. In addition to the multivariate ORs,
the function returns summary statistics of model performance, namely the Brier
score and the Nagelkerke's value.
The Brier score quantifies the accuracy of risk predictions by comparing
predicted risks with observed outcomes at individual level (where outcome
values are either 0 or 1). The Nagelkerke's
value indicates the percentage of variation
of the outcome explained by the predictors in the model.
The function returns:
Predictors Summary |
OR with 95% CI and corresponding p-values for each predictor in the model |
Brier Score |
Brier score |
Nagelkerke Index |
Nagelkerke's |
Brier GW. Verification of forecasts expressed in terms of probability. Monthly weather review 1950;78:1-3.
Nagelkerke NJ. A note on a general definition of the coefficient of determination. Biometrika 1991;78:691-692.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # obtain multivariate OR(95% CI) for all predictors of the fitted model ORmultivariate(riskModel=riskmodel)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # obtain multivariate OR(95% CI) for all predictors of the fitted model ORmultivariate(riskModel=riskmodel)
The function produces a calibration plot and provides Hosmer-Lemeshow goodness of fit test statistics.
plotCalibration(data, cOutcome, predRisk, groups, rangeaxis, plottitle, xlabel, ylabel, filename, fileplot, plottype)
plotCalibration(data, cOutcome, predRisk, groups, rangeaxis, plottitle, xlabel, ylabel, filename, fileplot, plottype)
data |
Data frame or numeric matrix that includes the outcome and predictor variables. |
cOutcome |
Column number of the outcome variable. |
predRisk |
Vector of predicted risks of all individuals in the dataset. |
groups |
Number of groups considered in
Hosmer-Lemeshow test. Specification of |
rangeaxis |
Range of x-axis and y-axis. Specification of |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis Default. Specification of |
ylabel |
Label of y-axis. Specification of |
filename |
Name of the output file in which the calibration table is saved.
The file is saved as a txt file in the working directory. When no
|
fileplot |
Name of the file that contains the calibation plot.
The file is saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. Foe example, |
Hosmer-Lemeshow test statistic is a measure of the fit of the model, comparing observed and predicted risks across subgroups of the population. The default number of groups is 10.
The function requires the outcome of interest and predicted risks of
all individuals. Predicted risks can be obtained from the
functions fitLogRegModel
and predRisk
or
be imported from other packages or methods.
The function creates a calibration plot and returns the following measures:
Chi_square |
Chi square value of Hosmer-Lemeshow test |
df |
Degrees of freedom, which is |
p_value |
p-value of Hosmer-Lemeshow test for goodness of fit |
Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S. A comparison of goodness-of-fit tests for the logistic regression model. Stat Med 1997; 16:965-980.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify range of x-axis and y-axis rangeaxis <- c(0,1) # specify number of groups for Hosmer-Lemeshow test groups <- 10 # compute calibration measures and produce calibration plot plotCalibration(data=ExampleData, cOutcome=cOutcome, predRisk=predRisk, groups=groups, rangeaxis=rangeaxis)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify range of x-axis and y-axis rangeaxis <- c(0,1) # specify number of groups for Hosmer-Lemeshow test groups <- 10 # compute calibration measures and produce calibration plot plotCalibration(data=ExampleData, cOutcome=cOutcome, predRisk=predRisk, groups=groups, rangeaxis=rangeaxis)
The function produces box plots of predicted risks for individuals with and without the outcome of interest and calculates the discrimination slope.
plotDiscriminationBox(data, cOutcome, predrisk, labels, plottitle, ylabel, fileplot, plottype)
plotDiscriminationBox(data, cOutcome, predrisk, labels, plottitle, ylabel, fileplot, plottype)
data |
Data frame or matrix that includes the outcome and predictors variables. |
cOutcome |
Column number of the outcome variable. |
predrisk |
Vector of predicted risks. |
labels |
Labels given to the groups of individuals without and with
the outcome of interest. Specification of |
plottitle |
Title of the plot. Specification of |
ylabel |
Label of y-axis. Specification of |
fileplot |
Name of the file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. Foe example, |
The discrimination slope is the difference between the mean predicted risks
of individuals with and without the outcome of interest. Predicted risks
can be obtained using the
fitLogRegModel
and predRisk
or be
imported from other programs. The difference between discrimination
slopes of two separate risk models is equivalent
to (IDI
) which is discussed
in details in the reclassification
function.
The function creates a box plots of predicted risks for individuals with and without the outcome of interest and returns the discrimination slope.
Yates JF. External correspondence: decomposition of the mean probability score. Organizational Behavior and Human Performance 1982;30:132-156.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify labels for the groups without and with the outcome of interest labels <- c("Without disease", "With disease") # produce discrimination box plot plotDiscriminationBox(data=ExampleData, cOutcome=cOutcome, predrisk=predRisk, labels=labels)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify labels for the groups without and with the outcome of interest labels <- c("Without disease", "With disease") # produce discrimination box plot plotDiscriminationBox(data=ExampleData, cOutcome=cOutcome, predrisk=predRisk, labels=labels)
The function creates a plot of cumulative percentage of individuals to the predicted risks.
plotPredictivenessCurve(predrisk, rangeyaxis, labels, plottitle, xlabel, ylabel, fileplot, plottype)
plotPredictivenessCurve(predrisk, rangeyaxis, labels, plottitle, xlabel, ylabel, fileplot, plottype)
predrisk |
Vector of predicted risk. When multiple curves need to
be presented in one plot, specify multiple vectors of predicted
risks as |
rangeyaxis |
Range of the y axis. Default |
labels |
Label(s) given to the predictiveness curve(s). Specification of |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis. Specification of |
ylabel |
Label of y-axis. Specification of |
fileplot |
Name of the output file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. For example, |
The Predictiveness curve is a plot of cumulative percentage
of individuals to the predicted risks. Cumulative percentage indicates
the percentage of individual that has a predicted risk equal or lower
than the risk value.
Predicted risks can be obtained using the functions
fitLogRegModel
and predRisk
or be imported from other methods or packages.
The function creates a predictiveness curve.
Pepe MS, Feng Z, Huang Y, et al. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008;167:362-368.
# specify dataset with outcome and predictor variables data(ExampleData) # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify range of y-axis rangeyaxis <- c(0,1) # specify labels of the predictiveness curves labels <- c("without genetic factors", "with genetic factors") # produce predictiveness curves plotPredictivenessCurve(predrisk=cbind(predRisk1,predRisk2), rangeyaxis=rangeyaxis, labels=labels)
# specify dataset with outcome and predictor variables data(ExampleData) # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify range of y-axis rangeyaxis <- c(0,1) # specify labels of the predictiveness curves labels <- c("without genetic factors", "with genetic factors") # produce predictiveness curves plotPredictivenessCurve(predrisk=cbind(predRisk1,predRisk2), rangeyaxis=rangeyaxis, labels=labels)
Function to plot posterior risks against prior risks.
plotPriorPosteriorRisk(data, priorrisk, posteriorrisk, cOutcome, plottitle, xlabel, ylabel, rangeaxis, plotAll=TRUE, labels, filename, fileplot, plottype)
plotPriorPosteriorRisk(data, priorrisk, posteriorrisk, cOutcome, plottitle, xlabel, ylabel, rangeaxis, plotAll=TRUE, labels, filename, fileplot, plottype)
data |
Data frame or matrix that includes the outcome and predictors variables. |
priorrisk |
Vector of predicted risks based on initial model. |
posteriorrisk |
Vector of predicted risks based on updated model. |
cOutcome |
Column number of the outcome variable. |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis. Specification of |
ylabel |
Label of y-axis. Specification of |
rangeaxis |
Range of x-axis and y-axis. Specification of |
plotAll |
|
labels |
Labels given to the groups of individuals without and with
the outcome of interest. Default |
filename |
Name of the output file in which prior and posterior
risks for each individual with the outcome will be saved. If no directory is
specified, the file is saved in the working directory as a txt file.
When no |
fileplot |
Name of the output file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. For example, |
The function creates a plot of posterior risks (predicted risks using
the updated model) against prior risks (predicted risks using the initial
model). Predicted risks can be obtained using the functions
fitLogRegModel
and predRisk
or be
imported from other packages or methods.
The function creates a plot of posterior risks against prior risks.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify label of x-axis xlabel <- "Prior risk" # specify label of y-axis ylabel <- "Posterior risk" # specify title for the plot titleplot <- "Prior versus posterior risk" # specify range of the x-axis and y-axis rangeaxis <- c(0,1) # labels given to the groups without and with the outcome of interest labels<- c("without outcome", "with outcome") # produce prior risks and posterior risks plot plotPriorPosteriorRisk(data=ExampleData, priorrisk=predRisk1, posteriorrisk=predRisk2, cOutcome=cOutcome, xlabel=xlabel, ylabel=ylabel, rangeaxis=rangeaxis, plotAll=TRUE, plottitle=titleplot, labels=labels)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify label of x-axis xlabel <- "Prior risk" # specify label of y-axis ylabel <- "Posterior risk" # specify title for the plot titleplot <- "Prior versus posterior risk" # specify range of the x-axis and y-axis rangeaxis <- c(0,1) # labels given to the groups without and with the outcome of interest labels<- c("without outcome", "with outcome") # produce prior risks and posterior risks plot plotPriorPosteriorRisk(data=ExampleData, priorrisk=predRisk1, posteriorrisk=predRisk2, cOutcome=cOutcome, xlabel=xlabel, ylabel=ylabel, rangeaxis=rangeaxis, plotAll=TRUE, plottitle=titleplot, labels=labels)
Function to plot histogram of risks separated for individuals with and without the outcome of interest.
plotRiskDistribution(data, cOutcome, risks, interval, rangexaxis, rangeyaxis, plottitle, xlabel, ylabel, labels, fileplot, plottype)
plotRiskDistribution(data, cOutcome, risks, interval, rangexaxis, rangeyaxis, plottitle, xlabel, ylabel, labels, fileplot, plottype)
data |
Data frame or numeric matrix that includes the outcome and predictor variables. |
cOutcome |
Column number of the outcome variable. |
risks |
Risk of each individual. It is specified by either a vector of risk scores or a vector of predicted risks. |
interval |
Size of the risk intervals. For example, |
rangexaxis |
Range of the x-axis. Specification of |
rangeyaxis |
Range of the y-axis. |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis. Specification of |
ylabel |
Label of y-axis. Specification of |
labels |
Labels given to the groups of individuals without and
with the outcome of interest. Specification of |
fileplot |
Name of the output file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. For example, |
The function creates the histogram of risks separated for individuals with and without the outcome of interest.
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify the size of each interval interval <- .05 # specify label of x-axis xlabel <- "Predicted risk" # specify label of y-axis ylabel <- "Percentage" # specify range of x-axis xrange <- c(0,1) # specify range of y-axis yrange <- c(0,40) # specify title for the plot maintitle <- "Distribution of predicted risks" # specify labels labels <- c("Without outcome", "With outcome") # produce risk distribution plot plotRiskDistribution(data=ExampleData, cOutcome=cOutcome, risks=predRisk, interval=interval, plottitle=maintitle, rangexaxis=xrange, rangeyaxis=yrange, xlabel=xlabel, ylabel=ylabel, labels=labels)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify the size of each interval interval <- .05 # specify label of x-axis xlabel <- "Predicted risk" # specify label of y-axis ylabel <- "Percentage" # specify range of x-axis xrange <- c(0,1) # specify range of y-axis yrange <- c(0,40) # specify title for the plot maintitle <- "Distribution of predicted risks" # specify labels labels <- c("Without outcome", "With outcome") # produce risk distribution plot plotRiskDistribution(data=ExampleData, cOutcome=cOutcome, risks=predRisk, interval=interval, plottitle=maintitle, rangexaxis=xrange, rangeyaxis=yrange, xlabel=xlabel, ylabel=ylabel, labels=labels)
This function is used to make a plot of predicted risks against risk scores.
plotRiskscorePredrisk(data, riskScore, predRisk, plottitle, xlabel, ylabel, rangexaxis, rangeyaxis, filename, fileplot, plottype)
plotRiskscorePredrisk(data, riskScore, predRisk, plottitle, xlabel, ylabel, rangexaxis, rangeyaxis, filename, fileplot, plottype)
data |
Data frame or matrix that includes the outcome and predictors variables. |
riskScore |
Vector of (weighted or unweighted) genetic risk scores. |
predRisk |
Vector of predicted risks. |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis. Specification of |
ylabel |
Label of y-axis. Specification of |
rangexaxis |
Range of the x axis. Specification of |
rangeyaxis |
Range of the y axis. Specification of |
filename |
Name of the output file in which risk scores and
predicted risks for each individual will be saved. If no directory is
specified, the file is saved in the working directory as a txt file.
When no |
fileplot |
Name of the output file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. For example, |
The function creates a plot of predicted risks against risk scores.
Predicted risks can be obtained using the functions
fitLogRegModel
and predRisk
or be imported from other methods or packages.
The function riskScore
can be
used to compute unweighted or weighted risk scores.
The function creates a plot of predicted risks against risk scores.
# specify dataset with outcome and predictor variables data(ExampleData) # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify column numbers of genetic predictors cGenPred <- c(11:16) # function to compute unweighted genetic risk scores riskScore <- riskScore(weights=riskmodel, data=ExampleData, cGenPreds=cGenPred, Type="unweighted") # specify range of x-axis rangexaxis <- c(0,12) # specify range of y-axis rangeyaxis <- c(0,1) # specify label of x-axis xlabel <- "Risk score" # specify label of y-axis ylabel <- "Predicted risk" # specify title for the plot plottitle <- "Risk score versus predicted risk" # produce risk score-predicted risk plot plotRiskscorePredrisk(data=ExampleData, riskScore=riskScore, predRisk=predRisk, plottitle=plottitle, xlabel=xlabel, ylabel=ylabel, rangexaxis=rangexaxis, rangeyaxis=rangeyaxis)
# specify dataset with outcome and predictor variables data(ExampleData) # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # obtain predicted risks predRisk <- predRisk(riskmodel) # specify column numbers of genetic predictors cGenPred <- c(11:16) # function to compute unweighted genetic risk scores riskScore <- riskScore(weights=riskmodel, data=ExampleData, cGenPreds=cGenPred, Type="unweighted") # specify range of x-axis rangexaxis <- c(0,12) # specify range of y-axis rangeyaxis <- c(0,1) # specify label of x-axis xlabel <- "Risk score" # specify label of y-axis ylabel <- "Predicted risk" # specify title for the plot plottitle <- "Risk score versus predicted risk" # produce risk score-predicted risk plot plotRiskscorePredrisk(data=ExampleData, riskScore=riskScore, predRisk=predRisk, plottitle=plottitle, xlabel=xlabel, ylabel=ylabel, rangexaxis=rangexaxis, rangeyaxis=rangeyaxis)
The function produces ROC curve and corresponding AUC value with 95% CI. The function can plot one or multiple ROC curves in a single plot.
plotROC(data, cOutcome, predrisk, labels, plottitle, xlabel, ylabel, fileplot, plottype)
plotROC(data, cOutcome, predrisk, labels, plottitle, xlabel, ylabel, fileplot, plottype)
data |
Data frame or matrix that includes the outcome and predictors variables. |
cOutcome |
Column number of the outcome variable. |
predrisk |
Vector of predicted risk. When multiple curves need to
be presented in one plot, specify multiple vectors of predicted
risks as |
labels |
Label(s) given to the ROC curve(s). Specification of |
plottitle |
Title of the plot. Specification of |
xlabel |
Label of x-axis. Specification of |
ylabel |
Label of y-axis. Specification of |
fileplot |
Name of the output file that contains the plot. The file is
saved in the working directory in the format specified under |
plottype |
The format in which the plot is saved. Available formats are
wmf, emf, png, jpg, jpeg, bmp, tif, tiff, ps,
eps or pdf. For example, |
The function requirs predicted risks or risk scores and the outcome of
interest for all individuals.
Predicted risks can be obtained using the functions
fitLogRegModel
and predRisk
or be imported from other methods or packages.
The function creates ROC plot and returns AUC value with 95% CI.
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982;143:29-36.
Tobias Sing, Oliver Sander, Niko Beerenwinkel, Thomas Lengauer. ROCR: visualizing classifier performance in R. Bioinformatics 2005;21(20):3940-3941.
predRisk
, plotRiskDistribution
# specify the arguments in the function to produce ROC plot # specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify label of the ROC curve labels <- c("without genetic factors", "with genetic factors") # produce ROC curve plotROC(data=ExampleData, cOutcome=cOutcome, predrisk=cbind(predRisk1,predRisk2), labels=labels)
# specify the arguments in the function to produce ROC plot # specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify label of the ROC curve labels <- c("without genetic factors", "with genetic factors") # produce ROC curve plotROC(data=ExampleData, cOutcome=cOutcome, predrisk=cbind(predRisk1,predRisk2), labels=labels)
Function to compute predicted risks for all individuals in the (new)dataset.
predRisk(riskModel, data, cID, filename)
predRisk(riskModel, data, cID, filename)
riskModel |
Name of logistic regression model that can be fitted using
the function |
data |
Data frame or matrix that includes the ID number and predictor variables. |
cID |
Column number of ID variable. The ID number and predicted risks
will be saved under |
filename |
Name of the output file in which the ID number and
estimated predicted risks will be saved. The file is saved in the working
directory as a txt file. Example: filename="name.txt". When no |
The function computes predicted risks from a specified logistic regression model.
The function fitLogRegModel
can be used to construct such a model.
The function returns a vector of predicted risks.
fitLogRegModel
, plotCalibration
,
plotROC
, plotPriorPosteriorRisk
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # specify column number of ID variable cID <- 1 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # obtain predicted risks predRisk <- predRisk(riskModel=riskmodel)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # specify column number of ID variable cID <- 1 # specify column numbers of non-genetic predictors cNonGenPred <- c(3:10) # specify column numbers of non-genetic predictors that are categorical cNonGenPredCat <- c(6:8) # specify column numbers of genetic predictors cGenPred <- c(11,13:16) # specify column numbers of genetic predictors that are categorical cGenPredCat <- c(0) # fit logistic regression model riskmodel <- fitLogRegModel(data=ExampleData, cOutcome=cOutcome, cNonGenPreds=cNonGenPred, cNonGenPredsCat=cNonGenPredCat, cGenPreds=cGenPred, cGenPredsCat=cGenPredCat) # obtain predicted risks predRisk <- predRisk(riskModel=riskmodel)
The function creates a reclassification table and provides statistics.
reclassification(data, cOutcome, predrisk1, predrisk2, cutoff)
reclassification(data, cOutcome, predrisk1, predrisk2, cutoff)
data |
Data frame or matrix that includes the outcome and predictors variables. |
cOutcome |
Column number of the outcome variable. |
predrisk1 |
Vector of predicted risks of all individuals using initial model. |
predrisk2 |
Vector of predicted risks of all individuals using updated model. |
cutoff |
Cutoff values for risk categories.
Define the cut-off values as |
The function creates a reclassification table and computes the
categorical and continuous net reclassification improvement (NRI
) and
integrated discrimination improvement (IDI
). A reclassification table
indicates the number of individuals who move to another risk category or remain
in the same risk category as a result of updating the risk model. Categorical NRI
equal to
x%
means that compared with individuals without outcome,
individuals with outcome were almost x%
more likely to move up a category than down.
The function also computes continuous NRI
, which does not require any discrete
risk categories and relies on the proportions of individuals with outcome
correctly assigned a higher probability and individuals without outcome
correctly assigned a lower probability by an updated model compared with the
initial model.
IDI
equal to x%
means that the difference in average
predicted risks between the individuals with and without the outcome
increased by x%
in the updated model.
The function requires predicted risks estimated by using two separate risk
models. Predicted risks can be obtained using the functions
fitLogRegModel
and predRisk
or be imported from other methods or packages. p-values in NRI and IDI were rounded upto five decimal places.
The function returns the reclassification table, separately for individuals with and without the outcome of interest and the following measures:
NRI (Categorical) |
Categorical Net Reclassification Improvement with 95% CI and |
NRI (Continuous) |
Continuous Net Reclassification Improvement with 95% CI and |
IDI |
Integrated Discrimination Improvement with 95% CI and |
Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007;115(7):928-935.
Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 2008;27(2):157-172; discussion 207-212.
plotDiscriminationBox
, predRisk
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify cutoff values for risk categories cutoff <- c(0,.10,.30,1) # compute reclassification measures reclassification(data=ExampleData, cOutcome=cOutcome, predrisk1=predRisk1, predrisk2=predRisk2, cutoff)
# specify dataset with outcome and predictor variables data(ExampleData) # specify column number of the outcome variable cOutcome <- 2 # fit logistic regression models # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel1 <- ExampleModels()$riskModel1 riskmodel2 <- ExampleModels()$riskModel2 # obtain predicted risks predRisk1 <- predRisk(riskmodel1) predRisk2 <- predRisk(riskmodel2) # specify cutoff values for risk categories cutoff <- c(0,.10,.30,1) # compute reclassification measures reclassification(data=ExampleData, cOutcome=cOutcome, predrisk1=predRisk1, predrisk2=predRisk2, cutoff)
The function computes unweighted or weighted genetic risk scores. The relative effects (or weights) of genetic variants can either come from beta coefficients of a risk model or from a vector of beta coefficients imported into R, e.g., when beta cofficients are obtained from meta-analysis.
riskScore(weights, data, cGenPreds, Type)
riskScore(weights, data, cGenPreds, Type)
weights |
The vector that includes the weights given to the genetic variants. See details for more informations. |
data |
Data frame or matrix that includes the outcome and predictors variables. |
cGenPreds |
Column numbers of the genetic variables on the basis of which the risk score is computed. |
Type |
Specification of the type of risk scores that will be computed.
Type can be weighted ( |
The function calculates unweighted or weighted genetic risk scores. The unweighted genetic risk score is a simple risk allele count assuming that all alleles have the same effect. For this calculation, it is required that the genetic variables are coded as the number of risk alleles. Beta coefficients are used to determine which allele is the risk allele. When the sign of the beta coefficient is negative, the allele coding is reversed. The weighted risk score is a sum of the number of risk alleles multiplied by their beta coefficients.
The beta coefficients can come from two different sources, either beta coefficients of a risk model
or a vector of beta coefficients imported into R, e.g., when beta cofficients are obtained from meta-analysis.
This vector of beta coefficients
should be a named vector containing the same names as mentioned in genetic variants.
A logistic regression model can be constructed using fitLogRegModel
from this package.
The function returns a vector of risk scores.
When a vector of beta coefficients is imported, it should be checked
whether the DNA strands and the coding of the risk alleles are the same
as in the study data. The functions are available in the package GenABEL
to accurately compute risk scores when the DNA strands are different or the risk
alleles are coded differently in the study data and the data used in meta-analysis.
plotRiskDistribution
, plotRiskscorePredrisk
# specify dataset with outcome and predictor variables data(ExampleData) # specify column numbers of genetic predictors cGenPred <- c(11:16) # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # compute unweighted risk scores riskScore <- riskScore(weights=riskmodel, data=ExampleData, cGenPreds=cGenPred, Type="unweighted")
# specify dataset with outcome and predictor variables data(ExampleData) # specify column numbers of genetic predictors cGenPred <- c(11:16) # fit a logistic regression model # all steps needed to construct a logistic regression model are written in a function # called 'ExampleModels', which is described on page 4-5 riskmodel <- ExampleModels()$riskModel2 # compute unweighted risk scores riskScore <- riskScore(weights=riskmodel, data=ExampleData, cGenPreds=cGenPred, Type="unweighted")
Construct a dataset that contains individual genotype data, genetic risk, and disease status for a hypothetical population. The dataset is constructed using simulation in such a way that the frequencies and odds ratios (ORs) of the genetic variants and the population disease risk computed from this dataset are the same as specified by the input parameters.
simulatedDataset(ORfreq, poprisk, popsize, filename)
simulatedDataset(ORfreq, poprisk, popsize, filename)
ORfreq |
Matrix with ORs and frequencies of the genetic variants. The matrix contains four columns in which the first two describe ORs and the last two describe the corresponding frequencies. The number of rows in this matrix is same as the number of genetic variants included. Genetic variants can be specified as per genotype, per allele, or as dominant/ recessive effect of the risk allele. When per genotype data are used, OR of the heterozygous and homozygous risk genotypes are mentioned in the first two columns and the corresponding genotype frequencies are mentioned in the last two columns. When per allele data are used, the OR and frequency of the risk allele are specified in the first and third column and the remaining two cells are coded as '1'. Similarly, when dominant/ recessive effects of the risk alleles are used, the OR and frequency of the dominant/ recessive variant are specified in the first and third column, and the remaining two cells are coded as '0'. |
poprisk |
Population disease risk (expressed in proportion). |
popsize |
Total number of individuals included in the dataset. |
filename |
Name of the file in which the dataset will be saved. The file is saved in the working directory as a txt file. When no filename is specified, the output is not saved. |
The function will execute when the matrix with odds ratios and frequencies,
population disease risk and the number of individuals are specified.
The simulation method is described in detail in the references.
The method assumes that (i) the combined effect of the genetic variants on disease risk follows a multiplicative (log additive) risk model; (ii) genetic variants inherit independently, that is no linkage disequilibrium between the variants; (iii) genetic variants have independent effects on the disease risk, which indicates no interaction among variants; and (iv) all genotypes and allele proportions are in Hardy-Weinberg equilibrium. Assumption (ii) and (iv) are used to generate the genotype data, and assumption (ii) and (iii) are used to calculate disease risk.
Simulating the dataset involves three steps: (1) modelling genotype data, (2) modelling disease risks, and (3) modelling disease status. Brief descriptions of these steps are as follows:
(1) Modelling genotype data: For each variant the genotype frequencies are either specified or calculated from the allele frequencies using Hardy-Weinberg equilibrium. Then, the genotypes for each genetic variant are randomly distributed without replacement over all individuals.
(2) Modelling disease risks: For the calculation of the individual disease risk, Bayes' theorem is used, which states that the posterior odds of disease are obtained by multiplying the prior odds by the likelihood ratio (LR) of the individual genotype data. The prior odds are calculated from the population disease risk or disease prevalence (prior odds= prior risk/ (1- prior risk)) and the posterior odds are converted back into disease risk (disease risk= posterior odds/ (1+ posterior odds)). Under the no linkage disequilibrium (LD) assumption, the LR of a genetic profile is obtained by multiplying the LRs of the single genotypes that are included in the risk model. The LR of a single genotype is calculated using frequencies and ORs of genetic variants and population disease risk. See references for more details.
(3) Modelling disease status: To model disease status, we used a procedure that compares the estimated disease risk of each subject to a randomly drawn value between 0 and 1 from a uniform distribution. A subject is assigned to the group who will develop the disease when the disease risk is higher than the random value and to the group who will not develop the disease when the risk is lower than the random value.
This procedure ensures that for each genomic profile, the percentage of people who will develop the disease equals the population disease risk associated with that profile, when the subgroup of individuals with that profile is sufficiently large.
The function returns:
Dataset |
A data frame or matrix that includes genotype data, genetic risk and disease status for a hypothetical population. The dataset contains (4 + number of genetic variants included) columns, in which the first column is the un-weighted risk score, which is the sum of the number of risk alleles for each individual, the third column is the estimated genetic risk, the forth column is the individual disease status expressed as '0' or '1', indicating without or with the outcome of interest, and the fifth until the end column are genotype data for the variants expressed as '0', '1' or '2', which indicate the number of risk alleles present in each individual for the genetic variants. |
Janssens AC, Aulchenko YS, Elefante S, Borsboom GJ, Steyerberg EW, van Duijn CM. Predictive testing for complex diseases using multiple genes: fact or fiction? Genet Med. 2006;8:395-400.
Kundu S, Karssen LC, Janssens AC: Analytical and simulation methods for estimating the potential predictive ability of genetic profiling: a comparison of methods and results. Eur J Hum Genet. 2012 May 30.
van Zitteren M, van der Net JB, Kundu S, Freedman AN, van Duijn CM, Janssens AC. Genome-based prediction of breast cancer risk in the general population: a modeling study based on meta-analyses of genetic associations. Cancer Epidemiol Biomarkers Prev. 2011;20:9-22.
van der Net JB, Janssens AC, Sijbrands EJ, Steyerberg EW. Value of genetic profiling for the prediction of coronary heart disease. Am Heart J. 2009;158:105-10.
Janssens AC, Moonesinghe R, Yang Q, Steyerberg EW, van Duijn CM, Khoury MJ. The impact of genotype frequencies on the clinical validity of genomic profiling for predicting common chronic diseases. Genet Med. 2007;9:528-35.
# specify the matrix containing the ORs and frequencies of genetic variants # In this example we used per allele effects of the risk variants ORfreq<-cbind(c(1.35,1.20,1.24,1.16), rep(1,4), c(.41,.29,.28,.51),rep(1,4)) # specify the population disease risk popRisk <- 0.3 # specify size of hypothetical population popSize <- 10000 # Obtain the simulated dataset Data <- simulatedDataset(ORfreq=ORfreq, poprisk=popRisk, popsize=popSize) # Obtain the AUC and produce ROC curve plotROC(data=Data, cOutcome=4, predrisk=Data[,3])
# specify the matrix containing the ORs and frequencies of genetic variants # In this example we used per allele effects of the risk variants ORfreq<-cbind(c(1.35,1.20,1.24,1.16), rep(1,4), c(.41,.29,.28,.51),rep(1,4)) # specify the population disease risk popRisk <- 0.3 # specify size of hypothetical population popSize <- 10000 # Obtain the simulated dataset Data <- simulatedDataset(ORfreq=ORfreq, poprisk=popRisk, popsize=popSize) # Obtain the AUC and produce ROC curve plotROC(data=Data, cOutcome=4, predrisk=Data[,3])