Title: | Analysis of High-Dimensional Categorical Data Such as SNP Data |
---|---|
Description: | Tools for the analysis of high-dimensional data developed/implemented at the group "Statistical Complexity Reduction In Molecular Epidemiology" (SCRIME). Main focus is on SNP data. But most of the functions can also be applied to other types of categorical data. |
Authors: | Holger Schwender, with a contribution of Arno Fritsch |
Maintainer: | Holger Schwender <[email protected]> |
License: | GPL-2 |
Version: | 1.3.5 |
Built: | 2024-10-31 20:30:18 UTC |
Source: | CRAN |
Computes the Approximate Bayes Factor proposed by Wakefield (2009) for test statistics theta / sqrt(V)
that under the null hypothesis are assumed to follow an asymptotic standard normal distribution.
abf(theta, V, W, numerator = 0, pi1 = NA)
abf(theta, V, W, numerator = 0, pi1 = NA)
theta |
a vector of numeric values, e.g., the maximum likelihood estimates for the parameter of
a logistic regression model computed by separately applying this simple logistic regression to several SNPs.
It is thus assumed that under the null hypothesis |
V |
a vector of the same length as |
W |
the prior variance. Must be either a positive value or a vector of the same length as |
numerator |
either 0 or 1, specifying whether the numerator of the approximate Bayes factor comprises the probability for the null hypothesis or the probability for the alternative hypothesis. |
pi1 |
either a numeric value between 0 and 1 specifying the prior probability of association or a vector of the
same length as |
If pi1 = NA
, a vector of the same length as theta
containing the values of the approximate Bayes factor.
If pi1
is specified, a list consisting of
ABF |
a numeric vector containing the values of the approximate Bayes factors, |
priorOdds |
either a numeric value or a numeric vector comprising the prior odds of association (if |
postOdds |
a numeric vector containing the posterior odds of association (if |
and either
BFDP |
a numeric vector containing the Bayesian False Discovery Probabilities for the SNPs (if |
or
PPA |
a numeric vector comprising the posterior probabilities of association (if |
Holger Schwender, [email protected]
Wakefield, J. (2007). A Bayesian Measure of Probability of False Discovery in Genetic Epidemiology Studies. American Journal of Human Genetics, 81, 208-227.
For an MCMC sample of Bayesian logic regression models obtained with fblr
the distribution of the model size and the most common logic predictors with up
to three binaries are reported.
analyse.models(file, size.freq = TRUE, moco = c(20, 10), int.freq = TRUE, kmax = 10, int.level = 2, bin.names = NULL)
analyse.models(file, size.freq = TRUE, moco = c(20, 10), int.freq = TRUE, kmax = 10, int.level = 2, bin.names = NULL)
file |
character string naming file where MCMC output of |
size.freq |
determines whether distribution of model size is reported as frequencies (default) or proportions. |
moco |
a vector of length 2 or 3 that determines how many of the most common main effects, two-factor interactions and (possibly) three-factor interactions are reported. |
int.freq |
determines whether the number (default) or the proportion of models containing a specific interaction is reported. |
kmax |
the maximum number of allowed logic predictors used in |
int.level |
the maximum number of allowed binaries in a logic predictor
used in |
bin.names |
character vector of names for the binary variables. If no names are supplied, binaries are referred to with their indices. |
The logic regression models visited during the MCMC run are stored by
fblr
in the rows of a matrix in the following fashion:
Position 1 contains the number of logic predictors in the model. The next
kmax * (int.level + 1)
positions contain the predictors, each predictor being
coded as c(number of binaries in predictor, indices of binaries), where negative
indices denote the complement of a variable. It follow the log-likelihood of
the model, the value of the precision of the regression parameters and the
kmax+1
regression parameters. Zeros indicate empty entries.
analyse.models
extracts some of the most interesting information, namely
which logic predictors occur most often in the visited models, from the sample.
The complement of a binary is indicated with a minus sign preceding its name.
size |
table of model sizes. |
ones |
table of the |
twos |
table of the |
threes |
table of the |
Arno Fritsch, [email protected]
## Not run: # Use fblr on some simulated SNP data snp <- matrix(rbinom(500*20,2,0.3),ncol=20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) fblr(y, bin, niter=1000, nburn=0) analyse.models("fblr_mcmc.txt") # with SNP names name.snp <- LETTERS[1:20] name.bin <- paste(rep(name.snp,each=2), c("_d","_r"),sep="") analyse.models("fblr_mcmc.txt", bin.names=name.bin) ## End(Not run)
## Not run: # Use fblr on some simulated SNP data snp <- matrix(rbinom(500*20,2,0.3),ncol=20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) fblr(y, bin, niter=1000, nburn=0) analyse.models("fblr_mcmc.txt") # with SNP names name.snp <- LETTERS[1:20] name.bin <- paste(rep(name.snp,each=2), c("_d","_r"),sep="") analyse.models("fblr_mcmc.txt", bin.names=name.bin) ## End(Not run)
Constructs a data frame from a metadata package containing annotations for the SNPs from the corresponding Affymetrix SNP Chip.
buildSNPannotation(pkg, rs = TRUE, allele = TRUE, gene = TRUE, chromosome = FALSE, position = FALSE, strand = FALSE, cytoband = FALSE, max.genes = 0, lib.loc = NULL, others = NULL, subset = NULL, pattern = NULL, na.rm = TRUE)
buildSNPannotation(pkg, rs = TRUE, allele = TRUE, gene = TRUE, chromosome = FALSE, position = FALSE, strand = FALSE, cytoband = FALSE, max.genes = 0, lib.loc = NULL, others = NULL, subset = NULL, pattern = NULL, na.rm = TRUE)
pkg |
the name of the metadata package from which the data frame containing the annotations of the SNPs should be generated. |
rs |
should the RefSNP-ID of the SNPs be added to the data frame? |
allele |
should the two alleles of each SNP be added to the data frame? |
gene |
should the genes associated with the SNPs be added to the data frame? |
chromosome |
should the chromosome to which the respective SNP belongs be added to the data frame? |
position |
should the physical positions of the SNPs be added to the data frame? |
strand |
should the strands be added to the data frame? |
cytoband |
logical indicating whether the cytoband of each SNP is added to the data frame. |
max.genes |
integer specifying the maximum number of genes associated with the respective
SNP that should be stored in the data frame. By default, all entries are considered. The corresponding column
of the data frame can also be shortened afterwards using |
lib.loc |
the directory in which the metadata package is stored. Needs only to be specified if it is not stored in the usual directory of the packages. |
others |
character string or vector naming other entries of the object |
subset |
character string consisting of the probe set IDs of the SNPs for which the data
frame should be generated. The data frame will contain all SNPs if |
pattern |
character string specifying the pattern of the probe set IDs of the SNPs for which
the data frame should be generated. For example, |
na.rm |
should the rows of the data frame corresponding to SNPs specified by |
A data frame composed of annotations for the SNPs for which information is available in the specified metadata package.
Holger Schwender, [email protected]
Performs the likelihood ratio test for epistatic interactions proposed by Cordell (2002) for case-control data, where
colEpistatic
assumes that each column represents a SNP, and rowEpistatic
assumes that each row represents a SNP.
colEpistatic(mat.snp, cl, genes = NULL, warnError = TRUE) rowEpistatic(mat.snp, cl, genes = NULL, warnError = TRUE) ## S3 method for class 'colEpi' print(x, top = 5, digits = 4, ...)
colEpistatic(mat.snp, cl, genes = NULL, warnError = TRUE) rowEpistatic(mat.snp, cl, genes = NULL, warnError = TRUE) ## S3 method for class 'colEpi' print(x, top = 5, digits = 4, ...)
mat.snp |
a matrix containing genotype data, where the genotypes of each SNPs need to be coded by the number of minor alleles,
i.e. 0, 1, and 2. Missing values are allowed. For |
cl |
a numeric vector of ones and zeros specifying which of the subjects in |
genes |
a character vector containing the names of the genes (or, e.g., LD-blocks or pathways) to which the SNPs belong. If specified,
only the two-way interactions between SNPs from different genetic sets (e.g., genes, LD-blocks, or pathways) are tested.
If |
warnError |
logical indicating whether the statistics for the gTDT for pairs of SNPs should be returned as |
x |
an object of class |
top |
number of interactions that should be printed. If |
digits |
number of digits that should be printed. |
... |
ignored. |
An object of class colEpi
consisting of
ll.main |
a numeric vector containing the values of the maximized loglikelihoods of the logistic regression models considering only main effects, |
ll.full |
a numeric vector containing the values the maximized loglikelihoods of the logistic regression models additionally containing interaction terms, |
stat |
a vector comprising the values of the test statistic, |
pval |
a vector comprising the corresponding p-values, |
genes |
if |
vec.error |
if |
Holger Schwender, [email protected]
Cordell, H. J. (2002). Epistasis: What it Means, what it Doesn't mean, and Statistical Methods to Detect it in Humans. Human Molecular Genetics, 11, 2463-2468.
Computes a contingency table for each pair of rows of a matrix, and stores all contigency table in a matrix.
computeContCells(data, computeExp = TRUE, justDiag = FALSE, check = TRUE, n.cat = NULL)
computeContCells(data, computeExp = TRUE, justDiag = FALSE, check = TRUE, n.cat = NULL)
data |
a numeric matrix consisting of integers between 1 and |
computeExp |
should the numbers of observations expected under the null hypothesis that
the respective two variables are independent also be computed? Required
when |
justDiag |
should only the diagonal elements of the contingency tables,
i.e.\ |
check |
should |
n.cat |
integer specifying the maximum number of levels a variable can take. If |
A list consisting of two matrices each consisting of rows and
n.cat
columns, where
is the number of rows of
data
.
One of these matrices
called mat.obs
contains in each row the values of the contingency table for
a particular pair of rows of data
, where the contigency table of the variables represented
by the ith and jth row of data
is shown in the
row of
mat.obs
.
The other matrix called mat.exp
consists of
the corresponding numbers of observations expected under the null hypothesis that
the respective two variables are independent.
Holger Schwender, [email protected]
Schwender, H.\ (2007). A Note on the Simultaneous Computation of Thousands of
Pearson's -Statistics. Technical Report, SFB 475,
Deparment of Statistics, University of Dortmund.
computeContClass
, rowChisqStats
## Not run: # Generate an example data set consisting of 5 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3. mat <- matrix(sample(3, 1000, TRUE), 5) # Generate the matrix containing the contingency tables for each # pair of rows of mat. out <- computeContCells(mat) # out contains both the observed numbers of observations # summarized by contingency tables out$mat.obs # and the number of observations expected under the null hypothesis # of independence. out$mat.exp # If, e.g., only the observed number of observations having the same # value is of interest, call computeContCells(mat, computeExp = FALSE, justDiag = TRUE) ## End(Not run)
## Not run: # Generate an example data set consisting of 5 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3. mat <- matrix(sample(3, 1000, TRUE), 5) # Generate the matrix containing the contingency tables for each # pair of rows of mat. out <- computeContCells(mat) # out contains both the observed numbers of observations # summarized by contingency tables out$mat.obs # and the number of observations expected under the null hypothesis # of independence. out$mat.exp # If, e.g., only the observed number of observations having the same # value is of interest, call computeContCells(mat, computeExp = FALSE, justDiag = TRUE) ## End(Not run)
Generates a matrix containing a contingency table for each row of a matrix and a vector of class labels.
computeContClass(data, cl, n.cat)
computeContClass(data, cl, n.cat)
data |
a numeric matrix consisting of integers between 1 and |
cl |
a numeric vector of length |
n.cat |
an integer giving the number of levels the variables can take. If not
specified, |
A list composed of the following two matrices:
mat.obs |
a matrix consisting of |
mat.exp |
a matrix of the same size as |
Holger Schwender, [email protected]
Schwender, H.\ (2007). A Note on the Simultaneous Computation of Thousands of
Pearson's -Statistics. Technical Report, SFB 475,
Deparment of Statistics, University of Dortmund.
computeContCells
, rowChisqStats
## Not run: # Generate an example data set consisting of 10 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3, and a vector of class labels of length 200 # indicating that the first 100 observation belong to class 1 # and the other 100 to class 2. mat <- matrix(sample(3, 2000, TRUE), 10) cl <- rep(1:2, e = 100) # Applying computeContClass to this data set out <- computeContClass(mat, cl) # generates the observed numbers of observations out$mat.obs # and the corresponding expected numbers of observations. out$mat.exp ## End(Not run)
## Not run: # Generate an example data set consisting of 10 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3, and a vector of class labels of length 200 # indicating that the first 100 observation belong to class 1 # and the other 100 to class 2. mat <- matrix(sample(3, 2000, TRUE), 10) cl <- rep(1:2, e = 100) # Applying computeContClass to this data set out <- computeContClass(mat, cl) # generates the observed numbers of observations out$mat.obs # and the corresponding expected numbers of observations. out$mat.exp ## End(Not run)
Performs full Bayesian logic regression for Single Nucleotide Polymorphism (SNP) data as described in Fritsch and Ickstadt (2007).
fblr.weight
allows to incorporate prior pathway information by restricting
search for interactions to specific groups of SNPs and/or giving them different
weights. fblr.weight
is only implemented for an interaction level of 2.
fblr(y, bin, niter, thin = 5, nburn = 10000, int.level = 2, kmax = 10, geo = 1, delta1 = 0.001, delta2 = 0.1, predict = FALSE, file = "fblr_mcmc.txt") fblr.weight(y, bin, niter, thin = 5, nburn = 10000, kmax = 10, geo = 1, delta1 = 0.001, delta2 = 0.1, predict = FALSE, group = NULL, weight = NULL, file = "fblr_mcmc.txt")
fblr(y, bin, niter, thin = 5, nburn = 10000, int.level = 2, kmax = 10, geo = 1, delta1 = 0.001, delta2 = 0.1, predict = FALSE, file = "fblr_mcmc.txt") fblr.weight(y, bin, niter, thin = 5, nburn = 10000, kmax = 10, geo = 1, delta1 = 0.001, delta2 = 0.1, predict = FALSE, group = NULL, weight = NULL, file = "fblr_mcmc.txt")
y |
binary vector indicating case-control status. |
bin |
binary matrix with number of rows equal to |
niter |
number of MCMC iterations after burn-in. |
thin |
after burn-in only every |
nburn |
number of burn-in iterations. |
int.level |
maximum number of binaries allowed in a logic predictor.
Is fixed to 2 for |
kmax |
maximum number of logic predictors allowed in the model. |
geo |
geometric penalty parameter for the number of binaries in a predictor.
Value between 0 and 1. Default is |
delta1 |
shape parameter for hierarchical gamma prior on precision of regression parameters. |
delta2 |
rate parameter for hierarchical gamma prior on precision of regression parameters. |
predict |
should predicted case probabilities be returned? |
file |
character string naming a file to write the MCMC output to. If
|
group |
list containing vectors of indices of binaries that are allowed to interact. Groups may be overlapping, but every binary has to be in at least one group. Groups have to contain at least two binaries. Defaults to NULL, meaning that all interactions are allowed. |
weight |
vector of length |
The MCMC output in file
can be analysed using the function
analyse.models
. In the help of this function it is also described how
the models are stored in file
.
accept |
acceptance rate of MCMC algorithm. |
pred |
vector of predicted case probabilities. Only given if
|
Arno Fritsch, [email protected]
Fritsch, A. and Ickstadt, K.\ (2007). Comparing logic regression based methods for identifying SNP interactions. In Bioinformatics in Research and Development, Hochreiter, S.\ and Wagner, R.\ (Eds.), Springer, Berlin.
## Not run: # SNP dataset with 500 persons and 20 SNPs each, # a two-SNP interaction influences the case probability snp <- matrix(rbinom(500*20,2,0.3),ncol=20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) # normally more iterations should be used fblr(y, bin, niter=1000, nburn=0) analyse.models("fblr_mcmc.txt") # Prior information: SNPs 1-10 belong to genes in one pathway, # SNPs 8-20 to another. Only interactions within a pathway are # considered and the first pathway is deemed to be twice as # important than the second. fblr.weight(y,bin,niter=1000, nburn=0, group=list(1:20, 15:40), weight=c(rep(2,20),rep(1,20))) analyse.models("fblr_mcmc.txt") ## End(Not run)
## Not run: # SNP dataset with 500 persons and 20 SNPs each, # a two-SNP interaction influences the case probability snp <- matrix(rbinom(500*20,2,0.3),ncol=20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) # normally more iterations should be used fblr(y, bin, niter=1000, nburn=0) analyse.models("fblr_mcmc.txt") # Prior information: SNPs 1-10 belong to genes in one pathway, # SNPs 8-20 to another. Only interactions within a pathway are # considered and the first pathway is deemed to be twice as # important than the second. fblr.weight(y,bin,niter=1000, nburn=0, group=list(1:20, 15:40), weight=c(rep(2,20),rep(1,20))) analyse.models("fblr_mcmc.txt") ## End(Not run)
Predicts the classes of new observations with Nearest Neighbors
based on an user-specified distance measure.
gknn(data, cl, newdata, nn = 5, distance = NULL, use.weights = FALSE, ...)
gknn(data, cl, newdata, nn = 5, distance = NULL, use.weights = FALSE, ...)
data |
a numeric matrix in which each row represents an observation and each column
a variable. If |
cl |
a numeric vector of length |
newdata |
a numeric matrix in which each row represents a new observation for
which the class label should be predicted and each column consists of the same
variable as the corresponding column of |
nn |
an integer specifying the number of nearest neighbors used to classify the new observations. |
distance |
character vector naming the distance measure used to identify the
|
use.weights |
should the votes of the nearest neighbors be weighted by the reciprocal of the distances to the new observation when the class of a new observation should be predicted? |
... |
further arguments for the distance measure. If, e.g.,
|
The predicted classes of the new observations.
Holger Schwender, [email protected]
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
## Not run: # Using the example from the function knn. library(class) data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- c(rep(2, 25), rep(1, 25), rep(1, 25)) knn.out <- knn(train, test, as.factor(cl), k = 3, use.all = FALSE) gknn.out <- gknn(train, cl, test, nn = 3) # Both applications lead to the same predictions. knn.out == gknn.out # But gknn allows to use other distance measures than the Euclidean # distance. E.g., the Manhattan distance. gknn(train, cl, test, nn = 3, distance = "manhattan") ## End(Not run)
## Not run: # Using the example from the function knn. library(class) data(iris3) train <- rbind(iris3[1:25,,1], iris3[1:25,,2], iris3[1:25,,3]) test <- rbind(iris3[26:50,,1], iris3[26:50,,2], iris3[26:50,,3]) cl <- c(rep(2, 25), rep(1, 25), rep(1, 25)) knn.out <- knn(train, test, as.factor(cl), k = 3, use.all = FALSE) gknn.out <- gknn(train, cl, test, nn = 3) # Both applications lead to the same predictions. knn.out == gknn.out # But gknn allows to use other distance measures than the Euclidean # distance. E.g., the Manhattan distance. gknn(train, cl, test, nn = 3, distance = "manhattan") ## End(Not run)
Identifies the rows of a matrix that only show one level.
identifyMonomorphism(x)
identifyMonomorphism(x)
x |
a matrix consisting of integers between 1 and |
Numeric vector containing the rows of x
showing only one level.
Holger Schwender, [email protected]
## Not run: # Generate a data set consisting of 10 rows and 15 columns, # where the values are randomly drawn from the integers 1, 2, and 3, # and row 3 and 7 consist only of one level. mat <- matrix(sample(3, 2000, TRUE), 10) mat[3, ] <- 1 mat[7, ] <- 2 identifyMonomorphism(mat) ## End(Not run)
## Not run: # Generate a data set consisting of 10 rows and 15 columns, # where the values are randomly drawn from the integers 1, 2, and 3, # and row 3 and 7 consist only of one level. mat <- matrix(sample(3, 2000, TRUE), 10) mat[3, ] <- 1 mat[7, ] <- 2 identifyMonomorphism(mat) ## End(Not run)
Imputes missing values in a matrix composed of categorical variables
using Nearest Neighbors.
knncatimpute(x, dist = NULL, nn = 3, weights = TRUE)
knncatimpute(x, dist = NULL, nn = 3, weights = TRUE)
x |
a numeric matrix containing missing values. All non-missing values
must be integers between 1 and |
dist |
either a character string naming the distance measure or a distance matrix.
If the former, |
nn |
an integer specifying |
weights |
should weighted |
A matrix of the same size as x
in which all the missing values have been imputed.
Holger Schwender, [email protected]
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
knncatimputeLarge
, gknn
, smc
, pcc
## Not run: # Generate a data set consisting of 200 rows and 50 columns # in which the values are integers between 1 and 3. # Afterwards, remove 20 of the values randomly. mat <- matrix(sample(3, 10000, TRUE), 200) mat[sample(10000, 20)] <- NA # Replace the missing values. mat2 <- knncatimpute(mat) # Replace the missing values using the 5 nearest neighbors # and Cohen's Kappa. mat3 <- knncatimpute(mat, nn = 5, dist = "cohen") ## End(Not run)
## Not run: # Generate a data set consisting of 200 rows and 50 columns # in which the values are integers between 1 and 3. # Afterwards, remove 20 of the values randomly. mat <- matrix(sample(3, 10000, TRUE), 200) mat[sample(10000, 20)] <- NA # Replace the missing values. mat2 <- knncatimpute(mat) # Replace the missing values using the 5 nearest neighbors # and Cohen's Kappa. mat3 <- knncatimpute(mat, nn = 5, dist = "cohen") ## End(Not run)
Imputes missing values in a high-dimensional matrix composed of categorical variables
using Nearest Neighbors.
knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL, nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), n.num = 100, use.weights = TRUE, verbose = FALSE)
knncatimputeLarge(data, mat.na = NULL, fac = NULL, fac.na = NULL, nn = 3, distance = c("smc", "cohen", "snp1norm", "pcc"), n.num = 100, use.weights = TRUE, verbose = FALSE)
data |
a numeric matrix consisting of integers between 1 and Each row of |
mat.na |
a numeric matrix containing missing values. Must have the same number of
columns as |
fac |
a numeric or character vector of length |
fac.na |
a numeric or character vector of length |
nn |
an integer specifying |
distance |
character string naming the distance measure used in |
n.num |
an integer giving the number of rows of |
use.weights |
should weighted |
verbose |
should more information about the progress of the imputation be printed? |
If mat.na = NULL
, then a matrix of the same size as data
in which the missing
values have been replaced. If mat.na
has been specified, then a matrix of the same size as
mat.na
in which the missing values have been replaced.
While in knncatimpute
all variable/rows are considered when replacing
missing values, knncatimputeLarge
only considers the rows with no missing values
when searching for the nearest neighbors.
Holger Schwender, [email protected]
Schwender, H. and Ickstadt, K.\ (2008). Imputing Missing Genotypes with Nearest Neighbors.
Technical Report, SFB 475, Department of Statistics, University of Dortmund. Appears soon.
knncatimpute
, gknn
, smc
, pcc
## Not run: # Generate a data set consisting of 100 columns and 2000 rows (actually, # knncatimputeLarge is made for much larger data sets), where the values # are randomly drawn from the integers 1, 2, and 3. # Afterwards, remove 200 of the observations randomly. mat <- matrix(sample(3, 200000, TRUE), 2000) mat[sample(200000, 20)] <- NA # Apply knncatimputeLarge to mat to remove the missing values. mat2 <- knncatimputeLarge(mat) sum(is.na(mat)) sum(is.na(mat2)) # Now assume that the first 100 rows belong to SNPs from chromosome 1, # the second 100 rows to SNPs from chromosome 2, and so on. chromosome <- rep(1:20, e = 100) # Apply knncatimputeLarge to mat chromosomewise, i.e. only consider # the SNPs that belong to the same chromosome when replacing missing # genotypes. mat4 <- knncatimputeLarge(mat, fac = chromosome) ## End(Not run)
## Not run: # Generate a data set consisting of 100 columns and 2000 rows (actually, # knncatimputeLarge is made for much larger data sets), where the values # are randomly drawn from the integers 1, 2, and 3. # Afterwards, remove 200 of the observations randomly. mat <- matrix(sample(3, 200000, TRUE), 2000) mat[sample(200000, 20)] <- NA # Apply knncatimputeLarge to mat to remove the missing values. mat2 <- knncatimputeLarge(mat) sum(is.na(mat)) sum(is.na(mat2)) # Now assume that the first 100 rows belong to SNPs from chromosome 1, # the second 100 rows to SNPs from chromosome 2, and so on. chromosome <- rep(1:20, e = 100) # Apply knncatimputeLarge to mat chromosomewise, i.e. only consider # the SNPs that belong to the same chromosome when replacing missing # genotypes. mat4 <- knncatimputeLarge(mat, fac = chromosome) ## End(Not run)
Performs a Prediction Analysis of Categorical Data.
pamCat(data, cl, theta = NULL, n.theta = 10, newdata = NULL, newcl = NULL)
pamCat(data, cl, theta = NULL, n.theta = 10, newdata = NULL, newcl = NULL)
data |
a numeric matrix composed of the integers between 1 and |
cl |
a numeric vector of length |
theta |
a numeric vector consisting of the strictly positive values of the shrinkage parameter used
in the Prediction Analysis. If |
n.theta |
an integer specifying the number of values for the shrinkage parameter of the
Prediction Analysis. Ignored if |
newdata |
a numeric matrix composed of the integers between 1 and |
newcl |
a numeric vector of length |
An object of class pamCat
composed of
mat.chisq |
a matrix with |
mat.obs |
a matrix with |
mat.exp |
a matrix of the same size as |
mat.theta |
a data frame consisting of the numbers of variables used in the classification
of the observations in |
tab.cl |
a table summarizing the values of the response, i.e.\ the class labels. |
n.cat |
|
Holger Schwender, [email protected]
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
## Not run: # Generate a data set consisting of 2000 rows (variables) and 50 columns. # Assume that the first 25 observations belong to class 1, and the other # 50 observations to class 2. mat <- matrix(sample(3, 100000, TRUE), 2000) rownames(mat) <- paste("SNP", 1:2000, sep = "") cl <- rep(1:2, e = 25) # Apply PAM for categorical data to this matrix, and compute the # misclassification rate on the training set, i.e. on mat. pam.out <- pamCat(mat, cl) pam.out # Now generate a new data set consisting of 20 observations, # and predict the classes of these observations using the # value of theta that has led to the smallest misclassification # rate in pam.out. mat2 <- matrix(sample(3, 40000, TRUE), 2000) rownames(mat2) <- paste("SNP", 1:2000, sep = "") predict(pam.out, mat2) # Let's assume that the predicted classes are the real classes # of the observations. Then, mat2 can also be used in pamCat # to compute the misclassification rate. cl2 <- predict(pam.out, mat2) pamCat(mat, cl, newdata = mat2, newcl = cl2) ## End(Not run)
## Not run: # Generate a data set consisting of 2000 rows (variables) and 50 columns. # Assume that the first 25 observations belong to class 1, and the other # 50 observations to class 2. mat <- matrix(sample(3, 100000, TRUE), 2000) rownames(mat) <- paste("SNP", 1:2000, sep = "") cl <- rep(1:2, e = 25) # Apply PAM for categorical data to this matrix, and compute the # misclassification rate on the training set, i.e. on mat. pam.out <- pamCat(mat, cl) pam.out # Now generate a new data set consisting of 20 observations, # and predict the classes of these observations using the # value of theta that has led to the smallest misclassification # rate in pam.out. mat2 <- matrix(sample(3, 40000, TRUE), 2000) rownames(mat2) <- paste("SNP", 1:2000, sep = "") predict(pam.out, mat2) # Let's assume that the predicted classes are the real classes # of the observations. Then, mat2 can also be used in pamCat # to compute the misclassification rate. cl2 <- predict(pam.out, mat2) pamCat(mat, cl, newdata = mat2, newcl = cl2) ## End(Not run)
Computes the values of (the corrected) Pearson's contingency coefficient for all pairs of rows of a matrix.
pcc(x, dist = FALSE, corrected = TRUE, version = 1)
pcc(x, dist = FALSE, corrected = TRUE, version = 1)
x |
a numeric matrix consisting of integers between 1 and |
dist |
should the distance based on Pearson's contingency coefficient be computed?
For how this distance is computed, see |
corrected |
should Pearson's contingency coefficient be corrected such that it can
take values between 0 and 1? If not corrected, it takes values between and 0
and |
version |
a numeric value – either 1, 2, or 3 – specifying how the distance is computed.
Ignored if |
A matrix with nrow(x)
columns and rows containing the values of (or distances based on)
the (corrected) Pearson's contigency coefficient for all pairs of rows of x
.
Holger Schwender, [email protected]
## Not run: # Generate a data set consisting of 10 rows and 200 columns, # where the values are randomly drawn from the integers 1, 2, and 3. mat <- matrix(sample(3, 2000, TRUE), 10) # For each pair of rows of mat, the value of the corrected Pearson's # contingency coefficient is then obtained by out1 <- pcc(mat) out1 # and the distances based on this coefficient by out2 <- pcc(mat, dist = TRUE) out2 # Note that if version is set to 1 (default) in pcc, then all.equal(sqrt(1 - out1^2), out2) ## End(Not run)
## Not run: # Generate a data set consisting of 10 rows and 200 columns, # where the values are randomly drawn from the integers 1, 2, and 3. mat <- matrix(sample(3, 2000, TRUE), 10) # For each pair of rows of mat, the value of the corrected Pearson's # contingency coefficient is then obtained by out1 <- pcc(mat) out1 # and the distances based on this coefficient by out2 <- pcc(mat, dist = TRUE) out2 # Note that if version is set to 1 (default) in pcc, then all.equal(sqrt(1 - out1^2), out2) ## End(Not run)
Predicts the classes of new observations based on a Prediction Analysis of Categorical Data.
## S3 method for class 'pamCat' predict(object, newdata, theta = NULL, add.nvar = FALSE, type = c("class", "prob"), ...)
## S3 method for class 'pamCat' predict(object, newdata, theta = NULL, add.nvar = FALSE, type = c("class", "prob"), ...)
object |
an object of class |
newdata |
a numeric matrix consisting of the integers between 1 and |
theta |
a strictly positive numeric value specifying the value of the shrinkage parameter
of the Prediction Analysis that should be used in the class prediction.
If |
add.nvar |
should the number of variables used in the class prediction be added to the output? |
type |
either |
... |
Ignored. |
If add.nvar = FALSE
, the predicted classes or the class probabilities (depending on type
).
Otherwise, a list consisting of
pred |
a vector or matrix containing the predicted classes or the class probabilities, respectively. |
n.var |
the number of variables used in the prediction. |
Holger Schwender, [email protected]
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
## Not run: # Generate a data set consisting of 2000 rows (variables) and 50 columns. # Assume that the first 25 observations belong to class 1, and the other # 50 observations to class 2. mat <- matrix(sample(3, 100000, TRUE), 2000) rownames(mat) <- paste("SNP", 1:2000, sep = "") cl <- rep(1:2, e = 25) # Apply PAM for categorical data to this matrix, and compute the # misclassification rate on the training set, i.e. on mat. pam.out <- pamCat(mat, cl) pam.out # Now generate a new data set consisting of 20 observations, # and predict the classes of these observations using the # value of theta that has led to the smallest misclassification # rate in pam.out. mat2 <- matrix(sample(3, 40000, TRUE), 2000) rownames(mat2) <- paste("SNP", 1:2000, sep = "") predict(pam.out, mat2) # Another theta, say theta = 4, can also be specified. predict(pam.out, mat2, theta = 4) # The class probabilities for each observation can be obtained by predict(pam.out, mat2, theta = 4, type = "prob") ## End(Not run)
## Not run: # Generate a data set consisting of 2000 rows (variables) and 50 columns. # Assume that the first 25 observations belong to class 1, and the other # 50 observations to class 2. mat <- matrix(sample(3, 100000, TRUE), 2000) rownames(mat) <- paste("SNP", 1:2000, sep = "") cl <- rep(1:2, e = 25) # Apply PAM for categorical data to this matrix, and compute the # misclassification rate on the training set, i.e. on mat. pam.out <- pamCat(mat, cl) pam.out # Now generate a new data set consisting of 20 observations, # and predict the classes of these observations using the # value of theta that has led to the smallest misclassification # rate in pam.out. mat2 <- matrix(sample(3, 40000, TRUE), 2000) rownames(mat2) <- paste("SNP", 1:2000, sep = "") predict(pam.out, mat2) # Another theta, say theta = 4, can also be specified. predict(pam.out, mat2, theta = 4) # The class probabilities for each observation can be obtained by predict(pam.out, mat2, theta = 4, type = "prob") ## End(Not run)
Predicts case probabilities for binary data (usually SNP data dichotomized with
snp2bin
) based on an MCMC sample of Bayesian logic regression models
obtained with fblr
.
predictFBLR(file, bin, kmax = 10, int.level = 2)
predictFBLR(file, bin, kmax = 10, int.level = 2)
file |
character string naming file where MCMC sample is stored. |
bin |
matrix of binary variables to make predictions for. One row is one
observation. The number of binary variables has to be the same as used in |
kmax |
the maximum number of allowed logic predictors used in |
int.level |
the maximum number of allowed binaries in a logic predictor
used in |
Vector of length nrow(bin)
with predicted case probabilities.
Arno Fritsch, [email protected]
## Not run: # Use fblr on some simulated SNP data snp <- matrix(rbinom(500 * 20, 2, 0.3), ncol = 20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) fblr(y, bin, niter=1000, nburn=0) # Prediction for some new observations newbin <- snp2bin(matrix(rbinom(100 * 20, 2, 0.3), ncol = 20)) predictFBLR("fblr_mcmc.txt",newbin) ## End(Not run)
## Not run: # Use fblr on some simulated SNP data snp <- matrix(rbinom(500 * 20, 2, 0.3), ncol = 20) bin <- snp2bin(snp) int <- apply(bin,1,function(x) (x[1] == 1 & x[3] == 0)*1) case.prob <- exp(-0.5+log(5)*int)/(1+exp(-0.5+log(5)*int)) y <- rbinom(nrow(snp),1,prob=case.prob) fblr(y, bin, niter=1000, nburn=0) # Prediction for some new observations newbin <- snp2bin(matrix(rbinom(100 * 20, 2, 0.3), ncol = 20)) predictFBLR("fblr_mcmc.txt",newbin) ## End(Not run)
Recodes the values used on Affymetrix SNP chips to code the genotypes to other values – required, e.g., by other functions of this package.
recodeAffySNP(mat, refAA = FALSE, geno = 1:3)
recodeAffySNP(mat, refAA = FALSE, geno = 1:3)
mat |
a matrix or data frame consisting of the character strings |
refAA |
codes |
geno |
a numeric or character vector of length 3 giving the three values that
should be used to recode the genotypes. By default, |
A matrix of the same size as mat
containing the recoded genotypes. (Missing values are
coded by NA
.)
## Not run: # Generate a sample data set consisting of 10 rows and 12 columns, # and randomly replace 5 of the values by "NN". mat <- matrix("", 10, 12) mat[1:5,] <- sample(c("AA", "AB", "BB"), 60, TRUE, prob = c(0.49, 0.42, 0.09)) mat[6:10,] <- sample(c("AA", "AB", "BB"), 60, TRUE, prob = c(0.09, 0.42, 0.49)) mat[sample(120, 5)] <- "NN" mat # Recode the SNPs. recodeAffySNP(mat) # Recode the SNPs assuming that "A" is always the major allele. recodeAffySNP(mat, refAA = TRUE) ## End(Not run)
## Not run: # Generate a sample data set consisting of 10 rows and 12 columns, # and randomly replace 5 of the values by "NN". mat <- matrix("", 10, 12) mat[1:5,] <- sample(c("AA", "AB", "BB"), 60, TRUE, prob = c(0.49, 0.42, 0.09)) mat[6:10,] <- sample(c("AA", "AB", "BB"), 60, TRUE, prob = c(0.09, 0.42, 0.49)) mat[sample(120, 5)] <- "NN" mat # Recode the SNPs. recodeAffySNP(mat) # Recode the SNPs assuming that "A" is always the major allele. recodeAffySNP(mat, refAA = TRUE) ## End(Not run)
Recodes the values used to specify the genotypes of the SNPs to other values. Such a recoding might be required to use other functions contained in this package.
recodeSNPs(mat, first.ref = FALSE, geno = 1:3, snp.in.col = FALSE)
recodeSNPs(mat, first.ref = FALSE, geno = 1:3, snp.in.col = FALSE)
mat |
a matrix or data frame consisting of character strings of length 2 that
specify the genotypes of the SNPs. Each of these character strings
must be a combination of the letters A, T, C, and G. Missing values can
be specified by |
first.ref |
does the first letter in the string coding the heterozygous
genotype always stands for the more frequent allele? E.g., codes |
geno |
a numeric or character vector of length 3 giving the three values that
should be used to recode the genotypes. By default, |
snp.in.col |
does each column of |
A matrix of the same size as mat
containing the recoded genotypes. (Missing values are
coded by NA
).
## Not run: # Generate an example data set consisting of 5 rows and 12 columns, # where it is assumed that each row corresponds to a SNP. mat <- matrix("", 10, 12) mat[c(1, 4, 6),] <- sample(c("AA", "AT", "TT"), 18, TRUE) mat[c(2, 3, 10),] <- sample(c("CC", "CG", "GG"), 18, TRUE) mat[c(5, 8),] <- sample(c("GG", "GT", "TT"), 12, TRUE) mat[c(7, 9),] <- sample(c("AA", "AC", "CC"), 12, TRUE) mat # Recode the SNPs recodeSNPs(mat) # Recode the SNPs by assuming that the first letter in # the heterogyzous genotype refers to the major allele. recodeSNPs(mat, first.ref = TRUE) ## End(Not run)
## Not run: # Generate an example data set consisting of 5 rows and 12 columns, # where it is assumed that each row corresponds to a SNP. mat <- matrix("", 10, 12) mat[c(1, 4, 6),] <- sample(c("AA", "AT", "TT"), 18, TRUE) mat[c(2, 3, 10),] <- sample(c("CC", "CG", "GG"), 18, TRUE) mat[c(5, 8),] <- sample(c("GG", "GT", "TT"), 12, TRUE) mat[c(7, 9),] <- sample(c("AA", "AC", "CC"), 12, TRUE) mat # Recode the SNPs recodeSNPs(mat) # Recode the SNPs by assuming that the first letter in # the heterogyzous genotype refers to the major allele. recodeSNPs(mat, first.ref = TRUE) ## End(Not run)
Given two matrices, each representing one group of subjects (e.g., cases and controls in a case-control study), that summarize the numbers of subjects showing the different (ordered) levels of the ordinal variables represented by the rows of the matrices, the value of the Cochran-Armitage Trend Test statistic is computed for each variable.
Using this function instead of rowTrendStats
is in particular recommended
when the total number of observations is very large.
rowCATTs(cases, controls, scores = NULL, add.pval = TRUE)
rowCATTs(cases, controls, scores = NULL, add.pval = TRUE)
cases |
a numeric matrix in which each row represents one ordinal
variable and each column one of the ordered levels that the variables exhibit. The
entries of this matrix are the numbers of observations from one group (e.g.,
the cases in a case-control study) showing a particular
level at the different variables. Such a matrix can, e.g., be generated
by |
controls |
a numeric matrix of the same dimensions as |
scores |
a numeric vector of length |
add.pval |
should p-values be added to the output? If |
Either a vector containing the rowwise values of the Cochran-Armitage trend test statistic
(if add.pval = FALSE
), or a list containing these values (stats
),
and the (raw) p-values (rawp
) not adjusted for multiple comparisons (if add.pval = TRUE
).
The usual contingency table for a variable can be obtained from the matrices by forming a variable-specific matrix in which each row consists of the row of one of these matrices.
Holger Schwender, [email protected]
Agresti, A.\ (2002). Categorical Data Analysis. Wiley, Hoboken, NJ. 2nd Edition.
Armitage, P.\ (1955). Tests for Linear Trends in Proportions and Frequencies. Biometrics, 11, 375-386.
Cochran, W.~G.\ (1954). Some Methods for Strengthening the Common ChiSquare Tests. Biometrics, 10, 417-451.
rowTrendStats
, rowMsquares
, rowChisq2Class
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # The values of the rowwise Cochran-Armitage trend test # are computed by rowCATTs(cases, controls) # which leads to the same results as rowTrendStats(mat, cl) # or as out <- rowMsquares(cases, controls) n <- ncol(mat) out$stats * n / (n-1) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # The values of the rowwise Cochran-Armitage trend test # are computed by rowCATTs(cases, controls) # which leads to the same results as rowTrendStats(mat, cl) # or as out <- rowMsquares(cases, controls) n <- ncol(mat) out$stats * n / (n-1) ## End(Not run)
Given a set of matrices, each of which represents one group of subjects, and summarizes rowwise the numbers of these observations showing the levels of the categorical variables represented by the rows of the matrices, the value of Pearson's ChiSquare statistic for testing whether the distribution of the variable differs between the different groups is computed for each variable.
Using this function instead of rowChisqStats
is recommended
when the total number of observations is very large.
rowChisq2Class(cases, controls, add.pval = TRUE, sameNull = FALSE) rowChisqMultiClass(..., listTables = NULL, add.pval = TRUE, sameNull = FALSE)
rowChisq2Class(cases, controls, add.pval = TRUE, sameNull = FALSE) rowChisqMultiClass(..., listTables = NULL, add.pval = TRUE, sameNull = FALSE)
cases |
a numeric matrix in which each row represents one categorical
variable and each column one of the levels that the variables exhibit. The
entries of this matrix are the numbers of observations from one group (e.g.,
the cases in a case-control study) showing a particular
level at the different variables. Such a matrix can, e.g., be generated
by |
controls |
a numeric matrix of the same dimensions as |
... |
numeric matrices (such as |
listTables |
instead of inputting the matrices directly into |
add.pval |
should p-values be added to the output? If |
sameNull |
should all variables follow the same null distribution? If |
Either a vector containing the rowwise values of Pearson's ChiSquare statistic
(if add.pval = FALSE
) or a list containing these values (stats
),
the degrees of freedom (df
), and the p-values (rawp
) not adjusted
for multiple comparisons (if add.pval = TRUE
).
In the case of a 2 x 2 table, no continuity correction is applied. In this
case, the results of rowChisq2Class
and rowChisMultiClass
are
only equal to the results of chisq.test
if in the latter correct = FALSE
is used.
The usual contingency table for a variable can be obtained from the matrices by forming a variable-specific matrix in which each row consists of the row of one of these matrices.
Holger Schwender, [email protected]
rowChisqStats
, rowTables
, rowCATTs
,
rowMsquares
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # To obtain the ChiSquare values call rowChisq2Class(cases, controls) # This leads to the same results as rowChisqStats(mat, cl) # or rowChisqMultiClass(cases, controls) # or listTab <- list(cases, controls) rowChisqMultiClass(listTables = listTab) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # To obtain the ChiSquare values call rowChisq2Class(cases, controls) # This leads to the same results as rowChisqStats(mat, cl) # or rowChisqMultiClass(cases, controls) # or listTab <- list(cases, controls) rowChisqMultiClass(listTables = listTab) ## End(Not run)
Computes for each row of a matrix the value of Pearson's ChiSquare statistic for testing if the corresponding categorical variable is associated with a (categorical) response, or determines for each pair of rows of a matrix the value of Pearson's ChiSquare statistic for testing if the two corresponding variables are independent.
rowChisqStats(data, cl, compPval = TRUE, asMatrix = TRUE)
rowChisqStats(data, cl, compPval = TRUE, asMatrix = TRUE)
data |
a numeric matrix consisting of the integers between 1 and |
cl |
a numeric vector of length |
compPval |
should also the p-value (based on the approximation to a
|
asMatrix |
should the pairwise test scores be returned as matrix? Ignored
if |
If compPval = FALSE
, a vector (or matrix if cl
is not specified and
as.matrix = TRUE
) composed of the values of Pearson's -statistic.
Otherwise, a list consisting of
stats |
a vector (or matrix) containing the values of Pearson's |
df |
a vector (or matrix) comprising the degrees of freedom of the asymptotic
|
rawp |
a vector (or matrix) containing the (unadjusted) p-values. |
Contrary to chisq.test
, currently no continuity correction is done
for 2 x 2 tables.
Holger Schwender, [email protected]
Schwender, H.\ (2007). A Note on the Simultaneous Computation of Thousands of
Pearson's -Statistics. Technical Report, SFB 475,
Deparment of Statistics, University of Dortmund.
computeContCells
, computeContClass
## Not run: # Generate an example data set consisting of 5 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3. mat <- matrix(sample(3, 1000, TRUE), 5) rownames(mat) <- paste("SNP", 1:5, sep = "") # For each pair of rows of mat, test if they are independent. r1 <- rowChisqStats(mat) # The values of Pearson's ChiSquare statistic as matrix. r1$stats # And the corresponding (unadjusted) p-values. r1$rawp # Obtain only the values of the test statistic as vector rowChisqStats(mat, compPval = FALSE, asMatrix =FALSE) # Generate an example data set consisting of 10 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3, and a vector of class labels of length 200 # indicating that the first 100 observation belong to class 1 # and the other 100 to class 2. mat2 <- matrix(sample(3, 2000, TRUE), 10) cl <- rep(1:2, e = 100) # For each row of mat2, test if they are associated with cl. r2 <- rowChisqStats(mat2, cl) r2$stats # And the results are identical to the one of chisq.test pv <- stat <- numeric(10) for(i in 1:10){ tmp <- chisq.test(mat2[i,], cl) pv[i] <- tmp$p.value stat[i] <- tmp$stat } all.equal(r2$stats, stat) all.equal(r2$rawp, pv) ## End(Not run)
## Not run: # Generate an example data set consisting of 5 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3. mat <- matrix(sample(3, 1000, TRUE), 5) rownames(mat) <- paste("SNP", 1:5, sep = "") # For each pair of rows of mat, test if they are independent. r1 <- rowChisqStats(mat) # The values of Pearson's ChiSquare statistic as matrix. r1$stats # And the corresponding (unadjusted) p-values. r1$rawp # Obtain only the values of the test statistic as vector rowChisqStats(mat, compPval = FALSE, asMatrix =FALSE) # Generate an example data set consisting of 10 rows (variables) # and 200 columns (observations) by randomly drawing integers # between 1 and 3, and a vector of class labels of length 200 # indicating that the first 100 observation belong to class 1 # and the other 100 to class 2. mat2 <- matrix(sample(3, 2000, TRUE), 10) cl <- rep(1:2, e = 100) # For each row of mat2, test if they are associated with cl. r2 <- rowChisqStats(mat2, cl) r2$stats # And the results are identical to the one of chisq.test pv <- stat <- numeric(10) for(i in 1:10){ tmp <- chisq.test(mat2[i,], cl) pv[i] <- tmp$p.value stat[i] <- tmp$stat } all.equal(r2$stats, stat) all.equal(r2$rawp, pv) ## End(Not run)
Computes Pearson's correlation coefficient of a vector with each row of a matrix.
rowCors(X, y, trendStat = FALSE, use.n = NULL)
rowCors(X, y, trendStat = FALSE, use.n = NULL)
X |
a numeric matrix in which each row represents a variable and each column an observation. |
y |
a numeric vector of length |
trendStat |
instead of the correlation coefficients should the values of
the statistic for a test of linear trend based on this coefficient be returned?
If |
use.n |
should the squared values of the correlation coefficient be multiplied
by |
A vector containing the rowwise values of Pearson's correlation coefficient (if
trendStat = FALSE
or the rowwise values of the trend statistics (if
trendStat = TRUE
.
Holger Schwender, [email protected]
Agresti, A.\ (2002). Categorical Data Analysis. Wiley, Hoboken, NJ. 2nd Edition.
rowTrendStats
, rowCATTs
, rowMsquares
## Not run: # Generate a random matrix containing 10 continuous variables # and a vector representing a continuous variable. mat <- matrix(runif(200, 0, 20), 10) y <- sample(runif(20, 0, 20)) # The correlations between y and each of row of mat are # computed by rowCors(mat, y) # Generate a random binary vector and a matrix consisting # of 10 ordinal variables with levels 0, 1, 2, where these # values can be interpreted as scores for the differ # categories. mat <- matrix(sample(0:2, 500, TRUE), 10) y <- sample(0:1, 50, TRUE) # The values of the Cochran-Armitage trend statistic are # computed by rowCors(mat, y, trendStat = TRUE) # If the values of the general test of linear trend described # on page 87 of Agresti (2002) should be computed, then call rowCors(mat, y, trendStat = TRUE, use.n = FALSE) ## End(Not run)
## Not run: # Generate a random matrix containing 10 continuous variables # and a vector representing a continuous variable. mat <- matrix(runif(200, 0, 20), 10) y <- sample(runif(20, 0, 20)) # The correlations between y and each of row of mat are # computed by rowCors(mat, y) # Generate a random binary vector and a matrix consisting # of 10 ordinal variables with levels 0, 1, 2, where these # values can be interpreted as scores for the differ # categories. mat <- matrix(sample(0:2, 500, TRUE), 10) y <- sample(0:1, 50, TRUE) # The values of the Cochran-Armitage trend statistic are # computed by rowCors(mat, y, trendStat = TRUE) # If the values of the general test of linear trend described # on page 87 of Agresti (2002) should be computed, then call rowCors(mat, y, trendStat = TRUE, use.n = FALSE) ## End(Not run)
Computes the frequencies of the levels that the categorical variables in a matrix show.
rowFreqs(x, levels = 1:3, divide.by.n = FALSE, affy = FALSE, includeNA = FALSE, useNN = c("not", "only", "also"), check = TRUE)
rowFreqs(x, levels = 1:3, divide.by.n = FALSE, affy = FALSE, includeNA = FALSE, useNN = c("not", "only", "also"), check = TRUE)
x |
a matrix in which each row represents a categorical variable (e.g., a SNP)
and each column an observation, where the variables are assumed to show the
levels specified by |
levels |
vector specifying the levels that the categorical variables in |
divide.by.n |
should the numbers of observations showing the respective levels
be divided by the total number of observations, i.e.\ by |
affy |
logical specifying whether the SNPs in |
includeNA |
should a column be added to the output matrix containing the number of missing values for each variable? |
useNN |
character specifying whether missing values can also be coded by |
check |
should it be checked whether some of the variables show other levels than the one
specified by |
A matrix with the same number of rows as x
containing for each variable the numbers
of observations showing the levels specified by levels
.
Holger Schwender, [email protected]
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowFreqs(mat) # leads to the same results as rowTables(mat) / ncol(mat) # If mat contains missing values mat[sample(500, 20)] <- NA # then rowFreqs(mat) # leads to the same result as rowTables(mat) / rowSums(!is.na(mat)) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowFreqs(mat) # leads to the same results as rowTables(mat) / ncol(mat) # If mat contains missing values mat[sample(500, 20)] <- NA # then rowFreqs(mat) # leads to the same result as rowTables(mat) / rowSums(!is.na(mat)) ## End(Not run)
Tests for each row of a matrix whether the Hardy-Weinberg Equilibrium holds for the SNP represented by the row.
rowHWEs(x, levels = 1:3, affy = FALSE, check = TRUE)
rowHWEs(x, levels = 1:3, affy = FALSE, check = TRUE)
x |
a matrix in which each row represents a SNP and each column a subject,
where the SNPs can take the values specified by |
levels |
a vector of length three specifying the values with which the three
genotypes of each SNP are represented. It is assumed that the second element of
|
affy |
logical specifying whether the SNPs in |
check |
should some checks be done if, e.g., other than the specified |
A list containing the values of the ChiSquare statistic for testing for deviation from HWE
(stats
) and the raw p-values (rawp
) computed by employing the ChiSquare distribution
with 1 degree of freedom.
Holger Schwender, [email protected]
Computes for each SNP represented by a row of a matrix the frequency of the minor allele.
rowMAFs(x, check = TRUE)
rowMAFs(x, check = TRUE)
x |
a matrix in which each row represents a SNP and each column a subject, where the genotypes of each SNP are coded by 1 (for the homozygous reference genotype), 2 (heterozygous), and 3 (homozygous variant). NAs are also allowed. |
check |
should it be checked if the matrix contains values differing from 1, 2, and 3?
It is highly recommended to leave |
a vector containing the minor allele frequency of the SNPs represented by x
.
Holger Schwender, [email protected]
Given a set of matrices, each representing one group of subjects (e.g., cases and controls in a case-control study), that summarize the numbers of subjects showing the different levels of the categorical variables represented by the rows of the matrices, the value of the linear trend statistic based on Pearson's correlation coefficient and described on page 87 of Agresti (2002) is computed for each variable.
Using this function instead of rowTrendStats
is in particular recommended
when the total number of observations is very large.
rowMsquares(..., listTables = NULL, clScores = NULL, levScores = NULL, add.pval = TRUE)
rowMsquares(..., listTables = NULL, clScores = NULL, levScores = NULL, add.pval = TRUE)
... |
numeric matrices in each of which each row corresponds to a ordinal variable
and each column to one of the ordered levels of these variables. Each of these matrices
represents one of the groups of interest and comprises the numbers of observations showing
the respective levels at the different variables. These matrices can, e.g., generated by
employing |
listTables |
instead of inputting the matrices directly,
a list consisting of these matrices can be generated and then be used in |
clScores |
a numeric vector with one entry for each matrix specifying the score that should
be assigned to the corresponding group. If |
levScores |
a numeric vector with one score for each level of the variables.If not specified,
i.e.\ |
add.pval |
should p-values be added to the output? If |
This is an extension of the Cochran-Armitage trend test from two to several classes. The
statistic of the Cochran-Armitage trend test can be obtained by multiplying the statistic
of this general linear trend test with , where
is the number
of observations.
Either a vector containing the rowwise values of the linear trend test statistic
(if add.pval = FALSE
), or a list containing these values (stats
),
and the (raw) p-values (rawp
) not adjusted for multiple comparisons (if add.pval = TRUE
).
The usual contingency table for a variable can be obtained from the matrices by forming a variable-specific matrix in which each row consists of the row of one of these matrices.
Holger Schwender, [email protected]
Agresti, A.\ (2002). Categorical Data Analysis. Wiley, Hoboken, NJ. 2nd Edition.
Mantel, N.\ (1963). Chi-Square Test with one Degree of Freedom: Extensions of the Mantel-Haenszel Procedure. Journal of the American Statistical Association, 58, 690-700.
rowTrendStats
, rowCATTs
, rowChisqMultiClass
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that we consider a case-control study, # i.e. two groups, and that the first 25 columns # of mat correspond to cases and the remaining 25 # columns to cases. Then a vector containing the # class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # The values of the rowwise liner trend test are # computed by rowMsquares(cases, controls) # which leads to the same results as listTab <- list(cases, controls) rowMsquares(listTables = listTab) # or as rowTrendStats(mat, cl, use.n = FALSE) # or as out <- rowCATTs(cases, controls) n <- ncol(mat) out$stats * (n - 1) / n ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that we consider a case-control study, # i.e. two groups, and that the first 25 columns # of mat correspond to cases and the remaining 25 # columns to cases. Then a vector containing the # class labels is given by cl <- rep(1:2, e=25) # and the matrices summarizing the numbers of subjects # showing the respective levels at the different variables # are computed by cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==2]) # The values of the rowwise liner trend test are # computed by rowMsquares(cases, controls) # which leads to the same results as listTab <- list(cases, controls) rowMsquares(listTables = listTab) # or as rowTrendStats(mat, cl, use.n = FALSE) # or as out <- rowCATTs(cases, controls) n <- ncol(mat) out$stats * (n - 1) / n ## End(Not run)
Scales each row of a matrix such that the values in this row have zero mean and a standard deviation of 1.
rowScales(X, add.stats = FALSE)
rowScales(X, add.stats = FALSE)
X |
a numeric matrix whose rows should be scaled. Missing values are allowed. |
add.stats |
should the rowwise means and standard deviations of |
If add.stats = FALSE
, a matrix of the same dimensions as X
with a rowwise mean of zero and a rowwise standard deviation of 1.
If add.stats = TRUE
, a list containing this matrix and the
rowwise means and standard deviations of the input matrix.
Holger Schwender, [email protected]
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowScales(mat) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowScales(mat) ## End(Not run)
Computes a one-dimensional table for each row of a matrix that summarizes the values of the categorical variables represented by the rows of the matrix.
rowTables(x, levels = 1:3, affy = FALSE, includeNA = FALSE, useNN = c("not", "only", "also"), check = TRUE)
rowTables(x, levels = 1:3, affy = FALSE, includeNA = FALSE, useNN = c("not", "only", "also"), check = TRUE)
x |
a matrix in which each row represents a categorical variable (e.g., a SNP)
and each column an observation, where the variables are assumed to show the
levels specified by |
levels |
vector specifying the levels that the categorical variables in |
affy |
logical specifying whether the SNPs in |
includeNA |
should a column be added to the output matrix containing the number of missing values for each variable? |
useNN |
character specifying whether missing values can also be coded by |
check |
should it be checked whether some of the variables show other levels than the one
specified by |
A matrix with the same number of rows as x
containing for each variable the numbers
of observations showing the levels specified by levels
.
Holger Schwender, [email protected]
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowTables(mat) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) rowTables(mat) ## End(Not run)
rowTrendFuzzy
performs the trend test proposed by Louis et al. (2010) based on fuzzy genotype calls, i.e. the weighted sums
over the confidences for the three genotypes as they are determined by preprocessing algorithms (e.g., CRLMM)
or imputation procedures.
Given the confidences and scores for all three genotypes, getMatFuzzy
constructs a matrix containing the fuzzy genotype calls.
rowTrendFuzzy(score, probs, y, mat.fuzzy = NULL, alternative = c("two.sided", "less", "greater"), check = TRUE) getMatFuzzy(score, probs, check = TRUE)
rowTrendFuzzy(score, probs, y, mat.fuzzy = NULL, alternative = c("two.sided", "less", "greater"), check = TRUE) getMatFuzzy(score, probs, check = TRUE)
score |
either a numeric vector of length 2 or 3, or a character string. If the latter, If |
probs |
a list of length 2 or 3 consisting of matrices of the same size. Each matrix must contain the confidences
for one of the three genotypes, where each row in the matrix represents a SNP and each column a sample (which must be in the same
order in all matrices). The matrices in |
y |
a vector of zeros and ones specifying which of the samples in the matrices in |
mat.fuzzy |
a matrix containing the fuzzy genotype calls. If specified, |
alternative |
a character string specifying the alternative hypothesis. Must be one of |
check |
logical specifying whether the specified objects should be extensively checked. If |
For getMatFuzzy
, a matrix containing the fuzzy genotype calls. For rowTrendFuzzy
, a list consisting of
stat |
a vector containing the values of the trend test statistic for all SNPs comprised by |
rawp |
a vector containing the unadjusted p-values computed for the values in |
theta |
a vector containing estimates for the log odds ratios for risk corresponding to |
varTheta |
a vector containing the variance estimates for |
Louis, T.A., Carvalho, B.S., Fallin, M.D., Irizarry, R.A., Li, Q., and Ruczinski, I. (2010). Association Tests that Accommodate Genotyping Errors. In Bernardo, J.M., Bayarri, M.J., Berger, J.O., Dawid, A.P., Heckerman, D., Smith, A.F.M., and West, M. (eds.), Bayesian Statistics 9, 393-420. Oxford University Press, Oxford, UK. With Discussion.
Computes for each row of a matrix the value of the statistic of a linear trend test for testing whether the ordinal variable corresponding to the row of the matrix is associated with an ordinal response.
In the two-class case, the statistic of the Cochran-Armitage trend test is computed by default.
rowTrendStats(X, y, use.n = NULL, add.pval = TRUE)
rowTrendStats(X, y, use.n = NULL, add.pval = TRUE)
X |
a numeric matrix in which each row represents an ordinal variable and each column corresponds to an observation. The entries of this matrix are interpreted as scores for the different (ordered) levels of the respective variables. |
y |
a numeric vector of length |
use.n |
should the squared values of Pearson's correlation coefficient be multiplied
by |
add.pval |
should p-values be added to the output? If |
Holger Schwender, [email protected]
Agresti, A.\ (2002). Categorical Data Analysis. Wiley, Hoboken, NJ. 2nd Edition.
Armitage, P.\ (1955). Tests for Linear Trends in Proportions and Frequencies. Biometrics, 11, 375-386.
Cochran, W.~G.\ (1954). Some Methods for Strengthening the Common ChiSquare Tests. Biometrics, 10, 417-451.
Mantel, N.\ (1963). Chi-Square Test with one Degree of Freedom: Extensions of the Mantel-Haenszel Procedure. Journal of the American Statistical Association, 58, 690-700.
rowMsquares
, rowCATTs
, rowChisqMultiClass
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(0:1, e=25) # The values of the Cochran-Armitage trend test can # then be computed by rowTrendStats(mat, cl) # This leads to the same results as cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==0]) rowCATTs(cases, controls) # or as out <- rowMsquares(cases, controls) n <- ncol(mat) out$stats * n / (n - 1) ## End(Not run)
## Not run: # Generate a matrix containing data for 10 categorical # variables with levels 1, 2, 3. mat <- matrix(sample(3, 500, TRUE), 10) # Now assume that the first 25 columns correspond to # cases and the remaining 25 columns to cases. Then # a vector containing the class labels is given by cl <- rep(0:1, e=25) # The values of the Cochran-Armitage trend test can # then be computed by rowTrendStats(mat, cl) # This leads to the same results as cases <- rowTables(mat[, cl==1]) controls <- rowTables(mat[,cl==0]) rowCATTs(cases, controls) # or as out <- rowMsquares(cases, controls) n <- ncol(mat) out$stats * n / (n - 1) ## End(Not run)
Shortens the entries of the column of a data frame containing the genes
associated with the SNPs for which the data frame comprises annotations.
Typically used in combination with, i.e.\ either within or after an application of,
buildSNPannotation.
shortenGeneDescription(dat, colname = "Gene", max.length = 2, sep = "///", add.ldots = TRUE)
shortenGeneDescription(dat, colname = "Gene", max.length = 2, sep = "///", add.ldots = TRUE)
dat |
a data frame. Typically, the output of |
colname |
character string comprising the name of the column of |
max.length |
integer specifying the maximum number of genes associated with the respective
SNP that should be stored in the data frame. By default, the first two genes are retained.
Shortened entries are marked by |
sep |
character string specifying the separation symbol between the different genes. |
add.ldots |
should |
The same data frame as dat
with shortened entries in the column of dat
named colname
.
Holger Schwender, [email protected]
Shows the content of the file CHANGES stored in a package.
showChanges(pkg = "scrime")
showChanges(pkg = "scrime")
pkg |
character string naming the package. |
Holger Schwender, [email protected]
Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify a categorical response by level-wise or multinomial logistic regression.
simulateSNPcatResponse(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, withRef = FALSE, beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, rand = NA) ## S3 method for class 'simSNPcatResponse' print(x, justify = c("left", "right"), spaces = 2, ...)
simulateSNPcatResponse(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, withRef = FALSE, beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, rand = NA) ## S3 method for class 'simSNPcatResponse' print(x, justify = c("left", "right"), spaces = 2, ...)
n.obs |
number of observations that should be generated. |
n.snp |
number of SNPs that should be generated. |
list.ia |
a list consisting of If, e.g., one of the
vectors is given by
For more details, see Details. Must be specified if |
list.snp |
a list consisting of numeric vectors (if one interaction should be explanatory
for a level of the response) or lists of numeric vectors (if there should be more than one
explanatory interaction) specifying the SNPs that compose
the interactions. |
withRef |
should there be an additional reference group (i.e.\ a control group) denoted by a
zero? If |
beta0 |
a numeric value or vector of |
beta |
either a non-negative numeric value or a list of non-negative numeric values specifying
the parameters in the logistic regression models. If a numeric value, all parameters (except for
the intercept) in all logistic regression models will be equal to this value. If a list, then
this list must have the same length as |
maf |
either an integer, or a vector of length 2 or |
sample.y |
should the values of the response be randomly drawn using the probabilities
determined by the logistic regression models? If |
rand |
a numeric value for setting the random number generator in a reproducible state. |
x |
the output of |
justify |
a character string specifying whether the column of the summarizing table that
names the explanatory interactions should be |
spaces |
integer specifying the distance from the left end of the column mentioned in |
... |
ignored. |
simulateSNPcatResponse
first simulates a matrix consisting of n.obs
observations and n.snp
SNPs, where the minor allele frequencies of these SNPs are given by maf
.
Note that all SNPs are currently simulated independently of each other such that they are unlinked. Moreover, an observation is currently not allowed to have genotypes/interactions that are explanatory for more than one of the levels of the response. If, e.g., the response has three categories, then an observation can either exhibit one (or more) of the genotypes explaining the first level, or one (or more) of the genotypes explanatory for the second level, or one (or more) of the genotypes explaining the third level, or none of these genotypes.
Afterwards, the response is generated by employing the specifications of list.ia
,
list.snp
, beta0
and beta
.
By default, i.e.\ if both list.ia
and list.snp
are NULL
, list.ia
is set
to
list(c(-1, 1), c(1, 1, 1), list(c(-1, 1), c(1, 1, 1)))
,
and list.snp
is set to
list(c(6, 7), c(3, 9, 10), list(c(2, 5), c(1, 4, 8)))
such that the interaction
(SNP6 != 1) & (SNP7 == 1)
is assumed to be explanatory for the first level of the three-categorical response, the interaction
(SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1)
is assumed to be explanatory for the second level, and the interactions
(SNP2 != 1) & (SNP5 == 1)
\ \ \ and
(SNP1 == 1) & (SNP4 == 1) & (SNP8 == 1)
,
are assumed to be explanatory for the third level.
If withRef = FALSE
, then for each of the levels,
the probability of having this level given that an observation exhibits one, two, ...
of the interactions intended to be explanatory for that level is determined using the corresponding
logistic regression model. Afterwards, the value of the response for an observation showing one, two, ...
of the interactions explanatory for a particular level is randomly drawn using the above probability
for the particular level and
as probabilities for the other
levels. If an observation exhibits none of the explanatory interactions,
its response value is randomly drawn using the probabilities
.
If withRef = TRUE
, a multinomial logistic regression is used to specify the class labels. In this
case the probabilities ,
, are given by
, where
are the probabilities on the
logit-scale (i.e.\ the probabilities on the scale of the linear predictors) and
is the reciprocal
of the probability for the control/reference group.
An object of class simSNPcatResponse
consisting of
x |
a matrix with |
y |
a vector of length |
models |
a character vector naming the level-wise logistic regression models. |
maf |
a vector of length |
tab.explain |
a data frame summarizing the results of the simulation. |
Holger Schwender, [email protected]
## Not run: # The simulated data set described in Details. sim1 <- simulateSNPcatResponse() sim1 # Specifying the values of the response by the levels with # the largest probability. sim2 <- simulateSNPcatResponse(sample.y = FALSE) sim2 # If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3), and # ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions # (or variables) that are explanatory for the three levels # of the response, list.ia and list.snp are specified as follows. list.ia <- list(c(-2, 1), 3, c(-1,3)) list.snp <- list(c(4, 3), 5, c(12,9)) # The categorical response and a data set consisting of # 800 observations and 25 SNPs, where the minor allele # frequency of each SNP is randomly drawn from a # uniform distribution with minimum 0.1 and maximum 0.4, # is then generated by sim3 <- simulateSNPcatResponse(n.obs = 800, n.snp = 25, list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4)) sim3 ## End(Not run)
## Not run: # The simulated data set described in Details. sim1 <- simulateSNPcatResponse() sim1 # Specifying the values of the response by the levels with # the largest probability. sim2 <- simulateSNPcatResponse(sample.y = FALSE) sim2 # If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3), and # ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions # (or variables) that are explanatory for the three levels # of the response, list.ia and list.snp are specified as follows. list.ia <- list(c(-2, 1), 3, c(-1,3)) list.snp <- list(c(4, 3), 5, c(12,9)) # The categorical response and a data set consisting of # 800 observations and 25 SNPs, where the minor allele # frequency of each SNP is randomly drawn from a # uniform distribution with minimum 0.1 and maximum 0.4, # is then generated by sim3 <- simulateSNPcatResponse(n.obs = 800, n.snp = 25, list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4)) sim3 ## End(Not run)
Simulates SNP data. Interactions of some of the simulated SNPs are then used to specify either a binary or a quantitative response by a logistic or linear regression model, respectively.
simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5, err.fun = NULL, rand = NA, ...)
simulateSNPglm(n.obs = 1000, n.snp = 50, list.ia = NULL, list.snp = NULL, beta0 = -0.5, beta = 1.5, maf = 0.25, sample.y = TRUE, p.cutoff = 0.5, err.fun = NULL, rand = NA, ...)
n.obs |
number of observations that should be generated. |
n.snp |
number of SNPs that should be generated. |
list.ia |
a list consisting of numeric vectors (or values) specifying the genotypes
of the SNPs that should be explanatory for the response. Each of these vectors
must be composed of some of the numbers -3, -2, -1, 1, 2, 3, where 1 denotes
the homozygous reference genotype, 2 the heterozygous genotype, and 3 the
homozygous variant genotype, and a minus before these numbers means that
the corresponding SNP should be not of this genotype. If, e.g., one of the
vectors is given by
For more details, see Details. Must be specified if |
list.snp |
a list consisting of numeric vectors specifying the SNPs that compose
the interactions used in the regression model. Each of these vectors must have the
same length as the corresponding vector in |
beta0 |
a numeric value specifying the intercept of the regression model. |
beta |
a non-negative numeric value or vector of the same length as |
maf |
either an integer, or a vector of length 2 or |
sample.y |
should the values of the response in the logistic regression model be randomly drawn
using the probabilities of the respective observations for being a case? If |
p.cutoff |
a probability, i.e.\ a numeric value between 0 and 1, naming the cutoff for an
observation to be called a case if |
err.fun |
a function for generating the error that is added to the linear model to determine
the value of the (quantitative) response. If |
rand |
a numeric value for setting the random number generator in a reproducible state. |
... |
further arguments of the function specified by |
simulateSNPglm
first simulates a matrix consisting of n.obs
observations and n.snp
SNPs, where the minor allele frequencies of these SNPs are given by maf
.
Note that all SNPs are currently simulated independently of each other such that they are unlinked.
Afterwards, the response is determined by a regression model using the specifications of list.ia
,
list.snp
, beta0
and beta
. Depending on whether err.fun
is specified or not,
a linear or a logistic regression model is used, respectively, i.e.\ the response is continuous or binary.
By default, a logistic regression model
logit(Prob()) =
beta0
+ beta[1]
* +
beta[2]
* + ...
is fitted, since err.fun = NULL
.
If both list.ia
and list.snp
are NULL
, then interactions similar to the one
considered, e.g., in Nunkesser et al.\ (2007) or Schwender et al.\ (2007) are used, i.e.\
=
(SNP6 != 1) & (SNP7 == 1)
\ \ \ and
=
(SNP3 == 1) & (SNP9 == 1) & (SNP10 == 1)
,
by setting list.ia = list(c(-1, 1), c(1, 1, 1))
and
list.snp = list(c(6, 7), c(3, 9, 10))
.
Using the above model Prob() is computed for each observation, and its value of the response is
determined either by a random draw from a Bernoulli distribution using this probability (if
sample.y = TRUE
),
or by evaluating if Prob()
p.cutoff
(if sample.y = FALSE
).
If err.fun
is specified, then the linear model
=
beta0
+ beta[1]
* +
beta[2]
* + ... + error
is used to determine the values of the response , where the values for error are given
by the output of a call of
err.fun
.
An object of class simSNPglm
consisting of
x |
a matrix with |
y |
a vector of length |
beta0 |
the value of the intercept. |
beta |
the vector of parameters. |
ia |
a character vector naming the explanatory interactions. |
maf |
a vector of length |
prob |
a vector of length |
err |
a vector of length |
p.cutoff |
the value of |
err.call |
a character string naming the call of the error function (if |
Holger Schwender, [email protected]
Nunkesser, R., Bernholt, T., Schwender, H., Ickstadt, K. and Wegener, I.\ (2007). Detecting High-Order Interactions of Single Nucleotide Polymorphisms Using Genetic Programming. Bioinformatics, 23, 3280-3288.
Schwender, H.\ (2007). Statistical Analysis of Genotype and Gene Expression Data. Dissertation, Department of Statistics, University of Dortmund.
simulateSNPs
, summary.simSNPglm
, simulateSNPcatResponse
## Not run: # The simulated data set described in Details. sim1 <- simulateSNPglm() sim1 # A bit more information: Table of probabilities of being a case # vs. numbers of cases and controls. summary(sim1) # Calling an observation a case if its probability of being # a case is larger than 0.5 (the default for p.cutoff). sim2 <- simulateSNPglm(sample.y = FALSE) summary(sim2) # If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and # ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions # (or variables) that are explanatory for the response, # list.ia and list.snp are specified as follows. list.ia <- list(c(-2, 1), 3, c(-1,3)) list.snp <- list(c(4, 3), 5, c(12,9)) # The binary response and the data set consisting of # 600 observations and 25 SNPs, where the minor allele # frequency of each SNP is randomly drawn from a # uniform distribution with minimum 0.1 and maximum 0.4, # is then generated by sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25, list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4)) sim3 summary(sim3) # If the response should be quantitative, err.fun has # to be specified. To use a normal distribution with mean 0 # (default in rnorm) and a standard deviation of 2 # as the distribution of the error, call simulateSNPglm(err.fun = rnorm, sd = 2) ## End(Not run)
## Not run: # The simulated data set described in Details. sim1 <- simulateSNPglm() sim1 # A bit more information: Table of probabilities of being a case # vs. numbers of cases and controls. summary(sim1) # Calling an observation a case if its probability of being # a case is larger than 0.5 (the default for p.cutoff). sim2 <- simulateSNPglm(sample.y = FALSE) summary(sim2) # If ((SNP4 != 2) & (SNP3 == 1)), (SNP5 ==3) and # ((SNP12 !=1) & (SNP9 == 3)) should be the three interactions # (or variables) that are explanatory for the response, # list.ia and list.snp are specified as follows. list.ia <- list(c(-2, 1), 3, c(-1,3)) list.snp <- list(c(4, 3), 5, c(12,9)) # The binary response and the data set consisting of # 600 observations and 25 SNPs, where the minor allele # frequency of each SNP is randomly drawn from a # uniform distribution with minimum 0.1 and maximum 0.4, # is then generated by sim3 <- simulateSNPglm(n.obs = 600, n.snp = 25, list.ia = list.ia, list.snp = list.snp, maf = c(0.1, 0.4)) sim3 summary(sim3) # If the response should be quantitative, err.fun has # to be specified. To use a normal distribution with mean 0 # (default in rnorm) and a standard deviation of 2 # as the distribution of the error, call simulateSNPglm(err.fun = rnorm, sd = 2) ## End(Not run)
Simulates SNP data, where a specified proportion of cases and controls is explained by specified set of SNP interactions. Can also be used to simulate a data set with a multi-categorical response, i.e.\ a data set in which the cases are divided into several classes (e.g., different diseases or subtypes of a disease).
simulateSNPs(n.obs, n.snp, vec.ia, prop.explain = 1, list.ia.val = NULL, vec.ia.num = NULL, vec.cat = NULL, maf = c(0.1, 0.4), prob.val = rep(1/3, 3), list.equal = NULL, prob.equal = 0.8, rm.redundancy = TRUE, shuffle = FALSE, shuffle.obs = FALSE, rand = NA)
simulateSNPs(n.obs, n.snp, vec.ia, prop.explain = 1, list.ia.val = NULL, vec.ia.num = NULL, vec.cat = NULL, maf = c(0.1, 0.4), prob.val = rep(1/3, 3), list.equal = NULL, prob.equal = 0.8, rm.redundancy = TRUE, shuffle = FALSE, shuffle.obs = FALSE, rand = NA)
n.obs |
either an integer specifying the total number of
observations, or a vector of length 2 specifying the number
of cases and the number of controls. If |
n.snp |
integer specifying the number of SNPs. |
vec.ia |
a vector of integers specifying the orders of the interactions
that explain the cases. |
prop.explain |
either an integer or a vector of |
list.ia.val |
a list of |
vec.ia.num |
a vector of |
vec.cat |
a vector of the same length of |
maf |
either an integer, or a vector of length 2 or |
prob.val |
a vector consisting of the probabilities for drawing a 0, 1, or 2,
if |
list.equal |
list of same structure as |
prob.equal |
a numeric value specifying the probability that a 1 is drawn when generating
|
rm.redundancy |
should redundant SNPs be removed from the explaining interactions?
It is possible that one specify an explaining |
shuffle |
logical. By default, the first |
shuffle.obs |
should the observations be shuffled? |
rand |
integer. Sets the random number generator in a reproducible state. |
An object of class simulatedSNPs
composed of
data |
a matrix with |
cl |
a vector of length |
tab.explain |
a table naming the explanatory interactions and the numbers of cases and controls explained by them. |
ia |
character vector naming the interactions. |
maf |
vector of length |
Currently, the genotypes of all SNPs are simulated independently from each other (except for the SNPs that belong to the same explanatory interaction).
Holger Schwender [email protected]
simulateSNPglm
, simulateSNPcatResponse
## Not run: # Simulate a data set containing 2000 observations (1000 cases # and 1000 controls) and 50 SNPs, where one three-way and two # two-way interactions are chosen randomly to be explanatory # for the case-control status. sim1 <- simulateSNPs(2000, 50, c(3, 2, 2)) sim1 # Simulate data of 1200 cases and 800 controls for 50 SNPs, # where 90% of the observations showing a randomly chosen # three-way interaction are cases, and 95% of the observations # showing a randomly chosen two-way interactions are cases. sim2 <- simulateSNPs(c(1200, 800), 50, c(3, 2), prop.explain = c(0.9, 0.95)) sim2 # Simulate a data set consisting of 1000 observations and 50 SNPs, # where the minor allele frequency of each SNP is 0.25, and # the interactions # ((SNP1 == 2) & (SNP2 != 0) & (SNP3 == 1)) and # ((SNP4 == 0) & (SNP5 != 2)) # are explanatory for 200 and 250 of the 500 cases, respectively, # and for none of the 500 controls. list1 <- list(c(2, 0, 1), c(0, 2)) list2 <- list(c(1, 0, 1), c(1, 0)) sim3 <- simulateSNPs(1000, 50, c(3, 2), list.ia.val = list1, list.equal = list2, vec.ia.num = c(200, 250), maf = 0.25) ## End(Not run)
## Not run: # Simulate a data set containing 2000 observations (1000 cases # and 1000 controls) and 50 SNPs, where one three-way and two # two-way interactions are chosen randomly to be explanatory # for the case-control status. sim1 <- simulateSNPs(2000, 50, c(3, 2, 2)) sim1 # Simulate data of 1200 cases and 800 controls for 50 SNPs, # where 90% of the observations showing a randomly chosen # three-way interaction are cases, and 95% of the observations # showing a randomly chosen two-way interactions are cases. sim2 <- simulateSNPs(c(1200, 800), 50, c(3, 2), prop.explain = c(0.9, 0.95)) sim2 # Simulate a data set consisting of 1000 observations and 50 SNPs, # where the minor allele frequency of each SNP is 0.25, and # the interactions # ((SNP1 == 2) & (SNP2 != 0) & (SNP3 == 1)) and # ((SNP4 == 0) & (SNP5 != 2)) # are explanatory for 200 and 250 of the 500 cases, respectively, # and for none of the 500 controls. list1 <- list(c(2, 0, 1), c(0, 2)) list2 <- list(c(1, 0, 1), c(1, 0)) sim3 <- simulateSNPs(1000, 50, c(3, 2), list.ia.val = list1, list.equal = list2, vec.ia.num = c(200, 250), maf = 0.25) ## End(Not run)
Computes the values of (or the distance based on) the simple matching coefficient or Cohen's Kappa, respectively, for each pair of rows of a matrix.
smc(x, dist = FALSE) cohen(x, dist = FALSE)
smc(x, dist = FALSE) cohen(x, dist = FALSE)
x |
a matrix consisting of integers between 1 and |
dist |
should the distance based on the simple matching coefficient or Cohen's Kappa, respectively,
be computed? Note that, e.g., |
A matrix with nrow(x)
columns and rows containing the distances or similarities.
Holger Schwender, [email protected]
## Not run: # Generate a data set consisting of 10 rows and 200 columns, # where the values are randomly drawn from the integers 1, 2, and 3. mat <- matrix(sample(3, 2000, TRUE), 10) # For each pair of row, the value of the simple matching coefficient # can be obtained by smc(mat) # and the distance based on the SMC by smc(mat, dist = TRUE) ## End(Not run)
## Not run: # Generate a data set consisting of 10 rows and 200 columns, # where the values are randomly drawn from the integers 1, 2, and 3. mat <- matrix(sample(3, 2000, TRUE), 10) # For each pair of row, the value of the simple matching coefficient # can be obtained by smc(mat) # and the distance based on the SMC by smc(mat, dist = TRUE) ## End(Not run)
Transforms SNPs to binary variables.
snp2bin(mat, domrec = TRUE, refAA = FALSE, snp.in.col = TRUE, monomorph = 0)
snp2bin(mat, domrec = TRUE, refAA = FALSE, snp.in.col = TRUE, monomorph = 0)
mat |
a matrix or data frame in which the genotypes of all SNPs are coded
either by |
domrec |
should each SNP be coded by two dummy variables from which one codes for
a recessive, and the other for a dominant effect? If |
refAA |
codes |
snp.in.col |
does each column of |
monomorph |
a non-negative number. If a dummy variable contains |
A matrix containing the binary dummy variables.
Holger Schwender, [email protected]
## Not run: # Generate an example data set consisting of 10 rows (observations) # and 5 columns (SNPs). mat <- matrix(sample(3, 50, TRUE), 10) colnames(mat) <- paste("SNP", 1:5, sep = "") # Transform each SNP into two dummy variables, one that codes for # a recessive effect and one that codes for a dominant effect. snp2bin(mat) # Transform each SNP into three dummy variables, where each of # these variables codes for one of the three genotypes. snp2bin(mat, domrec = FALSE) ## End(Not run)
## Not run: # Generate an example data set consisting of 10 rows (observations) # and 5 columns (SNPs). mat <- matrix(sample(3, 50, TRUE), 10) colnames(mat) <- paste("SNP", 1:5, sep = "") # Transform each SNP into two dummy variables, one that codes for # a recessive effect and one that codes for a dominant effect. snp2bin(mat) # Transform each SNP into three dummy variables, where each of # these variables codes for one of the three genotypes. snp2bin(mat, domrec = FALSE) ## End(Not run)
Summarizes an object of class simSNPglm
.
## S3 method for class 'simSNPglm' summary(object, digits = 3, ...)
## S3 method for class 'simSNPglm' summary(object, digits = 3, ...)
object |
an object of class |
digits |
number of digits used in the output. |
... |
Ignored. |
Shows the model used in simulateSNPglm
to generate the values of the response.
If the response is binary, then it additionally shows and returns a contingency table of the
numbers of cases and controls and the probabilities for being a case.
Holger Schwender, [email protected]
## Not run: # The default simulated data set. sim1 <- simulateSNPglm() sim1 # A bit more information: Table of probability of being a case # vs. number of cases and controls. summary(sim1) ## End(Not run)
## Not run: # The default simulated data set. sim1 <- simulateSNPglm() sim1 # A bit more information: Table of probability of being a case # vs. number of cases and controls. summary(sim1) ## End(Not run)