| Title: | Analysis of Copy Number Signatures |
|---|---|
| Description: | A workflow to generate and analyze signatures based on copy number data using non-negative matrix factorization (NMF) in an approach similar to that used in mutational signatures. It can be used to extract features from Copy number segment data and use that to find a subset of copy number signatures which can be further used to correlate with other relevant data. For more on 'NMF' see Gaujoux (2013) <doi:10.1186/1471-2105-11-367>. |
| Authors: | David Tallman [aut], Shawn Striker [cre, ctb], Daniel Stover [cph] |
| Maintainer: | Shawn Striker <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-08 08:05:08 UTC |
| Source: | https://github.com/cran/CNSigs |
This function is used to append your ploidy data onto the sample component matrix. The function is called in runPipeline if you specify ploidy as a desired feature and give a vector of ploidy values.
addPloidyData(scm, ploidyData)addPloidyData(scm, ploidyData)
scm |
The scm to append the ploidy to |
ploidyData |
Vector of ploidy date to append to segs |
Returns scm including the ploidy data
These generated components were derived from all of the copy number data available from TCGA. It is meant to be used to look for signatures in cancer data so that you do not have to model new data, and so that you can easily compare signatures
cancerCompscancerComps
A list with 6 flexmix objects:
Mixture of normal distributions
Mixture of normal distributions
Mixture of poisson distributions
Mixture of normal distributions
Mixture of normal distributions
Mixture of poisson distributions
These signatures were derived from all of the copy number data available from TCGA. It is meant to be used so that you can compare newly found signatures to these described signatures. There are a total of 25 signatures. The components used to derive these signatures can be found in the cancerComps variable.
cancerSigscancerSigs
An object of class matrix (inherits from array) with 28 rows and 25 columns.
This function is used to check for overlapping components and remove them if they overlap. The mixed modeling can sometimes run into an error in which it produces multiple essentially identical components. This function attempts to find and remove these duplicates.
checkCompOverlap(comps, pois = FALSE)checkCompOverlap(comps, pois = FALSE)
comps |
Component parameters |
pois |
Whether or not the components are poisson or normal distributions |
Returns a components with no overlaps
These signatures were derived from all of the copy number data available from TCGA. These pan-cancer signatures were collapsed by similarity to a total of 13 signatures and were built using ploidy data. It is meant to be used so that you can compare newly found signatures to these described signatures.
collapsedSigscollapsedSigs
An object of class data.frame with 29 rows and 13 columns.
This function is used to check two signature sets in order to compare how the samples exposures differ across the two runs.
compareExposures(reference, toCompare)compareExposures(reference, toCompare)
reference |
Results from your reference analysis |
toCompare |
Results from run that you want to compare to reference |
Prints out the difference in signature exposures
compareExposures(referenceExp, referenceExp)compareExposures(referenceExp, referenceExp)
The generated components for the segDataExp dataset. Generated using the function fitModels(featsExp). Each data frame has a value and a ID column. The ID tells you which sample the observed value is from.
compsExpcompsExp
A list with 6 flexmix objects:
Mixture of normal distributions
Mixture of normal distributions
Mixture of poisson distributions
Mixture of normal distributions
Mixture of normal distributions
Mixture of poisson distributions
This function is used to create the final signatures and generates the resulting NMF object, from which you can extract the feature contribution to each signature using NMF::basis(), and the signature contribution of each sample by using NMF::scoef()
createSigs(scm, nsig, cores = 1, runName = "", saveRes = FALSE, saveDir = NULL)createSigs(scm, nsig, cores = 1, runName = "", saveRes = FALSE, saveDir = NULL)
scm |
The sample_by_component matrix to run NMF on |
nsig |
Number of signatures for the NMF to create |
cores |
Number of cores to use in parallel process |
runName |
Name of the run used in file names, Default is "" |
saveRes |
Whether or not to save the results, Default is FALSE |
saveDir |
Where to save the results, must be provided if using saveDir |
Returns the resulting NMF object
createSigs(scmExp,5) #Generates 5 signatures from the SCMcreateSigs(scmExp,5) #Generates 5 signatures from the SCM
These are the default features that are used in the package. Use this to get a list of the feature names and you can remove values from this and pass it in to the package using the featsToUse parameter seen in multiple functions.
defaultFeatsdefaultFeats
An object of class character of length 6.
This function uses the extracted features and modelled components, and it performs NMF on these ranging from the minimum number of signatures to the max. It repeats this on randomized data and computes various measures to help inform the user on how many signatures to proceed with. This function may take a while to run since it repeats the NMF process many times. It is suggested to give it multiple cores to allow for parallel processing.
determineNumSigs( scm, rmin = 3, rmax = 12, cores = 1, nrun = 250, saveRes = FALSE, saveDir = NULL, runName = "" )determineNumSigs( scm, rmin = 3, rmax = 12, cores = 1, nrun = 250, saveRes = FALSE, saveDir = NULL, runName = "" )
scm |
Sample by component matrix used to find signatures |
rmin |
The lower bound of signature numbers to check. Default is 2. |
rmax |
The upper bound of signature numbers to check. Default is 12. |
cores |
The number of cores to use for parallel analysis. Default is 1. |
nrun |
Number of runs for NMF. Default is 250. |
saveRes |
Whether or not to save the plot. Default is FALSE. |
saveDir |
Directory to save plot in, must be provided if using saveDir |
runName |
Used to title plots and files when saving results |
Creates a series of plots to help user decide
determineNumSigs(generateSCM(featsExp,compsExp))determineNumSigs(generateSCM(featsExp,compsExp))
This function allows you to run the Copy number signature pipeline up until the determineSigNum call. This is useful if you want to repeteadly check the optimal number of signatures for different sample sets. May take a while, especially if not given multiple cores.
detSigNumPipeline( segData, cores = 1, components = NULL, saveRes = FALSE, runName = "Run", rmin = 3, rmax = 12, max_comps = NULL, min_comps = NULL, saveDir = NULL, smooth = FALSE, colMap = NULL, pR = FALSE, gbuild = "hg19", featsToUse = NULL, ploidyData = NULL )detSigNumPipeline( segData, cores = 1, components = NULL, saveRes = FALSE, runName = "Run", rmin = 3, rmax = 12, max_comps = NULL, min_comps = NULL, saveDir = NULL, smooth = FALSE, colMap = NULL, pR = FALSE, gbuild = "hg19", featsToUse = NULL, ploidyData = NULL )
segData |
The data to be analyzed. If a path name, readSegs is used to make the list. Otherwise the list must be formatted correctly. Refer to ?readSegs for format information. |
cores |
The number of computer cores to be used for parallel processing |
components |
Can be used when fixing components. Default is NULL. |
saveRes |
Whether or not to save the resulting tables and plots. Default is FALSE |
runName |
Used to title plots and files when saving results |
rmin |
Minimum number of signatures to look for. Default is 3. |
rmax |
Maximum number of signatures to look for. Default is 12. |
max_comps |
vector of length 6 specifying the max number of components for each feature. Passed to fitModels. Default is 10 for all features |
min_comps |
vector of length 6 specifying the min number of components for each feature. Passed to fitModels. Default is 2 for all features |
saveDir |
Used to specify where to save the results, must be provided if using saveDir |
smooth |
Whether or not to smooth the input data. Default is FALSE. |
colMap |
Mapping of column names when reading from text file. Default column names are ID, chromosome, start, end, segVal. |
pR |
Peak Reduction |
gbuild |
The reference genome build. Default is hg19. Also supports hg18 and hg38. |
featsToUse |
Vector of feature names that you wish to use |
ploidyData |
The ploidy data to use as a feature |
Returns a list with all of the results from the pipeline
#Runs the entire pipeline on the example data giving it 6 cores and specifying #5 signatures with a name of "TCGA Test" detSigNumPipeline(segDataExp, cores = 6, saveRes = FALSE, runName = "TCGA Test")#Runs the entire pipeline on the example data giving it 6 cores and specifying #5 signatures with a name of "TCGA Test" detSigNumPipeline(segDataExp, cores = 6, saveRes = FALSE, runName = "TCGA Test")
This function is used to determine the similarity between two signatures that have different underlying components. Uses ks-statistic based measure to estimate similarity for normal distribution based components and uses a correlation measure when comparing poisson distribution based components.
diffCompSigSim(refComps, refWeights, valComps, valWeights)diffCompSigSim(refComps, refWeights, valComps, valWeights)
refComps |
Reference component parameters |
refWeights |
Reference component weights |
valComps |
Component parameters to compare against |
valWeights |
Component weights to compare agaisnt |
Returns a correlation value
This function is used to extract the six copy number features that are eventually used in order to make the signatures. It does this using six sub functions to extract each feature. Before extracting the features, the segments are passed through a validation function to make sure the data is formatted correctly and there are no invalid segments. Can be done in parallel using the cores parameter.
extractCNFeats( segData, gbuild = "hg19", cores = 1, featsToExtract = CNSigs::defaultFeats )extractCNFeats( segData, gbuild = "hg19", cores = 1, featsToExtract = CNSigs::defaultFeats )
segData |
The copy number segment data |
gbuild |
The reference genome build. Default is hg19. Also supports hg18 and hg38. |
cores |
The number of cores to use for parallel processing. Default 1. |
featsToExtract |
The names of the features to extract. |
list of dataframes containing results of six copy number features
extractCNFeats(segDataExp)extractCNFeats(segDataExp)
This group of functions return vectors of the corresponding features for the samples passed in. Some of the functions use internal datasets to the CNSig package that specify the chromosome lengths and the centromere positions.
extractSegsize(segData) extractBP10MB(segData, chrlen) extractOscillations(segData, chrlen) extractBPChrArm(segData, centromeres, chrlen) extractChangepoints(segData, centromeres, chrlen) extractCN(segData)extractSegsize(segData) extractBP10MB(segData, chrlen) extractOscillations(segData, chrlen) extractBPChrArm(segData, centromeres, chrlen) extractChangepoints(segData, centromeres, chrlen) extractCN(segData)
segData |
The samples to extract data from |
chrlen |
The lengths of the chromosomes from reference genome |
centromeres |
The positions of the centromeres in reference genome |
This function returns a vector of all the segment sizes for all for all of the samples.
This function returns a vector of the average number of breakpoints in a per 10MB for each chromosome.
This function returns a vector of number of oscillation events found on each of the chromosomes.
This function returns a vector of number of total breakpoints per chromosome arm.
This function returns a vector of average size of changepoints per chromosome
This function returns a vector of average copynumber per chromosome
The extracted features from the segDataExp dataset. Generated using the function extractCNFeats(segDataExp). Each data frame has a value and a ID column. The ID tells you which sample the observed value is from.
featsExpfeatsExp
A list with 6 data frames:
Size of every segment
Average # of breakpoints per 10MB per chromosome
Number of oscillation events per chromosome
Average changepoint per chromosome
Average copy number per chromosome
Number of breakpoints per chromosome arm
This function is used to find the signature exposures of a set of samples using fixed signatures found earlier. It does this using the least squares optimization method with constraints to keep the output as non-negative using the lsei function from the package limSolve.
findExposures(scm, fixedSigs, runName = "", saveRes = FALSE, saveDir = NULL)findExposures(scm, fixedSigs, runName = "", saveRes = FALSE, saveDir = NULL)
scm |
Sample by component matrix |
fixedSigs |
The fixed signatures |
runName |
Name of the run used in file names, Default is "" |
saveRes |
Whether or not to save the results, Default is FALSE |
saveDir |
Where to save the results, must be provided if using saveDir |
Returns the resulting matrix of exposures
findExposures(t(scmExp), sigsExp)findExposures(t(scmExp), sigsExp)
This function is used to fit a mixture model of either normal or poisson distributions to the inpput data. This function is mainly used by the fitModels function to create the components from the extracted features.
fitComponent( toFit, min_prior = 0.001, min_comp = 2, max_comp = 10, dist = "norm", pR = FALSE, seed = 77777, model_sel = "BIC", niter = 10000, nrep = 1 )fitComponent( toFit, min_prior = 0.001, min_comp = 2, max_comp = 10, dist = "norm", pR = FALSE, seed = 77777, model_sel = "BIC", niter = 10000, nrep = 1 )
toFit |
Extracted features to fit models to. |
min_prior |
Minimum prior probability of a cluster. Default is 0.001. |
min_comp |
Minimum number of models to fit. Default is 2. |
max_comp |
Maximum number of models to fit. Default is 10. |
dist |
Type of distribution to fit. Either "norm" or "pois". Default "norm" |
pR |
Peak Reduction reduces peaks in modeling to make modeling easier. Default is FALSE. |
seed |
Seed to be used for modeling. Default is 77777 |
model_sel |
Type of model_selection method to be used. Default "BIC". See flexmix package for more options. |
niter |
Max number of iterations for modeling. Default is 1000. |
nrep |
Number of repetitions for modeling attempts. Default is 1. |
Returns the flexmix object for the fit model.
fitComponent(featsExp$bp10MB[,2]) #Fits 2-10 normal distributions #Tries to fit exactly 4 poisson distributions fitComponent(featsExp$osCN[,2],dist="pois",min_comp = 4, max_comp = 4)fitComponent(featsExp$bp10MB[,2]) #Fits 2-10 normal distributions #Tries to fit exactly 4 poisson distributions fitComponent(featsExp$osCN[,2],dist="pois",min_comp = 4, max_comp = 4)
This function takes all of the extracted copy number features and attempts to fit a mixture of poisson and normal distributions to the data, and returns a mixture of components that can be used to build the signatures. The order of features is "segsize","bp10MB","osCN","changepoint","copynumber","bpchrarm". Therefore if you only want to change the maximum number of components for osCN to 5 then you would use max_comps = c(10,10,5,10,10,10).
fitModels( CN_features, max_comps = NULL, min_comps = NULL, cores = 1, pR = FALSE, min_prior = NULL, featsToModel = CNSigs::defaultFeats )fitModels( CN_features, max_comps = NULL, min_comps = NULL, cores = 1, pR = FALSE, min_prior = NULL, featsToModel = CNSigs::defaultFeats )
CN_features |
List of features received from extractCopynumberFeatures |
max_comps |
vector of length 6 specifying the max number of components for each feature. default is 10 for all features |
min_comps |
vector of length 6 specifying the min number of components for each feature. default is 2 for all features |
cores |
Number of parallel cores to use. Default is 1. |
pR |
Peak Reduction reduces peaks in modeling to make modeling easier. Default is FALSE. |
min_prior |
Used to override the minimum prior probabilty of a cluster |
featsToModel |
The names of the features to extract. |
Returns a list of the different components that contain flexmix objects for each feature
fitModels(featsExp) #Models an exact number of components, useful when comparing two different #datasets min_comps = c(7, 3, 3, 2, 2, 3) max_comps = c(7, 3, 3, 2, 10, 3) fitModels(featsExp, max_comps, min_comps)fitModels(featsExp) #Models an exact number of components, useful when comparing two different #datasets min_comps = c(7, 3, 3, 2, 2, 3) max_comps = c(7, 3, 3, 2, 10, 3) fitModels(featsExp, max_comps, min_comps)
This function takes in an extracted set of features and a defined set of components, and calculates the sum of the posterior probabilities for each feature. This sum represents how much each component contributes to a sample and corresponds to one column in the matrix.
generateSCM(feats, comps, runName = "", saveRes = FALSE, saveDir = NULL)generateSCM(feats, comps, runName = "", saveRes = FALSE, saveDir = NULL)
feats |
List of features received from extractCopynumberFeatures |
comps |
List of components modelled using fitModels |
runName |
Name of the run used in file names, Default is "" |
saveRes |
Whether or not to save the results, Default is FALSE |
saveDir |
Where to save the results, Default is getwd() |
Creates a sample by component matrix
generateSCM(featsExp,compsExp)generateSCM(featsExp,compsExp)
This function is used to check to find a mapping between two similar sets of signatures. It compares the signature values to see how similar the proposed signatures are and shows you the best matches. It uses the measure of cosine similarity to compare signatures. The two signature sets must have the same underlying components to be matched.
matchSigs(referenceSigs, toCompareSigs)matchSigs(referenceSigs, toCompareSigs)
referenceSigs |
Signature matrix from your reference analysis |
toCompareSigs |
Signature matrix from run that you want to compare to reference |
Prints out the signature mapping and returns the avg similarity
matchSigs(referenceExp$sigs, referenceExp$sigs)matchSigs(referenceExp$sigs, referenceExp$sigs)
This function plots the specified mixed model so that it can be visualized. It utilizes the gamma function to allow approximations of the poisson distributions, allowing for a smooth plot.
plotComp(comps, compName, saveRes = FALSE, saveDir = NULL, runName = "")plotComp(comps, compName, saveRes = FALSE, saveDir = NULL, runName = "")
comps |
List of components to be plotted. Output from fitModels. |
compName |
Name of the component to plot |
saveRes |
Whether or not to save results. Default is F. |
saveDir |
Where to save plots, must be provided if using saveDir |
runName |
Used to add a runName to the file output. Default is "". |
Plots the components to allow visualization
plotComp(compsExp, compName = "segsize")plotComp(compsExp, compName = "segsize")
This function plots all of the mixed models so that it can be visualized. It utilizes the gamma function to allow approximations of the poisson distributions, allowing for a smooth plot.
plotComps(comps, saveRes = FALSE, saveDir = NULL, runName = "")plotComps(comps, saveRes = FALSE, saveDir = NULL, runName = "")
comps |
List of components to be plotted. Output from fitModels. |
saveRes |
Whether or not to save results. Default is F. |
saveDir |
Where to save plots. Default is getwd() |
runName |
Used to add a runName to the file output. Default is "". |
Plots all the components to allow visualization
plotComps(compsExp)plotComps(compsExp)
This function is used to generate the sample by component matrix plot.
plotScm( scm, runName = "", saveRes = FALSE, saveDir = NULL, rowOrder = FALSE, colOrder = TRUE )plotScm( scm, runName = "", saveRes = FALSE, saveDir = NULL, rowOrder = FALSE, colOrder = TRUE )
scm |
Sample by component matrix |
runName |
Name of the run used in plot titles, Default is "" |
saveRes |
Whether or not to save the plots, Default is FALSE |
saveDir |
Where to save the plots, must be provided if using saveDir |
rowOrder |
Ordering specification for the rows of the heatmap. Three possible options: * TRUE: Uses hierarchical clustering to determine row order. * FALSE: (default) Leaves rows in the order they were given. * A numeric vector the same length as the number of rows specifying the indices of the input matrix |
colOrder |
Ordering specification for the columns of the heatmap. See above for options. Default value is T. |
pheatmap figure of component result by sample
plotScm(scmExp) plotScm(scmExp, rowOrder = FALSE, colOrder = FALSE) newOrder = sample(1:ncol(scmExp), ncol(scmExp)) plotScm(scmExp, colOrder = newOrder)plotScm(scmExp) plotScm(scmExp, rowOrder = FALSE, colOrder = FALSE) newOrder = sample(1:ncol(scmExp), ncol(scmExp)) plotScm(scmExp, colOrder = newOrder)
This function is used to create a plot of a samples segs. The input samples can either be a single data.frame of the segs of one patient or a list of data.frames for multiple samples.
plotSegs( samples, name = "", chrom = -1, gbuild = "hg19", sep = FALSE, alpha = 1 )plotSegs( samples, name = "", chrom = -1, gbuild = "hg19", sep = FALSE, alpha = 1 )
samples |
The samples to plot. If a list it plots both on the same plot |
name |
The name of the sample. Used for plot title |
chrom |
Which chromosome to plot. Default plots all of them. |
gbuild |
The reference genome build. Default is hg19. Also supports hg18 and hg38. |
sep |
Whether or not to place different members of the list on the same or different axis |
alpha |
Allows you to adjust the transparency of the lines. 0-1 |
displays a plot of the segments
plotSegs(segDataExp[[1]]) #Plots all of the first sample's segments plotSegs(segDataExp[[1]],1) #Only plots the first chromosome segments plotSegs(segDataExp[1:2]) #Plots first two samples on same axis plotSegs(segDataExp[1:2], sep = TRUE) #Plots first two samples seperatelyplotSegs(segDataExp[[1]]) #Plots all of the first sample's segments plotSegs(segDataExp[[1]],1) #Only plots the first chromosome segments plotSegs(segDataExp[1:2]) #Plots first two samples on same axis plotSegs(segDataExp[1:2], sep = TRUE) #Plots first two samples seperately
This function is used to create a plot for the specified signature to look at the contribution of each of the components to the signatures
plotSig(sigs, sigNum)plotSig(sigs, sigNum)
sigs |
The dataset of component contribution to each signature |
sigNum |
The signature number to plot |
displays a plot of the signature
plotSig(referenceExp$sigs, 1) #Plots first signatureplotSig(referenceExp$sigs, 1) #Plots first signature
This function plots the signature exposure for all of the samples as a stacked bar plot. There are a number of different options for how to sort the resulting plot.
plotSigExposure( sigExposure, saveRes = FALSE, saveDir = NULL, runName = "", trackData = NULL, sort = FALSE, sortOrder = "m", method = NULL, colors = NULL )plotSigExposure( sigExposure, saveRes = FALSE, saveDir = NULL, runName = "", trackData = NULL, sort = FALSE, sortOrder = "m", method = NULL, colors = NULL )
sigExposure |
Signature exposure matrix to be plotted |
saveRes |
Whether or not to save results. Default is FALSE. |
saveDir |
Where to save plots, must be provided if using saveDir |
runName |
Used to add a runName to the file output. Default is "". |
trackData |
Data used to plot tracks |
sort |
Whether or not to sort the plot |
sortOrder |
The order in which to sort the plot |
method |
The method by which to sort the main plot |
colors |
Colors used in plotting |
Adding data tracks to the plot: One of the major features of this function is that it allows the user to add in some additional data for the samples to be plotted as a track alongside the main signature exposure stacked bar plot. These additional data points can be passed in as a vector of corresponding values in the same order. If you want to plot multiple tracks you can pass in a list of vectors using the trackData parameter.
Specifying how to sort the plot: When you give the function a set of trackData, it allows you to begin to specify the sortOrder. This allows your to sort the main plot in a different order. "m" represents the main plot, and "t" followed by the number of the track (ie: "t1","t2" ...) represents the tracks. By chaining the values together you can specify a variety of ways to sort the final plot. As an example, the sortOrder of "mt1t2" specifies the the plot should be sorted by the signature exposures first followed by the first track and finally the second track. In another example, the sortOrder of "t2mt1" specifies the plot to be sorted by track number 2 first followed by the signature exposures and lastly by track number 1.
Sorting method: The two methods of sorting the signature exposure are either "hclust" or "group". The hclust uses the ward.D method to cluster the exposures and then cuts the tree to split the data. The group method splits the samples into groups based on which signatures they had the highest exposure to.
Plots the signature exposure to allow visualization
plotSigExposure(sigExposExp)plotSigExposure(sigExposExp)
This function is used to generate the signature exposure matrix heatmap plot.
plotSigExposureMat( sigExposure, runName = "", saveRes = FALSE, saveDir = NULL, rowOrder = FALSE, colOrder = TRUE )plotSigExposureMat( sigExposure, runName = "", saveRes = FALSE, saveDir = NULL, rowOrder = FALSE, colOrder = TRUE )
sigExposure |
Sample by signature matrix |
runName |
Name of the run used in plot titles, Default is "" |
saveRes |
Whether or not to save the plots, Default is FALSE |
saveDir |
Where to save the plots, must be provided if using saveDir |
rowOrder |
Ordering specification for the rows of the heatmap. Three possible options: * TRUE: Uses hierarchical clustering to determine row order. * FALSE: (default) Leaves rows in the order they were given. * A numeric vector the same length as the number of rows specifying the indices of the input matrix |
colOrder |
Ordering specification for the columns of the heatmap. See above for options. Default value is T. |
pheatmap figure of signature exposure by patient
plotSigExposureMat(sigExposExp) plotSigExposureMat(sigExposExp, rowOrder = FALSE, colOrder = FALSE) newOrder = sample(1:ncol(sigExposExp), ncol(sigExposExp)) plotSigExposureMat(sigExposExp, colOrder = newOrder)plotSigExposureMat(sigExposExp) plotSigExposureMat(sigExposExp, rowOrder = FALSE, colOrder = FALSE) newOrder = sample(1:ncol(sigExposExp), ncol(sigExposExp)) plotSigExposureMat(sigExposExp, colOrder = newOrder)
This function is used to generate the signature by component matrix plot.
plotSigMat(sigs, runName = "", saveRes = FALSE, saveDir = NULL)plotSigMat(sigs, runName = "", saveRes = FALSE, saveDir = NULL)
sigs |
Signature by component matrix |
runName |
Name of the run used in plot titles, Default is "" |
saveRes |
Whether or not to save the plots, Default is FALSE |
saveDir |
Where to save the plots, must be provided if using saveDir |
pheatmap figure of component weights by sample
plotSigMat(sigsExp)plotSigMat(sigsExp)
This function plots all of the signatures so that they can be visualized. It does this by looping through the signatures and calling the plotSig function
plotSigs(sigs, saveRes = FALSE, saveDir = NULL, runName = "")plotSigs(sigs, saveRes = FALSE, saveDir = NULL, runName = "")
sigs |
The dataset of component contribution to each signature |
saveRes |
Whether or not to save results. Default is FALSE. |
saveDir |
Where to save plots, must be provided if using saveDir |
runName |
Used to add a runName to the file output. Default is "". |
Plots all the signatures to allow visualization
plotSigs(referenceExp$sigs)plotSigs(referenceExp$sigs)
This function calculates the probabilities that each of the new data point falls into the distributions defined by the parameters. Used when calculating the sample by component matrix.
postProb(params, newData)postProb(params, newData)
params |
A vector of the distribution parameters |
newData |
The new data to calculate the probabilities for |
Returns the probability that the newData is in the distributions
This function is used to read in the segments. It can either take a file path to a csv to read in the data, or it can take in a long data frame and convert it to the format needed for the pipeline. The variable colMap is used in order to map your column names to what the pipeline expects. For instance, if your column that has the chromosome numbers in it is titled "chrom" instead of the expected "chromosome" then you would specify the colMap as c("ID","chrom","start","end","segVal"). If your data is seperated into major and minor allele copy numbers then for the segVal part of the colMap should be formatted as "nMajor+nMinor" to let the function know to add them together.
readSegs(path, colMap = NULL, readPloidy = FALSE)readSegs(path, colMap = NULL, readPloidy = FALSE)
path |
The path to the .txt file with the data in it, or a folder containing the .txt files |
colMap |
The mapping of column names. The default is c("ID","chromosome","start","end","segVal"). If your column names vary from this please pass a vector similar to the above with the changes. |
readPloidy |
Whether or not the input file has ploidy and should be read |
Returns a segments in a list formatted to be run through the pipeline
This function is used to reduce the peaks within a feature distribution so that the models can be fitted properly. The flexmix package can struggle to converge on a solution if there are large spikes in the distribution.
reducePeaks(toReduce)reducePeaks(toReduce)
toReduce |
Input feature distribution |
Returns the input feature with the peaks reduced
The generated result object from the entire pipelin using the segDataExp. Function used to create: referenceExp = runPipeline(segDataExp,nsigs = 5)
referenceExpreferenceExp
A list with 7 elements:
The function call used in the pipeline run
The data the features were extracted from
The extracted features
The fitted component models
The sample by component matrix
The results of the NMF run
The signature by component matrix.
This function is used to remap the results from a runPipeline run with a different order of the signatures or different names. You can either give the function a new mapping of the signatures which is just a new order in which you want the signatures in. For instance, if you want to just swap the first and second signatures and you have a total of 4 signatures, you would pass in c(2,1,3,4) for the sigMap parameters. The other use case is if you want to rename the signatures. To do this you just have to pass a vector of names that is of the same length as the number of signatures.
remapResults(path, sigMap = NULL, sigNames = NULL, saveRes = FALSE)remapResults(path, sigMap = NULL, sigNames = NULL, saveRes = FALSE)
path |
The path to the results folder to remap |
sigMap |
The new order for the signatures |
sigNames |
New signature names |
saveRes |
Whether or not to save results. Default is FALSE. |
Overall, this function will create a duplicate results folder in the same directory and regenerate all of the plots and result files into the new order or with the new names. This means that you don't have to regenerate all of the plots manually.
no return
This function allows you to run the entire Copy number signature pipeline in one go. May take a while, especially if not given multiple cores. For more information on what actually happens in the pipeline, refer to the CNSigs vignette.
runPipeline( segData, cores = 1, nsigs = 0, saveRes = FALSE, runName = "Run", rmin = 3, rmax = 12, components = NULL, max_comps = NULL, min_comps = NULL, fixedSigs = NULL, saveDir = NULL, smooth = FALSE, colMap = NULL, pR = FALSE, gbuild = "hg19", featsToUse = NULL, ploidyData = NULL, plot = TRUE )runPipeline( segData, cores = 1, nsigs = 0, saveRes = FALSE, runName = "Run", rmin = 3, rmax = 12, components = NULL, max_comps = NULL, min_comps = NULL, fixedSigs = NULL, saveDir = NULL, smooth = FALSE, colMap = NULL, pR = FALSE, gbuild = "hg19", featsToUse = NULL, ploidyData = NULL, plot = TRUE )
segData |
The data to be analyzed. If a path name, readSegs is used to make the list. Otherwise the list must be formatted correctly. Refer to ?readSegs for format information. |
cores |
The number of computer cores to be used for parallel processing |
nsigs |
The number of signatures to look for. Value of 0 runs the determineSigNum function to look for optimal number. Default is 0. |
saveRes |
Whether or not to save the resulting tables and plots. Default is FALSE |
runName |
Used to title plots and files when saving results |
rmin |
Minimum number of signatures to look for. Default is 3. |
rmax |
Maximum number of signatures to look for. Default is 12. |
components |
Can be used when fixing components. Default is NULL. |
max_comps |
vector of length 6 specifying the max number of components for each feature. Passed to fitModels. Default is 10 for all features |
min_comps |
vector of length 6 specifying the min number of components for each feature. Passed to fitModels. Default is 2 for all features |
fixedSigs |
Signature x Component matrix. Used when fixing signatures. Default is NULL |
saveDir |
Used to specify where to save the results, must be provided if using saveDir |
smooth |
Whether or not to smooth the input data. Default is F. |
colMap |
Mapping of column names when reading from text file. Default column names are ID, chromosome, start, end, segVal. |
pR |
Peak Reduction |
gbuild |
The reference genome build. Default is hg19. Also supports hg18 and hg38. |
featsToUse |
Vector of feature names that you wish to use |
ploidyData |
The ploidy data to use as a feature |
plot |
Whether or not to generate the plots. Default is T. |
Returns a list with all of the results from the pipeline
#Runs the entire pipeline on the example data giving it 6 cores and specifying #5 signatures with a name of "TCGA Test" runPipeline(segDataExp, cores = 6, nsigs = 5, saveRes = FALSE, "TCGA Test")#Runs the entire pipeline on the example data giving it 6 cores and specifying #5 signatures with a name of "TCGA Test" runPipeline(segDataExp, cores = 6, nsigs = 5, saveRes = FALSE, "TCGA Test")
The generated scm for the segDataExp dataset. Generated using the function generateSCM(featsExp,compsExp). It is a matrix showing how much each extracted component contributes to each sample. Is what is put into NMF and used to create the signatures.
scmExpscmExp
An object of class matrix (inherits from array) with 20 rows and 20 columns.
A small example subset of BRCA samples from TCGA in list format. Each item contains the segmentation data for that sample.
segDataExpsegDataExp
A list with 20 elements and 5 variables:
ID number for the sample
chromosome the segment is found on
starting position of the segment
end position of the segment
copynumber value for the segment
https://portal.gdc.cancer.gov/
This function allows you to get an overview of some of the features of your samples. It outputs a summary of stats for the segments, including size, number per sample, along with various other measures.
segStats(segTabs)segStats(segTabs)
segTabs |
The list of samples copy number segments |
Outputs a summary of the statistics
segStats(segDataExp)segStats(segDataExp)
The generated signature exposure matrix for the segDataExp dataset. Extracted from the referenceExp data object using referenceExp$sigExposure. This matrix shows how much each signature contributes to the patient samples.
sigExposExpsigExposExp
An object of class data.frame with 5 rows and 20 columns.
The generated signature by component matrix for the segDataExp dataset. Extracted from the referenceExp data object using referenceExp$sigs. This matrix shows how much each component contributes to the signatures.
sigsExpsigsExp
An object of class matrix (inherits from array) with 20 rows and 5 columns.
This function is used to compare two sets of signatures by finding the similarity matrix across both signature sets. If the signatures have the same underlying components similarity is calculated using the cosine similarity. If the signatures have different underlying components the similarity is estimated using a ks-statistic based measure. See package vignette for more information.
sigSim(reference, toCompare, plot = TRUE, text = TRUE)sigSim(reference, toCompare, plot = TRUE, text = TRUE)
reference |
Results from your reference analysis |
toCompare |
Results from run that you want to compare to reference |
plot |
If T, displays the heatmap plot |
text |
If T, displays the similarity value on the plot |
Plots signature similarity and returns the avg similarity
sigSim(referenceExp, referenceExp)sigSim(referenceExp, referenceExp)
This function is used to attempt to smooth the input copy number segments in order to reduce the biasing affect of the technology and copy number caller used to make the segments. It does this by trying to join together close segs and removing small abbarent segments.
smoothSegs(segData, cores = 1)smoothSegs(segData, cores = 1)
segData |
The segData to be smoothed |
cores |
Number of cores to be used for parallel smoothing. Default is 1. |
Returns the smoothed segments
smoothSegs(segDataExp)smoothSegs(segDataExp)
This function is used to calculate the sum of posteriors for a given feature. It returns a vector of posterior probabilities which describe how much each component contributes to the distribution of features passed in.
sumOfPosteriors(feat, comps, name)sumOfPosteriors(feat, comps, name)
feat |
Feature to calculate the sum from. |
comps |
Component parameters for that feature |
name |
Name of feature to sum across |
Returns the sum of the posteriors for the specified feature.
This function is used to validate and clean up all of the input data. It converts all the columns to numeric that need to be, and filters out any invalid segments, like ones with 0 or negative length. It also converts the chromosome tags to the proper format for feature extraction.
validateSegData(segData, cores = 1)validateSegData(segData, cores = 1)
segData |
The copy number segment data |
cores |
The number of cores to use for parallel processing. Default 1. |
list of dataframes containing converted seg data
validateSegData(segDataExp)validateSegData(segDataExp)