Title: | A 'Sparklyr' Extension for 'VariantSpark' |
---|---|
Description: | This is a 'sparklyr' extension integrating 'VariantSpark' and R. 'VariantSpark' is a framework based on 'scala' and 'spark' to analyze genome datasets, see <https://bioinformatics.csiro.au/>. It was tested on datasets with 3000 samples each one containing 80 million features in either unsupervised clustering approaches and supervised applications, like classification and regression. The genome datasets are usually writing in VCF, a specific text file format used in bioinformatics for storing gene sequence variations. So, 'VariantSpark' is a great tool for genome research, because it is able to read VCF files, run analyses and return the output in a 'spark' data frame. |
Authors: | Samuel Macêdo [aut, cre], Javier Luraschi [aut] |
Maintainer: | Samuel Macêdo <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 0.1.1 |
Built: | 2024-12-14 06:24:24 UTC |
Source: | CRAN |
This function extracts the importance data frame from the Importance Analysis jobj.
importance_tbl(importance, name = "importance_tbl")
importance_tbl(importance, name = "importance_tbl")
importance |
A jobj from the class |
name |
The name to assign to the copied table in Spark. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) importance <- vs_importance_analysis(vsc, hipster_vcf, labels, 10) importance_tbl(importance) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) importance <- vs_importance_analysis(vsc, hipster_vcf, labels, 10) importance_tbl(importance) ## End(Not run)
This function display the first N variant names.
sample_names(vcf_source, n_samples = NULL)
sample_names(vcf_source, n_samples = NULL)
vcf_source |
An object with |
n_samples |
The number os samples to display. |
spark_jobj, shell_jobj
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) sample_names(hipster_vcf, 3) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) sample_names(hipster_vcf, 3) ## End(Not run)
You need to create a variantspark connection to use this extension. To do this,
you pass as argument a spark connection that you can create
using sparklyr::spark_connect()
.
vs_connect(sc)
vs_connect(sc)
sc |
A spark connection. |
A variantspark connection
library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT") connection_is_open(sc) vsc <- vs_connect(sc) spark_disconnect(sc)
library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT") connection_is_open(sc) vsc <- vs_connect(sc) spark_disconnect(sc)
This function performs an Importance Analysis using random forest algorithm. For more details, please look at here.
vs_importance_analysis(vsc, vcf_source, labels, n_trees)
vs_importance_analysis(vsc, vcf_source, labels, n_trees)
vsc |
A variantspark connection. |
vcf_source |
An object with |
labels |
An object with |
n_trees |
The number of trees using in the random forest. |
spark_jobj, shell_jobj
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) vs_importance_analysis(vsc, hipster_vcf, labels, 10) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_connect(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) vs_importance_analysis(vsc, hipster_vcf, labels, 10) ## End(Not run)
The vs_read_csv()
reads a CSV file format and returns a jobj
object from CsvFeatureSource
scala class.
vs_read_csv(vsc, path)
vs_read_csv(vsc, path)
vsc |
A variantspark connection. |
path |
The file's path. |
spark_jobj, shell_jobj
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) hipster_labels <- vs_read_csv(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) hipster_labels ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) hipster_labels <- vs_read_csv(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) hipster_labels ## End(Not run)
This function reads only the label column of a CSV file and returns a jobj
object from CsvLabelSource
scala class.
vs_read_labels(vsc, path, label = "label")
vs_read_labels(vsc, path, label = "label")
vsc |
A variantspark connection. |
path |
The file's path. |
label |
A string with the label column name. |
spark_jobj, shell_jobj
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) labels ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) labels <- vs_read_labels(vsc, system.file("extdata/hipster_labels.txt", package = "variantspark")) labels ## End(Not run)
The Variant Call Format (VCF) specifies the format of a text file used in
bioinformatics for storing gene sequence variations. The format has been developed
with the advent of large-scale genotyping and DNA sequencing projects, such as
the 1000 Genomes Project. The vs_read_vcf()
reads this format and returns
a jobj
object from VCFFeatureSource
scala class.
vs_read_vcf(vsc, path)
vs_read_vcf(vsc, path)
vsc |
A variantspark connection. |
path |
The file's path. |
spark_jobj, shell_jobj
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) hipster_vcf ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") vsc <- vs_context(sc) hipster_vcf <- vs_read_vcf(vsc, system.file("extdata/hipster.vcf.bz2", package = "variantspark")) hipster_vcf ## End(Not run)