Title: | A 'Sparklyr' Extension for 'Hail' |
---|---|
Description: | 'Hail' is an open-source, general-purpose, 'python' based data analysis tool with additional data types and methods for working with genomic data, see <https://hail.is/>. 'Hail' is built to scale and has first-class support for multi-dimensional structured data, like the genomic data in a genome-wide association study (GWAS). 'Hail' is exposed as a 'python' library, using primitives for distributed queries and linear algebra implemented in 'scala', 'spark', and increasingly 'C++'. The 'sparkhail' is an R extension using 'sparklyr' package. The idea is to help R users to use 'hail' functionalities with the well-know 'tidyverse' syntax, see <https://www.tidyverse.org/>. |
Authors: | Samuel Macêdo [aut, cre], Javier Luraschi [aut], Michael Lawrence [ctb] |
Maintainer: | Samuel Macêdo <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 0.1.1 |
Built: | 2024-12-11 07:28:56 UTC |
Source: | CRAN |
Set configuration for Hail using spark_config()
.
hail_config(config = sparklyr::spark_config())
hail_config(config = sparklyr::spark_config())
config |
A spark configuration. |
Import and initialize Hail using a spark connection.
hail_context(sc)
hail_context(sc)
sc |
Spark connection. |
hailContext
library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT", config = hail_config()) connection_is_open(sc) hail_context(sc) spark_disconnect(sc)
library(sparklyr) sc <- spark_connect(master = "spark://HOST:PORT", config = hail_config()) connection_is_open(sc) hail_context(sc) spark_disconnect(sc)
This function converts a hail MatrixTable in a dataframe.
hail_dataframe(x)
hail_dataframe(x)
x |
a hail MatrixTable |
A spark dataframe
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) df <- hail_dataframe(mt) df ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) df <- hail_dataframe(mt) df ## End(Not run)
hail_describe
prints a hail MatrixTable structure. You can access parts of
the structure using mt_globals_fields
, mt_str_rows
, mt_col_fields
,
mt_entry_fields
, mt_row_key
, mt_col_key
.
hail_describe(mt) mt_globals_fields(mt) mt_str_rows(mt) mt_row_fields(mt) mt_col_fields(mt) mt_entry_fields(mt) mt_row_key(mt) mt_col_key(mt)
hail_describe(mt) mt_globals_fields(mt) mt_str_rows(mt) mt_row_fields(mt) mt_col_fields(mt) mt_entry_fields(mt) mt_row_key(mt) mt_col_key(mt)
mt |
A MatrixTable object. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) hail_describe(mt) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) hail_describe(mt) ## End(Not run)
This function retrieves the entries fields from a hail dataframe and explodes the columns call, dp and gq.
hail_entries(df)
hail_entries(df)
df |
A hail dataframe. |
A spark dataframe.
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hail_context(sc) %>% hail_read_matrix(system.file("extdata/1kg.mt", package = "sparkhail")) %>% hail_dataframe() %>% hail_entries() ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.4", config = hail_config()) hail_context(sc) %>% hail_read_matrix(system.file("extdata/1kg.mt", package = "sparkhail")) %>% hail_dataframe() %>% hail_entries() ## End(Not run)
This function creates an extdata folder and downloads the datasets necessary to run the examples: 1kg MatrixTable folder and annotations.txt.
hail_get_1kg(path = NULL)
hail_get_1kg(path = NULL)
path |
The folder that the user wants to download the data. The path is NULL the data will be downloaded in a temp folder. |
Get the ids from s col key in a MatrixTable.
hail_ids(mt)
hail_ids(mt)
mt |
A MatrixTable object. |
A spark dataframe
## Not run: library(sparklyr) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) hail_ids(mt) ## End(Not run)
## Not run: library(sparklyr) hl <- hail_context(sc) mt <- hail_read_matrix(hl, system.file("extdata/1kg.mt", package = "sparkhail")) hail_ids(mt) ## End(Not run)
Install hail dependencies and datasets to run the examples in documentation.
To remove hail use hail_uninstall
.
hail_install(datasets_examples = TRUE, hail_path = "java_folder") hail_uninstall()
hail_install(datasets_examples = TRUE, hail_path = "java_folder") hail_uninstall()
datasets_examples |
If TRUE, hail will be downloaded along with the datasets to run the examples. Use FALSE if you just want to install hail. |
hail_path |
A string with the path of the jar. Sparklyr extensions normally install the jars in the java folder, but you can select a different one. |
Read and create a MatrixTable object, it is necessary to convert the data in
dataframe using hail_dataframe
.
hail_read_matrix(hl, path)
hail_read_matrix(hl, path)
hl |
A hail context object. Create one using |
path |
A string with the path to MatrixTable folder |
A hail MatrixTable is a standard data structure in hail framework. A MatrixTable consists of four components:
a two-dimensional matrix of entry fields where each entry is indexed by row key(s) and column key(s)
a corresponding rows table that stores all of the row fields that are constant for every column in the dataset
a corresponding columns table that stores all of the column fields that are constant for every row in the dataset
a set of global fields that are constant for every entry in the dataset
You can see the MatrixTable structure using hail_describe
.
hail_matrix_table