Title: | PLINK 2 Binary (.pgen) Reader |
---|---|
Description: | A thin wrapper over PLINK 2's core libraries which provides an R interface for reading .pgen files. A minimal .pvar loader is also included. Chang et al. (2015) \doi{10.1186/s13742-015-0047-8}. |
Authors: | Christopher Chang [aut, cre], Eric Biggers [ctb, cph] (Author of included libdeflate library), Yann Collet [ctb] (Author of included Zstd library), Meta Platforms, Inc. [cph] (Zstd library), Evan Nemerson [ctb, cph] (Author of included SIMDe library), Przemyslaw Skibinski [ctb] (Author of included Zstd library), Nick Terrell [ctb] (Author of included Zstd library) |
Maintainer: | Christopher Chang <[email protected]> |
License: | LGPL (>= 3) |
Version: | 0.3.7 |
Built: | 2024-11-02 06:27:54 UTC |
Source: | CRAN |
A thin wrapper over PLINK 2's core libraries which provides an R interface for reading .pgen files. A minimal .pvar loader is also included.
NewPvar
and NewPgen
initialize the respective readers. Then,
you can either iterate through one variant at a time (Read
,
ReadAlleles
) or perform a multi-variant matrix load
(ReadIntList
, ReadList
). When you're done, ClosePgen
and ClosePvar
free resources.
Christopher Chang [email protected]
Chang, C.C. and Chow, C.C. and Tellier, L.C.A.M. and Vattikuti, S. and Purcell, S.M. and Lee J.J. (2015) Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience 4:7. doi:10.1186/s13742-015-0047-8.
# This is modified from https://yosuketanigawa.com/posts/2020/09/PLINK2 . library(pgenlibr) # These files are subsetted from downloads available at # https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg . # Note that, after downloading the original files, the .pgen file must be # decompressed before use; but both pgenlibr and the PLINK 2 program can # handle compressed .pvar files. pvar_path <- system.file("extdata", "chr21_phase3_start.pvar.zst", package="pgenlibr") pgen_path <- system.file("extdata", "chr21_phase3_start.pgen", package="pgenlibr") pvar <- pgenlibr::NewPvar(pvar_path) pgen <- pgenlibr::NewPgen(pgen_path, pvar=pvar) # Check the number of variants and samples. pgenlibr::GetVariantCt(pgen) pgenlibr::GetRawSampleCt(pgen) # Get the ID of the first variant. GetVariantId(pvar, 1) # Read the 14th variant. buf <- pgenlibr::Buf(pgen) pgenlibr::Read(pgen, buf, 14) # Get the index of the variant with ID "rs569225703". var_id <- pgenlibr::GetVariantsById(pvar, "rs569225703") # Get allele count. pgenlibr::GetAlleleCt(pvar, var_id) # It has three alleles, i.e. two ALT alleles. # Read first-ALT-allele dosages for that variant. pgenlibr::Read(pgen, buf, var_id) # Read second-ALT-allele dosages. pgenlibr::Read(pgen, buf, var_id, allele_num=3) # Read a matrix with both variants. Note that, for the multiallelic variant, # the dosages of both ALT alleles are summed here. geno_mat <- pgenlibr::ReadList(pgen, c(14, var_id)) pgenlibr::ClosePgen(pgen) pgenlibr::ClosePvar(pvar)
# This is modified from https://yosuketanigawa.com/posts/2020/09/PLINK2 . library(pgenlibr) # These files are subsetted from downloads available at # https://www.cog-genomics.org/plink/2.0/resources#phase3_1kg . # Note that, after downloading the original files, the .pgen file must be # decompressed before use; but both pgenlibr and the PLINK 2 program can # handle compressed .pvar files. pvar_path <- system.file("extdata", "chr21_phase3_start.pvar.zst", package="pgenlibr") pgen_path <- system.file("extdata", "chr21_phase3_start.pgen", package="pgenlibr") pvar <- pgenlibr::NewPvar(pvar_path) pgen <- pgenlibr::NewPgen(pgen_path, pvar=pvar) # Check the number of variants and samples. pgenlibr::GetVariantCt(pgen) pgenlibr::GetRawSampleCt(pgen) # Get the ID of the first variant. GetVariantId(pvar, 1) # Read the 14th variant. buf <- pgenlibr::Buf(pgen) pgenlibr::Read(pgen, buf, 14) # Get the index of the variant with ID "rs569225703". var_id <- pgenlibr::GetVariantsById(pvar, "rs569225703") # Get allele count. pgenlibr::GetAlleleCt(pvar, var_id) # It has three alleles, i.e. two ALT alleles. # Read first-ALT-allele dosages for that variant. pgenlibr::Read(pgen, buf, var_id) # Read second-ALT-allele dosages. pgenlibr::Read(pgen, buf, var_id, allele_num=3) # Read a matrix with both variants. Note that, for the multiallelic variant, # the dosages of both ALT alleles are summed here. geno_mat <- pgenlibr::ReadList(pgen, c(14, var_id)) pgenlibr::ClosePgen(pgen) pgenlibr::ClosePvar(pvar)
Returns an empty two-row numeric matrix that ReadAlleles() can load to.
AlleleCodeBuf(pgen)
AlleleCodeBuf(pgen)
pgen |
Object returned by NewPgen(). |
Numeric matrix with two rows, and appropriate number of columns for ReadAlleles().
Returns a bool buffer that ReadAlleles() can load phasing information to.
BoolBuf(pgen)
BoolBuf(pgen)
pgen |
Object returned by NewPgen(). |
Logical vector with appropriate length for ReadAlleles().
Returns a numeric buffer that Read() or ReadHardcalls() can load to.
Buf(pgen)
Buf(pgen)
pgen |
Object returned by NewPgen(). |
Numeric vector with appropriate length for Read() and ReadHardcalls().
Closes a pgen object, releasing resources.
ClosePgen(pgen)
ClosePgen(pgen)
pgen |
Object returned by NewPgen(). |
No return value, called for side-effect.
Closes a pvar object, releasing memory.
ClosePvar(pvar)
ClosePvar(pvar)
pvar |
Object returned by NewPvar(). |
No return value, called for side-effect.
Look up an allele code.
GetAlleleCode(pvar, variant_num, allele_num)
GetAlleleCode(pvar, variant_num, allele_num)
pvar |
Object returned by NewPvar(). |
variant_num |
Variant index (1-based). |
allele_num |
Allele index (1-based). |
The allele_numth allele code for the variant_numth variant. allele_num=1 corresponds to the REF allele, allele_num=2 corresponds to the first ALT allele, allele_num=3 corresponds to the second ALT allele if it exists and errors out otherwise, etc.
Returns the effective number of alleles for a variant. Note that if no pvar was provided to the NewPgen() call, this function may return 2 even at multiallelic variants, since the .pgen may not store allele-count information.
GetAlleleCt(pvar_or_pgen, variant_num)
GetAlleleCt(pvar_or_pgen, variant_num)
pvar_or_pgen |
Object returned by NewPvar() or NewPgen(). |
variant_num |
Variant index (1-based). |
max(2, <number of alleles the variant_numth variant is known to have>). Note that if no
Returns the maximum GetAlleleCt() value across all variants in the file.
GetMaxAlleleCt(pvar_or_pgen)
GetMaxAlleleCt(pvar_or_pgen)
pvar_or_pgen |
Object returned by NewPvar() or NewPgen(). |
Maximum GetAlleleCt() value across all variants.
Returns the number of samples in the file.
GetRawSampleCt(pgen)
GetRawSampleCt(pgen)
pgen |
Object returned by NewPgen(). |
Number of samples.
Returns the number of variants in the file.
GetVariantCt(pvar_or_pgen)
GetVariantCt(pvar_or_pgen)
pvar_or_pgen |
Object returned by NewPvar() or NewPgen(). |
Number of variants.
Convert variant index to variant ID string.
GetVariantId(pvar, variant_num)
GetVariantId(pvar, variant_num)
pvar |
Object returned by NewPvar(). |
variant_num |
Variant index (1-based). |
The variant_numth variant ID string.
Convert variant ID string to variant index(es).
GetVariantsById(pvar, id)
GetVariantsById(pvar, id)
pvar |
Object returned by NewPvar(). |
id |
Variant ID to look up. |
A list of all (1-based) variant indices with the given variant ID.
Returns whether explicitly phased hardcalls are present.
HardcallPhasePresent(pgen)
HardcallPhasePresent(pgen)
pgen |
Object returned by NewPgen(). |
TRUE if the file contains at least one phased heterozygous hardcall, FALSE otherwise.
Returns an empty two-row integer matrix that ReadAlleles() can load to.
IntAlleleCodeBuf(pgen)
IntAlleleCodeBuf(pgen)
pgen |
Object returned by NewPgen(). |
Integer matrix with two rows, and appropriate number of columns for ReadAlleles().
Returns an integer buffer that ReadHardcalls() can load to.
IntBuf(pgen)
IntBuf(pgen)
pgen |
Object returned by NewPgen(). |
Integer vector with appropriate length for ReadHardcalls().
Opens a .pgen or PLINK 1 .bed file.
NewPgen(filename, pvar = NULL, raw_sample_ct = NULL, sample_subset = NULL)
NewPgen(filename, pvar = NULL, raw_sample_ct = NULL, sample_subset = NULL)
filename |
.pgen/.bed file path. |
pvar |
Object (see NewPvar()) corresponding to the .pgen's companion .pvar; technically optional, but necessary for some functionality. In particular, at multiallelic variants, all ALT alleles may be collapsed together when .pvar information is not available. |
raw_sample_ct |
Number of samples in file; required if it's a PLINK 1 .bed file, otherwise optional. |
sample_subset |
List of 1-based positions of samples to load; optional, all samples are loaded if this is not specified. |
A pgen object, which can be queried for genotype/dosage data.
Loads variant IDs and allele codes from a .pvar or .bim file (which can be compressed with gzip or Zstd).
NewPvar(filename)
NewPvar(filename)
filename |
.pvar/.bim file path. |
A pvar object, which can be queried for variant IDs and allele codes.
This function treats the data as diploid; divide by 2 to obtain haploid dosages.
Read(pgen, buf, variant_num, allele_num = 2L)
Read(pgen, buf, variant_num, allele_num = 2L)
pgen |
Object returned by NewPgen(). |
buf |
Buffer returned by Buf(). |
variant_num |
Variant index (1-based). |
allele_num |
Allele index; 1 corresponds to REF, 2 to the first ALT allele, 3 to the second ALT allele if it exists, etc. Optional, defaults 2. |
No return value, called for buf-filling side-effect.
This function treats the data as diploid. If it's really haploid, you may want to compare the two rows, and then treat samples where the allele codes differ as missing values.
ReadAlleles(pgen, acbuf, variant_num, phasepresent_buf = NULL)
ReadAlleles(pgen, acbuf, variant_num, phasepresent_buf = NULL)
pgen |
Object returned by NewPgen(). |
acbuf |
Buffer returned by AlleleCodeBuf() or IntAlleleCodeBuf(). |
variant_num |
Variant index (1-based). |
phasepresent_buf |
Buffer returned by BoolBuf(). Optional; if provided, elements are set to true when the sample has known phase. Most of these values will be TRUE even when the raw data is unphased, because homozygous genotypes always have known phase. (Missing genotypes are considered to have unknown phase.) |
No return value, called for acbuf-filling side-effect.
This function treats the data as diploid; you can divide by 2, and then treat 0.5 as NA, if it's actually haploid.
ReadHardcalls(pgen, buf, variant_num, allele_num = 2L)
ReadHardcalls(pgen, buf, variant_num, allele_num = 2L)
pgen |
Object returned by NewPgen(). |
buf |
Buffer returned by Buf() or IntBuf(). |
variant_num |
Variant index (1-based). |
allele_num |
Allele index; 1 corresponds to REF, 2 to the first ALT allele, 3 to the second ALT allele if it exists, etc. Optional, defaults 2. |
No return value, called for buf-filling side-effect.
This function treats the data as diploid; you can divide by 2, and then treat 0.5 as NA, if it's actually haploid.
ReadIntList(pgen, variant_subset)
ReadIntList(pgen, variant_subset)
pgen |
Object returned by NewPgen(). |
variant_subset |
Integer vector containing 1-based indexes of variants to load. |
Integer matrix, where rows correspond to samples, columns correspond to variant_subset, and values are in {0, 1, 2, NA} indicating the number of hardcall ALT allele copies. For multiallelic variants, all ALT alleles are combined.
This function treats the data as diploid; divide by 2 to obtain haploid dosages.
ReadList(pgen, variant_subset, meanimpute = FALSE)
ReadList(pgen, variant_subset, meanimpute = FALSE)
pgen |
Object returned by NewPgen(). |
variant_subset |
Integer vector containing 1-based indexes of variants to load. |
meanimpute |
Optional; if true, missing values are mean-imputed instead of being represented by NA. |
Numeric matrix, where rows correspond to samples, and columns correspond to variant_subset. Values are in [0, 2] indicating ALT allele dosages, or NA for missing dosages. For multiallelic variants, all ALT alelles are combined.
This function treats the data as diploid; divide by 2 to obtain scores based on a haploid dosage matrix.
VariantScores(pgen, weights, variant_subset = NULL)
VariantScores(pgen, weights, variant_subset = NULL)
pgen |
Object returned by NewPgen(). |
weights |
Sample weights. |
variant_subset |
Integer vector containing 1-based indexes of variants to include in the dosage matrix. Optional; by default, all variants are included. |
Numeric vector, containing product of sample-weight vector and the specified subset of the dosage matrix.