July 8th, 2023
This update fixes a bug in the DistCalc() function.
June 30th, 2023
This update fixes a bug in the DistCalc() function and introduces a few minor edits in the reference manual.
March 22nd, 2023
This update fixes a bug in the ReplMatch() and PapaDiv() functions, which caused an error when matching the occurrence of sequences between samples. Sequences are now named using index numbers of the sequences with zeroes padded in front, e.g. "Sequence_001" to "Sequence_999". This prevents RegEx pattern matching between e.g. "Sequence_1" and "Sequence_1X". This naming concept has been implemented throughout MHCtools, for the sake of consistency. The updated functions are: CreateFas(), CreateSamplesFas(), DistCalc(), HpltFind(), ReplMatch(), and PapaDiv().
In addition, a bug was fixed in the HpltMatch() function, which caused an error when running the function with the setting threshold=NULL.
October 19th, 2022
This update introduces three new functions to MHCtools, which are all related to haplotype inference:
The CreateHpltOccTable() function creates a haplotype-sequence occurrence matrix from the output of HpltFind(), for easy overview of which sequences are present in which haplotypes.
The HpltMatch() function compares haplotypes to help identify overlapping and potentially identical types.
The NestTablesXL() function translates the output from HpltFind() to an Excel workbook, that provides a convenient overview for evaluation and curation of the inferred putative haplotypes.
More details on these new functions are provided in the reference manual.
Adjustment to DistCalc()
The DistCalc() function has been updated so that it can handle data sets where some samples have 0 or 1 sequence(s) in a sequence occurrence table. DistCalc() will assign NA to the mean distance value for such samples in the output table.
Adjustment to ReplMatch()
The ReplMatch() function has been updated to throw a warning if all samples in a replicate set have 0 sequences. ReplMatch() will report which replicate set is problematic and suggest to remove it from the replicates table.
Finally, a few minor adjustments have been made to improve the efficiency of the code in most functions, with no change in functionality.
August 15th, 2022
This update fixes a bug in the HpltFind function, which caused an error when no alleles are shared between a chick and one of its parents (this may e.g. occur in nests with extra-pair fertilization).
Furthermore, the updated version of HpltFind introduces the alpha-parameter, which may be used to fine-tune the haplotype analysis (details provided in the reference manual).
May 23rd, 2022
This version provides the following citation details following presentation of MHCtools in Molecular Ecology Resources:
Roved, J., Hansson, B., Stervander, M., Hasselquist, D., & Westerdahl, H. (2022). MHCtools - an R package for MHC high-throughput sequencing data: genotyping, haplotype and supertype inference, and downstream genetic analyses in non-model organisms. Molecular Ecology Resources. https://doi.org/10.1111/1755-0998.13645
October 11th, 2021
This update fixes a bug in the DistCalc() function, which caused an error when attempting to calculate p-distances from a fasta file of amino acid sequences.
September 13th, 2021
The BootKmeans() and ClusterMatch() functions
In this update, two new functions BootKmeans() and ClusterMatch() have been added. In MHC data analysis, it is often desirable to group alleles by their physico-chemical properties, as MHC receptors with similar properties share the repertoire of peptides they can bind. From a functional immunological perspective, alleles with similar properties may therefore be regarded as belonging to the same supertypes, which in many cases can simplify statistical analyses by reducing the number of independent variables and increase statistical power as more samples will share supertypes compared to alleles. Inference of MHC supertypes has traditionally been carried out by k-means clustering analysis on a set of z-descriptors of the physico-chemical properties of the amino acid sequences. However, the inference of relevant clusters is not always straightforward, since MHC data sets do not always produce clear inflection points (e.g. the elbow in an elbow plot). The BootKmeans() function is a wrapper for the kmeans() function of the stats package, which allows for bootstrapping of a k-means clustering analysis. BootKmeans() performs multiple runs of kmeans() and estimates optimal k-values based on a user-defined threshold of BIC reduction. The method may be seen as an automated and bootstrapped version of visually inspecting elbow plots of BIC- vs. k-values. To evaluate which of the bootstrapped k-means models is most accurate and/or informative, the ClusterMatch() function offers a tool for evaluating whether different k-means clustering models identify similar clusters, and summarize bootstrap model stats as means for different estimated values of k. ClusterMatch() is designed to take files produced by the BootKmeans() function as input, but other data can be analysed if the descriptions of the required data formats are observed carefully.
September 15th, 2020
The DistCalc() function
In this update, the new function DistCalc() replaces CalcPdist(). The DistCalc() function calculates Grantham, Sandberg, or p-distances and produces a matrix with distances from pairwise comparisons of all sequences in an input file. Input files may be either a fasta file (in the format rendered by the read.fasta() function from the package seqinr) or a dada2-style sequence table. If a dada2-style sequence table is used as input, the function produces a table with mean distances for each sample in the data set in addition to the distance matrix. Amino acid distances may be calculated from input files with nucleotide sequences by automatic translation using the standard genetic code. The function furthermore includes an option for the user to specify which codons to compare, which is useful e.g. if conducting the analysis only on codon positions involved in specific functions such as the peptide-binding of an MHC molecule.
If calculation of Sandberg distances is specified when using DistCalc(), the function additionally outputs five tables with physico-chemical z-descriptor values for each amino acid position in all sequences in the input file. These tables may be useful for further downstream analyses, such as estimation of MHC supertypes. The z-descriptor values are derived from Sandberg et al. 1998, JMed Chem. 41(14):2481-2491.
This update furthermore removes the dependency on the package rlist.
August 8th, 2019
This update provides a minor modification of the CalcPdist() function, so that when a dada2 sequence table is used as input, sequences are now named by an index number (corresponding to their column number in the input table) in the p-distance matrix.
August 8th, 2019
This update replaces the MeanPdist() function with the new function CalcPdist(), which has more universal applications. The CalcPdist function produces a matrix with p-distances from pairwise comparisons of all sequences in an input file. If a dada2 sequence table is used as input, the function furthermore produces a table with mean p-distances for each sample in the data set. Optionally, amino acid p-distances may be calculated, based on the standard genetic code. The function furthermore includes an option for the user to specify which codons to compare, which is useful e.g. if conducting the analysis only on codon positions involved in specific functions such as the peptide-binding of an MHC molecule. Input files may be either a fasta file (fasta files are accepted in the format rendered by e.g. the read.fasta() function from the package seqinr) or a dada2 sequence table.
February 4th, 2019
This update fixes bugs in the HpltFind and PapaDiv functions. Furthermore, a minor modification of the functions GetHpltTable and GetReplTable ensures that results are presented according to nest or replicate number, respectively (previous versions presented them in alphanumeric order).
October 23rd, 2017
This update adds the new function MeanPdist() to the R package MHCtools. The MeanPdist() function calculates the mean p-distance from pairwise comparisons of the sequences in each sample in a data set. The function includes an option for the user to specify which codons to compare, which is useful e.g. if conducting the analysis only on codon positions involved in specific functions such as the peptide-binding of an MHC molecule.
September 29th, 2017
This new R package contains nine useful functions for analysis of major histocompatibility complex (MHC) data in non-model species. The functions are tailored for amplicon data sets that have been filtered using the dada2 pipeline (for more information visit https://benjjneb.github.io/dada2/), but even other data sets can be analyzed, if the data tables are formatted according to the description in each function.
The ReplMatch() function matches replicates in data sets in order to evaluate genotyping success.
The GetReplTable() and GetReplStats() functions perform such an evaluation.
The HpltFind() function infers putative haplotypes from families in the data set.
The GetHpltTable() and GetHpltStats() functions evaluate the accuracy of the haplotype inference.
The PapaDiv() function compares parent pairs in the data set and calculate their joint MHC diversity, taking into account sequence variants that occur in both parents.
The CreateFas() function creates a fasta file with all the sequences in the data set.
The CreateSamplesFas() function creates a fasta file for each sample in the data set.