Title: | Library Search Against Electron Ionization Mass Spectral Databases |
---|---|
Description: | Perform library searches against electron ionization mass spectral databases using either the API provided by 'MS Search' software (<https://chemdata.nist.gov/dokuwiki/doku.php?id=chemdata:nistlibs>) or custom implementations of the Identity and Similarity algorithms. |
Authors: | Andrey Samokhin [aut, cre, cph] |
Maintainer: | Andrey Samokhin <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.0 |
Built: | 2024-12-09 14:46:47 UTC |
Source: | CRAN |
Perform library search using a custom implementation of the Identity (EI Normal) or Similarity (EI Simple) algorithm. Pairwise comparison of two mass spectra is implemented in C.
LibrarySearch( msp_objs_u, msp_objs_l, algorithm = c("identity_normal", "similarity_simple"), search_type = c("standard", "reverse"), n_hits = 100L, hitlist_columns = c("formula", "mw", "smiles"), mz_min = NULL, mz_max = NULL, comments = NULL )
LibrarySearch( msp_objs_u, msp_objs_l, algorithm = c("identity_normal", "similarity_simple"), search_type = c("standard", "reverse"), n_hits = 100L, hitlist_columns = c("formula", "mw", "smiles"), mz_min = NULL, mz_max = NULL, comments = NULL )
msp_objs_u , msp_objs_l
|
A list of nested lists. Each nested list is a mass spectrum. Each nested
list must contain at least three elements: (1) |
algorithm |
A string. Library search algorithm. Either the Identity EI Normal
( |
search_type |
A string. Library search type: standard search ( |
n_hits |
An integer value. The maximum number of hits (i.e., candidates) to display. |
hitlist_columns |
A character vector. Three columns are always present in the returned
hitlist: |
mz_min , mz_max
|
An integer value. Boundaries of the m/z range (all m/z values out of this range are not taken into account when the match factor is calculated). |
comments |
Any R object. Some additional information. It is saved as the 'comments' attribute of the returned list. |
Return a list of data frames. Each data frame is a hitlist (i.e., list of
possible candidates). Each hitlist always contains three columns:
name
, mf
or rmf
(i.e., the match factor or the reverse
match factor), and idx
(i.e., the index of the respective library
mass spectrum in the msp_objs_l
list). Additional columns can be
extracted using the hitlist_columns
argument. Library search options
are saved as the library_search_options
attribute.
# Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") # Pre-processing msp_objs_u <- PreprocessMassSpectra(ReadMsp(msp_file)) # unknown mass spectra msp_objs_l <- PreprocessMassSpectra(massbank_alkanes) # library mass spectra # Searching using the Identity algorithm hitlists <- LibrarySearch(msp_objs_u, msp_objs_l, algorithm = "identity_normal", n_hits = 10L, hitlist_columns = c("formula", "smiles", "db_no")) # Printing a hitlist for the first compound from the 'alkanes.msp' file print(hitlists[[1]][1:5, ]) #> name mf idx formula smiles db_no #> 1 UNDECANE 950.5551 11 C11H24 CCCCCCCCCCC MSBNK-{...}-JP006877 #> 2 UNDECANE 928.4884 72 C11H24 CCCCCCCCCCC MSBNK-{...}-JP005760 #> 3 DODECANE 905.7546 74 C12H26 CCCCCCCCCCCC MSBNK-{...}-JP006878 #> 4 TRIDECANE 891.7862 41 C13H28 CCCCCCCCCCCCC MSBNK-{...}-JP006879 #> 5 DODECANE 885.6247 42 C12H26 CCCCCCCCCCCC MSBNK-{...}-JP005756
# Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") # Pre-processing msp_objs_u <- PreprocessMassSpectra(ReadMsp(msp_file)) # unknown mass spectra msp_objs_l <- PreprocessMassSpectra(massbank_alkanes) # library mass spectra # Searching using the Identity algorithm hitlists <- LibrarySearch(msp_objs_u, msp_objs_l, algorithm = "identity_normal", n_hits = 10L, hitlist_columns = c("formula", "smiles", "db_no")) # Printing a hitlist for the first compound from the 'alkanes.msp' file print(hitlists[[1]][1:5, ]) #> name mf idx formula smiles db_no #> 1 UNDECANE 950.5551 11 C11H24 CCCCCCCCCCC MSBNK-{...}-JP006877 #> 2 UNDECANE 928.4884 72 C11H24 CCCCCCCCCCC MSBNK-{...}-JP005760 #> 3 DODECANE 905.7546 74 C12H26 CCCCCCCCCCCC MSBNK-{...}-JP006878 #> 4 TRIDECANE 891.7862 41 C13H28 CCCCCCCCCCCCC MSBNK-{...}-JP006879 #> 5 DODECANE 885.6247 42 C12H26 CCCCCCCCCCCC MSBNK-{...}-JP005756
Perform the library search using an API for the MS Search software (NIST). The search is performed by calling the nistms$.exe file. The API is described in the NIST Mass Spectral Search Program manual. Library search options are set within the MS Search (NIST) software. To perform automatic library search the following settings should be set: (1) the 'Automatic Search On' box should be checked; (2) the 'Number of Hits to Print' field should contain reasonable value of candidates (e.g., 100).
LibrarySearchUsingNistApi( msp_objs, mssearch_dir = NULL, temp_msp_file_dir = NULL, overwrite_spec_list = FALSE, comments = NULL )
LibrarySearchUsingNistApi( msp_objs, mssearch_dir = NULL, temp_msp_file_dir = NULL, overwrite_spec_list = FALSE, comments = NULL )
msp_objs |
A list of nested lists. Each nested list is a mass spectrum. Each nested
list must contain at least three elements: (1) |
mssearch_dir |
A string. Full path to the MSSEARCH/ directory (e.g.
C:/NIST20/MSSEARCH/). This directory must contain the
nistms$.exe file. If |
temp_msp_file_dir |
A string. Path to a directory where a temporary msp-file is created. If
|
overwrite_spec_list |
A logical value. If |
comments |
Any R object. Some additional information (e.g., library search options, the list of used libraries, etc.). It is saved as the 'comments' attribute of the returned list. |
The function was tested using the MS Search (NIST) software (version 2.4) and the NIST20 mass spectral database. Only two algorithms have been tested yet: 'Identity EI Normal' and 'Similarity EI Simple'.
A few temporary files are created in the MSSEARCH/ directory according to the description provided in the NIST Mass Spectral Search Program manual.
Library search options are set within the MS Search (NIST) software. To do it, perform the following steps.
Open the MS Search (NIST) software.
Press the 'Library Search Options' button.
Select the required algorithm on the 'Search' tab (e.g., 'Identity, EI Normal').
Select the required set of libraries on the 'Libraries' tab.
Ensure that the 'Automatic Search On' box is checked ('Automation' tab).
Set the 'Number of Hits to Print' to reasonable value (e.g., 100) on the 'Automation' tab.
Change other settings according to the goal (e.g., 'Presearch', 'Limits', 'Constraints', etc.).
Return a list of data frames. Each data frame is a hitlist. The name of
unknown compound and compound in Library Factor (InLib) are saved as the
unknown_name
and inlib
attributes of the respective data
frame. Data frames contain the following elements:
name
A character vector. Compound name.
mf
An integer vector. Match factor.
rmf
An integer vector. Reverse match factor.
prob
A numeric vector. Probability.
lib
A character vector. Library.
cas
A character vector. CAS number.
formula
A character vector. Chemical formula.
mw
An integer vector. Molecular weight.
id
An integer vector. ID in the database.
ri
A numeric vector. Retention index.
## Not run: # To run this example, ensure that MS Search (NIST) software is installed. # Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") msp_objs <- ReadMsp(msp_file) # Searching using the MS Search (NIST) API hitlists <- LibrarySearchUsingNistApi(msp_objs) print(hitlists[[1]][1:5, ]) #> name mf rmf prob lib cas formula mw id ri #> 1 UNDECANE 951 960 55.70 massbank_alkanes 0 C11H24 156 11 0 #> 2 UNDECANE 928 928 20.34 massbank_alkanes 0 C11H24 156 72 0 #> 3 DODECANE 906 929 8.04 massbank_alkanes 0 C12H26 170 74 0 #> 4 TRIDECANE 892 907 5.03 massbank_alkanes 0 C13H28 184 41 0 #> 5 DODECANE 886 900 3.95 massbank_alkanes 0 C12H26 170 42 0 ## End(Not run)
## Not run: # To run this example, ensure that MS Search (NIST) software is installed. # Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") msp_objs <- ReadMsp(msp_file) # Searching using the MS Search (NIST) API hitlists <- LibrarySearchUsingNistApi(msp_objs) print(hitlists[[1]][1:5, ]) #> name mf rmf prob lib cas formula mw id ri #> 1 UNDECANE 951 960 55.70 massbank_alkanes 0 C11H24 156 11 0 #> 2 UNDECANE 928 928 20.34 massbank_alkanes 0 C11H24 156 72 0 #> 3 DODECANE 906 929 8.04 massbank_alkanes 0 C12H26 170 74 0 #> 4 TRIDECANE 892 907 5.03 massbank_alkanes 0 C13H28 184 41 0 #> 5 DODECANE 886 900 3.95 massbank_alkanes 0 C12H26 170 42 0 ## End(Not run)
Electron ionization mass spectra of alkanes from the MassBank database (version 2023.11).
massbank_alkanes
massbank_alkanes
A list of nested lists. Each nested list is a mass spectrum. Each nested list contains the following elements (a more detailed description can be found in the official documentation of MassBank):
name
A string. Name of the chemical compound analyzed.
synon
A character vector. Alternative chemical names. The element may be absent for certain mass spectra.
db_no
A string. Identifier of the MassBank record.
inchikey
A string. InChIKey.
inchi
A string. IUPAC International Chemical Identifier (InChI Code).
smiles
A string. SMILES string
spectrum_type
A string. MSn type of data.
instrument_type
A string. Type of instrument.
instrument
A string. Commercial name and manufacturer of instrument.
ion_mode
A string. Polarity of ion detection.
formula
A string. Chemical formula.
mw
A string. Nominal mass.
exactmass
A string. Exact mass.
comments
A string. Comments.
splash
A string. Hashed identifier of mass spectra.
library
A string. The name and version of the database.
mz
A numeric vector. Mass values of mass spectral peaks.
intst
A numeric vector. Intensities of mass spectral peaks.
Pre-process mass spectra. Pre-processing includes rounding/binning, sorting, and normalization.
PreprocessMassSpectra( msp_objs, bin_boundary = 0.649, remove_zeros = TRUE, max_intst = 999 )
PreprocessMassSpectra( msp_objs, bin_boundary = 0.649, remove_zeros = TRUE, max_intst = 999 )
msp_objs |
A list of nested lists. Each nested list is a mass spectrum. Each nested
list must contain at least three elements: (1) the |
bin_boundary |
A numeric value. The position of a bin boundary (it can be considered as a
'rounding point'). The |
remove_zeros |
An integer value. If |
max_intst |
A numeric value. The maximum intensity (i.e., intensity of the base peak) after normalization. The default value is 999 because it is used in some electron ionization mass spectral databases including NIST. |
Pre-processing includes the following steps:
Calculating a nominal mass spectrum. All floating point m/z values
are rounded to the nearest integer using the value of the
bin_boundary
argument. Intensities of peaks with identical m/z
values are summed.
Intensities of mass spectral peaks are normalized to
max_intst
.
Intensities of mass spectral peaks are rounded to the nearest integer.
If the remove_zeros
argument is TRUE
, all
zero-intensity peaks are removed from the mass spectrum.
The preprocessed
attribute is added and set to TRUE
for the respective mass spectrum.
A list of nested lists. Each nested list is a mass spectrum. Only the
mz
and intst
elements of each nested list are
modified during the pre-processing step.
# Original mass spectra of chlorine and methane msp_objs <- list( list(name = "Chlorine", mz = c(34.96885, 36.96590, 69.93771, 71.93476, 73.93181), intst = c(0.83 * c(100, 32), c(100, 63.99, 10.24))), list(name = "Methane", mz = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19), intst = c(0, 0, 25, 75, 155, 830, 999, 10, 0, 0)) ) matrix(c(msp_objs[[1]]$mz, msp_objs[[1]]$intst), ncol = 2) # Chlorine matrix(c(msp_objs[[2]]$mz, msp_objs[[2]]$intst), ncol = 2) # Methane # Pre-processed mass spectra of chlorine and methane pp_msp_objs <- PreprocessMassSpectra(msp_objs, remove_zeros = TRUE) matrix(c(pp_msp_objs[[1]]$mz, pp_msp_objs[[1]]$intst), ncol = 2) # Chlorine matrix(c(pp_msp_objs[[2]]$mz, pp_msp_objs[[2]]$intst), ncol = 2) # Methane
# Original mass spectra of chlorine and methane msp_objs <- list( list(name = "Chlorine", mz = c(34.96885, 36.96590, 69.93771, 71.93476, 73.93181), intst = c(0.83 * c(100, 32), c(100, 63.99, 10.24))), list(name = "Methane", mz = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19), intst = c(0, 0, 25, 75, 155, 830, 999, 10, 0, 0)) ) matrix(c(msp_objs[[1]]$mz, msp_objs[[1]]$intst), ncol = 2) # Chlorine matrix(c(msp_objs[[2]]$mz, msp_objs[[2]]$intst), ncol = 2) # Methane # Pre-processed mass spectra of chlorine and methane pp_msp_objs <- PreprocessMassSpectra(msp_objs, remove_zeros = TRUE) matrix(c(pp_msp_objs[[1]]$mz, pp_msp_objs[[1]]$intst), ncol = 2) # Chlorine matrix(c(pp_msp_objs[[2]]$mz, pp_msp_objs[[2]]$intst), ncol = 2) # Methane
Read an msp-file containing mass spectra in the NIST format. The complete description of the format can be found in the NIST Mass Spectral Search Program manual. A summary is presented below in the "Description of the NIST format" section.
ReadMsp(input_file)
ReadMsp(input_file)
input_file |
A string. The name of a file. |
Data from an msp-file are read without any modification (e.g., the order of mass values is not changed, zero-intensity peaks are preserved, etc.).
Return a list of nested lists. Each nested list is a mass spectrum. Almost
all metadata fields (e.g., "Name", "CAS#", "Formula", "MW", etc.) are
represented as strings. All "Synon" fields are merged into a single
character vector. Mass values and intensities are represented as numeric
vectors (mz
and intst
). Names of fields are slightly
modified:
names are converted to lowercase;
hash symbols are replaced with _no
;
any other special character is replaced with an underscore character.
The summary was prepared using the NIST Mass Spectral Search Program manual v.2.4 (2020).
An msp-file can contain as many spectra as wanted.
Each spectrum must start with the "Name" field. There must be something in this field.
The "Num Peaks" field is also required. It must contain the number of mass/intensity pairs.
Some optional fields (e.g. "Comments", "Formula", "MW") can be between the "Name" and "Num Peaks" fields.
When a spectrum is exported from the NIST library it also contains the "NIST#" and "DB#" fields. The "NIST#" field is on the same line as the "CAS#" field and separated by a semicolon.
Each field should be on a separate line (the "NIST#" field is an exception from this rule)
The mass/intensity list begins on the line following the "Num
Peaks" field. The peaks need not be normalized, and the masses need not
be ordered. The exact spacing and delimiters used for the mass/intensity
pairs are unimportant. The following characters are accepted as
delimiters: 'space
', 'tab
', ',
', ';
',
':
'. Parentheses, square brackets and curly braces ('(
',
'(
', '[
', ']
', '{
', and '}
') are
also allowed.
The "Name" field can be up to 511 characters.
The "Comments" field can be up to 1023 characters.
The "Formula" field can be up to 23 characters.
The "Synon" field may be repeated.
# Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") msp_objs <- ReadMsp(msp_file) # Plotting the first mass spectrum from the 'msp_objs' list par_old <- par(yaxs = "i") plot(msp_objs[[1]]$mz, msp_objs[[1]]$intst, ylim = c(0, 1000), main = msp_objs[[1]]$name, type = "h", xlab = "m/z", ylab = "Intensity", bty = "l") par(par_old)
# Reading the 'alkanes.msp' file msp_file <- system.file("extdata", "alkanes.msp", package = "mssearchr") msp_objs <- ReadMsp(msp_file) # Plotting the first mass spectrum from the 'msp_objs' list par_old <- par(yaxs = "i") plot(msp_objs[[1]]$mz, msp_objs[[1]]$intst, ylim = c(0, 1000), main = msp_objs[[1]]$name, type = "h", xlab = "m/z", ylab = "Intensity", bty = "l") par(par_old)
Write mass spectra in an msp-file (NIST format).
WriteMsp(msp_objs, output_file, fields = NULL)
WriteMsp(msp_objs, output_file, fields = NULL)
msp_objs |
A list of nested lists. Each nested list is a mass spectrum. Each nested
list must contain at least three elements: (1) |
output_file |
A string. The name of a file. |
fields |
A character vector. Names of elements in an R list (not the original field
names from an msp-file) to be exported. For example, if only CAS number is
needed to be exported, the |
Names of all fields are exported in lower case. It does not cause any problem in the case of the MS Search (NIST) software (however correct operation with other software products has not been tested). Only in a few cases hash symbols and spaces are restored:
the cas_no
element is exported as the 'cas#' field;
the nist_no
element is exported as the 'nist#' field;
the num_peaks
element is exported as the 'num peaks' field.
NULL
is returned.
# Exporting mass spectra # Only 'Name', 'SMILES', 'Formula', and 'Num Peaks' fields are exported. WriteMsp(massbank_alkanes[1:3], "test.msp", fields = c("smiles", "formula"))
# Exporting mass spectra # Only 'Name', 'SMILES', 'Formula', and 'Num Peaks' fields are exported. WriteMsp(massbank_alkanes[1:3], "test.msp", fields = c("smiles", "formula"))