Title: | A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs |
---|---|
Description: | The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases. |
Authors: | Zhengpeng Zeng [aut, cre, ctb] , Longfei Mao [aut, cph] , Feng Yu [aut] , Jiamin Hu [ctb] , Xiting Wang [ctb] |
Maintainer: | Zhengpeng Zeng <[email protected]> |
License: | GPL (>= 3) |
Version: | 1.0.4 |
Built: | 2025-01-08 07:01:49 UTC |
Source: | CRAN |
ID example dataset.
Example
Example
A tibble with 5000 rows and 2 variables.
A identifier character.
The database the ID belongs to.
A wrapper function that executes MantaID workflow.
mi( cores = NULL, levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":"), ratio = 0.3, para_blc = FALSE, model_path = NULL, batch_size = 128, epochs = 64, validation_split = 0.3, graph_path = NULL )
mi( cores = NULL, levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":"), ratio = 0.3, para_blc = FALSE, model_path = NULL, batch_size = 128, epochs = 64, validation_split = 0.3, graph_path = NULL )
cores |
The number of cores used when balancing data. |
levels |
The vector that includes all the single characters occurred in IDs. |
ratio |
The ratio of the test set. |
para_blc |
A logical value whether using parallel computing when balancing data. |
model_path |
The path to save models. |
batch_size |
The batch size of deep learning model fitting. |
epochs |
The epochs of deep learning model fitting. |
validation_split |
The validation ratio of deep learning model fitting. |
graph_path |
The path to save graphs. |
The list of models and graphs.
Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data.
mi_balance_data(data, ratio = 0.3, parallel = FALSE)
mi_balance_data(data, ratio = 0.3, parallel = FALSE)
data |
A data frame. Except for class column, all are numeric types. |
ratio |
Numeric between 0 and 1. The percent of the test set split from data. |
parallel |
Logical. |
A list contains a train set and a test set.
Reshape data and delete meaningless rows.
mi_clean_data(data, cols = everything(), placeholder = c("-"))
mi_clean_data(data, cols = everything(), placeholder = c("-"))
data |
A dataframe or tibble or data.table or matrix. Names of the column will be regard as the class of ID included in column. |
cols |
Character vectors. Columns of |
placeholder |
Character vectors. IDs included in |
A tibble with two columns("ID" and "class")
data <- tibble::tibble( "class1" = c("A", "B", "C", "D"), "class2" = c("E", "F", "G", "H"), "class3" = c("L", "M", "-", "O") ) mi_clean_data(data)
data <- tibble::tibble( "class1" = c("A", "B", "C", "D"), "class2" = c("E", "F", "G", "H"), "class3" = c("L", "M", "-", "O") ) mi_clean_data(data)
ID-related datasets in biomart.
mi_data_attributes
mi_data_attributes
A dataframe with 65 variables and 3 variables.
The name of dataset.
Description of dataset.
collection of attributes.
Processed ID data.
mi_data_procID
mi_data_procID
A tibble dataframe with 5000 rows and 21 variables.
Splited ID.
The databases that ID belongs to.
ID dataset for testing.
mi_data_rawID
mi_data_rawID
A tibble with 5000 rows and 2 variables.
A identifier character.
The database the ID belongs to.
Performing feature selection in a automatic way based on correlation and feature importance.
mi_filter_feat(data, cor_thresh = 0.7, imp_thresh = 0.99, union = FALSE)
mi_filter_feat(data, cor_thresh = 0.7, imp_thresh = 0.99, union = FALSE)
data |
The data frame returned by |
cor_thresh |
The threshold set for Pearson correlation. If correlation value is over this threshold, the two features will be viewed as redundant and one of them will be removed. |
imp_thresh |
The threshold set for feature importance. The last several features with the lowest importance will be removed if remained importance lower than |
union |
The method for combining the decisions of correlation method and importance method. If |
The names of the features that should be removed.
Compute the confusion matrix for the predicted result.
mi_get_confusion(result_list, ifnet = FALSE)
mi_get_confusion(result_list, ifnet = FALSE)
result_list |
A list returned from model training functions. |
ifnet |
Logical.Whether the data is obtained by a deep learning model. |
A confusionMatrix
object.
Biomart
database using attributes
.Get ID data from the Biomart
database using attributes
.
mi_get_ID( attributes, biomart = "genes", dataset = "hsapiens_gene_ensembl", mirror = "asia" )
mi_get_ID( attributes, biomart = "genes", dataset = "hsapiens_gene_ensembl", mirror = "asia" )
attributes |
A dataframe.The information we want to retrieve.Use |
biomart |
BioMart database name you want to connect to. Use |
dataset |
Datasets of the selected BioMart database. |
mirror |
Specify an Ensembl mirror to connect to. |
A tibble
dataframe.
Biomart
database.Get ID attributes from the Biomart
database.
mi_get_ID_attr( biomart = "genes", dataset = "hsapiens_gene_ensembl", mirror = "asia" )
mi_get_ID_attr( biomart = "genes", dataset = "hsapiens_gene_ensembl", mirror = "asia" )
biomart |
BioMart database name you want to connect to.Use |
dataset |
Datasets of the selected BioMart database. |
mirror |
Specify an Ensembl mirror to connect to. |
A dataframe.
Plot the bar plot for feature importance.
mi_get_importance(data)
mi_get_importance(data)
data |
A table. |
A bar plot.
Observe the distribution of the false response of the test set.
mi_get_miss(predict)
mi_get_miss(predict)
predict |
An R6 class |
A tibble data frame that records the number of wrong predictions for each category ID
Get max length of ID data.
mi_get_padlen(data)
mi_get_padlen(data)
data |
A dataframe. |
An int object.
data(mi_data_rawID) mi_get_padlen(mi_data_rawID)
data(mi_data_rawID) mi_get_padlen(mi_data_rawID)
Plot correlation heatmap.
mi_plot_cor(data, cls = "class")
mi_plot_cor(data, cls = "class")
data |
Data frame including IDs' position features. |
cls |
The name of the class column. |
A heatmap.
data(mi_data_procID) data_num <- mi_to_numer(mi_data_procID) mi_plot_cor(data_num)
data(mi_data_procID) data_num <- mi_to_numer(mi_data_procID) mi_plot_cor(data_num)
Plot heatmap for result confusion matrix.
mi_plot_heatmap(table, name = NULL, filepath = NULL)
mi_plot_heatmap(table, name = NULL, filepath = NULL)
table |
A table. |
name |
Model names. |
filepath |
File path the plot to save. Default NULL. |
A ggplot
object.
Predict new data with a trained learner.
mi_predict_new(data, result, ifnet = F)
mi_predict_new(data, result, ifnet = F)
data |
A dataframe. |
result |
The result object from a previous training. |
ifnet |
A boolean indicating if a neural network is used for prediction. |
A data frame that contains features and 'predict' class.
Compare classification models with small samples.
mi_run_bmr(data, row_num = 1000, resamplings = rsmps("cv", folds = 10))
mi_run_bmr(data, row_num = 1000, resamplings = rsmps("cv", folds = 10))
data |
A tibble.All are numeric except the first column is a factor. |
row_num |
The number of samples used. |
resamplings |
R6/Resampling.Resampling method. |
A list of R6 class of benchmark results and scores of test set. examples data(mi_data_procID) mi_run_bmr(mi_data_procID)
Cut the string of ID column character by character and divide it into multiple columns.
mi_split_col(data, cores = NULL, pad_len = 10)
mi_split_col(data, cores = NULL, pad_len = 10)
data |
Dataframe(tibble) to be split. |
cores |
Int.The num of cores to allocate for computing. |
pad_len |
The length of longest id, i.e. the maxlength. |
A tibble with pad_len+1 column.
Split the string into individual characters and complete the character vector to the maximum length.
mi_split_str(str, pad_len)
mi_split_str(str, pad_len)
str |
The string to be splited. |
pad_len |
The length of longest ID, i.e. the maxlength. |
Splited character vector.
string_test <- "Good Job" length <- 15 mi_split_str(string_test, length)
string_test <- "Good Job" length <- 15 mi_split_str(string_test, length)
Convert data to numeric, and for the ID column convert with fixed levels.
mi_to_numer( data, levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":") )
mi_to_numer( data, levels = c("*", 0:9, letters, LETTERS, "_", ".", "-", " ", "/", "\\", ":") )
data |
A tibble with n position column(pos1,pos2,...) and class column. |
levels |
Characters accommodated in IDs. |
A numeric data frame with numerical or factor type columns.
data(mi_data_procID) mi_to_numer(mi_data_procID)
data(mi_data_procID) mi_to_numer(mi_data_procID)
Train a three layers neural network model.
mi_train_BP( train, test, cls = "class", path2save = NULL, batch_size = 128, epochs = 64, validation_split = 0.3, verbose = 0 )
mi_train_BP( train, test, cls = "class", path2save = NULL, batch_size = 128, epochs = 64, validation_split = 0.3, verbose = 0 )
train |
A dataframe with the |
test |
A dataframe with the |
cls |
A character.The name of the label column. |
path2save |
The folder path to store the model and train history. |
batch_size |
Integer or NULL. The number of samples per gradient update. |
epochs |
The number of epochs to train the model. |
validation_split |
Float between 0 and 1. Fraction of the training data to be used as validation data. |
verbose |
The verbosity mode. |
A list
object containing the prediction confusion matrix, the model
object, and the mapping of predicted numbers to classes.
Random Forest Model Training.
mi_train_rg(train, test, measure = msr("classif.acc"), instance = NULL)
mi_train_rg(train, test, measure = msr("classif.acc"), instance = NULL)
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method. |
instance |
A tuner. |
A list of learner for predicting and predicted result of test set.
Classification tree model training.
mi_train_rp(train, test, measure = msr("classif.acc"), instance = NULL)
mi_train_rp(train, test, measure = msr("classif.acc"), instance = NULL)
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method.Use |
instance |
A tuner. |
A list of learner for predicting and predicted result of test set.
Xgboost model training
mi_train_xgb(train, test, measure = msr("classif.acc"), instance = NULL)
mi_train_xgb(train, test, measure = msr("classif.acc"), instance = NULL)
train |
A dataframe. |
test |
A dataframe. |
measure |
Model evaluation method. |
instance |
A tuner. |
A list of learner for predicting and predicted result of test set.
Tune the Random Forest model by hyperband.
mi_tune_rg( data, resampling = rsmp("cv", folds = 5), measure = msr("classif.acc"), eta = 3 )
mi_tune_rg( data, resampling = rsmp("cv", folds = 5), measure = msr("classif.acc"), eta = 3 )
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
A list of tuning instance and stage plot.
Tune the Decision Tree model by hyperband.
mi_tune_rp( data, resampling = rsmp("bootstrap", ratio = 0.8, repeats = 5), measure = msr("classif.acc"), eta = 3 )
mi_tune_rp( data, resampling = rsmp("bootstrap", ratio = 0.8, repeats = 5), measure = msr("classif.acc"), eta = 3 )
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
A list of tuning instance and stage plot.
Tune the Xgboost model by hyperband.
mi_tune_xgb( data, resampling = rsmp("cv", folds = 5), measure = msr("classif.acc"), eta = 3 )
mi_tune_xgb( data, resampling = rsmp("cv", folds = 5), measure = msr("classif.acc"), eta = 3 )
data |
A tibble.All are numeric except the first column is a factor. |
resampling |
R6/Resampling. |
measure |
Model evaluation method.Use |
eta |
The percent parameter configurations discarded. |
A list of tuning instance and stage plot.
Predict with four models and unify results by the sub-model's specificity score to the four possible classes.
mi_unify_mod( data, col_id, result_rg, result_rp, result_xgb, result_BP, c_value = 0.75, pad_len = 30 )
mi_unify_mod( data, col_id, result_rg, result_rp, result_xgb, result_BP, c_value = 0.75, pad_len = 30 )
data |
A dataframe contains the ID column. |
col_id |
The name of ID column. |
result_rg |
The result from the Random Forest model. |
result_rp |
The result from the Decision Tree model. |
result_xgb |
The result from the XGBoost model. |
result_BP |
The result from the Backpropagation Neural Network model. |
c_value |
A numeric value used in the final prediction calculation. |
pad_len |
The length to pad the ID characters to. |
A dataframe.