Package: MantaID 1.0.4

Zhengpeng Zeng

MantaID: A Machine-Learning Based Tool to Automate the Identification of Biological Database IDs

The number of biological databases is growing rapidly, but different databases use different IDs to refer to the same biological entity. The inconsistency in IDs impedes the integration of various types of biological data. To resolve the problem, we developed 'MantaID', a data-driven, machine-learning based approach that automates identifying IDs on a large scale. The 'MantaID' model's prediction accuracy was proven to be 99%, and it correctly and effectively predicted 100,000 ID entries within two minutes. 'MantaID' supports the discovery and exploitation of ID patterns from large quantities of databases. (e.g., up to 542 biological databases). An easy-to-use freely available open-source software R package, a user-friendly web application, and API were also developed for 'MantaID' to improve applicability. To our knowledge, 'MantaID' is the first tool that enables an automatic, quick, accurate, and comprehensive identification of large quantities of IDs, and can therefore be used as a starting point to facilitate the complex assimilation and aggregation of biological data across diverse databases.

Authors:Zhengpeng Zeng [aut, cre, ctb], Longfei Mao [aut, cph], Feng Yu [aut], Jiamin Hu [ctb], Xiting Wang [ctb]

MantaID_1.0.4.tar.gz
MantaID_1.0.4.tar.gz(r-4.5-noble)MantaID_1.0.4.tar.gz(r-4.4-noble)
MantaID_1.0.4.tgz(r-4.4-emscripten)
MantaID.pdf |MantaID.html
MantaID/json (API)

# Install 'MantaID' in R:
install.packages('MantaID', repos = c('https://cran.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/molaison/mantaid/issues

Pkgdown site:https://molaison.github.io

Datasets:

1.30 score 2 scripts 115 downloads 25 exports 151 dependencies

Last updated 4 months agofrom:41b745f561. Checks:2 OK. Indexed: no.

TargetResultLatest binary
Doc / VignettesOKJan 08 2025
R-4.5-linuxOKJan 08 2025

Exports:mimi_balance_datami_clean_datami_filter_featmi_get_confusionmi_get_IDmi_get_ID_attrmi_get_importancemi_get_missmi_get_padlenmi_plot_cormi_plot_heatmapmi_predict_newmi_run_bmrmi_split_colmi_split_strmi_to_numermi_train_BPmi_train_rgmi_train_rpmi_train_xgbmi_tune_rgmi_tune_rpmi_tune_xgbmi_unify_mod

Dependencies:AnnotationDbiaskpassbackportsbase64encbbotkBiobaseBiocFileCacheBiocGenericsbiomaRtBiostringsbitbit64blobcachemcaretcheckmateclasscliclockcodetoolscolorspaceconfigcpp11crayoncurldata.tableDBIdbplyrdbscandiagramdigestdplyre1071evaluatefansifarverfastmapfilelockFNNforeachfuturefuture.applygenericsGenomeInfoDbGenomeInfoDbDataggcorrplotggplot2globalsgluegowergtablehardhatherehmshttrhttr2igraphipredIRangesisobanditeratorsjsonliteKEGGRESTkerasKernSmoothlabelinglatticelavalgrlifecyclelistenvlubridatemagrittrMASSMatrixmclustmemoisemgcvmimemlbenchmlr3mlr3measuresmlr3miscmlr3tuningModelMetricsmunsellnlmennetnumDerivopensslpalmerpenguinsparadoxparallellypillarpkgconfigplogrplyrpngprettyunitspROCprocessxprodlimprogressprogressrproxyPRROCpspurrrR6rappdirsRColorBrewerRcppRcppTOMLrecipesreshape2reticulaterlangrpartrprojrootRSQLiterstudioapiS4VectorsscalesscutrshapesmotefamilySQUAREMstringistringrsurvivalsystensorflowtfautographtfrunstibbletidyrtidyselecttimechangetimeDatetzdbUCSC.utilsutf8uuidvctrsviridisLitewhiskerwithrxml2XVectoryamlzeallot

Readme and manuals

Help Manual

Help pageTopics
ID example dataset.Example
A wrapper function that executes MantaID workflow.mi
Data balance. Most classes adopt random undersampling, while a few classes adopt smote method to oversample to obtain relatively balanced data.mi_balance_data
Reshape data and delete meaningless rows.mi_clean_data
ID-related datasets in biomart.mi_data_attributes
Processed ID data.mi_data_procID
ID dataset for testing.mi_data_rawID
Performing feature selection in a automatic way based on correlation and feature importance.mi_filter_feat
Compute the confusion matrix for the predicted result.mi_get_confusion
Get ID data from the 'Biomart' database using 'attributes'.mi_get_ID
Get ID attributes from the 'Biomart' database.mi_get_ID_attr
Plot the bar plot for feature importance.mi_get_importance
Observe the distribution of the false response of the test set.mi_get_miss
Get max length of ID data.mi_get_padlen
Plot correlation heatmap.mi_plot_cor
Plot heatmap for result confusion matrix.mi_plot_heatmap
Predict new data with a trained learner.mi_predict_new
Compare classification models with small samples.mi_run_bmr
Cut the string of ID column character by character and divide it into multiple columns.mi_split_col
Split the string into individual characters and complete the character vector to the maximum length.mi_split_str
Convert data to numeric, and for the ID column convert with fixed levels.mi_to_numer
Train a three layers neural network model.mi_train_BP
Random Forest Model Training.mi_train_rg
Classification tree model training.mi_train_rp
Xgboost model trainingmi_train_xgb
Tune the Random Forest model by hyperband.mi_tune_rg
Tune the Decision Tree model by hyperband.mi_tune_rp
Tune the Xgboost model by hyperband.mi_tune_xgb
Predict with four models and unify results by the sub-model's specificity score to the four possible classes.mi_unify_mod