The BASiNET package aims to classify messenger RNA and long non-coding RNA, optionally also a third class such as small non-coding RNA may be included. The classification is made from measurements drawn from complex networks, for each RNA sequence a complex network is created. The networks are formed of vertices and edges, the vertices will be formed by words that can have their size defined by the parameter ‘word’. It is adopted a methodology of Thresholds in the networks so that each extraction of measures is made a cut in the network for a new extraction of measures. Finally, all measurements taken from the networks are used for classification using the algorithms J48 or Random Forest. There are four data present in the ‘BASiNET’ package, “sequences”, “sequences2”, “sequences-predict” and “sequences2-predict” with 11, 10, 11 and 11 sequences respectively. These sequences were taken from the data set used in the article (LI, Aimin; ZHANG, Junying; ZHOU, Zhongyin, Plek: a tool for predicting long non-coding messages and based on an improved k-mer scheme BMC bioinformatics, BioMed Central, 2014). These sequences are used to run examples. The BASiNET was published (ITO, Eric; KATAHIRA, Isaque; VICENTE, Fábio; PEREIRA, Felipe; LOPES, Fabrício, BASiNET—BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification, Nucleic Acids Research, 2018).
To install BASiNET correctly it is necessary to install dependencies: RWeka, igraph, rJava, randomForest, Biostrings, rmcfs. The Biostrings package is in the BioConductor repository, the other packages are available in CRAN. The following commands must be executed in the R for the deployments to be installed.
install.packages(“RWeka”)
install.packages(“rJava”)
install.packages(“igraph”)
install.packages(“randomForest”)
install.packages(“rmcfs”)
source(“https://bioconductor.org/biocLite.R”)
biocLite(“Biostrings”)
In order for the rJava package to work properly, you must have installed JDK java(https://www.oracle.com/java/technologies/downloads/) and JRE java(https://www.java.com/pt-BR/download/manual.jsp).
The function classification” applies an RNA classification methodology, at the end of the execution of the function is exposed the result for two classification algorithms: J48 and Random Forest.
word - Define the number of nitrogenous bases that formed a word. By default the word parameter is set to 3.
step - Defines the distance that will be traversed in the sequence for the formation of a new connection. By default the step parameter is set to 1
mRNA - Directory of an FASTA file containing mRNA sequences.
lncRNA - Directory of an FASTA file containing lncRNA sequences.
sncRNA - Directory of an FASTA file containing lncRNA sequences, this parameter is optional.
graphic - If TRUE is used to generate two-dimensional graphs between Thresholds x Measure. By default it is considered FALSE.
classifier - Character Parameter. By default the classifier is J48, but the user can choose to use randomForest by configuring as classifier = “RF”. The prediction with a model passed by the param load only works with the classifier J48.
load - Name of the .dat file that will be loaded as a template for the prediction of new RNA sequences. By default is NULL.
save - Name of the .dat file in which the measurement results will be saved. The generated file can be used in the “load” parameter for the prediction of new data. By default is NULL.
Within the BASiNET package there are two sample files, one for mRNA sequence and one for lncRNA sequences. For the example below you will use these two files.
Defining parameters:
mRNA <- system.file("extdata", "sequences2.fasta", package = "BASiNET")
lncRNA <- system.file("extdata", "sequences.fasta", package = "BASiNET")
library(BASiNET)
classification(mRNA,lncRNA, save="example")
## Analyzing mRNA from number:
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## Analyzing lncRNA from number:
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## Rescaling values
## Creating data frame
## Result.arff file generated in the current R directory
## Sorting the data with the J48
## J48 pruned tree
## ------------------
##
## MAX.6 <= 0: lncRNA (11.0)
## MAX.6 > 0: mRNA (10.0)
##
## Number of Leaves : 2
##
## Size of the tree : 3
##
## === 10 Fold Cross Validation ===
##
## === Summary ===
##
## Correctly Classified Instances 20 95.2381 %
## Incorrectly Classified Instances 1 4.7619 %
## Kappa statistic 0.9041
## K&B Relative Info Score 90.474 %
## K&B Information Score 19.0262 bits 0.906 bits/instance
## Class complexity | order 0 21.0295 bits 1.0014 bits/instance
## Class complexity | scheme 1074 bits 51.1429 bits/instance
## Complexity improvement (Sf) -1052.9705 bits -50.1415 bits/instance
## Mean absolute error 0.0476
## Root mean squared error 0.2182
## Relative absolute error 9.5238 %
## Root relative squared error 43.6012 %
## Total Number of Instances 21
##
## === Detailed Accuracy By Class ===
##
## TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class
## 0.900 0.000 1.000 0.900 0.947 0.908 0.950 0.948 mRNA
## 1.000 0.100 0.917 1.000 0.957 0.908 0.950 0.917 lncRNA
## Weighted Avg. 0.952 0.052 0.956 0.952 0.952 0.908 0.950 0.931
##
## === Confusion Matrix ===
##
## a b <-- classified as
## 9 1 | a = mRNA
## 0 11 | b = lncRNA
After the completion of the function the results for J48 and Random Forest will be shown. For example data the results are J48 = 95.2381% hit, Random Forest = 4.76% error.
It will also generate 10 two-dimensional graphs, one for each measurement. The blue lines represent the mRNA sequences, red lines are the lncRNA and when you have a third class will be represented by green lines.
Example of generated graph:
Bidimensional graph for the measurement Average Minimum Path
To predict a set of data, two parameters need to be set up, the first one is called “predicting” and the second is the “load”. In the “predicting” the directory of the file is set where the sequences to be predicted are found. The “load” parameter defines the model that will be used to predict the sequences.
Defining parameters:
mRNApredict <- system.file("extdata", "sequences2-predict.fasta", package = "BASiNET")
lncRNApredict <- system.file("extdata", "sequences-predict.fasta", package = "BASiNET")
modelPredict <- system.file("extdata", "modelPredict.dat", package = "BASiNET")
library(BASiNET)
classification(mRNApredict,lncRNApredict,load=modelPredict)
## Analyzing
## Rescaling values
## Creating data frame
## Results
## [1] mRNA mRNA mRNA mRNA mRNA lncRNA mRNA mRNA mRNA mRNA
## [11] mRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA
## [21] lncRNA lncRNA
## Levels: mRNA lncRNA