BASiNET - Classification of RNA sequences using complex network theory

Introduction

The BASiNET package aims to classify messenger RNA and long non-coding RNA, optionally also a third class such as small non-coding RNA may be included. The classification is made from measurements drawn from complex networks, for each RNA sequence a complex network is created. The networks are formed of vertices and edges, the vertices will be formed by words that can have their size defined by the parameter ‘word’. It is adopted a methodology of Thresholds in the networks so that each extraction of measures is made a cut in the network for a new extraction of measures. Finally, all measurements taken from the networks are used for classification using the algorithms J48 or Random Forest. There are four data present in the ‘BASiNET’ package, “sequences”, “sequences2”, “sequences-predict” and “sequences2-predict” with 11, 10, 11 and 11 sequences respectively. These sequences were taken from the data set used in the article (LI, Aimin; ZHANG, Junying; ZHOU, Zhongyin, Plek: a tool for predicting long non-coding messages and based on an improved k-mer scheme BMC bioinformatics, BioMed Central, 2014). These sequences are used to run examples. The BASiNET was published (ITO, Eric; KATAHIRA, Isaque; VICENTE, Fábio; PEREIRA, Felipe; LOPES, Fabrício, BASiNET—BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification, Nucleic Acids Research, 2018).

Instalation

To install BASiNET correctly it is necessary to install dependencies: RWeka, igraph, rJava, randomForest, Biostrings, rmcfs. The Biostrings package is in the BioConductor repository, the other packages are available in CRAN. The following commands must be executed in the R for the deployments to be installed.

install.packages(“RWeka”)

install.packages(“rJava”)

install.packages(“igraph”)

install.packages(“randomForest”)

install.packages(“rmcfs”)

source(“https://bioconductor.org/biocLite.R”)

biocLite(“Biostrings”)

In order for the rJava package to work properly, you must have installed JDK java(https://www.oracle.com/java/technologies/downloads/) and JRE java(https://www.java.com/pt-BR/download/manual.jsp).

Classification

The function classification” applies an RNA classification methodology, at the end of the execution of the function is exposed the result for two classification algorithms: J48 and Random Forest.

Parameters:

word - Define the number of nitrogenous bases that formed a word. By default the word parameter is set to 3.

step - Defines the distance that will be traversed in the sequence for the formation of a new connection. By default the step parameter is set to 1

mRNA - Directory of an FASTA file containing mRNA sequences.

lncRNA - Directory of an FASTA file containing lncRNA sequences.

sncRNA - Directory of an FASTA file containing lncRNA sequences, this parameter is optional.

graphic - If TRUE is used to generate two-dimensional graphs between Thresholds x Measure. By default it is considered FALSE.

classifier - Character Parameter. By default the classifier is J48, but the user can choose to use randomForest by configuring as classifier = “RF”. The prediction with a model passed by the param load only works with the classifier J48.

load - Name of the .dat file that will be loaded as a template for the prediction of new RNA sequences. By default is NULL.

save - Name of the .dat file in which the measurement results will be saved. The generated file can be used in the “load” parameter for the prediction of new data. By default is NULL.

Within the BASiNET package there are two sample files, one for mRNA sequence and one for lncRNA sequences. For the example below you will use these two files.

Defining parameters:

mRNA <- system.file("extdata", "sequences2.fasta", package = "BASiNET")
lncRNA <- system.file("extdata", "sequences.fasta", package = "BASiNET")
library(BASiNET)
classification(mRNA,lncRNA, save="example")

## Analyzing mRNA from number:

## 1

## 2

## 3

## 4

## 5

## 6

## 7

## 8

## 9

## 10

## Analyzing lncRNA from number:

## 1

## 2

## 3

## 4

## 5

## 6

## 7

## 8

## 9

## 10

## 11

## Rescaling values

## Creating data frame

## Result.arff file generated in the current R directory

## Sorting the data with the J48

## J48 pruned tree
## ------------------
## 
## MAX.6 <= 0: lncRNA (11.0)
## MAX.6 > 0: mRNA (10.0)
## 
## Number of Leaves  :  2
## 
## Size of the tree :   3
## 
## === 10 Fold Cross Validation ===
## 
## === Summary ===
## 
## Correctly Classified Instances          20               95.2381 %
## Incorrectly Classified Instances         1                4.7619 %
## Kappa statistic                          0.9041
## K&B Relative Info Score                 90.474  %
## K&B Information Score                   19.0262 bits      0.906  bits/instance
## Class complexity | order 0              21.0295 bits      1.0014 bits/instance
## Class complexity | scheme             1074      bits     51.1429 bits/instance
## Complexity improvement     (Sf)      -1052.9705 bits    -50.1415 bits/instance
## Mean absolute error                      0.0476
## Root mean squared error                  0.2182
## Relative absolute error                  9.5238 %
## Root relative squared error             43.6012 %
## Total Number of Instances               21     
## 
## === Detailed Accuracy By Class ===
## 
##                  TP Rate  FP Rate  Precision  Recall   F-Measure  MCC      ROC Area  PRC Area  Class
##                  0.900    0.000    1.000      0.900    0.947      0.908    0.950     0.948     mRNA
##                  1.000    0.100    0.917      1.000    0.957      0.908    0.950     0.917     lncRNA
## Weighted Avg.    0.952    0.052    0.956      0.952    0.952      0.908    0.950     0.931     
## 
## === Confusion Matrix ===
## 
##   a  b   <-- classified as
##   9  1 |  a = mRNA
##   0 11 |  b = lncRNA

After the completion of the function the results for J48 and Random Forest will be shown. For example data the results are J48 = 95.2381% hit, Random Forest = 4.76% error.

It will also generate 10 two-dimensional graphs, one for each measurement. The blue lines represent the mRNA sequences, red lines are the lncRNA and when you have a third class will be represented by green lines.

Example of generated graph:

Bidimensional graph for the measurement Average Minimum Path

knitr::include_graphics("2d.png")

Predict

To predict a set of data, two parameters need to be set up, the first one is called “predicting” and the second is the “load”. In the “predicting” the directory of the file is set where the sequences to be predicted are found. The “load” parameter defines the model that will be used to predict the sequences.

Defining parameters:

mRNApredict <- system.file("extdata", "sequences2-predict.fasta", package = "BASiNET")
lncRNApredict <- system.file("extdata", "sequences-predict.fasta", package = "BASiNET")
modelPredict <- system.file("extdata", "modelPredict.dat", package = "BASiNET")
library(BASiNET)
classification(mRNApredict,lncRNApredict,load=modelPredict)

## Analyzing

## Rescaling values

## Creating data frame

## Results

##  [1] mRNA   mRNA   mRNA   mRNA   mRNA   lncRNA mRNA   mRNA   mRNA   mRNA  
## [11] mRNA   lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA lncRNA
## [21] lncRNA lncRNA
## Levels: mRNA lncRNA