Title: | Functions to Operate on Binary Fingerprint Data |
---|---|
Description: | Functions to manipulate binary fingerprints of arbitrary length. A fingerprint is represented by an object of S4 class 'fingerprint' which is internally represented a vector of integers, such that each element represents the position in the fingerprint that is set to 1. The bitwise logical functions in R are overridden so that they can be used directly with 'fingerprint' objects. A number of distance metrics are also available (many contributed by Michael Fadock). Fingerprints can be converted to Euclidean vectors (i.e., points on the unit hypersphere) and can also be folded using OR. Arbitrary fingerprint formats can be handled via line handlers. Currently handlers are provided for CDK, MOE and BCI fingerprint data. |
Authors: | Rajarshi Guha <[email protected]> |
Maintainer: | Rajarshi Guha <[email protected]> |
License: | GPL |
Version: | 3.5.7 |
Built: | 2024-11-18 06:49:24 UTC |
Source: | CRAN |
The function returns a string of 1's and 0's or a character vector of features depending on the nature of the fingerprint supplied.
## S4 method for signature 'fingerprint' as.character(x) ## S4 method for signature 'featvec' as.character(x) ## S4 method for signature 'feature' as.character(x)
## S4 method for signature 'fingerprint' as.character(x) ## S4 method for signature 'featvec' as.character(x) ## S4 method for signature 'feature' as.character(x)
x |
An object of class |
A string of 1's and 0's or else a character vector of features (with their counts)
Rajarshi Guha [email protected]
# make a fingerprint vector fp <- new("fingerprint", nbit=32, bits=sample(1:32, 20)) # print out the string representation as.character(fp)
# make a fingerprint vector fp <- new("fingerprint", nbit=32, bits=sample(1:32, 20)) # print out the string representation as.character(fp)
It has been noted that the bit density in a fingerprint can affect its ability to retrieve similar compounds from a database primarily due to complexity effects. One approach to alleviating these effects is to generate fingerprints that have a bit density of 50 balanced code approach described by Nisius and Bajorath to convert an ordinary binary fingerprint (whose bit density is not 50 50 (resulting in a fingerprint twice the size of the original).
balance(fplist)
balance(fplist)
fplist |
A single fingerprint or a list of fingerprints |
A single fingerprint objects or list of fingerprint objects that are "balanced", in that they have a bit density of 50 fingerprints.
Rajarshi Guha [email protected]
Nisius, B.; Bajorath, J.; ChemMedChem, 2010, 5, 859-868.
This method evaluates the Kullback-Leibler (KL) divergence to rank the individual bits in a binary fingerprint in their ability to discriminate between database and active compounds. This method is implemented based on Nisius and Bajorath and includes an m-estimate correction.
bit.importance(actives, background)
bit.importance(actives, background)
actives |
A list of fingerprints for the actives |
background |
A list of fingerprints representing the background collection |
A numeric vector of length equal to the size of the fingerprints. Each element
of the vector is the KL divergence for the corresponding bit. If a bit position
is never set to 1 in any of the compounds from the actives and the background, then
the KL divergence for that position is undefined and NA
is returned.
Rajarshi Guha [email protected]
Nisius, B.; Bajorath, J.; ChemMedChem, 2010, 5, 859-868.
The idea of comparing datasets using fingerprints was described in Guha \& Schurer (2008). The idea is that one can summarize the dataset by counting the frequency of occurrence of each bit position. The frequency is normalized by the number of fingerprints considered. Thus a collection of N fingerprints can be converted to a single vector of numbers highlighting the most frequent bits with respect to a given dataset. A plot of this vector looks like a traditional spectrum and hence the name.
The bit spectra for two datasets (assuming that the same types of fingerprints have been used) allows one to compare the similarity of the datasets, without having to do a full pairwise similarity calculation. The difference between the structural features of the datasets can be quantified by evaluating the distance between the two bit spectra.
bit.spectrum(fplist)
bit.spectrum(fplist)
fplist |
A list structure with each element being an object of class
All fingerprints in the list should be of the same length. |
A numeric vector of length equal to the size of the fingerprints.
Rajarshi Guha [email protected]
Guha, R.; Schurer, S.; J. Comp. Aid. Molec. Des., 2008, 22, 367-384.
Combine multiple feature
objects to give a list of feature objects
## S4 method for signature 'feature' c(x, ..., recursive = FALSE)
## S4 method for signature 'feature' c(x, ..., recursive = FALSE)
x |
An object of class |
... |
One or more |
recursive |
Ignored |
Rajarshi Guha [email protected]
These functions take a single line and parses it to produce
a vector of integers which represents the position of the 'on' bits in
a fingerprint. This allows the user to use read.fp
with arbitrary fingerprint
files. A new file format can be handled by defining a new line parser function.
Currently the first three functions process fingerprint files obtained from the
CDK (http://cdk.sourceforge.net), MOE (http://chemcomp.com), BCI
(http://www.digitalchemistry.co.uk/) and the FPS format
(http://code.google.com/p/chem-fingerprints/wiki/FPS). The last function can be used
for any fingerprint that generates hashed features (such as ECFPs or other
circular fingerprints). For these cases, it is assumed that features are unsigned
integers, so string features are not handled.
Note that when the fps.lf
function is specified, items such as the number of bits
or the header flag do not need to be specified, as the format requires a header block
containing some of these items.
cdk.lf(line) moe.lf(line) bci.lf(line) ecfp.lf(line) fps.lf(line) jchem.binary.lf(line)
cdk.lf(line) moe.lf(line) bci.lf(line) ecfp.lf(line) fps.lf(line) jchem.binary.lf(line)
line |
The line to parse |
A list with three componenents - the name associated with the fingerprint (if available) and a vector of integers representing bits set to 1 (for the case of the first three methods) or a vector of characters representing hashed features (characteristic of circular fingerprints) or more generally, any string feature. The third component is a (possibly empty) list, which contains the remaining components of a line, when the format allows items other than an a title and the fingerprint (such as the FPS format). The content of the third component is dependent on the line function that is being used.
Rajarshi Guha [email protected]
Get or set the count of occurence associated with a
feature-class
object. The default value for the getter
(as defined in the prototype) is 1.
## S4 method for signature 'feature' count(object) ## S4 replacement method for signature 'feature,numeric' count(x) <- value
## S4 method for signature 'feature' count(object) ## S4 replacement method for signature 'feature,numeric' count(x) <- value
object |
An object of class |
x |
An object of class |
value |
A numeric (which will be coerced to |
An integer representing count of occurence of the feature
signature(object = "feature")
Return the count associated with the feature object
signature(x = "feature", value = "numeric")
Set the count associated with the feature object
Rajarshi Guha [email protected]
A number of distance metrics can be calculated for binary fingerprints. Some of these are actually similarity metrics and thus represent the reverse of a distance metric.
The following are distance (dissimilarity) metrics
Hamming
Mean Hamming
Soergel
Pattern Difference
Variance
Size
Shape
The following metrics are similarity metrics and so the distance can be obtained by subtracting the value fom 1.0
Tanimoto
Dice
Modified Tanimoto
Simple
Jaccard
Russel-Rao
Rodgers Tanimoto
Cosine
Achiai
Carbo
Baroniurbanibuser
Kulczynski2
Robust
Finally the method also provides a set of composite and asymmetric distance metrics
Hamann
Yule
Pearson
Dispersion
McConnaughey
Stiles
Simpson
Petke
Tversky
The default metric is the Tanimoto coefficient.
distance(fp1, fp2, method, a, b)
distance(fp1, fp2, method, a, b)
fp1 |
An object of class |
fp2 |
An object of class |
a |
Parameter for the Tversky index |
b |
Parameter for the Tversky index |
method |
The type of distance metric desired. Partial matching is
supported and the deault is
If the two fingerprints are of class |
Numeric value representing the distance in the specified metric between the supplied fingerprint objects
signature(fp1 = "featvec", fp2 = "featvec", method = "character", a = "missing", b = "missing")
Similarity method for feature vector type fingerprints, supporting tanimoto
, robust
and dice
metrics.
signature(fp1 = "featvec", fp2 = "featvec", method = "missing", a = "missing", b = "missing")
Evaluate Tanimoto similarity between two feature vector fingerprints
signature(fp1 = "fingerprint", fp2 = "fingerprint", method = "character", a = "missing", b = "missing")
Evaluate similarity (or dissimilrity) between two binary fingerprints. See below for a list of possible similarity (or dissimilarity) metrics
signature(fp1 = "fingerprint", fp2 = "fingerprint", method = "character", a = "numeric", b = "numeric")
Evaluate Tversky similarity between two binary fingerprints.
signature(fp1 = "fingerprint", fp2 = "fingerprint", method = "missing", a = "missing", b = "missing")
Evaluate Tanimoto similarity between two binary fingerprints
Rajarshi Guha [email protected]
Fligner, M.A.; Verducci, J.S.; Blower, P.E.; A Modification of the Jaccard-Tanimoto Similarity Index for Diverse Selection of Chemical Compounds Using Binary Strings, Technometrics, 2002, 44(2), 110-119
Monve, V.; Introduction to Similarity Searching in Chemistry, MATCH - Comm. Math. Comp. Chem., 2004, 51, 7-38
# make a 2 fingerprint vectors fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) # calculate the tanimoto coefficient distance(fp1,fp2) # should be 1 # Invert the second fingerprint fp3 <- !fp2 distance(fp1,fp3) # should be 0
# make a 2 fingerprint vectors fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) # calculate the tanimoto coefficient distance(fp1,fp2) # should be 1 # Invert the second fingerprint fp3 <- !fp2 distance(fp1,fp3) # should be 0
Ordinarily, a binary fingerprint can be considered to represent a corner of a nD hypercube. However in many cases using such a representation can lead to a very sparse space. Consequently one approach is to convert the fingerprint so that it represents points on a nD unit hypersphere.
The resultant fingerprint is then a nD coordinate.
euc.vector(fp)
euc.vector(fp)
fp |
An object of class |
A numeric of length equal to the bit length of the fingerprint. The result corresponds to a unit vector for a point on the nD hypersphere
Rajarshi Guha [email protected]
# make a fingerprint vector fp <- new("fingerprint", nbit=8, bits=c(1,3,4,5,7)) vec <- euc.vector(fp)
# make a fingerprint vector fp <- new("fingerprint", nbit=8, bits=c(1,3,4,5,7)) vec <- euc.vector(fp)
This class represents features - arbitrary alphanumeric sequences that are used to characterize molecular substructures (though there is no real restriction to molecules). A feature is associated with an integer count, indicating the occurence of that feature in a molecule. The default value is 1.
Objects can be created by calls of the form new("feature", ...)
.
feature
:Object of class "character"
~ The string representation of a feature
count
:Object of class "integer"
~ The occurence of the feature. Default is 1
.Data
:???
signature(object = "feature")
: Return the count associated with the
feature
Rajarshi Guha [email protected]
## create a new feature f <- new("feature", feature='ABCD', count=as.integer(1)) ## modify the feature string and the count feature(f) <- 'UXYZ' count(f) <- 10
## create a new feature f <- new("feature", feature='ABCD', count=as.integer(1)) ## modify the feature string and the count feature(f) <- 'UXYZ' count(f) <- 10
Get or set the character string representing a feature of a
feature-class
object. The default value for the getter
(as defined in the prototype) is the empty string.
## S4 method for signature 'feature' feature(object) ## S4 replacement method for signature 'feature,character' feature(x) <- value
## S4 method for signature 'feature' feature(object) ## S4 replacement method for signature 'feature,character' feature(x) <- value
object |
An object of class |
x |
An object of class |
value |
The character string to replace the current feature string with |
An character string representing the feature
signature(object = "feature")
Return the feature associated with the feature object
signature(x = "feature", value = "character")
Set the feature associated with the feature object
Rajarshi Guha [email protected]
This class represents feature vector style fingerprints, where, rather than a bit string, the fingerprint is represented as a sequence of (signed) integers or strings. Each element of the collection is a representation of a structural feature. For cases where the features are integers, this usually corresponds to a hash of the original feature string.
Objects can be created by calls of the form new("featvec", ...)
.
In contrast to traditional binary fingerprints, operations on feature vectors
are slightly different and essentially correspond to operations on sets. Thus
the logical and (&) would correspond to the union of the two feature vectors.
features
:Object of class "character"
~~ A vector
containing the numeric or character features. Numeric features are treated
as character strings
provider
:Object of class "character"
~~
Indicates the source of the fingerprint. Can be useful to keep
track of what software generated the fingerprint.
name
:Object of class "character"
~~
The name associated with the fingerprint. If not name is available
this gets set to an empty string
misc
:A list to hold arbitrary items associated with a fingerprint (such as extra fields from a fingerprint file)
signature(fp1 = "featvec", fp2 = "featvec", method = "missing")
: ...
signature(fp1 = "featvec", fp2 = "featvec", method = "character")
: ...
signature(fp = "featvec")
: ...
signature(fp = "featvec")
: ...
signature(fp = "featvec")
: ...
Rajarshi Guha [email protected]
fp.read
, fp.read.to.matrix
fp.sim.matrix
, fp.to.matrix
,
fp.factor.matrix
random.fingerprint
This class represents binary fingerprints, usually generated by a variety of cheminformatics software, but not restricted to such
Objects can be created by calls of the form new("fingerprint", ...)
.
Fingerprints can traditionally thought of as a vector of 1's and
0's. However for large fingerprints this is inefficient and
instead we simply store the positions of the bits that are
on. Certain operations also need to know the length of the
original bit string and this length is stored in the object at
construction. Even though we store extra information along with
the bit positions, conceptually we still consider the objects as
simple bit strings. Thus the usual bitwise logical operations
(&, |, !, xor) can be applied to objects of this class.
bits
:Object of class "numeric"
~~ A vector
indicating the bit positions that are on.
nbit
:Object of class "numeric"
~~ Indicates the length of the original bit string.
folded
:Object of class "logical"
~~ Indicates
whether the fingerprint has been folded.
provider
:Object of class "character"
~~
Indicates the source of the fingerprint. Can be useful to keep
track of what software generated the fingerprint.
name
:Object of class "character"
~~
The name associated with the fingerprint. If not name is available
this gets set to an empty string
misc
:Object of class "list"
~~
A holder for arbitrary items that may have been stored along with the fingerprint. Only
certain formats allow extra items to be stored with the fingerprint, so in many cases
this field is just an empty list
signature(fp1 = "fingerprint", fp2 = "fingerprint", method = "missing", a = "missing", b = "missing")
: ...
signature(fp1 = "fingerprint", fp2 = "fingerprint", method = "character", a = "missing", b = "missing")
: ...
signature(fp = "fingerprint")
: ...
signature(fp = "fingerprint")
: ...
signature(nbit = "numeric", on = "numeric")
: ...
Rajarshi Guha [email protected]
fp.read
, fp.read.to.matrix
fp.sim.matrix
, fp.to.matrix
,
fp.factor.matrix
random.fingerprint
## make fingerprints x <- new("fingerprint", nbit=128, bits=sample(1:128, 100)) y <- x distance(x,y) # should be 1 x <- new("fingerprint", nbit=128, bits=sample(1:128, 100)) distance(x,y) folded <- fold(x) ## binary operations on fingerprints x <- new("fingerprint", nbit=8, bits=c(1,2,3,6,8)) y <- new("fingerprint", nbit=8, bits=c(1,2,4,5,7,8)) x & y x | y !x
## make fingerprints x <- new("fingerprint", nbit=128, bits=sample(1:128, 100)) y <- x distance(x,y) # should be 1 x <- new("fingerprint", nbit=128, bits=sample(1:128, 100)) distance(x,y) folded <- fold(x) ## binary operations on fingerprints x <- new("fingerprint", nbit=8, bits=c(1,2,3,6,8)) y <- new("fingerprint", nbit=8, bits=c(1,2,4,5,7,8)) x & y x | y !x
In many situations a fingerprint is generated using a large length (such as 1024 bits or more). As a result of this, the fingerprints for a dataset can be very sparse. One approach to increasing bit density of such fingerprints is to fold them. This is performed by dividing the original fingerprint bitstring into two substrings of equal length and then perform an OR on the two substrings.
It should be noted that many fingerprint generating routines will perform this internally.
fold(fp)
fold(fp)
fp |
The fingerprint to fold. Should be of class |
An object of class fingerprint
representing the folded fingerprint.
Rajarshi Guha [email protected]
# make a fingerprint vector fp <- new("fingerprint", nbit=64, bits=sample(1:64, 30)) fold(fp)
# make a fingerprint vector fp <- new("fingerprint", nbit=64, bits=sample(1:64, 30)) fold(fp)
This function will convert a list
of fingerprint objects
to a data.frame
of factors with levels 1 and 0.
fp.factor.matrix(fplist)
fp.factor.matrix(fplist)
fplist |
A list structure with each element being an object of class
|
A matrix with dimensions equal to (length(fplist), length(fplist))
Rajarshi Guha [email protected]
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.factor.matrix( list(fp1,fp2,fp3) )
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.factor.matrix( list(fp1,fp2,fp3) )
fp.read
reads in a set of fingerprints from a file. Fingerprint
output from the CDK, MOE and BCI can be handled.
Each fingerprint is represented as a fingerprint
object.
fp.read
returns a list
structure, each element being a
fingerprint
or nfeatvec
object, depending on the value
of the binary
argument.
fp.read.to.matrix
is a utility function that reads the fingerprints directly to
matrix form (columns are the bit positions and the rows are the objects whose fingerprints
have been evaluated). Note that this method does not currently work with feature vector
fingerprints.
fp.read(f='fingerprint.txt', size=1024, lf=cdk.lf, header=FALSE, binary=TRUE) fp.read.to.matrix(f='fingerprint.txt', size=1024, lf=cdk.lf, header=FALSE)
fp.read(f='fingerprint.txt', size=1024, lf=cdk.lf, header=FALSE, binary=TRUE) fp.read.to.matrix(f='fingerprint.txt', size=1024, lf=cdk.lf, header=FALSE)
f |
File containing the fingperprints |
size |
The bit length of the fingerprints being considered |
lf |
A line reading function that parses a single line from a fingerprint file. A number of functions are provided that parse the fingerprints from the output of the CDK, MOE and the BCI toolkit. In addition, support is now available for the FPS format from the chemfp project (http://code.google.com/p/chem-fingerprints). |
header |
Indicates whether the first line of the fingerprint file is a header line |
binary |
If |
A list
or matrix
of fingerprints
Rajarshi Guha [email protected]
cdk.lf
,
moe.lf
,
bci.lf
,
ecfp.lf
,
fps.lf
Given a set of fingerprints, a pairwise similarity can be calculated using the
various distance metrics defined for binary strings. This function calculates
the pairwise similarity matrix for a set of fingerprint
or
featvec
objects supplied in a list
structure. Any of the distance metrics provided by distance
can be used and the
default is the Tanimoto metric.
Note that if the the Euclidean distance is specified then the resultant matrix is a distance matrix and not a similarity matrix
fp.sim.matrix(fplist, fplist2=NULL, method='tanimoto')
fp.sim.matrix(fplist, fplist2=NULL, method='tanimoto')
fplist |
A list structure with each element being an object of class
|
fplist2 |
A list structure with each element being an object of class
|
method |
The type of distance metric to use. The default is |
A matrix with dimensions equal to (length(fplist), length(fplist))
if
fplist2
is NULL, otherwise (length(fplist), length(fplist2))
Rajarshi Guha [email protected]
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.sim.matrix( list(fp1,fp2,fp3) )
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.sim.matrix( list(fp1,fp2,fp3) )
In general, fingerprint data is read from a file or obtained via calls to an external generator and the return value is a list of fingerprints. This function takes the list and returns a matrix having number of rows equal to the number of fingerprints and the number of columns equal to the length of the fingerprint. Each element is 1 or 0 (1's being specified by the positions in each fingerprint vector)
fp.to.matrix(fplist)
fp.to.matrix(fplist)
fplist |
A list structure with each element being an object of class
|
A matrix with dimensions equal to length(fplist), bit length)
where bit length is a property of the fingerprint objects in the list.
Rajarshi Guha [email protected]
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.to.matrix( list(fp1,fp2,fp3) )
# make fingerprint objects fp1 <- new("fingerprint", nbit=6, bits=c(1,2,5,6)) fp2 <- new("fingerprint", nbit=6, bits=c(1,4,5,6)) fp3 <- new("fingerprint", nbit=6, bits=c(2,3,4,5,6)) fp.to.matrix( list(fp1,fp2,fp3) )
These functions perform logical operatiosn (AND, OR, NOT, XOR) on the supplied binary fingerprints. Thus for two fingerprints A and B we have
&
Logical AND
|
Logical OR
xor
Logical XOR
!
Logical NOT (negation)
e1 |
An object of class |
e2 |
An object of class |
A fingerprint object
Rajarshi Guha [email protected]
Returns the length of the fingerprint. That is, this is the length of the entire bit string and not simply the number of bits that are on.
## S4 method for signature 'fingerprint' length(x)
## S4 method for signature 'fingerprint' length(x)
x |
An object of class |
The length of the bit string
Rajarshi Guha [email protected]
A utility function that can be used to generate binary fingerprints of a specified length with a specifed number of bit positions (selected randomly) set to 1. Currently bit positions are selected uniformly
random.fingerprint(nbit,on)
random.fingerprint(nbit,on)
nbit |
The length of the fingerprint, that is, the total number of bits. Must be a positive integer. |
on |
How many positions should be set to 1 |
An object of class fingerprint
Rajarshi Guha [email protected]
# make a fingerprint vector fp <- random.fingerprint(32, 16) as.character(fp)
# make a fingerprint vector fp <- random.fingerprint(32, 16) as.character(fp)
This method evaluates the Shannon entropy for a set of fingerprints
and utilizes the bit.spectrum
method to obtain the relative
frequencies of individual bits
shannon(fplist)
shannon(fplist)
fplist |
A list structure with each element being an object of class
All fingerprints in the list should be of the same length. |
The Shannon entropy for the set of fingerprints
Rajarshi Guha [email protected]
Simply summarize the fingerprint or feature
## S4 method for signature 'fingerprint' show(object) ## S4 method for signature 'featvec' show(object) ## S4 method for signature 'feature' show(object)
## S4 method for signature 'fingerprint' show(object) ## S4 method for signature 'featvec' show(object) ## S4 method for signature 'feature' show(object)
object |
An object of class |
Rajarshi Guha [email protected]