Title: | Protocol Inspection and State Machine Analysis |
---|---|
Description: | Loads and processes huge text corpora processed with the sally toolbox (<http://www.mlsec.org/sally/>). sally acts as a very fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some replicate-aware, highly tuned non-negative matrix factorization and principal component analysis implementation which allows the processing of very big data sets even on desktop machines. |
Authors: | Tammo Krueger, Nicole Kraemer |
Maintainer: | Tammo Krueger <[email protected]> |
License: | GPL (>= 2.0) |
Version: | 0.2-7 |
Built: | 2024-12-19 06:29:39 UTC |
Source: | CRAN |
Loads and processes huge text corpora processed with the sally toolbox (<http://www.mlsec.org/sally/>). sally acts as a very fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some replicate-aware, highly tuned non-negative matrix factorization and principal component analysis implementation which allows the processing of very big data sets even on desktop machines.
Package: | PRISMA |
Type: | Package |
Title: | Protocol Inspection and State Machine Analysis |
Version: | 0.2-7 |
Date: | 2018-05-26 |
Depends: | R (>= 2.10), Matrix, gplots, methods, ggplot2 |
Suggests: | tm (>= 0.6) |
Author: | Tammo Krueger, Nicole Kraemer |
Maintainer: | Tammo Krueger <[email protected]> |
Description: | Loads and processes huge text corpora processed with the sally toolbox (<http://www.mlsec.org/sally/>). sally acts as a very fast preprocessor which splits the text files into tokens or n-grams. These output files can then be read with the PRISMA package which applies testing-based token selection and has some replicate-aware, highly tuned non-negative matrix factorization and principal component analysis implementation which allows the processing of very big data sets even on desktop machines. |
License: | GPL (>= 2.0) |
NeedsCompilation: | no |
Packaged: | 2018-05-26 15:51:57 UTC; tammok |
Repository: | CRAN |
Date/Publication: | 2018-05-26 22:01:47 UTC |
Index of help topics:
PRISMA-package Protocol Inspection and State Machine Analysis asap The ASAP Data Set corpusToPrisma Convert tm copus to PRISMA estimateDimension Estimate Inner Dimension getDuplicateData Restores Data with Duplicates getMatrixFactorizationLabels Convert Coordinates of Matrix Factorization to Labels loadPrismaData Load PRISMA Data Files plot.prisma Generics For PRISMA Objects plot.prismaDimension Generics For PRISMA Objects plot.prismaMF Generics For PRISMA Objects prismaDuplicatePCA Matrix Factorization Based on Replicate-Aware PCA prismaHclust Matrix Factorization Based on Hierarchical Clustering prismaNMF Matrix Factorization Based on Replicate-Aware NMF thesis The Thesis Data Set
Further information is available in the following vignettes:
PRISMA |
Quick introduction (source, pdf) |
Tammo Krueger, Nicole Kraemer
Maintainer: Tammo Krueger <[email protected]>
Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012) Learning Stateful Models for Network Honeypots 5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted
Krueger, T., Kraemer, N., Rieck, K. (2011) ASAP: Automatic Semantics-Aware Analysis of Network Payloads Privacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50 - 63
# please see the vingette for examples
# please see the vingette for examples
Toy data set to show the capabilities of the PRISMA package.
asap
asap
A prisma object.
Tammo Krueger <[email protected]>
Krueger, T., Kraemer, N., Rieck, K. (2011) ASAP: Automatic Semantics-Aware Analysis of Network Payloads Privacy and Security Issues in Data Mining and Machine Learning - International ECML/PKDD Workshop. Lecture Notes in Computer Science 6549, Springer. 50 - 63
Converts a tm corpus object to a PRISMA object.
corpusToPrisma(corpus, alpha = 0.05, skipFeatureCorrelation = FALSE)
corpusToPrisma(corpus, alpha = 0.05, skipFeatureCorrelation = FALSE)
corpus |
a tm corpus |
alpha |
significance level for the feature tests. If NULL, all features are kept. |
skipFeatureCorrelation |
should the grouping of features based on correlation analysis be skipped. |
prismaData |
data object representing the tokenized documents as features x samples matrix. |
Tammo Krueger <[email protected]>
if (require("tm") && packageVersion("tm") >= '0.6') { data(thesis) thesis thesis = corpusToPrisma(thesis, NULL, TRUE) thesis }
if (require("tm") && packageVersion("tm") >= '0.6') { data(thesis) thesis thesis = corpusToPrisma(thesis, NULL, TRUE) thesis }
Matrix factorization methods compress the original data matrix with
features and
samples into two parts,
namely
with
. The function estimateDimension estimates
based on a noise
model estimated from a scrambled version of the original data matrix.
estimateDimension(prismaData, alpha = 0.05, nScrambleSamples = NULL)
estimateDimension(prismaData, alpha = 0.05, nScrambleSamples = NULL)
prismaData |
A prismaData object loaded via loadPrismaData |
alpha |
Error probability for confidence intervals |
nScrambleSamples |
The number of scrambled samples that should be used to estimate the noise model. NULL means to use the complete data set. |
estDim |
prismaDimension object that can be printed and plotted. |
Tammo Krueger <[email protected]>
R. Schmidt. Multiple emitter location and signal parameter estimation. IEEE Transactions on Antennas and Propagation, 34(3):276 – 280, 1986.
# please see the vingette for examles
# please see the vingette for examles
The loadPrismaData
function triggers a feature selection and
data combination methods which subsequently remove duplicate entries for
efficient representation of the data. The
getDuplicateData
rebuilds the data matrix with
explicit representation of all duplicate entries.
getDuplicateData(prismaData)
getDuplicateData(prismaData)
prismaData |
prisma data loaded via |
dataWithDuplicates |
Data matrix containing explicit copies of all duplicates. |
Tammo Krueger <[email protected]>
data(asap) dataWithDuplicates = getDuplicateData(asap)
data(asap) dataWithDuplicates = getDuplicateData(asap)
Given a matrix factorization object , this function returns for each
document the index of the inner dimension which has the maximal
coordinate. Thus, it converts the fuzzy clustering found in the
columns of the
matrix into a hard clustering by returning the
position with the maximal coordinate value.
getMatrixFactorizationLabels(prismaMF)
getMatrixFactorizationLabels(prismaMF)
prismaMF |
a matrix factorization object. |
labels |
vector containing the label assignment for each document. |
Tammo Krueger <[email protected]>
Loads files generated by the sally tool (see
http://www.mlsec.org/sally/) and represents the data as binary
token/ngrams x documents matrix. After loading, statistical tests are
applied to find features which are not volatile nor
constant. Co-occurring features are grouped to further compactify the
data. See system.file("extdata","sallyPreprocessing.py",
package="PRISMA")
for a Python script which generates the
corresponding .fsally file from a .sally file which reduce the
loading time via loadPrismaData
considerably.
loadPrismaData(path, maxLines = -1, fastSally = TRUE, alpha = 0.05, skipFeatureCorrelation=FALSE)
loadPrismaData(path, maxLines = -1, fastSally = TRUE, alpha = 0.05, skipFeatureCorrelation=FALSE)
path |
path of the data file without the .sally extension. loadPrisma loads path.sally or path.fsally depending on the fastSally switch. |
maxLines |
maximal number of lines to read from the data file. -1 means to read all lines. |
fastSally |
should the fsally file be used, which drastically decreases loading time. |
alpha |
significance level for the feature tests. If NULL, all features are kept. |
skipFeatureCorrelation |
should the grouping of features based on correlation analysis be skipped. |
prismaData |
data object representing the tokenized documents as features x samples matrix. |
Tammo Krueger <[email protected]>
See http://www.mlsec.org/sally/ for the sally utility.
# please see the vingette for examles # please see system.file("extdata","asap.tar.gz", package="PRISMA") for # an example sally output
# please see the vingette for examles # please see system.file("extdata","asap.tar.gz", package="PRISMA") for # an example sally output
Print and plot generic for the PRISMA objects.
## S3 method for class 'prisma' print(x, ...) ## S3 method for class 'prisma' plot(x, ...)
## S3 method for class 'prisma' print(x, ...) ## S3 method for class 'prisma' plot(x, ...)
x |
PRISMA data loaded via |
... |
not used |
Tammo Krueger <[email protected]>
estimateDimension
, prismaHclust
, prismaDuplicatePCA
, prismaNMF
data(asap) print(asap) plot(asap)
data(asap) print(asap) plot(asap)
Print and plot generic for the PRISMA dimension objects.
## S3 method for class 'prismaDimension' print(x, ...) ## S3 method for class 'prismaDimension' plot(x, ...)
## S3 method for class 'prismaDimension' print(x, ...) ## S3 method for class 'prismaDimension' plot(x, ...)
x |
PRISMA dimension object generated via |
... |
not used |
Tammo Krueger <[email protected]>
estimateDimension
, prismaHclust
, prismaDuplicatePCA
, prismaNMF
# please see the vingette for examles
# please see the vingette for examles
Print and plot generic for the PRISMA matrix factorization objects.
## S3 method for class 'prismaMF' plot(x, nLines = NULL, baseIndex = NULL, sampleIndex = NULL, minValue = NULL, noRowClustering = FALSE, noColClustering = FALSE, type = c("base", "coordinates"), ...)
## S3 method for class 'prismaMF' plot(x, nLines = NULL, baseIndex = NULL, sampleIndex = NULL, minValue = NULL, noRowClustering = FALSE, noColClustering = FALSE, type = c("base", "coordinates"), ...)
x |
PRISMA matrix factorization object |
nLines |
number of lines that should be plotted |
baseIndex |
which bases should be plotted |
sampleIndex |
which samples should be plotted |
minValue |
cut-off value, i.e., every value smaller than |
noRowClustering |
don't cluster the rows |
noColClustering |
don't cluster the columns |
type |
show the base ( |
... |
not used |
Tammo Krueger <[email protected]>
estimateDimension
, prismaHclust
, prismaDuplicatePCA
, prismaNMF
# please see the vingette for examles
# please see the vingette for examles
Efficient implementation of a replicate-aware principal component anaylsis (PCA).
prismaDuplicatePCA(prismaData)
prismaDuplicatePCA(prismaData)
prismaData |
PRISMA data for which a PCA should be calculated |
prismaPCA |
Matrix factorization object $A = B C$, in which the factors are calculate by a replicate-aware PCA |
Tammo Krueger <[email protected]>
# please see the vingette for examles
# please see the vingette for examles
A matrix factorization based on the results of hclust is constructed,
which holds the mean feature values for each cluster in the matrix
and the indication of the cluster in the matrix
for each data
point (i.e. each data point is represented by its assigned cluster center).
prismaHclust(prismaData, ncomp, method = "single")
prismaHclust(prismaData, ncomp, method = "single")
prismaData |
PRISMA data for which a clustering should be calculated. |
ncomp |
the number of components that should be extracted. |
method |
the method used for clustering. |
prismaHclust |
Matrix factorization object containing |
Tammo Krueger <[email protected]>
# please see the vingette for examles
# please see the vingette for examles
Matrix factorization with strictly positiv matrices
which minimize the reconstruction error
. This
replicate-aware version of the non-negtive matrix factorization (NMF)
is based on the alternating least squares
approach and exploits the replicate information to speed up the calculation.
prismaNMF(prismaData, ncomp, time = 60, pca.init = TRUE, doNorm = TRUE, oldResult = NULL)
prismaNMF(prismaData, ncomp, time = 60, pca.init = TRUE, doNorm = TRUE, oldResult = NULL)
prismaData |
PRISMA data for which a NMF should be calculated. |
ncomp |
either an |
time |
seconds after which the calculation should end. |
pca.init |
should the |
doNorm |
should the |
oldResult |
re-use results of a previous run, i.e. |
prismaNMF |
Matrix factorization object containing the |
Tammo Krueger <[email protected]>
Krueger, T., Gascon, H., Kraemer, N., Rieck, K. (2012) Learning Stateful Models for Network Honeypots 5th ACM Workshop on Artificial Intelligence and Security (AISEC 2012), accepted
R. Albright, J. Cox, D. Duling, A. Langville, and C. Meyer. (2006) Algorithms, initializations, and convergence for the nonnegative matrix factorization. Technical Report 81706, North Carolina State University
# please see the vingette for examles
# please see the vingette for examles
The 15 sections of a thesis (see references) as a tm-corpus.
thesis
thesis
A tm-corpus.
Tammo Krueger <[email protected]>
Tammo Krueger. Probabilistic Methods for Network Security. From Analysis to Response. PhD thesis, TU Berlin, 2013. http://opus.kobv.de/tuberlin/volltexte/2013/3881/