Title: | IBM in-Database Analytics for R |
---|---|
Description: | Functionality required to efficiently use R with IBM(R) Db2(R) Warehouse offerings (formerly IBM dashDB(R)) and IBM Db2 for z/OS(R) in conjunction with IBM Db2 Analytics Accelerator for z/OS. Many basic and complex R operations are pushed down into the database, which removes the main memory boundary of R and allows to make full use of parallel processing in the underlying database. For executing R-functions in a multi-node environment in parallel the idaTApply() function requires the 'SparkR' package (<https://spark.apache.org/docs/latest/sparkr.html>). The optional 'ggplot2' package is needed for the plot.idaLm() function only. |
Authors: | Shaikh Quader [aut, cre], Toni Bollinger [aut], Ming-pang Wei [aut], Alexander Eckert [aut], Michael Wurst [aut], Craig Blaha [ctb] (documentation), IBM Corporation [cph] |
Maintainer: | Shaikh Quader <[email protected]> |
License: | GPL-3 |
Version: | 1.51.0 |
Built: | 2024-12-18 06:54:39 UTC |
Source: | CRAN |
In-database analytics functions operate directly on data in a database, rather than requiring that the data first be extracted to working memory. This lets you analyze large amounts of data that would be impractical or impossible to extract. It also avoids security issues associated with extracting data, and ensures that the data being analyzed is as current as possible. Some functions additionally use lazy loading to load only those parts of the data that are actually required, to further increase efficiency.
This package also contains a data structure called a ida.list
, which you
can use to store R objects in the database. This simplifies the sharing
of R objects among users. Each user is assigned two tables for R object
storage: a private table, to which only that user has access, and a
public table, which can be read by other users. Use a IDA list to generate
a pointer to either of these tables, and use the pointer to list, store,
or retrieve R objects.
This function creates an IDA data frame ida.data.frame
from a local R data.frame
by creating a table in the database.
as.ida.data.frame(x, table=NULL, clear.existing=FALSE, case.sensitive=TRUE, rownames=NULL, dbname=NULL, asAOT=FALSE)
as.ida.data.frame(x, table=NULL, clear.existing=FALSE, case.sensitive=TRUE, rownames=NULL, dbname=NULL, asAOT=FALSE)
x |
The name of the input object that is to be converted to a IDA data frame. |
table |
The name of the database table that is to be created to hold the contents of the IDA data frame. The specified name is folded to uppercase. If this parameter is not specified, a name is generated automatically. |
clear.existing |
If the |
case.sensitive |
If the |
rownames |
The name of the column for the unique row id. If the value of this parameter is NULL, this column is not added to the output table. |
dbname |
DB2 for z/OS only parameter: the name of the database where the table should be created in. |
asAOT |
DB2 for z/OS only parameter: the table should be created as an "accelerator only table". |
A IDA data frame that points to the newly created table.
## Not run: #Add an ID column to iris iris2 <- iris iris2$ID <- 1:150 #Upload it and overwrite if already exists idf <- as.ida.data.frame(iris2,"IRIS",clear.existing=T) ## End(Not run)
## Not run: #Add an ID column to iris iris2 <- iris iris2$ID <- 1:150 #Upload it and overwrite if already exists idf <- as.ida.data.frame(iris2,"IRIS",clear.existing=T) ## End(Not run)
ida.col.def
objects are used to define new columns of a ida.data.frame
based
on existing ones.
For details see the documentation of ida.data.frame
.
ida.data.frame
objects provide many methods that will behave exactly like or very
similar to methods defined on a regular data.frame
. The following is a list of
currently supported methods: as.data.frame, sd, max, mean, min, length, print, names, colnames,
summary, NROW, NCOL, dim, var, head, hist, cor, cov. Furthermore, the $ and [] operators allow you
to select columns and rows and the $<- operator will allow you to add columns. For details see
the documentation of ida.data.frame
.
This function creates an IDA data frame (that is, an object of the class ida.data.frame
).
It does not store any data in local memory, but aggregates metadata used to determine the
exact table subset (columns - SELECT
clause; and/or rows -
WHERE
clause) and creates a pointer to a table located in the database.
ida.data.frame(table) is.ida.data.frame(x)
ida.data.frame(table) is.ida.data.frame(x)
table |
Name of a table or view in the current database. |
x |
An |
The argument table must be a valid table or view name and the table/view must exist.
If schema or table are set in quotes, they will be treated case sensitive otherwise they are automatically converted to the default schema of the database. Columns are always treated case sensitive.
A subset of columns and/or rows may be specified
using the indexing operator []
(which is translated to the
SELECT
clause for columns and/or the WHERE
clause for
rows). Note that columns are treated case sensitive.
One limitation is that rows cannot be selected using their numbers.
Instead, you must specify value-based conditions, for example d[d$ID > 10,]
which means “all rows where the value of the first column is
greater than 10”. The $
operator may be also used to select
an ida.data.frame
column.
You can also add and alter columns in an ida.data.frame. Currently, a limited set of functions and operators is supported to define columns based on other columns. The following is supported:
Arithmetic operators are +,-,/,*,^
Mathematical functions are abs, sqrt, log, log10, exp, floor, round, ceiling
Casting functions: as.numeric, as.integer, as.character
Comparison and logical operators: <,<=,>,>=,!=,==,!,&,|
Conditional functions: ifelse
Special functions: is.db.null (checks whether column value is NULL in the table)
There are several rules for adding columns:
You can not combine columns from different tables or from ida.data.frames that have different WHERE conditions.
You cannot add a column to an ida.data.frame that was defined on columns from another ida.data.frame
You can only add columns that evaluate to non-logical, atomic values
The package does basic type checking to enforce these rules, however, it is still possible that the database will refuse a query that was not properly defined.
is.ida.data.frame
checks if the given object's class is ida.data.frame
.
ida.data.frame
returns an IDA data frame.
is.ida.data.frame
returns a logical value that indicates
whether the specified object is an IDA data frame.
## Not run: idf <- ida.data.frame('IRIS') is.ida.data.frame(idf) #Select only certain rows or columns #The following creates an ida.data.frame that only selects rows with #Species=='setosa' and the first three columns of the table idf2 <- idf[idf$Species=='setosa',1:3] #Define new columns based on existing ones idf$SepalLengthX <- idf$SepalLength+1 idf$SepalLengthY <- ifelse(idf$SepalLengthX>4.5,idf$ID,10) #Take a look at the newly defined columns head(idf) ## End(Not run)
## Not run: idf <- ida.data.frame('IRIS') is.ida.data.frame(idf) #Select only certain rows or columns #The following creates an ida.data.frame that only selects rows with #Species=='setosa' and the first three columns of the table idf2 <- idf[idf$Species=='setosa',1:3] #Define new columns based on existing ones idf$SepalLengthX <- idf$SepalLength+1 idf$SepalLengthY <- ifelse(idf$SepalLengthX>4.5,idf$ID,10) #Take a look at the newly defined columns head(idf) ## End(Not run)
A user can elect to store R objects in a database table rather than storing them in a workstation file system. This makes it easier for users to share objects, and simplifies backup tasks.
Each user has two R object storage tables:
A private table, for objects that other users are not to be able to access
A public table, for objects that other users are to be able to read
Use the ida.list
function to create a pointer to either of your own R object storage tables,
or to the public R object storage table of another user. You can then use the pointer to store
objects in or retrieve objects from the corresponding table. (If the table belongs to another user
you can only retrieve objects from it, not store objects in it.)
Please note that whether public tables might not have effect on databases that do not allow to set permissions accordingly, for instance, in multi-tenant environments. To enable the sharing of objects in DB2, an administrator needs to first create a role names R_USERS_PUBLIC and assign it to all users who should be allowed to share objects. For Db2, roles will be setup automatically if in the scope of the plan.
ida.list(type='public',user=NULL)
ida.list(type='public',user=NULL)
type |
The type (private or public) of the table. You can specify 'private' only if |
user |
The user ID of the owner of the R object storage table. If set to NULL, the user ID is that of the current user. The user ID is treated case-sensitive. |
A pointer to an R object storage table.
## Not run: # Create a pointer to the private R object storage table of the current user. myPrivateObjects <- ida.list(type='private') # Use the pointer created in the previous example to store a series of numbers in an object with # the name 'series100' in the private R object storage table of the current user. myPrivateObjects['series100'] <- 1:100 # Retrieve the object with the name 'series100' from the # private R object storage table of the current user. x <- myPrivateObjects['series100'] # Delete the object with name 'series100' from the # private R object storage table of the current user. myPrivateObjects['series100'] <- NULL # List all objects in the private R object storage table of the current user. names(myPrivateObjects) # Return the number of objects in the private R object storage table of the current user. length(myPrivateObjects) # Create a pointer to the public R object storage table of the current user. myPublicObjects <- ida.list(type="public") ## End(Not run)
## Not run: # Create a pointer to the private R object storage table of the current user. myPrivateObjects <- ida.list(type='private') # Use the pointer created in the previous example to store a series of numbers in an object with # the name 'series100' in the private R object storage table of the current user. myPrivateObjects['series100'] <- 1:100 # Retrieve the object with the name 'series100' from the # private R object storage table of the current user. x <- myPrivateObjects['series100'] # Delete the object with name 'series100' from the # private R object storage table of the current user. myPrivateObjects['series100'] <- NULL # List all objects in the private R object storage table of the current user. names(myPrivateObjects) # Return the number of objects in the private R object storage table of the current user. length(myPrivateObjects) # Create a pointer to the public R object storage table of the current user. myPublicObjects <- ida.list(type="public") ## End(Not run)
ida.list
objects provide methods that will behave exactly like or very
similar to methods defined on a regular list
. The following methods are currently supported: length, names, print.
For details see the documentation of ida.list
.
This function calculates association rules on a database table.
idaArule( data, tid, item, maxlen=5, maxheadlen=1, minsupport=NULL, minconf=0.5, nametable=NULL, namecol=NULL, modelname=NULL ) idaApplyRules(modelname, newdata, tid, item, nametable=NULL, namecol=NULL, ...)
idaArule( data, tid, item, maxlen=5, maxheadlen=1, minsupport=NULL, minconf=0.5, nametable=NULL, namecol=NULL, modelname=NULL ) idaApplyRules(modelname, newdata, tid, item, nametable=NULL, namecol=NULL, ...)
data |
An |
tid |
Input table column that identifies the transactions by an id. |
item |
Input table column that identifies items in transactions. |
maxlen |
The maximum length of a rule. Must be two or larger. |
maxheadlen |
The maximum length of the rule head. |
minsupport |
The minimal support of a rule to be considered. |
minconf |
The minimal confidence of a rule to be considered. |
nametable |
A database table containing a mapping between the items in the input table and their name. The table must contain at least two columns, the first column is named as the column indicated in the item parameter and the second column is named as indicated in parameter namecol. |
namecol |
The name of the column containing the item name in case |
modelname |
The name of the model in-database. If NULL, it is automatically generated. |
newdata |
A table to which to apply the rules. |
... |
Additional stored procedure parameters. |
idaArule
finds association rules in transactional data. The input data must be in transactional format, thus each
row of the table contains exactly one item and an identifier of which transaction this item is assigned to. These two
columns need to be specified using the tid
and item
parameters. If the items are referred to with numeric IDs in the
transaction table, it is often useful to add a name mapping to produce rules that contain names instead of item IDs. This can be
achieved by setting the parameters nametable
and namecol
.
Models are stored persistently in database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
idaApplyRules
applies a rule model stored in the database to a table with transactions.
idaArule
returns an object of class rules
compatible with the packages arules
and arulesViz
idaApplyRules
returns an object of class ida.data.frame
, pointing to a table that contains a mapping between transaction IDs and matched
rules.
## Not run: idf <- ida.data.frame("GOSALES.ORDER_DETAILS") r <- idaArule(idf,tid="ORDER_NUMBER",item="PRODUCT_NUMBER",minsupport=0.01) inspect(r) applyResult <- idaApplyRules(idaGetModelname(r),idf,"ORDER_NUMBER","PRODUCT_NUMBER") ## End(Not run)
## Not run: idf <- ida.data.frame("GOSALES.ORDER_DETAILS") r <- idaArule(idf,tid="ORDER_NUMBER",item="PRODUCT_NUMBER",minsupport=0.01) inspect(r) applyResult <- idaApplyRules(idaGetModelname(r),idf,"ORDER_NUMBER","PRODUCT_NUMBER") ## End(Not run)
These functions are used to open or close an existing IDA database connection.
idaConnect(dsn, uid = "", pwd = "", conType = "odbc", dsnLookup = c("auto", "default", "store"), ...) idaClose(idaConn, conType = "odbc")
idaConnect(dsn, uid = "", pwd = "", conType = "odbc", dsnLookup = c("auto", "default", "store"), ...) idaClose(idaConn, conType = "odbc")
dsn |
The DSN of the data base. |
uid |
The user name. |
pwd |
The password. |
conType |
The connection type. |
dsnLookup |
This parameter only is used when ibmdbR is loaded in an RStudio instance of IBM Data Science Experience. Per default ("auto") ibmdbR automatically detects if the provided dsn value is an usual DSN string or the name of a connection in the local connection store. The lookup method can also be manually set to "default", if the dsn parameter should be treated as an usual DSN string. If set to "store" the connection store of RStudio on DSX will be used. |
... |
Additional arguments for DSN lookup. |
idaConn |
The connection object. |
Opens or closes a connection to a database. Currently, RODBC is used as underlying library, this might change, however, in the future.
## Not run: #Connect locally con <- idaConnect('BLUDB','','') #Close the connection idaClose(con) ## End(Not run)
## Not run: #Connect locally con <- idaConnect('BLUDB','','') #Close the connection idaClose(con) ## End(Not run)
Use these functions to create or drop a view that is based on a ida.data.frame
.
idaCreateView(x, newColumn = NULL) idaDropView(v)
idaCreateView(x, newColumn = NULL) idaDropView(v)
x |
|
newColumn |
The expression specifying the column to be added. |
v |
Name of the view to be dropped. |
The idaCreateView
function creates a view from the specified IDA data frame.
The idaDropView
function drops the specified view.
The idaCreateView
function returns the view name.
The idaDropView
function does not return a value.
## Not run: idf <- ida.data.frame('IRIS') #Create a view based on the IDA data frame vname <- idaCreateView(idf) #Drop the view idaDropView(vname) ## End(Not run)
## Not run: idf <- ida.data.frame('IRIS') #Create a view based on the IDA data frame vname <- idaCreateView(idf) #Drop the view idaDropView(vname) ## End(Not run)
These functions simplify working with database tables.
idaAppend(df, table) idaDeleteTable(table) idaExistTable(tableName) idaIsView(tableName) idaGetValidTableName(prefix="DATA_FRAME_")
idaAppend(df, table) idaDeleteTable(table) idaExistTable(tableName) idaIsView(tableName) idaGetValidTableName(prefix="DATA_FRAME_")
df |
A |
table |
The name of a database table or an |
tableName |
The name of a database table. |
prefix |
Keyword used to specify the prefix of a table name. |
Use the idaAppend
function to append a data.frame
to the specified table.
Use the idaDeleteTable
function to drop the specified table.
The specified table must exist in the current database.
Use the idaExistTable
function to determine whether the specified table exists in the database.
Use the idaGetValidTableName
function to obtain a table name that is not yet in use. This name will be the specified or default prefix followed by a number, for example, data_frame_7.
The idaDeleteTable
function does not return a value.
The idaExistTable
function returns a logical value (TRUE or FALSE) that indicates
whether the specified table exists in the database.
The idaGetValidTableName
function returns a string representing a table name.
## Not run: #Check whether a table with a given name exists idaExistTable('IRIS') #Create a pointer to the table idf <- ida.data.frame('IRIS') #Obtain a unique table name for a copy copyTableName <- idaGetValidTableName(prefix = "COPY_") #Create a copy of the original table idfCopy <- as.ida.data.frame(as.data.frame(idf),copyTableName) #Delete the copy again idaDeleteTable(copyTableName) ## End(Not run)
## Not run: #Check whether a table with a given name exists idaExistTable('IRIS') #Create a pointer to the table idf <- ida.data.frame('IRIS') #Obtain a unique table name for a copy copyTableName <- idaGetValidTableName(prefix = "COPY_") #Create a copy of the original table idfCopy <- as.ida.data.frame(as.data.frame(idf),copyTableName) #Delete the copy again idaDeleteTable(copyTableName) ## End(Not run)
These functions allow to query, store and update data in the database.
Usually, it is easier to use idaQuery
,ida.data.frame
and as.ida.data.frame
instead of these methods.
They can be useful, however, if an explicit connection object is needed, e.g. if there are several connections to different databases.
idadf(idaConn, query) idaSave(idaConn, dfrm, tblName = "", rowName = "", conType = "odbc") idaUpdate(db2Conn, updf, dfrm, idaIndex = "", conType = "odbc")
idadf(idaConn, query) idaSave(idaConn, dfrm, tblName = "", rowName = "", conType = "odbc") idaUpdate(db2Conn, updf, dfrm, idaIndex = "", conType = "odbc")
idaConn |
The IDA connection object. |
db2Conn |
The IDA connection object. |
query |
A query. |
dfrm |
A data.frame to store. |
tblName |
Name of the table to which to store the data. |
rowName |
Name of the row name column. |
updf |
Name of the table to update. |
idaIndex |
Name of the index column. |
conType |
Type of the connection. |
idadf
,idaSave
and idaUpdate
are simple wrappers around the RODBC functions sqlQuery
,
sqlSave
and sqlUpdate
.
Usually, it is easier to use idaQuery
,ida.data.frame
and as.ida.data.frame
instead of these methods.
idaQuery
,ida.data.frame
,as.ida.data.frame
,sqlQuery
,sqlSave
,sqlUpdate
## Not run: # create connection to DB con <- idaConnect("BLUDB", "", "") # create data.frame from table df <- idadf(con, "SELECT * FROM IRIS") # close the connection again idaClose(con) ## End(Not run)
## Not run: # create connection to DB con <- idaConnect("BLUDB", "", "") # create data.frame from table df <- idadf(con, "SELECT * FROM IRIS") # close the connection again idaClose(con) ## End(Not run)
This function generates a hierarchical (divisive) clustering model
based on the contents of an IDA data frame (ida.data.frame
) by applying recursively the K-means algorithm.
idaDivCluster( data, id, distance="euclidean", maxiter=5, minsplit=5, maxdepth=3, randseed=12345, outtable=NULL, modelname=NULL ) ## S3 method for class 'idaDivCluster' print(x,...) ## S3 method for class 'idaDivCluster' predict(object, newdata, id,...)
idaDivCluster( data, id, distance="euclidean", maxiter=5, minsplit=5, maxdepth=3, randseed=12345, outtable=NULL, modelname=NULL ) ## S3 method for class 'idaDivCluster' print(x,...) ## S3 method for class 'idaDivCluster' predict(object, newdata, id,...)
data |
An IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row. |
id |
The name of the column that contains a unique ID for each row of the input data. |
distance |
The distance function that is to be used. This can be set to |
maxiter |
The maximum number of iterations to perform in the base K-means Clustering algorithm |
minsplit |
The minimum number of instances per cluster that can be split. |
maxdepth |
The maximum number of cluster levels (including leaves). |
randseed |
The seed for the random number generator. |
outtable |
The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically. |
modelname |
The name under which the model is stored in the database.
This is the name that is specified when using functions such as |
object |
An object of the class |
x |
An object of the class |
newdata |
An IDA data frame that contains the data to which to apply the model. |
... |
Additional parameters to pass to the print or predict method. |
The idaDivCluster clustering function builds a hierarchical clustering model by applying the K-means algorithm recursively in a top-down fashion. The hierarchy of clusters is represented in a binary tree structure (each parent node has exactly 2 child nodes). The leafs of the cluster tree are identified by negative numbers.
Models are stored persistently in the database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
The output of the print function for a idaDivCluster object is:
A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaDivCluster object.
The idaDivCluster function returns an object of class idaDivCluster
.
idaRetrieveModel
, idaDropModel
, idaListModels
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a DivCluster model stored in the database as DivClusterMODEL dcm <- idaDivCluster(idf, id="ID",modelname="DivClusterMODEL") #Print the model print(dcm) #Predict the model pred <- predict(dcm,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a DivCluster model stored in the database as DivClusterMODEL dcm <- idaDivCluster(idf, id="ID",modelname="DivClusterMODEL") #Print the model print(dcm) #Predict the model pred <- predict(dcm,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
Use this function to drop from the database a model that was created by using a function like idaNaiveBayes
, idaLm
, idaTree
, idaArule
or idaKMeans
.
idaDropModel(modelname)
idaDropModel(modelname)
modelname |
The name of the predictive model to be dropped. |
## Not run: #Drop the model with the name KMEANSMODEL idaDropModel("KMEANSMODEL"); ## End(Not run)
## Not run: #Drop the model with the name KMEANSMODEL idaDropModel("KMEANSMODEL"); ## End(Not run)
Use this function to get the name under which a model is stored in-database. This function can be applied to objects returned by
functions like idaNaiveBayes
, idaKMeans
or idaArule
.
idaGetModelname(object)
idaGetModelname(object)
object |
The object representing the model. |
The fully qualified name of the model, as stored in-database. This name is used, e.g. in conjunction with the idaRetrieveModel
or with the idaDropModel
function.
## Not run: #Get the name of a model stored in variable km modelname <- idaGetModelname(km) ## End(Not run)
## Not run: #Get the name of a model stored in variable km modelname <- idaGetModelname(km) ## End(Not run)
This function computes generalized linear models on the contents of an ida.data.frame
.
idaGlm( form, data, id = "id", intercept = T, family = "bernoulli", family_param = -1, link = "logit", link_param = 1, maxit = 20, eps = 1e-3, tol = 1e-7, method = "irls", trials = NULL, incolumn = "", interaction = "", modelname = NULL, format = "glm", raw.resid = F, dropAfter = F, ...) ## S3 method for class 'idaGlm' print(x, ...) ## S3 method for class 'idaGlm' predict(object, newdata, id, outtable = NULL, ...)
idaGlm( form, data, id = "id", intercept = T, family = "bernoulli", family_param = -1, link = "logit", link_param = 1, maxit = 20, eps = 1e-3, tol = 1e-7, method = "irls", trials = NULL, incolumn = "", interaction = "", modelname = NULL, format = "glm", raw.resid = F, dropAfter = F, ...) ## S3 method for class 'idaGlm' print(x, ...) ## S3 method for class 'idaGlm' predict(object, newdata, id, outtable = NULL, ...)
form |
A |
data |
An |
id |
The ID column name. |
intercept |
The intercept. |
family |
The type of error distribution. It can have one of the follwing values: "bernoulli", "gaussian", "poisson", "binomial", "negativebinomial", "wald", "gamma" |
family_param |
A family-specific parameter. |
link |
Type of the link function. It can have one of the follwoing values: |
"clog", "cloglog", "gaussit", "identity", "log", "logit", "oddspower", "power", "probit", and "sqrt". For Db2 for z/OS it can have the following values as well: "canbinom", "cangeom", "cannegbinom", "cauchit", "inverse", "invnegative", "invsquare", "loglog" .
link_param |
Link parameter, 1 by default. |
maxit |
Maximum number of iterations. 20 by default. |
eps |
Maximum (relative) error used as a stopping criterion. This should be sufficiently small for the algorithm to work. |
tol |
The tolerance for the linear equation solver to consider a value equal to be zero. This should be sufficiently small for the algorithm to work. |
method |
Computing algorithm: either |
trials |
The input table column containing the number of trials for the binominal distribution. Ignored unless family is 'binomial'. |
incolumn |
Overwrite automatic creation of incolumn parameter and specify your own incolumn here. |
interaction |
Overwrite automatic creation of interaction parameter and specify your own interaction here. |
modelname |
Name of the model that will be created in the database. Will be created automatically if not specified. |
format |
Specify output format. Either |
raw.resid |
If |
dropAfter |
Whether to drop the results after downloading them as specified in |
x |
An |
object |
An |
newdata |
New data used for prediction as |
outtable |
The name of the table the results will be written in. |
... |
Additional parameters. |
For more details on the GLM algorithm and requirements to the data
, please refer to
the documentation of the nza..GLM
stored procedure in the Netezza
In-Database Analytics Reference Guide or Netezza In-Database Analytics Developers Guide.
The function idaGlm
returns the generalized linear regression model of classes
glm
and idaGlm
if format
equals "glm"
or a list
of
data.frame
s if format
equals "raw"
.
The functions print
and summary
have no return values.
The function predict
returns an ida.data.frame
that contains the predicted values.
## Not run: #Add isSetosa column to iris data frame iris2 <- iris iris2$isSetosa <- ifelse(iris2$Species=="setosa", 1, 0) #Store the iris2 data frame in the IRIS2 table idf <-as.ida.data.frame(iris2, table="IRIS2", clear.existing=T, rownames="ID") #Calculate GLM model in-db glm <- idaGlm(isSetosa~PetalLength+SepalLength*SepalWidth+PetalWidth, idf, id="ID") #Print the model print(glm) #Apply the model to data idf2 <- predict(glm, idf, "ID") #Inspect the results head(idf2) ## End(Not run)
## Not run: #Add isSetosa column to iris data frame iris2 <- iris iris2$isSetosa <- ifelse(iris2$Species=="setosa", 1, 0) #Store the iris2 data frame in the IRIS2 table idf <-as.ida.data.frame(iris2, table="IRIS2", clear.existing=T, rownames="ID") #Calculate GLM model in-db glm <- idaGlm(isSetosa~PetalLength+SepalLength*SepalWidth+PetalWidth, idf, id="ID") #Print the model print(glm) #Apply the model to data idf2 <- predict(glm, idf, "ID") #Inspect the results head(idf2) ## End(Not run)
This function initializes the In-Database Analytics functions.
idaInit(con,jobDescription=NULL)
idaInit(con,jobDescription=NULL)
con |
An open RODBC connection. |
jobDescription |
Optional argument that allows to assign a description to the jobs submitted from the R session. |
Use an existing RODBC connection to initialize the IDA in-database analytics functions. All commands are sent through this connection.
No value is returned.
## Not run: #Initialize the IDA Analytics functions con <- idaConnect('BLUDB','','') #Initialize the in-database functionality idaInit(con) ## End(Not run)
## Not run: #Initialize the IDA Analytics functions con <- idaConnect('BLUDB','','') #Initialize the in-database functionality idaInit(con) ## End(Not run)
This function generates a k-means clustering model based on the contents of a IDA data frame (ida.data.frame
).
idaKMeans( data, id, k=3, maxiter=5, distance="euclidean", outtable=NULL, randseed=12345, statistics=NULL, modelname=NULL ) ## S3 method for class 'idaKMeans' print(x,...) ## S3 method for class 'idaKMeans' predict(object, newdata, id,...)
idaKMeans( data, id, k=3, maxiter=5, distance="euclidean", outtable=NULL, randseed=12345, statistics=NULL, modelname=NULL ) ## S3 method for class 'idaKMeans' print(x,...) ## S3 method for class 'idaKMeans' predict(object, newdata, id,...)
data |
An IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row. |
id |
The name of the column that contains a unique ID for each row of the input data. |
k |
The number of clusters to be calculated. |
maxiter |
The maximum number of iterations to be used to calculate the k-means clusters. A larger number of iterations increases both the precision of the results and the amount of time required to calculate them. |
distance |
The distance function that is to be used. This can be set to |
outtable |
The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically. |
randseed |
The seed for the random number generator. |
statistics |
Denotes which statistics to calculate. Allowed values are |
modelname |
The name under which the model is stored in the database.
This is the name that is specified when using functions such as |
object |
An object of the class |
x |
An object of the class |
newdata |
A IDA data frame that contains the data to which to apply the model. |
... |
Additional parameters to pass to the print or predict method. |
The idaKMeans function calculates the squared Euclidean distance between rows, and groups them into clusters. Initial clusters are chosen randomly using a random seed, and the results are adjusted iteratively until either the maximum number of iterations is reached or until two iterations return identical results. Variables with missing values are set zero for distance calculation.
Models are stored persistently in database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
The output of the print function for a idaKMeans object is:
A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaKMeans object
The idaKMeans function returns an object of class idaKMeans
and kmeans
.
idaRetrieveModel
, idaDropModel
, idaListModels
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a kmeans model stored in the database as KMEANSMODEL km <- idaKMeans(idf, id="ID",modelname="KMEANSMODEL") #Print the model print(km) #Predict the model pred <- predict(km,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a kmeans model stored in the database as KMEANSMODEL km <- idaKMeans(idf, id="ID",modelname="KMEANSMODEL") #Print the model print(km) #Predict the model pred <- predict(km,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
Use these functions for DB2 for z/OS connections to retrieve the list of available accelerators and to set and get the current accelerator settings.
idaListAccelerators() idaSetAccelerator(acceleratorName, queryAcceleration="ENABLE") idaGetAccelerator() idaGetAcceleratorDetails()
idaListAccelerators() idaSetAccelerator(acceleratorName, queryAcceleration="ENABLE") idaGetAccelerator() idaGetAcceleratorDetails()
acceleratorName |
The name of the accelerator where the analytics functions (like idaKMeans or idaTree) are executed. |
queryAcceleration |
The value which the DB2 for z/OS register CURRENT QUERY ACCELERATION is set to. Possible values are "NONE", ENABLE", "ENABLE WITH FALLBACK", "ELIGIBLE" and "ALL" |
idaListAccelerators
returns a data frame that contains a list of the accelerators
available for the current DB2 for z/OS connection.
idaSetAccelerator
sets the accelerator to be used for the subsequent calls of the analytics functions.
idaGetAccelerator
retrieves the name of the accelerator.
idaGetAcceleratorDetails
retrieves the name of accelerator together with the encoding (like UNICODE or EBCDIC) of its data
and the value for query acceleration (like ENABLE or ELIGIBLE) and returns these values in a list object with elements "Accelerator", "Encoding"
and "QueryAcceleration".
## Not run: #Get a list of all accelerators q <- idaListAccelerators(); #Set accelerator to "MYACCEL" idaSetAccelerator("MYACCEL"); #Get name of current accelerator" idaGetAccelerator(); #Get name of current accelerator together with its encoding and query acceleration" idaGetAcceleratorDetails(); #Get encoding of current acccelerator idaGetAcceleratorDetails()$Encoding ## End(Not run)
## Not run: #Get a list of all accelerators q <- idaListAccelerators(); #Set accelerator to "MYACCEL" idaSetAccelerator("MYACCEL"); #Get name of current accelerator" idaGetAccelerator(); #Get name of current accelerator together with its encoding and query acceleration" idaGetAcceleratorDetails(); #Get encoding of current acccelerator idaGetAcceleratorDetails()$Encoding ## End(Not run)
Use these function to list all models in the schema of the current user that were created using the functions like idaNaiveBayes
or idaKMeans
or check whether a model with a specific name exists.
idaListModels() idaModelExists(modelname)
idaListModels() idaModelExists(modelname)
modelname |
The name of a predictive model. |
idaListModels
returns a data frame that contains a list of the predictive models that are stored
in the current schema and information about each one.
idaModelExists
returns a boolean value depending on whether the model exists or not.
## Not run: #Get a list of all models q <- idaListModels(); ## End(Not run)
## Not run: #Get a list of all models q <- idaListModels(); ## End(Not run)
This function performs linear regression on the contents of an ida.data.frame
.
idaLm(form, idadf, id = "id", modelname = NULL, dropModel = TRUE, limit = 25) ## S3 method for class 'idaLm' print(x, ...) ## S3 method for class 'idaLm' predict(object, newdata, id, outtable = NULL, ...) ## S3 method for class 'idaLm' plot(x, names = TRUE, max_forw = 50, max_plot = 15, order = NULL, lmgON = FALSE, backwardON = FALSE, ...)
idaLm(form, idadf, id = "id", modelname = NULL, dropModel = TRUE, limit = 25) ## S3 method for class 'idaLm' print(x, ...) ## S3 method for class 'idaLm' predict(object, newdata, id, outtable = NULL, ...) ## S3 method for class 'idaLm' plot(x, names = TRUE, max_forw = 50, max_plot = 15, order = NULL, lmgON = FALSE, backwardON = FALSE, ...)
form |
A |
idadf |
An ida.data.frame that contains the input data for the function. |
id |
The name of the column that contains a unique ID for each row of the input data. An id column needs to be specified, if a model contains categorical values, more than 41 columns or when dropModel is set to FALSE. If no valid id column was specified, a temporary id column will be used (not for DB2 for z/OS). |
modelname |
Name of the model that will be created in the database. |
dropModel |
logical: If TRUE the in database model will be dropped after the calculation. |
limit |
The maximum number of levels for a categorical column. Its default value is 25. This parameter only exists for consistency with older version of idaLm. |
x |
An object of the class |
object |
An object of the class |
newdata |
An ida.data.frame that contains data that will be predicted. |
outtable |
The name of the table where the results will be written in. |
names |
|
max_forw |
|
max_plot |
|
order |
Vector of attribute names. The method will calculate the value of the models with the attributes in the order of the vector and plot the value for each of it. |
lmgON |
|
backwardON |
|
... |
Additional parameters. |
The idaLm
function computes a linear regression model by extracting a covariance matrix and
computing its inverse. This implementation is optimized for problems that involve a large number of
samples and a relatively small number of predictors. The maximum number of columns is 78.
Missing values in the input table are ignored when calculating the covariance matrix. If this leads to undefined entries in the covariance matrix, the function fails. If the inverse of the covariance matrix cannot be computed (for example, due to correlated predictors), the Moore-Penrose generalized inverse is used instead.
The output of the idaLm function has the following attributes:
$coefficients is a vector with two values. The first value is the slope of the line that best fits the input data; the second value is its y-intercept.
$RSS is the root sum square (that is, the square root of the sum of the squares).
$effects is not used and can be ignored.
$rank is the rank.
$df.residuals is the number of degrees of freedom associated with the residuals.
$coefftab is a is a vector with four values:
The slope and y-intercept of the line that best fits the input data
The standard error
The t-value
The p-value
$Loglike is the log likelihood ratio.
$AIC is the Akaike information criterion. This is a measure of the relative quality of the model.
$BIC is the Bayesian information criterion. This is used for model selection.
$CovMat the Matrix used in the calculation ("Covariance Matrix"). This matrix is necessary for the Calculation in plot.idaLm and the statistics.
$card the number of dummy variables created for categorical columns and 1 for numericals.
$model the in database modelname of the idaLm object.
$numrow the number of rows of the input table that do not contain NAs.
$sigma the residual standard error.
The plot.idaLm
function uses as a measure of quality of a linear model.
compares the variance of the predicted values and the variance of the actual values
of the target variable.
$First: Returns the value of the linear model for each attribute alone.
$Usefulness: Returns the value reduction of the linear model with all
attributes to the linear model with one attribute taken away.
$Forward_Values: Is only calculated if backwardON=FALSE. This is a heuristic that adds in
each step the attribute which has the most increase.
$LMG: Is only calculated if lmgON=TRUE. It returns the increase of of
each attribute averaged over every possible permutation. By grouping some
of the permutations we only need to average over every possible subset.
For n attributes there are
subsets. So LMG is an algorithm with
exponential runningtime and is not recommended for more than 15
attributes.
$Backward_Values: Is only calculated if backwardON=TRUE. Similar to the forward heuristic.
This time we choose in each step of the algorithm that has minimal
reduction when taking it out of the model, starting with all
attributes.
$Model_Values: Is only calculated if order is a vector of attributes. In this case the
function calculates the value for the models that we get when
we add one attribute of order in each step.
RelImpPlot.png: If lmgON=FALSE. This plot shows a stackplot of the values Usefulness,First and the Model_Value of the heuristic. Note that usually Usefulness<First<Model_Value and that the bars overlap each other. If lmgON=TRUE. This plot shows the LMG values of the attributes in the order of the heuristic forward, backward or order.
The procedure returns a linear regression model of class idaLm
.
## Not run: #Create a pointer to table IRIS idf <- ida.data.frame("IRIS") #Calculate linear model in-db lm1 <- idaLm(SepalLength~., idf) library(ggplot2) plot(lm1) #Calculating linear models with categorical values requires an id column lm1 <- idaLm(SepalLength~., idf, id="ID") ## End(Not run)
## Not run: #Create a pointer to table IRIS idf <- ida.data.frame("IRIS") #Calculate linear model in-db lm1 <- idaLm(SepalLength~., idf) library(ggplot2) plot(lm1) #Calculating linear models with categorical values requires an id column lm1 <- idaLm(SepalLength~., idf, id="ID") ## End(Not run)
This function merges two IDA data frames(that is, two objects of the class ida.data.frame
).
idaMerge(x, y, by=intersect(x@cols, y@cols), by.x=by, by.y=by, all=FALSE, all.x=all, all.y=all, sort=TRUE, suffixes=c("_x", "_y"), table=NULL)
idaMerge(x, y, by=intersect(x@cols, y@cols), by.x=by, by.y=by, all=FALSE, all.x=all, all.y=all, sort=TRUE, suffixes=c("_x", "_y"), table=NULL)
x |
The first |
y |
The second |
by |
Specification of the common columns; see the Details section. |
by.x |
Specification of the common columns; see the Details section. |
by.y |
Specification of the common columns; see the Details section. |
all |
Whether non-matching columns of |
all.x |
If columns from only one of the IDA data frames being merged are to be included in the output,
set its corresponding parameter to TRUE and the other parameter to FALSE.
In SQL database terminology, specifying If If |
all.y |
Analogous to |
sort |
This parameter is ignored. The output is never sorted regardless of the setting of this parameter. |
suffixes |
Two 2-character strings, each of which specifies a suffix that is
used when generating column names.
By specifying different suffixes, you can ensure that each column
can be uniquely attributed to either |
table |
Name of the output IDA data frame. |
This function merges two IDA data frames on the columns that they have in common. The rows in the two data frames that match on the specified columns are extracted, and joined together. If there is more than one match, all possible matches contribute one row each. For the precise meaning of ‘match’.
If by
or both by.x
and by.y
are of length 0 (a
length zero vector or NULL
), the result, r
, is the
Cartesian product of x
and y
, that is, a cross join.
If non-merged columns of the data frames have identical names and are to be included in the output, suffixes are appended to the names of the corresponding columns in the output to make their names unique.
Note that this function creates, in the current database, a view that corresponds to the output object. Within the current session, this view can be accessed using the same IDA data frame object. However, it is persistent and, after it is no longer needed, it must be dropped manually.
A ida.data.frame
object.
## Not run: idf <- ida.data.frame('IRIS') #Perform a self-join idf2 <- idaMerge(idf,idf,by="ID") ## End(Not run)
## Not run: idf <- ida.data.frame('IRIS') #Perform a self-join idf2 <- idaMerge(idf,idf,by="ID") ## End(Not run)
This function generates a Naive Bayes classification model based on the contents of an IDA data frame (ida.data.frame
).
idaNaiveBayes(form,data,id="id",modelname=NULL) ## S3 method for class 'idaNaiveBayes' predict(object,newdata,id, withProbabilities=FALSE,...) ## S3 method for class 'idaNaiveBayes' print(x,...)
idaNaiveBayes(form,data,id="id",modelname=NULL) ## S3 method for class 'idaNaiveBayes' predict(object,newdata,id, withProbabilities=FALSE,...) ## S3 method for class 'idaNaiveBayes' print(x,...)
form |
A |
data |
An |
id |
The name of the column that contains unique IDs. |
modelname |
Name for the model. Will be created automatically unless specified otherwise. |
object |
An object of the class |
newdata |
An IDA data frame that contains the data to which to apply the model. |
withProbabilities |
A boolean value indicating if the probabilities for each class value are included in the result of the predict function. |
x |
An object of the class |
... |
Additional parameters to pass to the print and predict method. |
idaNaiveBayes
builds a Naive Bayes classification model, thus a model that assumes independence
of input variables with respect to the target variable.
Continuous input variables are discretized using equal width discretization. Missing values are ignored on a record and attribute level when calculating the conditional probabilities.
Models are stored persistently in the database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
The function idaNaiveBayes
returns an object of class "idaNaiveBayes"
and "naiveBayes"
compatible
with Naive Bayes objects produced by the e1071 package.
The predict.idaNaiveBayes
method applies the model to the data in a table and returns an IDA data frame that contains
a list of tuples, each of which comprises one row ID and one prediction.
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a naive bayes model nb <- idaNaiveBayes(Species~SepalLength,idf,"ID") #Print the model print(nb) #Apply the model to data idf2 <- predict(nb,idf,"ID") #Inspect the results head(idf2) ## End(Not run)
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a naive bayes model nb <- idaNaiveBayes(Species~SepalLength,idf,"ID") #Print the model print(nb) #Apply the model to data idf2 <- predict(nb,idf,"ID") #Inspect the results head(idf2) ## End(Not run)
Use these functions to run any SQL query on the database and put the
results into a data.frame
.
idaQuery(..., as.is=TRUE, na.strings = "NA") idaScalarQuery(..., as.is=TRUE)
idaQuery(..., as.is=TRUE, na.strings = "NA") idaScalarQuery(..., as.is=TRUE)
... |
Any number of query parts which are passed to |
as.is |
Specifies whether the result columns are to be converted using RODBC type conversions (as.is=FALSE) or left unconverted (as.is=TRUE).
For more information about RODBC type conversions, see the descriptions of the functions |
na.strings |
character vector of strings to be mapped to NA when reading character data. |
All parts of the input query are concatenated with paste(..., sep="")
and the result is passed to the database.
The idaQuery
function returns a data frame that contains the
result of the specified query.
The idaScalarQuery
function returns the result of the specified query coerced
to a single scalar value.
## Not run: #idaScalarQuery returns a single value v <- idaScalarQuery("SELECT COUNT(*) FROM IRIS") #idaQuery returns a data.frame df <- idaQuery("SELECT * FROM IRIS") #idaQuery and idaScalarQuery automatically paste all arguments into a single query #This is convenient if you use variables tableName <- "IRIS" df <- idaScalarQuery("SELECT COUNT(*) FROM ",tableName) ## End(Not run)
## Not run: #idaScalarQuery returns a single value v <- idaScalarQuery("SELECT COUNT(*) FROM IRIS") #idaQuery returns a data.frame df <- idaQuery("SELECT * FROM IRIS") #idaQuery and idaScalarQuery automatically paste all arguments into a single query #This is convenient if you use variables tableName <- "IRIS" df <- idaScalarQuery("SELECT COUNT(*) FROM ",tableName) ## End(Not run)
Use this function to retrieve from the database a model that was created using a function like idaNaiveBayes
or idaKMeans
.
idaRetrieveModel(modelname)
idaRetrieveModel(modelname)
modelname |
The name of the predictive model to be retrieved. |
This function returns an R object that contains a representation of the retrieved model. The class of the returned object depends on the function that was used to create the model.
## Not run: #Retrieve the model with name "MYKMEANSMODEL" from the database trCopy <- idaRetrieveModel("KMEANSMODEL"); ## End(Not run)
## Not run: #Retrieve the model with name "MYKMEANSMODEL" from the database trCopy <- idaRetrieveModel("KMEANSMODEL"); ## End(Not run)
This function draws a random sample from a IDA data frame (that is, an object of the class ida.data.frame
).
idaSample(bdf, n, stratCol=NULL,stratVals=NULL,stratProbs=NULL, dbPreSamplePercentage=100,fetchFirst=F);
idaSample(bdf, n, stratCol=NULL,stratVals=NULL,stratProbs=NULL, dbPreSamplePercentage=100,fetchFirst=F);
bdf |
The IDA data frame from which the sample is to be drawn. |
n |
The number of rows of sample data to be retrieved. |
stratCol |
For stratified sampling, the column that determines the strata. |
stratVals |
For stratified sampling, a vector of values that determine the subset of strata from which samples are to be drawn. |
stratProbs |
For stratified sampling, a vector of explicit sampling probabilities.
Each value corresponds to a value of the vector specified for |
dbPreSamplePercentage |
The percentage of the IDA data frame from which the sample is to be drawn (see details). |
fetchFirst |
Fetch first rows instead of using random sample. |
If stratCol
is specified, a stratified sample based on the contents of the specified column is taken.
Unless stratVals
is also specified, each unique value in the column results in one stratum.
If stratVals
is also specified, only the values it specifies result in strata, and only rows that contain one of those values are included in the
sample; other rows are ignored.
Unless stratProbs
is also specified, the number of rows retrieved for each stratum is proportional
to the size of that stratum relative to the overall sample.
To undersample or oversample data, use stratProbs
to specify, for each
value of stratVals
, the fraction of the rows of the corresponding stratum that are to be included in the sample.
For each stratum, the calculated number of rows is rounded up to the next highest integer. This ensures that there
is at least one sample for each stratum. Consequently, the number of samples that is returned might
be higher than the value specified for n
.
The value of dbPreSamplePercentage
is a numeric value in the range 0-100 that represents the percentage of the
IDA data frame that is to serve as the source of the sample data.
When working with an especially large IDA data frame, specifying a value smaller than 100 improves performance, because less data must be processed.
However, the proportionality of the pre-sampled data might vary from that of the complete data, and this would result in
a biased sample. It can even happen that entire strata are excluded from the final sample.
When fetchFirst
is set to TRUE, the sample values of each stratum are taken in the order in which they are
returned from the database rather than randomly. This is usually much faster than random sampling, but can introduce bias.
An object of class data.frame
that contains the sample.
## Not run: idf<-ida.data.frame('IRIS') #Simple random sampling df <- idaSample(idf,10) #Stratified sample df <- idaSample(idf,10,'Species') ## End(Not run)
## Not run: idf<-ida.data.frame('IRIS') #Simple random sampling df <- idaSample(idf,10) #Stratified sample df <- idaSample(idf,10,'Species') ## End(Not run)
Returns a data frame that contains the names of the tables contained in the current database.
idaShowTables(showAll=FALSE, matchStr=NULL, schema=NULL, accelerated=FALSE)
idaShowTables(showAll=FALSE, matchStr=NULL, schema=NULL, accelerated=FALSE)
showAll |
List all tables that are listed in the catalog of the current database (TRUE) or only those tables that are in the current schema (FALSE). |
matchStr |
If not NULL, only tables that contain the character string in this argument will be returned. |
schema |
If not NULL, only tables with this schema will be returned. This parameter is ignored if showAll=FALSE. |
accelerated |
Valid for DB2 for z/OS connections only. If TRUE, only accelerated tables will be returned. |
A data frame with the columns Schema, Name, Owner, and Type. For DB2 for z/OS connections the columns Acceleratorname and Enable are included as well.
## Not run: #Get a list of all tables in the current schema tabs <- idaShowTables() ## End(Not run)
## Not run: #Get a list of all tables in the current schema tabs <- idaShowTables() ## End(Not run)
Function used to build a contingency table of the counts at each combination of factor levels based on the contents of a IDA data frame (ida.data.frame
).
idaTable(idadf,max.entries=1000)
idaTable(idadf,max.entries=1000)
idadf |
A IDA data frame that contains the input data for the function. |
max.entries |
The maximum number of entries. If the cross product of all columns exceeds this number, an error will be thrown. |
idaTable
uses the cross-classifying factors to build a contingency table of the counts at each combination of categorical
values in all categorical columns of the ida.data.frame
passed as input.
The idaTable
function returns a contingency table, an object of class "table".
## Not run: #Create a pointer to the table IRIS idf<-ida.data.frame('IRIS') #Add a column idf$SepalLengthg4 <- ifelse(idf$SepalLength>4,'t','f') #Calculate the cross-table between Species and SepalLengthg4 idaTable(idf[,c('Species','SepalLengthg4')]) ## End(Not run)
## Not run: #Create a pointer to the table IRIS idf<-ida.data.frame('IRIS') #Add a column idf$SepalLengthg4 <- ifelse(idf$SepalLength>4,'t','f') #Calculate the cross-table between Species and SepalLengthg4 idaTable(idf[,c('Species','SepalLengthg4')]) ## End(Not run)
This function applies a R function to each subset (group of rows) of
a given IDA data frame (ida.data.frame
).
idaTApply(X, INDEX, FUN = NULL, output.name=NULL, output.signature=NULL, clear.existing=FALSE, debugger.mode=FALSE, num.tasks = 0, working.dir=NULL, apply.function="default", ...)
idaTApply(X, INDEX, FUN = NULL, output.name=NULL, output.signature=NULL, clear.existing=FALSE, debugger.mode=FALSE, num.tasks = 0, working.dir=NULL, apply.function="default", ...)
X |
A IDA data frame that contains the input data for the function. |
INDEX |
The name or the position of the column of the input IDA data frame X used to partition the input data into subsets. |
FUN |
The R function to be applied to the subsets of the input data. |
output.name |
The name of the output table where the results are written to. |
output.signature |
The Db2 data types of the output table. It is a named list with the column names as the names and the data types as the values. Supported data types are CHAR, VARCHAR, SMALLINT, INTEGER, BIGINT, FLOAT, REAL, DOUBLE, DECFLOAT, DECIMAL, NUMERIC, DATE |
clear.existing |
If TRUE the ouput table is dropped before recreating it. |
debugger.mode |
If TRUE intermediate results written into the working directory will not be removed. |
num.tasks |
The number of parallel tasks, i.e. R processes, which execute the R function on the subsets of the input data. If not specified or if the value is less than 1 it is calculated based on the number of available CPUs. |
working.dir |
The name of the directory where the directory is created into which intermediate results are written to. This directory is removed if debugger.mode is FALSE. The default value for working.directory is the value of the extbl_location Db2 database configuration variable or, if this variable has not been set, the home directory. |
apply.function |
The name of the R function to be used for parallelizing the execution of the calls of the function FUN. Possible values are "default", "spark.lapply" and "mclapply". If the value is "default" "spark-lapply" is used in a multi-node and "mclapply" in a single node environment. Please note that using the "spark.lapply" function requires Db2 Warehouse with integrated Spark. |
... |
Additional parameters that can be passed to the function FUN to be called by idaTApply. |
idaTApply
applies a user-provided R function to each subset (group of rows) of a given ida.data.frame.
The subsets are determined by a specified index column. The results of applying the function are written into a Db2 table
which is referenced by the returned ida.data.frame.
The idaTApply
function returns a ida.data.frame
.
## Not run: #Create an ida data frame from the iris data idf <- as.ida.data.frame(iris) #Define a function that computes the mean value for every column of a data frame x #except the index column. #It returns a data frame with the value of the index column and the mean values. columnMeans<- function(x, index) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index],2,mean))))} #Apply the columnMeans function to the subsets of the iris data identified by the Species column resSig <- list(Species="VARCHAR(12)", MSepalLength="DOUBLE", MSepalWidth="DOUBLE", MPetalLength="DOUBLE", MPetalWidth="DOUBLE") resDf <- idaTApply(idf, "Species", FUN=columnMeans, output.name="IRIS_MEANS", output.signature=resSig) #It is possible as well to apply an anonymous function. #The value "5" of the second parameter designates the position of the "Species" column #in the idf ida.data.frame. #The output table of the previous call is recreated because of the "clear.existing=T" parameter. resDf <- idaTApply(idf, 5, FUN=function(x, index) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index],2,mean))))}, output.name="IRIS_MEANS", output.signature=resSig, clear.existing=T) #Apply the columnMeans2 function which has an additional parameter "columns" #to specify the columns for which the mean values are computed columnMeans2 <- function(x, index, columns) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index & names(x) %in% columns],2,mean))))} petalColumns <- c("PetalLength", "PetalWidth") resSig2 <- list(Species="VARCHAR(12)", MPetalLength="DOUBLE", MPetalWidth="DOUBLE") resDf2 <- idaTApply(idf, "Species", FUN=columnMeans2, output.name="IRIS_MEANS2", output.signature=resSig2, clear.existing=T, columns=petalColumns) ## End(Not run)
## Not run: #Create an ida data frame from the iris data idf <- as.ida.data.frame(iris) #Define a function that computes the mean value for every column of a data frame x #except the index column. #It returns a data frame with the value of the index column and the mean values. columnMeans<- function(x, index) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index],2,mean))))} #Apply the columnMeans function to the subsets of the iris data identified by the Species column resSig <- list(Species="VARCHAR(12)", MSepalLength="DOUBLE", MSepalWidth="DOUBLE", MPetalLength="DOUBLE", MPetalWidth="DOUBLE") resDf <- idaTApply(idf, "Species", FUN=columnMeans, output.name="IRIS_MEANS", output.signature=resSig) #It is possible as well to apply an anonymous function. #The value "5" of the second parameter designates the position of the "Species" column #in the idf ida.data.frame. #The output table of the previous call is recreated because of the "clear.existing=T" parameter. resDf <- idaTApply(idf, 5, FUN=function(x, index) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index],2,mean))))}, output.name="IRIS_MEANS", output.signature=resSig, clear.existing=T) #Apply the columnMeans2 function which has an additional parameter "columns" #to specify the columns for which the mean values are computed columnMeans2 <- function(x, index, columns) { cbind(index=x[1,match(index, names(x))], as.data.frame(as.list(apply(x[,names(x) != index & names(x) %in% columns],2,mean))))} petalColumns <- c("PetalLength", "PetalWidth") resSig2 <- list(Species="VARCHAR(12)", MPetalLength="DOUBLE", MPetalWidth="DOUBLE") resDf2 <- idaTApply(idf, "Species", FUN=columnMeans2, output.name="IRIS_MEANS2", output.signature=resSig2, clear.existing=T, columns=petalColumns) ## End(Not run)
This function generates a tree model based on the contents of an IDA data frame (ida.data.frame
).
idaTree( form, data, id, minsplit=50, maxdepth=10, qmeasure=NULL, minimprove=0.01, eval=NULL, valtable=NULL, modelname=NULL) ## S3 method for class 'idaTree' plot(x,...) ## S3 method for class 'idaTree' predict(object, newdata, id, ...)
idaTree( form, data, id, minsplit=50, maxdepth=10, qmeasure=NULL, minimprove=0.01, eval=NULL, valtable=NULL, modelname=NULL) ## S3 method for class 'idaTree' plot(x,...) ## S3 method for class 'idaTree' predict(object, newdata, id, ...)
form |
A |
data |
An IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row. |
id |
The name of the column that contains a unique ID for each row of the input data. |
minsplit |
The minimum number of rows a node must contain to be split further. |
maxdepth |
The maximum depth (that is, the number of hierarchical levels) of the generated tree. |
qmeasure |
The measure that is to be used to prune the tree.
For a decision tree, allowed values are |
minimprove |
The minimum improvement. A node is not split further unless the split improves the class impurity by at least the amount specified for this parameter. |
eval |
The criterion that is to be used to calculate each split.
For a decision tree, allowed values are |
valtable |
When the output tree is to be pruned using external data, use this parameter to specify the fully-qualified name of the table that contains that data. Otherwise, specify NULL. |
modelname |
The name under which the model is stored in the database.
This is the name that is specified when using functions such as |
object |
An object of the class |
x |
An object of the class |
newdata |
A IDA data frame that contains the data to which to apply the model. |
... |
additional arguments to be passed to plot or predict. |
The idaTree function uses a top-down, iterative procedure to generate a decision-tree or regression-tree model, depending on the type of the target variable. The resulting model comprises a network of nodes and connectors, and each subnode is the endpoint of a binary split.
A node is not split further when any of the following are true:
The node has a uniform class (and therefore cannot be split further).
Additional splits do not improve the class impurity by at least the amount specified by minimprove
.
The number of rows contained by the node is less than the value specified by minsplit
.
The tree depth reaches the value specified by maxdepth
.
If variable that is used to determine a split does not have a value, the corresponding row remains in the node that is being split.
The output of the print function for a idaTree object is a textual description of the corresponding model.
The output of the plot function for a idaTree object is a graphical representation of the corresponding model.
Models are stored persistently in the database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
The predict.idaTree
method applies the model to the data in a table and returns a IDA data frame that contains
a list of tuples, each of which comprises one row ID and one prediction.
The idaTree function returns an object of classes idaTree
and rpart
.
idaRetrieveModel
, idaDropModel
, idaListModels
## Not run: #Create a pointer to the table IRIS idf <- ida.data.frame('IRIS') #Create a tree model tr <- idaTree(Species~.,idf,"ID",modelname="MYTREEMODEL") #Print the model print(tr) #Plot the model plot(tr) #Apply the model to data pred <- predict(tr,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
## Not run: #Create a pointer to the table IRIS idf <- ida.data.frame('IRIS') #Create a tree model tr <- idaTree(Species~.,idf,"ID",modelname="MYTREEMODEL") #Print the model print(tr) #Plot the model plot(tr) #Apply the model to data pred <- predict(tr,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
This function generates a two step clustering model based on the contents of an IDA data frame (ida.data.frame
).
idaTwoStep( data, id, k = 3, maxleaves = 1000, distance = "euclidean", outtable = NULL, randseed = 12345, statistics = NULL, maxk = 20, nodecapacity = 6, leafcapacity = 8, outlierfraction = 0, modelname = NULL) ## S3 method for class 'idaTwoStep' print(x,...) ## S3 method for class 'idaTwoStep' predict(object, newdata, id,...)
idaTwoStep( data, id, k = 3, maxleaves = 1000, distance = "euclidean", outtable = NULL, randseed = 12345, statistics = NULL, maxk = 20, nodecapacity = 6, leafcapacity = 8, outlierfraction = 0, modelname = NULL) ## S3 method for class 'idaTwoStep' print(x,...) ## S3 method for class 'idaTwoStep' predict(object, newdata, id,...)
data |
A IDA data frame that contains the input data for the function. The input IDA data frame must include a column that contains a unique ID for each row. |
id |
The name of the column that contains a unique ID for each row of the input data. |
k |
The number of clusters to be calculated. |
maxleaves |
The maximum number of leaf nodes in the initial clustering tree. When the tree contains maxleaves leaf nodes, the following data records are aggregated into clusters associated with the existing leaf nodes. This parameter is available for Db2 for z/OS only and ignored for Db2 Warehouse with integrated Spark. |
maxk |
The maximum number of clusters that can be determined automatically. |
nodecapacity |
The branching factor of the internal tree that is used in pass 1. Each node can have up to <nodecapacity> subnodes. |
This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS.
leafcapacity |
The number of clusters per leaf node in the internal tree that is used in pass 1. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS. |
outlierfraction |
The fraction of the records that is to be considered as outlier in the internal tree that is used in pass 1. Clusters that contain less than <outlierfraction> times the mean number of data records per cluster are removed. This parameter is available for Db2 Warehouse with integrated Spark only and ignored for Db2 for z/OS. |
distance |
The distance function that is to be used. This can be set to |
outtable |
The name of the output table that is to contain the results of the operation. When NULL is specified, a table name is generated automatically. |
randseed |
The seed for the random number generator. |
statistics |
Denotes which statistics to calculate. Allowed values are |
modelname |
The name under which the model is stored in the database.
This is the name that is specified when using functions such as |
object |
An object of the class |
x |
An object of the class |
newdata |
A IDA data frame that contains the data to which to apply the model. |
... |
Additional parameters to pass to the print or predict method. |
The idaTwoStep clustering function distributes first the input data into a hierarchical tree structure according to the distance between the data records where each leaf node corresponds to a (small) cluster. Then idaTwoStep reduces the tree by aggregating the leaf nodes according to the distance function until k clusters remain.
Models are stored persistently in database under the name modelname
. Model names cannot have more than 64 characters and
cannot contain white spaces. They need to be quoted like table names, otherwise they will be treated upper case by default. Only one
model with a given name is allowed in the database at a time. If a model with modelname
already exists, you need to drop it with idaDropModel
first before you can create another one with the same name. The model name can be used to retrieve the model later (idaRetrieveModel
).
The output of the print function for a idaTwoStep object is:
A vector containing a list of centers
A vector containing a list of cluster sizes
A vector containing a list of the number of elements in each cluster
A data frame or the name of the table containing the calculated cluster assignments
The within-cluster sum of squares (which indicates cluster density)
The names of the slots that are available in the idaTwoStep object
The idaTwoStep function returns an object of class idaTwoStep
and TwoStep
.
idaRetrieveModel
, idaDropModel
, idaListModels
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a TwoStep model stored in the database as TwoStepMODEL tsm <- idaTwoStep(idf, id="ID",modelname="TwoStepMODEL") #Print the model print(tsm) #Predict the model pred <- predict(tsm,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)
## Not run: #Create ida data frame idf <- ida.data.frame("IRIS") #Create a TwoStep model stored in the database as TwoStepMODEL tsm <- idaTwoStep(idf, id="ID",modelname="TwoStepMODEL") #Print the model print(tsm) #Predict the model pred <- predict(tsm,idf,id="ID") #Inspect the predictions head(pred) ## End(Not run)