NOTE: The
pmml
package referenced in this vignette is assumed to be
version 1.5.7
. Starting with pmml 2.0.0
,
functions from pmmlTransformations
have been merged into
pmml
. The examples have (commented-out) calls to functions
from pmml
; if using pmmlTransformations
, use
pmml 1.5.7
or older.
For an updated version of this vignette, see the latest
pmml
package.
This vignette provides examples of how to use the
FunctionXform
transformation to create new data features
for PMML models.
Given a WrapData
object and a transformation expression,
FunctionXform
calculates data for a new feature and creates
a new WrapData
object. When PMML is produced with
pmml::pmml()
, the transformation is inserted into the
LocalTransformations
node as a
DerivedField
.
FunctionXform
makes it possible to use multiple data
fields and functions to produce a new feature.
While FunctionXform
is part of the
pmmlTransformations
package, the code to produce pmml from
R is in the pmml
package. The following examples assume
that both these packages are installed and loaded. The
kable
function is part of knitr
, and is used
to make tables more readable.
Using the iris
dataset as an example, let’s construct a
new feature by transforming one variable. Load the dataset and show the
first few lines:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Create the irisBox
object with
WrapData
:
irisBox
contains the data and transform information that
will be used to produce PMML later. The original data is in
irisBox$data
. Any new features created with a
transformation are added as columns to this data frame.
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
Transform and field information is in irisBox$fieldData
.
The fieldData data frame contains information on every field in the
dataset, as well as every transform used. The functionXform
column contains expressions used in the FunctionXform
transform.
type | dataType | origFieldName | sampleMin | sampleMax | xformedMin | xformedMax | centers | scales | fieldsMap | transform | default | missingValue | functionXform | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Sepal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Petal.Length | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Petal.Width | original | numeric | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Species | original | factor | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
Now add a new feature, Sepal.Length.Sqrt
, using
FunctionXform
:
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length",
newFieldName="Sepal.Length.Sqrt",
formulaText="sqrt(Sepal.Length)")
The new feature is calculated and added as a column to the
irisBox$data
data frame:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.Sqrt |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 2.258318 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 2.213594 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 2.167948 |
irisBox$fieldData
now contains a new row with the
transformation expression:
type | dataType | origFieldName | functionXform | |
---|---|---|---|---|
Sepal.Length.Sqrt | derived | numeric | Sepal.Length | sqrt(Sepal.Length) |
Construct a linear model for Petal.Width
using this new
feature:
fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data)
# Convert to PMML:
# fit_pmml <- pmml(fit, transform=irisBox)
Since the model predicts Petal.Width
using a variable
based on Sepal.Length
, the PMML will contain these two
fields in the DataDictionary
and
MiningSchema
:
The LocalTransformations
node contains
Sepal.Length.Sqrt
as a derived field:
FunctionXform
can also operate on categorical data. In
this example, let’s create a boolean feature that equals 1 only when
Species
is setosa
:
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
newFieldName="Species.Setosa",
formulaText="if (Species == 'setosa') {1} else {0}")
kable(head(irisBox$data,3))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
Create a linear model and check the LocalTransformations
node:
It is possible to create new features by combining several fields. Let’s create a new field from the ratio of sepal and petal lengths:
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
newFieldName="Length.Ratio",
formulaText="Sepal.Length / Petal.Length")
As before, the new field is added as a column to the
irisBox$data
data frame:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Ratio |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 3.642857 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 3.500000 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 3.615385 |
Fit a linear model using this new feature:
fit <- lm(Petal.Width ~ Length.Ratio, data=irisBox$data)
# Convert to pmml:
# fit_pmml <- pmml(fit, transform=irisBox)
The pmml will contain Sepal.Length
and
Petal.Length
in the DataDictionary
and
MiningSchema
, since these were used in
FormulaXform
:
The Local.Transformations
node contains
Length.Ratio
as a derived field:
It is possible to pass a feature derived with
FunctionXform
to another FunctionXform
call.
To do this, the second call to FunctionXform
must use the
original data field names (instead of the derived field) in the
origFieldName
argument.
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length",
newFieldName="Length.Ratio",
formulaText="Sepal.Length / Petal.Length")
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
newFieldName="Length.R.Times.S.Width",
formulaText="Length.Ratio * Sepal.Width")
#> Warning in `[<-.factor`(`*tmp*`, ri, value = structure(c("original",
#> "original", : invalid factor level, NA generated
kable(irisBox$fieldData[6:7,c(1:3,14)])
type | dataType | origFieldName | functionXform | |
---|---|---|---|---|
Length.Ratio | derived | numeric | Sepal.Length,Petal.Length | Sepal.Length / Petal.Length |
Length.R.Times.S.Width | derived | numeric | Sepal.Length,Petal.Length,Sepal.Width | Length.Ratio * Sepal.Width |
fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=irisBox$data)
# Convert to pmml:
# fit_pmml <- pmml(fit, transform=irisBox)
The pmml will contain Sepal.Length
,
Petal.Length
, and Sepal.Width
in the
DataDictionary
and MiningSchema
, since these
were used in FormulaXform
:
The Local.Transformations
node contains
Length.Ratio
and Length.R.Times.S.Width
as
derived fields:
FunctionXform
The following R functions and operators are directly supported by
FunctionXform
. Their PMML equivalents are listed on the
second line:
+ | - | / | * | ^ | < | <= | > | >= | && | & | | | || | == | != | ! | ceiling | prod | log |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
+ | - | / | * | ^ | < | <= | > | >= | && | & | | | || | == | != | ! | ceiling | prod | log |
+ | - | / | * | pow | lessThan | lessOrEqual | greaterThan | greaterOrEqual | and | and | or | or | equal | notEqual | not | ceil | product | ln |
For these functions, no extra code is required for translation.
The R function prod
can be used as long as only numeric
arguments are specified. That is, prod
can take an
na.rm
argument, but specifying this in
FunctionXform
directly will not produce PMML equivalent to
the R expression.
Similarly, the R function log
can be used directly as
long as the second argument (the base) is not specified.
FunctionXform
There are built-in functions defined in PMML that cannot be directly
translated to PMML using FunctionXform
as described
above.
In this case, an error will be thrown when R tries to calculate a new
feature using the function passed to FunctionXform
, but
does not see that function in the environment.
It is still possible to make FunctionXform
work, but the
PMML function must be defined in the R environment first.
Let’s use isIn
, a PMML function, as an example. The
function returns a boolean indicating whether the first argument is
contained in a list of values. Detailed specification for this function
is available on this DMG
page.
One way to implement this in R is by using %in%
, with
the list of values being represented by ...
:
isIn <- function(x, ...) {
dots <- c(...)
if (x %in% dots) {
return(TRUE)
} else {
return(FALSE)
}
}
isIn(1,2,1,4)
#> [1] TRUE
This function can now be passed to FunctionXform
. The
following code creates a feature that indicates whether
Species
is either setosa
or
versicolor
:
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Species",
newFieldName="Species.Setosa.or.Versicolor",
formulaText="isIn(Species,'setosa','versicolor')")
The data
data frame now contains the new feature:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Species.Setosa.or.Versicolor |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | TRUE |
4.9 | 3.0 | 1.4 | 0.2 | setosa | TRUE |
4.7 | 3.2 | 1.3 | 0.2 | setosa | TRUE |
Create a linear model and view the corresponding PMML for the function:
FunctionXform
- another
exampleAs another example, let’s use R’s mean
function to
create a new feature. PMML has a built-in avg
, so we will
define an R function with this name.
Now use this function to take an average of several other features and combine with another field:
irisBox <- WrapData(iris)
irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width",
newFieldName="Length.Average.Ratio",
formulaText="avg(Sepal.Length,Petal.Length)/Sepal.Width")
The data
data frame now contains the new feature:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | Length.Average.Ratio |
---|---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa | 0.9285714 |
4.9 | 3.0 | 1.4 | 0.2 | setosa | 1.0500000 |
4.7 | 3.2 | 1.3 | 0.2 | setosa | 0.9375000 |
Create a simple linear model and view the corresponding PMML for the function:
fit <- lm(Petal.Width ~ Length.Average.Ratio, data=irisBox$data)
# fit_pmml <- pmml(fit, transform=irisBox)
# fit_pmml[[3]][[3]]
In the PMML, avg
will be recognized as a valid
function.
The function functionToPMML
(part of the
pmml
package) makes it possible to convert an R expression
into PMML directly, without creating a model or calculating values.
As long as the expression passed to the function is a valid R
expression (e.g., no unbalanced parentheses), it can contain arbitrary
function names not defined in R. Variables in the expression passed to
FunctionXform
are always assumed to be field names, and not
substituted. That is, even if x
has a value in the R
environment, the resulting expression will still use x
.
There are several limitations to parsing expressions in
FunctionXform
.
Each transformation operates on one data row at a time. For example,
it is not possible to compute the mean of an entire feature column in
FunctionXform
.
An expression such as foo(x)
is treated as a function
foo
with argument x
. Consequently, passing in
an R vector c(1,2,3)
will produce PMML where c
is a function and 1,2,3
are the arguments:
We can also see what happens when passing an na.rm
argument to prod
, as mentioned in an above example:
# functionToPMML("prod(1,2,na.rm=FALSE)") #produces incorrect PMML
# functionToPMML("prod(1,2)") #produces correct PMML
Additionally, passing in a vector to prod
produces
incorrect PMML:
The following are additional examples of pmml produced from R expressions.
Extra parentheses:
If-else expressions:
FunctionXform
FunctionXform
FunctionXform
- another
example