--- title: "Introduction to FunctionXform in PmmlTransformations" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Introduction to FunctionXform} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` **NOTE**: The `pmml` package referenced in this vignette is assumed to be version `1.5.7`. Starting with `pmml 2.0.0`, functions from `pmmlTransformations` have been merged into `pmml`. The examples have (commented-out) calls to functions from `pmml`; if using `pmmlTransformations`, use `pmml 1.5.7` or older. For an updated version of this vignette, see the latest `pmml` package. ## Introduction This vignette provides examples of how to use the `FunctionXform` transformation to create new data features for PMML models. Given a `WrapData` object and a transformation expression, `FunctionXform` calculates data for a new feature and creates a new `WrapData` object. When PMML is produced with `pmml::pmml()`, the transformation is inserted into the `LocalTransformations` node as a `DerivedField`. `FunctionXform` makes it possible to use multiple data fields and functions to produce a new feature. While `FunctionXform` is part of the `pmmlTransformations` package, the code to produce pmml from R is in the `pmml` package. The following examples assume that both these packages are installed and loaded. The `kable` function is part of `knitr`, and is used to make tables more readable. ```{r, echo=FALSE,warning=FALSE,message=FALSE,results="hide"} library(pmml) library(pmmlTransformations) library(knitr) ``` ## Single numeric field Using the `iris` dataset as an example, let's construct a new feature by transforming one variable. Load the dataset and show the first few lines: ```{r} data(iris) kable(head(iris,3)) ``` Create the `irisBox` object with `WrapData`: ```{r} irisBox <- WrapData(iris) ``` `irisBox` contains the data and transform information that will be used to produce PMML later. The original data is in `irisBox$data`. Any new features created with a transformation are added as columns to this data frame. ```{r} kable(head(irisBox$data,3)) ``` Transform and field information is in `irisBox$fieldData`. The fieldData data frame contains information on every field in the dataset, as well as every transform used. The `functionXform` column contains expressions used in the `FunctionXform` transform. ```{r} kable(irisBox$fieldData) ``` Now add a new feature, `Sepal.Length.Sqrt`, using `FunctionXform`: ```{r} irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length", newFieldName="Sepal.Length.Sqrt", formulaText="sqrt(Sepal.Length)") ``` The new feature is calculated and added as a column to the `irisBox$data` data frame: ```{r} kable(head(irisBox$data,3)) ``` `irisBox$fieldData` now contains a new row with the transformation expression: ```{r} kable(irisBox$fieldData[6,c(1:3,14)]) ``` Construct a linear model for `Petal.Width` using this new feature: ```{r} fit <- lm(Petal.Width ~ Sepal.Length.Sqrt, data=irisBox$data) # Convert to PMML: # fit_pmml <- pmml(fit, transform=irisBox) ``` Since the model predicts `Petal.Width` using a variable based on `Sepal.Length`, the PMML will contain these two fields in the `DataDictionary` and `MiningSchema`: ```{r} # fit_pmml[[2]] #Data Dictionary node # fit_pmml[[3]][[1]] #Mining Schema node ``` The `LocalTransformations` node contains `Sepal.Length.Sqrt` as a derived field: ```{r} # fit_pmml[[3]][[3]] ``` ## Single categorical field `FunctionXform` can also operate on categorical data. In this example, let's create a boolean feature that equals 1 only when `Species` is `setosa`: ```{r} irisBox <- WrapData(iris) irisBox <- FunctionXform(irisBox,origFieldName="Species", newFieldName="Species.Setosa", formulaText="if (Species == 'setosa') {1} else {0}") kable(head(irisBox$data,3)) ``` Create a linear model and check the `LocalTransformations` node: ```{r} fit <- lm(Petal.Width ~ Species.Setosa, data=irisBox$data) # fit_pmml <- pmml(fit, transform=irisBox) # fit_pmml[[3]][[3]] ``` ## Multiple input fields It is possible to create new features by combining several fields. Let's create a new field from the ratio of sepal and petal lengths: ```{r} irisBox <- WrapData(iris) irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length", newFieldName="Length.Ratio", formulaText="Sepal.Length / Petal.Length") ``` As before, the new field is added as a column to the `irisBox$data` data frame: ```{r} kable(head(irisBox$data,3)) ``` Fit a linear model using this new feature: ```{r} fit <- lm(Petal.Width ~ Length.Ratio, data=irisBox$data) # Convert to pmml: # fit_pmml <- pmml(fit, transform=irisBox) ``` The pmml will contain `Sepal.Length` and `Petal.Length` in the `DataDictionary` and `MiningSchema`, since these were used in `FormulaXform`: ```{r} # fit_pmml[[2]] #Data Dictionary node # fit_pmml[[3]][[1]] #Mining Schema node ``` The `Local.Transformations` node contains `Length.Ratio` as a derived field: ```{r} # fit_pmml[[3]][[3]] ``` ## Using a previously derived feature It is possible to pass a feature derived with `FunctionXform` to another `FunctionXform` call. To do this, the second call to `FunctionXform` must use the original data field names (instead of the derived field) in the `origFieldName` argument. ```{r} irisBox <- WrapData(iris) irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length", newFieldName="Length.Ratio", formulaText="Sepal.Length / Petal.Length") irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width", newFieldName="Length.R.Times.S.Width", formulaText="Length.Ratio * Sepal.Width") kable(irisBox$fieldData[6:7,c(1:3,14)]) ``` ```{r} fit <- lm(Petal.Width ~ Length.R.Times.S.Width, data=irisBox$data) # Convert to pmml: # fit_pmml <- pmml(fit, transform=irisBox) ``` The pmml will contain `Sepal.Length`, `Petal.Length`, and `Sepal.Width` in the `DataDictionary` and `MiningSchema`, since these were used in `FormulaXform`: ```{r} # fit_pmml[[2]] #Data Dictionary node # fit_pmml[[3]][[1]] #Mining Schema node ``` The `Local.Transformations` node contains `Length.Ratio` and `Length.R.Times.S.Width` as derived fields: ```{r} # fit_pmml[[3]][[3]] ``` ## PMML functions supported by `FunctionXform` The following R functions and operators are directly supported by `FunctionXform`. Their PMML equivalents are listed on the second line: ```{r,echo=FALSE} funcs <- rbind(c("+","-","/","*","^","<","<=",">",">=","&&","&","|","||","==","!=","!","ceiling","prod","log"), c("+","-","/","*","pow","lessThan","lessOrEqual","greaterThan","greaterOrEqual","and","and","or","or","equal","notEqual","not","ceil","product","ln")) colnames(funcs) <- funcs[1,] kable(funcs,col.names=colnames(funcs)) ``` For these functions, no extra code is required for translation. The R function `prod` can be used as long as only numeric arguments are specified. That is, `prod` can take an `na.rm` argument, but specifying this in `FunctionXform` directly will not produce PMML equivalent to the R expression. Similarly, the R function `log` can be used directly as long as the second argument (the base) is not specified. ## PMML functions not supported by `FunctionXform` There are built-in functions defined in PMML that cannot be directly translated to PMML using `FunctionXform` as described above. In this case, an error will be thrown when R tries to calculate a new feature using the function passed to `FunctionXform`, but does not see that function in the environment. It is still possible to make `FunctionXform` work, but the PMML function must be defined in the R environment first. Let's use `isIn`, a PMML function, as an example. The function returns a boolean indicating whether the first argument is contained in a list of values. Detailed specification for this function is available on [this DMG page](http://dmg.org/pmml/v4-3/BuiltinFunctions.html#boolean5). One way to implement this in R is by using `%in%`, with the list of values being represented by `...`: ```{r} isIn <- function(x, ...) { dots <- c(...) if (x %in% dots) { return(TRUE) } else { return(FALSE) } } isIn(1,2,1,4) ``` This function can now be passed to `FunctionXform`. The following code creates a feature that indicates whether `Species` is either `setosa` or `versicolor`: ```{r} irisBox <- WrapData(iris) irisBox <- FunctionXform(irisBox,origFieldName="Species", newFieldName="Species.Setosa.or.Versicolor", formulaText="isIn(Species,'setosa','versicolor')") ``` The `data` data frame now contains the new feature: ```{r} kable(head(irisBox$data,3)) ``` Create a linear model and view the corresponding PMML for the function: ```{r} fit <- lm(Petal.Width ~ Species.Setosa.or.Versicolor, data=irisBox$data) # fit_pmml <- pmml(fit, transform=irisBox) # fit_pmml[[3]][[3]] ``` ## PMML Function not supported by `FunctionXform` - another example As another example, let's use R's `mean` function to create a new feature. PMML has a built-in `avg`, so we will define an R function with this name. ```{r} avg <- function(...) { dots <- c(...) return(mean(dots)) } ``` Now use this function to take an average of several other features and combine with another field: ```{r} irisBox <- WrapData(iris) irisBox <- FunctionXform(irisBox,origFieldName="Sepal.Length,Petal.Length,Sepal.Width", newFieldName="Length.Average.Ratio", formulaText="avg(Sepal.Length,Petal.Length)/Sepal.Width") ``` The `data` data frame now contains the new feature: ```{r} kable(head(irisBox$data,3)) ``` Create a simple linear model and view the corresponding PMML for the function: ```{r} fit <- lm(Petal.Width ~ Length.Average.Ratio, data=irisBox$data) # fit_pmml <- pmml(fit, transform=irisBox) # fit_pmml[[3]][[3]] ``` In the PMML, `avg` will be recognized as a valid function. ## PMML for arbitrary functions The function `functionToPMML` (part of the `pmml` package) makes it possible to convert an R expression into PMML directly, without creating a model or calculating values. As long as the expression passed to the function is a valid R expression (e.g., no unbalanced parentheses), it can contain arbitrary function names not defined in R. Variables in the expression passed to `FunctionXform` are always assumed to be field names, and not substituted. That is, even if `x` has a value in the R environment, the resulting expression will still use `x`. ```{r} # functionToPMML("1 + 2") # x <- 3 # functionToPMML("foo(bar(x * y))") ``` ## More notes on functions There are several limitations to parsing expressions in `FunctionXform`. Each transformation operates on one data row at a time. For example, it is not possible to compute the mean of an entire feature column in `FunctionXform`. An expression such as `foo(x)` is treated as a function `foo` with argument `x`. Consequently, passing in an R vector `c(1,2,3)` will produce PMML where `c` is a function and `1,2,3` are the arguments: ```{r} # functionToPMML("c(1,2,3)") ``` We can also see what happens when passing an `na.rm` argument to `prod`, as mentioned in an above example: ```{r} # functionToPMML("prod(1,2,na.rm=FALSE)") #produces incorrect PMML # functionToPMML("prod(1,2)") #produces correct PMML ``` Additionally, passing in a vector to `prod` produces incorrect PMML: ```{r} # prod(c(1,2,3)) # functionToPMML("prod(c(1,2,3))") ``` ## More examples of functions The following are additional examples of pmml produced from R expressions. Extra parentheses: ```{r} # functionToPMML("pmmlT(((1+2))*(x))") ``` If-else expressions: ```{r} # functionToPMML("if(a<2) {x+3} else if (a>4) {4} else {5}") ``` ## References - [PMML 4.3 official specification](http://dmg.org/pmml/v4-3/GeneralStructure.html)