--- title: "The XiMpLe Package" author: "m.eik michalke" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: true abstract: > This package provides basic tools for parsing and generating XML into and from R. It is not as feature-rich as alternative packages, but it's small and keeps dependencies to a minimum. vignette: > %\VignetteIndexEntry{The XiMpLe Package} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8x]{inputenc} \usepackage{apacite} --- ```{r, include=FALSE, cache=FALSE} library(XiMpLe) ``` ```{r, set-options, echo=FALSE, cache=FALSE} options(width=85) ``` # Previously on `XiMpLe` Before I even begin, I would like to stress that `XiMpLe` can *not* replace the `XML` package, and it is not supposed to. It has only a hand full of functions, therefore it can only do so much. Probably the most noteworthy missing feature in this package is any real DTD support. If you need that, you can stop reading here. Another problem is speed -- `XiMpLe` is written in pure R, and it's painfully slow with large XML trees. You won't notice this if you're only dealing with portions of some kilobytes, but if you need to parse really huge documents, it can take ages to finish. Historically, this package was written for exactly one purpose: I wanted to be able to read and write the XML documents of [RKWard](https://rkward.kde.org), because I was about to write an R package for scripting plugins for this R GUI. I actually had started another project shortly before, using the `XML` package as a dependency, but soon got complaints from Windows users. As it turned out, that package was not available for Windows, because somehow it couldn't be built automatically. When I realised that I only needed a small subset of its features anyway, I figured it might be easiest to quickly implement those features myself. Instead of hiding them in the internals of what eventually became the [rkwarddev](https://files.kde.org/rkward/R/pckg/rkwarddev/) package, I then started working on this package first. And well, "quickly" was rather optimistic... but since I'm happily using `XiMpLe` in other packages as well (like [roxyPackage](https://reaktanz.de/?c=hacking\&s=roxyPackage)), I'm satisfied it was worth it. So now you know. If you need a full-featured package to parse or generate XML in R, try the `XML` package. Otherwise, keep on reading. # And now the continuation Basically, `XiMpLe` can do these things for you: * parse XML from files into an R object, using the `parseXMLTree()` function * generate XML R objects, by calling `XMLNode()` and `XMLTree()` * extract nodes from XML R objects, or change their content, using the `node()` method * write back XML files from R objects, using the `pasteXML()` method That about covers it. XML nodes can of course be nested to construct complex trees. Finally, there's also some shortcuts to get the job done very efficient: * generate simple wrapper functions around `XMLNode()` to quickly script XML trees, using the `gen_tag_functions()` function * turn an XML object into script code using those wrapper functions, via `pasteXML(..., as_script=TRUE)` Let's look at some examples. # Naming conventions Let's quickly explain what we'll be talking about here. If you're parsing an XML document, it will contain an **XML tree**. This tree is made up of **XML nodes**. A node is indicated by arrow brackets, *must* have a **name**, *can* have **attributes**, and is either **empty** or not. Nodes can be nested, where nodes inside a node are its **child nodes**. ``` this text is the child of the "other" node. it can have multiple entries. ``` # Generate XML trees Now let's see how these nodes can be generated using the `XiMpLe` package. Single nodes are the domain of the `XMLNode()` function, and to get an empty node you just give it the name of that node: ```{r} XMLNode("useless") ``` As you see, you will be shown XML code on the console. But what this function returns is actually an R object of class `XiMpLe_node`, so what you see is an *interpretation* of that object, made by the `show()` method for objects of this type (see section [Writing XML files] on how to export XML to files). The second node in the initial example has an attribute. Attributes can comfortably be specified as named character strings via the `...` argument: ```{r} XMLNode("other", foo="bar") ``` Alternatively, you can provide attributes as a named list via the `attrs` argument. You will need to use this if any attribute you would like to have in an XML node collides with the argument names of `XMLNode()`: ```{r} XMLNode("other", attrs=list(foo="bar")) ``` By default, as long as our node doesn't have any children, it's assumed to be an empty node. To force it into a non-empty node (i.e., opening and closing tag) even without content, we'd have to provide an empty character string as its child. Like attributes, child nodes can also be provided in two ways -- either one by one via the `...` argument: ```{r} XMLNode("other", "", foo="bar") ``` Or as one list via the `.children` argument. However, due to the implementation invoking the `.children` argument currently has the side effect of *replacing* the `...` argument, which will then ignore any names attribute definitions you might have given there. Therefore, you would then also have to put attributes in the `attrs` argument as well: ```{r} XMLNode("other", attrs=list(foo="bar"), .children=list("")) ``` Anyway, this is also the place to provide our node with the text value: ```{r} XMLNode( "other", "this text is the child of the \"other\" node.", "it can have multiple entries.", foo="bar" ) ``` How about the comments? Well, `XiMpLe` does detect some special node names, one being `"!--"` to indicate a comment: ```{r} XMLNode("!--", "following is an empty node named \"useless\"") ``` So far we genrated single nodes. In most cases, you want to have nested nodes which combine into an XML tree. You can achieve this by simply using nodes as child nodes. As a practical example, this is how you could generate an XHTML document: ```{r} sample_XML_a <- XMLNode("a", "klick here!", href="http://example.com", target="_blank") sample_XML_body <- XMLNode("body", sample_XML_a) sample_XML_html <- XMLNode("html", XMLNode("head", ""), sample_XML_body) (sample_XML_tree <- XMLTree(sample_XML_html, xml=list(version="1.0", encoding="UTF-8"), dtd=list(doctype="html", decl="PUBLIC", id="-//W3C//DTD XHTML 1.0 Transitional//EN", refer="http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"))) ``` It should be noted, however, that `XiMpLe` doesn't perform even the slightest checks on what you provide as `DOCTYPE` or `xml` attributes. # Writing XML files We've learned earlier that `XiMpLe` objects do not contain actual XML code. If you would like to write the XML code to a file, you should use `pasteXML()`, which will translate the R objects into a character string: ```{r} useless_node <- XMLNode("useless") pasteXML(useless_node) ``` Now let's write the XHTML code we created in the previous section to a file called `example.html`: ```{r, eval=FALSE} cat(pasteXML(sample_XML_tree), file="example.html") ``` And that's it. The method `pasteXML()` has some arguments to configure the output, like `shine`, which sets the level of code formatting. If you set `shine=0`, no formatting is done, not even newlines, `shine=1` (default) uses individual lines for nodes and adds indentation for better readability, and `shine=2` uses indented lines for each attribute as well: ```{r, eval=FALSE} cat(pasteXML(sample_XML_tree, shine=2)) ``` # Reading XML files We've also just created an example file we can read back in, to see how XML parsing looks like with `XiMpLe`: ```{r, eval=FALSE} sample_XML_parsed <- parseXMLTree("example.html") ``` `parseXMLTree()` can also digest XML directly if it comes in single character strings or vectors. You only need to tell it that you're not providing a file name this time, using the `object` argument: ```{r} my.XML.stuff <- c("here it begins","") parseXMLTree(my.XML.stuff, object=TRUE) ``` # Mining nodes Reading and writing XML files is neat, but what if you need to aquire only certain parts of, say, a parsed XML file? For example, what if we only needed the URL of the `href` attribute in our XHTML example? That's a job for `node()`. This method can be used to extract parts from XML trees, once they are `XiMpLe` objects. The branch you'd like to get can be defined by a list of node names, and `node()` will follow down this hierarchy and then return what nodes are to be found below that. You can also specify that you don't want the whole node(s), but only the attributes: ```{r} node(sample_XML_tree, node=list("html","body","a"), what="attributes") ``` ```{r} node(sample_XML_tree, node=list("html","body","a"), what="value") ``` This way it's easy to get the value of all attributes or the link text. You can also change values with `node()`. Let's change the URL and remove the `target` attribute completely: ```{r} node(sample_XML_tree, node=list("html","body","a"), what="attributes", element="href") <- "http://example.com/foobar" node(sample_XML_tree, node=list("html","body","a"), what="attributes", element="target") <- NULL sample_XML_tree ``` # Less typing with wrappers There's a fancy way of generating XML nodes and trees using wrapper functions around `XMLNode()`. The function `gen_tag_functions()` takes a character vector of tag names and generates those wrappers for you in a defined environment, which by default is the current global environment. This means you can start using these wrappers right after they were created, as if you had loaded a package. For safety reasons, i.e. to not collide with existing objects, these functions by default are named with a trailing underscore and are not created if an object by that name already exists. ```{r} gen_tag_functions(c("html", "head", "body", "a")) # see them in action (sample_XML_tree2 <- html_( head_(), body_( a_(href="http://example.com", target="_blank", "klick here!") ) )) ``` If you don't want these functions filling up your `.GlobalEnv`, you can also put them in an attached environment. Let's call it `XiMpLe_wrappers`: ```{r} attach(list(), name="XiMpLe_wrappers") gen_tag_functions(tags=c("p", "div"), envir=as.environment("XiMpLe_wrappers")) ``` Since the evironment is in the search path, you should be able to call the functions directly as well. # Turning XML into script code The already introduced method `pasteXML()` usually shows you the nested R objects as XML code. But it is also capable of turning them into script code that uses the syntax of the wrapper functions we have just created. To get this, simply use `as_script=TRUE`: ```{r} cat(pasteXML(sample_XML_tree2, as_script=TRUE)) ``` A use case for this might be existing XML documents that you would like to maintain with R scripts. You can first import the documents with `parseXMLTree()` and then have `pasteXML()` turn it into scripts that would generate the original XML code in return. Be aware that the code generated might not always run out of the box. Obviously, you'll have to create the needed wrapper functions. Some XML nodes might be skipped if the parser doesn't fully understand them, and the code will be as nested as the input. However, it it's probably a good start saving you a lot of time anyway. If you find that useful, you might also want to check out the function `provide_file()`. It can be used as a wrapper around a file name that a node is referencing (like the value of `src` of an `` tag) and make an actual copy of that file to a defined path.