Title: | Parse XML |
---|---|
Description: | Work with XML files using a simple, consistent interface. Built on top of the 'libxml2' C library. |
Authors: | Hadley Wickham [aut, cre], Jim Hester [aut], Jeroen Ooms [aut], Posit Software, PBC [cph, fnd], R Foundation [ctb] (Copy of R-project homepage cached as example) |
Maintainer: | Hadley Wickham <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.3.6 |
Built: | 2024-11-25 06:51:52 UTC |
Source: | CRAN |
This turns an XML document (or node or nodeset) into the equivalent R
list. Note that this is as_list()
, not as.list()
:
lapply()
automatically calls as.list()
on its inputs, so
we can't override the default.
as_list(x, ns = character(), ...)
as_list(x, ns = character(), ...)
x |
A document, node, or node set. |
ns |
Optionally, a named vector giving prefix-url pairs, as produced
by |
... |
Needed for compatibility with generic. Unused. |
as_list
currently only handles the four most common types of
children that an element might have:
Other elements, converted to lists.
Attributes, stored as R attributes. Attributes that have special meanings in R
(class()
, comment()
, dim()
,
dimnames()
, names()
, row.names()
and
tsp()
) are escaped with '.'
Text, stored as a character vector.
as_list(read_xml("<foo> a <b /><c><![CDATA[<d></d>]]></c></foo>")) as_list(read_xml("<foo> <bar><baz /></bar> </foo>")) as_list(read_xml("<foo id = 'a'></foo>")) as_list(read_xml("<foo><bar id='a'/><bar id='b'/></foo>"))
as_list(read_xml("<foo> a <b /><c><![CDATA[<d></d>]]></c></foo>")) as_list(read_xml("<foo> <bar><baz /></bar> </foo>")) as_list(read_xml("<foo id = 'a'></foo>")) as_list(read_xml("<foo><bar id='a'/><bar id='b'/></foo>"))
This turns an R list into the equivalent XML document. Not all R lists will produce valid XML, in particular there can only be one root node and all child nodes need to be named (or empty) lists. R attributes become XML attributes and R names become XML node names.
as_xml_document(x, ...)
as_xml_document(x, ...)
x |
A document, node, or node set. |
... |
Needed for compatibility with generic. Unused. |
as_xml_document(list(x = list())) # Nesting multiple nodes as_xml_document(list(foo = list(bar = list(baz = list())))) # attributes are stored as R attributes as_xml_document(list(foo = structure(list(), id = "a"))) as_xml_document(list(foo = list( bar = structure(list(), id = "a"), bar = structure(list(), id = "b") )))
as_xml_document(list(x = list())) # Nesting multiple nodes as_xml_document(list(foo = list(bar = list(baz = list())))) # attributes are stored as R attributes as_xml_document(list(foo = structure(list(), id = "a"))) as_xml_document(list(foo = list( bar = structure(list(), id = "a"), bar = structure(list(), id = "b") )))
Libcurl implementation of C_download
(the "internal" download method)
with added support for https, ftps, gzip, etc. Default behavior is identical
to download.file()
, but request can be fully configured by passing
a custom curl::handle()
.
download_xml( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() ) download_html( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() )
download_xml( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() ) download_html( url, file = basename(url), quiet = TRUE, mode = "wb", handle = curl::new_handle() )
url |
A character string naming the URL of a resource to be downloaded. |
file |
A character string with the name where the downloaded file is saved. |
quiet |
If |
mode |
A character string specifying the mode with which to write the file.
Useful values are |
handle |
a curl handle object |
The main difference between curl_download
and curl_fetch_disk
is that curl_download
checks the http status code before starting the
download, and raises an error when status is non-successful. The behavior of
curl_fetch_disk
on the other hand is to proceed as normal and write
the error page to disk in case of a non success response.
For a more advanced download interface which supports concurrent requests and resuming large files, have a look at the multi_download function.
Path of downloaded file (invisibly).
## Not run: download_html("http://tidyverse.org/index.html") ## End(Not run)
## Not run: download_html("http://tidyverse.org/index.html") ## End(Not run)
Read HTML or XML.
read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS") read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS")) ## S3 method for class 'character' read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS") ## S3 method for class 'raw' read_xml( x, encoding = "", base_url = "", ..., as_html = FALSE, options = "NOBLANKS" ) ## S3 method for class 'connection' read_xml( x, encoding = "", n = 64 * 1024, verbose = FALSE, ..., base_url = "", as_html = FALSE, options = "NOBLANKS" )
read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS") read_html(x, encoding = "", ..., options = c("RECOVER", "NOERROR", "NOBLANKS")) ## S3 method for class 'character' read_xml(x, encoding = "", ..., as_html = FALSE, options = "NOBLANKS") ## S3 method for class 'raw' read_xml( x, encoding = "", base_url = "", ..., as_html = FALSE, options = "NOBLANKS" ) ## S3 method for class 'connection' read_xml( x, encoding = "", n = 64 * 1024, verbose = FALSE, ..., base_url = "", as_html = FALSE, options = "NOBLANKS" )
x |
A string, a connection, or a raw vector. A string can be either a path, a url or literal xml. Urls will
be converted into connections either using If a connection, the complete connection is read into a raw vector before being parsed. |
encoding |
Specify a default encoding for the document. Unless otherwise specified XML documents are assumed to be in UTF-8 or UTF-16. If the document is not UTF-8/16, and lacks an explicit encoding directive, this allows you to supply a default. |
... |
Additional arguments passed on to methods. |
as_html |
Optionally parse an xml file as if it's html. |
options |
Set parsing options for the libxml2 parser. Zero or more of
|
base_url |
When loading from a connection, raw vector or literal html/xml, this allows you to specify a base url for the document. Base urls are used to turn relative urls into absolute urls. |
n |
If |
verbose |
When reading from a slow connection, this prints some output on every iteration so you know its working. |
An XML document. HTML is normalised to valid XML - this may not be exactly the same transformation performed by the browser, but it's a reasonable approximation.
When performing web scraping tasks it is both good practice — and often required —
to set the user agent request header
to a specific value. Sometimes this value is assigned to emulate a browser in order
to have content render in a certain way (e.g. Mozilla/5.0 (Windows NT 5.1; rv:52.0) Gecko/20100101 Firefox/52.0
to emulate more recent Windows browsers). Most often,
this value should be set to provide the web resource owner information on who you are
and the intent of your actions like this Google scraping bot user agent identifier:
Googlebot/2.1 (+http://www.google.com/bot.html)
.
You can set the HTTP user agent for URL-based requests using httr::set_config()
and httr::user_agent()
:
httr::set_config(httr::user_agent("[email protected]; +https://example.com/info.html"))
httr::set_config()
changes the configuration globally,
httr::with_config()
can be used to change configuration temporarily.
# Literal xml/html is useful for small examples read_xml("<foo><bar /></foo>") read_html("<html><title>Hi<title></html>") read_html("<html><title>Hi") # From a local path read_html(system.file("extdata", "r-project.html", package = "xml2")) ## Not run: # From a url cd <- read_xml(xml2_example("cd_catalog.xml")) me <- read_html("http://had.co.nz") ## End(Not run)
# Literal xml/html is useful for small examples read_xml("<foo><bar /></foo>") read_html("<html><title>Hi<title></html>") read_html("<html><title>Hi") # From a local path read_html(system.file("extdata", "r-project.html", package = "xml2")) ## Not run: # From a url cd <- read_xml(xml2_example("cd_catalog.xml")) me <- read_html("http://had.co.nz") ## End(Not run)
Convert between relative and absolute urls.
url_absolute(x, base) url_relative(x, base)
url_absolute(x, base) url_relative(x, base)
x |
A character vector of urls relative to that base |
base |
A string giving a base url. |
A character vector of urls
xml_url
to retrieve the URL associated with a document
url_absolute(c(".", "..", "/", "/x"), "http://hadley.nz/a/b/c/d") url_relative("http://hadley.nz/a/c", "http://hadley.nz") url_relative("http://hadley.nz/a/c", "http://hadley.nz/") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b/")
url_absolute(c(".", "..", "/", "/x"), "http://hadley.nz/a/b/c/d") url_relative("http://hadley.nz/a/c", "http://hadley.nz") url_relative("http://hadley.nz/a/c", "http://hadley.nz/") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b") url_relative("http://hadley.nz/a/c", "http://hadley.nz/a/b/")
Escape and unescape urls.
url_escape(x, reserved = "") url_unescape(x)
url_escape(x, reserved = "") url_unescape(x)
x |
A character vector of urls. |
reserved |
A string containing additional characters to avoid escaping. |
url_escape("a b c") url_escape("a b c", "") url_unescape("a%20b%2fc") url_unescape("%C2%B5")
url_escape("a b c") url_escape("a b c", "") url_unescape("a%20b%2fc") url_unescape("%C2%B5")
Parse a url into its component pieces.
url_parse(x)
url_parse(x)
x |
A character vector of urls. |
A dataframe with one row for each element of x
and
columns: scheme, server, port, user, path, query, fragment.
url_parse("http://had.co.nz/") url_parse("http://had.co.nz:1234/") url_parse("http://had.co.nz:1234/?a=1&b=2") url_parse("http://had.co.nz:1234/?a=1&b=2#def")
url_parse("http://had.co.nz/") url_parse("http://had.co.nz:1234/") url_parse("http://had.co.nz:1234/?a=1&b=2") url_parse("http://had.co.nz:1234/?a=1&b=2#def")
This writes out both XML and normalised HTML. The default behavior will
output the same format which was read. If you want to force output pass
option = "as_xml"
or option = "as_html"
respectively.
write_xml(x, file, ...) ## S3 method for class 'xml_document' write_xml(x, file, ..., options = "format", encoding = "UTF-8") write_html(x, file, ...) ## S3 method for class 'xml_document' write_html(x, file, ..., options = "format", encoding = "UTF-8")
write_xml(x, file, ...) ## S3 method for class 'xml_document' write_xml(x, file, ..., options = "format", encoding = "UTF-8") write_html(x, file, ...) ## S3 method for class 'xml_document' write_html(x, file, ..., options = "format", encoding = "UTF-8")
x |
A document or node to write to disk. It's not possible to save nodesets containing more than one node. |
file |
Path to file or connection to write to. |
... |
additional arguments passed to methods. |
options |
default: ‘format’. Zero or more of
|
encoding |
The character encoding to use in the document. The default encoding is ‘UTF-8’. Available encodings are specified at http://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding. |
h <- read_html("<p>Hi!</p>") tmp <- tempfile(fileext = ".xml") write_xml(h, tmp, options = "format") readLines(tmp) # write formatted HTML output write_html(h, tmp, options = "format") readLines(tmp)
h <- read_html("<p>Hi!</p>") tmp <- tempfile(fileext = ".xml") write_xml(h, tmp, options = "format") readLines(tmp) # write formatted HTML output write_html(h, tmp, options = "format") readLines(tmp)
xml_attrs()
retrieves all attributes values as a named character
vector, xml_attrs() <-
or xml_set_attrs()
sets all attribute
values. xml_attr()
retrieves the value of single attribute and
xml_attr() <-
or xml_set_attr()
modifies its value. If the
attribute doesn't exist, it will return default
, which defaults to
NA
. xml_has_attr()
tests if an attribute is present.
xml_attr(x, attr, ns = character(), default = NA_character_) xml_has_attr(x, attr, ns = character()) xml_attrs(x, ns = character()) xml_attr(x, attr, ns = character()) <- value xml_set_attr(x, attr, value, ns = character()) xml_attrs(x, ns = character()) <- value xml_set_attrs(x, value, ns = character())
xml_attr(x, attr, ns = character(), default = NA_character_) xml_has_attr(x, attr, ns = character()) xml_attrs(x, ns = character()) xml_attr(x, attr, ns = character()) <- value xml_set_attr(x, attr, value, ns = character()) xml_attrs(x, ns = character()) <- value xml_set_attrs(x, value, ns = character())
x |
A document, node, or node set. |
attr |
Name of attribute to extract. |
ns |
Optionally, a named vector giving prefix-url pairs, as produced
by |
default |
Default value to use when attribute is not present. |
value |
character vector of new value. |
xml_attr()
returns a character vector. NA
is used
to represent of attributes that aren't defined.
xml_has_attr()
returns a logical vector.
xml_attrs()
returns a named character vector if x
x is single
node, or a list of character vectors if given a nodeset
x <- read_xml("<root id='1'><child id ='a' /><child id='b' d='b'/></root>") xml_attr(x, "id") xml_attr(x, "apple") xml_attrs(x) kids <- xml_children(x) kids xml_attr(kids, "id") xml_has_attr(kids, "id") xml_attrs(kids) # Missing attributes give missing values xml_attr(xml_children(x), "d") xml_has_attr(xml_children(x), "d") # If the document has a namespace, use the ns argument and # qualified attribute names x <- read_xml(' <root xmlns:b="http://bar.com" xmlns:f="http://foo.com"> <doc b:id="b" f:id="f" id="" /> </root> ') doc <- xml_children(x)[[1]] ns <- xml_ns(x) xml_attrs(doc) xml_attrs(doc, ns) # If you don't supply a ns spec, you get the first matching attribute xml_attr(doc, "id") xml_attr(doc, "b:id", ns) xml_attr(doc, "id", ns) # Can set a single attribute with `xml_attr() <-` or `xml_set_attr()` xml_attr(doc, "id") <- "one" xml_set_attr(doc, "id", "two") # Or set multiple attributes with `xml_attrs()` or `xml_set_attrs()` xml_attrs(doc) <- c("b:id" = "one", "f:id" = "two", "id" = "three") xml_set_attrs(doc, c("b:id" = "one", "f:id" = "two", "id" = "three"))
x <- read_xml("<root id='1'><child id ='a' /><child id='b' d='b'/></root>") xml_attr(x, "id") xml_attr(x, "apple") xml_attrs(x) kids <- xml_children(x) kids xml_attr(kids, "id") xml_has_attr(kids, "id") xml_attrs(kids) # Missing attributes give missing values xml_attr(xml_children(x), "d") xml_has_attr(xml_children(x), "d") # If the document has a namespace, use the ns argument and # qualified attribute names x <- read_xml(' <root xmlns:b="http://bar.com" xmlns:f="http://foo.com"> <doc b:id="b" f:id="f" id="" /> </root> ') doc <- xml_children(x)[[1]] ns <- xml_ns(x) xml_attrs(doc) xml_attrs(doc, ns) # If you don't supply a ns spec, you get the first matching attribute xml_attr(doc, "id") xml_attr(doc, "b:id", ns) xml_attr(doc, "id", ns) # Can set a single attribute with `xml_attr() <-` or `xml_set_attr()` xml_attr(doc, "id") <- "one" xml_set_attr(doc, "id", "two") # Or set multiple attributes with `xml_attrs()` or `xml_set_attrs()` xml_attrs(doc) <- c("b:id" = "one", "f:id" = "two", "id" = "three") xml_set_attrs(doc, c("b:id" = "one", "f:id" = "two", "id" = "three"))
Construct a cdata node
xml_cdata(content)
xml_cdata(content)
content |
The CDATA content, does not include |
x <- xml_new_root("root") xml_add_child(x, xml_cdata("<d/>")) as.character(x)
x <- xml_new_root("root") xml_add_child(x, xml_cdata("<d/>")) as.character(x)
xml_children
returns only elements, xml_contents
returns
all nodes. xml_length
returns the number of children.
xml_parent
returns the parent node, xml_parents
returns all parents up to the root. xml_siblings
returns all nodes
at the same level. xml_child
makes it easy to specify a specific
child to return.
xml_children(x) xml_child(x, search = 1, ns = xml_ns(x)) xml_contents(x) xml_parents(x) xml_siblings(x) xml_parent(x) xml_length(x, only_elements = TRUE) xml_root(x)
xml_children(x) xml_child(x, search = 1, ns = xml_ns(x)) xml_contents(x) xml_parents(x) xml_siblings(x) xml_parent(x) xml_length(x, only_elements = TRUE) xml_root(x)
x |
A document, node, or node set. |
search |
For |
ns |
Optionally, a named vector giving prefix-url pairs, as produced
by |
only_elements |
For |
A node or nodeset (possibly empty). Results are always de-duplicated.
x <- read_xml("<foo> <bar><boo /></bar> <baz/> </foo>") xml_children(x) xml_children(xml_children(x)) xml_siblings(xml_children(x)[[1]]) # Note the each unique node only appears once in the output xml_parent(xml_children(x)) # Mixed content x <- read_xml("<foo> a <b/> c <d>e</d> f</foo>") # Childen gets the elements, contents gets all node types xml_children(x) xml_contents(x) xml_length(x) xml_length(x, only_elements = FALSE) # xml_child makes it easier to select specific children xml_child(x) xml_child(x, 2) xml_child(x, "baz")
x <- read_xml("<foo> <bar><boo /></bar> <baz/> </foo>") xml_children(x) xml_children(xml_children(x)) xml_siblings(xml_children(x)[[1]]) # Note the each unique node only appears once in the output xml_parent(xml_children(x)) # Mixed content x <- read_xml("<foo> a <b/> c <d>e</d> f</foo>") # Childen gets the elements, contents gets all node types xml_children(x) xml_contents(x) xml_length(x) xml_length(x, only_elements = FALSE) # xml_child makes it easier to select specific children xml_child(x) xml_child(x, 2) xml_child(x, "baz")
Construct a comment node
xml_comment(content)
xml_comment(content)
content |
The comment content |
x <- xml_new_document() r <- xml_add_child(x, "root") xml_add_child(r, xml_comment("Hello!")) as.character(x)
x <- xml_new_document() r <- xml_add_child(x, "root") xml_add_child(r, xml_comment("Hello!")) as.character(x)
This is used to create simple document type definitions. If you need to
create a more complicated definition with internal subsets it is recommended
to parse a string directly with read_xml()
.
xml_dtd(name = "", external_id = "", system_id = "")
xml_dtd(name = "", external_id = "", system_id = "")
name |
The name of the declaration |
external_id |
The external ID of the declaration |
system_id |
The system ID of the declaration |
r <- xml_new_root( xml_dtd( "html", "-//W3C//DTD XHTML 1.0 Transitional//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ) ) # Use read_xml directly for more complicated DTD d <- read_xml( '<!DOCTYPE doc [ <!ELEMENT doc (#PCDATA)> <!ENTITY foo " test "> ]> <doc>This is a valid document &foo; !</doc>' )
r <- xml_new_root( xml_dtd( "html", "-//W3C//DTD XHTML 1.0 Transitional//EN", "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd" ) ) # Use read_xml directly for more complicated DTD d <- read_xml( '<!DOCTYPE doc [ <!ELEMENT doc (#PCDATA)> <!ENTITY foo " test "> ]> <doc>This is a valid document &foo; !</doc>' )
Xpath is like regular expressions for trees - it's worth learning if
you're trying to extract nodes from arbitrary locations in a document.
Use xml_find_all
to find all matches - if there's no match you'll
get an empty result. Use xml_find_first
to find a specific match -
if there's no match you'll get an xml_missing
node.
xml_find_all(x, xpath, ns = xml_ns(x), ...) ## S3 method for class 'xml_nodeset' xml_find_all(x, xpath, ns = xml_ns(x), flatten = TRUE, ...) xml_find_first(x, xpath, ns = xml_ns(x)) xml_find_num(x, xpath, ns = xml_ns(x)) xml_find_int(x, xpath, ns = xml_ns(x)) xml_find_chr(x, xpath, ns = xml_ns(x)) xml_find_lgl(x, xpath, ns = xml_ns(x))
xml_find_all(x, xpath, ns = xml_ns(x), ...) ## S3 method for class 'xml_nodeset' xml_find_all(x, xpath, ns = xml_ns(x), flatten = TRUE, ...) xml_find_first(x, xpath, ns = xml_ns(x)) xml_find_num(x, xpath, ns = xml_ns(x)) xml_find_int(x, xpath, ns = xml_ns(x)) xml_find_chr(x, xpath, ns = xml_ns(x)) xml_find_lgl(x, xpath, ns = xml_ns(x))
x |
A document, node, or node set. |
xpath |
A string containing an xpath (1.0) expression. |
ns |
Optionally, a named vector giving prefix-url pairs, as produced
by |
... |
Further arguments passed to or from other methods. |
flatten |
A logical indicating whether to return a single, flattened nodeset or a list of nodesets. |
xml_find_all
returns a nodeset if applied to a node, and a nodeset
or a list of nodesets if applied to a nodeset. If there are no matches,
the nodeset(s) will be empty. Within each nodeset, the result will always
be unique; repeated nodes are automatically de-duplicated.
xml_find_first
returns a node if applied to a node, and a nodeset
if applied to a nodeset. The output is always the same size as
the input. If there are no matches, xml_find_first
will return a
missing node; if there are multiple matches, it will return the first
only.
xml_find_num
, xml_find_chr
, xml_find_lgl
return
numeric, character and logical results respectively.
xml_find_one()
has been deprecated. Instead use
xml_find_first()
.
xml_ns_strip()
to remove the default namespaces
x <- read_xml("<foo><bar><baz/></bar><baz/></foo>") xml_find_all(x, ".//baz") xml_path(xml_find_all(x, ".//baz")) # Note the difference between .// and // # // finds anywhere in the document (ignoring the current node) # .// finds anywhere beneath the current node (bar <- xml_find_all(x, ".//bar")) xml_find_all(bar, ".//baz") xml_find_all(bar, "//baz") # Find all vs find one ----------------------------------------------------- x <- read_xml("<body> <p>Some <b>text</b>.</p> <p>Some <b>other</b> <b>text</b>.</p> <p>No bold here!</p> </body>") para <- xml_find_all(x, ".//p") # By default, if you apply xml_find_all to a nodeset, it finds all matches, # de-duplicates them, and returns as a single nodeset. This means you # never know how many results you'll get xml_find_all(para, ".//b") # If you set flatten to FALSE, though, xml_find_all will return a list of # nodesets, where each nodeset contains the matches for the corresponding # node in the original nodeset. xml_find_all(para, ".//b", flatten = FALSE) # xml_find_first only returns the first match per input node. If there are 0 # matches it will return a missing node xml_find_first(para, ".//b") xml_text(xml_find_first(para, ".//b")) # Namespaces --------------------------------------------------------------- # If the document uses namespaces, you'll need use xml_ns to form # a unique mapping between full namespace url and a short prefix x <- read_xml(' <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com"> <f:doc><g:baz /></f:doc> <f:doc><g:baz /></f:doc> </root> ') xml_find_all(x, ".//f:doc") xml_find_all(x, ".//f:doc", xml_ns(x))
x <- read_xml("<foo><bar><baz/></bar><baz/></foo>") xml_find_all(x, ".//baz") xml_path(xml_find_all(x, ".//baz")) # Note the difference between .// and // # // finds anywhere in the document (ignoring the current node) # .// finds anywhere beneath the current node (bar <- xml_find_all(x, ".//bar")) xml_find_all(bar, ".//baz") xml_find_all(bar, "//baz") # Find all vs find one ----------------------------------------------------- x <- read_xml("<body> <p>Some <b>text</b>.</p> <p>Some <b>other</b> <b>text</b>.</p> <p>No bold here!</p> </body>") para <- xml_find_all(x, ".//p") # By default, if you apply xml_find_all to a nodeset, it finds all matches, # de-duplicates them, and returns as a single nodeset. This means you # never know how many results you'll get xml_find_all(para, ".//b") # If you set flatten to FALSE, though, xml_find_all will return a list of # nodesets, where each nodeset contains the matches for the corresponding # node in the original nodeset. xml_find_all(para, ".//b", flatten = FALSE) # xml_find_first only returns the first match per input node. If there are 0 # matches it will return a missing node xml_find_first(para, ".//b") xml_text(xml_find_first(para, ".//b")) # Namespaces --------------------------------------------------------------- # If the document uses namespaces, you'll need use xml_ns to form # a unique mapping between full namespace url and a short prefix x <- read_xml(' <root xmlns:f = "http://foo.com" xmlns:g = "http://bar.com"> <f:doc><g:baz /></f:doc> <f:doc><g:baz /></f:doc> </root> ') xml_find_all(x, ".//f:doc") xml_find_all(x, ".//f:doc", xml_ns(x))
The (tag) name of an xml element.
Modify the (tag) name of an element
xml_name(x, ns = character()) xml_name(x, ns = character()) <- value xml_set_name(x, value, ns = character())
xml_name(x, ns = character()) xml_name(x, ns = character()) <- value xml_set_name(x, value, ns = character())
x |
A document, node, or node set. |
ns |
Optionally, a named vector giving prefix-url pairs, as produced
by |
value |
a character vector with replacement name. |
A character vector.
x <- read_xml("<bar>123</bar>") xml_name(x) y <- read_xml("<bar><baz>1</baz>abc<foo /></bar>") z <- xml_children(y) xml_name(xml_children(y))
x <- read_xml("<bar>123</bar>") xml_name(x) y <- read_xml("<bar><baz>1</baz>abc<foo /></bar>") z <- xml_children(y) xml_name(xml_children(y))
xml_new_document
creates only a new document without a root node. In
most cases you should instead use xml_new_root
, which creates a new
document and assigns the root node in one step.
xml_new_document(version = "1.0", encoding = "UTF-8") xml_new_root( .value, ..., .copy = inherits(.value, "xml_node"), .version = "1.0", .encoding = "UTF-8" )
xml_new_document(version = "1.0", encoding = "UTF-8") xml_new_root( .value, ..., .copy = inherits(.value, "xml_node"), .version = "1.0", .encoding = "UTF-8" )
version |
The version number of the document. |
encoding |
The character encoding to use in the document. The default encoding is ‘UTF-8’. Available encodings are specified at http://xmlsoft.org/html/libxml-encoding.html#xmlCharEncoding. |
.value |
node to insert. |
... |
If named attributes or namespaces to set on the node, if unnamed text to assign to the node. |
.copy |
whether to copy the |
.version |
The version number of the document, passed to |
.encoding |
The encoding of the document, passed to |
A xml_document
object.
xml_ns
extracts all namespaces from a document, matching each
unique namespace url with the prefix it was first associated with. Default
namespaces are named d1
, d2
etc. Use xml_ns_rename
to change the prefixes. Once you have a namespace object, you can pass it to
other functions to work with fully qualified names instead of local names.
xml_ns(x) xml_ns_rename(old, ...)
xml_ns(x) xml_ns_rename(old, ...)
x |
A document, node, or node set. |
old , ...
|
An existing xml_namespace object followed by name-value (old prefix-new prefix) pairs to replace. |
A character vector with class xml_namespace
so the
default display is a little nicer.
x <- read_xml(' <root> <doc1 xmlns = "http://foo.com"><baz /></doc1> <doc2 xmlns = "http://bar.com"><baz /></doc2> </root> ') xml_ns(x) # When there are default namespaces, it's a good idea to rename # them to give informative names: ns <- xml_ns_rename(xml_ns(x), d1 = "foo", d2 = "bar") ns # Now we can pass ns to other xml function to use fully qualified names baz <- xml_children(xml_children(x)) xml_name(baz) xml_name(baz, ns) xml_find_all(x, "//baz") xml_find_all(x, "//foo:baz", ns) str(as_list(x)) str(as_list(x, ns))
x <- read_xml(' <root> <doc1 xmlns = "http://foo.com"><baz /></doc1> <doc2 xmlns = "http://bar.com"><baz /></doc2> </root> ') xml_ns(x) # When there are default namespaces, it's a good idea to rename # them to give informative names: ns <- xml_ns_rename(xml_ns(x), d1 = "foo", d2 = "bar") ns # Now we can pass ns to other xml function to use fully qualified names baz <- xml_children(xml_children(x)) xml_name(baz) xml_name(baz, ns) xml_find_all(x, "//baz") xml_find_all(x, "//foo:baz", ns) str(as_list(x)) str(as_list(x, ns))
Strip the default namespaces from a document
xml_ns_strip(x)
xml_ns_strip(x)
x |
A document, node, or node set. |
x <- read_xml( "<foo xmlns = 'http://foo.com'> <baz/> <bar xmlns = 'http://bar.com'> <baz/> </bar> </foo>" ) # Need to specify the default namespaces to find the baz nodes xml_find_all(x, "//d1:baz") xml_find_all(x, "//d2:baz") # After stripping the default namespaces you can find both baz nodes directly xml_ns_strip(x) xml_find_all(x, "//baz")
x <- read_xml( "<foo xmlns = 'http://foo.com'> <baz/> <bar xmlns = 'http://bar.com'> <baz/> </bar> </foo>" ) # Need to specify the default namespaces to find the baz nodes xml_find_all(x, "//d1:baz") xml_find_all(x, "//d2:baz") # After stripping the default namespaces you can find both baz nodes directly xml_ns_strip(x) xml_find_all(x, "//baz")
This is useful when you want to figure out where nodes matching an xpath expression live in a document.
xml_path(x)
xml_path(x)
x |
A document, node, or node set. |
A character vector.
x <- read_xml("<foo><bar><baz /></bar><baz /></foo>") xml_path(xml_find_all(x, ".//baz"))
x <- read_xml("<foo><bar><baz /></bar><baz /></foo>") xml_path(xml_find_all(x, ".//baz"))
xml_add_sibling()
and xml_add_child()
are used to insert a node
as a sibling or a child. xml_add_parent()
adds a new parent in
between the input node and the current parent. xml_replace()
replaces an existing node with a new node. xml_remove()
removes a
node from the tree.
xml_replace(.x, .value, ..., .copy = TRUE) xml_add_sibling(.x, .value, ..., .where = c("after", "before"), .copy = TRUE) xml_add_child(.x, .value, ..., .where = length(xml_children(.x)), .copy = TRUE) xml_add_parent(.x, .value, ...) xml_remove(.x, free = FALSE)
xml_replace(.x, .value, ..., .copy = TRUE) xml_add_sibling(.x, .value, ..., .where = c("after", "before"), .copy = TRUE) xml_add_child(.x, .value, ..., .where = length(xml_children(.x)), .copy = TRUE) xml_add_parent(.x, .value, ...) xml_remove(.x, free = FALSE)
.x |
a document, node or nodeset. |
.value |
node to insert. |
... |
If named attributes or namespaces to set on the node, if unnamed text to assign to the node. |
.copy |
whether to copy the |
.where |
to add the new node, for |
free |
When removing the node also free the memory used for that node. Note if you use this option you cannot use any existing objects pointing to the node or its children, it is likely to crash R or return garbage. |
Care needs to be taken when using xml_remove()
,
Serializing XML objects to connections.
xml_serialize(object, connection, ...) xml_unserialize(connection, ...)
xml_serialize(object, connection, ...) xml_unserialize(connection, ...)
object |
R object to serialize. |
connection |
an open connection or (for |
... |
Additional arguments passed to |
For serialize
, NULL
unless connection = NULL
, when
the result is returned in a raw vector.
For unserialize
an R object.
library(xml2) x <- read_xml("<a> <b><c>123</c></b> <b><c>456</c></b> </a>") b <- xml_find_all(x, "//b") out <- xml_serialize(b, NULL) xml_unserialize(out)
library(xml2) x <- read_xml("<a> <b><c>123</c></b> <b><c>456</c></b> </a>") b <- xml_find_all(x, "//b") out <- xml_serialize(b, NULL) xml_unserialize(out)
The namespace to be set must be already defined in one of the node's ancestors.
xml_set_namespace(.x, prefix = "", uri = "")
xml_set_namespace(.x, prefix = "", uri = "")
.x |
a node |
prefix |
The namespace prefix to use |
uri |
The namespace URI to use |
the node (invisibly)
Show the structure of an html/xml document without displaying any of
the values. This is useful if you want to get a high level view of the
way a document is organised. Compared to xml_structure
,
html_structure
prints the id and class attributes.
xml_structure(x, indent = 2, file = "") html_structure(x, indent = 2, file = "")
xml_structure(x, indent = 2, file = "") html_structure(x, indent = 2, file = "")
x |
HTML/XML document (or part there of) |
indent |
Number of spaces to ident |
file |
A connection, or a character string naming the file
to print to. If |
xml_structure(read_xml("<a><b><c/><c/></b><d/></a>")) rproj <- read_html(system.file("extdata", "r-project.html", package = "xml2")) xml_structure(rproj) xml_structure(xml_find_all(rproj, ".//p")) h <- read_html("<body><p id = 'a'></p><p class = 'c d'></p></body>") html_structure(h)
xml_structure(read_xml("<a><b><c/><c/></b><d/></a>")) rproj <- read_html(system.file("extdata", "r-project.html", package = "xml2")) xml_structure(rproj) xml_structure(xml_find_all(rproj, ".//p")) h <- read_html("<body><p id = 'a'></p><p class = 'c d'></p></body>") html_structure(h)
xml_text
returns a character vector, xml_double
returns a
numeric vector, xml_integer
returns an integer vector.
xml_text(x, trim = FALSE) xml_text(x) <- value xml_set_text(x, value) xml_double(x) xml_integer(x)
xml_text(x, trim = FALSE) xml_text(x) <- value xml_set_text(x, value) xml_double(x) xml_integer(x)
x |
A document, node, or node set. |
trim |
If |
value |
character vector with replacement text. |
A character vector, the same length as x.
x <- read_xml("<p>This is some text. This is <b>bold!</b></p>") xml_text(x) xml_text(xml_children(x)) x <- read_xml("<x>This is some text. <x>This is some nested text.</x></x>") xml_text(x) xml_text(xml_find_all(x, "//x")) x <- read_xml("<p> Some text </p>") xml_text(x, trim = TRUE) # xml_double() and xml_integer() are useful for extracting numeric attributes x <- read_xml("<plot><point x='1' y='2' /><point x='2' y='1' /></plot>") xml_integer(xml_find_all(x, "//@x"))
x <- read_xml("<p>This is some text. This is <b>bold!</b></p>") xml_text(x) xml_text(xml_children(x)) x <- read_xml("<x>This is some text. <x>This is some nested text.</x></x>") xml_text(x) xml_text(xml_find_all(x, "//x")) x <- read_xml("<p> Some text </p>") xml_text(x, trim = TRUE) # xml_double() and xml_integer() are useful for extracting numeric attributes x <- read_xml("<plot><point x='1' y='2' /><point x='2' y='1' /></plot>") xml_integer(xml_find_all(x, "//@x"))
Determine the type of a node.
xml_type(x)
xml_type(x)
x |
A document, node, or node set. |
x <- read_xml("<foo> a <b /> <![CDATA[ blah]]></foo>") xml_type(x) xml_type(xml_contents(x))
x <- read_xml("<foo> a <b /> <![CDATA[ blah]]></foo>") xml_type(x) xml_type(xml_contents(x))
This is useful for interpreting relative urls with url_relative()
.
xml_url(x)
xml_url(x)
x |
A node or document. |
A character vector of length 1. Returns NA
if the name is
not set.
catalog <- read_xml(xml2_example("cd_catalog.xml")) xml_url(catalog) x <- read_xml("<foo/>") xml_url(x)
catalog <- read_xml(xml2_example("cd_catalog.xml")) xml_url(catalog) x <- read_xml("<foo/>") xml_url(x)
Validate an XML document against an XML 1.0 schema.
xml_validate(x, schema)
xml_validate(x, schema)
x |
A document, node, or node set. |
schema |
an XML document containing the schema |
TRUE or FALSE
# Example from https://msdn.microsoft.com/en-us/library/ms256129(v=vs.110).aspx doc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2")) schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2")) xml_validate(doc, schema)
# Example from https://msdn.microsoft.com/en-us/library/ms256129(v=vs.110).aspx doc <- read_xml(system.file("extdata/order-doc.xml", package = "xml2")) schema <- read_xml(system.file("extdata/order-schema.xml", package = "xml2")) xml_validate(doc, schema)
xml2 comes bundled with a number of sample files in its ‘inst/extdata’ directory. This function makes them easy to access.
xml2_example(path = NULL)
xml2_example(path = NULL)
path |
Name of file. If |