Select nodes from an HTML document

More easily extract pieces out of HTML documents using XPath and CSS selectors. CSS selectors are particularly useful in conjunction with http://selectorgadget.com/: it makes it easy to find exactly which selector you should be using. If you haven't used CSS selectors before, work your way through the fun tutorial at http://flukeout.github.io/

html_nodes(x, css, xpath)

html_node(x, css, xpath)

Arguments

x	Either a document, a node set or a single node.
css, xpath	Nodes to select. Supply one of `css` or `xpath` depending on whether you want to use a CSS or XPath 1.0 selector.

`html_node` vs `html_nodes`

html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

CSS selector support

CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.

It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:

Pseudo selectors that require interactivity are ignored: :hover, :active, :focus, :target, :visited
The following pseudo classes don't work with the wild card element, *: *:first-of-type, *:last-of-type, *:nth-of-type, *:nth-last-of-type, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar] is the same as :not([foo=bar])
:not() accepts a sequence of simple selectors, not just single simple selector.

Examples

# CSS selectors ----------------------------------------------
url <- paste0(
  "https://web.archive.org/web/20190202054736/",
  "https://www.boxofficemojo.com/movies/?id=ateam.htm"
)
ateam <- read_html(url)
html_nodes(ateam, "center")
#> {xml_nodeset (1)}
#> [1] <center><table border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcd ...
html_nodes(ateam, "center font")
#> {xml_nodeset (1)}
#> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
html_nodes(ateam, "center font b")
#> {xml_nodeset (1)}
#> [1] <b>$77,222,099</b>

# But html_node is best used in conjunction with %>% from magrittr
# You can chain subsetting:
ateam %>% html_nodes("center") %>% html_nodes("td")
#> {xml_nodeset (7)}
#> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ...
#> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ...
#> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ...
#> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n
#> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td>
#> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n
#> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
ateam %>% html_nodes("center") %>% html_nodes("font")
#> {xml_nodeset (1)}
#> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>

td <- ateam %>% html_nodes("center") %>% html_nodes("td")
td
#> {xml_nodeset (7)}
#> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ...
#> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ...
#> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ...
#> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n
#> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td>
#> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n
#> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>
# When applied to a list of nodes, html_nodes() returns all nodes,
# collapsing results into a new nodelist.
td %>% html_nodes("font")
#> {xml_nodeset (1)}
#> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
# html_node() returns the first matching node. If there are no matching
# nodes, it returns a "missing" node
if (utils::packageVersion("xml2") > "0.1.2") {
  td %>% html_node("font")
}
#> {xml_nodeset (7)}
#> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>
#> [2] <NA>
#> [3] <NA>
#> [4] <NA>
#> [5] <NA>
#> [6] <NA>
#> [7] <NA>

# To pick out an element at specified position, use magrittr::extract2
# which is an alias for [[
library(magrittr)
ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img")
#> {xml_nodeset (6)}
#> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ...
#> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ...
#> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ...
#> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...
ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img")
#> {xml_nodeset (6)}
#> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ...
#> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ...
#> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ...
#> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...

# Find all images contained in the first two tables
ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img")
#> {xml_nodeset (6)}
#> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ...
#> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ...
#> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ...
#> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...
ateam %>% html_nodes("table") %>% extract(1:2) %>% html_nodes("img")
#> {xml_nodeset (6)}
#> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ...
#> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ...
#> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ...
#> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ...
#> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...

# XPath selectors ---------------------------------------------
# chaining with XPath is a little trickier - you may need to vary
# the prefix you're using - // always selects from the root node
# regardless of where you currently are in the doc
ateam %>%
  html_nodes(xpath = "//center//font//b") %>%
  html_nodes(xpath = "//b")
#> {xml_nodeset (21)}
#>  [1] <b>Adjuster:</b>
#>  [2] <b>The A-Team</b>
#>  [3] <b>$77,222,099</b>
#>  [4] <b><a href="/web/20190202054736/https://www.boxofficemojo.com/studio/cha ...
#>  [5] <b><nobr><a href="/web/20190202054736/https://www.boxofficemojo.com/sche ...
#>  [6] <b>Action</b>
#>  [7] <b>1 hrs. 57 min.</b>
#>  [8] <b>PG-13</b>
#>  [9] <b>$110 million</b>
#> [10] <b>Domestic:</b>
#> [11] <b>$77,222,099</b>
#> [12] <b>43.6%</b>
#> [13] <b>Worldwide:</b>
#> [14] <b>$177,238,796</b>
#> [15] <b>&gt; View All 14 Weekends</b>
#> [16] <b>Showdown: 'Men-on-a-Mission'</b>
#> [17] <b>4</b>
#> [18] <b>Chart</b>
#> [19] <b>Rank</b>
#> [20] <b>Charts (Premier Pass Users Only)</b>
#> ...

Arguments

html_node vs html_nodes

CSS selector support

Examples

`html_node` vs `html_nodes`