More easily extract pieces out of HTML documents using XPath and CSS selectors. CSS selectors are particularly useful in conjunction with http://selectorgadget.com/: it makes it easy to find exactly which selector you should be using. If you haven't used CSS selectors before, work your way through the fun tutorial at http://flukeout.github.io/
html_nodes(x, css, xpath) html_node(x, css, xpath)
| x | Either a document, a node set or a single node. |
|---|---|
| css, xpath | Nodes to select. Supply one of |
html_node vs html_nodeshtml_node is like [[ it always extracts exactly one
element. When given a list of nodes, html_node will always return
a list of the same length, the length of html_nodes might be longer
or shorter.
CSS selectors are translated to XPath selectors by the selectr package, which is a port of the python cssselect library, https://pythonhosted.org/cssselect/.
It implements the majority of CSS3 selectors, as described in http://www.w3.org/TR/2011/REC-css3-selectors-20110929/. The exceptions are listed below:
Pseudo selectors that require interactivity are ignored:
:hover, :active, :focus, :target,
:visited
The following pseudo classes don't work with the wild card element, *:
*:first-of-type, *:last-of-type, *:nth-of-type,
*:nth-last-of-type, *:only-of-type
It supports :contains(text)
You can use !=, [foo!=bar] is the same as :not([foo=bar])
:not() accepts a sequence of simple selectors, not just single
simple selector.
# CSS selectors ---------------------------------------------- url <- paste0( "https://web.archive.org/web/20190202054736/", "https://www.boxofficemojo.com/movies/?id=ateam.htm" ) ateam <- read_html(url) html_nodes(ateam, "center")#> {xml_nodeset (1)} #> [1] <center><table border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcd ...html_nodes(ateam, "center font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>html_nodes(ateam, "center font b")#> {xml_nodeset (1)} #> [1] <b>$77,222,099</b># But html_node is best used in conjunction with %>% from magrittr # You can chain subsetting: ateam %>% html_nodes("center") %>% html_nodes("td")#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td>ateam %>% html_nodes("center") %>% html_nodes("font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font>td <- ateam %>% html_nodes("center") %>% html_nodes("td") td#> {xml_nodeset (7)} #> [1] <td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$7 ... #> [2] <td valign="top">Distributor: <b><a href="/web/20190202054736/https://www ... #> [3] <td valign="top">Release Date: <b><nobr><a href="/web/20190202054736/http ... #> [4] <td valign="top">Genre: <b>Action</b>\n</td>\n #> [5] <td valign="top">Runtime: <b>1 hrs. 57 min.</b>\n</td> #> [6] <td valign="top">MPAA Rating: <b>PG-13</b>\n</td>\n #> [7] <td valign="top">Production Budget: <b>$110 million</b>\n</td># When applied to a list of nodes, html_nodes() returns all nodes, # collapsing results into a new nodelist. td %>% html_nodes("font")#> {xml_nodeset (1)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font># html_node() returns the first matching node. If there are no matching # nodes, it returns a "missing" node if (utils::packageVersion("xml2") > "0.1.2") { td %>% html_node("font") }#> {xml_nodeset (7)} #> [1] <font size="4">Domestic Total Gross: <b>$77,222,099</b></font> #> [2] <NA> #> [3] <NA> #> [4] <NA> #> [5] <NA> #> [6] <NA> #> [7] <NA># To pick out an element at specified position, use magrittr::extract2 # which is an alias for [[ library(magrittr) ateam %>% html_nodes("table") %>% extract2(1) %>% html_nodes("img")#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...ateam %>% html_nodes("table") %>% `[[`(1) %>% html_nodes("img")#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...# Find all images contained in the first two tables ateam %>% html_nodes("table") %>% `[`(1:2) %>% html_nodes("img")#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...#> {xml_nodeset (6)} #> [1] <img src="https://web.archive.org/web/20190202054736im_/https://m.media-a ... #> [2] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [3] <img src="//web.archive.org/web/20190202054736im_/https://www.assoc-amazo ... #> [4] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/b ... #> [5] <img src="/web/20190202054736im_/https://www.boxofficemojo.com/img/misc/I ... #> [6] <img src="https://web.archive.org/web/20190202054736im_/http://b.scorecar ...# XPath selectors --------------------------------------------- # chaining with XPath is a little trickier - you may need to vary # the prefix you're using - // always selects from the root node # regardless of where you currently are in the doc ateam %>% html_nodes(xpath = "//center//font//b") %>% html_nodes(xpath = "//b")#> {xml_nodeset (21)} #> [1] <b>Adjuster:</b> #> [2] <b>The A-Team</b> #> [3] <b>$77,222,099</b> #> [4] <b><a href="/web/20190202054736/https://www.boxofficemojo.com/studio/cha ... #> [5] <b><nobr><a href="/web/20190202054736/https://www.boxofficemojo.com/sche ... #> [6] <b>Action</b> #> [7] <b>1 hrs. 57 min.</b> #> [8] <b>PG-13</b> #> [9] <b>$110 million</b> #> [10] <b>Domestic:</b> #> [11] <b>$77,222,099</b> #> [12] <b>43.6%</b> #> [13] <b>Worldwide:</b> #> [14] <b>$177,238,796</b> #> [15] <b>> View All 14 Weekends</b> #> [16] <b>Showdown: 'Men-on-a-Mission'</b> #> [17] <b>4</b> #> [18] <b>Chart</b> #> [19] <b>Rank</b> #> [20] <b>Charts (Premier Pass Users Only)</b> #> ...