html_encoding_guess()
helps you handle web pages that declare an incorrect
encoding. Use html_encoding_guess()
to generate a list of possible
encodings, then try each out by using encoding
argument of read_html()
.
html_encoding_guess()
replaces the deprecated guess_encoding()
.
html_encoding_guess(x)
x | A character vector. |
---|
# A file with bad encoding included in the package path <- system.file("html-ex", "bad-encoding.html", package = "rvest") x <- read_html(path) x %>% html_elements("p") %>% html_text() #> [1] "\xc9migré cause célèbre déjà vu." html_encoding_guess(x) #> encoding language confidence #> 1 ISO-8859-1 fr 0.31 #> 2 ISO-8859-2 ro 0.22 #> 3 UTF-16BE 0.10 #> 4 UTF-16LE 0.10 #> 5 GB18030 zh 0.10 #> 6 Big5 zh 0.10 #> 7 ISO-8859-9 tr 0.06 #> 8 IBM424_rtl he 0.01 #> 9 IBM424_ltr he 0.01 # Two valid encodings, only one of which is correct read_html(path, encoding = "ISO-8859-1") %>% html_elements("p") %>% html_text() #> [1] "Émigré cause célèbre déjà vu." read_html(path, encoding = "ISO-8859-2") %>% html_elements("p") %>% html_text() #> [1] "Émigré cause célčbre déjŕ vu."