readxl Workflows

Ideas for using readxl to increase reproducibiliy and reduce tedium.

Reproducibility is much easier in theory than in reality. Here are some special dilemmas we face with spreadsheets:

We say: Keep raw data raw! But we also say: Store data in a tool-agnostic, future-proof format! If data comes in the form of an .xls[x] file, we’re in a pickle. The .xls[x] file should obviously be preserved, and probably write-protected. But a faithful copy as CSV is a wonderful complement, as long as you can ensure the two are the same.
We say: Don’t do analysis “by hand” or “by mouse”! But then we break this rule by manually exporting spreadsheets to file with File > Save As … > (save to .csv). readxl helps you get data directly out of a spreadsheet and into R, where you can record every step of your analysis as code. Below we show how to cache a CSV snapshot as part of this process.

The examples below also demonstrate the use of functional programming or “apply” techniques to iterate over the worksheets in a workbook.

Load packages

We load the tidyverse metapackage here because the workflows below show readxl working with readr, purrr, etc. See the last section for solutions using base R only (other than readxl).

We must load readxl explicitly because it is not part of the core tidyverse.

library(tidyverse)
#> ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
#> ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
#> ✓ tibble  2.1.3     ✓ dplyr   0.8.4
#> ✓ tidyr   1.0.2     ✓ stringr 1.4.0
#> ✓ readr   1.3.1     ✓ forcats 0.4.0
#> ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag()    masks stats::lag()
library(readxl)

Cache a CSV snapshot

Break analyses into logical steps, via a series of scripts that relate to one theme, such as “clean the data” or “make exploratory and diagnostic plots”.

This forces you to transmit info from step i to step i + 1 via a set of output files. The cumulative outputs of steps 1, 2, …, i are available as inputs for steps i + 1 and beyond.

These outputs constitute an API for your analysis, i.e. they provide clean entry points that can be used (and understood) in isolation, possibly using an entirely different toolkit. Contrast this with the alternative of writing one monolithic script or transmitting entire workspaces via save(), load(), and R-specific .rds files.

If raw data is stored only as an Excel spreadsheet, this limits your ability to inspect it when solving the little puzzles that crop up in dowstream work. You’ll need to fire up Excel (or similar) and get busy with your mouse. You certainly can’t poke around it or view diffs on GitHub.

Solution: cache a CSV snapshot of your raw data tables at the time of export. Even if you use read_excel() for end-to-end reproducibility, this complementary CSV leaves your analysis in a more accessible state.

Pipe the output of read_excel() directly into readr::write_csv() like so:

iris_xl <- readxl_example("datasets.xlsx") %>% 
  read_excel(sheet = "iris") %>% 
  write_csv("iris-raw.csv")

Why does this work? readr::write_csv() is a well-mannered “write” function: it does its main job and returns its input invisibly. The above command reads the iris sheet from readxl’s datasets.xlsx example workbook and caches a CSV version of the resulting data frame to file.

Let’s check. Did we still import the data? Did we write the CSV file?

iris_xl
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> # … with 147 more rows
dir(pattern = "iris")
#> [1] "iris-raw.csv"

Yes! Is the data written to CSV an exact copy of what we imported from Excel?

iris_alt <- read_csv("iris-raw.csv")
#> Parsed with column specification:
#> cols(
#>   Sepal.Length = col_double(),
#>   Sepal.Width = col_double(),
#>   Petal.Length = col_double(),
#>   Petal.Width = col_double(),
#>   Species = col_character()
#> )
## readr leaves a note-to-self in `spec` that records its column guessing,
## so we remove that attribute before the check
attr(iris_alt, "spec") <- NULL
identical(iris_xl, iris_alt)
#> [1] FALSE

Yes! If we needed to restart or troubleshoot this fictional analysis, iris-raw.csv is available as a second, highly accessible alternative to datasets.xlsx.

Iterate over multiple worksheets in a workbook

Some Excel workbooks contain only data and you are tempted to ask “Why, God, why is this data stored in Excel? Why not store this as a series of CSV files?” One possible answer is this: because the workbook structure keeps them all together.

Let’s accept that this happens and that it is not entirely crazy. How can you efficiently load all of that into R?

Here’s how to load all the sheets in a workbook into a list of data frames:

Get worksheet names as a self-named character vector (these names propagate nicely).
Use purrr::map() to iterate sheet reading.

path <- readxl_example("datasets.xlsx")
path %>% 
  excel_sheets() %>% 
  set_names() %>% 
  map(read_excel, path = path)
#> $iris
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> # … with 147 more rows
#> 
#> $mtcars
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows
#> 
#> $chickwts
#> # A tibble: 71 x 2
#>   weight feed     
#>    <dbl> <chr>    
#> 1    179 horsebean
#> 2    160 horsebean
#> 3    136 horsebean
#> # … with 68 more rows
#> 
#> $quakes
#> # A tibble: 1,000 x 5
#>     lat  long depth   mag stations
#>   <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.4  182.   562   4.8       41
#> 2 -20.6  181.   650   4.2       15
#> 3 -26    184.    42   5.4       43
#> # … with 997 more rows

CSV caching and iterating over sheets

What if we want to read all the sheets in at once and simultaneously cache to CSV? We define read_then_csv() as read_excel(...) %>% write_csv() and use purrr::map() again.

read_then_csv <- function(sheet, path) {
  pathbase <- path %>%
    basename() %>%
    tools::file_path_sans_ext()
  path %>%
    read_excel(sheet = sheet) %>% 
    write_csv(paste0(pathbase, "-", sheet, ".csv"))
}

We could even define this on-the-fly as an anonymous function inside map(), but I think this is more readable.

path <- readxl_example("datasets.xlsx")
path %>%
  excel_sheets() %>%
  set_names() %>% 
  map(read_then_csv, path = path)
#> $iris
#> # A tibble: 150 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <chr>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa 
#> # … with 147 more rows
#> 
#> $mtcars
#> # A tibble: 32 x 11
#>     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3  22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows
#> 
#> $chickwts
#> # A tibble: 71 x 2
#>   weight feed     
#>    <dbl> <chr>    
#> 1    179 horsebean
#> 2    160 horsebean
#> 3    136 horsebean
#> # … with 68 more rows
#> 
#> $quakes
#> # A tibble: 1,000 x 5
#>     lat  long depth   mag stations
#>   <dbl> <dbl> <dbl> <dbl>    <dbl>
#> 1 -20.4  182.   562   4.8       41
#> 2 -20.6  181.   650   4.2       15
#> 3 -26    184.    42   5.4       43
#> # … with 997 more rows
dir(pattern = "^datasets.*\\.csv$")
#> [1] "datasets-chickwts.csv" "datasets-iris.csv"     "datasets-mtcars.csv"  
#> [4] "datasets-quakes.csv"

In a real analysis, starting with workbook "foo.xlsx", you might want to create the directory foo and place the CSVs inside that.

Concatenate worksheets into one data frame

What if the datasets found on different sheets have the same variables? Then you’ll want to row-bind them, after import, to form one big, beautiful data frame.

readxl ships with an example sheet deaths.xlsx, containing data on famous people who died in 2016 or 2017. It has two worksheets named “arts” and “other”, but the spreadsheet layout is the same in each and the data tables have the same variables, e.g., name and date of death.

The map_df() function from purrr makes it easy to iterate over worksheets and glue together the resulting data frames, all at once.

Store a self-named vector of worksheet names (critical for the ID variable below).
Use purrr::map_df() to import the data, create an ID variable for the source worksheet, and row bind.

path <- readxl_example("deaths.xlsx")
deaths <- path %>%
  excel_sheets() %>%
  set_names() %>% 
  map_df(~ read_excel(path = path, sheet = .x, range = "A5:F15"), .id = "sheet")
print(deaths, n = Inf)
#> # A tibble: 20 x 7
#>    sheet Name  Profession   Age `Has kids` `Date of birth`    
#>    <chr> <chr> <chr>      <dbl> <lgl>      <dttm>             
#>  1 arts  Davi… musician      69 TRUE       1947-01-08 00:00:00
#>  2 arts  Carr… actor         60 TRUE       1956-10-21 00:00:00
#>  3 arts  Chuc… musician      90 TRUE       1926-10-18 00:00:00
#>  4 arts  Bill… actor         61 TRUE       1955-05-17 00:00:00
#>  5 arts  Prin… musician      57 TRUE       1958-06-07 00:00:00
#>  6 arts  Alan… actor         69 FALSE      1946-02-21 00:00:00
#>  7 arts  Flor… actor         82 TRUE       1934-02-14 00:00:00
#>  8 arts  Harp… author        89 FALSE      1926-04-28 00:00:00
#>  9 arts  Zsa … actor         99 TRUE       1917-02-06 00:00:00
#> 10 arts  Geor… musician      53 FALSE      1963-06-25 00:00:00
#> 11 other Vera… scientist     88 TRUE       1928-07-23 00:00:00
#> 12 other Moha… athlete       74 TRUE       1942-01-17 00:00:00
#> 13 other Morl… journalist    84 TRUE       1931-11-08 00:00:00
#> 14 other Fide… politician    90 TRUE       1926-08-13 00:00:00
#> 15 other Anto… lawyer        79 TRUE       1936-03-11 00:00:00
#> 16 other Jo C… politician    41 TRUE       1974-06-22 00:00:00
#> 17 other Jane… lawyer        78 FALSE      1938-07-21 00:00:00
#> 18 other Gwen… journalist    61 FALSE      1955-09-29 00:00:00
#> 19 other John… astronaut     95 TRUE       1921-07-28 00:00:00
#> 20 other Pat … coach         64 TRUE       1952-06-14 00:00:00
#> # … with 1 more variable: `Date of death` <dttm>

Note the use of range = "A5:E15" here. deaths.xlsx is a typical spreadsheet and includes a few non-data lines at the top and bottom and this argument specifies where the data rectangle lives.

Putting it all together

All at once now:

Multiple worksheets feeding into one data frame
Sheet-specific target rectangles
Cache to CSV upon import

Even though the worksheets in deaths.xlsx have the same layout, we’ll pretend they don’t and specify the target rectangle in two different ways here. This shows how this can work if each worksheet has it’s own peculiar geometry. Here’s the workflow:

Store a self-named vector of worksheet names.
Store a vector of cell range specifications.
Use purrr::map2_df() to iterate over those two vectors in parallel, importing the data, row binding, and creating an ID variable for the source worksheet.
Cache the unified data to CSV.

path <- readxl_example("deaths.xlsx")
sheets <- path %>%
  excel_sheets() %>% 
  set_names()
ranges <- list("A5:F15", cell_rows(5:15))
deaths <- map2_df(
  sheets,
  ranges,
  ~ read_excel(path, sheet = .x, range = .y),
  .id = "sheet"
) %>%
  write_csv("deaths.csv")
print(deaths, n = Inf)
#> # A tibble: 20 x 7
#>    sheet Name      Profession   Age `Has kids` `Date of birth`  `Date of death` 
#>    <chr> <chr>     <chr>      <dbl> <lgl>      <chr>            <chr>           
#>  1 arts  David Bo… musician      69 TRUE       1947-01-08T00:0… 2016-01-10T00:0…
#>  2 arts  Carrie F… actor         60 TRUE       1956-10-21T00:0… 2016-12-27T00:0…
#>  3 arts  Chuck Be… musician      90 TRUE       1926-10-18T00:0… 2017-03-18T00:0…
#>  4 arts  Bill Pax… actor         61 TRUE       1955-05-17T00:0… 2017-02-25T00:0…
#>  5 arts  Prince    musician      57 TRUE       1958-06-07T00:0… 2016-04-21T00:0…
#>  6 arts  Alan Ric… actor         69 FALSE      1946-02-21T00:0… 2016-01-14T00:0…
#>  7 arts  Florence… actor         82 TRUE       1934-02-14T00:0… 2016-11-24T00:0…
#>  8 arts  Harper L… author        89 FALSE      1926-04-28T00:0… 2016-02-19T00:0…
#>  9 arts  Zsa Zsa … actor         99 TRUE       1917-02-06T00:0… 2016-12-18T00:0…
#> 10 arts  George M… musician      53 FALSE      1963-06-25T00:0… 2016-12-25T00:0…
#> 11 other Vera Rub… scientist     88 TRUE       1928-07-23T00:0… 2016-12-25T00:0…
#> 12 other Mohamed … athlete       74 TRUE       1942-01-17T00:0… 2016-06-03T00:0…
#> 13 other Morley S… journalist    84 TRUE       1931-11-08T00:0… 2016-05-19T00:0…
#> 14 other Fidel Ca… politician    90 TRUE       1926-08-13T00:0… 2016-11-25T00:0…
#> 15 other Antonin … lawyer        79 TRUE       1936-03-11T00:0… 2016-02-13T00:0…
#> 16 other Jo Cox    politician    41 TRUE       1974-06-22T00:0… 2016-06-16T00:0…
#> 17 other Janet Re… lawyer        78 FALSE      1938-07-21T00:0… 2016-11-07T00:0…
#> 18 other Gwen Ifi… journalist    61 FALSE      1955-09-29T00:0… 2016-11-14T00:0…
#> 19 other John Gle… astronaut     95 TRUE       1921-07-28T00:0… 2016-12-08T00:0…
#> 20 other Pat Summ… coach         64 TRUE       1952-06-14T00:0… 2016-06-28T00:0…

Base version

Rework examples from above but using base R only, other than readxl.

Cache a CSV snapshot

iris_xl <- read_excel(readxl_example("datasets.xlsx"), sheet = "iris")
write.csv(iris_xl, "iris-raw.csv", row.names = FALSE, quote = FALSE)
iris_alt <- read.csv("iris-raw.csv", stringsAsFactors = FALSE)
## coerce iris_xl back to a data.frame
identical(as.data.frame(iris_xl), iris_alt)

Iterate over multiple worksheets in a workbook

path <- readxl_example("datasets.xls")
sheets <- excel_sheets(path)
xl_list <- lapply(excel_sheets(path), read_excel, path = path) 
names(xl_list) <- sheets

CSV caching and iterating over sheets

read_then_csv <- function(sheet, path) {
  pathbase <- tools::file_path_sans_ext(basename(path))
  df <- read_excel(path = path, sheet = sheet)
  write.csv(df, paste0(pathbase, "-", sheet, ".csv"),
            quote = FALSE, row.names = FALSE)
  df
}
path <- readxl_example("datasets.xlsx")
sheets <- excel_sheets(path)
xl_list <- lapply(excel_sheets(path), read_then_csv, path = path)
names(xl_list) <- sheets

Concatenate worksheets into one data frame

path <- readxl_example("deaths.xlsx")
sheets <- excel_sheets(path)
xl_list <-
  lapply(excel_sheets(path), read_excel, path = path, range = "A5:F15")
xl_list <- lapply(seq_along(sheets), function(i) {
  data.frame(sheet = I(sheets[i]), xl_list[[i]])
})
xl_list <- do.call(rbind, xl_list)

Putting it all together

path <- readxl_example("deaths.xlsx")
sheets <- excel_sheets(path)
ranges <- list("A5:F15", cell_rows(5:15))
xl_list <- mapply(function(x, y) {
  read_excel(path = path, sheet = x, range = y)
}, sheets, ranges, SIMPLIFY = FALSE)
xl_list <- lapply(seq_along(sheets), function(i) {
  data.frame(sheet = I(sheets[i]), xl_list[[i]])
})
xl_list <- do.call(rbind, xl_list)
write.csv(xl_list, "deaths.csv", row.names = FALSE, quote = FALSE)

Load packages

Cache a CSV snapshot

Iterate over multiple worksheets in a workbook

CSV caching and iterating over sheets

Concatenate worksheets into one data frame

Putting it all together

Base version

Cache a CSV snapshot

Iterate over multiple worksheets in a workbook

CSV caching and iterating over sheets

Concatenate worksheets into one data frame

Putting it all together

Contents