There are many, many ways to subset data frames and tibbles.
This vignette is an attempt to provide a comprehensive overview over the behavior of the subsetting operators $, [[ and [, highlighting where the tibble implementation differs from the data frame implementation.
Results of the same code for data frames and tibbles are presented side by side:
new_df()#> a b cd#> 1 1 e 9#> 2 2 f 10, 11#> 3 3 g 12, 13, 14#> 4 4 h text
new_tbl()#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>#> 3 3 g <int [3]>#> 4 4 h <chr [1]>
In the following, if the results are identical (after converting to a data frame if necessary), only the tibble result is shown, as in the example below. This allows to spot differences easier.
new_tbl()#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>#> 3 3 g <int [3]>#> 4 4 h <chr [1]>
Subsetting operations are read-only. The same objects are reused in all examples:
df<-new_df()tbl<-new_tbl()
$
With $ subsetting, accessing a missing column gives a warning. Inexact matching is not supported:
#> Error: Can't use NA as column index with
#> `[` at position 1.
df[, NA_integer_]
#> Error in `[.data.frame`(df, ,
#> NA_integer_): undefined columns selected
tbl[, NA_integer_]
#> Error: Can't use NA as column index with
#> `[` at position 1.
Multiple columns can be queried by passing a vector of column indexes (names, positions, or even a logical vector). With the latter option, tibbles are a tad stricter:
tbl[c("a", "b")]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
tbl[1:2]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
tbl[1:3]#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>#> 3 3 g <int [3]>#> 4 4 h <chr [1]>
df[1:4]
#> Error in `[.data.frame`(df, 1:4):
#> undefined columns selected
tbl[1:4]
#> Error: Can't subset columns that don't
#> exist.
#> x Location 4 doesn't exist.
#> ℹ There are only 3 columns.
tbl[0:2]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
df[-1:2]
#> Error in `[.default`(df, -1:2): only 0's
#> may be mixed with negative subscripts
tbl[-1:2]
#> Error: Must subset columns with a valid
#> subscript vector.
#> x Negative and positive locations can't
#> be mixed.
#> ℹ Subscript `-1:2` has 2 positive values
#> at locations 3 and 4.
tbl[-1]#> # A tibble: 4 × 2#> b cd #> <chr> <list> #> 1 e <dbl [1]>#> 2 f <int [2]>#> 3 g <int [3]>#> 4 h <chr [1]>
#> Error: Must subset columns with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 3 but subscript
#> `c(FALSE, TRUE)` has size 2.
#> Error: Must subset columns with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 3 but subscript
#> `c(FALSE, TRUE, FALSE, TRUE)` has size
#> 4.
The same examples are repeated for two-dimensional indexing when omitting the row index:
tbl[, c("a", "b")]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
tbl[, 1:2]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
tbl[, 1:3]#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>#> 3 3 g <int [3]>#> 4 4 h <chr [1]>
df[, 1:4]
#> Error in `[.data.frame`(df, , 1:4):
#> undefined columns selected
tbl[, 1:4]
#> Error: Can't subset columns that don't
#> exist.
#> x Location 4 doesn't exist.
#> ℹ There are only 3 columns.
tbl[, 0:2]#> # A tibble: 4 × 2#> a b #> <int> <chr>#> 1 1 e #> 2 2 f #> 3 3 g #> 4 4 h
df[, -1:2]
#> Error in .subset(x, j): only 0's may be
#> mixed with negative subscripts
tbl[, -1:2]
#> Error: Must subset columns with a valid
#> subscript vector.
#> x Negative and positive locations can't
#> be mixed.
#> ℹ Subscript `-1:2` has 2 positive values
#> at locations 3 and 4.
tbl[, -1]#> # A tibble: 4 × 2#> b cd #> <chr> <list> #> 1 e <dbl [1]>#> 2 f <int [2]>#> 3 g <int [3]>#> 4 h <chr [1]>
#> Error: Must subset columns with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 3 but subscript
#> `c(FALSE, TRUE)` has size 2.
#> Error: Must subset columns with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 3 but subscript
#> `c(FALSE, TRUE, FALSE, TRUE)` has size
#> 4.
Row subsetting with integer indexes works almost identical. Out-of-bounds subsetting is not recommended and may lead to an error in future versions. Another special case is subsetting with [1, , drop = TRUE] where the data frame implementation returns a list.
tbl[1, ]#> # A tibble: 1 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>
tbl[1, , drop =TRUE]#> # A tibble: 1 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>
tbl[1:2, ]#> # A tibble: 2 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>
tbl[0, ]#> # A tibble: 0 × 3#> # … with 3 variables: a <int>, b <chr>,#> # cd <list>
tbl[integer(), ]#> # A tibble: 0 × 3#> # … with 3 variables: a <int>, b <chr>,#> # cd <list>
tbl[5, ]#> # A tibble: 1 × 3#> a b cd #> <int> <chr> <list>#> 1 NA <NA> <NULL>
tbl[4:5, ]#> # A tibble: 2 × 3#> a b cd #> <int> <chr> <list> #> 1 4 h <chr [1]>#> 2 NA <NA> <NULL>
tbl[-1, ]#> # A tibble: 3 × 3#> a b cd #> <int> <chr> <list> #> 1 2 f <int [2]>#> 2 3 g <int [3]>#> 3 4 h <chr [1]>
df[-1:2, ]
#> Error in xj[i]: only 0's may be mixed
#> with negative subscripts
tbl[-1:2, ]
#> Error: Must subset rows with a valid
#> subscript vector.
#> x Negative and positive locations can't
#> be mixed.
#> ℹ Subscript `-1:2` has 2 positive values
#> at locations 3 and 4.
tbl[NA, ]#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list>#> 1 NA <NA> <NULL>#> 2 NA <NA> <NULL>#> 3 NA <NA> <NULL>#> 4 NA <NA> <NULL>
tbl[NA_integer_, ]#> # A tibble: 1 × 3#> a b cd #> <int> <chr> <list>#> 1 NA <NA> <NULL>
tbl[c(NA, 1), ]#> # A tibble: 2 × 3#> a b cd #> <int> <chr> <list> #> 1 NA <NA> <NULL> #> 2 1 e <dbl [1]>
Row subsetting with logical indexes also works almost identical, the index vector must have length one or the number of rows with tibbles.
tbl[TRUE, ]#> # A tibble: 4 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 2 f <int [2]>#> 3 3 g <int [3]>#> 4 4 h <chr [1]>
tbl[FALSE, ]#> # A tibble: 0 × 3#> # … with 3 variables: a <int>, b <chr>,#> # cd <list>
df[c(TRUE, FALSE), ]#> a b cd#> 1 1 e 9#> 3 3 g 12, 13, 14
#> Error: Must subset rows with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 4 but subscript
#> `c(TRUE, FALSE)` has size 2.
df[c(TRUE, FALSE, TRUE), ]#> a b cd#> 1 1 e 9#> 3 3 g 12, 13, 14#> 4 4 h text
#> Error: Must subset rows with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 4 but subscript
#> `c(TRUE, FALSE, TRUE)` has size 3.
tbl[c(TRUE, FALSE, TRUE, FALSE), ]#> # A tibble: 2 × 3#> a b cd #> <int> <chr> <list> #> 1 1 e <dbl [1]>#> 2 3 g <int [3]>
df[c(TRUE, FALSE, TRUE, FALSE, TRUE), ]#> a b cd#> 1 1 e 9#> 3 3 g 12, 13, 14#> NA NA <NA> NULL
#> Error: Must subset rows with a valid
#> subscript vector.
#> ℹ Logical subscripts must match the size
#> of the indexed input.
#> x Input has size 4 but subscript
#> `c(TRUE, FALSE, TRUE, FALSE, TRUE)` has
#> size 5.
Indexing both row and column works more or less the same, except for drop:
df[1, "a"]#> [1] 1
tbl[1, "a"]#> # A tibble: 1 × 1#> a#> <int>#> 1 1
tbl[1, "a", drop =FALSE]#> # A tibble: 1 × 1#> a#> <int>#> 1 1