Subset distinct/unique rows — distinct • dplyr

Select only unique/distinct rows from a data frame. This is similar to unique.data.frame() but considerably faster.

distinct(.data, ..., .keep_all = FALSE)

Arguments

.data	A data frame, data frame extension (e.g. a tibble), or a lazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, for more details.
...	<`data-masking`> Optional variables to use when determining uniqueness. If there are multiple rows for a given combination of inputs, only the first row will be preserved. If omitted, will use all variables.
.keep_all	If `TRUE`, keep all variables in `.data`. If a combination of `...` is not distinct, this keeps the first row of values.

Value

An object of the same type as .data. The output has the following properties:

Rows are a subset of the input but appear in the same order.
Columns are not modified if ... is empty or .keep_all is TRUE. Otherwise, distinct() first calls mutate() to create new columns.
Groups are not modified.
Data frame attributes are preserved.

Methods

This function is a generic, which means that packages can provide implementations (methods) for other classes. See the documentation of individual methods for extra arguments and differences in behaviour.

The following methods are currently available in loaded packages: dbplyr (tbl_lazy), dplyr (data.frame) .

Examples

df <- tibble(
  x = sample(10, 100, rep = TRUE),
  y = sample(10, 100, rep = TRUE)
)
nrow(df)
#> [1] 100
nrow(distinct(df))
#> [1] 69
nrow(distinct(df, x, y))
#> [1] 69

distinct(df, x)
#> # A tibble: 10 x 1
#>        x
#>    <int>
#>  1     2
#>  2    10
#>  3     7
#>  4     4
#>  5     9
#>  6     6
#>  7     1
#>  8     3
#>  9     5
#> 10     8
distinct(df, y)
#> # A tibble: 10 x 1
#>        y
#>    <int>
#>  1     3
#>  2     1
#>  3     9
#>  4    10
#>  5     2
#>  6     8
#>  7     4
#>  8     7
#>  9     5
#> 10     6

# You can choose to keep all other variables as well
distinct(df, x, .keep_all = TRUE)
#> # A tibble: 10 x 2
#>        x     y
#>    <int> <int>
#>  1     2     3
#>  2    10     1
#>  3     7     9
#>  4     4    10
#>  5     9    10
#>  6     6     4
#>  7     1     2
#>  8     3     2
#>  9     5    10
#> 10     8     7
distinct(df, y, .keep_all = TRUE)
#> # A tibble: 10 x 2
#>        x     y
#>    <int> <int>
#>  1     2     3
#>  2    10     1
#>  3     7     9
#>  4     4    10
#>  5     7     2
#>  6    10     8
#>  7     6     4
#>  8     9     7
#>  9     2     5
#> 10     6     6

# You can also use distinct on computed variables
distinct(df, diff = abs(x - y))
#> # A tibble: 10 x 1
#>     diff
#>    <int>
#>  1     1
#>  2     9
#>  3     2
#>  4     6
#>  5     5
#>  6     8
#>  7     0
#>  8     7
#>  9     3
#> 10     4

# use across() to access select()-style semantics
distinct(starwars, across(contains("color")))
#> # A tibble: 67 x 3
#>    hair_color    skin_color  eye_color
#>    <chr>         <chr>       <chr>    
#>  1 blond         fair        blue     
#>  2 NA            gold        yellow   
#>  3 NA            white, blue red      
#>  4 none          white       yellow   
#>  5 brown         light       brown    
#>  6 brown, grey   light       blue     
#>  7 brown         light       blue     
#>  8 NA            white, red  red      
#>  9 black         light       brown    
#> 10 auburn, white fair        blue-gray
#> # … with 57 more rows

# Grouping -------------------------------------------------
# The same behaviour applies for grouped data frames,
# except that the grouping variables are always included
df <- tibble(
  g = c(1, 1, 2, 2),
  x = c(1, 1, 2, 1)
) %>% group_by(g)
df %>% distinct(x)
#> # A tibble: 3 x 2
#> # Groups:   g [2]
#>       g     x
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     2
#> 3     2     1