Tuesday, February 13, 2018

Tow way to dedup in R

To delete duplication in raw data with dplr

# simple but lost other columns
dfRaw %>%
  distinct(`PK1`, `PK2`, `PK3`) ->
  dfWork


# tow more lines, but keep other columns, e.g. RID
dfRaw %>%
  group_by(`PK1`, `PK2`, `PK3`) %>%
  mutate(gid = 1:n()) %>%
  filter(gid < 2) ->
  dfWork

No comments:

Post a Comment