Comparing data frames

Problem

You want to do compare two or more data frames and find rows that appear in more than one data frame, or rows that appear only in one data frame.

Solution

An example

Suppose you have the following three data frames, and you want to know whether each row from each data frame appears in at least one of the other data frames.

  1. dfA <- data.frame(Subject=c(1,1,2,2), Response=c("X","X","X","X"))
  2. dfA
  3. #> Subject Response
  4. #> 1 1 X
  5. #> 2 1 X
  6. #> 3 2 X
  7. #> 4 2 X
  8. dfB <- data.frame(Subject=c(1,2,3), Response=c("X","Y","X"))
  9. dfB
  10. #> Subject Response
  11. #> 1 1 X
  12. #> 2 2 Y
  13. #> 3 3 X
  14. dfC <- data.frame(Subject=c(1,2,3), Response=c("Z","Y","Z"))
  15. dfC
  16. #> Subject Response
  17. #> 1 1 Z
  18. #> 2 2 Y
  19. #> 3 3 Z

In dfA, the rows containing (1,X) also appear in dfB, but the rows containing (2,X) do not appear in any of the other data frames. Similarly, dfB contains (1,X) which appears in dfA, and (2,Y), which appears in dfC, but (3,X) does not appear in any other data frame.

You might wish to mark the rows which are duplicated in another data frame, or which are unique to each data frame.

Joining the data frames

To proceed, first join the three data frames with a column identifying which source each row came from. It’s called Coder here because this could be data coded by three different people. In this case, you might wish to find where the coders agreed, or where they disagreed.

  1. dfA$Coder <- "A"
  2. dfB$Coder <- "B"
  3. dfC$Coder <- "C"
  4. df <- rbind(dfA, dfB, dfC) # Stick them together
  5. df <- df[,c("Coder", "Subject", "Response")] # Reorder the columns to look nice
  6. df
  7. #> Coder Subject Response
  8. #> 1 A 1 X
  9. #> 2 A 1 X
  10. #> 3 A 2 X
  11. #> 4 A 2 X
  12. #> 5 B 1 X
  13. #> 6 B 2 Y
  14. #> 7 B 3 X
  15. #> 8 C 1 Z
  16. #> 9 C 2 Y
  17. #> 10 C 3 Z

If your data starts out in this format, then there’s obviously no need to join it together.

Finding duplicated rows

Using the function dupsBetweenGroups (defined below), we can find which rows are duplicated between different groups:

  1. # Find the rows which have duplicates in a different group.
  2. dupRows <- dupsBetweenGroups(df, "Coder")
  3. # Print it alongside the data frame
  4. cbind(df, dup=dupRows)
  5. #> Coder Subject Response dup
  6. #> 1 A 1 X TRUE
  7. #> 2 A 1 X TRUE
  8. #> 3 A 2 X FALSE
  9. #> 4 A 2 X FALSE
  10. #> 5 B 1 X TRUE
  11. #> 6 B 2 Y TRUE
  12. #> 7 B 3 X FALSE
  13. #> 8 C 1 Z FALSE
  14. #> 9 C 2 Y TRUE
  15. #> 10 C 3 Z FALSE

Note that this does not mark duplicated rows within a group. With Coder=A, there are two rows with Subject=1 and Response=X, but these are not marked as duplicates.

Finding unique rows

It’s also possible to find the rows that are unique within each group:

  1. cbind(df, unique=!dupRows)
  2. #> Coder Subject Response unique
  3. #> 1 A 1 X FALSE
  4. #> 2 A 1 X FALSE
  5. #> 3 A 2 X TRUE
  6. #> 4 A 2 X TRUE
  7. #> 5 B 1 X FALSE
  8. #> 6 B 2 Y FALSE
  9. #> 7 B 3 X TRUE
  10. #> 8 C 1 Z TRUE
  11. #> 9 C 2 Y FALSE
  12. #> 10 C 3 Z TRUE

Splitting apart the data frame

If you wish to split the joined data frame into the three original data frames

  1. # Store the results in df
  2. dfDup <- cbind(df, dup=dupRows)
  3. dfA <- subset(dfDup, Coder=="A", select=-Coder)
  4. dfA
  5. #> Subject Response dup
  6. #> 1 1 X TRUE
  7. #> 2 1 X TRUE
  8. #> 3 2 X FALSE
  9. #> 4 2 X FALSE
  10. dfB <- subset(dfDup, Coder=="B", select=-Coder)
  11. dfB
  12. #> Subject Response dup
  13. #> 5 1 X TRUE
  14. #> 6 2 Y TRUE
  15. #> 7 3 X FALSE
  16. dfC <- subset(dfDup, Coder=="C", select=-Coder)
  17. dfC
  18. #> Subject Response dup
  19. #> 8 1 Z FALSE
  20. #> 9 2 Y TRUE
  21. #> 10 3 Z FALSE

Ignoring columns

It is also possible to ignore one or more columns, by removing that column from the data frame that is passed to the function. The results can be joined to the original complete data frame if desired.

  1. # Ignore the Subject column -- only use Response
  2. dfNoSub <- subset(df, select=-Subject)
  3. dfNoSub
  4. #> Coder Response
  5. #> 1 A X
  6. #> 2 A X
  7. #> 3 A X
  8. #> 4 A X
  9. #> 5 B X
  10. #> 6 B Y
  11. #> 7 B X
  12. #> 8 C Z
  13. #> 9 C Y
  14. #> 10 C Z
  15. # Check for duplicates
  16. dupRows <- dupsBetweenGroups(dfNoSub, "Coder")
  17. # Join the result to the original data frame
  18. cbind(df, dup=dupRows)
  19. #> Coder Subject Response dup
  20. #> 1 A 1 X TRUE
  21. #> 2 A 1 X TRUE
  22. #> 3 A 2 X TRUE
  23. #> 4 A 2 X TRUE
  24. #> 5 B 1 X TRUE
  25. #> 6 B 2 Y TRUE
  26. #> 7 B 3 X TRUE
  27. #> 8 C 1 Z FALSE
  28. #> 9 C 2 Y TRUE
  29. #> 10 C 3 Z FALSE

dupsBetweenGroups function

This is the function that does all the work:

  1. dupsBetweenGroups <- function (df, idcol) {
  2. # df: the data frame
  3. # idcol: the column which identifies the group each row belongs to
  4. # Get the data columns to use for finding matches
  5. datacols <- setdiff(names(df), idcol)
  6. # Sort by idcol, then datacols. Save order so we can undo the sorting later.
  7. sortorder <- do.call(order, df)
  8. df <- df[sortorder,]
  9. # Find duplicates within each id group (first copy not marked)
  10. dupWithin <- duplicated(df)
  11. # With duplicates within each group filtered out, find duplicates between groups.
  12. # Need to scan up and down with duplicated() because first copy is not marked.
  13. dupBetween = rep(NA, nrow(df))
  14. dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols])
  15. dupBetween[!dupWithin] <- duplicated(df[!dupWithin,datacols], fromLast=TRUE) | dupBetween[!dupWithin]
  16. # ============= Replace NA's with previous non-NA value ==============
  17. # This is why we sorted earlier - it was necessary to do this part efficiently
  18. # Get indexes of non-NA's
  19. goodIdx <- !is.na(dupBetween)
  20. # These are the non-NA values from x only
  21. # Add a leading NA for later use when we index into this vector
  22. goodVals <- c(NA, dupBetween[goodIdx])
  23. # Fill the indices of the output vector with the indices pulled from
  24. # these offsets of goodVals. Add 1 to avoid indexing to zero.
  25. fillIdx <- cumsum(goodIdx)+1
  26. # The original vector, now with gaps filled
  27. dupBetween <- goodVals[fillIdx]
  28. # Undo the original sort
  29. dupBetween[sortorder] <- dupBetween
  30. # Return the vector of which entries are duplicated across groups
  31. return(dupBetween)
  32. }

Notes

To find exact duplicate rows within a single dataframe, see ../Finding and removing duplicate records.