Finding sequences of identical values

Problem

You want to find sequences of identical values in a vector or factor.

Solution

It is possible to search for sequences of identical values by simply iterating over a vector, but this is very slow in R. A much faster way to find sequences is to use the rle() function.

  1. # Example data
  2. v <- c("A","A","A", "B","B","B","B", NA,NA, "C","C", "B", "C","C","C")
  3. v
  4. #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"
  5. vr <- rle(v)
  6. vr
  7. #> Run Length Encoding
  8. #> lengths: int [1:7] 3 4 1 1 2 1 3
  9. #> values : chr [1:7] "A" "B" NA NA "C" "B" "C"

The RLE coded data can be converted back to a vector with inverse.rle().

  1. inverse.rle(vr)
  2. #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"

One issue that might be problematic is that each NA is treated as a run of length 1, even if the NA’s are next to each other. It is possible to work around this by replacing the NA’s with some special designated value. For numeric vectors, Inf or some other number can be used; for character vectors, any string may be used. Of course, the special value must not appear otherwise in the vector.

  1. w <- v
  2. w[is.na(w)] <- "ZZZ"
  3. w
  4. #> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C"
  5. #> [15] "C"
  6. wr <- rle(w)
  7. wr
  8. #> Run Length Encoding
  9. #> lengths: int [1:6] 3 4 2 2 1 3
  10. #> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"
  11. # Replace the ZZZ's with NA in the RLE-coded data
  12. wr$values[ wr$values=="ZZZ" ] <- NA
  13. wr
  14. #> Run Length Encoding
  15. #> lengths: int [1:6] 3 4 2 2 1 3
  16. #> values : chr [1:6] "A" "B" NA "C" "B" "C"
  17. w2 <- inverse.rle(wr)
  18. w2
  19. #> [1] "A" "A" "A" "B" "B" "B" "B" NA NA "C" "C" "B" "C" "C" "C"

Working with factors

Even though factors are basically just integer vectors with some information about levels attached, the rle() function doesn’t work with factors. The solution is to manually convert the factor to an integer vector or a character vector. Using an integer vector is fast and memory-efficient, which may matter for large data sets, but it is difficult to interpret. Using a character vector is slower and requires more memory, but can be much easier to interpret.

  1. # Suppose this is the factor we have to work with
  2. f <- factor(v)
  3. f
  4. #> [1] A A A B B B B <NA> <NA> C C B C C C
  5. #> Levels: A B C
  6. # Store the levels in the factor.
  7. # This isn't strictly necessary, but it is useful for preserving order of levels
  8. f_levels <- levels(f)
  9. f_levels
  10. #> [1] "A" "B" "C"
  11. fc <- as.character(f)
  12. fc[ is.na(fc) ] <- "ZZZ"
  13. fc
  14. #> [1] "A" "A" "A" "B" "B" "B" "B" "ZZZ" "ZZZ" "C" "C" "B" "C" "C"
  15. #> [15] "C"
  16. fr <- rle(fc)
  17. fr
  18. #> Run Length Encoding
  19. #> lengths: int [1:6] 3 4 2 2 1 3
  20. #> values : chr [1:6] "A" "B" "ZZZ" "C" "B" "C"
  21. # Replace the ZZZ's with NA in the RLE-coded data
  22. fr$values[ fr$values=="ZZZ" ] <- NA
  23. fr
  24. #> Run Length Encoding
  25. #> lengths: int [1:6] 3 4 2 2 1 3
  26. #> values : chr [1:6] "A" "B" NA "C" "B" "C"
  27. # Invert RLE coding and convert back to a factor
  28. f2 <- inverse.rle(fr)
  29. f2 <- factor(f, levels=f_levels)
  30. f2
  31. #> [1] A A A B B B B <NA> <NA> C C B C C C
  32. #> Levels: A B C