Inter-rater reliability

Problem

You want to calculate inter-rater reliability.

Solution

The method for calculating inter-rater reliability will depend on the type of data (categorical, ordinal, or continuous) and the number of coders.

Categorical data

Suppose this is your data set. It consists of 30 cases, rated by three coders. It is a subset of the diagnoses data set in the irr package.

  1. library(irr)
  2. #> Loading required package: lpSolve
  3. data(diagnoses)
  4. dat <- diagnoses[,1:3]
  5. # rater1 rater2 rater3
  6. # 4. Neurosis 4. Neurosis 4. Neurosis
  7. # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
  8. # 2. Personality Disorder 3. Schizophrenia 3. Schizophrenia
  9. # 5. Other 5. Other 5. Other
  10. # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
  11. # 1. Depression 1. Depression 3. Schizophrenia
  12. # 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia
  13. # 1. Depression 1. Depression 3. Schizophrenia
  14. # 1. Depression 1. Depression 4. Neurosis
  15. # 5. Other 5. Other 5. Other
  16. # 1. Depression 4. Neurosis 4. Neurosis
  17. # 1. Depression 2. Personality Disorder 4. Neurosis
  18. # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
  19. # 1. Depression 4. Neurosis 4. Neurosis
  20. # 2. Personality Disorder 2. Personality Disorder 4. Neurosis
  21. # 3. Schizophrenia 3. Schizophrenia 3. Schizophrenia
  22. # 1. Depression 1. Depression 1. Depression
  23. # 1. Depression 1. Depression 1. Depression
  24. # 2. Personality Disorder 2. Personality Disorder 4. Neurosis
  25. # 1. Depression 3. Schizophrenia 3. Schizophrenia
  26. # 5. Other 5. Other 5. Other
  27. # 2. Personality Disorder 4. Neurosis 4. Neurosis
  28. # 2. Personality Disorder 2. Personality Disorder 4. Neurosis
  29. # 1. Depression 1. Depression 4. Neurosis
  30. # 1. Depression 4. Neurosis 4. Neurosis
  31. # 2. Personality Disorder 2. Personality Disorder 2. Personality Disorder
  32. # 1. Depression 1. Depression 1. Depression
  33. # 2. Personality Disorder 2. Personality Disorder 4. Neurosis
  34. # 1. Depression 3. Schizophrenia 3. Schizophrenia
  35. # 5. Other 5. Other 5. Other

Two raters: Cohen’s Kappa

This will calculate Cohen’s Kappa for two coders – In this case, raters 1 and 2.

  1. kappa2(dat[,c(1,2)], "unweighted")
  2. #> Cohen's Kappa for 2 Raters (Weights: unweighted)
  3. #>
  4. #> Subjects = 30
  5. #> Raters = 2
  6. #> Kappa = 0.651
  7. #>
  8. #> z = 7
  9. #> p-value = 2.63e-12

N raters: Fleiss’s Kappa, Conger’s Kappa

If there are more than two raters, use Fleiss’s Kappa.

  1. kappam.fleiss(dat)
  2. #> Fleiss' Kappa for m Raters
  3. #>
  4. #> Subjects = 30
  5. #> Raters = 3
  6. #> Kappa = 0.534
  7. #>
  8. #> z = 9.89
  9. #> p-value = 0

It is also possible to use Conger’s (1980) exact Kappa. (Note that it is not clear to me when it is better or worse to use the exact method.)

  1. kappam.fleiss(dat, exact=TRUE)
  2. #> Fleiss' Kappa for m Raters (exact value)
  3. #>
  4. #> Subjects = 30
  5. #> Raters = 3
  6. #> Kappa = 0.55

Ordinal data: weighted Kappa

If the data is ordinal, then it may be appropriate to use a weighted Kappa. For example, if the possible values are low, medium, and high, then if a case were rated medium and high by the two coders, they would be in better agreement than if the ratings were low and high.

We will use a subset of the anxiety data set from the irr package.

  1. library(irr)
  2. data(anxiety)
  3. dfa <- anxiety[,c(1,2)]
  4. dfa
  5. #> rater1 rater2
  6. #> 1 3 3
  7. #> 2 3 6
  8. #> 3 3 4
  9. #> 4 4 6
  10. #> 5 5 2
  11. #> 6 5 4
  12. #> 7 2 2
  13. #> 8 3 4
  14. #> 9 5 3
  15. #> 10 2 3
  16. #> 11 2 2
  17. #> 12 6 3
  18. #> 13 1 3
  19. #> 14 5 3
  20. #> 15 2 2
  21. #> 16 2 2
  22. #> 17 1 1
  23. #> 18 2 3
  24. #> 19 4 3
  25. #> 20 3 4

The weighted Kappa calculation must be made with 2 raters, and can use either linear or squared weights of the differences.

  1. # Compare raters 1 and 2 with squared weights
  2. kappa2(dfa, "squared")
  3. #> Cohen's Kappa for 2 Raters (Weights: squared)
  4. #>
  5. #> Subjects = 20
  6. #> Raters = 2
  7. #> Kappa = 0.297
  8. #>
  9. #> z = 1.34
  10. #> p-value = 0.18
  11. # Use linear weights
  12. kappa2(dfa, "equal")
  13. #> Cohen's Kappa for 2 Raters (Weights: equal)
  14. #>
  15. #> Subjects = 20
  16. #> Raters = 2
  17. #> Kappa = 0.189
  18. #>
  19. #> z = 1.42
  20. #> p-value = 0.157

Compare the results above to the unweighted calculation (used in the tests for non-ordinal data above), which treats all differences as the same:

  1. kappa2(dfa, "unweighted")
  2. #> Cohen's Kappa for 2 Raters (Weights: unweighted)
  3. #>
  4. #> Subjects = 20
  5. #> Raters = 2
  6. #> Kappa = 0.119
  7. #>
  8. #> z = 1.16
  9. #> p-value = 0.245

Weighted Kappa with factors

The data above is numeric, but a weighted Kappa can also be calculated for factors. Note that the factor levels must be in the correct order, or results will be wrong.

  1. # Make a factor-ized version of the data
  2. dfa2 <- dfa
  3. dfa2$rater1 <- factor(dfa2$rater1, levels=1:6, labels=LETTERS[1:6])
  4. dfa2$rater2 <- factor(dfa2$rater2, levels=1:6, labels=LETTERS[1:6])
  5. dfa2
  6. #> rater1 rater2
  7. #> 1 C C
  8. #> 2 C F
  9. #> 3 C D
  10. #> 4 D F
  11. #> 5 E B
  12. #> 6 E D
  13. #> 7 B B
  14. #> 8 C D
  15. #> 9 E C
  16. #> 10 B C
  17. #> 11 B B
  18. #> 12 F C
  19. #> 13 A C
  20. #> 14 E C
  21. #> 15 B B
  22. #> 16 B B
  23. #> 17 A A
  24. #> 18 B C
  25. #> 19 D C
  26. #> 20 C D
  27. # The factor levels must be in the correct order:
  28. levels(dfa2$rater1)
  29. #> [1] "A" "B" "C" "D" "E" "F"
  30. levels(dfa2$rater2)
  31. #> [1] "A" "B" "C" "D" "E" "F"
  32. # The results are the same as with the numeric data, above
  33. kappa2(dfa2, "squared")
  34. #> Cohen's Kappa for 2 Raters (Weights: squared)
  35. #>
  36. #> Subjects = 20
  37. #> Raters = 2
  38. #> Kappa = 0.297
  39. #>
  40. #> z = 1.34
  41. #> p-value = 0.18
  42. # Use linear weights
  43. kappa2(dfa2, "equal")
  44. #> Cohen's Kappa for 2 Raters (Weights: equal)
  45. #>
  46. #> Subjects = 20
  47. #> Raters = 2
  48. #> Kappa = 0.189
  49. #>
  50. #> z = 1.42
  51. #> p-value = 0.157

Continuous data: Intraclass correlation coefficient

When the variable is continuous, the intraclass correlation coefficient should be computed. From the documentation for icc:

When considering which form of ICC is appropriate for an actual set of data, one has take several decisions (Shrout & Fleiss, 1979):

  • Should only the subjects be considered as random effects ("oneway" model, default) or are subjects and raters randomly chosen from a bigger pool of persons ("twoway" model).
  • If differences in judges’ mean ratings are of interest, interrater "agreement" instead of "consistency" (default) should be computed.
  • If the unit of analysis is a mean of several ratings, unit should be changed to "average". In most cases, however, single values (unit="single", default) are regarded. We will use the anxiety data set from the irr package.
  1. library(irr)
  2. data(anxiety)
  3. anxiety
  4. #> rater1 rater2 rater3
  5. #> 1 3 3 2
  6. #> 2 3 6 1
  7. #> 3 3 4 4
  8. #> 4 4 6 4
  9. #> 5 5 2 3
  10. #> 6 5 4 2
  11. #> 7 2 2 1
  12. #> 8 3 4 6
  13. #> 9 5 3 1
  14. #> 10 2 3 1
  15. #> 11 2 2 1
  16. #> 12 6 3 2
  17. #> 13 1 3 3
  18. #> 14 5 3 3
  19. #> 15 2 2 1
  20. #> 16 2 2 1
  21. #> 17 1 1 3
  22. #> 18 2 3 3
  23. #> 19 4 3 2
  24. #> 20 3 4 2
  25. # Just one of the many possible ICC coefficients
  26. icc(anxiety, model="twoway", type="agreement")
  27. #> Single Score Intraclass Correlation
  28. #>
  29. #> Model: twoway
  30. #> Type : agreement
  31. #>
  32. #> Subjects = 20
  33. #> Raters = 3
  34. #> ICC(A,1) = 0.198
  35. #>
  36. #> F-Test, H0: r0 = 0 ; H1: r0 > 0
  37. #> F(19,39.7) = 1.83 , p = 0.0543
  38. #>
  39. #> 95%-Confidence Interval for ICC Population Values:
  40. #> -0.039 < ICC < 0.494