In this paper, we develop a model for evaluating an ordinal rating systems where we assume that the true underlying disease state is continuous in nature. Our approach in motivated by a dataset with 35 microscopic slides with 35 representative duct lesions of the pancreas. Each of the slides was evaluated by eight raters using two novel rating systems (PanIN illustrations and PanIN nomenclature),where each rater used each systems to rate the slide with slide identity masked between evaluations. We find that the two methods perform equally well but that differentiation of higher grade lesions is more consistent across raters than differentiation across raters for lower grade lesions.

A proportional odds model is assumed, which allows us to estimate rater-specific thresholds for comparing agreement. In this situation where we have two methods of rating, we can determine whether the two methods have the same thresholds and whether or not raters perform equivalently across methods. Unlike some other model-based approaches for measuring agreement, we focus on the interpretation of the model parameters and their scientific relevance. We compare posterior estimates of rater-specific parameters across raters to see if they are implementing the intended rating system in the same manner. Estimated standard deviation distributions are used to make inferences as to whether raters are consistent and whether there are differences in rating behaviors in the two rating systems under comparison.


Categorical Data Analysis | Statistical Models