Aim?
Determine agreement/reliability of a measurement
Observational/Intervention?
Observational Research
Causal/Prediction?
NA
Cross-sectional/Follow-up?
Cross-sectional
Research Question Example:
What is the inter-rater reliability of radiologists reading pulmonary CTA in patients suspected of acute pulmonary embolism?
Typical Study Design:
Cohort Study
Alternative Study Design:
(Nested) Case-control Study
Data collection:
Prospective or Retrospective
Type of Outcomes:
Inter- or intra-rater agreement (proportions of agreement; standard error of measurement; Bland-Altman bias and 95% limits of agreement) and reliability measures (kappa statistic, ICC)
Data Analysis
Descriptive statistics; Bland-Altman plot; ANOVA
Epidemiological Statement
GRRAS STATEMENT LINKS:
Website -
General Publication
Follow the outlined steps and start writing down your methodology. When you are finished, you have the basis for your study protocol. Furthermore, you will be able to claim that you have followed the Guidelines for Reporting Reliability and Agreement Studies (GRRAS).
Clinical measurements and assessments are very important tools in medicine, especially in the diagnostic process. Before we can confidently use a quantitative measurement (for example a blood analysis) or qualitative assessment (for example a radiologist reading a scan or pathologist examining tissue) in clinical practice, we need to make sure that the tests are actually measuring what we are interested in (validity and diagnostic accuracy) and do this with good reproducibility (low measurement error relative to subject/object variability). Sometimes it is our purpose to compare a new method’s reproducibility to an established method, because of a certain advantage (cheaper, less time intensive, safer). Reproducibility can refer to: inter-rater (different raters, using the same scale, classification, instrument, or procedure, assess the same subjects or objects) or intra-rater (same rater, using the same scale, classification, instrument, or procedure, assess the same subjects or objects at different time points) measures. Inter-rater/intra-rater reproducbility is sometimes called inter-observer/intra-observer reproducbility.
These reproducibility measures can be further divided into:
Agreement: evaluation of the measurement error. How close are scores for repeated measurements?
Reliability: discriminatory ability. How well can patients be distinguished from each other despite measurement error?
Mistaking association/correlation for reproducibility.
Unfortunately, some researchers try to prove that two measurements are similar by testing for an association (in continuous outcomes using a linear regression with a correlation coefficient).
Two measures can, however, be perfectly correlated but inherently different. In reproducibility research you want to quantify that difference (or measurement error).
Confusing agreement and reliability.
Figuring out what analysis to use.
Hopefully, you have already defined your research question. You know your domain, determinant(s) and outcome of interest.
Now, write down the background of the clinical problem, findings of previous studies and rationale of your study.
The next step is to meticulously define your study methodology.
Your methodology must be so clear ahead of time, that other researchers could easily replicate the study.
We can divide study design into two parts:
Use a cross-sectional study design in which the measurements are performed in a prospective, standardised fashion. It usually pays off to perform multiple, repeated measurements (test-retest method).
Describe:
- The study setting (primary care, secondary/tertiary hospital, ICU, ED etc.)
- The dates and period of recruitment, exposure time, and time of follow-up
- Eligibility criteria (inclusion and exclusion criteria).
Describe:
- Variables of interest
- The sources of data
- Detailed methods of measurement/assessment (one/multiple raters; level of training, sequence of testing; method of blinding etc.)
- Handling of undetermined test outcomes (if possible avoid consensus reading)
Describe:
- Primary and secondary outcomes, along with their exact definition
- The sources of data
- Detailed methods of measurement/assessment (one/multiple raters; level of training, sequence of testing; method of blinding etc.)
Describe how you prevent potential sources of bias (for example selection bias, information bias and confounding)
Describe:
- How summary data are analysed (mean – sd; median – range). Addition of 95% confidence intervals.
- Sample size determination based on primary endpoint (use the Sample Size Calculator)
- Which analyses will be performed (Use our Test Wizard)
- Transformation of data (if applicable)
- Statistical ways of handling missing data: Complete case analysis (not recommended) | Multiple imputation | Reclassification => best or worst-case scenario
- Statistical package used for analyses
- Assumption of statistical significance
Here's some advice on which oucome measures to use (depending on the nature of the measurement):
Continuous data
- Agreement: Bland-Altman plot with bias between methods and 95% limits of agreement; Standard errors of measurement
- Reliability: Intraclass correlation coefficient (specify the subtype of ICC)
Categorical data
- Agreement: Proportions of overall agreement and proportions of specific agreement
- Reliability: Kappa statistics (use matrix of kappa coefficients or weighted kappa for ordinal data)
Go to the GCR Statistics Academy