When teachers, departments, and schools use writing from sources as formative assessments in History, protocols need to be followed before evaluating these assessments. Many departments or schools collaboratively grade these assessments during common planning time, or teacher professional development. This requires training evaluators in using validated rubrics before applying this knowledge to the analysis of student work. Teachers work together to identify exemplars that strongly correlate with the spectrum of work defined by the rubric.
Holistic scoring involves assigning a single score that indicates the overall quality of a text (Bang, 2012). Raters give one summary score based on their impression of a text without trying to evaluate a specific set of skills. Analytic scoring examines multiple aspects of writing (e.g., content, structure, mechanics, etc.) and assigns a score for each. This type of evaluation generates several scores useful for guiding instruction.
Broadly defined, reliability is the consistency with which an instrument/method produces measurements, while validity is the extent to which an instrument/method actually measures what it is meant to measure, or its accuracy. In testing writing rubrics, agreement rates are used to determine inter-rater reliability, where agreement is further defined as exact or adjacent scores. Exact agreement consensus rates need to be 70% or greater to be considered reliable (Stemler, 2004). Adjacent agreements within one score point should exceed 90% to indicate a good level of consistency (Jonsonn & Svingby, 2007).
I used Google Forms to have my students validate the above rubric from the Literacy Design Collaborative. I found the LDC rubric to be more student friendly than the rubric my District adapted from the Smarter Balanced consortium.
Jonsonn & Svingby (2007) analyzed 75 rubric validation studies and found (a) benchmarks are most likely to increase agreement, but they should be chosen with care since the scoring depends heavily on the benchmarks chosen to define the rubric; (b) agreement is improved by training, but training will probably never totally eliminate differences; (c) topic-specific rubrics are likely to produce more generalizable and dependable scores than generic rubrics; and (d) augmentation of the rating scale (for example so the raters can expand the number of levels using + or − signs) seems to improve certain aspects of inter-rater reliability, although not consensus agreements.
Validating a rubric with your class gives your students additional time to consider their historical writing. When they have to review more than one student’s writing, they establish a context for evaluating their own writing. Class discussions should identify exemplars of strong historical writing. Direct instruction should focus on improving examples of weak writing. Rubric validation is a much-needed historical thinking exercise. Otherwise your students may develop what educational psychologists call the Dunning-Kruger Effect.
The Dunning-Kruger Effect describes a cognitive bias in which people perform poorly on a task, but lack the meta-cognitive capacity to properly evaluate their performance. As a result, such people remain unaware of their incompetence and accordingly fail to take any self- improvement measures that might rid them of their incompetence.
Bang, H. J. (N.D.) Reliability of National Writing Project’s Analytic Writing Continuum Assessment System.
Jonsson, A., & Svingby, G. (2007). The use of scoring rubrics: Reliability, validity and educational consequences. Educational Research Review, 2(2), 130-144.
Kruger, J. & Dunning, D. (1999) Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, Vol 77(6), Dec 1999, 1121-1134.
Stemler, S. E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4).