Tag Archives: Automated essay graders

#SSchat on RoboReaders

Rise of The Robo-Readers

July 13 @ 4pm PST/7pm EST #sschat
co-mods: @scottmpetri & @DavidSalmanson

A primer on auto essay scoring


Q1 What is your definition of AES, robo-reading, or robo-grading? #sschat

Q2 What is greatest hope and/or your worst fear about technology-assisted grading? #sschat

Q3 When is it ok for a computer to assign grades on student work? #sschat

Q4 How can classroom teachers test & evaluate a robograder without disrupting learning? #sschat

Q5 What would parents think if Ts required Ss to use robo-graders before submitting work? #sschat

Q6 What would school admins say if you used a robograder in your classes? #sschat

Q7 How would you use a robograder in your History-Social Science class?

Q8 How could robo-readers help teachers gamify the art and process of writing?

Shameless plug: https://www.canvas.net/browse/ncss/courses/improving-historical-writing has a module on writing feedback & AES. Course is free and open til Sept. 22. #sschat

Teaser Tweets (to promote the chat after Monday – 7/6).

Are robo-graders the future of assessment or worse than useless? http://wp.me/4SfVS #sschat

Robo-readers are called Automated Essay Scorers (AES) in education research. http://wp.me/4SfVS  #sschat

In one study, Ss using a robo-reader wrote 3X as many words as Ss not using the RR. http://wp.me/4SfVS #sschat

Robo-readers produce a change in Ss behavior from never revising to 100% revising. http://wp.me/4SfVS #sschat

Criticism from a human instructor has a negative effect on students’ attitudes about revisions. http://wp.me/4SfVS #sschat

Comments from the robo-reader produced overwhelmingly positive feelings for student writers. http://wp.me/4SfVS #sschat

Computer feedback stimulates reflectiveness in students, something instructors don’t always do. http://wp.me/4SfVS #sschat

Robo-graders are able to match human scores simply by over-valuing length compared to human readers. http://wp.me/4SfVS #sschat

None of the major testing companies allow open-ended demonstrations of their robo-graders http://wp.me/4SfVS #sschat

Toasters sold at Walmart have more gov. oversight than robo-readers grading high stakes tests. http://wp.me/4SfVS #sschat

What is the difference between a robo-reader & a robo-grader? http://wp.me/4SfVS #sschat

To join the video-chat follow @ImpHRW, sign into www.Nurph.com. Enter the #ImpHRW channel. Note you will still need to enter #sschat to your tweets.









Promo Video for a forthcoming Turnitin.com product


A longer paper by Shermis & Hamner

Perelman’s full-length critique of Shermis & Hamner


If you are really a hard-core stats & edu-research nerd




National Council of Teachers of English Statement


For Further Research

Williamson, D. M., Xi, X., & Breyer, F. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2-13.

Role of Robo-Readers


I have increased the amount of writing in my high school World History classes over the last five years. At first, I required two DBQs per semester, then I increased that to four DBQs per semester. Next, I added a five-page research paper at the end of the school year. Now, I assign research papers during each semester. If I were to allot ten minutes of reading/grading time for each DBQ that would be 80 minutes of grading per student, multiplied by last year’s total student load of 197 for a total of 263 hours of reading and grading. Assuming I spent 30 minutes correcting each research paper, an additional 197 hours of  grading would be added to my workload. Where do I find those extra 460 hours per year? Do I neglect my family and grade non-stop every weekend? No. I use a combination of robo-readers, or automated essay scoring (AES) tools and structured peer review protocols to help my students improve their writing.

Hemmingway App

As AES has matured, a myriad of programs has proliferated that are free to educators.  Grammarly claims to find and correct ten times more mistakes than a word processor. The Hemmingway App makes writing bold and clear. PaperRater offers feedback by comparing a writer’s work to others at their grade level. It ranks each paper on a percentile scale examining originality, grammar, spelling, phrasing, transitions, academic vocabulary, voice, and style. Then it provides students with an overall grade. My students use this trio of tools to improve their writing before I ever look at it.


David Salmanson, a fellow history teacher and scholar, questioned my reliance on technology. The purpose of these back and forth posts is to elaborate on the continuum of use that robo-readers may develop in the K-12 ecosystem. Murphy-Paul argues a non-judgmental computer may motivate students to try, to fail and to improve more than almost any human. Research on a program called e-rater confirmed this and found that students using it wrote almost three times as many words as their peers who did not. Perelman rebuts this by pointing out robo-graders do not score by understanding meaning but by the use of gross measures, especially length and pretentious language. He feels students should not be graded by machines making faulty assumptions with proprietary algorithms.

Both of these writers make excellent points, however, classroom teachers, especially those of us in low SES public schools are going to have a difficult time improving their discipline-specific writing instruction, increasing the amount of writing assigned, not to mention providing feedback that motivates students to revise their work, prior to a final evaluation. We will need to find an appropriate balance for giving both computerized and human feedback to our students.

Mayfield maintains that automated assessment changes the locus of control, making students enlist the teacher as an ally to help them address the feedback from the computer. I have found that students in my class reluctantly revise their writing per the advice of a robo-reader, but real growth happens when students have discussions in small groups in regards to what works and what doesn’t. Asking students to write a revision memo detailing the changes they have made in each new draft helps them see writing as an iterative process instead of a one and done assignment.

Read David’s post and participate in our #sschat on this topic on July 13th at 7pm EST/4pm PST.


Student Perceptions of Writing Feedback

What Students Say About Instructor Feedback was a 2013 study that examined student perceptions of instructor feedback using the Turnitin.com platform. Students wanted timely feedback, but rarely received it: 28 percent of respondents reported that their instructors took 13+ days to provide feedback on their papers. Students preferred feedback about grammar, mechanics, composition, and structure. Students found feedback on thesis development valuable. Despite high rates of electronic submission, students did not report receiving electronic feedback at nearly the same rate.

QuickMark Categories

From The Margins analyzed nearly 30 million marks left on student papers submitted to the Turnitin.com’s service between January 2010 and May 2012. QuickMark comments are a preloaded set of 76 comments covering 4 categories that instructors can drag- and-drop onto students’ papers within the Turnitin online grading platform.

This study looked specifically at frequencies and trends teachers employed when providing margin comments. The top 15 are listed below.

Top 10 QuickMarks

This 2014 follow-up study discovered that students found face-to-face feedback “very” or “extremely effective.” 77 percent of students viewed face-to-face comments as “very” or “extremely effective,” but only 30 percent received face-to-face “very” or “extremely often.”

Students perceived general comments on their writing to be “very” or “extremely effective.” However, a smaller percentage of educators felt the same. Even though 68 percent of students reported receiving general comments “very” or “extremely often,” and 67 percent of students said this feedback type was “very” or “extremely effective,” only one-third of educators viewed general comments as “very” or “extremely effective.”

Students preferred “suggestions for improvement” over “praise or discouragement.” The greatest percentage of students found suggestions for improvement “very” or “extremely effective,” while the fewest percentage of students said the same for “praise or discouragement.”

Students and educators differed on what constituted effective feedback.  The gap between educators and students was greater than 15 percent on the majority of areas examined. The biggest difference between educator and student responses occurred with “general, overall comments about the paper” and “specific notes written in the margins.”


Comments recorded by voice or audio may be a time-saving substitute for face-to-face feedback. Only five percent of student respondents reported receiving voice or video comments at the same frequency as that reported by students who report receiving face-to-face feedback “very” or “extremely often” (30 percent). As a way to negotiate time pressures and still be able to provide more personalized feedback, educators might consider leveraging the use of recorded voice or audio comments to provide feedback on student work. Many grading platforms and learning management systems (LMS) offer this feature as part of their services.

This study identified a clear relationship between exposure to feedback and perceived effectiveness of feedback. Thus, it is imperative to provide students with different types of feedback, and evaluate what is helpful for them. Obviously, the more feedback students get, the more valuable it becomes. Teachers should discuss the types of feedback they typically provide to their classes. Then ask students to share what types of feedback they have found most helpful.

The definition of “effective feedback” is going to be modified within your course. Poll your class to find out what types of feedback students think would improve their writing.





History Assessments of Thinking

Joel Breakstone wrote that two of the most readily available test item types, multiple-choice questions and document-based questions (DBQs), are poorly suited for formative assessment. Breakstone and his colleagues at SHEG have designed History Assessments of Thinking (HATs) that measure both content knowledge and historical thinking skills. HATs measure disciplinary skills through engagement with primary sources. Teachers using HATs must interpret student responses and enact curricular revisions using their pedagogical content knowledge, something that may prove difficult with new, or poorly-trained teachers.


To use HATs, teachers must understand the question, be familiar with the historical content, evaluate student responses, diagnose student mistakes, develop remediation, and implement the intervention. Teachers must possess an understanding of what makes learning easy or difficult and ways of formulating the subject that make it comprehensible to others. In designing HATs, Breakstone sought to collect data on cognitive validity, or the relationship between the constructs targeted by the assessments and the cognitive processes students use to answer them. This would help teachers interpret student responses and use that information to make curricular changes. Formative assessments in history depend on teachers being able to quickly diagnose student understanding. Assessments based on historical thinking represent a huge shift from the norm in history classrooms. For formative assessment to become routine, teachers will need extensive professional development and numerous other supports.

Sipress & Voelker (2011) write eloquently about the rise and fall of the coverage model in history instruction. This tension has been revitalized as educators eagerly anticipate which testing methodologies will be used for the “fewer, deeper” Common Core assessments and what I call the “Marv Alkin overkill method” of using at least four items to assess each content standard. This results in end of year history assessments that are 80 questions or more. Breadth vs. depth arguments have existed forever in education, Jay Mathews illustrates this by asking if teachers should focus on a few topics so students have time to absorb and comprehend the inner workings of the subject? Or should teachers cover every topic so students get a sense of the whole and can later pursue those parts that interest them most?

Something that may settle this debate is one of the more interesting developments in ed tech. The nexus of machine learning and student writing is a controversial and competitive market. Turnitin recently demonstrated that it is looking to move beyond plagiarism detection and into the automated writing feedback market with a recent acquisition. If my wife allowed me to gamble, I would bet that one of the testing consortiums, either Smarter Balanced or PARCC, will soon strike a deal with one of the eight automated essay grading vendors to grade open-ended questions on their standardized tests. Lightside Labs will pilot test their product with the Gates Foundation in 2015 and get it to market in 2016, just a little too late to be included in the first wave of Common Core assessments. I wonder if HAT assessments would be able to incorporate some automated scoring technology and settle the depth versus breadth debate in assessing history?