top of page

How to train AI to mark a piece of writing part 2: 'Science-ing' it

Writer: Hannah GillottHannah Gillott

Updated: Feb 24


A robot peering at a blank sheet of paper through a magnifying glass


The missing word


In June 2024, I wrote about how we’re training AI to mark writing. Reading it now in February 2025, the most important word — the word that underpins everything we’ve done since then — is completely missing.


Before I tell you what that word is, I’d like to share a post I saw on LinkedIn which sums up what’s happened at stylus between June and February:



A screenshot of a LinkedIn post. The post reads: Most teams building AI products haven't got past the experimental, building with 'vibes' stage yet.

We've had to build a bunch of tooling at incident.io to get us beyond the early MVP stages of complex AI products, where it's hard to predict if a change you make to the system will improve things, or if you've made some other edge case much worse.

What I've seen from speaking with other people building AI products is a repeatable pattern of:

- Quick and easy path to a promising but deceptive MVP
- Iterating on the MVP with unpredictable ups-and-downs, until...
- You accept defeat, take the challenge seriously, and start to 'science' it
- Then finally, you're ready to build properly.

Most teams are still in the MVP stage, and have an emotional rollercoaster where progress is non-linear and unable to quantify. It's a brutal ride and the faster you can wise-up and start doing things properly the better experience you'll have, but there's very few people talking about what 'properly' actually means for companies outside of FANG.

For that reason, I've written a post about these stages and share a bit about what we've found was necessary to build to get us to where we are today.


Lawrence Jones, an engineer at incident.io, describes four repeatable stages to building an AI product:


  • Quick and easy path to a promising but deceptive MVP

  • Iterating on the MVP with unpredictable ups-and-downs until…

  • You accept defeat, take the challenge seriously, and start to ‘science’ it

  • Then finally, you’re ready to build properly


In June, we were on step two. Now we’re in step three: ‘science-ing’ it. And the word we realised we were missing, the new Queen around these parts, is the word ‘policy’. Here’s why.



Doing it well once v doing it well at scale


If you copy-paste a section of the Teacher Assessment Framework for year 6, or an extract of a GCSE Literature markscheme, and then you turn it into a well-written prompt in ChatGPT or similar and ask it to give feedback on an extract of student work, you’ll probably get a decent output at least once.


By ‘decent output’, I mean something that indicates AI could be useful when it comes to giving students feedback.


What you won’t get is something that mimics every stage of a teacher’s thinking process. English teachers, unlike Science or Maths teachers, use many small judgements to generate one overall mark or judgement. A Literature essay might have a score out of 30 written at the bottom, but even broken down into its component Assessment Objectives, that score won’t tell you much about what the student actually did wrong. This means Literature teachers need to read all the essays to know what to teach next.


Year 6 teachers have a little more granularity to work with, but not much. A single statement in the Teacher Assessment Framework asks if students “Use capital letters, full stops, question marks, commas for lists and apostrophes for contraction mostly correctly.” I count five separate judgements in that statement — five things AI needs to look for, five opportunities for errors. That’s without even defining what ‘mostly’ means. 


We knew from the start that if we wanted to remove work from teachers, and increase insight, we needed to mimic their thought process. When we wrote our writing curriculum, ‘Use capital letters’ therefore became ‘Use capital letters to begin sentences’ and ‘Use capital letters for proper nouns’. ‘Use full stops’ became ‘Uses the appropriate stop marks to distinguish between questions and statements’ and ‘Uses both capital letters and full stops for sentence demarcation’. 


This way, we could help teachers pinpoint exactly where the error fell in a student’s work, in order to be able to plan an appropriate lesson to correct it.


We duly built an AI-powered, human-moderated process for marking work, tracking carefully which criteria had high volatility — in other words, which criteria were often being flagged by moderators — and amending the relevant prompting accordingly. 


Narrowing our focus like this meant we saw our marking accuracy soar, with most criteria performing at around 80-95% accuracy. But there was a problem — we were still seeing previously ‘reliable’ criteria suddenly tank during moderation, and we were fielding continual questions from moderators around how they should handle individual marking judgements. 


Lawrence Jones called it: unpredictable ups-and-downs despite a promising start. Here’s where policy became the star of the show, and where things suddenly got really difficult. 


Sidenote: I suspect this is where many companies and individuals exploring AI assessment stop. There’s a lot of commentary around AI feedback being ‘good to a degree’, ‘consistent, but not consistent enough’, ‘specific, but not as specific as a teacher, and not every time’. If you stop here, you’re still going to be generating useful and usually reliable feedback — but you’ll be admiring the view from a false summit. 



Why policy is the most important word in any AI product or service


To show you why policy is so important, I’m going to use two examples — both lifted from the Year 6 Teacher Assessment Framework, and both based on real conversations we’ve been having behind the scenes. 


Example one: defining “Mostly”


The word “mostly” appears three times in the Year 6 Teacher Assessment Framework. You probably have a good idea of what the word means — if I asked you whether you could think of a student who “mostly” uses capital letters correctly, I expect a few faces would pop into your mind. 


Allow me to pose some scenarios:


  • If a student only ever demonstrates a skill once in each piece of writing, but they are correct each time, are they “mostly” correct?

  • What if they demonstrate it twice in each piece, and one of those is wrong every time — are they “mostly” correct?

  • What if they demonstrate it correctly six times in one piece, but then never again — are they “mostly” correct?

  • What if they demonstrate it correctly six times in one piece, but then incorrectly in two subsequent pieces — are they “mostly” correct?

  • Can you put a number to the proportion of correct to incorrect examples allowed in order to credit “mostly”?

  • Can you put a number to the minimum number of correct examples you need to see before “mostly” can be applied?


You’ll have answers in mind to all those scenarios — but here’s the really crucial follow-up question: would your colleagues give the same answers? Not just colleagues in your school, but colleagues across the teaching profession. 


The answer is almost certainly ‘no’, and that’s one reason why AI marking can be inconsistent — because teachers can be, too.


Developing a policy around our use of the word “mostly” means putting numbers and boundaries to the questions above. 


Then it means doing the same to all the other ambiguous terms English teachers encounter every day: “consistently”, “relevant”, “effective”, “beginning to”. 


You might not agree with our definition of “mostly”, and that’s okay — but we do want you to agree that we apply it consistently to all work, and to be able to make your own judgements accordingly. That’s step one to generating truly reliable, accurate AI assessment.



Example two: defining ‘Correct use of capital letters for proper nouns’


When we speak to teachers about AI marking of student work in English, they often jump to the trickiest, most subjective criteria, like the ability to evoke an emotional response or effectively describe a character. 


‘Correct use of capital letters for proper nouns’ seemed like a nice, innocuous criterion to mark: it refers to a clearly-defined rule, so it should be easy to achieve consistency. And it was — until it wasn’t.


When our moderators started to flag increasing numbers of errors in the AI-awarded mark for this criterion, we were confused. As it turned out, the moderators were too — because almost all the errors they were flagging related to our policy. 


The policy stated that errors would include the use of lower case letters for proper nouns (e.g. ‘england’ instead of ‘England’) as well as the random inclusion of capitalised words in the middle of a sentence for no reason. 


These are some of the examples our moderators queried:

  • Use of a capital ‘D’ in ‘Dear Diary’ (is this correct, or an error? If it’s correct, is lower case ‘Dear diary’ therefore an error?)

  • Forgetting to use a capital letter partway through a name e.g. ‘Mcdonald’, ‘O’grady’ 

  • Using lower case ‘w’ to refer to ‘the warden’  — the team knew the character in question was, in the book that inspired the piece of work, called The Warden. This was only an error if you knew the book students were referencing — so was it objectively an error?

  • Use of capital letters in titles — is ‘Lord Of The Rings’ correct or an error? What about ‘Return of the King’? What about ‘Talking To Strangers’?


We had an answer to each of these queries — but multiply them by 40 criteria, across 30 pieces of work, across two or three classes, and many more schools, and you can see how vital it is to have a clear policy to be able to mark accurately every time. 


We’re not the only AI company facing a policy challenge: every company in every industry using this technology is having to find ways to manage policy in order that data is labelled correctly. 


This in turn creates a new problem: the longer the policy, the harder it is for a human to remember it all. There are 40-50 different criteria that need judging every time we mark a piece of writing. No human can hold that volume of associated policies in their head at once. In a beautiful, full-circle moment, we think that AI can — but that’s the subject of a future article.



So… how do you train AI to mark a piece of writing?


We started in the right place: with a granular curriculum, developed by expert teachers, designed to mimic the thinking behind teacher judgements. 


Then we tested it by comparing AI judgements to human judgements, over as much student work as we could — thanks to our brilliant team of discovery schools who made this possible.


Now we’re doing the really, really hard work: interrogating every single inconsistency, every query, every moment of hesitation, and using it to write a policy that will generate the same judgement every time. We’re Science-ing it.


I still feel my former-English-teacher self recoil instinctively at the thought of bringing objectivity into my beautifully messy and unpredictable subject. She is soothed by the thought that objectivity means fairness, sure, but it also means no more staring at the same essay for ten minutes trying to determine if it’s ‘Clear’ or ‘Developed’. She sees the time she could have saved, and the weekday evenings she could gain back. If she looks far enough down an alternative timeline, maybe she even sees herself staying in teaching.



This article is written with thanks to our discovery schools whose time and feedback has made the insights above possible — you can read more about two of them, Alec Hunter Academy and the INMAT Primary MAT, on our case study page


We’re changing the way schools approach marking and assessment, in order to give teachers back time — if you want to join the MATs, Headteachers, Subject Leads and English teachers already on the journey with us, you can register interest here.


Part of our work is being funded by the Department for Education — read the press release here.

 
 
 

Comments


bottom of page