An unexpected number of Wilcox seniors taking AP Literature were accused of AI use when they submitted a personal statement essay assignment to Turnitin.com last October.
According to AP Literature teacher Mr. Jackson, about 60% of essays were flagged by Turnitin as potentially containing AI-generated text. “An exorbitant amount of students compared to prior years and prior assignments were flagged for AI, with many flagged at really, really high percentages like 75%, 90%, or 100% AI-generated,” Mr. Jackson says. It is unclear whether this abrupt uptick occurred due to a sudden change in the behavior of the software or a sudden change in the behavior of students. The incident raises questions about the reliability of AI detection in high-school writing.
“I think there are definitely students using AI in their assignments, but when nearly the entire class gets flagged, it doesn’t make sense,” Wilcox senior Jocelyn Wang says. She says that several seniors asked teachers to run essays written pre-ChatGPT through Turnitin “and these were flagged for AI despite being written before AI became mainstream.”
Rohan Sainani, one of a minority of AP literature students whose essay was not flagged, believes that the problem is with the Turnitin software. “I didn’t get flagged because my spelling and grammar are just so atrocious,” he says. He thinks that many students may have been flagged because they were extra careful with writing mechanics for this particular assignment, since college essays are high stakes. Turnitin is suspicious of overly polished writing, the same way teachers are now suspicious of professional writing signatures like the em-dash. This is the kind of catch-22 that makes AI detection tricky: too-perfect grammar can be a hallmark of AI-written text, but it also can be a hallmark of conscientious students. “Many of the most honest, hardworking people in the class got flagged,” Sainani says.
Mr. Jackson agrees that some factors may have led to innocent essays being flagged. “College essays are a different style of writing,” he says. “It’s more of a narrative style, it’s more creative, so maybe because it’s not your typical academic voice, this caused it to flag more than usual.” But he also believes that some students relied on AI tools during the editing process. “Maybe students are using Grammarly AI or paper.co to help edit their essays, or copying and pasting into Gemini or ChatGPT and having it polish their essays,” he says. Another theory, he says, involves outside college tutors. “A lot of the students that were flagged said that they’d shared their essays with college tutors outside of class,” he says. “The tutors could be using AI and then sharing it back with the student.”
Because of the large number of students affected, AP Literature teachers Mr. Jackson and Ms. Nuti gave students an amnesty period to revise and resubmit their essays. Only roughly 17% of resubmissions were flagged. “This goes against the argument that these were all false flags on Turnitin,” Mr. Jackson says. “Because what were you doing between submission one and submission two that caused your second submission not to get flagged?”
For the essays that remained flagged, Mr. Jackson and Ms. Nuti did not just rely on Turnitin reports, but also reviewed document histories using a program called Brisk, and met with students individually. When students were found to have used AI, they were given the opportunity to make up the assignment in a timed write. “I think Ms. Nuti and I were extremely generous and absolutely did our due diligence,” Mr. Jackson says. Still, he acknowledges that detecting AI use is not straightforward. “AI checkers are moving at a different rate to the actual AI,” Mr. Jackson says. “There’s a constant race where the AI checkers are playing catch-up to the AI, and you never know whether the flags are authentic or not.”
Indeed, false positives are a widely acknowledged issue with AI detection software (See Scribe Spring 2025). Academic studies have reported false positive rates for AI detection tools as high as 25%. Writing on Meduim.com in 2023, journalist Michelle Harwood noted that the AI detector ZeroGPT reported 94% certainty that AI wrote the US Constitution.
Turnitin.com claims a false positive rate of 1%, though this has been disputed by outside experts. Even if accurate, if 10,000 essays are submitted, this means that 100 will be falsely flagged as AI-written, potentially harming 100 students. Some colleges, such as Vanderbilt University, have disabled Turnitin’s AI detection feature over concerns about false positives. On their website, Turnitin also admits that their false positive rate is higher in middle and high school students than in post-secondary students.
Unlike with plagiarism detection, where teachers can see the original text that was copied, teachers have no means of fact-checking AI detection software. Turnitin is a “black box,” Wang says. Teachers cannot reliably verify from the software alone whether a particular essay was correctly flagged or a false positive, and sorting out the truth is stressful for both teachers and students, as the Wilcox incident shows. “It was long and it was painstaking, really labor intensive, and not something I want to have to go through again,” Mr. Jackson says.
Wang adds: “Turnitin is not fine-tuned enough. This experience definitely opened my eyes to the flaws in their system.”
