AI-generated exam answers go undetected in real-world test

June 28, 2024

125

Researchers from the University of Reading within the UK conducted a blind study to see if human educators were capable of detect AI-generated content. The results don’t bode well for teachers.

The move toward distant learning has seen lots of student assessments conducted outside the normal space of a written exam with an invigilator looking out for cheating.

Ready access to advanced AI models has made it easy for college kids to make use of tools like ChatGPT to jot down their coursework assignments for them or help when taking online exams.

Would a sensible university professor find a way to inform if a student was using AI to reply online exam questions?

Associate Professor Peter Scarfe, certainly one of the paper’s lead authors, said, “Many institutions have moved away from traditional exams to make assessment more inclusive. Our research shows it’s of international importance to know how AI will affect the integrity of educational assessments.”

In the “largest and most robust blind study of its kind,” the researchers submitted 100% AI written submissions into the examinations system in five undergraduate modules for a BSc degree in Psychology.

The markers of the exams were completely unaware of the study. This made the experiment a form of real-world Turing test, where an educator who didn’t call ‘Cheat!’ on a paper believed a human wrote the project.

Here’s the way it worked:

The submissions were created using GPT-4.
They submitted a complete of 33 AI-generated exam entries across five modules.
The study included each short answer questions (SAQs) and longer essay-based questions.
For exams, submissions consisted of 4 answers from a selection of six questions, each with a 200-word limit.
For essay-based exams, submissions consisted of a single 1500-word essay (students submitted one answer out of a selection of either three or 4, depending on the module).

The researchers used the next prompts with ChatGPT for the SAQs and essays, respectively:

“Including references to academic literature but not a separate reference section, answer the next query in 160 words: XXX”
“Including references to academic literature but not a separate reference section, write a 2000-word essay answering the next query: XXX” (they selected 2,000 words as ChatGPT often underdelivers on wordcounts)
In each prompt, XXX was replaced by the exam query.

When the outcomes were tallied, 94% of the AI submissions went unflagged by the markers. What form of grades did the AI papers achieve?

The researchers said, “The grades awarded to our AI submissions were on average half a grade boundary higher than that achieved by real students. Across modules there was an 83.4% likelihood that the AI submissions on a module would outperform a random number of the identical variety of real student submissions.”

Researchers further noted that their approach likely overestimates the detectability of AI use in real-world scenarios. As Dr. Scarfe explained, “If real students were cheating in an exam, they’d be unlikely to take such a naively obvious approach as we did.”

In practice, students might use AI as a place to begin, refining and personalizing the output, making detection even tougher.

And if that wasn’t enough, then besides the researchers’ AI submissions, other students likely used ChatGPT for his or her answers. This means the detection rate may very well be even lower than the recorded results.

No easy solutions

Couldn’t tutors simply have used AI detection software? Maybe, but not confidently, says the study.

AI detectors, like that offered by the favored academic plagiarism platform Turnitin, have been proven inaccurate.

Plus, AI detectors risk falsely accusing non-native English speakers who’re less more likely to use certain vocabulary, idioms, etc., which AI can view as signals of human writing.

With no reliable means to detect AI-generated content, education leaders are left scratching their heads. Should AI’s use be persecuted, or should it simply form a part of the syllabus? Should using AI be normalized just like the calculator?

Overall, there’s some consensus that integrating AI into education is just not without risks. At worst, it threatens to erode critical pondering and stunt the creation of authentic recent knowledge.

Professor Karen Yeung cautioned against potential “deskilling” of scholars, telling The Guardian, “There is an actual danger that the approaching generation will find yourself effectively tethered to those machines, unable to have interaction in serious pondering, evaluation or writing without their assistance.”

To combat AI misuse, Reading researchers recommend potentially moving away from unsupervised, take-home exams to more controlled environments. This could involve a return to traditional in-person exams or the event of latest, AI-resistant assessment formats.

Another possibility – and a model some universities are already following – is developing coursework that teaches students the right way to use AI critically and ethically.

We also have to confront the evident lack of AI literacy amongst tutors exposed by this study. It seems pretty woeful.

ChatGPT often resorts to certain ‘tropes’ or sentence patterns that change into quite obvious whenever you’re exposed to them continuously.

It can be interesting to see how a tutor ‘trained’ to acknowledge AI writing would perform under the identical conditions.

ChatGPT’s exam record is mixed

The Reading University study is just not the primary to check AI’s capabilities in academic settings. Various studies have examined AI performance across different fields and levels of education:

Medical exams: A bunch of pediatric doctors tested ChatGPT (GPT-3.5) on the neonatal-perinatal board exam. The AI scored only 46% correct answers, performing best on basic recall and clinical reasoning questions but battling multi-logic reasoning. Interestingly, it scored highest (78.5%) within the ethics section.
Financial exams: JPMorgan Chase & Co. researchers tested GPT-4 on the Chartered Financial Analyst (CFA) exam. While ChatGPT was unlikely to pass Levels I and II, GPT-4 showed “an honest likelihood” if prompted appropriately. The AI models performed well in derivatives, alternative investments, and ethics sections but struggled with portfolio management and economics.
Law exams: ChatGPT has been tested on the bar exam for law, often scoring very highly.
Standardized tests: The AI has performed well on Graduate Record Examinations (GRE), SAT Reading and Writing, and Advanced Placement exams.
University courses: Another study pitched ChatGPT (model not given) against 32 degree-level topics, finding that it beat or exceeded students on only 9 out of 32 exams.

So, while AI excels in some areas, this is very variable depending on the topic and sort of test in query.

The conclusion is that for those who’re a student who doesn’t mind cheating, you should utilize ChatGPT to improve grades with only a 6% likelihood of getting caught. You’ve got to like those odds.

As researchers noted, student assessment methods may have to vary to take care of their academic integrity, especially as AI-generated content becomes harder to detect.

The researchers added a humorous conclusion to their paper.

“If we were to say that GPT-4 had designed a part of this study, did a part of the evaluation and helped write the manuscript, aside from those sections where we have now directedly quoted GPT-4, which parts of the manuscript would you discover as written by GPT-4 somewhat than the authors listed?”

If the researchers “cheated” by utilizing AI to jot down the study, how would you prove it?

AI-generated exam answers go undetected in real-world test

No easy solutions

ChatGPT’s exam record is mixed

LEAVE A REPLY Cancel reply

Must Read

Confusion in talks with top brands about an ad model that challenges Google

Google's NotebookLM continues to evolve: What IT managers must know concerning the enterprise applications

Microsoft to restart Three Mile Island nuclear power plant under exclusive contract

GreenLite, founded by a former Gopuff manager, automates constructing permits

Adversarial attacks on AI models are rising: what do you have to do now?

AI startup Rep.ai raises $7.5 million to coach sales reps with “digital twins”

Investors join OpenAI's unprecedented $6 billion funding round

Latest articles

Confusion in talks with top brands about an ad model that challenges Google

Google's NotebookLM continues to evolve: What IT managers must know concerning the enterprise applications

Microsoft to restart Three Mile Island nuclear power plant under exclusive contract

Our Newsletter

AI-generated exam answers go undetected in real-world test

No easy solutions

ChatGPT’s exam record is mixed

RELATED ARTICLES

LEAVE A REPLY Cancel reply

Must Read

Latest articles

Our Newsletter