How well can ChatGPT-4 write APA-style psychology papers?

(Photo credit: Adobe Stock)

In a recent study published in Contemporary School Psychology, researchers have put the latest AI technology to the test in academic writing, revealing both its potential and limitations.

Artificial Intelligence (AI) has been making waves in various fields, and academia is no exception. Tools powered by AI such as Grammarly and Turnitin have become staples for students and researchers, helping to refine writing by checking for grammar and ensuring originality of written work, respectively. However the capabilities of these tools, particularly in autonomously generating coherent, reliable, and scientifically accurate content, remain under scrutiny.

Led by Adam B Lockwood and Joshua Castleberry from Kent State University, the study aimed to evaluate Generative Pre-trained Transformer 4 (GPT-4), a popular advanced AI language model developed by OpenAI, in writing American Psychological Association (APA)-style psychology papers.

While recent advancements in technology have enabled these sophisticated language models to produce what resembles human-written information, the researchers were curious to assess performance of GPT-4 in three areas: substantiation of claims, factual accuracy, and referencing.

Lockwood and Castleberry entered the following prompt into GPT-4, “Write a 2500-word manuscript on the ethical dilemmas of using ChatGPT to write for psychological and educational reports. Address how APA and NASP guidelines, as well as HIPAA and FERPA laws pertain to these ethical dilemmas. Provide recommendations for overcoming these limitations. Provide citations and references in APA formatting.”

GPT-4 provided a 1814-word document, but after removal of the title, abstract, keywords, headings, and references, a 1043-word paper remained which comprised 45 sentences.

Out of 42 sentences should have been supported by an in-text citation, only 17 (40.5%) were correctly substantiated. The remaining 25 sentences did not have a citation (40%), possessed a citation that did not exist (40%), or were supported by a citation that was irrelevant to the claim being made in the paper (20%).

To check scientific accuracy of the 25 unsubstantiated claims, the researchers were fully able to confirm the accuracy of 14 using other sources, and partially confirm accuracy of 3 more sentences (i.e. the other sources did not explicitly state the claim, but it could be inferred). Thus in total, 31 (73.8%) of sentences were verified.

Finally, 16 references were provided at the end of the paper – 12 referenced real websites; errors were found on 5 of these (1 listed incorrect authors, 1 failed to provide a Digital Object Identifier (DOI) and 3 provided incorrect links). With the remaining 4 references, 1 was to the wrong article and the 3 remaining links were broken.

Lockwood and Castleberry concluded, “While GPT-4 demonstrated some capability in generating factually accurate information and producing APA-style citations, there were notable limitations. The substantial number of unsubstantiated claims and the presence of errors in citations and referencing indicate the need for further refinement and that we cannot blindly rely on GPT-4 to write papers.”

Some limitations should be noted. The study’s focus on a single paper may not be representative of GPT-4’s overall performance, and the use of specific prompts may have biased GPT-4’s output, suggesting that further research is needed to fully understand its capabilities.

The study, “Examining the Capabilities of GPT-4 to Write an APA-Style School Psychology Paper,” was authored by Adam B. Lockwood and Joshua Castleberry.