ChatGPT-created letters of recommendation are nearly indistinguishable from human-authored letters, study finds

In a new study published in the journal AEM Education and Training, researchers discovered that academic physicians could only slightly better than guesswork differentiate between recommendation letters written by humans and those generated by artificial intelligence (AI). The study raises critical questions about the future role of AI in academic assessments, the need for ethical considerations in its use, and the potential reevaluation of the current practices in recommendation letters.

Letters of recommendation are a staple in the academic world, particularly in medicine. They play a critical role in various decisions, from student admissions to faculty promotions. However, writing these letters is often a burdensome task for busy academics. With the rise of AI technologies like ChatGPT, a tool adept at generating human-like text, the possibility emerged: Could AI assist in this labor-intensive process?

“This topic interested us as we recognized the essential yet time-consuming role of letters of recommendation (LORs) in academic medicine,” explained study author Carl Preiksaitis, a clinical instructor at the Department of Emergency Medicine at Stanford University School of Medicine. “These letters are written for a variety of different scenarios, from application to medical school and residency to faculty promotion. We had heard anecdotal evidence that generative AI models, such as ChatGPT, were being used to aid in authoring LORs and we wanted to explore this possibility in a more rigorous way.”

To conduct the study, the researchers selected four hypothetical candidates for academic promotion. They prepared detailed profiles for these candidates, covering their educational background, employment history, and accolades, but without any gender identification to avoid bias.

Next, the team crafted letters of recommendation. Two experienced members wrote letters as they usually would, serving as the ‘human’ authors. Meanwhile, two junior team members, with no prior experience in such letter writing, used ChatGPT to create AI-authored letters. The AI-generated letters were based on prompts derived from the candidates’ achievements. To maintain consistency, all letters were formatted similarly, focusing solely on content differences.

The researchers then designed a survey, which was administered to 32 participants, primarily full professors in the fields of emergency medicine, internal medicine, and family medicine. These participants were randomly given eight out of 16 letters (half AI-authored, half human-authored) to review. They were asked to guess the authorship of each letter, rate its quality, and assess its persuasiveness regarding the candidate’s promotion.

On average, participants correctly identified the authorship only 59.4% of the time, barely above a random guess. Interestingly, even those with extensive experience in reviewing letters did not fare much better. When it came to the perceived quality and persuasiveness of the letters, there was a bias: reviewers rated letters they believed were human-written higher than those they thought were AI-generated. However, when the actual source of the letters was considered, this difference in perception disappeared.

“One surprising element was the overall difficulty participants had in distinguishing between human- and AI-authored LORs, with accuracy only slightly better than chance,” Preiksaitis said. “Additionally, the study revealed a discrepancy in the perceived quality and persuasiveness of LORs based on the suspected authorship, with human-suspected LORs rated more favorably, despite the actual authorship.”

The study also examined gender bias in the letters. Results showed human-written letters contained more female-associated words, while AI-generated letters tended to have more male-associated words. Additionally, AI detection tools like GPTZero and OpenAI’s Text Classifier showed mixed effectiveness, each correctly identifying the authorship of the letters only half of the time.

The findings are in line with a previous study published in Research Methods in Applied Linguistics. In that study, 72 linguistics experts were tested to see if they could differentiate between research abstracts written by AI and humans. Despite the experts’ efforts to use linguistic and stylistic analyses, their success rate was only 38.9%, indicating a significant challenge in distinguishing AI writing from human writing.

“The average person should understand that AI technologies like ChatGPT have reached a level of sophistication where they can generate text, such as LORs, that is nearly indistinguishable from human-authored content,” Preiksaitis told PsyPost. “This suggests that AI might be a viable tool to reduce the administrative workload in academic settings. However, it also raises questions about the integrity and personalization of such important documents. The study highlights the potential for AI to assist in academic writing while also signaling the need for careful consideration of its implications.”

Despite these intriguing results, the study is not without its limitations. The standardized format of the data used in letter creation might not reflect the more personalized and nuanced letters in real-world scenarios. Also, the recruitment strategy could lead to biased results, with an overrepresentation of male participants and those in emergency medicine. Moreover, the study did not delve deeply into why and how reviewers made their distinctions between human- and AI-authored letters.

Future research could explore these areas further, perhaps focusing on how to enhance AI’s ability to write more personalized and unbiased letters. Additionally, as AI continues to advance, it’s essential to consider the ethical implications and the need for transparency in its usage, especially in critical areas like academic evaluations.

“A key caveat is the standardized approach used to generate the LORs, which might not reflect the personalized and nuanced understanding a human writer has of the candidate,” Preiksaitis noted. “The overrepresentation of certain demographics in the participant pool and the potential bias in their responses also could limit the generalizability of our findings. Future research should explore how AI-generated LORs might be optimized for authenticity and how biases, both human and AI, can be mitigated. Additionally, the ethical implications of AI assistance in such tasks need thorough exploration.”

“Perhaps most provocatively, this research and the increasing ability of generative AI causes us to question the utility of practices from a pre-AI era, like LOR<‘ the researcher added. “Perhaps we can use this crossroads as an opportunity to develop a different way of recommending candidates that is more equitable and transparent.”

The study, “Brain versus bot: Distinguishing letters of recommendation authored by humans compared with artificial intelligence“, was authored by Carl Preiksaitis, Christopher Nash, Michael Gottlieb, Teresa M. Chan, Al’ai Alvarez, and Adaira Landry.

© PsyPost