Is ChatGPT really more creative than humans? New research provides an intriguing test

(Photo credit: OpenAI's DALL·E)

While some high-profile research suggests that artificial intelligence (AI) has surpassed human creativity, a recent study in the International Journal of Human–Computer Interaction offers a more nuanced view. Scientists found that although AI excels at generating diverse interpretations of ambiguous images, human judges still perceive human responses as more creative. These results indicate that human creativity still holds an edge over AI in certain domains.

The rapid development and implementation of AI systems, such as ChatGPT by OpenAI, Llama from Meta, and Google Bard, have sparked widespread discussion about their impact on society and the concept of creativity. As AI continues to advance and perform tasks traditionally reserved for humans, there are growing concerns about the displacement of human labor, implications for education, and the ethical dimensions of AI-generated content.

“We were interested in exploring this topic because of the rapid advancement of AI language models like ChatGPT that can generate remarkably human-like text outputs. We are interested in understanding the current capabilities of these models compared to humans’ performance. More generally, we are interested in understanding how these AI will change the way humans perceive creativity and what type of social impact these AI models may have,” explained study author Simone Grassini, an associate professor at the University of Bergen.

There has been considerable excitement alongside concerns about whether AI will eventually surpass human capabilities for creative tasks. In this study, we aimed to empirically investigate the current state of AI’s creative abilities in comparison to humans.

To explore the comparative creative abilities of AI and humans, the researchers utilized a newly developed task known as the Figural Interpretation Quest (FIQ). This task involves interpreting ambiguous, abstract figures. The task is designed to evaluate divergent thinking, a cognitive process that involves generating multiple, unique solutions or ideas in response to an open-ended problem. It contrasts with convergent thinking, which focuses on finding a single, correct solution.

Grassini noted that the FIQ “is multimodal in nature – requiring both the perception of an ambiguous visual image and the generation of a creative textual interpretation of that image. Crucially, the FIQ is a recently developed test, meaning the AI system we tested (ChatGPT-4) could not have encountered optimal responses in its training data up to 2021. This allowed us to evaluate the AI’s ability to ‘come up’ with original creative solutions, rather than simply retrieving responses from its training memory (an important limitation shared by all the previously published studies on the topic).”

The study included 279 native English speakers recruited through an online platform, of whom 256 completed the study. These participants were asked to provide two different interpretations for each of four ambiguous figures. The goal was to produce interpretations that were as semantically different from each other as possible. Participants were given 30 seconds to enter their interpretations for each figure, ensuring they responded under a time constraint that mirrored real-world creative tasks.

In parallel, ChatGPT-4, an advanced AI chatbot, was also tested using the same four figures. The AI was prompted to generate responses of varying lengths (one, two, or three words) to simulate the distribution of response lengths observed in human data. This was done to ensure a fair comparison between human and AI responses. The AI sessions were conducted multiple times to gather a robust dataset of AI-generated interpretations.

The researchers employed two primary measures to assess creativity: flexibility and perceived creativity.

Flexibility was evaluated using the semantic distance between the two interpretations provided for each figure. The larger the semantic distance, the more flexible the responses were considered to be. This measure was calculated using the SemDis platform, which analyzes the degree of difference between the meanings of two responses.

To assess perceived creativity, a panel of eleven human judges, who were blind to whether the responses were from humans or the AI, rated each interpretation on a scale from 0 to 2. A score of 0 indicated a basic or non-creative response, 1 indicated a creative response, and 2 indicated an exceptionally creative interpretation. The judges were psychology students who rated the responses as part of their research training or in exchange for university credits.

The researchers found that ChatGPT-4 demonstrated a higher average level of flexibility in generating interpretations compared to humans. This indicates that AI was adept at producing semantically diverse ideas, showcasing its ability to think divergently across different contexts.

“One surprising aspect was just how well the AI language model performed in terms of generating semantically diverse interpretations of the ambiguous visual images,” Grassini told PsyPost. “Its flexibility scores based on semantic distance analysis were on average higher than the average human participant. This highlights the remarkable natural language abilities of large language models.”

However, despite AI’s superior flexibility, human judges perceived human-generated responses as more creative overall. This suggests that there are aspects of creativity that go beyond mere semantic diversity, likely linked to the richness of human experiences and the nuanced ways humans interpret stimuli.

In addition, the highest-scoring human responses outperformed the best AI responses in both perceived creativity and flexibility. While AI produced a narrower range of responses, human participants exhibited a broader variety of ideas, which contributed to their higher creativity scores.

“AI excelled at generating semantically diverse interpretations of ambiguous visual images,” Grassini explained. “However, when human judges subjectively rated the creativity of the AI’s outputs versus real humans, they tended to perceive the human responses as more creative overall.

“This suggests that while AI has advanced abilities in producing diverse creative ideas and is very good in following the instructions we provide to it, human creativity still appears to have an edge, at least in how it is subjectively perceived. The study highlights both the remarkable progress of AI but also some of its current limitations compared to human cognition when it comes to creativity tasks requiring multimodal processing and open-ended imagination.”

The findings also suggest that “there may be important aspects of creativity that go beyond just semantic diversity when it comes to how open-ended creative works are perceived by others, and that AI and humans may actually excel in different types of creativity types, and that whether or not AI has reached human (or super-human) level of creativity, may actually depend on the type of creativity we are testing,” Grassini added.

But the study, like all research, has limitations, including its focus on only one type of creativity task involving ambiguous visual images, which may not generalize to other creative domains. Additionally, the AI’s training data cutoff in September 2021 means it might not have been fully optimized for the FIQ task, which was developed later. Future research could explore a broader range of creative tasks to better understand AI’s strengths and limitations.

“A key caveat is that our study focused on just one particular type of creativity test involving interpretation of ambiguous visual images,” Grassini explained. “While this allowed us to evaluate multimodal processing, the results may not generalize to other types of creative domains or tasks. Additionally, we tested a single AI model (ChatGPT-4) which is already dated compared to the most cutting-edge models being developed, such as the newly presented ChatGPT4o, that may have better overall capability when it comes to multimodality.”

Previous studies have presented varying results on the creativity of AI compared to humans, largely influenced by how creativity was measured in each case. In one study, conducted by NAME and his colleagues, AI outperformed the average human in a creative thinking test known as the Torrance Tests of Creative Thinking. This test assesses creativity based on factors like fluency, flexibility, originality, and elaboration in generating responses to open-ended prompts. The AI’s superior performance in this test suggests that it can excel in tasks requiring a broad and rapid production of ideas, benefiting from its vast training data and processing speed .

Another study found that AI scored in the top percentile of creative thinking, using a divergent thinking task that measured the number and uniqueness of ideas generated in response to prompts. This type of task emphasizes the ability to produce a wide variety of ideas, a domain where AI’s capacity to draw from extensive datasets and its algorithmic efficiency gives it an edge over the average human.

On the other hand, a study examining creativity in children used a different approach, focusing on simple tasks where participants had to generate novel uses for everyday objects. The findings showed that children outperformed AI, highlighting that human creativity, particularly in younger individuals, can involve intuitive leaps and personal experiences that AI struggles to replicate.

“Our long-term goal is to better understand the cognitive mechanisms underlying human creativity and how they compare to the operationalization of creativity in artificial intelligence systems,” Grassini told PsyPost. “This can shed light on the unique strengths of human cognition versus AI approaches.”

“We also hope to explore how AI language models could potentially augment or support human creativity rather than just trying to replicate or replace it. Ultimately, increasing our understanding of machine creativity capabilities compared to human cognition can help guide the responsible development of AI technologies.”

“I think it’s important to emphasize that our study should not be interpreted as trying to ‘rank’ human vs AI creativity overal,” Grassini added. “Both involve remarkably complex cognitive capabilities that are difficult to reduce to a single score. Additionally, the study highlights some key differences in the creative process between biological human cognition and current language AI – with the former perhaps better attuned to multimodal processing and subjective creative expression.”

“As AI systems become more multimodal and acquire richer world experiences, that could further advance their creative abilities. This is still any early stage of research into these questions. We like to imagine that in the future AI could be used by humans to improve their creative potentials, leveraging on the type of creativity where AI excels.”

The study, “Artificial Creativity? Evaluating AI Against Human Performance in Creative Interpretation of Visual Stimuli,” was authored by Simone Grassini and Mika Koivisto.

© PsyPost