Measuring Creativity with Deep Learning Techniques: A Scoping Review of Automated Evaluation Methods

1. Introduction

Creativity is increasingly recognized as a crucial 21st-century skill, becoming a core component of educational policies and curricula worldwide (Plucker et al., 2023). This multifaceted concept has seen significant research advancements in understanding its various elements, including idea generation in collaborative creative processes (Sawyer, 2011, 2022). Furthermore, the critical role of creativity evaluation has emerged as a key area of study (Guo et al., 2023). Creativity evaluation, defined as the capacity to accurately identify creative ideas, solutions, or individual traits, is essential for understanding creative strengths and potential (Kim et al., 2019). In education, this evaluation is particularly vital for teachers and students, facilitating the monitoring, refinement, and implementation of innovative ideas, ultimately enhancing students’ creative performance throughout the creative process (Rominger et al., 2022).

However, measuring creativity presents a complex challenge in research. Creativity evaluation traditionally focuses on four core dimensions: fluency (the number of meaningful ideas), flexibility (the variety of idea categories), elaboration (the depth of detail in ideas), and novelty (the uniqueness of ideas) (Bozkurt Altan and Tan, 2021). Manual creativity evaluations, including paper-based tests and psychological assessments, have been widely employed (Rafner et al., 2022). Examples of these include the Torrance Tests of Creative Thinking (Torrance, 2008), the Creativity Assessment Packet (CAP) (Williams, 1980), and Divergent Production abilities (DP) tests (Guilford, 1967). Other manual methods encompass rating scales (Gong and Zhang, 2017; Birkey and Hausserman, 2019), surveys and questionnaires (De Stobbeleir et al., 2011; Gong et al., 2019), grading rubrics (Vo and Asojo, 2018), and subjective scoring of creativity dimensions (George and Wiley, 2020). Despite their prevalence, these manual methods are prone to errors due to subjectivity in expert ratings and are notably time-consuming (Said-Metwaly et al., 2017; Doboli et al., 2020).

To overcome these limitations, automated creativity evaluation leveraging Artificial Intelligence (AI) techniques offers a promising alternative. AI-driven approaches can also enrich co-creation processes by providing real-time feedback, guiding students toward the development of novel solutions (George and Wiley, 2020; Kenworthy et al., 2023). AI, in essence, empowers machines to execute tasks that typically require human intelligence. Within AI, Machine Learning (ML) algorithms enable systems to learn from data and make predictions. Specifically, computer vision is used for analyzing visual data, while Natural Language Processing (NLP) is employed for textual data analysis. Given the focus on textual ideas in creativity, NLP becomes instrumental in enabling machines to understand, interpret, analyze, and generate human language (Braun et al., 2017).

NLP encompasses a diverse array of approaches and techniques, including text similarity, text classification, topic modeling, information extraction, and text generation. These techniques utilize computational methods ranging from statistical analyses to sophisticated predictive and deep learning models. NLP provides various avenues for computing variables relevant to creativity dimensions. Within the vector space created by NLP, five key variables can be derived: (1) Contextual and semantic similarity, used to gauge idea uniqueness and originality (Hass, 2017; Doboli et al., 2020); (2) text clustering, capable of identifying different categories within text; (3) text classification, employed to compute novelty (Simpson et al., 2019); (4) keyword searching, primarily used for measuring elaboration (Dumas et al., 2021); and (5) information retrieval, applicable for scoring the level of idea elaboration (Vartanian et al., 2020). These applications of NLP in co-creative processes can automate creativity evaluation and enhance co-creation by providing valuable feedback (Bae et al., 2020; Kang et al., 2021; Kovalkov et al., 2021).

Current research is increasingly focused on exploring how various computational techniques, particularly deep learning, can be effectively used for measuring creativity dimensions (Doboli et al., 2020). This area of research has been highly productive, leading to the development of diverse computational techniques. For instance, (1) novelty is assessed using keyword similarity (Prasch et al., 2020), part-of-speech tagging (Karampiperis et al., 2014; Camburn et al., 2019), and various ML classifiers like Bayesian classifiers, random trees, and Support Vector Machines (SVM) (Manske and Hoppe, 2014; Simpson et al., 2019; Doboli et al., 2020); (2) originality is measured using Latent Semantic Analysis (LSA) (Dunbar and Forster, 2009), Global Vectors for Word Representation (GloVe) (Dumas et al., 2021), and part-of-speech tagging (Georgiev and Casakin, 2019); (3) fluency is evaluated with LSA (Dumas and Dunbar, 2014; LaVoie et al., 2020); (4) elaboration is measured via part-of-speech tagging (Dumas et al., 2021); and (5) the level of detail is assessed using text-mining methods (Camburn et al., 2019).

This study addresses four main challenges in the current research landscape of computational creativity assessment: (1) the wide range of computational techniques applied to evaluate diverse creativity dimensions; (2) the lack of consensus on specific techniques for measuring particular creativity dimensions; (3) the often-unexplained rationale behind using certain techniques for specific dimensions, such as using LSA for category switch evaluation (Dunbar and Forster, 2009); and (4) the necessity to consider the inherent limitations of computational techniques that may affect the accuracy of creativity dimension evaluation (Olivares-Rodríguez et al., 2017; Doboli et al., 2020). To the best of our knowledge, no existing literature review comprehensively addresses these challenges. This exploration leads us to two key research questions: (1) What NLP approaches and techniques are currently employed to automatically measure creativity? and (2) Which creativity dimensions are being automatically computed, and how? Answering these questions allows us to tackle the aforementioned challenges in automatic creativity evaluation, fostering a deeper understanding of NLP approaches and creativity dimensions, their applications in evaluating creativity, identifying research gaps and limitations, and proposing alternative solutions to advance the evaluation and promotion of creativity. Therefore, we adopted a scoping review methodology, which is effective for understanding key concepts and identifying knowledge gaps (Munn et al., 2018, ultimately aiming to inspire innovation and improve education through advanced technologies.

2. Research Objectives

This scoping review is structured around two primary objectives:

To identify and categorize the various ML approaches utilized in automatic creativity evaluation. This includes highlighting their application scenarios and discussing the inherent limitations of different computational approaches and techniques. This categorization aims to provide a more profound understanding of the contributions of various ML approaches to the field of automated creativity assessment.
To analyze the definitions and computational methods used for different creativity dimensions in automatic creativity evaluation research. This analysis seeks to foster a more unified understanding of creativity dimensions and their computation, paving the way for future advancements in automatic creativity evaluation methodologies and ensuring more robust measures of creativity.

3. Method

This section outlines the sampling method used to gather and synthesize state-of-the-art approaches in automatic creativity evaluation. Our methodological framework follows the PRISMA technique (Dickson and Yeung, 2022), employing a scoping review to identify relevant and significant research papers based on four core concepts:

Creativity: Articles must be directly related to creativity, particularly the creative process (Sawyer, 2011).
Measurement/Evaluation/Assessment of Creativity Dimensions: The studies must focus on methods for measuring, evaluating, or assessing creativity dimensions.
Technology: We selected studies that utilize technology for assistance or evaluation. This concept aims to review technological support in creativity evaluation and explore future research in the creative process involving technology.
Domain: We focused on creativity processes applicable within the educational sector, aiming to enhance student creativity. Fields like medicine, finance, and business were excluded from the search.

Considering these core concepts, we included peer-reviewed journal articles and conference papers in this mapping study, searching the Scopus database for publications between 2005 and 2021. Interestingly, despite the time span, the earliest study meeting our inclusion criteria dates back to 2009, with the majority published in recent years. This indicates that automatic creativity evaluation is a relatively recent area of research that is rapidly gaining attention and remains an active and evolving field.

We excluded articles focusing solely on individual or organizational creativity evaluation, domains outside of education (e.g., medicine and finance), articles not in English, those published before 2005, and studies lacking a technological component in creativity evaluation.

For this scoping review, we extracted articles from Scopus using the search query: [(creativ* OR “Creative Process” OR “Novelty” OR “Flexibility” OR “Fluency” OR “Elaboration” OR “Originality”) AND (Measur*OR Evaluat* OR Asses* OR Calcul* OR Analys* OR Scor* OR Qunat*) AND (Automat* OR Comput* OR Machin* OR Natural* OR Artificial* OR Deep learning OR Mathemat* OR Mining) AND (E-learning OR educa* OR Learn* OR School OR students*)].

This search yielded 364 research articles. Applying inclusion and exclusion criteria through title, abstract, keyword, and conclusion review narrowed the selection to 65 articles. Subsequently, the authors thoroughly read, checked, and discussed these selected articles, conducting all screening stages to address the two research questions. Discrepancies were resolved through consensus among the authors, a member-checking process to ensure “trustworthiness” in qualitative research (Toma, 2011). After this rigorous process, 26 articles were ultimately included in this scoping review. The complete article selection procedure, following the PRISMA technique, is illustrated in Figure 1.

FIGURE 1

Figure 1. Screening procedure of the articles using the PRISMA technique.

4. Results

4.1. Approaches and Techniques Used in Automatic Creativity Evaluation (RQ1)

The compilation of computational approaches and techniques employed in automatic creativity evaluation research, aimed at answering the first research question, revealed three key findings:

Firstly, research in creativity evaluation is distributed across three primary NLP approaches: (1) Text Similarity, which measures the relatedness and closeness of words, sentences, or paragraphs within a numerical space; (2) Text Classification, a supervised learning approach (requiring data training) that utilizes ML algorithms (such as K-Nearest Neighbor (KNN) and Random Forest) to automatically analyze text and assign predefined tags or categories; and (3) Text Mining, which uses NLP to examine and transform large volumes of unstructured text data to uncover new information and patterns. These three NLP approaches and their associated computational techniques, as identified in the reviewed studies, are depicted in Figure 2.

FIGURE 2

Figure 2. Different NLP approaches in creativity evaluation.

Secondly, the scoping review indicated that text similarity is the most frequently used approach (in 69% of the reviewed studies), followed by text classification (27%), with text mining being the least common (only 4% of studies), as illustrated in Figure 2.

Thirdly, our review identified and categorized the specific computational techniques used within these three NLP approaches (text similarity, text classification, and text mining), and the creativity dimensions they were used to automatically evaluate. The following sections present the mapping constructed from a detailed analysis of all studies included in this scoping review, providing a comprehensive overview of techniques for measuring creativity with deep learning techniques and other NLP methods.

Within the text similarity approach, NLP transforms textual ideas into a numerical vector space. This conversion in the reviewed studies employed a wide range of techniques, categorized into three main types: string-based similarity, corpus-based similarity, and knowledge-based similarity. These categories and their computational techniques are shown in Figure 3, with Table 1 mapping the automatic creativity evaluation studies to these categories and specific techniques.

FIGURE 3

Figure 3. Text similarity approaches, categories, sub-categories, and their computational techniques.

TABLE 1

Table 1. Categorizing of review studies in text similarity approaches and percentages of studies included in the review that use each approach.

In the first category, string-based similarity (6% of text similarity approaches in reviewed studies) focuses on matching exact keywords or alphabet strings, using techniques like Longest Common Substring (LCS) or N-gram (a subsequence of n items from a given text sequence). Keyword matching is used to compute the string similarity of ideas with existing ideas in a database (Prasch et al., 2020).

The second category, corpus-based similarity, is the most prevalent (72% of textual similarity approaches), as detailed in Table 1. This category is further divided into two sub-categories: statistical-based models and deep learning-based models. Statistical-based models, such as LSA, represent a corpus in a word-document matrix with words as row vectors and documents as column vectors, applying weighting and dimension reduction schemes before calculating cosine similarity between word vectors (Martin and Berry, 2007; Wagire et al., 2020). Deep learning-based models (both word and sentence embeddings) utilize supervised (data-trained), semi-supervised, or unsupervised methods (no prior training), trained on large corpora like Wikipedia and the Common Crawl dataset. Deep learning models such as Word2Vec (Mikolov et al., 2013) or GloVe (Pennington et al., 2014) leverage knowledge from extensive datasets, encode data, and identify word or sentence similarities. The GloVe model has demonstrated reliable results, particularly for single-word creativity tasks, showing comparability with expert scores (Beaty and Johnson, 2021; Johnson and Hass, 2022). This highlights the increasing role of deep learning techniques in measuring creativity.

The third category is knowledge-based similarity (22% of text similarity approaches in reviewed studies, Table 1), which uses ontologies to represent textual data as semantic network graphs of nodes (semantic memory) and lines. Ontologies are extensive dictionaries of millions of lexically associated words, such as WordNet, Wikipedia, and DBpedia.

Text classification, the second NLP approach, was used in 27% of the reviewed studies for automatic creativity evaluation (Figure 1). Classification is an ML technique categorizing text into predefined categories, involving four main steps: (1) data collection, pre-processing (data acquisition, cleaning, and labeling), and data presentation (feature selection, training/testing dataset division); (2) application of classifier models; (3) classifier evaluation; and (4) prediction (testing data output). These steps are crucial when using text classification for measuring creativity. Table 2 provides an overview of the classification approach, datasets, classifiers, evaluations, and creativity dimensions in creativity evaluation research.

TABLE 2

Table 2. Text classification-based creativity evaluation studies.

Text mining, the third approach in automatic creativity evaluation, involves analyzing large textual data collections to identify key concepts, trends, patterns, and hidden relationships. In this review, text mining was used in studies like Dumas et al. (2021), employing techniques such as all word count, stop list inclusion (removing non-meaningful terms), part-of-speech counting, and inverse document frequency (extracting rare and important documents).

4.2. Creativity Dimensions Computed Automatically (RQ2)

Across the studies included in this review of automatic creativity evaluation, we identified 25 distinct creativity dimensions. These are listed in the second column (Manifestation) of Table 3. Analyzing the conceptual definitions and computational approaches used in studies assessing different creativity dimensions allowed us to categorize these 25 manifestations into seven core creativity dimensions: novelty, value, flexibility, elaboration, fluency, feasibility, and others related to playful creativity aspects like humor. These core dimensions are presented in the first column of Table 3 (Core Dimension).

TABLE 3

Table 3. Characterization of 25 creativity dimensions into seven core creativity dimensions (first column) and creativity dimensions manifested (second column) based on similarities in definitions (third column) and computation (fourth column).

Furthermore, the results answering research question two are visualized in Figure 4, showing the percentage distribution of the seven core creativity dimensions identified. Novelty emerges as the most frequently evaluated dimension in the reviewed studies.

FIGURE 4

Figure 4. Percentage distribution of each core creativity dimension in the reviewed studies.

5. Discussion

5.1. Approaches and Techniques Used in Automatic Creativity Evaluation

This scoping review identified three primary NLP approaches used in automatic creativity evaluation: (1) text similarity, (2) text classification, and (3) text mining. The following sections discuss each approach’s contribution, applications, limitations, research gaps, and recommendations for future automatic creativity evaluations, particularly focusing on the role of deep learning techniques.

The prevalence of the text similarity approach, used in 69% of studies, highlights its importance in understanding creative thinking (Li et al., 2023). Its widespread use in automatic creativity evaluation stems from the focus on evaluating originality, novelty, similarity, and diversity – dimensions that inherently involve assessing the similarity of ideas to existing ones. The text similarity approach offers a variety of computational techniques for this, as shown in Figure 3.

Comparing the three categories of text similarity – string similarity, corpus-based similarity, and knowledge-based similarity – as detailed in Table 3, reveals differences in their similarity computation processes and applicability. String-based and knowledge-based similarities have limited application in automatic creativity evaluation. String-based similarity is restricted by its focus on syntactic similarity rather than semantic meaning, while knowledge-based similarity often extracts specific entities rather than capturing the nuanced technical or scientific jargon used in complex ideas (Camburn et al., 2019). For example, in brainstorming renewable energy solutions, a knowledge-based approach might miss specific terms like “photovoltaics” or “wind turbines.” Corpus-based techniques are more widely used, and we will elaborate on these further, particularly focusing on deep learning.

Corpus-based similarity is commonly used in automatic evaluation due to its range of techniques, from statistical to deep learning models (Figure 2). Statistical models like LSA, applied to examine semantic similarity, memory, and creativity (Beaty and Johnson, 2021), have shown reliable originality scoring in divergent thinking tasks, sometimes outperforming human raters (Dunbar and Forster, 2009; Dumas and Dunbar, 2014; LaVoie et al., 2020, Table 1). However, LSA’s statistical techniques, including Probabilistic Latent Semantic Analysis (Hofmann, 1999), Latent Dirichlet Allocation (Blei et al., 2003), and Non-Negative Matrix Factorization (Lee and Seung, 1999), are limited as they primarily consider word statistics (e.g., word co-occurrence) rather than contextual and semantic meaning. Deep learning models address these limitations.

Recent advancements in NLP, particularly with deep learning models based on deep neural architectures, have revolutionized text modeling, allowing for greater nuance and complexity. This began with word embedding models like GloVe and Word2Vec, pre-trained on extensive datasets like Wikipedia and news articles. These predictive models utilize neural networks with hidden layers to learn word vector representations. GloVe has shown results comparable to human expert scores in single-word creativity tasks (Beaty and Johnson, 2021; Olson et al., 2021), demonstrating the potential of deep learning for measuring creativity. However, word embedding models don’t differentiate between keyword lists and meaningful sentences, limiting their ability to capture semantic and contextual sentence meaning.

Vectorizing entire sentences marks a significant innovation in text modeling. Transformer architectures, leveraging the concept of attention (Vaswani et al., 2017), generally outperform word embedding models by significant margins in standard tasks (Wang et al., 2018, 2019). Attention makes it computationally feasible for transformer models to process long text sequences by focusing on the most important parts. This has led to two main categories: pre-trained sentence embedding models and text generation models.

Sentence embedding models vectorize entire sentences, preserving semantic and contextual meaning. Unsupervised techniques like Unsupervised Smooth Inverse Frequency (uSIF) (Ethayarajh, 2018) and Geometric Sentence Embedding (GEM) (Yang et al., 2018) require no external data. Transformers like BERT (Devlin et al., 2018), Sentence Transformer (Reimers and Gurevych, 2019), MPNet (Song et al., 2020), Skip-Thought (ST) (Kiros et al., 2015), InferSent (Conneau et al., 2017), and Universal Sentence Encoder (USE) (Cer et al., 2018) can be fine-tuned or trained on specific datasets for improved performance. The USE model has been used in creativity research to evaluate idea novelty (Kenworthy et al., 2023), suggesting further exploration of various sentence embedding models, or their combinations, for evaluating creative ideas in open-ended co-creation is warranted.

Text generation models, such as Generative Pre-trained Transformer (GPT-3) (Brown et al., 2020), Text-to-Text Transfer Transformer (T5) (Raffel et al., 2020), and Long Short-Term Memory (LSTM) (Huang et al., 2022), generate new text similar to a given prompt. In creativity research, Generative Adversarial Networks (GANs) (Aggarwal et al., 2021), a text generation model, were used by Franceschelli and Musolesi (2022) to evaluate novelty, surprise, and relevance, demonstrating the application of deep learning techniques in measuring creativity. However, there are criticisms regarding text generation models for evaluating open-ended ideas. Firstly, text generation is optimized for generating text from given prompts, useful for dialog generation, machine translation, chatbots, and prompt-based learning (Liu et al., 2023. Secondly, as models improve in text generation, they may become more likely to replicate input data rather than produce novel or creative outputs. Despite these concerns, text generation models haven’t been extensively tested in creativity research, suggesting future investigations are needed to understand their limitations and potential in measuring creativity.

In conclusion, for single-word tasks in creativity research, word embedding models, particularly GloVe, are effective. For open-ended co-creation with sentence-structured ideas, sentence embedding models are more suitable because they (a) represent entire sentences in vector space, capturing semantic and contextual meaning; (b) outperform word embedding models in textual similarity tasks; and (c) can be applied to small datasets and open-ended problems due to pre-training on large corpora. We recommend validating sentence embedding models and further exploring text generation models within broader co-creation contexts to fully understand the potential of deep learning techniques for measuring creativity.

Sentence embedding models offer a powerful tool that can be used alongside statistical (Acar et al., 2021), word embedding models (Organisciak et al., 2023), and standard subjective scoring methods for evaluating the creative process and its outputs (Kenett, 2019).

The text classification approach automates the categorization of textual data into predefined classes using machine learning classifiers. This approach relies heavily on a large dataset, typically split into training (70%) and testing (30%) sets. The ML classifier learns from the training set and then categorizes the testing set. Integrating text classification into automatic creativity evaluation is contingent on four key factors: dataset quality and size, ML classifier selection, classifier accuracy, and the specific creativity dimensions being evaluated. These factors are highlighted in Table 2.

Dataset considerations are critical for text classification. Firstly, datasets require pre-processing and labeling, including noise removal and assigning class labels to each idea. Secondly, large datasets are essential for training ML classifiers effectively; classifier prediction capability improves with larger training datasets. Most studies reviewed in Table 2, except Stella and Kenett (2019), used over a thousand ideas for classification. Smaller datasets may require different or more balanced approaches. Thirdly, classifiers trained on one data type are not transferable to other data types. For example, classifiers trained on linguistic data cannot be directly applied to scientific data.

Classifier selection and accuracy are also crucial. Different ML classifiers operate differently and are suited to different dataset characteristics. For example, SVM performs well in multiclass classification, while random forests excel with numerical and categorical features. Logistic regression works for linear problems, K-Nearest Neighbor is suitable for text, and SVM can also handle multiclass datasets. Bayesian approaches are simple and fast algorithms. The reviewed studies often lacked justification for specific classifier choices. Classifier accuracy is also a concern, with potential for low accuracy. Model accuracy is evaluated using metrics like confusion matrices, entropy, and sensitivity (Table 2). It is advisable to test multiple classifiers and select the one with the highest accuracy for prediction within a similar domain, optimizing the measurement of creativity.

While text classification can evaluate various creativity dimensions, its reliance on large, labeled datasets limits its application in creativity research. Dataset preparation and labeling can be expensive, potentially negating the advantages of automatic evaluation over manual methods in terms of accuracy, cost, and time. Furthermore, text classification problems are domain-dependent. While public datasets exist for tasks like object use and alternate use tasks, these may not be suitable for small, open-ended creative tasks that are domain-independent and lack sufficient data for classifier training. In summary, the extensive data preparation, labeling, and domain dependence can make text classification less reliable and more expensive than manual creativity evaluation for certain types of creativity assessment.

Text mining uses NLP statistical computations to discover new information and patterns, employing statistical indicators like word frequency, patterns, and correlations. Dumas et al. (2021) used text mining techniques to measure elaboration scores in Alternate Use Tasks (AUT), using methods like unweighted word count, stop list inclusion, part-of-speech counting, and inverse document frequency.

These text mining techniques represent basic statistical NLP operations. Text mining has the potential to process massive datasets to uncover new information, patterns, trends, and relationships relevant to creativity research. Applications include search engines, product suggestion analysis, social media analytics, and trend analysis, suggesting broader applications in measuring creativity.

5.2. Automatically Computed Creativity Dimensions

This scoping review identified 25 automatically computed creativity dimensions. However, our analysis indicates that these dimensions are not always well-grounded in prior creativity research and theory. This leads to theoretical and methodological inconsistencies that future research needs to address. Firstly, some dimensions are defined and computed based on specific challenges or creativity tasks designed for experiments, rather than on a robust theoretical framework. For example, “category switch” is defined as the similarity difference between successive responses in object use tasks (Dunbar and Forster, 2009). Similarly, “quality” (reusability) and “usefulness” (degree of completion) are defined within the context of programming problems (Manske and Hoppe, 2014). Secondly, inconsistency arises from variations in manifestations across studies. Dimensions like novelty (Prasch et al., 2020), similarity (LaVoie et al., 2020), and originality (Beaty and Johnson, 2021) are often similarly defined, focusing on idea or solution similarity, and measured using semantic textual similarity, albeit with different computational techniques.

To address these shortcomings and improve the measurement of creativity, this review analyzed the conceptual and computational frameworks of each study, contributing to the identification of seven core creativity dimensions that can be automatically evaluated more consistently: novelty, elaboration, flexibility, value, feasibility, fluency, and playful aspects (humor, recreational effort). We discuss each core dimension, highlighting conceptual definitions and computational approaches.

Novelty, the most evaluated core dimension (59% of reviewed studies), shows significant diversity in definitions and measures. Studies use different terms for novelty, including: (1) uniqueness (concept distinctiveness (Camburn et al., 2019)); (2) originality (difference from standard solutions, semantic distance between ideas (Georgiev and Casakin, 2019; Beaty and Johnson, 2021)); (3) similarity (meaning similarity between texts, distance between texts (LaVoie et al., 2020; Olson et al., 2021)); (4) diversity (user query diversity); (5) rarity (rare combinations, unique solutions (Karampiperis et al., 2014; Doboli et al., 2020)); (6) common use (difference between common and uncommon solutions); (7) surprise (artifact deviation from existing attributes (Shrivastava et al., 2017)); and (8) influence (artifact comparison with others (Shrivastava et al., 2017)).

Despite labeling diversity, six characteristics emerge as defining novelty and aiding automatic evaluation: (1) deviation from standard problem-solving (Manske and Hoppe, 2014); (2) semantic distance between ideas (Beaty and Johnson, 2021); (3) meaning similarity between texts (LaVoie et al., 2020); (4) semantic similarity of user queries to challenge concepts; (5) property combinations (Karampiperis et al., 2014); and (6) surprise and unexpected ideas (Shrivastava et al., 2017). These characteristics reflect the complexity of defining novelty and the challenges in developing automatic measures of creativity, especially novelty.

Despite these challenges, common computational approaches for measuring novelty as a core dimension include: (1) distance of new solution to existing solutions (Manske and Hoppe, 2014); (2) semantic distance between ideas (Beaty and Johnson, 2021; Olson et al., 2021); (3) semantic similarity of user queries to relevant Wikipedia concepts; (4) semantic distance between story clusters; and (5) semantic distance between consecutive story fragments (Karampiperis et al., 2014). Therefore, when automatically evaluating novelty, semantic distance from existing solutions is a key consideration for measuring creativity.

Value, the second core dimension, is related to concepts like overall value (societal perception (Georgiev and Casakin, 2019)); quality (reliability, maintainability, extensibility, adaptability of programming solutions (Manske and Hoppe, 2014)); usefulness (correctness); and adaptiveness (problem-solving effectiveness (Jimenez-Mavillard and Suarez, 2022)). These concepts share a common meaning of usefulness and quality, constituting the value dimension of creativity. In computer science, value, quality, usefulness, adaptiveness, and style are non-functional quality attributes. Value, quality, and usefulness computation varies by task, e.g., programming solution quality is reusability and scalability (Manske and Hoppe, 2014), and usefulness is task completion degree (Prasch et al., 2020). The value dimension requires clearer definitions and computational metrics for accurate measurement of creativity.

Flexibility, the third core dimension, is a key executive function in creative thinking (Boot et al., 2017), driving individuals to explore diverse directions and pathways, increasing the likelihood of highly creative ideas (Zhang et al., 2020; Acar et al., 2021). Flexibility is defined in two ways: category switching (transitioning between semantic concepts (Dunbar and Forster, 2009; Acar et al., 2019; Mastria et al., 2021)); and the number of semantic categories, varieties, or topics generated (Dunbar and Forster, 2009). Different computational approaches are used due to these varying definitions. Flexibility as category switching is measured using semantic similarity approaches like LSA (Dunbar and Forster, 2009), network graphs (Cosgrove et al., 2021), and sentence embedding models. Flexibility as semantic categories is evaluated using text clustering (Sung et al., 2022) or topic modeling techniques (e.g., LDA (Chauhan and Shah, 2021)) to categorize or extract topics from textual ideas, providing different methods for measuring creativity. Category switch flexibility is computationally simpler, requiring text similarity measures rather than complex category identification.

Elaboration, another core dimension, is defined as the degree to which participants detail their responses (Camburn et al., 2019; Dumas et al., 2021), adding reasoning or cause to ideas. Automatic evaluation measures elaboration by counting words in an idea (Camburn et al., 2019). Four methods for evaluating elaboration include: (1) counting all words (unweighted); (2) counting words excluding stop words; (3) counting nouns, verbs, and adverbs; and (4) counting adjectives and high-weight uncommon words (inverse frequency weighting). More words indicate higher elaboration. However, this computation may miss conjunctions (Tuzcu, 2021) or reasoning words (Sedova et al., 2019; Hennessy et al., 2020) that add explanation. Semantic search for reasoning-related words (e.g., because, therefore, since) could improve elaboration measurement of creativity.

Fluency is defined as the number of ideas generated. This dimension has a consensus on definition (number of ideas) and computation (counting ideas) (Dumas and Dunbar, 2014; Stella and Kenett, 2019). More ideas increase the chance of original outputs (Dumas and Dunbar, 2014). Fluency is easy to measure and independent of other dimensions, making it a straightforward measure of creativity.

Feasibility is defined as a solution achievable in practice (Georgiev and Casakin, 2019). Transcendence and realization are manifestations of feasibility (Jimenez-Mavillard and Suarez, 2022), referring to practical achievement. While important in creativity research, automatic computation of feasibility, transcendence, and realization lacks clear rationale from creativity research. Feasibility is product-oriented and used in ideation, but transforming ideas into practice remains a challenge for automatic measurement of creativity. Further research is needed to automatically measure feasible, transcendent, and realistic ideas.

Other dimensions related to playful creativity aspects include humor (Simpson et al., 2019) and recreational effort (Karampiperis et al., 2014). Humor, the funniness of ideas, is measured by pairwise text comparison. Recreational effort, solution difficulty, is measured using clustering. These dimensions contribute to playful creativity and require clear definitions and computational approaches from both psychology and computer science perspectives for effective measurement of creativity.

6. Conclusion

This scoping review aimed to analyze automatic creativity evaluation from computer science and education perspectives. We addressed two research questions: identifying NLP approaches and techniques used, and analyzing which creativity dimensions are computed and how.

The first research question’s contributions include: (1) identifying ML approaches and techniques in automatic creativity evaluation; (2) categorizing approaches (text similarity, classification, mining), highlighting text similarity as most common; (3) classifying studies by techniques within these approaches (string, corpus, knowledge-based similarity), showing corpus-based methods as widely used; (4) identifying limitations and alternative techniques (e.g., statistical and word embedding limitations, sentence embedding potential); and (5) providing a broad overview of automatic creativity evaluation. We concluded that word embedding models (GloVe) are effective for single-word tasks, while sentence embedding models are promising for open-ended, sentence-structured ideas, especially when measuring creativity with deep learning techniques.

The second research question’s contributions include: examining automatically evaluated creativity dimensions; noting 25 dimensions in automatic creativity evaluation, compared to standardized tests’ four dimensions; analyzing dimension definitions and measures; identifying similarities in definitions and computations; and categorizing 25 dimensions into seven core dimensions (novelty, elaboration, flexibility, value, feasibility, fluency, playful aspects). This analysis provides a coherent framework for core creativity dimensions and their computation, improving the measurement of creativity.

This review bridges computer science and education. For computer scientists, it offers insights to refine NLP approaches and develop novel methods for evaluating and promoting creativity, particularly using deep learning techniques. Educators can use automatic evaluations as pedagogical tools in classrooms. Automatic creativity evaluation can assess and nurture creativity, aligning with educational policy initiatives. Ultimately, AI serves as a valuable tool for evaluating and enhancing creativity, equipping future citizens to innovate solutions for global challenges through improved measures of creativity.

6.1. Limitations and Future Work

This scoping review has two limitations: search keyword strategy may have missed key articles, and inclusion/exclusion criteria may have omitted relevant studies. We mitigated this by using an inclusive search string and explicit criteria with co-author consensus.

Future work will experimentally evaluate the reliability of deep learning models like sentence embedding models for measuring novelty in open-ended co-creative processes. We also suggest using text generation models to recommend hints for divergent thinking. Addressing the research gap in fully automating core creativity dimensions, we plan to simultaneously measure different core dimensions using ML techniques. Developing reliable automatic evaluation of creativity dimensions can enable real-time recommendations during the creative process, fostering student creativity and improving methods for measuring creativity with deep learning techniques.

Data availability statement

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.

Author contributions

IU has contributed in the conceptualization of the paper, methodology and investigation; he has participated in writing the original manuscript, revision and edition. MP is the principal investigator of the research project and she has designed the project, she has also contributed in the conceptualization of the paper, methodology and investigation; she has participated in writing the manuscript, revision and edition. Both authors contributed to the article and approved the submitted version.

Funding

This research has been funded by the Ministry of Science and Innovation of the Government of Spain under Grants EDU2019-107399RB-I00 and PID2022-139060OB-I00.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

References

Acar, S., Berthiaume, K., Grajzel, K., Dumas, D., Flemister, C., and Organisciak, P. (2021). Applying automated originality scoring to the verbal form of torrance tests of creative thinking. Gifted Child Quart. 67, 3–17. doi: 10.1177/00169862211061874