Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: understanding and creating art with ai: review and outlook.

Abstract: Technologies related to artificial intelligence (AI) have a strong impact on the changes of research and creative practices in visual arts. The growing number of research initiatives and creative applications that emerge in the intersection of AI and art, motivates us to examine and discuss the creative and explorative potentials of AI technologies in the context of art. This paper provides an integrated review of two facets of AI and art: 1) AI is used for art analysis and employed on digitized artwork collections; 2) AI is used for creative purposes and generating novel artworks. In the context of AI-related research for art understanding, we present a comprehensive overview of artwork datasets and recent works that address a variety of tasks such as classification, object detection, similarity retrieval, multimodal representations, computational aesthetics, etc. In relation to the role of AI in creating art, we address various practical and theoretical aspects of AI Art and consolidate related works that deal with those topics in detail. Finally, we provide a concise outlook on the future progression and potential impact of AI technologies on our understanding and creation of art.

Submission history

Access paper:.

  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

  • Search Menu
  • Advance articles
  • Browse content in Biological, Health, and Medical Sciences
  • Administration Of Health Services, Education, and Research
  • Agricultural Sciences
  • Allied Health Professions
  • Anesthesiology
  • Anthropology
  • Anthropology (Biological, Health, and Medical Sciences)
  • Applied Biological Sciences
  • Biochemistry
  • Biophysics and Computational Biology (Biological, Health, and Medical Sciences)
  • Biostatistics
  • Cell Biology
  • Dermatology
  • Developmental Biology
  • Environmental Sciences (Biological, Health, and Medical Sciences)
  • Immunology and Inflammation
  • Internal Medicine
  • Medical Sciences
  • Medical Microbiology
  • Microbiology
  • Neuroscience
  • Obstetrics and Gynecology
  • Ophthalmology
  • Pharmacology
  • Physical Medicine
  • Plant Biology
  • Population Biology
  • Psychological and Cognitive Sciences (Biological, Health, and Medical Sciences)
  • Public Health and Epidemiology
  • Radiation Oncology
  • Rehabilitation
  • Sustainability Science (Biological, Health, and Medical Sciences)
  • Systems Biology
  • Browse content in Physical Sciences and Engineering
  • Aerospace Engineering
  • Applied Mathematics
  • Applied Physical Sciences
  • Bioengineering
  • Biophysics and Computational Biology (Physical Sciences and Engineering)
  • Chemical Engineering
  • Civil and Environmental Engineering
  • Computer Sciences
  • Computer Science and Engineering
  • Earth Resources Engineering
  • Earth, Atmospheric, and Planetary Sciences
  • Electric Power and Energy Systems Engineering
  • Electronics, Communications and Information Systems Engineering
  • Engineering
  • Environmental Sciences (Physical Sciences and Engineering)
  • Materials Engineering
  • Mathematics
  • Mechanical Engineering
  • Sustainability Science (Physical Sciences and Engineering)
  • Browse content in Social and Political Sciences
  • Anthropology (Social and Political Sciences)
  • Economic Sciences
  • Environmental Sciences (Social and Political Sciences)
  • Political Sciences
  • Psychological and Cognitive Sciences (Social and Political Sciences)
  • Social Sciences
  • Sustainability Science (Social and Political Sciences)
  • Author guidelines
  • Submission site
  • Open access policy
  • Self-archiving policy
  • Why submit to PNAS Nexus
  • The PNAS portfolio
  • For reviewers
  • About PNAS Nexus
  • About National Academy of Sciences
  • Editorial Board
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

Introduction, robustness checks and sensitivity analyses, materials and methods, acknowledgments, supplementary material, author contributions, data availability.

  • < Previous

Generative artificial intelligence, human creativity, and art

ORCID logo

Competing Interest: The authors declare no competing interest.

  • Article contents
  • Figures & tables
  • Supplementary Data

Eric Zhou, Dokyun Lee, Generative artificial intelligence, human creativity, and art, PNAS Nexus , Volume 3, Issue 3, March 2024, pgae052, https://doi.org/10.1093/pnasnexus/pgae052

  • Permissions Icon Permissions

Recent artificial intelligence (AI) tools have demonstrated the ability to produce outputs traditionally considered creative. One such system is text-to-image generative AI (e.g. Midjourney, Stable Diffusion, DALL-E), which automates humans’ artistic execution to generate digital artworks. Utilizing a dataset of over 4 million artworks from more than 50,000 unique users, our research shows that over time, text-to-image AI significantly enhances human creative productivity by 25% and increases the value as measured by the likelihood of receiving a favorite per view by 50%. While peak artwork Content Novelty, defined as focal subject matter and relations, increases over time, average Content Novelty declines, suggesting an expanding but inefficient idea space. Additionally, there is a consistent reduction in both peak and average Visual Novelty, captured by pixel-level stylistic elements. Importantly, AI-assisted artists who can successfully explore more novel ideas, regardless of their prior originality, may produce artworks that their peers evaluate more favorably. Lastly, AI adoption decreased value capture (favorites earned) concentration among adopters. The results suggest that ideation and filtering are likely necessary skills in the text-to-image process, thus giving rise to “generative synesthesia”—the harmonious blending of human exploration and AI exploitation to discover new creative workflows.

We investigate the implications of incorporating text-to-image generative artificial intelligence (AI) into the human creative workflow. We find that generative AI significantly boosts artists’ productivity and leads to more favorable evaluations from their peers. While average novelty in artwork content and visual elements declines, peak Content Novelty increases, indicating a propensity for idea exploration. The artists who successfully explore novel ideas and filter model outputs for coherence benefit the most from AI tools, underscoring the pivotal role of human ideation and artistic filtering in determining an artist’s success with generative AI tools.

Recently, artificial intelligence (AI) has exhibited that it can feasibly produce outputs that society traditionally would judge as creative. Specifically, generative algorithms have been leveraged to automatically generate creative artifacts like music ( 1 ), digital artworks ( 2 , 3 ), and stories ( 4 ). Such generative models allow humans to directly engage in the creative process through text-to-image systems (e.g. Midjourney, Stable Diffusion, DALL-E) based on the latent diffusion model ( 5 ) or by participating in an open dialog with transformer-based language models (e.g. ChatGPT, Bard, Claude). Generative AI is projected to become more potent to automate even more creative tasks traditionally reserved for humans and generate significant economic value in the years to come ( 6 ).

Many such generative algorithms were released in the past year, and their diffusion into creative domains has concerned many artistic communities which perceive generative AI as a threat to substitute the natural human ability to be creative. Text-to-image generative AI has emerged as a candidate system that automates elements of humans’ creative process in producing high-quality digital artworks. Remarkably, an artwork created by Midjourney bested human artists in an art competition, a while another artist refused to accept the top prize in a photo competition after winning, citing ethical concerns. b Artists have filed lawsuits against the founding companies of some of the most prominent text-to-image generators, arguing that generative AI steals from the works upon which the models are trained and infringes on the copyrights of artists. c This has ignited a broader debate regarding the originality of AI-generated content and the extent to which it may replace human creativity, a faculty that many consider unique to humans. While generative AI has demonstrated the capability to automatically create new digital artifacts, there remains a significant knowledge gap regarding its impact on productivity in artistic endeavors which lack well-defined objectives, and the long-run implications on human creativity more broadly. In particular, if humans increasingly rely on generative AI for content creation, creative fields may become saturated with generic content, potentially stifling exploration of new creative frontiers. Given that generative algorithms will remain a mainstay in creative domains as it continues to mature, it is critical to understand how generative AI is affecting creative production, the evaluation of creative artifacts, and human creativity more broadly. To this end, our research questions are 3-fold:

How does the adoption of generative AI affect humans’ creative production?

Is generative AI enabling humans to produce more creative content?

When and for whom does the adoption of generative AI lead to more creative and valuable artifacts?

Our analyses of over 53,000 artists and 5,800 known AI adopters on one of the largest art-sharing platforms reveal that creative productivity and artwork value, measured as favorites per view, significantly increased with the adoption of text-to-image systems.

We then focus our analysis on creative novelty. A simplified view of human creative novelty with respect to art can be summarized via two main channels through which humans can inject creativity into an artifact: Contents and Visuals . These concepts are rooted in the classical philosophy of symbolism in art which suggests that the contents of an artwork is related to the meaning or subject matter, whereas visuals are simply the physical elements used to convey the content ( 7 ). In our setting, Contents concern the focal object(s) and relations depicted in an artifact, whereas Visuals consider the pixel-level stylistic elements of an artifact. Thus, Content and Visual Novelty are measured as the pairwise cosine distance between artifacts in the feature space (see Materials and methods for details on feature extraction and how novelty is measured).

Our analyses reveal that over time, adopters’ artworks exhibit decreasing novelty, both in terms of Concepts and Visual features. However, maximum Content Novelty increases, suggesting an expanding yet inefficient idea space. At the individual level, artists who harness generative AI while successfully exploring more innovative ideas, irrespective of their prior originality, may earn more favorable evaluations from their peers. In addition, the adoption of generative AI leads to a less concentrated distribution of favorites earned among adopters.

We present results from three analyses. Using an event study difference-in-differences approach ( 8 ), we first estimate the causal impact of adopting generative AI on creative productivity, artwork value measured as favorites per view, and artifact novelty with respect to Content and Visual features. Then, using a two-way fixed effects model, we offer correlational evidence regarding how humans’ originality prior to adopting generative AI may influence postadoption gains in artwork value when artists successfully explore the creative space. Lastly, we show how adoption of generative AI may lead to a more dispersed distribution of favorites across users on the platform.

Creative productivity

We define creative productivity as the log of the number of artifacts that a user posts in a month. Figure 1 a reveals that upon adoption, artists experience a 50% increase in productivity on average, which then doubles in the subsequent month. For the average user, this translates to approximately 7 additional artifacts published in the adoption month and 15 artifacts in the following month. Beyond the adoption month, user productivity gradually stabilizes to a level that still exceeds preadoption volume. By automating the execution stage of the creative process, adopters can experience prolonged productivity gains compared to their nonadopter counterparts.

Causal effect of adopting generative AI on a) creative productivity as the log of monthly posts; b) creative value as number of favorites per view; c) mean Content Novelty; d) maximum Content Novelty; e) mean Visual Novelty; f) maximum Visual Novelty. The error bars represent 95% CI.

Causal effect of adopting generative AI on a) creative productivity as the log of monthly posts; b) creative value as number of favorites per view; c) mean Content Novelty; d) maximum Content Novelty; e) mean Visual Novelty; f) maximum Visual Novelty. The error bars represent 95% CI.

Creative value

If users are becoming more productive, what of the quality of the artifacts they are producing? We next examine how adopters’ artifacts are evaluated by their peers over time. In the literature, creative Value is intended to measure some aspect of utility, performance and/or attractiveness of an artifact, subject to temporal and cultural contexts ( 9 ). Given this subjectivity, we measure Value as the number of favorites an artwork receives per view after 2 weeks, reflecting its overall performance and contextual relevance within the community. This metric also hints at the artwork’s broader popularity within the cultural climate, suggesting a looser definition of Value based on cultural trends. Throughout the paper, the term “Value” will refer to these two notions.

Figure 1 b reveals an initial nonsignificant upward trend in the Value of artworks produced by AI adopters. But after 3 months, AI adopters consistently produce artworks judged significantly more valuable than those of nonadopters. This translates to a 50% increase in artwork favorability by the sixth month, jumping from the preadoption average of 2% to a steady 3% rate of earning a favorite per view.

Content Novelty

Figure 1 c shows that average Content Novelty decreases over time among adopters, meaning that the focal objects and themes within new artworks produced by AI adopters are becoming progressively more alike over time when compared to control units. Intuitively, this is equivalent to adopters’ ideas becoming more similar over time. In practice, many publicly available fine-tuned checkpoints and adapters are refined to enable text-to-image models to produce specific contents with consistency. Figure 1 d, however, reveals that maximum Content Novelty is increasing and marginally statistically significantly within the first several months after adoption. This suggests two possibilities: either a subset of adopters are exploring new ideas at the creative frontier or the adopter population as a whole is driving the exploration and expansion of the universe of artifacts.

Visual Novelty

The result shown in Fig. 1 e highlights that average Visual Novelty is decreasing over time among adopters when compared to nonadopters. The same result holds for the maximum Visual Novelty seen in Fig. 1 f. This suggests that adopters may be gravitating toward a preferred visual style, with relatively minor deviations from it. This tendency could be influenced by the nature of text-to-image workflows, where prompt engineering tends to follow a formulaic approach to generate consistent, high-quality images with a specific style. As is the case with contents, publicly available fine-tuned checkpoints and adapters for these models may be designed to capture specific visual elements from which users can sample from to maintain a particular and consistent visual style. In effect, AI may be pushing artists toward visual homogeneity.

Role of human creativity in AI-assisted value capture

Although aggregate trends suggest novelty of ideas and aesthetic features is sharply declining over time with generative AI, are there individual-level differences that enable certain artists to successfully produce more creative artworks? Specifically, how does humans’ baseline novelty, in the absence of AI tools, correlate with their ability to successfully explore novel ideas with generative AI to produce valuable artifacts? To delve into this heterogeneity, we categorize each user into quartiles based on their average Content and Visual Novelty without AI assistance to capture each users’ baseline novelty. We then employ a two-way fixed effects model to examine the interaction between adoption, pretreatment novelty quartiles, and posttreatment adjustments in novelty. Each point in Fig. 2 a and b represents the estimated impact of increasing mean Content (left) or Visual (right) Novelty on Value based on artists’ prior novelty denoted along the horizontal axis. Intuitively, these estimates quantify the degree to which artists can successfully navigate the creative space based on prior originality in both ideation and visuals to earn more favorable evaluations from peers. Refer to SI Appendix, Section 2B for estimation details.

Estimated effect of increases in mean Content and Visual Novelty on Value post-adoption based on a) average Content Novelty quartiles prior to treatment; b) average Visual Novelty quartiles prior to treatment. Each point shows the estimated effect of postadoption novelty increases given creativity levels prior to treatment on Value. The error bars represent 95% CI.

Estimated effect of increases in mean Content and Visual Novelty on Value post-adoption based on a) average Content Novelty quartiles prior to treatment; b) average Visual Novelty quartiles prior to treatment. Each point shows the estimated effect of postadoption novelty increases given creativity levels prior to treatment on Value. The error bars represent 95% CI.

Figure 2 a presents correlational evidence that users, regardless of their proficiency in generating novel ideas, might be able to realize significant gains in Value if they can successfully produce more novel content with generative AI. The lowest quartile of content creators may also experience marginally significant gains. However, those same users who benefit from expressing more novel ideas may also face penalties for producing more divergent visuals.

Next, Fig. 2 b suggests that users who were proficient in creating exceedingly novel visual features before adopting generative AI may garner the most Value gains from successfully introducing more novel ideas. While marginally significant, less proficient users can also experience weak Value gains. In general, more novel ideas are linked to improved Value capture. Conversely, users capable of producing the most novel visual features may face penalties for pushing the boundaries of pixel-level aesthetics with generative AI. This finding might be attributed to the contextual nature of Value, implying an “acceptable range” of novelty. Artists already skilled at producing highly novel pixel-level features may exceed the limit of what can be considered coherent.

Despite penalties for pushing visual boundaries, the gains from exploring creative ideas with AI outweigh the losses from visual divergence. Unique concepts take priority over novel aesthetics, as shown by the larger Value gains for artists who were already adept at Visual Novelty before using AI. This suggests users who naturally lean toward visual exploration may benefit more from generative AI tools to explore the idea space.

Lastly, we estimate Generalized Random Forests ( 10 ) configured to optimize the splitting criteria that maximize heterogeneity in Value gains among adopters for each postadoption period. With each trained model, we extract feature importance weights quantified by the SHAP (SHapley Additive exPlanations) method ( 11 ). This method utilizes ideas from cooperative game theory to approximate the predictive signal of covariates, accounting for linear and nonlinear interactions through the Markov chain Monte Carlo method. Intuitively, a feature of greater importance indicates potentially greater impacts on treatment effect heterogeneity among adopters.

Figure 3 offers correlational evidence that Content Novelty significantly increases model performance within several months of adoption, whereas Visual Novelty remains marginally impactful until the last observation period. This suggests that Content Novelty plays a more significant role in predicting posttreatment variations in Value gains compared to Visual Novelty. In summary, these findings illustrate that content is king in the text-to-image creative paradigm.

SHAP values measuring importance of mean Content and Visual Novelty on Value gains.

SHAP values measuring importance of mean Content and Visual Novelty on Value gains.

Platform-level value capture

One question remains: do individual-level differences within adopters result in greater concentrations of value among fewer users at the platform-level? Specifically, are more favorites being captured by fewer users, or is generative AI promoting less concentrated value capture? To address these questions, we calculate the Gini coefficients with respect to favorites received of never-treated units, not-yet-treated units, and treated units and conduct permutation tests with 10,000 iterations to evaluate if adoption of generative AI may lead to a less concentrated distribution of favorites among users. The Gini coefficient is a common measure of aggregate inequality where a coefficient of 0 indicates that all users make up an equal proportion of favorites earned, and a coefficient of 1 indicates that a single user captures all favorites. Thus, higher values of the Gini coefficient indicate a greater concentration of favorites captured by fewer users. Figure 4 depicts the differences in cumulative distributions as well as Gini coefficients of both control groups and the treated group with respect to a state of perfect equality.

Gini coefficients of treated units vs. never-treated and not-yet-treated units.

Gini coefficients of treated units vs. never-treated and not-yet-treated units.

First, observe that platform-level favorites are predominantly captured by a small portion of users, reflecting an aggregate concentration of favorites. Second, this concentration is more pronounced among not-yet-treated units than among never-treated units. Third, despite the presence of aggregate concentration, favorites captured among AI adopters are more evenly distributed compared to both never-treated and not-yet-treated control units. The results from the permutation tests in Table 1 , where column D shows the difference between the treated coefficient and the control group coefficients, show that the differences in coefficients are statistically significant between never-treated and not-yet-treated groups vs. the treated group. This suggests that generative AI may lead to a broader allocation of favorites earned (value capture from peer feedback), particularly among control units who eventually become adopters.

Permutation tests for statistical significance.

The column D denotes the difference in Gini coefficients relative to the treated population.

To reinforce the validity of our causal estimates, we employ the generalized synthetic control method ( 12 ) (GSCM). GSCM allows us to relax the parallel trends assumption by creating synthetic control units that closely match the pretreatment characteristics of the treated units while also accounting for unobservable factors that may influence treatment outcome. In addition, we conduct permutation tests to evaluate the robustness of our estimates to potential measurement errors in treatment time identification and control group contamination. Our results remain consistent even when utilizing GSCM and in the presence of substantial measurement error.

Because adopting generative AI is subject to selection issues, one emergent concern is the case where an artist who experiences renewed interest in creating artworks, and thus is more “inspired,” is also more likely to experiment with text-to-image AI tools and explore the creative space as they ramp up production. In this way, unobservable characteristics like a renewed interest in creating art or “spark of inspiration” might correlate with adoption of AI tools while driving the main effects rather than AI tools themselves. Thus, we also provide evidence that unobservable characteristics that may correlate with users’ productivity or “interest” shocks and selection into treatment are not driving the estimated effects by performing a series of falsification tests. For a comprehensive overview of all robustness checks and sensitivity analyses, please refer to SI Appendix, Section 3 .

The rapid adoption of generative AI technologies poses exceptional benefits as well as risks. Current research demonstrates that humans, when assisted by generative AI, can significantly increase productivity in coding ( 13 ), ideation ( 14 ), and written assignments ( 15 ) while raising concerns regarding potential disinformation ( 16 ) and stagnation of knowledge creation ( 17 ). Our research is focused on how generative AI is impacting and potentially coevolving with human creative workflows. In our setting, human creativity is embodied through prompts themselves, whereas in written assignments, generative AI is primarily used to source ideas that are subsequently evaluated by humans, representing a different paradigm shift in the creative process.

Within the first few months post-adoption, text-to-image generative AI can help individuals produce nearly double the volume of creative artifacts that are also evaluated 50% more favorably by their peers over time. Moreover, we observe that peak Content Novelty increases over time, while average Content and Visual Novelty diminish. This implies that the universe of creative possibilities is expanding but with some inefficiencies.

Our results hint that the widespread adoption of generative AI technologies in creative fields could lead to a long-run equilibrium where in aggregate, many artifacts converge to the same types of content or visual features. Creative domains may be inundated with generic content as exploration of the creative space diminishes. Without establishing new frontiers for creative exploration, AI systems trained on outdated knowledge banks run the risk of perpetuating the generation of generic content at a mass scale in a self-reinforcing cycle ( 17 ). Before we reach that point, technology firms and policy makers pioneering the future of generative AI must be sensitive to the potential consequences of such technologies in creative fields and society more broadly.

Encouragingly, humans assisted by generative AI who can successfully explore more novel ideas may be able to push the creative frontier, produce meaningful content, and be evaluated favorably by their peers. With respect to traditional theories of creativity, one particularly useful framework for understanding these results is the theory of blind variation and selective retention (BVSR) which posits that creativity is a process of generating new ideas (variation) and consequently selecting the most promising ones (retention) ( 18 ). The blindness feature suggests that variation is not guided by any specific goal but can also involve evaluating outputs against selection criteria in a genetic algorithm framework ( 19 ).

Because we do not directly observe users’ process, this discussion is speculative but suggestive that a text-to-image creative workflow models after a BVSR genetic process. First, humans manipulate and mutate known creative elements in the form of prompt engineering which requires that the human deconstruct an idea into atomic components, primarily in the form of distinct words and phrases, to compose abstract ideas or meanings. Then, visual realization of an idea is automated by the algorithm, allowing humans to rapidly sample ideas from their creative space and simply evaluate the output against selection criteria. The selection criteria varies based on humans’ ability to make sense of model outputs, and curate those that most align with individual or peer preferences, thus having direct implications on their evaluation by peers. Satisfactory outputs contribute to the genetic evolution of future ideas, prompts, and image refinements.

Although we can only observe the published artworks, it is plausible that many more unobserved iterations of ideation, prompt engineering, filtering, and refinement have occurred. This is especially likely given the documented increase in creative productivity. Thus, it is possible that individuals with less refined artistic filters are also less discerning when filtering artworks for quality which could lead to a flood of less refined content on platforms. In contrast, artists who prioritize coherence and quality may only publish artworks that are likely to be evaluated favorably.

The results suggest some evidence in this direction, indicating that humans who excel at producing novel ideas before adopting generative AI are evaluated most favorably after adoption if they successfully explore the idea space, implying that ability to manipulate novel concepts and curate artworks based on coherence are relevant skills when using text-to-image AI. This aligns with prior research which suggest that creative individuals are particularly adept at discerning which ideas are most meaningful ( 20 ), reflecting a refined sensitivity to the artistic coherence of artifacts ( 21 ). Furthermore, all artists, regardless of their ability to produce novel visual features without generative AI, appear to be evaluated more favorably if they can capably explore more novel ideas. This finding hints at the importance of humans’ baseline ideation and filtering abilities as focal expressions of creativity in a text-to-image paradigm. Finally, generative AI appears to promote a more even distribution of platform-level favorites among adopters, signaling a potential step toward an increasingly democratized, inclusive creative domain for artists empowered by AI tools.

In summary, our findings emphasize that humans’ ideation proficiency and a refined artistic filter rather than pure mechanical skill may become the focal skills required in a future of human–AI cocreative process as generative AI becomes more mainstream in creative endeavors. This phenomenon in which AI-assisted artistic creation is driven by ideas and filtering is what we term “generative synesthesia”—the harmonization of human exploration and AI exploitation to discover new creative workflows. This paradigm shift may provide avenues for creatives to focus on what ideas they are representing rather than how they represent it, opening new opportunities for creative exploration. While concerns about automation loom, society must consider a future where generative AI is not the source of human stagnation, but rather of symphonic collaboration and human enrichment.

Identifying AI adopters

Platform-level policy commonly suggests that users disclose their use of AI assistance in the form of tags associated with their artworks. Thus, we employ a rule-based classification scheme. As a first-pass, any artwork published before the original DALL-E in January 2021 is automatically labeled as non-AI generated. Then, for all artworks published after January 2021, we examine postlevel title and tags provided by the publishing user. We use simple keyword matching (AI-generated, Stable Diffusion, Midjourney, DALL-E, etc.) for each post to identify for which artworks a user employs AI tools. As a second-pass, we track artworks posts published in AI art communities which may not include explicit tags denoting AI assistance. We compile all of these artworks and simply label them as AI-generated. Finally, we assign adoption timing based on the first-known AI-generated post for each use ( SI Appendix, Fig. S2 ).

Measuring creative novelty

To measure the two types of novelty, we borrow the idea of conceptual spaces which can be understood as geometric representations of entities which capture particular attributes of the artifacts along various dimensions ( 9 , 22 ). This definition naturally aligns with the concept of embeddings, like word2vec ( 23 ), which capture the relative features of objects in a vector space. This concept can be applied to text passages and images such that measuring the distance between these vector representations captures whether an artifact deviates or converges with a reference object in the space.

Using embeddings, we apply the following algorithm: take all artifacts published before 2022 April 1, as the baseline set of artworks. We use this cutoff because nearly all adoption occurs after May 2022, so all artifacts in future periods are compared to non-AI-generated works in the baseline period, and it provides an adequate number of pretreatment and posttreatment observations (on average 3 and 7, respectively) for the majority of our causal sample. Then, take all artifacts published in the following month and measure the pairwise cosine distance between those artifacts and the baseline set, recovering the mean, minimum, and maximum distances for each artifact. This month’s artifacts are then added to the baseline set such that all future artworks are compared to all prior artworks, effectively capturing the time-varying nature of novelty. Continue for all remaining months. We apply this approach to all adopters’ artworks and a random sample of 10,000 control users due to computational feasibility.

Content feature extraction

To describe the focal objects and object relationships in an artifact, we utilize state-of-the-art multimodal model BLIP-2 ( 24 ) which takes as input an image and produces a text description of the content. A key feature of this approach is the availability of controlled text generation hyperparameters that allow us to generate more stable descriptions that are systematically similar in structure, having been trained on 129M images and human-annotated data. BLIP-2 can maintain consistent focus and regularity while avoiding the noise added by cross-individual differences.

Given the generated descriptions, we then utilize a pretrained text embedding model based on BERT ( 25 ), which has demonstrated state-of-the-art performance on semantic similarity benchmarks while also being highly efficient, to compute high-dimensional vector representations for each description. Then, we apply the algorithm described above to measure Content Novelty.

Visual feature extraction

To capture visual features of each artifact at the pixel level, we use a more flexible approach via the self-supervised visual representation learning algorithm DINOv2 ( 26 ) which overcomes the limitations of standard image-text pretraining approaches where visual features may not be explicitly described in text. Because we are dealing with creative concepts, this approach is particularly suitable to robustly identify object parts in an image and extract low-level pixel features of images while still exhibiting excellent generalization performance. We compute vector representations of each image such that we can apply the algorithm described above to obtain measures of Visual Novelty.

An AI-generated picture won an art prize. Artists are not happy.

Artist wins photography contest after submitting AI-generated image, then forfeits prize.

The current legal cases against generative AI are just the beginning.

The authors acknowledge the valuable contributions from their Business Insights through Text Lab (BITLAB) research assistants Animikh Aich, Aditya Bala, Amrutha Karthikeyan, Audrey Mao, and Esha Vaishnav in helping to prepare the data for analysis. Furthermore, the authors are grateful for Stefano Puntoni, Alex Burnap, Mi Zhou, Gregory Sun, our audiences at the Wharton Business & Generative AI Workshop (23/9/8), INFORMS Workshop on Data Science (23/10/14), INFORMS Annual Meeting (23/10/15) and seminar participants at the University of Wisconsin-Milwaukee (23/9/22), University of Texas-Dallas (23/10/6), and MIT Initiative on the Digital Economy (23/11/29) for their insightful comments and feedback.

Supplementary material is available at PNAS Nexus online.

The authors declare no funding.

D.L. and E.Z. designed the research and wrote the paper. E.Z. analyzed data and performed research with guidance from D.L.

A preprint of this article is available at SSRN .

Replication archive with code is available at Open Science Framework at https://osf.io/jfzyp/ . Data have been anonymized for the privacy of the users.

Dong   H-W , Hsiao   W-Y , Yang   L-C , Yang   Y-H . 2018 . MuseGAN: multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence . p. 34–41.

Tan   WR , Chan   CS , Aguirre   H , Tanaka   K . 2017 . ArtGAN: artwork synthesis with conditional categorical GANs. In: 2017 IEEE International Conference on Image Processing (ICIP). IEEE. p. 3760–3764 .

Elgammal   A , Liu   B , Elhoseiny   M , Mazzone   M . 2017 . CAN: creative adversarial networks, generating “art” by learning about styles and deviating from style norms, arXiv, arXiv:1706.07068, preprint: not peer reviewed .

Brown   TB , et al . 2020 . Language models are few-shot learners. Adv Neural Inf Process Syst. 33:1877–1901 .

Rombach   R , Blattmann   A , Lorenz   D , Esser   P , Ommer   B . 2022 . High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. p. 10684–10695 .

Huang   S , Grady   P , GPT-3 . 2022 . Generative AI: a creative new world. Sequoia Capital US/Europe. https://www.sequoiacap.com/article/generative-ai-a-creative-new-world/

Wollheim   R . 1970 . Nelson Goodman’s languages of art . J Philos . 67 ( 16 ): 531 – 539 .

Google Scholar

Callaway   B , Sant’Anna   PHC . 2021 . Difference-in-differences with multiple time periods . J Econom . 225 ( 2 ): 200 – 230 .

Boden   MA . 1998 . Creativity and artificial intelligence. Artif Intell. 103(1–2):347–356 .

Athey   S , Tibshirani   J , Wager   S . 2019 . Generalized random forests. Ann Statist . 47(2):1148–1178 .

Lundberg   S , Lee   S-I . 2017 . A unified approach to interpreting model predictions. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. p. 4768–4777 .

Xu   Y . 2017 . Generalized synthetic control method: causal inference with interactive fixed effects models . Polit Anal . 25 ( 1 ): 57 – 76 .

Peng   S , Kalliamvakou   E , Cihon   P , Demirer   M . 2023 . The impact of AI on developer productivity: evidence from github copilot, arXiv, arXiv:2302.06590, preprint: not peer reviewed .

Noy   S , Zhang   W . 2023 . Experimental evidence on the productivity effects of generative artificial intelligence . Science . 381 ( 6654 ): 187 – 192 .

Dell’Acqua   F , et al . 2023 . Navigating the jagged technological frontier: field experimental evidence of the effects of AI on knowledge worker productivity and quality. Harvard Business School Technology & Operations Mgt. Unit Working Paper, (24-013) .

Spitale   G , Biller-Andorno   N , Germani   F . 2023 . AI model GPT-3 (dis)informs us better than humans . Sci Adv . 9 ( 26 ): eadh1850 .

Burtch   G , Lee   D , Chen   Z . 2023 . The consequences of generative AI for UGC and online community engagement. Available at SSRN 4521754 .

Campbell   DT . 1960 . Blind variation and selective retentions in creative thought as in other knowledge processes . Psychol Rev . 67 ( 6 ): 380 – 400 .

Simonton   DK . 1999 . Creativity as blind variation and selective retention: is the creative process Darwinian? Psychol Inq. 10(4):309–328 .

Silvia   PJ . 2008 . Discernment and creativity: how well can people identify their most creative ideas?   Psychol Aesthet Creat Arts . 2 ( 3 ): 139 – 146 .

Ivcevic   Z , Mayer   JD . 2009 . Mapping dimensions of creativity in the life-space . Creat Res J . 21 ( 2–3 ): 152 – 165 .

McGregor   S , Wiggins   G , Purver   M . 2014 . Computational creativity: a philosophical approach, and an approach to philosophy. In: International Conference on Innovative Computing and Cloud Computing. p. 254–262 .

Mikolov   T , Chen   K , Corrado   G , Dean   J . 2013 . Efficient estimation of word representations in vector space, arXiv, arXiv:1301.3781, preprint: not peer reviewed .

Li   J , Li   D , Savarese   S , Hoi   S . 2023 . Bootstrapping language-image pre-training with frozen image encoders and large language models, arXiv, arXiv:2301.12597, preprint: not peer reviewed .

Reimers   N , Gurevych   I . 2019 . Sentence-bert: sentence embeddings using siamese bert-networks, arXiv, arXiv:1908.10084, preprint: not peer reviewed .

Oquab   M , et al . 2023 . DINOv2: learning robust visual features without supervision, arXiv, arXiv:2304.07193, preprint: not peer reviewed .

Author notes

Supplementary data, email alerts, citing articles via.

  • Contact PNAS Nexus
  • Advertising and Corporate Services
  • Journals Career Network

Affiliations

  • Online ISSN 2752-6542
  • Copyright © 2024 National Academy of Sciences
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Advertisement

Advertisement

Artificial intelligence in the creative industries: a review

  • Open access
  • Published: 02 July 2021
  • Volume 55 , pages 589–656, ( 2022 )

Cite this article

You have full access to this open access article

ai art research papers

  • Nantheera Anantrasirichai   ORCID: orcid.org/0000-0002-2122-5781 1 &
  • David Bull 1  

98k Accesses

100 Citations

54 Altmetric

Explore all metrics

This paper reviews the current state of the art in artificial intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically machine learning (ML) algorithms, is provided including convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs) and deep Reinforcement Learning (DRL). We categorize creative applications into five groups, related to how AI technologies are used: (i) content creation, (ii) information analysis, (iii) content enhancement and post production workflows, (iv) information extraction and enhancement, and (v) data compression. We critically examine the successes and limitations of this rapidly advancing technology in each of these areas. We further differentiate between the use of AI as a creative tool and its potential as a creator in its own right. We foresee that, in the near future, ML-based AI will be adopted widely as a tool or collaborative assistant for creativity. In contrast, we observe that the successes of ML in domains with fewer constraints, where AI is the ‘creator’, remain modest. The potential of AI (or its developers) to win awards for its original creations in competition with human creatives is also limited, based on contemporary technologies. We therefore conclude that, in the context of creative industries, maximum benefit from AI will be derived where its focus is human-centric—where it is designed to augment, rather than replace, human creativity.

Similar content being viewed by others

ai art research papers

Machine learning in human creativity: status and perspectives

ai art research papers

The Role of Artificial Intelligence in Art: A Comprehensive Review of a Generative Adversarial Network Portrait Painting

Creativity and style in gan and ai art: some art-historical reflections.

Avoid common mistakes on your manuscript.

1 Introduction

The aim of new technologies is normally to make a specific process easier, more accurate, faster or cheaper. In some cases they also enable us to perform tasks or create things that were previously impossible. Over recent years, one of the most rapidly advancing scientific techniques for practical purposes has been Artificial Intelligence (AI). AI techniques enable machines to perform tasks that typically require some degree of human-like intelligence. With recent developments in high-performance computing and increased data storage capacities, AI technologies have been empowered and are increasingly being adopted across numerous applications, ranging from simple daily tasks, intelligent assistants and finance to highly specific command, control operations and national security. AI can, for example, help smart devices or computers to understand text and read it out loud, hear voices and respond, view images and recognize objects in them, and even predict what may happen next after a series of events. At higher levels, AI has been used to analyze human and social activity by observing their convocation and actions. It has also been used to understand socially relevant problems such as homelessness and to predict natural events. AI has been recognized by governments across the world to have potential as a major driver of economic growth and social progress (Hall and Pesenti 2018 ; NSTC 2016 ). This potential, however, does not come without concerns over the wider social impact of AI technologies which must be taken into account when designing and deploying these tools.

Processes associated with the creative sector demand significantly different levels of innovation and skill sets compared to routine behaviours. While AI accomplishments rely heavily on conformity of data, creativity often exploits the human imagination to drive original ideas which may not follow general rules. Basically, creatives have a lifetime of experiences to build on, enabling them to think ‘outside of the box’ and ask ‘What if’ questions that cannot readily be addressed by constrained learning systems.

There have been many studies over several decades into the possibility of applying AI in the creative sector. One of the limitations in the past was the readiness of the technology itself, and another was the belief that AI could attempt to replicate human creative behaviour (Rowe and Partridge 1993 ). A recent survey by Adobe Footnote 1 revealed that three quarters of artists in the US, UK, Germany and Japan would consider using AI tools as assistants, in areas such as image search, editing, and other ‘non-creative’ tasks. This indicates a general acceptance of AI as a tool across the community and reflects a general awareness of the state of the art, since most AI technologies have been developed to operate in closed domains where they can assist and support humans rather than replace them. Better collaboration between humans and AI technologies can thus maximize the benefits of the synergy. All that said, the first painting created solely by AI was auctioned for $432,500 in 2018. Footnote 2

Applications of AI in the creative industries have dramatically increased in the last five years. Based on analysis of data from arXiv Footnote 3 and Gateway to Research, Footnote 4 Davies et al. ( 2020 ) revealed that the growth rate of research publications on AI (relevant to the creative industries) exceeds 500% in many countries (in Taiwan the growth rate is 1490%), and the most of these publications relate to image-based data. Analysis on company usage from the Crunchbase database Footnote 5 indicates that AI is used more in games and for immersive applications, advertising and marketing, than in other creative applications. Caramiaux et al. ( 2019 ) recently reviewed AI in the current media and creative industries across three areas: creation, production and consumption. They provide details of AI/ML-based research and development, as well as emerging challenges and trends.

In this paper, we review how AI and its technologies are, or could be, used in applications relevant to creative industries. We first provide an overview of AI and current technologies (Sect.  1 ), followed by a selection of creative domain applications (Sect.  3 ). We group these into subsections Footnote 6 covering: (i) content creation: where AI is employed to generate original work, (ii) information analysis: where statistics of data are used to improve productivity, (iii) content enhancement and post production workflows: used to improve quality of creative work, (iv) information extraction and enhancement: where AI assists in interpretation, clarifies semantic meaning, and creates new ways to exhibit hidden information, and (v) data compression: where AI helps reduce the size of the data while preserving its quality. Finally we discuss challenges and the future potential of AI associated with the creative industries in Sect.  4 .

2 An introduction to artificial intelligence

Artificial intelligence (AI) embodies a set of codes, techniques, algorithms and data that enables a computer system to develop and emulate human-like behaviour and hence make decisions similar to (or in some cases, better than) humans (Russell and Norvig 2020 ). When a machine exhibits full human intelligence, it is often referred to as ‘general AI’ or ‘strong AI’ (Bostrom 2014 ). However, currently reported technologies are normally restricted to operation in a limited domain to work on specific tasks. This is called ‘narrow AI’ or ‘weak AI’. In the past, most AI technologies were model-driven; where the nature of the application is studied and a model is mathematically formed to describe it. Statistical learning is also data-dependent, but relies on rule-based programming (James et al. 2013 ). Previous generations of AI (mid-1950s until the late 1980s (Haugeland 1985 )) were based on symbolic AI, following the assumption that humans use symbols to represent things and problems. Symbolic AI is intended to produce general, human-like intelligence in a machine (Honavar 1995 ), whereas most modern research is directed at specific sub-problems.

2.1 Machine learning, neurons and artificial neural networks

The main class of algorithms in use today are based on machine learning (ML), which is data-driven. ML employs computational methods to ‘learn’ information directly from large amounts of example data without relying on a predetermined equation or model (Mitchell 1997 ). These algorithms adaptively converge to an optimum solution and generally improve their performance as the number of samples available for learning increases. Several types of learning algorithms exist, including supervised learning, unsupervised learning and reinforcement learning. Supervised learning algorithms build a mathematical model from a set of data that contains both the inputs and the desired outputs (each output usually representing a classification of the associated input vector), while unsupervised learning algorithms model the problems on unlabeled data. Self-supervised learning is a form of unsupervised learning where the data provide the measurable structure to build a loss function. Semi-supervised learning employs a limited set of labeled data to label, usually a larger amount of, unlabeled data. Then both datasets are combined to create a new model. Reinforcement learning methods learn from trial and error and are effectively self-supervised (Russell and Norvig 2020 ).

Modern ML methods have their roots in the early computational model of a neuron proposed by Warren MuCulloch (neuroscientist) and Walter Pitts (logician) in ( 1943 ). This is shown in Fig.  1 a. In their model, the artificial neuron receives one or more inputs, where each input is independently weighted. The neuron sums these weighted inputs and the result is passed through a non-linear function known as an activation function, representing the neuron’s action potential which is then transmitted along its axon to other neurons. The multi-layer perceptron (MLP) is a basic form of artificial neural network (ANN) that gained popularity in the 1980s. This connects its neural units in a multi-layered (typically one input layer, one hidden layer and one output layer) architecture (Fig.  1 b). These neural layers are generally fully connected to adjacent layers, (i.e., each neuron in one layer is connected to all neurons in the next layer). The disadvantage of this approach is that the total number of parameters can be very large and this can make them prone to overfitting data.

For training, the MLP (and most supervised ANNs) utilizes error backpropagation to compute the gradient of a loss function. This loss function maps the event values from multiple inputs into one real number to represent the cost of that event. The goal of the training process is therefore to minimize the loss function over multiple presentations of the input dataset. The backpropagation algorithm was originally introduced in the 1970s, but peaked in popularity after ( 1986 ), when Rumelhart et al. described several neural networks where backpropagation worked far faster than earlier approaches, making ANNs applicable to practical problems.

figure 1

a Basic neural network unit by MuCulloch and Pitts. b Basic multi-layer perceptron (MLP)

2.2 An introduction to deep neural networks

Deep learning is a subset of ML that employs deep artificial neural networks (DNNs). The word ‘deep’ means that there are multiple hidden layers of neuron collections that have learnable weights and biases. When the data being processed occupies multiple dimensions (images for example), convolutional neural networks (CNNs) are often employed. CNNs are (loosely) a biologically-inspired architecture and their results are tiled so that they overlap to obtain a better representation of the original inputs.

The first CNN was designed by Fukushima ( 1980 ) as a tool for visual pattern recognition (Fig.  2 a). This so called Neocognitron was a hierarchical architecture with multiple convolutional and pooling layers. LeCun et al. ( 1989 ) applied the standard backpropagation algorithm to a deep neural network with the purpose of recognizing handwritten ZIP codes. At that time, it took 3 days to train the network. Lecun et al. ( 1998 ) proposed LeNet5 (Fig.  2 b), one of the earliest CNNs which could outperform other models for handwritten character recognition. The deep learning breakthrough occurred in the 2000s driven by the availability of graphics processing units (GPUs) that could dramatically accelerate training. Since around 2012, CNNs have represented the state of the art for complex problems such as image classification and recognition, having won several major international competitions.

figure 2

a Neocognitron (Fukushima 1980 ), where U \(_s\) and U \(_c\) learn simple and complex features, respectively. b LeNet5 (Lecun et al. 1998 ), consisting of two sets of convolutional and average pooling layers, followed by a flattening convolutional layer, then two fully-connected layers and finally a softmax classifier

A CNN creates its filters’ values based on the task at hand. Generally, the CNN learns to detect edges from the raw pixels in the first layer, then uses those edges to detect simple shapes in the next layer, and so on building complexity through subsequent layers. The higher layers produce high-level features with more semantically relevant meaning. This means that the algorithms can exploit both low-level features and a higher-level understanding of what the data represent. Deep learning has therefore emerged as a powerful tool to find patterns, analyze information, and to predict future events. The number of layers in a deep network is unlimited but most current networks contain between 10 and 100 layers.

Goodfellow et al. ( 2014 ) proposed an alternative form of architecture referred to as a Generative Adversarial Network (GAN). GANs consist of 2 AI competing modules where the first creates images (the generator) and the second (the discriminator) checks whether the received image is real or created from the first module. This competition results in the final picture being very similar to the real image. Because of their performance in reducing deceptive results, GAN technologies have become very popular and have been applied to numerous applications, including those related to creative practice.

While many types of machine learning algorithms exist, because of their prominence and performance, in this paper we place emphasis on deep learning methods. We will describe various applications relevant to the creative industries and critically review the methodologies that achieve, or have the potential to achieve, good performance.

2.3 Current AI technologies

This section presents state-of-the-art AI methods relevant to the creative industries. For those readers who prefer to focus on the applications, please refer to Sect.  3 .

2.3.1 AI and the need for data

An AI system effectively combines a computational architecture and a learning strategy with a data environment in which it learns. Training databases are thus a critical component in optimizing the performance of ML processes and hence a significant proportion of the value of an AI system resides in them. A well-designed training database with appropriate size and coverage can help significantly with model generalization and avoiding problems of overfitting.

In order to learn without being explicitly programmed, ML systems must be trained using data having statistics and characteristics typical of the particular application domain under consideration. This is true regardless of training methods (see Sect.  2.1 ). Good datasets typically contain large numbers of examples with a statistical distribution matched to this domain. This is crucial because it enables the network to estimate gradients in the data (error) domain that enables it to converge to an optimum solution, forming robust decision boundaries between its classes. The network will then, after training, be able to reliably match new unseen information to the right answer when deployed.

The reliability of training dataset labels is key in achieving high performance supervised deep learning. These datasets must comprise: i) data that are statistically similar to the inputs when the models are used in the real situations and ii) ground truth annotations that tell the machine what the desired outputs are. For example, in segmentation applications, the dataset would comprise the images and the corresponding segmentation maps indicating homogeneous, or semantically meaningful regions in each image. Similarly for object recognition, the dataset would also include the original images while the ground truth would be the object categories, e.g., car, house, human, type of animals, etc.

Some labeled datasets are freely available for public use, Footnote 7 but these are limited, especially in certain applications where data are difficult to collect and label. One of the largest, ImageNet, contains over 14 million images labeled into 22,000 classes. Care must be taken when collecting or using data to avoid imbalance and bias—skewed class distributions where the majority of data instances belong to a small number of classes with other classes being sparsely populated. For instance, in colorization, blue may appear more often as it is a color of sky, while pink flowers are much rarer. This imbalance causes ML algorithms to develop a bias towards classes with a greater number of instances; hence they preferentially predict majority class data. Features of minority classes are treated as noise and are often ignored.

Numerous approaches have been introduced to create balanced distributions and these can be divided into two major groups: modification of the learning algorithm, and data manipulation techniques (He and Garcia 2009 ). Zhang et al. ( 2016 ) solve the class-imbalance problem by re-weighting the loss of each pixel at train time based on the pixel color rarity. Recently, Lehtinen et al. ( 2018 ) have introduced an innovative approach to learning via their Noise2Noise network which demonstrates that it is possible to train a network without clean data if the corrupted data complies with certain statistical assumptions. However, this technique needs further testing and refinement to cope with real-world noisy data. Typical data manipulation techniques include downsampling majority classes, oversampling minority classes, or both. Two primary techniques are used to expand, adjust and rebalance the number of samples in the dataset and, in turn, to improve ML performance and model generalization: data augmentation and data synthesis. These are discussed further below.

2.3.1.1 Data augmentation

Data augmentation techniques are frequently used to increase the volume and diversity of a training dataset without the need to collect new data. Instead, existing data are used to generate more samples, through transformations such as cropping, flipping, translating, rotating and scaling (Anantrasirichai et al. 2018 ; Krizhevsky et al. 2012 ). This can assist by increasing the representation of minority classes and also help to avoid overfitting, which occurs when a model memorizes the full dataset instead of only learning the main concepts which underlie the problem. GANs (see Section 2.3.3 ) have recently been employed with success to enlarge training sets, with the most popular network currently being CycleGAN (Zhu et al. 2017 ). The original CycleGAN mapped one input to only one output, causing inefficiencies when dataset diversity is required. Huang et al. ( 2018 ) improved CycleGAN with a structure-aware network to augment training data for vehicle detection. This slightly modified architecture is trained to transform contrast CT images (computed tomography scans) into non-contrast images (Sandfort et al. 2019 ). A CycleGAN-based technique has also been used for emotion classification, to amplify cases of extremely rare emotions such as disgust (Zhu et al. 2018 ). IBM Research introduced a Balancing GAN (Mariani et al. 2018 ), where the model learns useful features from majority classes and uses these to generate images for minority classes that avoid features close to those of majority cases. An extensive survey of data augmentation techniques can be found in Shorten and Khoshgoftaar ( 2019 ).

2.3.1.2 Data synthesis

Scientific or parametric models can be exploited to generate synthetic data in those applications where it is difficult to collect real data, and where data augmentation techniques cannot increase variety in the dataset. Examples include signs of disease (Alsaih et al. 2017 ) and geological events that rarely happen (Anantrasirichai et al. 2019 ). In the case of creative processes, problems are often ill-posed as ground truth data or ideal outputs are not available. Examples include post-production operations such as deblurring, denoising and contrast enhancement. Synthetic data are often created by degrading the clean data. Su et al. ( 2017 ) applied synthetic motion blur on sharp video frames to train the deblurring model. LLNet (Lore et al. 2017 ), enhances low-light images, and is trained using a dataset generated with synthetic noise and intensity adjustment, while LLCNN (Tao et al. 2017 ) employs a gamma adjustment technique.

figure 3

CNN architectures for a object recognition adapted from \(^{8}\) , b semantic segmentation \(^{9}\)

2.3.2 Convolutional neural networks (CNNs)

2.3.2.1 basic cnns.

Convolutional neural networks (CNNs) are a class of deep feed-forward ANN. They comprise a series of convolutional layers that are designed to take advantage of 2D structures, such as found in images. These employ locally connected layers that apply convolution operations between a predefined-size kernel and an internal signal; the output of each convolutional layer is the input signal modified by a convolution filter. The weights of the filter are adjusted according to a loss function that assesses the mismatch (during training) between the network output and the ground truth values or labels. Commonly used loss functions include \(\ell _1\) , \(\ell _2\) , SSIM (Tao et al. 2017 ) and perceptual loss (Johnson et al. 2016 )). These errors are then backpropagated through multiple forward and backward iterations and the filter weights adjusted based on estimated gradients of the local error surface. This in turn drives what features are detected, associating them to the characteristics of the training data. The early layers in a CNN extract low-level features conceptually similar to visual basis functions found in the primary visual cortex (Matsugu et al. 2003 ).

The most common CNN architecture (Fig.  3 a Footnote 8 ) has the outputs from its convolution layers connected to a pooling layer, which combines the outputs of neuron clusters into a single neuron. Subsequently, activation functions such as tanh (the hyperbolic tangent) or ReLU (Rectified Linear Unit) are applied to introduce non-linearity into the network (Agostinelli et al. 2015 ). This structure is repeated with similar or different kernel sizes. As a result, the CNN learns to detect edges from the raw pixels in the first layer, then combines these to detect simple shapes in the next layer. The higher layers produce higher-level features, which have more semantic meaning. The last few layers represent the classification part of the network. These consist of fully connected layers (i.e. being connected to all the activation outputs in the previous layer) and a softmax layer, where the output class is modelled as a probability distribution - exponentially scaling the output between 0 and 1 (this is also referred to as a normalised exponential function).

VGG (Simonyan and Zisserman 2015 ) is one of the most common backbone networks, offering two depths: VGG-16 and VGG-19 with 16 and 19 layers respectively. The networks incorporate a series of convolution blocks (comprising convolutional layers, ReLU activations and a max-pooling layer), and the last three layers are fully connection with ReLU activations. VGG employs very small receptive fields (3 \(\times\) 3 with a stride of 1) allowing deeper architectures than the older networks. DeepArt (Gatys et al. 2016 ) employs a VGG-Network without fully connected layers. It demonstrates that the higher layers in the VGG network can represent the content of an artwork. The pre-trained VGG network is widely used to provide a measure of perceptual loss (and style loss) during the training process of other networks (Johnson et al. 2016 ).

2.3.2.2 CNNs with reconstruction

The basic structure of CNNs described in the previous section is sometimes called an ‘encoder’. This is because the network learns a representation of a set of data, which often has fewer parameters than the input. In other words, it compresses the input to produce a code or a latent-space representation. In contrast, some architectures omit pooling layers in order to create dense features in an output with the same size as the input.

Alternatively, the size of the feature map can be enlarged to that of the input via deconvolutional layers or transposed convolution layers (Fig.  3 b Footnote 9 ). This structure is often referred to as a ‘decoder’ as it generates the output using the code produced by the encoder. Encoder-decoder architectures combine an encoder and a decoder. Autoencoders are a special case of encoder-decoder models, where the input and output are the same size. Encoder-decoder models are suitable for creative applications, such as style transfer (Zhang et al. 2016 ), image restoration (Nah et al. 2017 ; Yang and Sun 2018 ; Zhang et al. 2017 ), contrast enhancement (Lore et al. 2017 ; Tao et al. 2017 ), colorization (Zhang et al. 2016 ) and super-resolution (Shi et al. 2016 ).

Some architectures also add skip connections or a bridge section (Long et al. 2015 ) so that the local and global features, as well as semantics are connected and captured, providing improved pixel-wise accuracy. These techniques are widely used in object detection (Anantrasirichai and Bull 2019 ) and object tracking (Redmon and Farhadi 2018 ). U-Net (Ronneberger et al. 2015 ) is perhaps the most popular network of this kind, even though it was originally developed for biomedical image segmentation. Its network consists of a contracting path (encoder) and an expansive path (decoder), giving it the u-shaped architecture. The contracting path consists of the repeated application of two 3 \(\times\) 3 convolutions, followed by ReLU and a max-pooling layer. Each step in the expansive path consists of a transposed convolution layer for upsampling, followed by two sets of convolutional and ReLU layers, and concatenations with correspondingly-resolution features from the contracting path.

2.3.2.3 Advanced CNNs

Some architectures introduce modified convolution operations for specific applications. For example, dilated convolution (Yu and Koltun 2016 ), also called atrous convolution, enlarges the receptive field, to support feature extraction locally and globally. The dilated convolution is applied to the input with a defined spacing between the values in a kernel. For example, a 3 \(\times\) 3 kernel with a dilation rate of 2 has the same receptive field as a 5 \(\times\) 5 kernel, but using 9 parameters. This has been used for colorization by Zhang et al. ( 2016 ) in the creative sector. ResNet is an architecture developed for residual learning, comprising several residual blocks (He et al. 2016 ). A single residual block has two convolution layers and a skip connection between the input and the output of the last convolution layer. This avoids the problem of vanishing gradients, enabling very deep CNN architectures. Residual learning has become an important part of the state of the art in many application, such as contrast enhancement (Tao et al. 2017 ), colorization (Huang et al. 2017 ), SR (Dai et al. 2019 ; Zhang et al. 2018a ), object recognition (He et al. 2016 ), and denoising (Zhang et al. 2017 ).

Traditional convolution operations are performed in a regular grid fashion, leading to limitations for some applications, where the object and its location are not in the regular grid. Deformable convolution (Dai et al. 2017 ) has therefore been proposed to facilitate the region of support for the convolution operations to take on any shape, instead of just the traditional square shape. This has been used in object detection and SR (Wang et al. 2019a ). 3D deformable kernels have also been proposed for denoising video content, as they can better cope with large motions, producing cleaner and sharper sequences (Xiangyu Xu 2019 ).

Capsule networks were developed to address some of the deficiencies with traditional CNNs (Sabour et al. 2017 ). They are able to better model hierarchical relationships, where each neuron (referred to as a capsule) expresses the likelihood and properties of its features, e.g., orientation or size. This improves object recognition performance. Capsule networks have been extended to other applications that deal with complex data, including multi-label text classification (Zhao et al. 2019 ), slot filling and intent detection (Zhang et al. 2019a ), polyphonic sound event detection (Vesperini et al. 2019 ) and sign language recognition (Jalal et al. 2018 ).

2.3.3 Generative adversarial networks (GANs)

The generative adversarial network (GAN) is a recent algorithmic innovation that employs two neural networks: generative and discriminative. The GAN pits one against the other in order to generate new, synthetic instances of data that can pass for real data. The general GAN architecture is shown in Fig.  4 a. It can be observed that the generative network generates new candidates to increase the error rate of the discriminative network until the discriminative network cannot tell whether these candidates are real or synthesized. The generator is typically a deconvolutional neural network, and the discriminator is a CNN. Recent successful applications of GANs include SR (Ledig et al. 2017 ), inpainting (Yu et al. 2019 ), contrast enhancement (Kuang et al. 2019 ) and compression (Ma et al. 2019a ).

GANs have a reputation of being difficult to train since the two models are trained simultaneously to find a Nash equilibrium but with each model updating its cost (or error) independently. Failures often occur when the discriminator cannot feedback information that is good enough for the generator to make progress, leading to vanishing gradients. Wasserstein loss is designed to prevent this (Arjovsky et al. 2017 ; Frogner et al. 2015 ). A specific condition or characteristic, such as a label associated with an image, rather than a generic sample from an unknown noise distribution can be included in the generative model, creating what is referred to as a conditional GAN (cGAN) (Mirza and Osindero 2014 ). This improved GAN has been used in several applications, including pix2pix(Isola et al. 2017 ) and for deblurring (Kupyn et al. 2018 ).

Theoretically, the generator in a GAN will not learn to create new content, but it will just try to make its output look like the real data. Therefore, to produce creative works of art, the Creative Adversarial Network (CAN) has been proposed by Elgammal et al. ( 2017 ). This works by including an additional signal in the generator to prevent it from generating content that is too similar to existing examples. Similar to traditional CNNs, a perceptual loss based on VGG16 (Johnson et al. 2016 ) has become common in applications where new images are generated that have the same semantics as the input (Antic 2020 ; Ledig et al. 2017 ).

Most GAN-based methods are currently limited to the generation of relatively small square images, e.g., 256 \(\times\) 256 pixels (Zhang et al. 2017 ). The best resolution created up to the time of this review is 1024 \(\times\) 1024-pixels, achieved by NVIDIA research. The team introduced the progressive growing of GANs (Karras et al. 2018 ) and showed that their method can generate near-realistic 1024 \(\times\) 1024-pixel portrait images (trained for 14 days). However the problem of obvious artefacts at transition areas between foreground and background persists.

Another form of deep generative model is the Variational Autoencoder (VAE). A VAE is an autoencoder, where the encoding distribution is regularised to ensure the latent space has good properties to support the generative process. Then the decoder samples from this distribution to generate new data. Comparing VAEs to GANs, VAEs are more stable during training, while GANs are better at producing realistic images. Recently Deepmind (Google) has included vector quantization (VQ) within a VAE to learn a discrete latent representation (Razavi et al. 2019 ). Its performance for image generation are competitive with their BigGAN (Brock et al. 2019 ) but with greater capacity for generating a diverse range of images. There have also been many attempts to merge GANs and VAEs so that the end-to-end network benefits from both good samples and good representation, for example using a VAE as the generator for a GAN (Bhattacharyya et al. 2019 ; Wan et al. 2017 ). However, the results of this have not yet demonstrated significant improvement in terms of overall performance (Rosca et al. 2019 ), remaining an ongoing research topic.

A review of recent state-of-the-art GAN models and applications can be found in Foster ( 2019 ).

figure 4

Architectures of a GAN, b RNN for drawing sketches (Ha and Eck 2018 )

2.3.4 Recurrent neural networks (RNNs)

Recurrent neural networks (RNNs) have been widely employed to perform sequential recognition; they offer benefits in this respect by incorporating at least one feedback connection. The most commonly used type of RNN is the Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber 1997 ), as this solves problems associated with vanishing gradients, observed in traditional RNNs. It does this by memorizing sufficient context information in time series data via its memory cell. Deep RNNs use their internal state to process variable length sequences of inputs, combining across multiple levels of representation. This makes them amenable to tasks such as speech recognition (Graves et al. 2013 ), handwriting recognition (Doetsch et al. 2014 ), and music generation (Briot et al. 2020 ). RNNs are also employed in image and video processing applications, where recurrency is applied to convolutional encoders for tasks such as drawing sketches (Ha and Eck 2018 ) and deblurring videos (Zhang et al. 2018 ). VINet (Kim et al. 2019 ) employs an encoder-decoder model using an RNN to estimate optical flow, processing multiple input frames concatenated with the previous inpainting results. An example network using an RNN is illustrated in Fig.  4 b.

CNNs extract spatial features from its input images using convolutional filters and RNNs extract sequential features in time-series data using memory cells. In extension, 3D CNN, CNN-LSTM and ConvLSTM have been designed to extract spatial-temporal features from video sequences. The 3D activation maps produced in 3D CNNs are able to analyze temporal or volumetric context which are important in applications such as medical imaging (Lundervold and Lundervold 2019 ) and action recognition (Ji et al. 2013 ). The CNN-LSTM simply concatenates a CNN and an LSTM (the 1D output of the CNN is the input to the LSTM) to process time-series data. In contrast, ConvLSTM is another LSTM variant, where the internal matrix multiplications are replaced with convolution operations at each gate of the LSTM cell so that the LSTM input can be in the form of multi-dimensional data (Shi et al. 2015 ).

2.3.5 Deep reinforcement learning (DRL)

Reinforcement learning (RL) is an ML algorithm trained to make a sequence of decisions. Deep reinforcement learning (DRL) combines ANNs with an RL architecture that enables RL agents to learn the best actions in a virtual environment to achieve their goals. The RL agents are comprised of a policy that performs a mapping from an input state to an output action and an algorithm responsible for updating this policy. This is done through leveraging a system of rewards and punishments to acquire useful behaviour—effectively a trial-and-error process. The framework trains using a simulation model, so it does not require a predefined training dataset, either labeled or unlabeled.

However, pure RL requires an excessive number of trials to learn fully, something that may be impractical in many (especially real-time) applications if training from scratch (Hessel et al. 2018 ). AlphaGo, a computer program developed by DeepMind Technologies that can beat a human professional Go player, employs RL on top of a pre-trained model to improve its play strategy to beat a particular player. Footnote 10 RL could be useful in creative applications, where there may not be a predefined way to perform a given task, but where there are rules that the model has to follow to perform its duties correctly. Current applications involve end-to-end RL combined with CNNs, including gaming (Mnih et al. 2013 ), and RLs with GANs in optimal painting stroke in stroke-based rendering (Huang et al. 2019 ). Recently RL methods have been developed using a graph neural network (GNN) to play Diplomacy, a highly complex 7-player (large scale) board game (Anthony et al. 2020 ).

Temporal difference (TD) learning (Gregor et al. 2019 ; Chen et al. 2018 ; Nguyen et al. 2020 ) has recently been introduced as a model-free reinforcement learning method that learns how to predict a quantity that depends on future values of a given signal. That is, the model learns from an environment through episodes with no prior knowledge of the environment. This may well have application in the creative sector for storytelling, caption-from-image generation and gaming.

3 AI for the creative industries

AI has increasingly (and often mistakenly) been associated with human creativity and artistic practice. As it has demonstrated abilities to ‘see’, ‘hear’, ‘speak’, ‘move’, and ‘write’, it has been applied in domains and applications including: audio, image and video analysis, gaming, journalism, script writing, filmmaking, social media analysis and marketing. One of the earliest AI technologies, available for more than two decades, is Autotune, which automatically fixes vocal intonation errors (Hildebrand 1999 ). An early attempt to exploit AI for creating art occurred in 2016, when a three-dimensional (3D) printed painting, the Next Rembrandt, Footnote 11 was produced solely based on training data from Rembrandt’s portfolio. It was created using deep learning algorithms and facial recognition techniques.

Creativity is defined in the Cambridge Dictionary as ‘the ability to produce original and unusual ideas, or to make something new or imaginative’. Creative tasks generally require some degree of original thinking, extensive experience and an understanding of the audience, while production tasks are, in general, more repetitive or predictable, making them more amenable to being performed by machines. To date, AI technologies have produced mixed results when used for generating original creative works. For example, GumGum Footnote 12 creates a new piece of art following the input of a brief idea from the user. The model is trained by recording the preferred tools and processes that the artist uses to create a painting. A Turing test revealed that it is difficult to distinguish these AI generated products from those painted by humans. AI methods often produce unusual results when employed to create new narratives for books or movie scripts. Botnik Footnote 13 employs an AI algorithm to automatically remix texts of existing books to create a new chapter. In one experiment, the team fed the seven Harry Potter novels through their predictive text algorithm, and the ‘bot’ created rather strange but amusing sentences, such as “ Ron was standing there and doing a kind of frenzied tap dance. He saw Harry and immediately began to eat Hermione’s family ” (Sautoy 2019 ). However, when AI is used to create less structured content (e.g., some forms of ‘musical’ experience), it can demonstrate pleasurable difference (Briot et al. 2020 ).

In the production domain, Twitter has applied automatic cropping to create image thumbnails that show the most salient part of an image (Theis et al. 2018 ). The BBC has created a proof-of-concept system for automated coverage of live events. In this work, the AI-based system performs shot framing (wide, mid and close-up shots), sequencing, and shot selection automatically (Wright et al. 2020 ). However, the initial results show that the algorithm needs some improvement if it is to replace human operators. Nippon Hoso Kyokai (NHK, Japan’s Broadcasting Corporation), has developed a new AI-driven broadcasting technology called “Smart Production”. This approach extracts events and incidents from diverse sources such as social media feeds (e.g., Twitter), local government data and interviews, and integrates these into a human-friendly accessible format (Kaneko et al. 2020 ).

In this review, we divide creative applications into five major categories: content creation, information analysis, content enhancement and post production workflows, information extraction and enhancement, and data compression. However, it should be noted that many applications exploit several categories in combination. For instance, post-production tools (discussed in Sects.  3.3 and 3.4 ) frequently combine information extraction and content enhancement techniques. These combinations can together be used to create new experiences, enhance existing material or to re-purpose archives (e.g., ‘Venice Through a VR Lens, 1898’ directed by BDH Immersive and Academy 7 Production Footnote 14 ). These workflows may employ AI-enabled super-resolution, colorization, 3D reconstruction and frame rate interpolation methods. Gaming is another important example that has been key for the development of AI. It could be considered as an ‘all-in-one’ AI platform, since it combines rendering, prediction and learning.

We categorize the applications and the corresponding AI-based solutions as shown in Table  1 . For those interested, a more detailed overview of contemporary Deep Learning systems is provided in Sect.  2.3 .

3.1 Content creation

Content creation is a fundamental activity of artists and designers. This section discusses how AI technologies have been employed both to support the creative process and as a creator in their own right.

3.1.1 Script and movie generation

The narrative or story underpins all forms of creativity across art, fiction, journalism, gaming, and other forms of entertainment. AI has been used both to create stories and to optimize the use of supporting data, for example organizing and searching through huge archives for documentaries. The script of a fictional short film, Sunspring (2016), Footnote 15 was entirely written by an AI machine, known as Benjamin, created by New York University. The model, based on a recurrent neural network (RNN) architecture, was trained using science fiction screenplays as input, and the script was generated with random seeds from a sci-fi filmmaking contest. Sunspring has some unnatural story lines. In the sequel, It’s No Game (2017), Benjamin was then used only in selected areas and in collaboration with humans, producing a more fluid and natural plot. This reinforces the notion that the current AI technology can work more efficiently in conjunction with humans rather than being left to its own devices. In 2016, IBM Watson, an AI-based computer system, composed the 6-min movie trailer of a horror film, called Morgan. Footnote 16 The model was trained with more than 100 trailers of horror films enabling it to learn the normative structure and pattern. Later in 2018, Benjamin was used to generate a new film ‘Zone Out’ (produced within 48 h). The project also experimented further by using face-swapping, based on a GAN and voice-generating technologies. This film was entirely directed by AI, but includes many artefacts and unnatural scenes as shown in Fig.  5 a. Footnote 17 Recently, ScriptBook Footnote 18 introduced a story-awareness concept for AI-based storytelling. The generative models focus on three aspects: awareness of characters and their traits, awareness of a script’s style and theme, and awareness of a script’s structure, so the resulting script is more natural.

In gaming, AI has been used to support design, decision-making and interactivity (Justesen et al. 2020 ). Interactive narrative, where users create a storyline through actions, has been developed using AI methods over the past decade (Riedl and Bulitko 2012 ). For example, MADE (Massive Artificial Drama Engine for non-player characters) generates procedural content in games (Héctor 2014 ), and deep reinforcement learning has been employed for personalization (Wang et al. 2017 ). AI Dungeon Footnote 19 is a web-based game that is capable of generating a storyline in real time, interacting with player input. The underlying algorithm requires more than 10,000 label contributions for training to ensure that the model produces smooth interaction with the players. Procedural generation has been used to automatically randomize content so that a game does not present content in the same order every time (Short and Adams 2017 ). Modern games often integrate 3D visualization, augmented reality (AR) and virtual reality (VR) techniques, with the aim of making play more realistic and immersive. Examples include Vid2Vid (Wang et al. 2018 ) which uses a deep neural network, trained on real videos of cityscapes, to generate a synthetic 3D gaming environment. Recently, NVIDIA Research has used a generative model [GameGAN by Kim et al. ( 2020b )], trained on 50,000 PAC-MAN episodes, to create new content, which can be used by game developers to automatically generate layouts for new game levels in the future.

figure 5

a A screenshot from ‘Zone Out’, where the face of the woman was replaced with a man’s mouth \(^{17}\) . b Music transcription generated by AI algorithm \(^{31}\)

3.1.2 Journalism and text generation

Natural language processing (NLP) refers to the broad class of computational techniques for incorporating speech and text. It analyzes natural language data and trains machines to perceive and to generate human language directly. NLP algorithms frequently involve speech recognition (Sect.  3.4 ), natural language understanding [e.g., BERT by Google AI (Devlin et al. 2019 )], and natural language generation (Leppänen et al. 2017 ). Automated journalism, also known as robot journalism, describes automated tools that can generate news articles from structured data. The process scans large amounts of assorted data, orders key points, and inserts details such as names, places, statistics, and some figures (Cohen 2015 ). This can be achieved through NLP and text mining techniques (Dörr 2016 ).

AI can help to break down barriers between different languages with machine translation (Dzmitry Bahdanau 2015 ). A conditioned GAN with an RNN architecture has been proposed for language translation by Subramanian et al. ( 2018 ). It was used for the difficult task of generating English sentences from Chinese poems; it creates understandable text but sometimes with grammatical errors. CNN and RNN architectures are employed to translate video into natural language sentences (Venugopalan et al. 2015 ). AI can also be used to rewrite one article to suit several different channels or audience tastes. Footnote 20 A survey of recent deep learning methods for text generation by Iqbal and Qureshi ( 2020 ) concludes that text generated from images could be most amenable to GAN processing while topic-to-text translation is likely to be dominated by variational autoencoders (VAE).

Automated journalism is now quite widely used. For example, BBC reported on the UK general election in 2019 using such tools. Footnote 21 Forbes uses an AI-based content management system, called Bertie, to assist in providing reporters with the first drafts and templates for news stories. Footnote 22 The Washington Post also has a robot reporting program called Heliograf. Footnote 23 Microsoft has announced in 2020 that they use automated systems to select news stories to appear on MSN website. Footnote 24 This application of AI demonstrates that current AI technology can be effective in supporting human journalists in constrained cases, increasing production efficiency.

3.1.3 Music generation

There are many different areas where sound design is used in professional practice, including television, film, music production, sound art, video games and theatre. Applications of AI in this domain include searching through large databases to find the most appropriate match for such applications (see Sect.  3.2.3 ), and assisting sound design. Currently, several AI assisted music composition systems support music creation. The process generally involves using ML algorithms to analyze data to find musical patterns, e.g., chords, tempo, and length from various instruments, synthesizers and drums. The system then suggests new composed melodies that may inspire the artist. Example software includes Flow Machines by Sony, Footnote 25 Jukebox by OpenAI Footnote 26 and NSynth by Google AI. Footnote 27 In 2016, Flow Machines launched a song in the style of The Beatles, and in 2018 the team released the first AI album, ‘Hello World’, composed by an artist, SKYGGE (Benoit Carré), using an AI-based tool. Footnote 28 Coconet uses a CNN to infill missing pieces of music. Footnote 29 Modelling music creativity is often achieved using Long Short-Term Memory (LSTM), a special type of RNN architecture (Sturm et al. 2016 ) (an example of the output of this model is shown in Fig.  5 b Footnote 30 and the reader can experience AI-based music at Ars Electronica Voyages Channel Footnote 31 ). The model takes a transcribed musical idea and transforms it in meaningful ways. For example, DeepJ composes music conditioned on a specific mixture of composer styles using a Biaxial LSTM architecture (Mao et al. 2018 ). More recently, generative models have been configured based on an LSTM neural network to generate music (Li et al. 2019b ).

Alongside these methods of musical notation based audio synthesis, there also exists a range of direct waveform synthesis techniques that learn and/or act directly on the waveform of the audio itself [for example (Donahue et al. 2019 ; Engel et al. 2019 ]. A more detailed overview of Deep Learning techniques for music generation can be found in Briot et al. ( 2020 ).

figure 6

Example applications of pix2pix framework (Isola et al. 2017 )

3.1.4 Image generation

AI can be used to create new digital imagery or art-forms automatically, based on selected training datasets, e.g., new examples of bedrooms (Radford et al. 2016 ), cartoon characters (Jin et al. 2017 ), celebrity headshots (Karras et al. 2018 ). Some applications produce a new image conditioned to the input image, referred to as image-to-image translation, or ‘ style transfer ’. It is called translation or transfer, because the image output has a different appearance to the input but with similar semantic content. That is, the algorithms learn the mapping between an input image and an output image. For example, grayscale tones can be converted into natural colors (Zhang et al. 2016 ), using eight simple convolution layers to capture localized semantic meaning and to generate a and b color channels of the CIELAB color space. This involves mapping class probabilities to point estimates in ab space. DeepArt (Gatys et al. 2016 ) transforms the input image into the style of the selected artist by combining feature maps from different convolutional layers. A stroke-based drawing method trains machines to draw and generalise abstract concepts in a manner similar to humans using RNNs (Ha and Eck 2018 ).

A Berkeley AI Research team has successfully used GANs to convert between two image types (Isola et al. 2017 ), e.g., from a Google map to an aerial photo, a segmentation map to a real scene, or a sketch to a colored object (Fig.  6 ). They have published their pix2pix codebase Footnote 32 and invited the online community to experiment with it in different application domains, including depth map to street view, background removal and pose transfer. For example pix2pix has been used Footnote 33 to create a Renaissance portrait from a real portrait photo. Following pix2pix, a large number of research works have improved the performance of style transfer. Cycle-consistent adversarial networks (CycleGAN) (Zhu et al. 2017 ) and DualGAN (Yi et al. 2017 ) have been proposed for unsupervised learning. Both algorithms are based on similar concepts—the images of both groups are translated twice (e.g., from group A to group B, then translated back to the original group A) and the loss function compares the input image and its reconstruction, computing what is referred to as cycle-consistency loss. Samsung AI has shown, using GANs, that it is possible to turn a portrait image, such as the Mona Lisa, into a video where the portrait’s face speaks in the style of a guide (Zakharov et al. 2019 ). Conditional GANs can be trained to transform a human face into one of a different age (Song et al. 2018b ), and to change facial attributes, such as the presence of a beard, skin condition, hair style and color (He et al. 2019 ).

Several creative tools have employed ML-AI methods to create new unique artworks. For example, Picbreeder Footnote 34 and EndlessForms Footnote 35 employ Hypercube-based NeuroEvolution of Augmenting Topologies (Stanley et al. 2009 ) as a generative encoder that exploits geometric regularities. Artbreeder Footnote 36 and GANVAS Studio Footnote 37 employ BigGAN (Brock et al. 2019 ) to generate high-resolution class-conditional images and also to mix two images together to create new interesting work.

figure 7

a Real-time pose animator \(^{38}\) . b Deepfake applied to replaces Alden Ehrenreich with young Harrison Ford in Solo: a star wars story by derpfakes \(^{50}\)

3.1.5 Animation

Animation is the process of using drawings and models to create moving images. Traditionally this was done by hand-drawing each frame in the sequence and rendering these at an appropriate rate to give the appearance of continuous motion. In recent years, AI methods have been employed to automate the animation process making it easier, faster and more realistic than in the past. A single animation project can involve several shot types, ranging from simple camera pans on a static scene, to more challenging dynamic movements of multiple interacting characters [e.g basketball players (Starke et al. 2020 )]. ML-based AI is particularly well suited to learning models of motion from captured real motion sequences. These motion characteristics can be learnt using deep learning-based approaches, such as autoencoders (Holden et al. 2015 ), LSTMs (Lee et al. 2018 ), and motion prediction networks (Starke et al. 2019 ). Then, the inference applies these characteristics from the trained model to animate characters and dynamic movements. In simple animation, the motion can be estimated using a single low-cost camera. For example, Google research has created software for pose animation that turns a human pose into a cartoon animation in real time Footnote 38 . This is based on PoseNet (estimating pose position Footnote 39 ) and FaceMesh (capturing face movement (Kartynnik et al. 2019 )) as shown in Fig.  7 a. Adobe has also created Character Animator software Footnote 40 offering lip synchronisation, eye tracking and gesture control through webcam and microphone inputs in real-time. This has been adopted by Hollywood studios and other online content creators.

AI has also been employed for rendering objects and scenes. This includes the synthesis of 3D views from motion capture or from monocular cameras (see Sect.  3.4.6 ), shading (Nalbach et al. 2017 ) and dynamic texture synthesis (Tesfaldet et al. 2018 ). Creating realistic lighting in animation and visual effects has also benefited by combining traditional geometrical computer vision with enhanced ML approaches and multiple depth sensors (Guo et al. 2019 ). Animation is not only important within the film industry; it also plays an important role in the games industry, responsible for the portrayal of movement and behaviour. Animating characters, including their faces and postures, is a key component in a game engine. AI-based technologies have enabled digital characters and audiences to co-exist and interact. Footnote 41 Avatar creation has also been employed to enhance virtual assistants, Footnote 42 e.g., using proprietary photoreal AI face synthesis technology (Nagano et al. 2018 ). Facebook Reality Labs have employed ML-AI techniques to animate realistic digital humans, called Codec Avatars, in real time using GAN-based style transfer and using a VAE to extract avatar parameters (Wei et al. 2019 ). AI is also employed to up-sample frame rate in animation (Siyao et al. 2021 ).

3.1.6 Augmented, virtual and mixed reality (VR, AR, MR)

AR and VR use computer technologies to create a fully simulated environment or one that is real but augmented with virtual entities. AR expands the physical world with digital layers via mobile phones, tablets or head mounted displays, while VR takes the user into immersive experiences via a headset with a 3D display that isolates the viewer (at least in an audio-visual sense) from the physical world (Milgram et al. 1995 ).

Significant predictions have been made about the growth of AR and VR markets in recent years but these have not realised yet. Footnote 43 This is due to many factors including equipment cost, available content and the physiological effects of ‘immersion’ (particularly over extended time periods) due to conflicting sensory interactions (Ng et al. 2020 ). VR can be used to simulate a real workspace for training workers for the sake of safety and to prevent the real-world consequences of failure (Laver et al. 2017 ). In the healthcare industry, VR is being increasingly used in various sectors, ranging from surgical simulation to physical therapy (Keswani et al. 2020 ).

Gaming is often cited as a major market for VR, along with related areas such as pre-visualisation of designs or creative productions (e.g., in building, architecture and filmmaking). A good list of VR games can be found in many article. Footnote 44 Deep learning technologies have been exploited in many aspects of gaming, for example in VR/AR game design (Zhang 2020 ) and emotion detection while using VR to improve the user’s immersive experience (Quesnel et al. 2018 ). More recently AI gaming methods have been extended into the area of virtual production, where the tools are scaled to produce dynamic virtual environments for filmmaking

AR perhaps has more early potential for growth than VR and uses have been developed in education and to create shared information, work or design spaces, where it can provide added 3D realism for the users interacting in the space (Palmarini et al. 2018 ). AR has also gained interest in augmenting experiences in movie and theatre settings. Footnote 45 A review of current and future trends of AR and VR systems can be found in Bastug et al. ( 2017 ).

MR combines the real world with digital elements (or the virtual world) (Milgram and Kishino 1994 ). It allows us to interact with objects and environments in both the real and virtual world by using touch technology and other sensory interfaces, to merge reality and imagination and to provide more engaging experiences. Examples of MR applications include the ‘MR Sales Gallery’ used by large real estate developers. Footnote 46 It is a virtual sample room that simulates the environment for customers to experience the atmosphere of an interactive residential project. The growth of VR, AR and MR technologies is described by Immerse UK in their recent report on the immersive economy in the UK 2019. Footnote 47 Extended reality (XR) is a newer technology that combines VR, AR and MR with internet connectivity, which opens further opportunities across industry, education, defence, health, tourism and entertainment (Chuah 2018 ).

An immersive experience with VR or MR requires good quality, high-resolution, animated worlds or 360-degree video content (Ozcinar and Smolic 2018 ). This poses new problems for data compression and visual quality assessment, which are the subject of increased research activity currently (Xu et al. 2020 ). AI technologies have been employed to make AR/VR/MR/XR content more exciting and realistic, to robustly track and localize objects and users in the environment. For example, automatic map reading using image-based localization (Panphattarasap and Calway 2018 ), and gaze estimation (Anantrasirichai et al. 2016 ; Soccini 2017 ). Oculus Insight, by Facebook, uses visual-inertial SLAM (simultaneous localization and mapping) to generate real-time maps and position tracking. Footnote 48 More sophisticated approaches, such as Neural Topological SLAM, leverage semantics and geometric information to improve long-horizon navigation (Chaplot et al. 2020 ). Combining audio and visual sensors can further improve navigation of egocentric observations in complex 3D environments, which can be done through deep reinforcement learning approach (Chen et al. 2020 ).

3.1.7 Deepfakes

Manipulations of visual and auditory media, either for amusement or malicious intent, are not new. However, advances in AI and ML methods have taken this to another level, improving their realistism and providing automated processes that make them easier to render. Text generator tools, such as those by OpenAI, can generate coherent paragraphs of text with basic comprehension, translation and summarization but have also been used to create fake news or abusive spam on social media. Footnote 49 Deepfake technologies can also create realistic fake videos by replacing some parts of the media with synthetic content. For example, substituting someone’s face while hair, body and action remain the same (Fig.  7 b Footnote 50 ). Early research created mouth movement synthesis tools capable of making the subject appear to say something different from the actual narrative, e.g., President Barack Obama is lip-synchronized to a new audio track in Suwajanakorn et al. ( 2017 ). More recently, DeepFaceLab (Perov et al. 2020 ) provided a state-of-the-art tool for face replacement; however manual editing is still required in order to create the most natural appearance. Whole body movements have been generated via learning from a source video to synthesize the positions of arms, legs and body of the target (Chan et al. 2019 ).

Deep learning approaches to Deepfake generation primarily employ generative neural network architectures, e.g., VAEs (Kietzmann et al. 2020 ) and GANs (Zakharov et al. 2019 ). Despite rapid progress in this area, the creation of perfectly natural figures remains challenging; for example deepfake faces often do not blink naturally. Deepfake techniques have been widely used to create pornographic images of celebrities, to cause political distress or social unrest, for purposes of blackmail and to announce fake terrorism events or other disasters. This has resulted in several countries banning non-consensual deepfake content. To counter these often malicious attacks, a number of approaches have been reported and introduced to detect fake digital content (Güera and Delp 2018 ; Hasan and Salah 2019 ; Li and Lyu 2019 ).

3.1.8 Content and captions

There are many approaches that attempt to interpret an image or video and then automatically generate captions based on its content (Pu et al. 2016 ; Xia and Wang 2005 ; Xu et al. 2017b ). This can successfully be achieved through object recognition (see Sect.  3.4 ); YouTube has provided this function for both video-on-demand and livestream videos. Footnote 51

The other way around, AI can also help to generate a new image from text. However, this problem is far more complicated; attempts so far have been based on GANs. Early work by Mansimov et al. ( 2016 ) was capable of generating background image content with relevant colors but with blurred foreground details. A conditioning augmentation technique was proposed to stabilize the training process of the conditional GAN, and also to improve the diversity of the generated samples (Zhang et al. 2017 ). Recent methods with significantly increased complexity are capable of learning to generate an image in an object-wise fashion, leading to more natural-looking results (Li et al. 2019c ). However, limitations remain, for example artefacts often appear around object boundaries or inappropriate backgrounds can be produced if the words of the caption are not given in the correct order.

3.2 Information analysis

AI has proven capability to process and adapt to large amounts of training data. It can learn and analyze the characteristics of these data, making it possible to classify content and predict outcomes with high levels of confidence. Example applications include advertising and film analysis, as well as image or video retrieval, for example enabling producers to acquire information, analysts to better market products or journalists to retrieve content relevant to an investigation.

3.2.1 Text categorization

Text categorization is a core application of NLP. This generic text processing task is useful in indexing documents for subsequent retrieval and content analysis (e.g., spam detection, sentiment classification, and topic classification). It can be thought of as the generation of summarised texts from full texts. Traditional techniques for both multi-class and multi-label classifications include decision trees, support vector machines (Kowsari et al. 2019 ), term frequency–inverse document frequency (Azam and Yao 2012 ), and extreme learning machine (Rezaei-Ravari et al. 2021 ). Unsupervised learning with self-organizing maps has also been investigated (Pawar and Gawande 2012 ). Modern NLP techniques are based on deep learning, where generally the first layer is an embedding layer that converts words to vector representations. Additional CNN layers are then added to extract text features and learn word positions (Johnson and Zhang 2015 ). RNNs (mostly based on LSTM architectures) have also been concatenated to learn sentences and give prediction outputs (Chen et al. 2017 ; Gunasekara and Nejadgholi 2018 ). A category sentence generative adversarial network has also been proposed that combines GAN, RNN and reinforcement learning to enlarge training datasets, which improves performance for sentiment classification (Li et al. 2018b ). Recently, an attention layer has been integrated into the network to provide semantic representations in aspect-based sentiment analysis (Truşcă et al. 2020 ). The artist, Vibeke Sorensen, has applied AI techniques to categorize texts from global social networks such as Twitter into six live emotions and display the ‘Mood of the Planet’ artistically using six different colors. Footnote 52

3.2.2 Advertisements and film analysis

AI can assist creators in matching content more effectively to their audiences, for example recommending music and movies in a streaming service, like Spotify or Netflix. Learning systems have also been used to characterize and target individual viewers, optimizing the time they spend on advertising (Lacerda et al. 2006 ). This approach assesses what users look at and how long they spend browsing adverts, participating on social media platforms. In addition, AI can be used to inform how adverts should be presented to help boost their effectiveness, for example by identifying suitable customers and showing the ad at the right time. This normally involves gathering and analysing personal data in order to predict preferences (Golbeck et al. 2011 ).

Contextualizing social-media conversations can also help advertisers understand how consumers feel about products and to detect fraudulent ad impressions (Ghani et al. 2019 ). This can be achieved using NLP methods (Young et al. 2018 ). Recently, an AI-based data analysis tool has been introduced to assist filmmaking companies to develop strategies for how, when and where prospective films should be released (Dodds 2020 ). The tool employs ML approaches to model the patterns of historical data about film performances associating with the film’s content and themes. This is also used in gaming industries, where the behaviour of each player is analyzed so that the company can better understand their style of play and decide when best to approach them to make money. Footnote 53

3.2.3 Content retrieval

Data retrieval is an important component in many creative processes, since producing a new piece of work generally requires undertaking a significant amount of research at the start. Traditional retrieval technologies employ metadata or annotation text (e.g., titles, captions, tags, keywords and descriptions) to the source content (Jeon et al. 2003 ). The manual annotation process needed to create this metadata is however very time-consuming. AI methods have enabled automatic annotation by supporting the analysis of media based on audio and object recognition and scene understanding (Amato et al. 2017 ; Wu et al. 2015 ).

In contrast to traditional concept-based approaches, content-based image retrieval (or query by image content (QBIC)) analyzes the content of an image rather than its metadata. A reverse image search technique (one of the techniques Google Images uses Footnote 54 ) extracts low-level features from an input image, such as points, lines, shapes, colors and textures. The query system then searches for related images by matching these features within the search space. Modern image retrieval methods often employ deep learning techniques, enabling image to image searching by extracting low-level features and then combining these to form semantic representations of the reference image that can be used as the basis of a search (Wan et al. 2014 ). For example, when a user uploads an image of a dog to Google Images, the search engine will return the dog breed, show similar websites by searching with this keyword, and also show selected images that are visually similar to that dog, e.g., with similar colors and background. These techniques have been further improved by exploiting features at local, regional and global image levels (Gordo et al. 2016 ). GAN approaches are also popular, associated with learning-based hashing which was proposed for scalable image retrieval (Song et al. 2018a ). Video retrieval can be more challenging due to the requirement for understanding activities, interactions between objects and unknown context; RNNs have provided a natural extension that supports the extraction of sequential behaviour in this case (Jabeen et al. 2018 ).

Music information retrieval extracts features of sound, and then converts these to a meaningful representation suitable for a query engine. Several methods for this have been reported, including automatic tagging, query by humming, search by sound and acoustic fingerprinting (Kaminskas and Ricci 2012 ).

3.2.4 Recommendation services

A recommendation engine is a system that suggests products, services, information to users based on analysis of data. For example, a music curator creates a soundtrack or a playlist that has songs with similar mood and tone, bringing related content to the user. Curation tools, capable of searching large databases and creating recommendation shortlists, have become popular because they can save time, elevate brand visibility and increase connection to the audience. The techniques used in recommendation systems generally fall into three categories: (i) content-based filtering, which uses a single user’s data, (ii) collaborative filtering, the most prominent approach, that derives suggestions from many other users, and (iii) knowledge-based system, based on specific queries made by the user, which is generally employed in complex domains, where the first two cannot be applied. The approach can be hybrid; for instance where content-based filtering exploits individual metadata and collaborative filtering finds overlaps between user playlists. Such systems build a profile of what the users listen to or watch, and then look at what other people who have similar profiles listen to or watch. ESPN and Netflix have partnered with Spotify to curate playlists from the documentary ‘The Last Dance’. Spotify has created music and podcast playlists that viewers can check out after watching the show. Footnote 55

Content summarization is a fundamental tool that can support recommendation services. Text categorization approaches extract important content from a document into key indices (see Sect.  3.2.1 ). RNN-based models incorporating attention models have been employed to successfully generate a summary in the form of an abstract (Rush et al. 2015 ), short paragraph (See et al. 2017 ) or a personalized sentence (Li et al. 2019a ). The gaze behavior of an individual viewer has also been included for personalised text summarization (Yi et al. 2020 ). The personalized identification of key frames and start points in a video has also been framed as an optimization problem in Chen et al. ( 2014 ). ML approaches have been developed to perform content-based recommendations. Multimodal features of text, audio, image, and video content are extracted and used to seek similar content in Deldjoo et al. ( 2018 ). This task is relevant to content retrieval, as discussed in Sect.  3.2.3 . A detailed review of deep learning for recommendation systems can be found in Batmaz et al. ( 2019 ).

3.2.5 Intelligent assistants

Intelligent Assistants employ a combination of AI tools, including many of those mentioned above, in the form of a software agent that can perform tasks or services for an individual. These virtual agents can access information via digital channels to answer questions relating to, for example, weather forecasts, news items or encyclopaedic enquiries. They can recommend songs, movies and places, as well as suggest routes. They can also manage personal schedules, emails, and reminders. The communication can be in the form of text or voice. The AI technologies behind the intelligent assistants are based on sophisticated ML and NLP methods. Examples of current intelligent assistants include Google Assistant, Footnote 56 Siri, Footnote 57 Amazon Alexa and Nina by Nuance. Footnote 58 Similarly, chatbots and other types of virtual assistants are used for marketing, customer service, finding specific content and information gathering (Xu et al. 2017a ).

3.3 Content enhancement and post production workflows

It is often the case that original content (whether images, videos, audio or documents) is not fit for the purpose of its target audience. This could be due to noise caused by sensor limitations, the conditions prevailing during acquisition, or degradation over time. AI offers the potential to create assistive intelligent tools that improve both quality and management, particularly for mass-produced content.

3.3.1 Contrast enhancement

The human visual system employs many opponent processes, both in the retina and visual cortex, that rely heavily on differences in color, luminance or motion to trigger salient reactions (Bull and Zhang 2021 ). Contrast is the difference in luminance and/or color that makes an object distinguishable, and this is an important factor in any subjective evaluation of image quality. Low contrast images exhibit a narrow range of tones and can therefore appear flat or dull. Non-parametric methods for contrast enhancement involve histogram equalisation which spans the intensity of an image between its bit depth limits from 0 to a maximum value (e.g., 255 for 8 bits/pixel). Contrast-limited adaptive histogram equalisation (CLAHE) is one example that is commonly used to adjust an histogram and reduce noise amplification (Pizer et al. 1987 ). Modern methods have further extended performance by exploiting CNNs and autoencoders (Lore et al. 2017 ), inception modules and residual learning (Tao et al. 2017 ). Image Enhancement Conditional Generative Adversarial Networks (IE-CGANs) designed to process both visible and infrared images have been proposed by Kuang et al. ( 2019 ). Contrast enhancement, along with other methods to be discussed later, suffer from a fundamental lack of data for supervised training because real image pairs with low and high contrast are unavailable (Jiang et al. 2021 ). Most of these methods therefore train their networks with synthetic data (see Sect.  2.3.1 ).

3.3.2 Colorization

Colorization is the process that adds or restores color in visual media. This can be useful in coloring archive black and white content, enhancing infrared imagery (e.g., in low-light natural history filming) and also in restoring the color of aged film. A good example is the recent film “They Shall Not Grow Old” (2018) by Peter Jackson, that colorized (and corrected for speed and jerkiness, added sound and converted to 3D) 90 minutes of footage from World War One. The workflow was based on extensive studies of WW1 equipment and uniforms as a reference point and involved a time-consuming use of post production tools.

The first AI-based techniques for colorization used a CNN with only three convolutional layers to convert a grayscale image into chrominance values and refined them with bilateral filters to generate a natural color image (Cheng et al. 2015 ). A deeper network, but still only with eight dilated convolutional layers, was proposed a year later (Zhang et al. 2016 ). This network captured better semantics, resulting in an improvement on images with distinct foreground objects. Encoder-decoder networks are employed in Xu et al. ( 2020 ).

Colorization remains a challenging problem for AI as recognized in the recent Challenge in Computer Vision and Pattern Recognition Workshops (CVPRW) in 2019 (Nah et al. 2019 ). Six teams competed and all of them employed deep learning methods. Most of the methods adopted an encoder-decoder or a structure based on U-Net (Ronneberger et al. 2015 ). The deep residual net (NesNet) architecture (He et al. 2016 ) and the dense net (DenseNet) architecture (Huang et al. 2017 ) have both demonstrated effective conversion of gray scale to natural-looking color images. More complex architectures have been developed based on GAN structures (Zhang et al. 2019 ), for example DeOldify and NoGAN (Antic 2020 ). The latter model was shown to reduce temporal color flickering on the video sequence, which is a common problem when enhancing colors on an individual frame by frame basis. Infrared images have also been converted to natural color images using CNNs (e.g., Limmer and Lensch 2016 ) (Fig.  8 a) and GANs (e.g., Kuang et al. 2020 ; Suarez et al. 2017 ).

figure 8

Image enhancement. a Colorization for infrared image (Limmer and Lensch 2016 ). b Super-resolution (Ledig et al. 2017 )

3.3.3 Upscaling imagery: super-resolution methods

Super-resolution (SR) approaches have gained popularity in recent years, enabling the upsampling of images and video spatially or temporally. This is useful for up-converting legacy content for compatibility with modern formats and displays. SR methods increase the resolution (or sample rate) of a low-resolution (LR) image (Fig.  8 b) or video. In the case of video sequences, successive frames can, for example, be employed to construct a single high-resolution (HR) frame. Although the basic concept of the SR algorithm is quite simple, there are many problems related to perceptual quality and restriction of available data. For example, the LR video may be aliased and exhibit sub-pixel shifts between frames and hence some points in the HR frame do not correspond to any information from the LR frames.

With deep learning-based technologies, the LR and HR images are matched and used for training architectures such as CNNs, to provide high quality upscaling potentially using only a single LR image (Dong et al. 2014 ). Sub-pixel convolution layers can be introduced to improve fine details in the image, as reported by Shi et al.. Residual learning and generative models are also employed, (e.g., Kim et al. 2016 ; Tai et al. 2017 ). A generative model with a VGG-based Footnote 59 perceptual loss function has been shown to significantly improve quality and sharpness when used with the SRGAN by Ledig et al. ( 2017 ). Wang et al. ( 2018 ) proposed a progressive multi-scale GAN for perceptual enhancement, where pyramidal decomposition is combined with a DenseNet architecture (Huang et al. 2017 ). The above techniques seek to learn implicit redundancy that is present in natural data to recover missing HR information from a single LR instance. For single image SR, the review by Yang et al. ( 2019 ) suggests that methods such as EnhanceNet (Sajjadi et al. 2017 ) and SRGAN (Ledig et al. 2017 ), that achieve high subjective quality with good sharpness and textural detail, cannot simultaneously achieve low distortion loss (e.g., mean absolute error (MAE) or peak signal-to-noise-ratio (PSNR)). A comprehensive survey of image SR is provided by Wang et al. ( 2020b ). This observes that more complex networks generally produce better PSNR results and that most state-of-the-art methods are based on residual learning and use \(\ell _1\) as one of training losses (e.g., Dai et al. 2019 ; Zhang et al. 2018a ).

When applied to video sequences, super-resolution methods can exploit temporal correlations across frames as well as local spatial correlations within them. Early contributions applying deep learning to achieve video SR gathered multiple frames into a 3D volume which formed the input to a CNN (Kappeler et al. 2016 ). Later work exploited temporal correlation via a motion compensation process before concatenating multiple warped frames into a 3D volume (Caballero et al. 2017 ) using a recurrent architecture (Huang et al. 2015 ). The framework proposed by Liu et al. ( 2018a ) upscales each frame before applying another network for motion compensation. The original target frame is fed, along with its neighbouring frames, into intermediate layers of the CNN to perform inter-frame motion compensation during feature extraction (Haris et al. 2019 ). EDVR (Wang et al. 2019 ), the winner of the NTIRE19 video restoration and enhancement challenges in 2019, Footnote 60 employs a deformable convolutional network (Dai et al. 2017 ) to align two successive frames. Deformable convolution is also employed in DNLN (Deformable Non-Local Network) (Wang et al. 2019a ). At the time of writing, EDVR (Wang et al. 2019 ) and DNLN (Wang et al. 2019a ) are reported to outperform other methods for video SR, followed by the method of Haris et al. ( 2019 ). This suggests that deformable convolution plays an important role in overcoming inter-frame misalignment, producing sharp textural details.

3.3.4 Restoration

The quality of a signal can often be reduced due to distortion or damage. This could be due to environmental conditions during acquisition (low light, atmospheric distortions or high motion), sensor characteristics (quantization due to limited resolution or bit-depth or electronic noise in the sensor itself) or ageing of the original medium such as tape of film. The general degradation model can be written as \(I_{obs} = h * I_{ideal} + n\) , where \(I_{obs}\) is an observed (distorted) version of the ideal signal \(I_{ideal}\) , h is the degradation operator, \(*\) represents convolution, and n is noise. The restoration process tries to reconstruct \(I_{ideal}\) from \(I_{obs}\) . h and n are values or functions that are dependent on the application. Signal restoration can be addressed as an inverse problem and deep learning techniques have been employed to solve it. Below we divide restoration into four classes that relate to work in the creative industries with examples illustrated in Fig.  9 . Further details of deep learning for inverse problem solving can be found in Lucas et al. ( 2018 ).

3.3.4.1 Deblurring

Images can be distorted by blur, due to poor camera focus or camera or subject motion. Blur-removal is an ill-posed problem represented by a point spread function (PSF) h , which is generally unknown. Deblurring methods sharpen an image to increase subjective quality, and also to assist subsequent operations such as optical character recognition (OCR) (Hradis et al. 2015 ) and object detection (Kupyn et al. 2018 ). Early work in this area analyzed the statistics of the image and attempted to model physical image and camera properties (Biemond et al. 1990 ). More sophisticated algorithms such as blind deconvolution (BD), attempt to restore the image and the PSF simultaneously (Jia 2007 ; Krishnan et al. 2011 ). These methods however assume a space-invariant PSF and the process generally involves several iterations.

As described by the image degradation model, the PSF ( h ) is related to the target image via a convolution operation. CNNs are therefore inherently applicable for solving blur problems (Schuler et al. 2016 ). Deblurring techniques based on CNNs (Nah et al. 2017 ) and GANs (Kupyn et al. 2018 ) usually employ residual blocks, where skip connections are inserted every two convolution layers (He et al. 2016 ). Deblurring an image from coarse-to-fine scales is proposed in Tao et al. ( 2018 ), where the outputs are upscaled and are fed back to the encoder-decoder structure. The high-level features of each iteration are linked in a recurrent manner, leading to a recursive process of learning sharp images from blurred ones. Nested skip connections were introduced by Gao et al. ( 2019 ), where feature maps from multiple convolution layers are merged before applying them to the next convolution layer (in contrast to the residual block approach where one feature map is merged at the next input). This more complicated architecture improves information flow and results in sharper images with fewer ghosting artefacts compared to previous methods.

In the case of video sequences, deblurring can benefit from the abundant information present across neighbouring frames. The DeBlurNet model (Su et al. 2017 ) takes a stack of nearby frames as input and uses synthetic motion blur to generate a training dataset. A Spatio-temporal recurrent network exploiting a dynamic temporal blending network is proposed by Hyun Kim et al. ( 2017 ). Zhang et al. ( 2018 ) have concatenated an encoder, recurrent network and decoder to mitigate motion blur. Recently a recurrent network with iterative updating of the hidden state was trained using a regularization process to create sharp images with fewer ringing artefacts (Nah et al. 2019 ), denoted as IFI-RNN. A Spatio-Temporal Filter Adaptive Network (STFAN) has been proposed (Zhou et al. 2019 ), where the convolutional kernel is acquired from the feature values in a spatially varying manner. IFI-RNN and STFAN produce comparable results and hitherto achieve the best performances in terms of both subjective and objective quality measurements [the average PSNRs of both methods are higher than that of Hyun Kim et al. ( 2017 ) by up to 3 dB].

figure 9

Restoration for a deblurring (Zhang et al. 2018 ), b denoising with DnCNN (Zhang et al. 2017 ), and c turbulence mitigation (Anantrasirichai et al. 2013 ). Left and right are the original degraded images and the restored images respectively

3.3.4.2 Denoising

Noise can be introduced from various sources during signal acquisition, recording and processing, and is normally attributed to sensor limitations when operating under extreme conditions. It is generally characterized in terms of whether it is additive, multiplicative, impulsive or signal dependent, and in terms of its statistical properties. Not only visually distracting, but noise can also affect the performance of detection, classification and tracking tools. Denoising nodes are therefore commonplace in post production workflows, especially for challenging low light natural history content (Anantrasirichai et al. 2020a ). In addition, noise can reduce the efficiency of video compression algorithms, since the encoder allocates wasted bits to represent noise rather than signal, especially at low compression levels. This is the reason that film-grain noise suppression tools are employed in certain modern video codecs (Such as AV1) prior to encoding by streaming and broadcasting organisations.

The simplest noise reduction technique is weighted averaging, performed spatially and/or temporally as a sliding window, also known as a moving average filter (Yahya et al. 2016 ). More sophisticated methods however perform significantly better and are able to adapt to change noise statistics. These include adaptive spatio-temporal smoothing through anisotropic filtering (Malm et al. 2007 ), nonlocal transform-domain group filtering (Maggioni et al. 2012 ), Kalman-bilateral mixture model (Zuo et al. 2013 ), and spatio-temporal patch-based filtering (Buades and Duran 2019 ). Prior to the introduction of deep neural network denoising, methods such as BM3D (block matching 3-D) (Dabov et al. 2007 ) represented the state of the art in denoising performance.

Recent advances in denoising have almost entirely been based on deep learning approaches and these now represent the state of the art. RNNs have been employed successfully to remove noise in audio (Maas et al. 2012 ; Zhang et al. 2018b ). A residual noise map is estimated in the Denoising Convolutional Neural Network (DnCNN) method (Zhang et al. 2017 ) for image based denoising, and for video based denoising, a spatial and temporal network are concatenated (Claus and van Gemert 2019 ) where the latter handles brightness changes and temporal inconsistencies. FFDNet is a modified form of DnCNN that works on reversibly downsampled subimages (Zhang et al. 2018 ). Liu et al. ( 2018b ) developed MWCNN; a similar system that integrates multiscale wavelet transforms within the network to replace max pooling layers in order to better retain visual information. This integrated a wavelet/CNN denoising system and currently provides the state-of-the-art performance for Additive Gaussian White Noise (AGWN). VNLnet combines a non-local patch search module with DnCNN. The first part extracts features, while the latter mitigates the remaining noise (Davy et al. 2019 ). Zhao et al. ( 2019a ) proposed a simple and shallow network, SDNet, uses six convolution layers with some skip connection to create a hierarchy of residual blocks. TOFlow (Xue et al. 2019 ) offers an end-to-end trainable convolutional network that performs motion analysis and video processing simultaneously. GANs have been employed to estimate a noise distribution which is subsequently used to augment clean data for training CNN-based denoising networks (such as DnCNN) (Chen et al. 2018b ). GANs for denoising data have been proposed for medical imaging (Yang et al. 2018 ), but they are not popular in the natural image domain due to the limited data resolution of current GANs. However, CycleGAN has recently been modified to attempt denoising and enhancing low-light ultra-high-definition (UHD) videos using a patch-based strategy (Anantrasirichai and Bull 2021 ).

Recently, the Noise2Noise algorithm has shown that it is possible to train a denoising network without clean data, under the assumption that the data is corrupted by zero-mean noise (Lehtinen et al. 2018 ). The training pair of input and output images are both noisy and the network learns to minimize the loss function by solving the point estimation problem separately for each input sample. However, this algorithm is sensitive to the loss function used, which can significantly influence the performance of the model. Another algorithm, Noise2Void (Krull et al. 2019 ), employs a novel blind-spot network that does not include the current pixel in the convolution. The network is trained using the noisy patches as input and output within the same noisy patch. It achieves comparable performance to Noise2Noise but allows the network to learn noise characteristics in a single image.

NTIRE 2020 held a denoising grand challenge within the IEEE CVPR conference that compared many contemporary high performing ML denoising methods on real images (Abdelhamed et al. 2020 ). The best competing teams employed a variety of techniques using variants on CNN architectures such as U-Net (Ronneberger et al. 2015 ), ResNet (He et al. 2016 ) and DenseNet (Huang et al. 2017 ), together with \(\ell _1\) loss functions and ensemble processing including flips and rotations. The survey by Tian et al. ( 2020 ) states that SDNet (Zhao et al. 2019a ) achieves the best results on ISO noise, and FFDNet (Zhang et al. 2018 ) offers the best denoising performance overall, including Gaussian noise and spatially variant noise (non-uniform noise levels).

Neural networks have also been used for other aspects of image denoising: Chen et al. ( 2018a ) have developed specific low light denoising methods using CNN-based methods; Lempitsky et al. ( 2018 ) have developed a deep learning prior that can be used to denoise images without access to training data; and Brooks et al. ( 2019 ) have developed specific neural networks to denoise real images through ‘unprocessing’, i.e. they re-generate raw captured images by inverting the processing stages in a camera to form a supervised training system for raw images.

3.3.4.3 Dehazing

In certain situations, fog, haze, smoke and mist can create mood in an image or video. In other cases, they are considered as distortions that reduce contrast, increase brightness and lower color fidelity. Further problems can be caused by condensation forming on the camera lens. The degradation model can be represented as: \(I_{obs} = I_{ideal} t + A (1-t)\) where A is atmospheric light and t is medium transmission. The transmission t is estimated using a dark channel prior based on the observation that the lowest value of each color channel of haze-free images is close to zero (He et al. 2011 ). Berman et al. ( 2016 ), the true colors are recovered based on the assumption that an image can be faithfully represented with just a few hundred distinct colors. The authors showed that tight color clusters change because of haze and form lines in RGB space enabling them to be readjusted. The scene radiance ( \(I_{ideal}\) ) is attenuated exponentially with depth so some work has included an estimate of the depth map corresponding to each pixel in the image (Kopf et al. 2008 ). CNNs are employed to estimate transmission t and dark channel by Yang and Sun ( 2018 ). Cycle-Dehazing (Engin et al. 2018 ) is used to enhance GAN architecture in CycleGAN (Zhu et al. 2017 ). This formulation combines cycle-consistency loss (see Sect.  3.1.4 ) and perceptual loss (see Sect.  2.3.2 ) in order to improve the quality of textural information recovery and generate visually better haze-free images (Engin et al. 2018 ). A comprehensive study and an evaluation of existing single-image dehazing CNN-based algorithms are reported by Li et al. ( 2019 ). It concludes that DehazeNet (Cai et al. 2016 ) performs best in terms of perceptual loss, MSCNN (Tang et al. 2019 ) offers the best subjective quality and superior detection performance on real hazy images, and AOD-Net (Li et al. 2017 ) is the most efficient.

A related application is underwater photography (Li et al. 2016 ) as commonly used in natural history filmmaking. CNNs are employed to estimate the corresponding transmission map or ambient light of an underwater hazy image in Shin et al. ( 2016 ). More complicated structures merging U-Net, multi-scale estimation, and incorporating cross layer connections to produce even better results are reported by Hu et al. ( 2018 ).

3.3.4.4 Mitigating atmospheric turbulence

When the temperature difference between the ground and the air increases, the air layers move upwards rapidly, leading to a change in the interference pattern of the light refraction. This is generally observed as a combination of blur, ripple and intensity fluctuations in the scene. Restoring a scene distorted by atmospheric turbulence is a challenging problem. The effect, which is caused by random, spatially varying, perturbations, makes a model-based solution difficult and, in most cases, impractical. Traditional methods have involved frame selection, image registration, image fusion, phase alignment and image deblurring (Anantrasirichai et al. 2013 ; Xie et al. 2016 ; Zhu and Milanfar 2013 ). Removing the turbulence distortion from a video containing moving objects is very challenging, as generally multiple frames are used and they are needed to be aligned. Temporal filtering with local weights determined from optical flow is employed to address this by Anantrasirichai et al. ( 2018 ). However, artefacts in the transition areas between foreground and background regions can remain. Removing atmospheric turbulence based on single image processing is proposed using ML by Gao et al. ( 2019 ). Deep learning techniques to solve this problem are still in their early stages. However, one method reported employs a CNN to support deblurring (Nieuwenhuizen and Schutte 2019 ) and another employs multiple frames using a GAN architecture (Chak et al. 2018 ). This however appears only to work well for static scenes.

3.3.5 Inpainting

Inpainting is the process of estimating lost or damaged parts of an image or a video. Example applications for this approach include the repair of damage caused by cracks, scratches, dust or spots on film or chemical damage resulting in image degradation. Similar problems arise due to data loss during transmission across packet networks. Related applications include the removal of unwanted foreground objects or regions of an image and video; in this case the occluded background that is revealed must be estimated. An example of inpainting is shown in Fig.  10 . In digital photography and video editing, perhaps the most widely used tool is Adobe Photoshop, Footnote 61 where inpainting is achieved using content-aware interpolation by analysing the entire image to find the best detail to intelligently replace the damaged area.

Recently AI technologies have been reported that model the missing parts of an image using content in proximity to the damage, as well as global information to assist extracting semantic meaning. Xie et al. ( 2012 ) combine sparse coding with deep neural networks pre-trained with denoising auto-encoders. Dilated convolutions are employed in two concatenated networks for spatial reconstruction in the coarse and fine details (Yu et al. 2018 ). Some methods allow users to interact with the process, for example inputting information such as strong edges to guide the solution and produce better results. An example of this image inpainting with user-guided free-form is given by Yu et al. ( 2019 ). Gated convolution is used to learn the soft mask automatically from the data and the content is then generated using both low-level features and extracted semantic meaning. Chang et al. ( 2019 ) extend the work by Yu et al. ( 2019 ) to video sequences using a GAN architecture. Video Inpainting, VINet, as reported by Kim et al. ( 2019 ) offers the ability to remove moving objects and replace them with content aggregated from both spatial and temporal information using CNNs and recurrent feedback. Black et al. ( 2020 ) evaluated state-of-the-art methods by comparing performance based on the classification and retrieval of fixed images. They reported that DFNet (Hong et al. 2019 ), based on U-Net (Ronneberger et al. 2015 ) adding fusion blocks in the decoding layers, outperformed other methods over a wide range of missing pixels.

figure 10

Example of inpainting, (left-right) original image, masking and inpainted image

3.3.6 Visual special effects (VFX)

Closely related to animation, the use of ML-based AI in VFX has increased rapidly in recent years. Examples include BBC’s His Dark Materials and Avengers Endgame (Marvel). Footnote 62 These both use a combination of physics models with data driven results from AI algorithms to create high fidelity and photorealistic 3D animations, simulations and renderings. ML-based tools transform the actor’s face into the film’s character using head-mounted cameras and facial tracking markers. With ML-based AI, a single image can be turned into a photorealistic and fully-clothed production-level 3D avatar in real-time (Hu et al. 2017 ). Other techniques related to VFX can be found in Sect.  3.1 (e.g., style transfer and deepfakes), Sect.  3.3 (e.g., colorization and super-resolution) and Sect.  3.4 (e.g tracking and 3D rendering). AI techniques Footnote 63 are increasingly being employed to reduce the human resources needed for certain labour-intensive or repetitive tasks such as match-move, tracking, rotoscoping, compositing and animation (Barber et al. 2016 ; Torrejon et al. 2020 ).

3.4 Information extraction and enhancement

AI methods based on deep learning have demonstrated significant success in recognizing and extracting information from data. They are well suited to this task since successive convolutional layers efficiently perform statistical analysis from low to high level, progressively abstracting meaningful and representative features. Once information is extracted from a signal, it is frequently desirable to enhance it or transform it in some way. This may, for example, make an image more readily interpretable through modality fusion, or translate actions from a real animal to an animation. This section investigates how AI methods can utilize explicit information extracted from images and videos to construct such information and reuse it in new directions or new forms.

3.4.1 Segmentation

Segmentation methods are widely employed to partition a signal (typically an image or video) into a form that is semantically more meaningful and easier to analyze or track. The resulting segmentation map indicates the locations and boundaries of semantic objects or regions with parametric homogeneity in an image. Pixels within a region could therefore represent an identifiable object and/or have shared characteristics, such as color, intensity, and texture. Segmentation boundaries indicate the shape of objects and this, together with other parameters, can be used to identify what the object is. Segmentation can be used as a tool in the creative process, for example assisting with rotoscoping, masking, cropping and for merging objects from different sources into a new picture. Segmentation, in the case of video content, also enables the user to change the object or region’s characteristics over time, for example through blurring, color grading or replacement. Footnote 64

Classification systems can be built on top of segmentation in order to detect or identify objects in a scene (Fig.  11 a). This can be compared with the way that humans view a photograph or video, to spot people or other objects, to interpret visual details or to interpret the scene. Since different objects or regions will differ to some degree in terms of the parameters that characterize them, we can train a machine to perform a similar process, providing an understanding of what the image or video contains and activities in the scene. This can in turn support classification, cataloguing and data retrieval. Semantic segmentation classifies all pixels in an image into predefined categories, implying that it processes segmentation and classification simultaneously. The first deep learning approach to semantic segmentation employed a fully convolutional network (Long et al. 2015 ). In the same year, the encoder-decoder model in Noh et al. ( 2015 ) and the U-Net architecture (Ronneberger et al. 2015 ) were introduced. Following these, a number of modified networks based on them architectures have been reported Asgari Taghanaki et al. ( 2021 ). GANs have also been employed for the purpose of image translation, in this case to translate a natural image into a segmentation map (Isola et al. 2017 ). The semantic segmentation approach has also been applied to point cloud data to classify and segment 3D scenes, e.g., Fig. 11 b (Qi et al. 2017 ).

3.4.2 Recognition

Object recognition has been one of the most common targets for AI in recent years, driven by the complexity of the task but also by the huge amount of labeled imagery available for training deep networks. The performance in terms of mean Average Precision (mAP) for detecting 200 classes has increased more than 300% over the last 5 years (Liu et al. 2020 ). The Mask R-CNN approach (He et al. 2017 ) has gained popularity due to its ability to separate different objects in an image or a video giving their bounding boxes, classes and pixel-level masks, as demonstrated by Ren et al. ( 2017 ). Feature Pyramid Network (FPN) is also a popular backbone for object detection (Lin et al. 2017 ). An in-depth review of object recognition using deep learning can be found in Zhao et al. ( 2019b ) and Liu et al. ( 2020 ).

YOLO and its variants represent the current state of the art in real-time object detection and tracking (Redmon et al. 2016 ). A state-of-the-art, real-time object detection system, You Only Look Once (YOLO), works on a frame-by-frame basis and is fast enough to process at typical video rates (currently reported up to 55 fps). YOLO divides an image into regions and predicts bounding boxes using a multi-scale approach and gives probabilities for each region. The latest model, YOLOv4, (Bochkovskiy et al. 2020 ), concatenates YOLOv3 (Redmon and Farhadi 2018 ) with a CNN that is 53 layers deep, with SPP-blocks (He et al. 2015 ) or SAM-blocks (Woo et al. 2018 ) and a multi-scale CNN backbone. YOLOv4 offers real-time computation and high precision [up to 66 mAP on Microsoft’s COCO object dataset (Lin et al. 2014 )].

On the PASCAL visual object classes (VOC) Challenge datasets (Everingham et al. 2012 ), YOLOv3 is the leader of object detection on the VOC2010 dataset a with mAP of 80.8% (YOLOv4 performance on this dataset had not been reported at the time of writing) and NAS-Yolo is the best for VOC2012 dataset with a mAP of 86.5% Footnote 65 (the VOC2012 dataset has a larger number of segmentations than VOC2010). NAS-Yolo (Yang et al. 2020b ) employs Neural Architecture Search (NAS) and reinforcement learning to find the best augmentation policies for the target. In the PASCAL VOC Challenge for semantic segmentation, FlatteNet (Cai and Pu 2019 ) and FDNet (Zhen et al. 2019 ) lead the field achieving the mAP of 84.3 and 84.0% on VOC2012 data, respectively. FlatteNet integrates fully convolutional network with pixel-wise visual descriptors converting from feature maps. FDNet links all feature maps from the encoder to each input of the decoder leading to really dense network and precise segmentation. On the Microsoft COCO object dataset, MegDetV2 (Li et al. 2019d ) ranks first on both the detection leaderboard and the semantic segmentation leaderboard. MegDetV2 combines ResNet with FPN and uses deformable convolution to train the end-to-end network with large mini-batches.

figure 11

Segmentation and recognition. a Object recognition (Kim et al. 2020a ). b 3D semantic segmentation (Qi et al. 2017 )

Recognition of speech and music has also been successfully achieved using deep learning methods. Mobile phone apps that capture a few seconds of sound or music, such as Shazam, Footnote 66 characterize songs based on an audio fingerprint using a spectrogram (a time-frequency graph) that is used to search for a matching fingerprint in a database. Houndify by SoundHound Footnote 67 exploits speech recognition and searches content across the internet. This technology also provides voice interaction for in-car systems. Google proposed a full visual-speech recognition system that maps videos of lips to sequences of words using spatiotemporal CNNs and LSTMs (Shillingford et al. 2019 ).

Emotion recognition has also been studied for over a decade. AI methods have been used to learn, interpret and respond to human emotion, via speech (e.g., tone, loudness, and tempo) (Kwon et al. 2003 ), face detection (e.g., eyebrows, the tip of nose, the corners of mouth) (Ko 2018 ), and both audio and video (Hossain and Muhammad 2019 ). Such systems have also been used in security systems and for fraud detection.

A further task, relevant to video content, is action recognition. This involves capturing spatio-temporal context across frames, for example: jumping into a pool, swimming, getting out of the pool. Deep learning has again been extensively exploited in this area, with the first report based on a 3D CNN (Ji et al. 2013 ). An excellent state-of-the-art review on action recognition can be found in Yao et al. ( 2019 ). More recent advances include temporal segment networks (Wang et al. 2016 ) and temporal binding networks, where the fusion of audio and visual information is employed (Kazakos et al. 2019 ). EPIC-KITCHENS, is a large dataset focused on egocentric vision that provides audio-visual, non-scripted recordings in native environments (Damen et al. 2018 ); it has been extensively used to train action recognition systems. Research on sign language recognition is also related to creative applications, since it studies body posture, hand gesture, and face expression, and hence involves segmentation, detection, classification and 3D reconstruction (Jalal et al. 2018 ; Kratimenos et al. 2020 ; Adithya and Rajesh 2020 ). Moreover, visual and linguistic modelling has been combined to enable translation between spoken/written language and continuous sign language videos (Bragg et al. 2019 ).

3.4.3 Salient object detection

Salient object detection (SOD) is a task based on visual attention mechanisms, in which algorithms aim to identify objects or regions that are likely to be the focus of attention. SOD methods can benefit the creative industries in applications such as image editing (Cheng et al. 2010 ; Mejjati et al. 2020 ), content interpretation (Rutishauser et al. 2004 ), egocentric vision (Anantrasirichai et al. 2018 ), VR (Ozcinar and Smolic 2018 ), and compression (Gupta et al. 2013 ). The purpose of SOD differs from fixation detection, which predicts where humans look, but there is a strong correlation between the two (Borji et al. 2019 ). In general, the SOD process involves two tasks: saliency prediction and segmentation. Recent supervised learning technologies have significantly improved the performance of SOD. Hou et al. ( 2019 ) merge multi-level features of a VGG network with fusion and cross-entropy losses. A survey by Wang et al. ( 2021 ) reveals that most SOD models employ VGG and ResNet as backbone architectures and train the model with the standard binary cross-entropy loss. More recent work has developed the end-to-end framework with GANs (Wang et al. 2020a ) and some works include depth information from RGB-D cameras (Jiang et al. 2020 ). More details on the recent SOD on RGB-D data can be found in (Zhou et al. 2021 ). When detecting salient objects in the video, an LSTM module is used to learn saliency shifts (Fan et al. 2019 ). The SOD approach has also been extended to co-salient object detection (CoSOD), aiming to detect the co-occurring salient objects in multiple images (Fan et al. 2020 ).

3.4.4 Tracking

Object tracking is the temporal process of locating objects in consecutive video frames. It takes an initial set of object detections (see Sect.  3.4 ), creates a unique ID for each of these initial detections, and then tracks each of the objects, via their properties, over time. Similar to segmentation, object tracking can support the creative process, particularly in editing. For example, a user can identify and edit a particular area or object in one frame and, by tracking the region, these adjusted parameters can be applied to the rest of the sequence regardless of object motion. Semi-supervised learning is also employed in SiamMask (Wang et al. 2019b ) offering the user an interface to define the object of interest and to track it over time.

Similar to object recognition, deep learning has become an effective tool for object tracking, particularly when tracking multiple objects in the video (Liu et al. 2020 ). Recurrent networks have been integrated with object recognition methods to track the detected objects over time (e.g., Fang 2016 ; Gordon et al. 2018 ; Milan et al. 2017 ). VOT benchmarks (Kristan et al. 2016 ) have been reported for real-time visual object tracking challenges run in both ICCV and ECCV conferences, and the performance of tracking has been observed to improve year on year. The best performing methods include Re \(^3\) (Gordon et al. 2018 ) and Siamese-RPN (Li et al. 2018 ) achieving 150 and 160 fps at the expected overlap of 0.2, respectively. MOTChallenge Footnote 68 and KITTI Footnote 69 are the most commonly used datasets for training and testing multiple object tracking (MOT). At the time of publishing, ReMOTS (Yang et al. 2020a ) is currently the best performer with a mask-based MOT accuracy of 83.9%. ReMOTS fuses the segmentation results of the Mask R-CNN (He et al. 2017 ) and a ResNet-101 (He et al. 2016 ) backbone extended with FPN.

3.4.5 Image fusion

Image fusion provides a mechanism to combine multiple images (or regions therein, or their associated information) into a single representation that has the potential to aid human visual perception and/or subsequent image processing tasks. A fused image (e.g., a combination of IR and visible images) aims to express the salient information from each source image without introducing artefacts or inconsistencies. A number of applications have exploited image fusion to combine complementary information into a single image, where the capability of a single sensor is limited by design or observational constraints. Existing pixel-level fusion schemes range from simple averaging of the pixel values of registered (aligned) images to more complex multiresolution pyramids, sparse methods (Anantrasirichai et al. 2020b ) and methods based on complex wavelets (Lewis et al. 2007 ). Deep learning techniques have been successfully employed in many image fusion applications. An all-in-focus image is created using multiple images of the same scene taken with different focal settings (Liu et al. 2017 ) (Fig.  12 a). Multi-exposure deep fusion is used to create high-dynamic range images by Prabhakar et al. ( 2017 ). A review of deep learning for pixel-level image fusion can be found in Liu et al. ( 2018 ). Recently, GANs have also been developed for this application (e.g., Ma et al. 2019c ), with an example of image blending using a guided mask (e.g., Wu et al. 2019 ).

The performance of a fusion algorithm is difficult to quantitatively assess as no ground truth exists in the fused domain. Ma et al. ( 2019b ) shows that a guided filtering-based fusion (Li et al. 2013 ) achieves the best results based on the visual information fidelity (VIF) metric, but proposed that fused images with very low correlation coefficients, measuring the degree of linear correlation between the fused image its source images, also works well compared to subjective assessment.

figure 12

Information Enhancement. a Multifocal image fusion. b 2D to 3D face conversion generated using the algorithm proposed by Jackson et al. ( 2017 )

3.4.6 3D reconstruction and rendering

In the human visual system, a stereopsis process (together with many other visual cues and priors (Bull and Zhang 2021 ) creates a perception of three-dimensional (3D) depth from the combination of two spatially separated signals received by the visual cortex from our retinas. The fusion of these two slightly different pictures gives the sensation of strong three-dimensionality by matching similarities. To provide stereopsis in machine vision applications, images are captured simultaneously from two cameras with parallel camera geometry, and an implicit geometric process is used to extract 3D information from these images. This process can be extended using multiple cameras in an array to create a full volumetric representation of an object. This approach is becoming increasingly popular in the creative industries, especially for special effects that create digital humans Footnote 70 in high end movies or live performance.

To convert 2D to 3D representations (including 2D+t to 3D), the first step is normally depth estimation, which is performed using stereo or multi-view RGB camera arrays. Consumer RGB-D sensors can also be used for this purpose (Maier et al. 2017 ). Depth estimation, based on disparity can also be assisted by motion parallax (using a single moving camera), focus, and perspective. For example, motion parallax is learned using a chain of encoder-decoder networks by Ummenhofer et al. ( 2017 ). Google Earth has computed topographical information from images captured using aircraft and added texture to create a 3D mesh. As the demands for higher depth accuracy have increased and real-time computation has become feasible, deep learning methods (particularly CNNs) have gained more attention. A number of network architectures have been proposed for stereo configurations, including a pyramid stereo matching network (PSMNet) (Chang and Chen 2018 ), a stacked hourglass architecture (Newell et al. 2016 ), a sparse cost volume network (SCV-Net) (Lu et al. 2018 ), a fast densenet (Anantrasirichai et al. 2021 ) and a guided aggregation net (GA-Net) (Zhang et al. 2019 ). On the KITTI Stereo dataset benchmark (Geiger et al. 2012 ), the team, called LEAStereo from Monash University, ranks 1st at the time of writing (the number of erroneous pixels reported as 1.65%). They exploit neural architecture search (NAS) technique Footnote 71 to build the best network designed by another neural network.

3D reconstruction is generally divided into: volumetric, surface-based, and multi-plane representations. Volumetric representations can be achieved by extending the 2D convolutions used in image analysis. Surface-based representations, e.g., meshes, can be more memory-efficient, but are not regular structures and thus do not easily map onto deep learning architectures. The state-of-the-art methods for volumetric and surface-based representations are Pix2Vox (Xie et al. 2019 ) and AllVPNet (Soltani et al. 2017 ) reporting an Intersection-over-Union (IoU) measure of 0.71 and 0.83 constructed from 20 views on the ShapeNet dataset benchmark (Chang et al. 2015 )). GAN architectures have been used to generate non-rigid surfaces from a monocular image (Shimada et al. 2019 ). The third type of representation is formed from multiple planes of the scene. It is a trade-off between the first two representations—efficient storage and amenable to training with deep learning. The method in Flynn et al. ( 2019 ), developed by Google, achieves view synthesis with learned gradient descent. A review of state-of-the-art 3D reconstruction from images using deep learning can be found in Han et al. ( 2019 ).

Recently, low-cost video plus depth (RGB-D) sensors have become widely available. Key challenges related to RGB-D video processing have included synchronisation, alignment and data fusion between multimodal sensors (Malleson et al. 2019 ). Deep learning approaches have also been used to achieve semantic segmentation, multi-model feature matching and noise reduction for RGB-D information (Zollhöfer et al. 2018 ). Light field cameras, that capture the intensity and direction of light rays, produce denser data than the RGB-D cameras. Depth information of a scene can be extracted from the displacement of the image array, and 3D rendering has been reported using deep learning approaches in Shi et al. ( 2020 ). Recent state-of-the-art light field methods can be found in the review by Jiang et al. ( 2020 ).

3D reconstruction from a single image is an ill-posed problem. However, it is possible with deep learning due to the network’s ability to learn semantic meaning (similar to object recognition, described in Sect.  3.4 ). Using a 2D RGB training image with 3D ground truth, the model can predict what kind of scene and objects are contained in the test image. Deep learning-based methods also provide state-of-the-art performance for generating the corresponding right view from a left view in a stereo pair (Xie et al. 2016 ), and for converting 2D face images to 3D face reconstructions using CNN-based encoder-decoder architectures (Bulat and Tzimiropoulos 2017 ; Jackson et al. 2017 ), autoencoders (Tewari et al. 2020 ) and GANs (Tian et al. 2018 ) (Fig.  12 b). Creating 3D models of bodies from photographs is the focus of (Kanazawa et al. 2018 ). Here, a CNN is used to translate a single 2D image of a person into parameters of shape and pose, as well as to estimate camera parameters. This is useful for applications such as virtual modelling of clothes in the fashion industry. A recent method reported by Mescheder et al. ( 2019 ) is able to generate a realistic 3D surface from a single image intruding the idea of a continuous decision boundary within the deep neural network classifier. For 2D image to 3D object generation, generative models offer the best performance to date, with the state-of-the-art method, GAL (Jiang et al. 2018 ), achieving an average IoU of 0.71 on the ShapeNet dataset. The creation of a 3D photograph from 2D images is also possible via tools such as SketchUp Footnote 72 and Smoothie-3D. Footnote 73 Very recently (Feb 2020), Facebook allowed users to add a 3D effect to all 2D images. Footnote 74 They trained a CNN on millions of pairs of public 3D images with their associating depth maps. Their Mesh R-CNN (Gkioxari et al. 2019 ) leverages the Mask R-CNN approach (He et al. 2017 ) for object recognition and segmentation to help estimate depth cues. A common limitation when converting a single 2D image to a 3D representation is associated with occluded areas that require spatial interpolation.

AI has also been used to increase the dimensionality of audio signals. Humans have an ability to spatially locate a sound as our brain can sense the differences between arrival times of sounds at the left and the right ears, and between the volumes (interaural level) that the left and the right ears hear. Moreover, our ear flaps distort the sound telling us whether the sound emanates in front of or behind the head. With this knowledge, Gao and Grauman ( 2019 ) created binaural audio from a mono signal driven by the subject’s visual environment to enrich the perceptual experience of the scene. This framework exploits U-Net to extract audio features, merged with visual features extracted from ResNet to predict the sound for the left and the right channels. Subjective tests indicate that this method can improve realism and the sensation being in a 3D space. Morgado et al. ( 2018 ) expand mono audio, recorded using a 360 \(^\circ\) video camera, to the sound over the full viewing surface of sphere. The process extracts semantic environments from the video with CNNs and then high-level features of vision and audio are combined to generate the sound corresponding to different viewpoints. Vasudevan et al. ( 2020 ) also include depth estimation to improve realistic quality of super-resolution sound.

3.5 Data compression

Visual information is the primary consumer of communications bandwidth across broadcasting and internet communications. The demand for increased qualities and quantities of visual content is particularly driven by the creative media sector, with increased numbers of users expecting increased quality and new experiences. Cisco predict, in their Video Network Index report, (Barnett et al. 2018 ) that there will be 4.8 zettabytes (4.8 \(\times 10^{21}\) bytes) of global annual internet traffic by 2022—equivalent to all movies ever made crossing global IP networks in 53 seconds. Video will account for 82 percent of all internet traffic by 2022. This will be driven by increased demands for new formats and more immersive experiences with multiple viewpoints, greater interactivity, higher spatial resolutions, frame rates and dynamic range and wider color gamut. This is creating a major tension between available network capacity and required video bit rate. Network operators, content creators and service providers all need to transmit the highest quality video at the lowest bit rate and this can only be achieved through the exploitation of content awareness and perceptual redundancy to enable better video compression.

Traditional image encoding systems (e.g., JPEG) encode a picture without reference to any other frames. This is normally achieved by exploiting spatial redundancy through transform-based decorrelation followed by variable length, quantization and symbol encoding. While video can also be encoded as a series of still images, significantly higher coding gains can be achieved if temporal redundancies are also exploited. This is achieved using inter-frame motion prediction and compensation. In this case the encoder processes the low energy residual signal remaining after prediction, rather than the original frame. A thorough coverage of image and video compression methods is provided by Bull and Zhang ( 2021 ).

Deep neural networks have gained popularity for image and video compression in recent years and can achieve consistently greater coding gain than conventional approaches. Deep compression methods are also now starting to be considered as components in mainstream video coding standards such as VVC and AV2. They have been applied to optimize a range of coding tools including intra prediction (Li et al. 2018 ; Schiopu et al. 2019 ), motion estimation (Zhao et al. 2019b ), transforms (Liu et al. 2018 ), quantization (Liu et al. 2019 ), entropy coding (Zhao et al. 2019a ) and loop filtering (Lu et al. 2019 ). Post processing is also commonly applied at the video decoder to reduce various coding artefacts and enhance the visual quality of the reconstructed frames [e.g., (Xue and Su 2019 ; Zhang et al. 2020 )]. Other work has implemented a complete coding framework based on neural networks using end-to-end training and optimisation (Lu et al. 2020 ). This approach presents a radical departure from conventional coding strategies and, while it is not yet competitive with state-of-the-art conventional video codecs, it holds significant promise for the future.

Perceptually based resampling methods based on SR methods using CNNs and GANs have been introduced recently. Disney Research proposed a deep generative video compression system (Han et al. 2019 ) that involves downscaling using a VAE and entropy coding via a deep sequential model. ViSTRA2 (Zhang et al. 2019b ), exploits adaptation of spatial resolution and effective bit depth, downsampling these parameters at the encoder based on perceptual criteria, and up-sampling at the decoder using a deep convolutional neural network. ViSTRA2 has been integrated with the reference software of both the HEVC (HM 16.20) and VVC (VTM 4.01), and evaluated under the Joint Video Exploration Team Common Test Conditions using the Random Access configuration. Results show consistent and significant compression gains against HM and VVC based on Bjønegaard Delta measurements, with average BD-rate savings of 12.6% (PSNR) and 19.5% (VMAF) over HM and 5.5% and 8.6% over VTM. This work has been extended to a GAN architecture by Ma et al. ( 2020a ). Recently, Mentzer et al. ( 2020 ) optimize a neural compression scheme with a GAN, yielding reconstructions with high perceptual fidelity. Ma et al. ( 2021 ) combine several quantitative losses to achieve maximal perceptual video quality when training a relativistic sphere GAN.

Like all deep learning applications, training data is a key factor in compression performance. Research by Ma et al. ( 2020 ) has demonstrated the importance of large and diverse datasets when developing CNN-based coding tools. Their BVI-DVC database is publicly available and produces significant improvements in coding gain across a wide range of deep learning networks for coding tools such as loop filtering and post-decoder enhancement. An extensive review of AI for compression can be found in Bull and Zhang ( 2021 ) and Ma et al. ( 2020b ).

4 Future challenges for AI in the creative sector

There will always be philosophical and ethical questions relating to the creative capacity, ideas and thought processes, particularly where computers or AI are involved. The debate often focuses on the fundamental difference between humans and machines. In this section we will briefly explore some of these issues and comment on their relevance to and impact on the use of AI in the creative sector.

4.1 Ethical issues, fakes and bias

An AI-based machine can work ‘intelligently’, providing an impression of understanding but nonetheless performing without ‘awareness’ of wider context. It can however offer probabilities or predictions of what could happen in the future from several candidates, based on the trained model from an available database. With current technology, AI cannot truly offer broad context, emotion or social relationship. However, it can affect modern human life culturally and societally. UNESCO has specifically commented on the potential impact of AI on culture, education, scientific knowledge, communication and information provision particularly relating to the problems of the digital divide. Footnote 75 AI seems to amplify the gap between those who can and those who cannot use new digital technologies, leading to increasing inequality of information access. In the context of the creative industries, UNESCO mentions that collaboration between intelligent algorithms and human creativity may eventually bring important challenges for the rights of artists.

One would expect that the authorship of AI creations resides with those who develop the algorithms that drive the art work. Issues of piracy and originality thus need special attention and careful definition, and deliberate and perhaps unintentional exploitation needs to be addressed. We must be cognizant of how easy AI technologies can be accessed and used in the wrong hands. AI systems are now becoming very competent at creating fake images, videos, conversations, and all manner of content. Against this, as reported in Sect.  3.1.7 , there are also other AI-based methods under development that can, with some success, detect these fakes.

The primary learning algorithms for AI are data-driven. This means that, if the data used for training are unevenly distributed or unrepresentative due to human selection criteria or labeling, the results after learning can equally be biased and ultimately judgemental. For example, streaming media services suggest movies that the users may enjoy and these suggestions must not privilege specific works over others. Similarly face recognition or autofocus methods must be trained on a broad range of skin types and facial features to avoid failure for certain ethnic groups or genders. Bias in algorithmic decision-making is also a concern of governments across the world. Footnote 76 Well-designed AI systems can not only increase the speed and accuracy with which decisions are made, but they can also reduce human bias in decision-making processes. However, throughout the lifetime of a trained AI system, the complexity of data it processes is likely to grow, so even a network originally trained with balanced data may consequently establish some bias. Periodic retraining may therefore be needed. A review of various sources of bias in ML is provided in Ntoutsi et al. ( 2020 ).

Dignum ( 2018 ) provide a useful classification of the relationships between ethics and AI, defining three categories: (i) Ethics by Design, methods that ensure ethical behaviour in autonomous systems, (ii) Ethics in Design, methods that support the analysis of the ethical implications of AI systems, and (iii) Ethics for Design, codes and protocols to ensure the integrity of developers and users. A discussion of ethics associated with AI in general can be found in Bostrom and Yudkowsky ( 2014 ).

AI can, of course, also be used to help identify and resolve ethical issues. For example, Instagram uses an anti-bullying AI Footnote 77 to identify negative comments before they are published and asks users to confirm if they really want to post such messages.

4.2 The human in the loop: AI and creativity

Throughout this review we have recognized and reported on the successes of AI in supporting and enhancing processes within constrained domains where there is good availability of data as a basis for ML. We have seen that AI-based techniques work very well when they are used as tools for information extraction, analysis and enhancement. Deep learning methods that characterize data from low-level features and connect these to extract semantic meaning are well suited to these applications. AI can thus be used with success, to perform tasks that are too difficult for humans or are too time-consuming, such as searching through a large database and examining its data to draw conclusions. Post production workflows will therefore see increased use of AI, including enhanced tools for denoising, colorization, segmentation, rendering and tracking. Motion and volumetric capture methods will benefit from enhanced parameter selection and rendering tools. Virtual production methods and games technologies will see greater convergence and increased reliance on AI methodologies.

In all the above examples, AI tools will not be used in isolation as a simple black box solution. Instead, they must be designed as part of the associated workflow and incorporate a feedback framework with the human in the loop. For the foreseeable future, humans will need to check the outputs from AI systems, make critical decisions, and feedback ‘faults’ that will be used to adjust the model. In addition, the interactions between audiences or users and machines are likely to become increasingly common. For example, AI could help to create characters that learn context in location-based storytelling and begin to understand the audience and adapt according to interactions.

Currently, the most effective AI algorithms still rely on supervised learning, where ground truth data readily exist or where humans have labeled the dataset prior to using it for training the model (as described in Sect.  2.3.1 ). In contrast, truly creative processes do not have pre-defined outcomes that can simply be classed as good or bad. Although many may follow contemporary trends or be in some way derivative, based on known audience preferences, there is no obvious way of measuring the quality of the result in advance. Creativity almost always involves combining ideas, often in an abstract yet coherent way, from different domains or multiple experiences, driven by curiosity and experimentation. Hence, labeling of data for these applications is not straightforward or even possible in many cases. This leads to difficulties in using current ML technologies.

In the context of creating a new artwork, generating low-level features from semantics is a one-to-many relationship, leading to inconsistencies between outputs. For example, when asking a group of artists to draw a cat, the results will all differ in color, shape, size, context and pose. Results of the creative process are thus unlikely to be structured, and hence may not be suitable for use with ML methods. We have previously referred to the potential of generative models, such as GANs, in this respect, but these are not yet sufficiently robust to consistently create results that are realistic or valuable. Also, most GAN-based methods are currently limited to the generation of relatively small images and are prone to artefacts at transitions between foreground and background content. It is clear that significant additional work is needed to extract significant value from AI in this area.

4.3 The future of AI technologies

Research into, and development of, AI-based solutions continue apace. AI is attracting major investments from governments and large international organisations alongside venture capital investments in start-up enterprises. ML algorithms will be the primary driver for most AI systems in the future and AI solutions will, in turn, impact an even wider range of sectors. The pace of AI research has been predicated, not just on innovative algorithms (the basics are not too dissimilar to those published in the 1980s), but also on our ability to generate, access and store massive amounts of data, and on advances in graphics processing architectures and parallel hardware to process these massive amounts of data. New computational solutions such as quantum computing, will likely play an increasing role in this respect (Welser et al. 2018 ).

In order to produce an original work, such as music or abstract art, it would be beneficial to support increased diversity and context when training AI systems. The quality of the solution in such cases is difficult to define and will inevitably depend on audience preferences and popular contemporary trends. High-dimensional datasets that can represent some of these characteristics will therefore be needed. Furthermore, the loss functions that drive the convergence of the network’s internal weights must reflect perceptions rather than simple mathematical differences. Research into such loss functions that better reflect human perception of performance or quality is therefore an area for further research.

ML-based AI algorithms are data-driven; hence how to select and prepare data for creative applications will be key to future developments. Defining, cleaning and organizing bias-free data for creative applications are not straightforward tasks. Because the task of data collection and labeling can be highly resource intensive,labeling services are expected to become more popular in the future. Amazon currently offers a cloud management tool, SageMaker, Footnote 78 that uses ML to determine which data in a dataset needs to be labeled by humans, and consequently sends this data to human annotators through its Mechanical Turk system or via third party vendors. This can reduce the resources needed by developers during the key data preparation process. In this or other contexts, AI may converge with blockchain technologies. Blockchains create decentralized, distributed, secure and transparent networks that can be accessed by anyone in public (or private) blockchain networks. Such systems may be a means of trading trusted AI assets, or alternatively AI agents may be trusted to trade other assets (e.g., financial (or creative) across blockchain networks. Recently, Microsoft has tried to improve small ML models hosted on public blockchains and plan to expand to more complex models in the future. Footnote 79 Blockchains make it possible to reward participants who help to improve models, while providing a level of trust and security.

As the amount of unlabeled data grows dramatically, unsupervised or self-supervised ML algorithms are prime candidates for underpinning future advancements in the next generation of ML. There exist techniques that employ neural networks to learn statistical distributions of input data and then transfer this to the distribution of the output data (Damodaran et al. 2018 ; Xu et al. 2019 ; Zhu et al. 2017 ). These techniques do not require a precise matching pair between the input and the ground truth, reducing the limitations for a range of applications.

It is clear that current AI methods do not mimic the human brain, or even parts of it, particularly closely. The data driven learning approach with error backpropagation is not apparent in human learning. Humans learn in complex ways that combine genetics, experience and prediction-failure reinforcement. A nice example is provided by Yan LeCun of NYU and Facebook Footnote 80 who describes a 4–6 month old baby being shown a picture of a toy floating in space; the baby shows little surprise that this object defies gravity. Showing the same image to the same child at around 9 months produces a very different result, despite the fact that it is very unlikely that the child has been explicitly trained about gravity. It has instead learnt by experience and is capable of transferring its knowledge across a wide range of scenarios never previously experienced. This form of reinforcement and transfer learning holds significant potential for the next generation of ML algorithms, providing much greater generalization and scope for innovation.

Reinforcement Learning generally refers to a goal-oriented approach, which learns how to achieve a complex objective through reinforcement via penalties and rewards based on its decisions over time. Deep Reinforcement Learning (DRL) integrates this approach into a deep network which, with little initialisation and through self-supervision, can achieve extraordinary performance in certain domains. Rather than depend on manual labeling, DRL automatically extracts weak annotation information from the input data, reinforced over several steps. It thus learns the semantic features of the data, which can be transferred to other tasks. DRL algorithms can beat human experts playing video games and the world champions of Go. The state of the art in this area is progressing rapidly and the potential for strong AI, even with ambiguous data in the creative sector is significant. However, this will require major research effort as the human processes that underpin this are not well understood.

5 Concluding remarks

This paper has presented a comprehensive review of current AI technologies and their applications, specifically in the context of the creative industries. We have seen that ML-based AI has advanced the state of the art across a range of creative applications including content creation, information analysis, content enhancement, information extraction, information enhancement and data compression. ML–AI methods are data driven and benefit from recent advances in computational hardware and the availability of huge amounts of data for training—particularly image and video data.

We have differentiated throughout between the use of ML–AI as a creative tool and its potential as a creator in its own right. We foresee, in the near future, that AI will be adopted much more widely as a tool or collaborative assistant for creativity, supporting acquisition, production, post-production, delivery and interactivity. The concurrent advances in computing power, storage capacities and communication technologies (such as 5G) will support the embedding of AI processing within and at the edge of the network. In contrast, we observe that, despite recent advances, significant challenges remain for AI as the sole generator of original work. ML–AI works well when there are clearly defined problems that do not depend on external context or require long chains of inference or reasoning in decision making. It also benefits significantly from large amounts of diverse and unbiased data for training. Hence, the likelihood of AI (or its developers) winning awards for creative works in competition with human creatives may be some way off. We therefore conclude that, for creative applications, technological developments will, for some time yet, remain human-centric—designed to augment, rather than replace, human creativity. As AI methods begin to pervade the creative sector, developers and deployers must however continue to build trust; technological advances must go hand-in-hand with a greater understanding of ethical issues, data bias and wider social impact.

https://www.pfeifferreport.com/wp-content/uploads/2018/11/Creativity_and_AI_Report_INT.pdf .

https://edition.cnn.com/style/article/obvious-ai-art-christies-auction-smart-creativity/index.html .

https://arxiv.org/ .

https://gtr.ukri.org/ .

https://www.crunchbase.com/ .

While we hope that this categorization is helpful, it should be noted that several of the applications described could fit into, or span, multiple categories.

https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research , https://ieee-dataport.org/ .

https://uk.mathworks.com/solutions/deep-learning/convolutional-neural-network.html .

https://uk.mathworks.com/solutions/image-video-processing/semantic-segmentation.html .

https://ai.googleblog.com/2016/01/alphago-mastering-ancient-game-of-go.html .

https://www.nextrembrandt.com/ .

https://gumgum.com/artificial-creativity .

https://botnik.org/ .

https://www.bdh.net/immersive/venice-1898-through-the-lens .

https://www.imdb.com/title/tt5794766/ .

https://www.ibm.com/blogs/think/2016/08/cognitive-movie-trailer/ .

https://www.youtube.com/watch?v=vUgUeFu2Dcw .

https://www.scriptbook.io .

https://aidungeon.io/ .

https://www.niemanlab.org/2016/10/the-ap-wants-to-use-machine-learning-to-automate-turning-print-stories-into-broadcast-ones/ .

https://www.bbc.com/news/technology-50779761 .

https://www.forbes.com/sites/nicolemartin1/2019/02/08/did-a-robot-write-this-how-ai-is-impacting-journalism/#5292ab617795 .

https://www.washingtonpost.com/pr/wp/2016/10/19/the-washington-post-uses-artificial-intelligence-to-cover-nearly-500-races-on-election-day/ .

https://www.bbc.com/news/world-us-canada-52860247 .

http://www.flow-machines.com/ .

https://openai.com/blog/jukebox/ .

https://magenta.tensorflow.org/nsynth .

https://www.helloworldalbum.net/ .

https://magenta.tensorflow.org/coconet .

https://folkrnn.org/ .

https://ars.electronica.art/keplersgardens/en/folk-algorithms/ .

https://phillipi.github.io/pix2pix/ .

https://ai-art.tokyo/en/ .

http://picbreeder.org/ .

http://endlessforms.com/ .

https://www.artbreeder.com/ .

https://ganvas.studio/ .

https://github.com/yemount/pose-animator/ .

https://www.tensorflow.org/lite/models/pose_estimation/overview .

https://www.adobe.com/products/character-animator.html .

https://cubicmotion.com/persona/ .

https://www.pinscreen.com/ .

https://www.marketresearchfuture.com/reports/augmented-reality-virtual-reality-market-6884 .

https://www.forbes.com/sites/jessedamiani/2020/01/15/the-top-50-vr-games-of-2019/?sh=42279941322d .

https://www.factor-tech.com/feature/lifting-the-curtain-on-augmented-reality-how-ar-is-bringing-theatre-into-the-future/ .

https://dynamics.microsoft.com/en-gb/mixed-reality/overview/ .

https://www.immerseuk.org/wp-content/uploads/2019/11/The-Immersive-Economy-in-the-UK-Report-2019.pdf .

https://ai.facebook.com/blog/powered-by-ai-oculus-insight/ .

https://talktotransformer.com/ .

https://www.youtube.com/watch?time_continue=2&v=ANXucrz7Hjs .

https://support.google.com/youtube/answer/6373554?hl=en .

http://vibeke.info/mood-of-the-planet/ .

https://www.bloomberg.com/news/articles/2017-10-23/game-makers-tap-ai-to-profile-each-player-and-keep-them-hooked .

https://images.google.com/ .

https://open.spotify.com/show/3ViwFAdff2YaXPygfUuv51 .

https://assistant.google.com/ .

https://www.apple.com/siri/ .

https://www.nuance.com/omni-channel-customer-engagement/digital/virtual-assistant/nina.html .

VGG is a popular CNN, originally developed for object recognition by the Visual Geometry Group at the University of Oxford (Simonyan and Zisserman 2015 ). See Sect.  2.3.2 for more detail.

https://data.vision.ee.ethz.ch/cvl/ntire19/ .

https://www.adobe.com/products/photoshop/content-aware-fill.html .

https://blogs.nvidia.com/blog/2020/02/07/ai-vfx-oscars/ .

https://www.vfxvoice.com/the-new-artificial-intelligence-frontier-of-vfx/ .

https://support.zoom.us/hc/en-us/articles/210707503-Virtual-Background .

http://host.robots.ox.ac.uk:8080/leaderboard/main_bootstrap.php .

https://www.shazam.com/gb/company .

https://www.soundhound.com/ .

https://motchallenge.net/ .

http://www.cvlibs.net/datasets/kitti/eval_tracking.php .

https://www.dimensionstudio.co/solutions/digital-humans .

http://www.cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=stereo .

https://www.sketchup.com/plans-and-pricing/sketchup-free .

https://smoothie-3d.com/ .

https://ai.facebook.com/blog/powered-by-ai-turning-any-2d-photo-into-3d-using-convolutional-neural-nets/ .

https://ircai.org/project/preliminary-study-on-the-ethics-of-ai/ .

https://www.gov.uk/government/publications/interim-reports-from-the-centre-for-data-ethics-and-innovation/interim-report-review-into-bias-in-algorithmic-decision-making .

https://about.instagram.com/blog/announcements/instagrams-commitment-to-lead-fight-against-online-bullying .

https://docs.aws.amazon.com/sagemaker/latest/dg/sms.html .

https://www.microsoft.com/en-us/research/blog/leveraging-blockchain-to-make-machine-learning-models-more-accessible/ .

LeCun credits Emmanuel Dupoux for this example.

Abdelhamed A, Afifi M, Timofte R, Brown MS (2020) NTIRE 2020 challenge on real image denoising: dataset, methods and results. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops

Adithya V, Rajesh R (2020) A deep convolutional neural network approach for static hand gesture recognition. Proced Comput Sci 171:2353–2361. https://doi.org/10.1016/j.procs.2020.04.255

Article   Google Scholar  

Agostinelli F, Hoffman M, Sadowski P, Baldi P (2015) Learning activation functions to improve deep neural networks. In: Proceedings of international conference on learning representations, pp 1–9

Alsaih K, Lemaitre G, Rastgoo M, Sidibé D, Meriaudeau F (2017) Machine learning techniques for diabetic macular EDEMA (DME) classification on SD-OCT images. BioMed Eng 16(1):1–12. https://doi.org/10.1186/s12938-017-0352-9

Amato G, Falchi F, Gennaro C, Rabitti F (2017) Searching and annotating 100M images with YFCC100M-HNfc6 and MI-File. In: Proceedings of the 15th international workshop on content-based multimedia indexing https://doi.org/10.1145/3095713.3095740

Anantrasirichai N, Bull D (2019) DefectNet: multi-class fault detection on highly-imbalanced datasets. In: IEEE international conference on image processing (ICIP), pp 2481–2485

Anantrasirichai N, Bull D (2021) Contextual colorization and denoising for low-light ultra high resolution sequences. In: IEEE international conference on image processing (ICIP)

Anantrasirichai N, Achim A, Kingsbury N, Bull D (2013) Atmospheric turbulence mitigation using complex wavelet-based fusion. Image Process, IEEE Trans 22(6):2398–2408

Article   MathSciNet   MATH   Google Scholar  

Anantrasirichai N, Gilchrist ID, Bull DR (2016) Fixation identification for low-sample-rate mobile eye trackers. In: IEEE international conference on image processing (ICIP), pp 3126–3130. https://doi.org/10.1109/ICIP.2016.7532935

Anantrasirichai N, Achim A, Bull D (2018) Atmospheric turbulence mitigation for sequences with moving objects using recursive image fusion. In: 2018 25th IEEE international conference on image processing (ICIP), pp 2895–2899

Anantrasirichai N, Biggs J, Albino F, Hill P, Bull D (2018) Application of machine learning to classification of volcanic deformation in routinely-generated InSAR data. J Geophys Res: Solid Earth 123:1–15. https://doi.org/10.1029/2018JB015911

Anantrasirichai N, Daniels KAJ, Burn JF, Gilchrist ID, Bull DR (2018) Fixation prediction and visual priority maps for biped locomotion. IEEE Trans Cybern 48(8):2294–2306. https://doi.org/10.1109/TCYB.2017.2734946

Anantrasirichai N, Biggs J, Albino F, Bull D (2019) A deep learning approach to detecting volcano deformation from satellite imagery using synthetic datasets. Remote Sensing Environ 230:111179

Anantrasirichai N, Zhang F, Malyugina A, Hill P, Katsenou A (2020a) Encoding in the dark grand challenge: an overview. In: IEEE international conference on multimedia and Expo (ICME)

Anantrasirichai N, Zheng R, Selesnick I, Achim A (2020b) Image fusion via sparse regularization with non-convex penalties. Pattern Recogn Lett 131:355–360. https://doi.org/10.1016/j.patrec.2020.01.020

Anantrasirichai N, Geravand M, Braendler D, Bull DR (2021) Fast depth estimation for view synthesis. In: 2020 28th European signal processing conference (EUSIPCO), pp 575–579. https://doi.org/10.23919/Eusipco47968.2020.9287371

Anthony T, Eccles T, Tacchetti A, Kramár J, Gemp I, Hudson TC, Porcel N, Lanctot M, Pérolat J, Everett R, Singh S, Graepel T, Bachrach Y (2020) Learning to play no-press diplomacy with best response policy iteration. In: 34th Conference on neural information processing systems

Antic J (2020) DeOldify image colorization on DeepAI. https://github.com/jantic/DeOldify/ . Accessed 10 Apr 2020

Arjovsky M, Chintala S, Bottou L (2017) Wasserstein GAN. In: Proceedings of machine learning research, vol 70

Asgari Taghanaki S, Abhishek K, Cohen J, Hamarneh G (2021) Deep semantic segmentation of natural and medical images: a review. Artif Intell Rev 54(1):137–178. https://doi.org/10.1007/s10462-020-09854-1

Azam N, Yao J (2012) Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Syst Appl 39(5):4760–4768. https://doi.org/10.1016/j.eswa.2011.09.160

Barber A, Cosker D, James O, Waine T, Patel R (2016) Camera tracking in visual effects an industry perspective of structure from motion. In: Proceedings of the 2016 symposium on digital production, association for computing machinery, New York, DigiPro ’16, pp 45–54. https://doi.org/10.1145/2947688.2947697

Barnett JT, Jain S, Andra U, Khurana T (2018) Cisco visual networking index (VNI): complete forecast update, pp 2017–2022. https://www.cisco.com/c/dam/m/en_us/network-intelligence/service-provider/digital-transformation/knowledge-network-webinars/pdfs/1211_BUSINESS_SERVICES_CKN_PDF.pdf

Bastug E, Bennis M, Medard M, Debbah M (2017) Toward interconnected virtual reality: opportunities, challenges, and enablers. IEEE Commun Maga 55(6):110–117

Batmaz Z, Yurekli A, Bilge A, Kaleli C (2019) A review on deep learning for recommender systems: challenges and remedies. Artif Intell Rev 52:1–37. https://doi.org/10.1007/s10462-018-9654-y

Berman D, treibitz T, Avidan S (2016) Non-local image dehazing. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Bhattacharyya A, Fritz M, Schiele B (2019) “Best-of-many-samples” distribution matching. In: Workshop on Bayesian deep learning

Biemond J, Lagendijk RL, Mersereau RM (1990) Iterative methods for image deblurring. Proc IEEE 78(5):856–883

Black S, Keshavarz S, Souvenir R (2020) Evaluation of image inpainting for classification and retrieval. In: IEEE winter conference on applications of computer vision (WACV), pp 1049–1058

Bochkovskiy A, Wang CY, Liao HYM (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv:abs/2004.10934

Borji A, Cheng M, Hou Q, Li J (2019) Salient object detection: a survey. Comput Vis Media 5:117–150. https://doi.org/10.1007/s41095-019-0149-9

Borysenko D, Mykheievskyi D, Porokhonskyy V (2020) Odesa: object descriptor that is smooth appearance-wise for object tracking task. In: To be submitted to ECCV’20

Bostrom N (2014) Superintelligence. Oxford University Press, Oxford

Google Scholar  

Bostrom N, Yudkowsky E (2014) The ethics of artificial intelligence. In: In Cambridge handbook of artificial intelligence

Bragg D, Koller O, Bellard M, Berke L, Boudreault P, Braffort A, Caselli N, Huenerfauth M, Kacorri H, Verhoef T, Vogler C, Ringel Morris M (2019) Sign language recognition, generation, and translation: An interdisciplinary perspective. In: International ACM SIGACCESS conference on computers and accessibility, pp 16–31. https://doi.org/10.1145/3308561.3353774

Briot JP, Hadjeres G, Pachet FD (2020) Deep learning techniques for music generation. Springer, Cham. https://doi.org/10.1007/978-3-319-70163-9

Brock A, Donahue J, Simonyan K (2019) Large scale GAN training for high fidelity natural image synthesis. In: International conference on learning representations (ICLR)

Brooks T, Mildenhall B, Xue T, Chen J, Sharlet D, Barron JT (2019) Unprocessing images for learned raw denoising. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Buades A, Duran J (2019) CFA video denoising and demosaicking chain via spatio-temporal patch-based filtering. IEEE Trans Circ Syst Video Tech 30(11):1. https://doi.org/10.1109/TCSVT.2019.2956691

Bulat A, Tzimiropoulos G (2017) How far are we from solving the 2D & 3D face alignment problem? (and a dataset of 230,000 3D facial landmarks). In: The IEEE international conference on computer vision (ICCV)

Bull D, Zhang F (2021) Intelligent image and video compression: communicating pictures, 2nd edn. Elsevier, New York

Caballero J, Ledig C, Aitken A, Acosta A, Totz J, Wang Z, Shi W (2017) Real-time video super-resolution with spatio-temporal networks and motion compensation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2848–2857. https://doi.org/10.1109/CVPR.2017.304

Cai B, Xu X, Jia K, Qing C, Tao D (2016) DehazeNet: an end-to-end system for single image haze removal. IEEE Trans Image Process 25(11):5187–5198

Cai X, Pu Y (2019) Flattenet: a simple and versatile framework for dense pixelwise prediction. IEEE Access 7:179985–179996

Caramiaux B, Lotte F, Geurts J, Amato G, Behrmann M, Falchi F, Bimbot F, Garcia A, Gibert J, Gravier G, Hadmut Holken HK, Lefebvre S, Liutkus A, Perkis A, Redondo R, Turrin E, Vieville T, Vincent E (2019) AI in the media and creative industries. In: New European media (NEM), hal-02125504f

Chak WH, Lau CP, Lui LM (2018) Subsampled turbulence removal network. arXiv:1807.04418v2

Chan C, Ginosar S, Zhou T, Efros A (2019) Everybody dance now. In: IEEE/CVF international conference on computer vision (ICCV), pp 5932–5941

Chang AX, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H, Xiao J, Yi L, Yu F (2015) ShapeNet: an information-rich 3D model repository. arXiv:1512.03012

Chang J, Chen Y (2018) Pyramid stereo matching network. In: IEEE/CVF conference on computer vision and pattern recognition, pp 5410–5418. https://doi.org/10.1109/CVPR.2018.00567

Chang Y, Liu ZY, Lee K, Hsu W (2019) Free-form video inpainting with 3d gated convolution and temporal patchgan. In: IEEE/CVF international conference on computer vision (ICCV), pp 9065–9074

Chaplot DS, Salakhutdinov R, Gupta A, Gupta S (2020) Neural topological slam for visual navigation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Chen C, Chen Q, Xu J, Koltun V (2018a) Learning to see in the dark. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3291–3300

Chen C, Jain U, Schissler C, Gari SVA, Al-Halah Z, Ithapu VK, Robinson P, Grauman K (2020) Soundspaces: audio-visual navigation in 3D environments. In: European Conference on Computer Vision (ECCV)

Chen F, De Vleeschouwer C, Cavallaro A (2014) Resource allocation for personalized video summarization. IEEE Trans Multimed 16(2):455–469. https://doi.org/10.1109/TMM.2013.2291967

Chen G, Ye D, Xing Z, Chen J, Cambria E (2017) Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: 2017 international joint conference on neural networks (IJCNN), pp 2377–2383. https://doi.org/10.1109/IJCNN.2017.7966144

Chen J, Chen J, Chao H, Yang M (2018b) Image blind denoising with generative adversarial network based noise modeling. In: IEEE/CVF conference on computer vision and pattern recognition, pp 3155–3164

Chen H, Ding G, Zhao S, Han J (2018) Temporal-difference learning with sampling baseline for image captioning. In: 32nd AAAI conference on artificial intelligence

Chen K, Wang J, Pang J, Cao Y, Xiong Y, Li X, Sun S, Feng W, Liu Z, Xu J, Zhang Z, Cheng D, Zhu C, Cheng T, Zhao Q, Li B, Lu X, Zhu R, Wu Y, Dai J, Wang J, Shi J, Ouyang W, Loy CC, Lin D (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:190607155

Chen SF, Chen YC, Yeh CK, Wang YCF (2018) Order-free rnn with visual attention for multi-label classification. In: AAAI conference on artificial intelligence

Chen Z, Wei X, Wang P, Guo Y (2019) Multi-label image recognition with graph convolutional networks. In: 2019 IEEE/CVF conference on computer vision and pattern recognition, pp 5172–5181. https://doi.org/10.1109/CVPR.2019.00532

Cheng MM, Zhang FL, Mitra NJ, Huang X, Hu SM (2010) Repfinder: finding approximately repeated scene elements for image editing 29(4), 1-8. https://doi.org/10.1145/1778765.1778820

Cheng X, Wang P, Yang R (2019) Learning depth with convolutional spatial propagation network. IEEE Trans Pattern Anal Mach Intell 42(10):1

Cheng Z, Yang Q, Sheng B (2015) Deep colorization. In: The IEEE international conference on computer vision (ICCV)

Chuah SHW (2018) Why and who will adopt extended reality technology? Literature review, synthesis, and future research agenda. SSRN. https://doi.org/10.2139/ssrn.3300469

Claus M, van Gemert J (2019) ViDeNN: deep blind video denoising. In: CVPR workshop

Cohen NS (2015) From pink slips to pink slime: transforming media labor in a digital age. Commun Rev 18(2):98–122. https://doi.org/10.1080/10714421.2015.1031996

Dabov K, Foi A, Katkovnik V, Egiazarian K (2007) Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans Image Process 16(8):2080–2095

Article   MathSciNet   Google Scholar  

Dai J, Qi H, Xiong Y, Li Y, Zhang G, Hu H, Wei Y (2017) Deformable convolutional networks. In: IEEE international conference on computer vision (ICCV), pp 764–773. https://doi.org/10.1109/ICCV.2017.89

Dai T, Cai J, Zhang Y, Xia S, Zhang L (2019) Second-order attention network for single image super-resolution. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 11057–11066

Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: the epic-kitchens dataset. In: European conference on computer vision

Damodaran BB, Kellenberger B, Flamary R, Tuia D, Courty N (2018) DeepJDOT: deep joint distribution optimal transport for unsupervised domain adaptation. In: The European conference on computer vision (ECCV)

Davies J, Klinger J, Mateos-Garcia J, Stathoulopoulos K (2020) The art in the artificial AI and the creative industries. Creat Ind Policy Evid Centre 1–38

Davy A, Ehret T, Morel J, Arias P, Facciolo G (2019) A non-local cnn for video denoising. In: IEEE international conference on image processing (ICIP), pp 2409–2413. https://doi.org/10.1109/ICIP.2019.8803314

Deldjoo Y, Constantin MG, Eghbal-Zadeh H, Ionescu B, Schedl M, Cremonesi P (2018) Audio-visual encoding of multimedia content for enhancing movie recommendations. In: Proceedings of the 12th ACM conference on recommender systems, association for computing machinery, New York, NY, USA, RecSys ’18, pp 455–459. https://doi.org/10.1145/3240323.3240407

Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, vol 1

Dignum V (2018) Ethics in artificial intelligence: introduction to the special issue. Ethics Inf Technol, 20:1–3

Dodds L (2020) The ai that unerringly predicts hollywood’s hits and flops. https://www.telegraph.co.uk/technology/2020/01/20/ai-unerringly-predicts-hollywoods-hits-flops/ . Accessed 10 Apr 2020

Doetsch P, Kozielski M, Ney H (2014) Fast and robust training of recurrent neural networks for offline handwriting recognition. In: 2014 14th international conference on frontiers in handwriting recognition, pp 279–284

Donahue C, McAuley J, Puckette M (2019) Adversarial audio synthesis. In: International conference on learning representations (ICLR)

Dong C, Loy CC, He K, Tang X (2014) Learning a deep convolutional network for image super-resolution. In: The European conference on computer vision (ECCV), pp 184–199

Dörr KN (2016) Mapping the field of algorithmic journalism. Digit J 4(6):700–722. https://doi.org/10.1080/21670811.2015.1096748

Dzmitry Bahdanau YB Kyunghyun Cho (2015) Neural machine translation by jointly learning to align and translate. In: International conference on learning representations

Elgammal A, Liu B, Elhoseiny M, Mazzone M (2017) CAN: creative adversarial networks, generating “art” by learning about styles and deviating from style norms. arXiv:1706.07068

Engel J, Agrawal KK, Chen S, Gulrajani I, Donahue C, Roberts A (2019) GANSynth: adversarial neural audio synthesis. In: International conference on learning representations

Engin D, Genc A, Kemal Ekenel H (2018) Cycle-Dehaze: enhanced CycleGAN for single image dehazing. In: The IEEE conference on computer vision and pattern recognition (CVPR) workshops

Everingham M, Van Gool L, Williams CKI, Winn J, Zisserman A (2012) The PASCAL visual object classes challenge 2012 (VOC2012) results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html

Fan D, Wang W, Cheng M, Shen J (2019) Shifting more attention to video salient object detection. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 8546–8556. https://doi.org/10.1109/CVPR.2019.00875

Fan DP, Lin Z, Ji GP, Zhang D, Fu H, Cheng MM (2020) Taking a deeper look at co-salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Fang K (2016) Track-RNN: Joint detection and tracking using recurrent neural networks. In: Conference on neural information processing systems

Flynn J, Broxton M, Debevec P, DuVall M, Fyffe G, Overbeck R, Snavely N, Tucker R (2019) DeepView: view synthesis with learned gradient descent. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2362–2371

Foster D (2019) Generative deep learning: teaching machines to paint, write, compose, and play. O’Reilly Media Inc

Frogner C, Zhang C, Mobahi H, Araya-Polo M, Poggio T (2015) Learning with a wasserstein loss. In: Proceedings of the 28th international conference on neural information processing systems, NIPS’15, vol 2. MIT Press, Cambridge, pp 2053–2061

Fukushima K (1980) Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol Cybern 36:193–202. https://doi.org/10.1007/BF00344251

Article   MATH   Google Scholar  

Gao H, Tao X, Shen X, Jia J (2019) Dynamic scene deblurring with parameter selective sharing and nested skip connections. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3843–3851

Gao J, Anantrasirichai N, Bull D (2019) Atmospheric turbulence removal using convolutional neural network. arXiv:1912.11350

Gao R, Grauman K (2019) 2.5D visual sound. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 324–333

Gatys L, Ecker A, Bethge M (2016) A neural algorithm of artistic style. J Vis. https://doi.org/10.1167/16.12.326

Geiger A, Lenz P, Urtasun R (2012) Are we ready for autonomous driving? The KITTI vision benchmark suite. In: Conference on computer vision and pattern recognition (CVPR)

Ghani NA, Hamid S, Hashem IA, Ahmed E (2019) Social media big data analytics: a survey. Comput Hum Behav 101:417–428. https://doi.org/10.1016/j.chb.2018.08.039

Gkioxari G, Johnson J, Malik J (2019) Mesh r-CNN. In: IEEE/CVF international conference on computer vision (ICCV), pp 9784–9794

Golbeck J, Robles C, Turner K (2011) Predicting personality with social media. In: CHI ’11 extended abstracts on human factors in computing systems, pp 253–262. https://doi.org/10.1145/1979742.1979614

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Ghahramani Z, Welling M, Cortes C, Lawrence ND, Weinberger KQ (eds) Advances in neural information processing systems, vol 27. Curran Associates, Inc., pp 2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf

Gordo A, Almazán J, Revaud J, Larlus D (2016) Deep image retrieval: learning global representations for image search. In: The European conference on computer vision (ECCV). Springer, pp 241–257

Gordon D, Farhadi A, Fox D (2018) Re 3 : real-time recurrent regression networks for visual tracking of generic objects. IEEE Robot Autom Lett 3(2):788–795

Goyal M, Tatwawadi K, Chandak S, Ochoa I (2019) DeepZip: lossless data compression using recurrent neural networks. In: 2019 data compression conference (DCC), pp 575–575

Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: IEEE international conference on acoustics, speech and signal processing, pp 6645–6649

Gregor K, Papamakarios G, Besse F, Buesing L, Weber T (2019) Temporal difference variational auto-encoder. In: International conference on learning representations

Güera D, Delp EJ (2018) Deepfake video detection using recurrent neural networks. In: 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), pp 1–6

Gunasekara I, Nejadgholi I (2018) A review of standard text classification practices for multi-label toxicity identification of online content. In: Proceedings of the 2nd workshop on abusive language online (ALW2). Association for Computational Linguistics, Brussels, Belgium, pp 21–25. https://doi.org/10.18653/v1/W18-5103 . https://www.aclweb.org/anthology/W18-5103

Guo K, Lincoln P, Davidson P, Busch J, Yu X, Whalen M, Harvey G, Orts-Escolano S, Pandey R, Dourgarian J, DuVall M, Tang D, Tkach A, Kowdle A, Cooper E, Dou M, Fanello S, Fyffe G, Rhemann C, Taylor J, Debevec P, Izadi S (2019) The relightables: volumetric performance capture of humans with realistic relighting. In: ACM SIGGRAPH Asia

Gupta R, Thapar Khanna M, Chaudhury S (2013) Visual saliency guided video compression algorithm. Signal Process: Image Commun 28(9):1006–1022. https://doi.org/10.1016/j.image.2013.07.003

Ha D, Eck D (2018) A neural representation of sketch drawings. In: International conference on learning representations

Hall DW, Pesenti J (2018) Growing the artificial intelligence industry in the UK. https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/652097/Growing_the_artificial_intelligence_industry_in_the_UK.pdf

Han J, Lombardo S, Schroers C, Mandt S (2019) Deep generative video compression. In: Conference on neural information processing systems 32:1–12

Han X, Laga H, Bennamoun M (2021) Image-based 3D object reconstruction: state-of-the-art and trends in the deep learning era. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1578–1604

Haris M, Shakhnarovich G, Ukita N (2019) Recurrent back-projection network for video super-resolution. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 3892–3901

Hasan HR, Salah K (2019) Combating deepfake videos using blockchain and smart contracts. IEEE Access 7:41596–41606

Haugeland J (1985) Artificial intelligence: the very idea. MIT Press, New York

He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284. https://doi.org/10.1109/TKDE.2008.239

He K, Sun J, Tang X (2011) Single image haze removal using dark channel prior. IEEE Trans Pattern Anal Mach Intell 33(12):2341–2353

He K, Zhang X, Ren S, Sun J (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell 37(9):1904–1916

He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 770–778

He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: IEEE international conference on computer vision (ICCV), pp 2980–2988

He Z, Zuo W, Kan M, Shan S, Chen X (2019) AttGAN: facial attribute editing by only changing what you want. IEEE Trans Image Process 28(11):5464–5478. https://doi.org/10.1109/TIP.2019.2916751

Héctor R (2014) MADE—massive artificial drama engine for non-player characters. FOSDEM VZW. https://doi.org/10.5446/32569 . Accessed 26 May 2020

Hessel M, Modayil J, van Hasselt H, Schaul T, Ostrovski G, Dabney W, Horgan D, Piot B, Azar M, Silver D (2018) Rainbow: combining improvements in deep reinforcement learning. In: 32nd AAAI conference on artificial intelligence

Hildebrand HA (1999) Pitch detection and intonation correction apparatus and method. US Patent 5973252A

Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

Holden D, Saito J, Komura T, Joyce T (2015) Learning motion manifolds with convolutional autoencoders. In: SIGGRAPH Asia 2015 technical briefs. Association for Computing Machinery,SA ’15, New York. https://doi.org/10.1145/2820903.2820918

Honavar V (1995) Symbolic artificial intelligence and numeric artificial neural networks: towards a resolution of the dichotomy. Springer, Boston, pp 351–388. https://doi.org/10.1007/978-0-585-29599-2_11

Hong X, Xiong P, Ji R, Fan H (2019) Deep fusion network for image completion. In: Proceedings of the 27th ACM international conference on multimedia, pp 2033–2042. https://doi.org/10.1145/3343031.3351002

Hossain MS, Muhammad G (2019) Emotion recognition using deep learning approach from audio-visual emotional big data. Inf Fusion 49:69–78. https://doi.org/10.1016/j.inffus.2018.09.008

Hou Q, Cheng M, Hu X, Borji A, Tu Z, Torr PHS (2019) Deeply supervised salient object detection with short connections. IEEE Trans Pattern Anal Mach Intell 41(4):815–828. https://doi.org/10.1109/TPAMI.2018.2815688

Hradis M, Kotera J, Zemcik P, Sroubek F (2015) Convolutional neural networks for direct text deblurring. In: Proceedings of the British machine vision conference (BMVC), pp 6.1–6.13. https://doi.org/10.5244/C.29.6

Hu L, Saito S, Wei L, Nagano K, Seo J, Fursund J, Sadeghi I, Sun C, Chen YC, Li H (2017) Avatar digitization from a single image for real-time rendering. ACM Trans Graph 36(6):1–4. https://doi.org/10.1145/3130800.31310887

Hu Y, Wang K, Zhao X, Wang H, Li Y (2018) Underwater image restoration based on convolutional neural network. In: Proceedings of the 10th Asian conference on machine learning, PMLR, proceedings of machine learning research, vol 95, pp 296–311. http://proceedings.mlr.press/v95/hu18a.html

Huang G, Liu Z, v d Maaten L, Weinberger KQ (2017) Densely connected convolutional networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2261–2269. https://doi.org/10.1109/CVPR.2017.243

Huang SW, Lin CT, Chen SP, Wu YY, Hsu PH, Lai SH (2018) AugGAN: cross domain adaptation with GAN-based data augmentation. In: The European conference on computer vision (ECCV)

Huang Y, Wang W, Wang L (2015) Bidirectional recurrent convolutional networks for multi-frame super-resolution. In: Advances in neural information processing systems, vol 28. Curran Associates, Inc., pp 235–243. http://papers.nips.cc/paper/5778-bidirectional-recurrent-convolutional-networks-for-multi-frame-super-resolution.pdf

Huang Z, Zhou S, Heng W (2019) Learning to paint with model-based deep reinforcement learning. In: IEEE/CVF international conference on computer vision (ICCV), pp 8708–8717

Hyun Kim T, Mu Lee K, Scholkopf B, Hirsch M (2017) Online video deblurring via dynamic temporal blending network. In: The IEEE international conference on computer vision (ICCV)

Iqbal T, Qureshi S (2020) The survey: text generation models in deep learning. J King Saud Univ-Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.04.001

Isola P, Zhu J, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 5967–5976. https://doi.org/10.1109/CVPR.2017.632

Jabeen S, Khan G, Naveed H, Khan Z, Khan UG (2018) Video retrieval system using parallel multi-class recurrent neural network based on video description. In: 2018 14th international conference on emerging technologies (ICET), pp 1–6

Jackson AS, Bulat A, Argyriou V, Tzimiropoulos G (2017) Large pose 3D face reconstruction from a single image via direct volumetric CNN regression. In: International conference on computer vision

Jalal MA, Chen R, Moore RK, Mihaylova L (2018) American sign language posture understanding with deep neural networks. In: International conference on information fusion (FUSION), pp 573–579. https://doi.org/10.23919/ICIF.2018.8455725

James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning. Springer, New York

Book   MATH   Google Scholar  

Jeon J, Lavrenko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, pp 119–126. https://doi.org/10.1145/860435.860459

Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231. https://doi.org/10.1109/TPAMI.2012.59

Jia J (2007) Single image motion deblurring using transparency. In: IEEE conference on computer vision and pattern recognition, pp 1–8

Jiang B, Zhou Z, Wang X, Tang J, Luo B (2020) CMSALGAN: RGB-D salient object detection with cross-view generative adversarial networks. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2020.2997184

Jiang F, Tao W, Liu S, Ren J, Guo X, Zhao D (2018) An end-to-end compression framework based on convolutional neural networks. IEEE Trans Circuits Syst Video Technol 28(10):3007–3018

Jiang L, Shi S, Qi X, Jia J (2018) GAL: geometric adversarial loss for single-view 3D-object reconstruction. In: The European conference on computer vision (ECCV). Springer, Cham, pp 820–834

Jiang Y, Zhou T, Ji GP, Fu K, jun Zhao Q, Fan DP (2020) Light field salient object detection: a review and benchmark. arXiv:abs/2010.04968

Jiang Y, Gong X, Liu D, Cheng Y, Fang C, Shen X, Yang J, Zhou P, Wang Z (2021) Enlightengan: deep light enhancement without paired supervision. IEEE Trans Image Process 30:2340–2349. https://doi.org/10.1109/TIP.2021.3051462

Jin Y, Zhang J, Li M, Tian Y, Zhu H, Fang Z (2017) Towards the automatic anime characters creation with generative adversarial networks. arXiv:1708.05509

Johnson J, Alahi A, Fei-Fei L (2016) Perceptual losses for real-time style transfer and super-resolution. In: European conference on computer vision

Johnson R, Zhang T (2015) Effective use of word order for text categorization with convolutional neural networks. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, pp 103–112. https://doi.org/10.3115/v1/N15-1011 . https://www.aclweb.org/anthology/N15-1011

Justesen N, Bontrager P, Togelius J, Risi S (2020) Deep learning for video game playing. IEEE Trans Games 12(1):1–20

Kaminskas M, Ricci F (2012) Contextual music information retrieval and recommendation: State of the art and challenges. Comput Sci Rev 6(2):89–119. https://doi.org/10.1016/j.cosrev.2012.04.002

Kanazawa A, Black MJ, Jacobs DW, Malik J (2018) End-to-end recovery of human shape and pose. In: IEEE/CVF conference on computer vision and pattern recognition, pp 7122–7131

Kaneko H, Goto J, Kawai Y, Mochizuki T, Sato S, Imai A, Yamanouchi Y (2020) AI-driven smart production. SMPTE Motion Imaging J 129(2):27–35

Kappeler A, Yoo S, Dai Q, Katsaggelos AK (2016) Video super-resolution with convolutional neural networks. IEEE Trans Comput Imaging 2(2):109–122

Karras T, Aila T, Laine S, Lehtinen J (2018) Progressive growing of GANs for improved quality, stability, and variation. In: International conference on learning representations (ICLR)

Kartynnik Y, Ablavatski A, Grishchenko I, Grundmann M (2019) Real-time facial surface geometry from monocular video on mobile GPUs. In: CVPR workshop on computer vision for augmented and virtual reality

Kazakos E, Nagrani A, Zisserman A, Damen D (2019) EPIC-Fusion: audio-visual temporal binding for egocentric action recognition. In: IEEE/CVF international conference on computer vision (ICCV), pp 5491–5500

Keswani B, Mohapatra AG, Mishra TC, Keswani P, Mohapatra PCG, Akhtar MM, Vijay P (2020) World of virtual reality (VR) in healthcare. Springer, pp 1–23. https://doi.org/10.1007/978-3-030-35252-3_1

Kietzmann J, Lee LW, McCarthy IP, Kietzmann TC (2020) Deepfakes: trick or treat? Bus Horiz 63(2):135–146. https://doi.org/10.1016/j.bushor.2019.11.006

Kim D, Woo S, Lee J, Kweon IS (2019) Deep video inpainting. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5785–5794. https://doi.org/10.1109/CVPR.2019.00594

Kim J, Lee JK, Lee KM (2016) Accurate image super-resolution using very deep convolutional networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1646–1654

Kim N, Lee D, Oh S (2020a) Learning instance-aware object detection using determinantal point processes. Comput Vis Image Underst 201:103061. https://doi.org/10.1016/j.cviu.2020.103061

Kim SW, Zhou Y, Philion J, Torralba A, Fidler S (2020b) Learning to Simulate Dynamic Environments with GameGAN. In: IEEE conference on computer vision and pattern recognition (CVPR)

Kirillov A, Wu Y, He K, Girshick R (2020) Pointrend: image segmentation as rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Ko B (2018) A brief review of facial emotion recognition based on visual information. Sensors 18:401

Kopf J, Neubert B, Chen B, Cohen M, Cohen-Or D, Deussen O, Uyttendaele M, Lischinski D (2008) Deep photo: model-based photograph enhancement and viewing. ACM Trans Graph 27(5):1–10. https://doi.org/10.1145/1409060.1409069

Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D (2019) Text classification algorithms: a survey. Information 10(4):150. https://doi.org/10.3390/info10040150

Kratimenos A, Pavlakos G, Maragos P (2020) 3D hands, face and body extraction for sign language recognition. In: European conference on computer vision workshop

Krishnan D, Tay T, Fergus R (2011) Blind deconvolution using a normalized sparsity measure. CVPR 2011:233–240

Kristan M, Matas J, Leonardis A, Vojir T, Pflugfelder R, Fernandez G, Nebehay G, Porikli F, Čehovin L (2016) A novel performance evaluation methodology for single-target trackers. IEEE Trans Pattern Anal Mach Intell 38(11):2137–2155. https://doi.org/10.1109/TPAMI.2016.2516982

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems, vol 1. Curran Associates Inc., USA, pp 1097–1105

Krull A, Buchholz T, Jug F (2019) Noise2Void—learning denoising from single noisy images. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 2124–2132

Kuang X, Sui X, Liu Y, Chen Q, Gu G (2019) Single infrared image enhancement using a deep convolutional neural network. Neurocomputing 332:119–128. https://doi.org/10.1016/j.neucom.2018.11.081

Kuang X, Zhu J, Sui X, Liu Y, Liu C, Chen Q, Gu G (2020) Thermal infrared colorization via conditional generative adversarial network. Infrared Phys Technol 107:103338. https://doi.org/10.1016/j.infrared.2020.103338

Kupyn O, Budzan V, Mykhailych M, Mishkin D, Matas J (2018) DeblurGAN: Blind motion deblurring using conditional adversarial networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Kwon OW, Chan K, Hao J, Lee TW (2003) Emotion recognition by speech signals. In: EUROSPEECH-2003, pp 125–128

Lacerda A, Cristo M, Gonçalves MA, Fan W, Ziviani N, Ribeiro-Neto B (2006) Learning to advertise. In: Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval, association for computing machinery, New York, NY, USA, SIGIR ’06, pp 549–556. https://doi.org/10.1145/1148170.1148265

Laver KE, Lange B, George S, Deutsch JE, Saposnik G, Crotty M (2017) Virtual reality for stroke rehabilitation. Cochrane Database Syst Rev 11(11):1–183

LeCun Y, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD (1989) Backpropagation applied to handwritten zip code recognition. Neural Comput 1(4):541–551

Lecun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324

Ledig C, Theis L, Huszar F, Caballero J, Cunningham A, Acosta A, Aitken A, Tejani A, Totz J, Wang Z, Shi W (2017) Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 105–114

Lee K, Lee S, Lee J (2018) Interactive character animation by learning multi-objective control. ACM Trans Graph 37(6):1–10

Lehtinen J, Munkberg J, Hasselgren J, Laine S, Karras T, Aittala M, Aila T (2018) Noise2Noise: learning image restoration without clean data. In: Proceedings of the 35th international conference on machine learning, vol 80, pp 2965–2974

Lempitsky V, Vedaldi A, Ulyanov D (2018) Deep image prior. In: IEEE/CVF conference on computer vision and pattern recognition, pp 9446–9454

Leppänen L, Munezero M, Granroth-Wilding M, Toivonen H (2017) Data-driven news generation for automated journalism. In: Proceedings of the 10th international conference on natural language generation, association for computational linguistics, Santiago de Compostela, Spain, pp 188–197. https://doi.org/10.18653/v1/W17-3528

Lewis JJ, O’Callaghan RJ, Nikolov SG, Bull DR, Canagarajah N (2007) Pixel- and region-based image fusion with complex wavelets. Info Fusion 8(2):119–130 Special Issue on Image Fusion: Advances in the State of the Art

Li B, Peng X, Wang Z, Xu J, Feng D (2017) AOD-Net: all-in-one dehazing network. In: IEEE international conference on computer vision (ICCV), pp 4780–4788

Li B, Yan J, Wu W, Zhu Z, Hu X (2018) High performance visual tracking with siamese region proposal network. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Li B, Ren W, Fu D, Tao D, Feng D, Zeng W, Wang Z (2019) Benchmarking single-image dehazing and beyond. IEEE Trans Image Process 28(1):492–505

Li J, Li B, Xu J, Xiong R, Gao W (2018) Fully connected network-based intra prediction for image coding. IEEE Trans Image Process 27(7):3236–3247

Li S, Kang X, Hu J (2013) Image fusion with guided filtering. IEEE Trans Image Process 22(7):2864–2875

Li J, Li H, Zong C (2019a) Towards personalized review summarization via user-aware sequence network. Proceed AAAI Conf Artif Intell 33(01):6690–6697. https://doi.org/10.1609/aaai.v33i01.33016690

Li S, Jang S, Sung Y (2019b) Automatic melody composition using enhanced GAN. Mathematics 7:883

Li W, Zhang P, Zhang L, Huang Q, He X, Lyu S, Gao J (2019c) Object-driven text-to-image synthesis via adversarial training. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Li Z, Ma Y, Chen Y, Zhang X, Sun J (2019d) Joint COCO and mapillary workshop at ICCV 2019: Coco instance segmentation challenge track Technical report: MegDetV2. In: IEEE international conference on computer vision workshop

Li X, Liu M, Ye Y, Zuo W, Lin L, Yang R (2018a) Learning warped guidance for blind face restoration. In: The European conference on computer vision (ECCV), pp 278–296

Li Y, Lyu S (2019) Exposing deepfake videos by detecting face warping artifacts. In: IEEE conference on computer vision and pattern recognition workshops (CVPRW)

Li Y, Lu H, Li J, Li X, Li Y, Serikawa S (2016) Underwater image de-scattering and classification by deep neural network. Comput Electr Eng 54:68–77. https://doi.org/10.1016/j.compeleceng.2016.08.008

Li Y, Pan Q, Wang S, Yang T, Cambria E (2018b) A generative model for category text generation. Inf Sci 450:301–315. https://doi.org/10.1016/j.ins.2018.03.050

Limmer M, Lensch HPA (2016) Infrared colorization using deep convolutional neural networks. In: 15th IEEE international conference on machine learning and applications (ICMLA), pp 61–68

Lin T, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 936–944

Lin TY, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: common objects in context, pp 740–755

Liu D, Ma H, Xiong Z, Wu F (2018) CNN-based DCT-like transform for image compression. In: MultiMedia modeling, pp 61–72

Liu D, Wang Z, Fan Y, Liu X, Wang Z, Chang S, Wang X, Huang TS (2018a) Learning temporal dynamics for video super-resolution: a deep learning approach. IEEE Trans Image Process 27(7):3432–3445

Liu J, Xia S, Yang W, Li M, Liu D (2019) One-for-All: grouped variation network-based fractional interpolation in video coding. IEEE Trans Image Process 28(5):2140–2151

Liu L, Ouyang W, Wang X, Fieguth P, Chen J, Liu X, Pietikainen M (2020) Deep learning for generic object detection: a survey. Int J Comput Vis 128:261–318. https://doi.org/10.1007/s11263-019-01247-4

Liu P, Zhang H, Zhang K, Lin L, Zuo W (2018b) Multi-level wavelet-CNN for image restoration. In: IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 886–88609

Liu Y, Chen X, Peng H, Wang Z (2017) Multi-focus image fusion with a deep convolutional neural network. Inf Fusion 36:191–207. https://doi.org/10.1016/j.inffus.2016.12.001

Liu Y, Chen X, Wang Z, Wang ZJ, Ward RK, Wang X (2018) Deep learning for pixel-level image fusion: recent advances and future prospects. Inf Fusion 42:158–173. https://doi.org/10.1016/j.inffus.2017.10.007

Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 3431–3440. https://doi.org/10.1109/CVPR.2015.7298965

Lore KG, Akintayo A, Sarkar S (2017) Llnet: a deep autoencoder approach to natural low-light image enhancement. Pattern Recogn 61:650–662. https://doi.org/10.1016/j.patcog.2016.06.008

Lu C, Uchiyama H, Thomas D, Shimada A, Ichiro Taniguchi R, (2018) Sparse cost volume for efficient stereo matching. Remote sensing 10(11):1–12

Lu G, Ouyang W, Xu D, Zhang X, Cai C, Gao Z (2019) DVC: an end-to-end deep video compression framework. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10998–11007

Lu G, Zhang X, Ouyang W, Chen L, Gao Z, Xu D (2020) An end-to-end learning framework for video compression. IEEE Trans Pattern Anal Mach Intell 1

Lucas A, Iliadis M, Molina R, Katsaggelos AK (2018) Using deep neural networks for inverse problems in imaging: beyond analytical methods. IEEE Signal Process Maga 35(1):20–36

Lundervold AS, Lundervold A (2019) An overview of deep learning in medical imaging focusing on MRI. Z Med Phys 29(2):102–127. https://doi.org/10.1016/j.zemedi.2018.11.002 . Special Issue: Deep Learning in Medical Physics

Ma D, Afonso M, Zhang F, Bull D (2019a) Perceptually-inspired super-resolution of compressed videos. In: Proc. SPIE 11137, applications of digital image processing XLII, vol 1113717, pp 310–318

Ma D, Zhang F, Bull DR (2020) BVI-DVC: a training database for deep video compression. arXiv:2003.13552

Ma D, Zhang F, Bull DR (2020a) Gan-based effective bit depth adaptation for perceptual video compression. In: IEEE international conference on multimedia and expo (ICME), pp 1–6

Ma D, Zhang F, Bull DR (2021) CVEGAN: a perceptually-inspired gan for compressed video enhancement. arXiv:2011.09190v2

Ma J, Ma Y, Li C (2019b) Infrared and visible image fusion methods and applications: a survey. Inf Fusion 45:153–178. https://doi.org/10.1016/j.inffus.2018.02.004

Ma J, Yu W, Liang P, Li C, Jiang J (2019c) FusionGAN: a generative adversarial network for infrared and visible image fusion. Inf Fusion 48:11–26. https://doi.org/10.1016/j.inffus.2018.09.004

Ma S, Zhang X, Jia C, Zhao Z, Wang S, Wang S (2020b) Image and video compression with neural networks: a review. IEEE Trans Circuits Syst Video Technol 30(6):1683–1698

Maas A, Le QV, O’Neil TM, Vinyals O, Nguyen P, Ng AY (2012) Recurrent neural networks for noise reduction in robust ASR. In: INTERSPEECH

Maggioni M, Katkovnik V, Egiazarian K, Foi A (2012) Nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Trans Image Process 22(1):119–133

Maier R, Kim K, Cremers D, Kautz J, Nießner M (2017) Intrinsic3D: high-quality 3D reconstruction by joint appearance and geometry optimization with spatially-varying lighting. In: IEEE international conference on computer vision (ICCV), pp 3133–3141

Malleson C, Guillemaut JY, Hilton A (2019) 3D reconstruction from RGB-D data. Springer, pp 87–115. https://doi.org/10.1007/978-3-030-28603-3_5

Malm H, Oskarsson M, Warrant E, Clarberg P, Hasselgren J, Lejdfors C (2007) Adaptive enhancement and noise reduction in very low light-level video. In: IEEE ICCV, pp 1–8. https://doi.org/10.1109/ICCV.2007.4409007

Mansimov E, Parisotto E, Ba JL, Salakhutdinov R (2016) Generating images from captions with attention. In: International conference on learning representations

Mao HH, Shin T, Cottrell G (2018) DeepJ: style-specific music generation. In: IEEE 12th international conference on semantic computing (ICSC), pp 377–382

Mariani G, Scheidegger F, Istrate R, Bekas C, Malossi C (2018) BAGAN: Data augmentation with balancing GAN. arXiv:1803.09655v2

Matsugu M, Mori K, Mitari Y, Kaneda Y (2003) Subject independent facial expression recognition with robust face detection using a convolutional neural network. Neural Netw 16(5–6):555–559. https://doi.org/10.1016/S0893-6080(03)00115-1

McCulloch W, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull Math Biophys 5:115–133. https://doi.org/10.1007/BF02478259

Mejjati Y, Gomez C, Kim K, Shechtman E, Bylinskii Z (2020) Look here! a parametric learning based approach to redirect visual attention. In: European conference on computer vision. https://doi.org/10.1007/978-3-030-58592-1_21

Mentzer F, Toderici GD, Tschannen M, Agustsson E (2020) High-fidelity generative image compression. Adv Neural Inf Process Syst 33:1–12

Mescheder L, Oechsle M, Niemeyer M, Nowozin S, Geiger A (2019) Occupancy networks: learning 3D reconstruction in function space. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4455–4465

Milan A, Rezatofighi SH, Dick A, Reid I, Schindler K (2017) Online multi-target tracking using recurrent neural networks. In: Proceedings of the 31st AAAI conference on artificial intelligence. AAAI Press, AAAI’17, pp 4225–4232

Milgram P, Kishino F (1994) A taxonomy of mixed reality visual displays. IEICE Trans Inf Syst 77(12):1–15

Milgram P, Takemura H, Utsumi A, Kishino F (1995) Augmented reality: a class of displays on the reality-virtuality continuum. Telemanipulator Telepresence Technol, SPIE 2351:282–292. https://doi.org/10.1117/12.197321

Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv:1411.1784v1

Mitchell TM (1997) Machine learning. McGraw Hill Education

Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra D, Riedmiller M (2013) Playing atari with deep reinforcement learning. In: NIPS deep learning workshop

Morgado P, Nvasconcelos N, Langlois T, Wang O (2018) Self-supervised generation of spatial audio for 360° video. In: Advances in neural information processing systems, vol 11. pp 362–372

Nagano K, Seo J, Xing J, Wei L, Li Z, Saito S, Agarwal A, Fursund J, Li H (2018) PaGAN: real-time avatars using dynamic textures. ACM Trans Graph 37(6):1–12. https://doi.org/10.1145/3272127.3275075

Nah S, Hyun Kim T, Mu Lee K (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Nah S, Son S, Lee KM (2019) Recurrent neural networks with intra-frame iterations for video deblurring. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Nah S, Timofte R, Zhang R, Suin M, Purohit K, Rajagopalan AN, S AN, Pinjari JB, Xiong Z, Shi Z, Chen C, Liu D, Sharma M, Makwana M, Badhwar A, Singh AP, Upadhyay A, Trivedi A, Saini A, Chaudhury S, Sharma PK, Jain P, Sur A, Özbulak G (2019) NTIRE 2019 challenge on image colorization: report. In: IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 2233–2240

Nalbach O, Arabadzhiyska E, Mehta D, Seidel HP, Ritschel T (2017) Deep shading: convolutional neural networks for screen space shading. Comput Graph Forum 36(4):65–78. https://doi.org/10.1111/cgf.13225

Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: The European conference on computer vision (ECCV). Springer, Cham, pp 483–499

Ng AK, Chan LK, Lau HY (2020) A study of cybersickness and sensory conflict theory using a motion-coupled virtual reality system. Displays 61:101922. https://doi.org/10.1016/j.displa.2019.08.004

Nguyen TT, Nguyen ND, Nahavandi S (2020) Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications. IEEE Trans Cybern 50(9):1–14

Nieuwenhuizen R, Schutte K (2019) Deep learning for software-based turbulence mitigation in long-range imaging. Artif Intell Mach Learn Def Appl, Int Soc Opt Photon, SPIE 11169:153–162. https://doi.org/10.1117/12.2532603

Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: The IEEE international conference on computer vision (ICCV)

NSTC (2016) Preparing for the future of artificial intelligence. https://obamawhitehouse.archives.gov/sites/default/files/whitehouse_files/microsites/ostp/NSTC/preparing_for_the_future_of_ai.pdf . Accessed 10 Apr 2020

Ntoutsi E, Fafalios P, Gadiraju U, Iosifidis V, Nejdl W, Vidal ME, Ruggieri S, Turini F, Papadopoulos S, Krasanakis E, Kompatsiaris I, Kinder-Kurlanda K, Wagner C, Karimi F, Fernandez M, Alani H, Berendt B, Kruegel T, Heinze C, Broelemann K, Kasneci G, Tiropanis T, Staab S (2020) Bias in data-driven artificial intelligence systems—an introductory survey. WIREs Data Mining Knowl Discov 10(3):e1356. https://doi.org/10.1002/widm.1356

Oh BT, Lei S, Kuo CJ (2009) Advanced film grain noise extraction and synthesis for high-definition video coding. IEEE Trans Circ Syst Video Tech 19(12):1717–1729. https://doi.org/10.1109/TCSVT.2009.2026974

Ozcinar C, Smolic A (2018) Visual attention in omnidirectional video for virtual reality applications. In: 2018 10th international conference on quality of multimedia experience (QoMEX), pp 1–6. https://doi.org/10.1109/QoMEX.2018.8463418

Palmarini R, Erkoyuncu JA, Roy R, Torabmostaedi H (2018) A systematic review of augmented reality applications in maintenance. Robot Comput-Integr Manuf 49:215–228. https://doi.org/10.1016/j.rcim.2017.06.002

Panphattarasap P, Calway A (2018) Automated map reading: image based localisation in 2-D maps using binary semantic descriptors. In: IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 6341–6348

Pawar PY, Gawande SH (2012) A comparative study on different types of approaches to text categorization. Int J Mach Learn Comput 2(4):423

Peng C, Xiao T, Li Z, Jiang Y, Zhang X, Jia K, Yu G, Sun J (2018) Megdet: A large mini-batch object detector. In: IEEE/CVF conference on computer vision and pattern recognition, pp 6181–6189

Perov I, Gao D, Chervoniy N, Liu K, Marangonda S, Umé C, Dpfks M, Facenheim CS, RP L, Jiang J, Zhang S, Wu P, Zhou B, Zhang W (2020) Deepfacelab: a simple, flexible and extensible face swapping framework. arXiv preprint arXiv:200505535v4

Pizer SM, Amburn EP, Austin JD, Cromartie R, Geselowitz A, Greer T, [ter Haar Romeny] B, Zimmerman JB, Zuiderveld K, (1987) Adaptive histogram equalization and its variations. Comput Vis, Graph, Image Process 39(3):355–368. https://doi.org/10.1016/S0734-189X(87)80186-X

Prabhakar KR, Srikar VS, Babu RV (2017) DeepFuse: a deep unsupervised approach for exposure fusion with extreme exposure image pairs. In: IEEE international conference on computer vision (ICCV), pp 4724–4732

Pu Y, Gan Z, Henao R, Yuan X, Li C, Stevens A, Carin L (2016) Variational autoencoder for deep learning of images, labels and captions. In: Advances in neural information processing systems, vol 29. Curran Associates, Inc., pp 2352–2360. http://papers.nips.cc/paper/6528-variational-autoencoder-for-deep-learning-of-images-labels-and-captions.pdf

Qi CR, Su H, Mo K, Guibas LJ (2017) Pointnet: deep learning on point sets for 3D classification and segmentation. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Quesnel D, DiPaola S, Riecke B (2018) Deep learning for classification of peak emotions within virtual reality systems. In: International SERIES on information systems and management in creative media, pp 6–11

Radford A, Metz L, Chintala S (2016) Unsupervised representation learning with deep convolutional generative adversarial networks. In: International conference on learning representations

Razavi A, van den Oord A, Vinyals O (2019) Generating diverse high-resolution images with VQ-VAE. In: ICLR 2019 workshop DeepGenStruct

Redmon J, Farhadi A (2018) YOLOv3: an incremental improvement. arXiv:abs/1804.02767

Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 779–788

Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

Rezaei-Ravari M, Eftekhari M, Saberi-Movahed F (2021) Regularizing extreme learning machine by dual locally linear embedding manifold learning for training multi-label neural network classifiers. Eng Appl Artif Intell 97:104062. https://doi.org/10.1016/j.engappai.2020.104062

Riedl M, Bulitko V (2012) Interactive narrative: a novel application of artificial intelligence for computer games. In: 16th AAAI conference on artificial intelligence

Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer, pp 234–241

Rosca M, Lakshminarayanan B, Mohamed S (2019) Distribution matching in variational inference. arXiv:1802.06847v4

Rowe J, Partridge D (1993) Creativity: a survey of AI approaches. Artif Intell Rev 7:43–70. https://doi.org/10.1007/BF00849197

Rumelhart D, Hinton G, Williams R (1986) Learning representations by back-propagating errors. Nature 323:533–536. https://doi.org/10.1038/323533a0

Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing, association for computational linguistics, Lisbon, Portugal, pp 379–389. https://doi.org/10.18653/v1/D15-1044

Russell S, Norvig P (2020) Artificial intelligence: a modern approach, 4th edn. Pearson

Rutishauser U, Walther D, Koch C, Perona P (2004) Is bottom-up attention useful for object recognition? In: IEEE computer society conference on computer vision and pattern recognition, vol 2, p II. https://doi.org/10.1109/CVPR.2004.1315142

Sabour S, Frosst N, Hinton GE (2017) Dynamic routing between capsules. In: Proceedings of the 31st international conference on neural information processing systems, pp 3859–3869

Sajjadi MSM, Schölkopf B, Hirsch M (2017) EnhanceNet: single image super-resolution through automated texture synthesis. In: IEEE international conference on computer vision (ICCV), pp 4501–4510

Sandfort V, Yan K, Pickhardt P, Summers R (2019) Data augmentation using generative adversarial networks (CycleGAN) to improve generalizability in CT segmentation tasks. Sci Rep 9(16884):1–9. https://doi.org/10.1038/s41598-019-52737-x

Sautoy MD (2019) The creativity code: art and innovation in the age of AI. Harvard University Press

Schiopu I, Huang H, Munteanu A (2020) CNN-based intra-prediction for lossless HEVC. IEEE Trans Circuits Syst Video Technol 30(7):1816–1828

Schuler CJ, Hirsch M, Harmeling S, Schölkopf B (2016) Learning to deblur. IEEE Trans Pattern Anal Mach Intell 38(7):1439–1451

See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. In: Association for computational linguistics, 1073–1083

Shi J, Jiang X, Guillemot C (2020) Learning fused pixel and feature-based view reconstructions for light fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Shi W, Caballero J, Huszár F, Totz J, Aitken AP, Bishop R, Rueckert D, Wang Z (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1874–1883

Shi X, Chen Z, Wang H, Yeung DY, Wong Wk, Woo Wc (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. In: Proceedings of the 28th international conference on neural information processing systems, vol 1, p 802–810

Shillingford B, Assael Y, Hoffman MW, Paine T, Hughes C, Prabhu U, Liao H, Sak H, Rao K, Bennett L, Mulville M, Coppin B, Laurie B, Senior A, de Freitas N (2019) Large-scale visual speech recognition. In: INTERSPEECH

Shimada S, Golyanik V, Theobalt C, Stricker D (2019) ISMO-gan: Adversarial learning for monocular non-rigid 3d reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops

Shin Y, Cho Y, Pandey G, Kim A (2016) Estimation of ambient light and transmission map with common convolutional architecture. In: OCEANS 2016 MTS/IEEE Monterey, pp 1–7

Short T, Adams T (2017) Procedural generation in game design. Taylor & Francis Inc

Shorten C, Khoshgoftaar T (2019) A survey on image data augmentation for deep learning. J Big Data 6(60):1–48. https://doi.org/10.1186/s40537-019-0197-0

Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: International conference on learning representations

Siyao L, Zhao S, Yu W, Sun W, Metaxas DN, Loy CC, Liu Z (2021) Deep animation video interpolation in the wild. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

Soccini AM (2017) Gaze estimation based on head movements in virtual reality applications using deep learning. In: IEEE virtual reality (VR), pp 413–414

Soltani AA, Huang H, Wu J, Kulkarni TD, Tenenbaum JB (2017) Synthesizing 3D shapes via modeling multi-view depth maps and silhouettes with deep generative networks. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2511–2519

Song J, He T, Gao L, Xu X, Hanjalic A, Shen HT (2018a) Binary generative adversarial networks for image retrieval. In: 32nd AAAI conference on artificial intelligence

Song J, Zhang J, Gao L, Liu X, Shen HT (2018b) Dual conditional gans for face aging and rejuvenation. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 899–905

Stankiewicz O (2019) Video coding technique with a parametric modelling of noise. Opto-Electron Rev 27(3):241–251. https://doi.org/10.1016/j.opelre.2019.05.006

Stanley KO, D’Ambrosio DB, Gauci J (2009) A hypercube-based encoding for evolving large-scale neural networks. Artif Life 15(2):185–212

Starke S, Zhang H, Komura T, Saito J (2019) Neural state machine for character-scene interactions. ACM Trans Graph 38(6):209. https://doi.org/10.1145/3355089.3356505

Starke S, Zhao Y, Komura T, Zaman K (2020) Local motion phases for learning multi-contact character movements. In: ACM SIGGRAPH

Sturm B, Santos JF, Ben-Tal O, Korshunova I (2016) Music transcription modelling and composition using deep learning. In: 1st conference on computer simulation of musical creativity

Su S, Delbracio M, Wang J, Sapiro G, Heidrich W, Wang O (2017) Deep video deblurring for hand-held cameras. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 237–246

Suarez PL, Sappa AD, Vintimilla BX (2017) Infrared image colorization based on a triplet DCGAN architecture. In: The IEEE conference on computer vision and pattern recognition (CVPR) workshops

Subramanian S, Rajeswar S, Sordoni A, Trischler A, Courville A, Pal C (2018) Towards text generation with adversarially learned neural outlines. In: NeurIPS 2018

Sun S, Pang J, Shi J, Yi S, Ouyang W (2018) Fishnet: A versatile backbone for image, region, and pixel level prediction. In: Advances in neural information processing systems, pp 760–770

Suwajanakorn S, Seitz SM, Kemelmacher-Shlizerman I (2017) Synthesizing Obama: learning lip sync from audio. ACM Trans Graph 36(4):1–13. https://doi.org/10.1145/3072959.3073640

Tai Y, Yang J, Liu X (2017) Image super-resolution via deep recursive residual network. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 2790–2798

Tang G, Zhao L, Jiang R, Zhang X (2019) Single image dehazing via lightweight multi-scale networks. In: IEEE international conference on big data (big data), pp 5062–5069

Tao L, Zhu C, Xiang G, Li Y, Jia H, Xie X (2017) Llcnn: a convolutional neural network for low-light image enhancement. In: IEEE visual communications and image processing (VCIP), pp 1–4

Tao X, Gao H, Shen X, Wang J, Jia J (2018) Scale-recurrent network for deep image deblurring. In: IEEE/CVF conference on computer vision and pattern recognition, pp 8174–8182

Tesfaldet M, Brubaker MA, Derpanis KG (2018) Two-stream convolutional networks for dynamic texture synthesis. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Tewari A, Zollhöfer M, Bernard F, Garrido P, Kim H, Pérez P, Theobalt C (2020) High-fidelity monocular face reconstruction based on an unsupervised model-based face autoencoder. IEEE Trans Pattern Anal Mach Intell 42(2):357–370

Theis L, Korshunova I, Tejani A, Huszár F (2018) Faster gaze prediction with dense networks and fisher pruning. arXiv:1801.05787v2

Tian C, Fei L, Zheng W, Xu Y, Zuo W, Lin CW (2020) Deep learning on image denoising: an overview. Neural Netw 131:251–275. https://doi.org/10.1016/j.neunet.2020.07.025

Tian Y, Peng X, Zhao L, Zhang S, Metaxas DN (2018) Cr-gan: Learning complete representations for multi-view generation. In: International joint conference on artificial intelligence

Torrejon OE, Peretti N, Figueroa R (2020) Rotoscope automation with deep learning. SMPTE Mot Imaging J 129(2):16–26

Truşcă M, Wassenberg D, Frasincar F, Dekker R (2020) A hybrid approach for aspect-based sentiment analysis using deep contextual word embeddings and hierarchical attention. In: International conference on web engineering, vol 12128. https://doi.org/10.1007/978-3-030-50578-3_25

Ummenhofer B, Zhou H, Uhrig J, Mayer N, Ilg E, Dosovitskiy A, Brox T (2017) DeMoN: depth and motion network for learning monocular stereo. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Vasudevan AB, Dai D, Gool LV (2020) Semantic object prediction and spatial sound super-resolution with binaural sounds. In: European conference on computer vision

Venugopalan S, Xu H, Donahue J, Rohrbach M, Mooney R, Saenko K (2015) Translating videos to natural language using deep recurrent neural networks. In: Conference of the North American chapter of the association for computational linguistics—human language technologies

Vesperini F, Gabrielli L, Principi E, Squartini S (2019) Polyphonic sound event detection by using capsule neural networks. IEEE J Sel Top Signal Process 13(2):310–322. https://doi.org/10.1109/JSTSP.2019.2902305

Wan C, Probst T, Van Gool L, Yao A (2017) Crossing nets: combining GANs and VAEs with a shared latent space for hand pose estimation. In: IEEE conference on computer vision and pattern recognition (CVPR), pp 1196–1205

Wan J, Wang D, Hoi SCH, Wu P, Zhu J, Zhang Y, Li J (2014) Deep learning for content-based image retrieval: A comprehensive study. In: Proceedings of the 22nd ACM international conference on multimedia, association for computing machinery, New York, NY, USA, MM ’14, pp 157–166. https://doi.org/10.1145/2647868.2654948

Wang C, Dong S, Zhao X, Papanastasiou G, Zhang H, Yang G (2020a) Saliencygan: deep learning semisupervised salient object detection in the fog of iot. IEEE Trans Ind Inf 16(4):2667–2676. https://doi.org/10.1109/TII.2019.2945362

Wang H, Su D, Liu C, Jin L, Sun X, Peng X (2019a) Deformable non-local network for video super-resolution. IEEE Access 7:177734–177744

Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, Van Gool L (2016) Temporal segment networks: Towards good practices for deep action recognition. In: The European conference on computer vision (ECCV), pp 20–36

Wang P, Rowe J, Min W, Mott B, Lester J (2017) Interactive narrative personalization with deep reinforcement learning. In: International joint conference on artificial intelligence

Wang Q, Zhang L, Bertinetto L, Hu W, Torr PHS (2019b) Fast online object tracking and segmentation: A unifying approach. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 1328–1338. https://doi.org/10.1109/CVPR.2019.00142

Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. In: Advances in neural information processing systems (NeurIPS)

Wang W, Lai Q, Fu H, Shen J, Ling H, Yang R (2021) Salient object detection in the deep learning era: an in-depth survey. IEEE Trans Pattern Anal Mach Intell. https://doi.org/10.1109/TPAMI.2021.3051099

Wang X, Chan KC, Yu K, Dong C, Loy CC (2019) EDVR: video restoration with enhanced deformable convolutional networks. In: The IEEE conference on computer vision and pattern recognition (CVPR) workshops

Wang Y, Perazzi F, McWilliams B, Sorkine-Hornung A, Sorkine-Hornung O, Schroers C (2018) A fully progressive approach to single-image super-resolution. In: IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW), pp 977–97709. https://doi.org/10.1109/CVPRW.2018.00131

Wang Z, Chen J, Hoi SCH (2020b) Deep learning for image super-resolution: a survey. IEEE Trans Pattern Anal Mach Intell 1

Wei SE, Saragih J, Simon T, Harley AW, Lombardi S, Perdoch M, Hypes A, Wang D, Badino H, Sheikh Y (2019) Vr facial animation via multiview image translation. ACM Trans Graph 38(4):1–16. https://doi.org/10.1145/3306346.3323030

Welser J, Pitera JW, Goldberg C (2018) Future computing hardware for AI. In: IEEE international electron devices meeting (IEDM), pp 1.3.1–1.3.6

Woo S, Park J, Lee JY, Kweon IS (2018) CBAM: convolutional block attention module. In: The European conference on computer vision (ECCV), pp 3–19

Wright C, Allnutt J, Campbell R, Evans M, Forman R, Gibson J, Jolly S, Kerlin L, Lechelt S, Phillipson G, Shotton M (2020) AI in production: video analysis and machine learning for expanded live events coverage. SMPTE Mot Imaging J 129(2):36–45

Wu H, Zheng S, Zhang J, Huang K (2019) GP-GAN: towards realistic high-resolution image blending. In: ACM international conference on multimedia

Wu J, Yu Y, Huang C, Yu K (2015) Deep multiple instance learning for image classification and auto-annotation. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Wu J, Wang Y, Xue T, Sun X, Freeman B, Tenenbaum J (2017) Marrnet: 3D shape reconstruction via 2.5d sketches. In: Advances in Neural Information Processing Systems, vol 30, pp 540–550. https://proceedings.neurips.cc/paper/2017/file/ad972f10e0800b49d76fed33a21f6698-Paper.pdf

Xia Y, Wang J (2005) A recurrent neural network for solving nonlinear convex programs subject to linear constraints. IEEE Trans Neural Netw 16(2):379–386

Xiangyu Xu WS Muchen Li (2019) Learning deformable kernels for image and video denoising. arXiv:1904.06903

Xie H, Yao H, Sun X, Zhou S, Zhang S (2019) Pix2Vox: context-aware 3D reconstruction from single and multi-view images. In: IEEE/CVF international conference on computer vision (ICCV), pp 2690–2698

Xie J, Xu L, Chen E (2012) Image denoising and inpainting with deep neural networks. In: Advances in neural information processing systems, vol 25. Curran Associates, Inc., pp 341–349. http://papers.nips.cc/paper/4686-image-denoising-and-inpainting-with-deep-neural-networks.pdf

Xie J, Girshick R, Farhadi A (2016) Deep3D: fully automatic 2D-to-3D video conversion with deep convolutional neural networks. In: The European conference on computer vision (ECCV). Springer, Cham, pp 842–857

Xie Y, Zhang W, Tao D, Hu W, Qu Y, Wang H (2016) Removing turbulence effect via hybrid total variation and deformation-guided kernel regression. IEEE Trans Image Process 25(10):4943–4958

Xu A, Liu Z, Guo Y, Sinha V, Akkiraju R (2017a) A new chatbot for customer service on social media. In: Proceedings of the 2017 CHI conference on human factors in computing systems, association for computing machinery, New York, NY, USA, CHI ’17, pp 3506–3510. https://doi.org/10.1145/3025453.3025496

Xu J, Yao T, Zhang Y, Mei T (2017b) Learning multimodal attention LSTM networks for video captioning. In: Proceedings of the 25th ACM international conference on multimedia, association for computing machinery, New York, NY, USA, MM ’17, p 537–545. https://doi.org/10.1145/3123266.3123448

Xu L, Sun H, Liu Y (2019) Learning with batch-wise optimal transport loss for 3D shape recognition. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Xu M, Li C, Zhang S, Callet PL (2020) State-of-the-art in 360° video/image processing: perception, assessment and compression. IEEE J Sel Top Signal Process 14(1):5–26. https://doi.org/10.1109/JSTSP.2020.2966864

Xu Z, Wang T, Fang F, Sheng Y, Zhang G (2020) Stylization-based architecture for fast deep exemplar colorization. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 9360–9369. https://doi.org/10.1109/CVPR42600.2020.00938

Xue T, Chen B, Wu J, Wei D, Freeman WT (2019) Video enhancement with task-oriented flow. Int J Comput Vis 127:1106–1125

Xue Y, Su J (2019) Attention based image compression post-processing convolutional neural network. In: IEEE/CVF conference on computer vision and pattern recognition workshop (CVPRW)

Yahya AA, Tan J, Su B, Liu K (2016) Video denoising based on spatial-temporal filtering. In: 6th intern. conf. on digital home, pp 34–37. https://doi.org/10.1109/ICDH.2016.017

Yang B, Wen H, Wang S, Clark R, Markham A, Trigoni N (2017) 3D object reconstruction from a single depth view with adversarial learning. In: Proceedings of the IEEE international conference on computer vision (ICCV) workshops

Yang D, Sun J (2018) Proximal Dehaze-Net: a prior learning-based deep network for single image dehazing. In: The European conference on computer vision (ECCV)

Yang F, Chang X, Dang C, Zheng Z, Sakti S, SN, Wu Y (2020a) ReMOTS: self-supervised refining multi-object tracking and segmentation. arXiv:2007.03200v2

Yang J, Hong Z, Qu X, Wang J, Xiao J (2020b) NAS-YODO. http://host.robots.ox.ac.uk:8080/leaderboard/displaylb_main.php?challengeid=11&compid=3#KEY_NAS%20Yolo

Yang Q, Yan P, Zhang Y, Yu H, Shi Y, Mou X, Kalra MK, Zhang Y, Sun L, Wang G (2018) Low-dose ct image denoising using a generative adversarial network with Wasserstein distance and perceptual loss. IEEE Trans Med Imaging 37(6):1348–1357

Yang W, Zhang X, Tian Y, Wang W, Xue J, Liao Q (2019) Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed 21(12):3106–3121

Yao G, Lei T, Zhong J (2019) A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett 118:14–22. https://doi.org/10.1016/j.patrec.2018.05.018 . Cooperative and Social Robots: Understanding Human Activities and Intentions

Yi K, Guo Y, Wang Z, Sun L, Zhu W (2020) Personalized text summarization based on gaze patterns. In: 2020 IEEE conference on multimedia information processing and retrieval (MIPR), pp 307–313. https://doi.org/10.1109/MIPR49039.2020.00070

Yi Z, Zhang H, Tan P, Gong M (2017) DualGAN: unsupervised dual learning for image-to-image translation. In: IEEE international conference on computer vision (ICCV), pp 2868–2876. https://doi.org/10.1109/ICCV.2017.310

Young T, Hazarika D, Poria S, Cambria E (2018) Recent trends in deep learning based natural language processing [review article]. IEEE Comput Intell Maga 13(3):55–75

Yu F, Koltun V (2016) Multi-scale context aggregation by dilated convolutions. In: International conference on learning representations

Yu J, Lin Z, Yang J, Shen X, Lu X, Huang TS (2018) Generative image inpainting with contextual attention. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Yu J, Lin Z, Yang J, Shen X, Lu X, Huang T (2019) Free-form image inpainting with gated convolution. In: IEEE/CVF international conference on computer vision (ICCV), pp 4470–4479. https://doi.org/10.1109/ICCV.2019.00457

Zakharov E, Shysheya A, Burkov E, Lempitsky V (2019) Few-shot adversarial learning of realistic neural talking head models. In: 2019 IEEE/CVF international conference on computer vision (ICCV), pp 9458–9467. https://doi.org/10.1109/ICCV.2019.00955

Zhang C, Li Y, Du N, Fan W, Yu P (2019a) Joint slot filling and intent detection via capsule neural networks. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 5259–5267. https://doi.org/10.18653/v1/P19-1519

Zhang F, Afonso M, Bull D (2019b) ViSTRA2: video coding using spatial resolution and effective bit depth adaptation. arXiv:1911.02833

Zhang F, Prisacariu V, Yang R, Torr PHS (2019) GA-Net: guided aggregation net for end-to-end stereo matching. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 185–194. https://doi.org/10.1109/CVPR.2019.00027

Zhang F, Chen F, Bull DR (2020) Enhancing VVC through CNN-based Post-Processing. In: IEEE ICME

Zhang G (2020) Design of virtual reality augmented reality mobile platform and game user behavior monitoring using deep learning. Int J Electr Eng Edu. https://doi.org/10.1177/0020720920931079

Zhang H, Xu T, Li H, Zhang S, Wang X, Huang X, Metaxas D (2017) StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE international conference on computer vision (ICCV), pp 5908–5916

Zhang H, Goodfellow I, Metaxas D, Odena A (2019) Self-attention generative adversarial networks. In: Proceedings of the 36th international conference on machine learning, PMLR, Long Beach, CA, USA, Proceedings of machine learning research, vol 97, pp 7354–7363

Zhang J, Pan J, Ren J, Song Y, Bao L, Lau RW, Yang MH (2018) Dynamic scene deblurring using spatially variant recurrent neural networks. In: The IEEE conference on computer vision and pattern recognition (CVPR)

Zhang K, Zuo W, Chen Y, Meng D, Zhang L (2017) Beyond a gaussian denoiser: residual learning of deep CNN for image denoising. IEEE Trans Image Process 26(7):3142–3155

Zhang K, Zuo W, Zhang L (2018) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Trans Image Process 27(9):4608–4622

Zhang R, Isola P, Efros AA (2016) Colorful image colorization. In: The European conference on computer vision (ECCV), pp 649–666

Zhang Y, Li K, Li K, Wang L, Zhong B, Fu Y (2018a) Image super-resolution using very deep residual channel attention networks. In: The European conference on computer vision (ECCV). Springer, Cham, pp 294–310

Zhang Z, Geiger J, Pohjalainen J, Mousa AED, Jin W, Schuller B (2018b) Deep learning for environmentally robust speech recognition: an overview of recent developments. ACM Trans Intell Syst Technol 9(5):1–26. https://doi.org/10.1145/3178115

Zhao H, Shao W, Bao B, Li H (2019a) A simple and robust deep convolutional approach to blind image denoising. In: IEEE/CVF international conference on computer vision workshop (ICCVW), pp 3943–3951

Zhao L, Wang S, Zhang X, Wang S, Ma S, Gao W (2019b) Enhanced motion-compensated video coding with deep virtual reference frame generation. IEEE Trans Image Process 28(10):4832–4844

Zhao W, Peng H, Eger S, Cambria E, Yang M (2019) Towards scalable and reliable capsule networks for challenging NLP applications. In: Proceedings of the 57th annual meeting of the association for computational linguistics, pp 1549–1559. https://doi.org/10.18653/v1/P19-1150

Zhao Z, Wang S, Wang S, Zhang X, Ma S, Yang J (2019a) Enhanced bi-prediction with convolutional neural network for high-efficiency video coding. IEEE Trans Circuits Syst Video Technol 29(11):3291–3301

Zhao Z, Zheng P, Xu S, Wu X (2019b) Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst 30(11):3212–3232

Zhen M, Wang J, Zhou L, Fang T, Quan L (2019) Learning fully dense neural networks for image semantic segmentation. In: 33rd AAAI conference on artificial intelligence (AAAI-19)

Zhou S, Zhang J, Pan J, Zuo W, Xie H, Ren J (2019) Spatio-temporal filter adaptive network for video deblurring. In: IEEE/CVF international conference on computer vision (ICCV), pp 2482–2491

Zhou T, Fan D, Cheng M, Shen J, Shao L (2021) RGB-D salient object detection: a survey. Comput Vis Media. https://doi.org/10.1007/s41095-020-0199-z

Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In: The IEEE international conference on computer vision (ICCV)

Zhu X, Milanfar P (2013) Removing atmospheric turbulence via space-invariant deconvolution. IEEE Trans Pattern Anal Mach Intell 35(1):157–170

Zhu X, Liu Y, Li J, Wan T, Qin Z (2018) Emotion classification with data augmentation using generative adversarial networks. In: Advances in knowledge discovery and data mining. Springer, Cham, pp 349–360

Zollhöfer M, Stotko P, Görlitz A, Theobalt C, Nießner M, Klein R, Kolb A (2018) State of the art on 3D reconstruction with RGB-D cameras. Eurographics 37(2):625–652. https://doi.org/10.1111/cgf.13386

Zuo C, Liu Y, Tan X, Wang W, Zhang M (2013) Video denoising based on a spatiotemporal Kalman-bilateral mixture model. Sci World J. https://doi.org/10.1155/2013/438147

Download references

Acknowledgements

This work has been funded by Bristol+Bath Creative R+D under AHRC grant AH/S002936/1. The Creative Industries Clusters Programme is managed by the Arts and Humanities Research Council as part of the Industrial Strategy Challenge Fund. The authors would like to acknowledge the following people who provided valuable contributions that enabled us to improve the quality and accuracy of this review: Ben Trewhella (Opposable Games), Darren Cosker (University of Bath), Fan Zhang (University of Bristol), and Paul Hill (University of Bristol).

Author information

Authors and affiliations.

Bristol Vision Institute, University of Bristol, Bristol, UK

Nantheera Anantrasirichai & David Bull

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Nantheera Anantrasirichai .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Anantrasirichai, N., Bull, D. Artificial intelligence in the creative industries: a review. Artif Intell Rev 55 , 589–656 (2022). https://doi.org/10.1007/s10462-021-10039-7

Download citation

Published : 02 July 2021

Issue Date : January 2022

DOI : https://doi.org/10.1007/s10462-021-10039-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Creative industries
  • Machine learning
  • Image and video enhancement
  • Find a journal
  • Publish with us
  • Track your research

machine learning models aren’t autonomous.  ‘They aren’t going to create new artistic movements on their own – those are PR stories

Art for our sake: artists cannot be replaced by machines – study

There has been an explosion of interest in ‘creative AI’, but does this mean that artists will be replaced by machines? No, definitely not, says Anne Ploin , Oxford Internet Institute researcher and one of the team behind today’s report on the potential impact of machine learning (ML) on creative work. 

The report, ‘ AI and the Arts: How Machine Learning is Changing Artistic Work ’ , was co-authored with OII researchers Professor Rebecca Eynon and Dr Isis Hjorth as well as Professor Michael A. Osborne from Oxford’s Department of Engineering .

Their study took place in 2019, a high point for AI in art. It was also a time of high interest around the role of AI (Artificial Intelligence) in the future of work, and particularly around the idea that automation could transform non-manual professions, with a previous study by Professor Michael A. Osborne and Dr Carl Benedict Frey predicting that some 30% of jobs could, technically, be replaced in an AI revolution by 2030.

Human agency in the creative process is never going away. Parts of the creative process can be automated in interesting ways using AI...but the creative decision-making which results in artworks cannot be replicated by current AI technology

Mx Ploin says it was clear from their research that machine learning was becoming a tool for artists – but will not replace artists. She maintains, ‘The main message is that human agency in the creative process is never going away. Parts of the creative process can be automated in interesting ways using AI (generating many versions of an image, for example), but the creative decision-making which results in artworks cannot be replicated by current AI technology.’

She adds, ‘Artistic creativity is about making choices [what material to use, what to draw/paint/create, what message to carry across to an audience] and develops in the context in which an artist works. Art can be a response to a political context, to an artist’s background, to the world we inhabit. This cannot be replicated using machine learning, which is just a data-driven tool. You cannot – for now – transfer life experience into data.’

She adds, ‘AI models can extrapolate in unexpected ways, draw attention to an entirely unrecognised factor in a certain style of painting [from having been trained on hundreds of artworks]. But machine learning models aren’t autonomous.

Artistic creativity is about making choices ...and develops in the context in which an artist works...the world we inhabit. This cannot be replicated using machine learning, which is just a data-driven tool

‘They aren’t going to create new artistic movements on their own – those are PR stories. The real changes that we’re seeing are around the new skills that artists develop to ‘hack’ technical tools, such as machine learning, to make art on their own terms, and around the importance of curation in an increasingly data-driven world.’

The research paper uses a case study of the use of current machine learning techniques in artistic work, and investigates the scope of AI-enhanced creativity and whether human/algorithm synergies may help unlock human creative potential. In doing so, the report breaks down the uncertainty surrounding the application of AI in the creative arts into three key questions.

  • How does using generative algorithms alter the creative processes and embodied experiences of artists?
  • How do artists sense and reflect upon the relationship between human and machine creative intelligence?
  • What is the nature of human/algorithmic creative complementarity?

According to Mx Ploin, ‘We interviewed 14 experts who work in the creative arts, including media and fine artists whose work centred around generative ML techniques. We also talked to curators and researchers in this field. This allowed us to develop fuller understanding of the implications of AI – ranging from automation to complementarity – in a domain at the heart of human experience: creativity.’

They found a range of responses to the use of machine learning and AI. New activities required by using ML models involved both continuity with previous creative processes and rupture from past practices. There were major changes around the generative process, the evolving ways ML outputs were conceptualised, and artists’ embodied experiences of their practice.

And, says the researcher, there were similarities between the use of machine learning and previous periods in art history, such as the code-based and computer arts of the 1960s and 1970s. But the use of ML models was a “step change” from past tools, according to many artists.

While the machine learning models could help produce ‘surprising variations of existing images’, practitioners felt the artist remained irreplaceable...in making artworks

But, she maintains, while the machine learning models could help produce ‘surprising variations of existing images’, practitioners felt the artist remained irreplaceable in terms of giving images artistic context and intention – that is, in making artworks.

Ultimately, most agreed that despite the increased affordances of ML technologies, the relationship between artists and their media remained essentially unchanged, as artists ultimately work to address human – rather than technical – questions.

Don’t let it put you off going to art school. We need more artists

The report concludes that human/ML complementarity in the arts is a rich and ongoing process, with contemporary artists continuously exploring and expanding technological capabilities to make artworks . Although ML-based processes raise challenges around skills, a common language, resources, and inclusion, what is clear is that the future of ML arts will belong to those with both technical and artistic skills. There is more to come.

But, says Mx Ploin, ‘Don’t let it put you off going to art school. We need more artists.’

Further information

AI and the Arts: How Machine Learning is Changing Artistic Work . Ploin, A., Eynon, R., Hjorth I. & Osborne, M.A. (2022). Report from the Creative Algorithmic Intelligence Research Project. Oxford Internet Institute, University of Oxford, UK. Download the full report .

This report accounts for the findings of the 'Creative Algorithmic Intelligence: Capabilities and Complementarity' project, which ran between 2019 and 2021 as a collaboration between the University of Oxford's Department of Engineering and Oxford Internet Institute.

The report also showcases a range of artworks from contemporary artists who use AI as part of their practice and who participated in our study: Robbie Barrat , Nicolas Boillot , Sofia Crespo , Jake Elwes , Lauren Lee McCarthy , Sarah Meyohas , Anna Ridler , Helena Sarin , and David Young.

Subscribe to News

DISCOVER MORE

  • Support Oxford's research
  • Partner with Oxford on research
  • Study at Oxford
  • Research jobs at Oxford

You can view all news or browse by category

How AI-generated art is changing the concept of art itself

via LA Times

Sept. 21, 2022

  • #artificial intelligence
  • #social media
  • Ziv Epstein Former Research Assistant
  • Iyad Rahwan Former Associate Professor of Media Arts and Sciences; Former AT&T Career Development Professor of Media Arts and Sciences
  • Sydney Levine Former Postdoctoral Associate

Share this article

By Steven Vargas

This is one way that artificial intelligence can output a selection of images based on words and phrases one feeds it. The program gathers possible outputs from its dataset references that it learned from — typically pulled from the internet — to provide possible images.

For some, AI-generated art is revolutionary.

In June 2022, Cosmopolitan released its first magazine cover generated by an AI program named DALL-E 2. However, the AI did not work on its own. Video director Karen X. Cheng, the artist behind the design, documented on TikTok what specific words she used for the program to create the image of an astronaut triumphantly walking on Mars:

“A wide angle shot from below of a female astronaut with an athletic feminine body walking with swagger towards camera on Mars in an infinite universe, synthwave digital art."

Who gets credit for AI-generated art?

Epstein, Ziv, et al. "Who gets credit for AI-generated art?." iScience (2020): 101515.

ai art research papers

The dynamics of attention in digital ecosystems

Epstein, Ziv. The dynamics of attention in digital ecosystems. Diss. Massachusetts Institute of Technology, 2019.

Dust, costumes, weirdness — and science: Burning Man is back

Ziv Epstein talks about his latest research project, which will attempt to "map the networks of cooperation and serendipity at Burning Man."

Study: Digital literacy doesn’t stop the spread of misinformation

Digital literacy helps people identify misinformation — but it doesn’t necessarily stop them from spreading it.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Front Psychol

Art in an age of artificial intelligence

Artificial intelligence (AI) will affect almost every aspect of our lives and replace many of our jobs. On one view, machines are well suited to take over automated tasks and humans would remain important to creative endeavors. In this essay, I examine this view critically and consider the possibility that AI will play a significant role in a quintessential creative activity, the appreciation and production of visual art. This possibility is likely even though attributes typically important to viewers–the agency of the artist, the uniqueness of the art and its purpose might not be relevant to AI art. Additionally, despite the fact that art at its most powerful communicates abstract ideas and nuanced emotions, I argue that AI need not understand ideas or experience emotions to produce meaningful and evocative art. AI is and will increasingly be a powerful tool for artists. The continuing development of aesthetically sensitive machines will challenge our notions of beauty, creativity, and the nature of art.

Introduction

Artificial intelligence (AI) will permeate our lives. It will profoundly affect healthcare, education, transportation, commerce, politics, finance, security, and warfare ( Ford, 2021 ; Lee and Qiufan, 2021 ). It will also replace many human jobs. On one view, AI is particularly suited to take over routine tasks. If this view is correct, then humans involvement will remain relevant, if not essential, for creative endeavors. In this essay, I examine the potential role of AI in one particularly creative human activity—the appreciation and production of art. AI might not seem well suited for such aesthetic engagement; however, it would be premature to relegate AI to a minor role. In what follows, I survey what it means for humans to appreciate and produce art, what AI seems capable of, and how the two might converge.

Agency and purpose in art

If an average person in the US were asked to name an artistic genius they might mention Michelangelo or Picasso. Having accepted that they are geniuses, the merit of their work is given the benefit of the doubt. A person might be confused by a cubist painting, but might be willing to keep their initial confusion at bay by assuming that Picasso knew what he was doing Art historical narratives value individual agency ( Fineberg, 1995 ). By agency, I mean the choices a person makes, their intentionality, motivations, and the quality of their work. Even though some abstract art might look like it could be made by children, viewers distinguish the two by making inferences about the artists’ intentionality ( Hawley-Dolan and Winner, 2011 ).

Given the importance we give to the individual artist, it is not surprising that most people react negatively to forgeries ( Newman and Bloom, 2012 ). This reaction, even when the object is perceptually indistinguishable from an original, underscores the importance of the original creator in conferring authenticity to art. Authenticity does not refer to the mechanical skills of a painter. Rather it refers to the original conception of the work in the mind of the artist. We value the artist’s imagination and their choices in how to express their ideas. We might appreciate the skill involved in producing a forgery, but ultimately devalue such works as a refined exercise in paint-by-numbers.

Children care about authenticity. They value an original object and are less fond of an identical object if they think it was made by a replicator ( Hood and Bloom, 2008 ). Such observations suggest that the value of an original unique object made by a person rather than a machine is embedded in our developmental psychology. This sensibility persists among adults. Objects are typically imbued with something of the essence of its creator. People experience a connection between the creator and receiver transmitted through the object, which lends authenticity to the object ( Newman et al., 2014 ; Newman, 2019 ).

The value of art made by a person rather than a machine also seems etched in our brains. People care about the effort, skill, and intention that underly actions ( Kruger et al., 2004 ; Snapper et al., 2015 ); features that are more apparent in a human artist than they would be with a machine. In one study, people responded more favorably to identical abstract images if they thought the images were hanging in a gallery than if they were generated by a computer ( Kirk et al., 2009 ). This response was accompanied by greater neural activity in reward areas of the brain, suggesting that the participants experienced more pleasure if they thought the image came from a gallery than if it was produced by a machine. We do not know if such responses that were reported in 2009, will be true in 2029 or 2059. Even now, biases against AI art are mitigated if people anthropomorphize the machine ( Chamberlain et al., 2018 ). As AI art develops, we might be increasingly fascinated by the fact that people can create devices that themselves can create novel images.

Before the European Renaissance, agency was probably not important for how people thought about art ( Shiner, 2001 ). The very notion of art probably did not resemble how we think of artworks when we walk into a museum or a gallery. Even if the agency of an artist did not much matter, purpose did. Religious art conveyed spiritual messages. Indigenous cultures used art in rituals. Forms of a gaunt Christ on the crucifix, sensual carvings at Khajuraho temples, and Kongo sculptures of human forms impaled with nails, served communal purposes. Dissanayake (2008) emphasized the deep roots of ritual in the evolution of art. Purpose in art does not have to be linked to agency. We admire cave paintings at Lascaux or Alta Mira but do not give much thought to specific artists who made them. We continue to speculate about the purpose of these images.

Art is sometimes framed as “art for art’s sake,” as if it has no purpose. According to Benjamin (1936/2018) this doctrine, l’art pour l’art , was a reaction to art’s secularization. The attenuation of communal ritualistic functions along with the ease of art’s reproduction brought on a crisis. “Pure” art denied any social function and reveled in its purity.

Some of functions of art shifted from a communal purpose to individual intent. The Sistine Chapel, while promoting a Christian narrative, was also a product of Michelangelo’s mind. Modern and contemporary art bewilder many because the message of the art is often opaque. One needs to be educated about the point of a urinal on a pedestal or a picture of soup cans to have a glimmer as to why anybody considers these objects as important works of art. In these examples, intent of the artist is foregrounded while communal purpose recedes and for most viewers is hard to decipher. Even though 20th Century art often represented social movements, we emphasize the individual as the author of their message. Guernica, and its antiwar message, is attributed to an individual, even when embedded in a social context. We might ask, what was Basquiat saying about identity? How did Kahlo convey pain and death? How did depression affect Rothko’s art?

Would AI art have a purpose? As I will recount later, AI at the very least could be a powerful tool for an artist, perhaps analogous to the way a sophisticated camera is a tool for a fine art photographer. In that case, a human artist still dictates the purpose of the art. For a person using AI art generating programs, their own cultural context, their education, and personal histories influence their choices and modifications the initial “drafts” of images produced by the generator. If AI develops sentience, then questions about the purpose of AI art and its cultural context, if such work is even produced, will come to the fore and challenge our engagement with such art.

Reproduction and access

I mentioned the importance of authenticity in how a child reacts to reproductions and our distaste for forgeries. These observations point to a special status for original artwork. For Benjamin (1936/2018) the original had a unique presence in time and place. He regarded this presence as the artwork’s “aura.” The aura of art depreciates with reproduction.

Reproduction has been an issue in art for a long time. Wood cuts and lithographs (of course the printing press for literature) meant that art could be reproduced and many copies distributed. These copies made art more accessible. Photography and film, vastly increased reproductions of and access to art images.

Even before reproductions, paintings as portable objects within a frame, increased access to art. These objects could be moved to different locations, unlike frescoes or mosaics which had to be experienced in situ (setting aside the removal of artifacts from sites of origin to imperial collections). Paintings that could be transported in a frame already diminished their aura by being untethered to a specific location of origin.

Concerns about reproduction take on a different force in the digital realm. These concerns extend those raised by photographic reproduction. Analog photography retains the ghost of an original- in the form of a negative. Fine art photography often limits prints to a specific number to impart a semblance of originality and introduce scarcity to the physical artifact of a print. Digital photography has no negative. A RAW file might be close. Copies of the digital file, short of being corrupted, are indistinguishable from an original file, calling into question any uniqueness contained in that original. Perhaps non-fungible tokens could be used to establish an original unique identifier for such digital files.

If technology pushes art toward new horizons and commercial opportunities push advances in technology, then it is hard to ignore the likelihood that virtual reality (VR) and augmented reality (AR) will have an impact on our engagement with art. The ease of mass production and commercial imperatives to make more, also renders the notion of the aura of an individual object or specific location in VR nonsensical. AI art, by virtue of being digital, will lack uniqueness and not have the same aura as a specific object tied to a specific time and place. However, the images will be novel. Novelty, as I describe later, is an important feature of creativity.

Artificial intelligence in our lives

As I mentioned at the outset of this essay, machine learning and AI will have a profound effect on almost every aspect of what we do and how we live. Intelligence in current forms of AI is not like human cognition. AI as implemented in deep learning algorithms are not taught rules to guide the processing of their inputs. Their learning takes different forms. They can be supervised, reinforced, or unsupervised. For supervised learning, they are fed massive amounts of labeled data as input and then given feedback about how well their outputs match the desired label. In this way networks are trained to maximize an “objective function,” which typically targets the correct answer. For example, a network might be trained to recognize “dog” and learn to identify dogs despite the fact that dogs vary widely in color, size, and bodily configurations. After being trained on many examples of images that have been labeled a priori as dog, the network identifies images of dogs it has never encountered before. The distinctions between supervised, reinforcement learning, and unsupervised learning are not important to the argument here. Reinforcement learning relies on many trial-and-error iterations and learns to succeed from the errors it makes, especially in the context of games. Unsupervised learning learns by identifying patterns in data and making predictions based on past patterns in that are not labeled.

Artificial intelligence improves with more data. With massive information increasingly available from web searches, commercial purchases, internet posts, texts, official records, all resting on enormous cloud computing platforms, the power of AI is growing and will continue to do so for the foreseeable future. The limits to AI are availability of data and of computational power.

Artificial intelligence does some tasks better than humans. It processes massive amounts of information, generates many simulations, and identifies patterns that would be impossible for humans to appreciate. For example, in biology, AI recently solved the complex problem of three-dimensional protein folding from a two-dimensional code ( Callaway, 2022 ). The output of deep learning algorithms can seem magical ( Rich, 2022 ). Given that they are produced by complex multidimensional equations, their results resist easy explanation.

Current forms of AI have limits. They do not possess common sense. They are not adept at analytical reasoning, extracting abstract concepts, understanding metaphors, experiencing emotions, or making inferences ( Marcus and Davis, 2019 ). Given these limits, how could AI appreciate or produce art? If art communicates abstract and symbolic ideas or expresses nuanced emotions, then an intelligence that cannot abstract ideas or feel emotions would seem ill-equipped to appreciate or produce art. If we care about agency, short of developing sentience, AI has no agency. If we care about purpose, the purpose of an AI system is determined by its objective function. This objective, as of now, is put in place by human designers and the person making use of AI as a tool. If we care about uniqueness, the easy reproducibility of digital outputs depreciates any “aura” to which AI art might aspire.

Despite these reasons to be skeptical, it might be premature to dismiss a significant role of AI in art.

Art appreciation and production

What happens when people appreciate art? Art, when most powerful, can transform a viewer, evoke deep emotions, and promote new understanding of the world and of themselves. Historically, scientists working in empirical aesthetics have asked participants in their studies whether they like a work of art, find it interesting, or beautiful ( Chatterjee and Cardilo, 2021 ). The vast repository of images, on platforms like Instagram, Facebook, Flicker, and Pinterest, have images labeled with people’s preferences. These rich stores of data, growing every day, mean that AI programs can be trained to identify underlying patterns in images that people like.

Crowd-sourcing beauty or preference risks produce boring images. In the 1990s, Komar and Melamid (1999) conducted a pre-digital satirical project in crowd-sourcing art preferences. They hired polling companies to find out what paintings people in 11 countries wanted the most. For Americans, they found that 44% of Americans preferred blue; 49% preferred outdoor scenes featuring lakes, rivers, or oceans; more than 60% liked large paintings; 51% preferred wild, rather than domestic, animals; and 56% said they wanted historical figures featured in the painting. Based on this information, the painting most Americans want showed an idyllic landscape featuring a lake, two frolicking deer, a group of three clothed strollers, and George Washington standing upright in the foreground. For many critics, The Most Wanted Paintings were banal. They were the kind of anodyne images you might find in a motel. Is the Komar and Melamid experiment a cautionary tale for AI?

Artificial intelligence would not be polling people the way that Komar and Melamid did. With a large database of images, including paintings from various collections, the training phase would encompass an aggregate of many more images than collecting the opinions of a few hundred people. AI need not be confined to producing banal images reduced to a low common denominator. Labels for images in databases might end up being far richer than the simple “likes” on Instagram and other social media platforms. Imagine a nuanced taxonomy of words that describe different kinds of art and their potential impacts on viewers. At a small scale, such projects are underway ( Menninghaus et al., 2019 ; Christensen et al., 2022 ; Fekete et al., 2022 ). These research programs go beyond asking people if they like an image, or find it beautiful or interesting. In one such project, we queried a philosopher, a psychologist, a theologian, and art historian and a neuroscientist for verbal labels that could describe a work of art and labels that would indicate potential impacts on how they thought or felt. Descriptions of art could include terms like “colorful” or “dynamic” or refer to the content of art such as portraits or landscapes or to specific art historical movements like Baroque or post-impressionist. Terms describing the impact of art certainly include basic terms such as “like” and “interest,” but also terms like “provoke,” or “challenge,” or “elevate,” or “disgust.” The motivation behind such projects is that powerful art evokes nuanced emotions beyond just liking or disliking the work. Art can be difficult and challenging, and such art might make some viewers feel anxious and others feel more curious. Researchers in empirical aesthetics are increasing focused on identifying a catalog of cognitive and emotional impacts of art. Over the next few years, a large database of art images labeled with a wide range of descriptors and impacts could serve as a training set for an art appreciating AI. Since such networks are adept at extracting patterns in vast amounts of data, one could imagine a trained network describing a novel image it is shown as “playing children in a sunny beach that evokes joy and is reminiscent of childhood summers.” The point is that AI need not know what it is looking at or experience emotions. All it needs to be able to do is label a novel image with descriptions and impacts- a more complex version of labeling an image as a brown dog even if it has never seen that particular dog before.

Can AI, in its current form, be creative? One view is that AI is and will continue to be good at automated but not creative tasks. As AI disrupts work and replaces jobs that involve routine procedures, the hope is that creative jobs will be spared. This hope is probably not warranted.

Sequence transduction or transformer models are making strides in processing natural language. Self-GPT-3 (generative pre-trained transformers) as of now building on 45 terabytes of data can produce text based on the likelihood of words co-occurring in sequence. The words produced by transformer models can seem indistinguishable from sentences produced by humans. GPT-3 transformers can produce poetry, philosophical musings, and even self-critical essays ( Thunström, 2022 ).

The ability to use text to display images is the first step in producing artistic images. DALL-E 2, Imagen, Midjourney, and DreamStudio are gaining popularity as art generators that make images when fed words ( Kim, 2022 ). To give readers, who might not be familiar with the range of AI art images, a sense of these pictures I offer some examples.

The first set of images were made using Midjourney. I started with the prompt “a still life with fruit, flowers, a vase, dead game, a candle, and a skull in a Renaissance style” ( Figure 1 ). The program generates four options, from which I picked the one that came closest to how I imagined the image. I then generated another four variations from the one I picked and chose the one I liked best. The upscaled version of the figure is included.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g001.jpg

Midjourney image generated to the prompt “a still life with fruit, flowers, a vase, dead game, a candle, and a skull in a Renaissance style”.

To show variations of the kind of images produced, I used the same procedures and prompts, except changing the style to Expressionist, Pop-art, and Minimalist ( Figures 2 – 4 ).

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g002.jpg

Midjourney image generated to the prompt “a still life with fruit, flowers, a vase, dead game, a candle, and a skull in an Expressionist style”.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g004.jpg

Midjourney image generated to the prompt “a still life with fruit, flowers, a vase, dead game, a candle, and a skull in a Minimalist style”.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g003.jpg

Midjourney image generated to the prompt “a still life with fruit, flowers, a vase, dead game, a candle, and a skull in a Pop-art style”.

“To show how one might build up an image I used Open AI’s program Dall-E, to generate an image to the prompt, “a Surreal Impressionist Landscape.” Then using the same program, I used the prompt, “a Surreal Impressionist Landscape that evokes the feeling of awe.” To demonstrate how different programs can produce different images to the same prompt,” a Surreal Impressionist Landscape that evokes the feeling of awe” I include images produced by Dream Studio and by Midjourney.

Regardless of the merits of each individual image, they only took a few minutes to make. Such images and many other produced easily could serve as drafts for an artist to consider the different ways they might wish to depict their ideas or give form to their intuitions ( Figures 5 – 8 ). The idea that artists use technology to guide their art is not new. For example, Hockney (2001) described ways that Renaissance masters used technology of their time to create their work.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g005.jpg

Dall-E generated image to the prompt “a Surreal Impressionist Landscape”.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g008.jpg

Midjourney generated image to the prompt “a Surreal Impressionist Landscape that evokes the feeling of awe”.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g006.jpg

Dall-E generated image to the prompt “a Surreal Impressionist Landscape that evokes the feeling of awe”.

An external file that holds a picture, illustration, etc.
Object name is fpsyg-13-1024449-g007.jpg

Dream Studio generated image to the prompt “a Surreal Impressionist Landscape that evokes the feeling of awe”.

Unlike the imperative for an autonomous vehicle to avoid mistakes when it needs to recognize a child playing in the street, art makes no such demands. Rather, art is often intentionally ambiguous. Ambiguity can fuel an artworks’ power, forcing viewers to ponder what it might mean. What then will be the role of the human artist? Most theories of creative processing include divergent and convergent thinking ( Cortes et al., 2019 ). Divergent thinking includes coming up with many possibilities. This phase can also be thought of as the generative or imaginative phase. A commonly used laboratory test is the Alternative Uses Test ( Cortes et al., 2019 ). This test asks people to offer as many uses of a common object, like a brick, that they can imagine. The more uses, that a person can conjure up, especially when they are unusual, is taken as a measure of divergent thinking and creative potential. When confronting a problem that needs a creative solution, generating many possibilities doesn’t mean that they are the right or the best one. An evaluative phase is needed to narrow the possibilities, to converge on a solution, and to identify a useful path forward. In producing a work of art, artists presumably shift back and forth between divergent and convergent processes as they keep working toward their final work.

An artist could use text-to-image platforms as a tool ( Kim, 2022 ). They could type in their intent and then evaluate the possible images generated, as I show in the figures. They might tweak their text several times. The examples of images included here using similar verbal prompts show how the text can be translated into images differently. Artists could choose which of the images generated they like and modify them. The divergent and generative parts of creative output could be powerfully enhanced by using AI, while the artist would evaluate these outputs. AI would be a powerful addition to their creative tool-kit.

Some art historians might object that art cannot be adequately appreciated outside its historical and cultural context. For example, Picasso and Matisse are better understood in relation to Cezanne. The American abstract expressionists are better understood as expressing an individualistic spirit while still addressed universal experiences; a movement to counter Soviet social realism and its collective ethos. We can begin to see how this important objection might be dealt with using AI. “Creative adversarial networks” can produce novel artworks by learning about historic art styles and then intentionally deviating from them ( Elgammal et al., 2017 ). These adversarial networks would use other artistic styles as a contextual springboard from which to generate images.

Artificial intelligence and human artists might be partners ( Mazzone and Elgammal, 2019 ), rather than one serving as a tool for the other. For example, in 2015 Mike Tyka created large-scale artworks using Iterative DeepDream and co-founded the Artists and Machine Intelligence program at Google. Using DeepDream and GANs he produced a series “Portraits of Imaginary People,” which was shown at ARS Electronica in Linz, Christie’s in New York and at the New Museum in Karuizawa (Japan) ( Interalia Magazine, 2018 ). The painter Pindar van Arman teaches robots to paint and believes they augment his own creativity. Other artists are increasingly using VR as an enriched and immersive experience ( Romano, 2022 ).

Kinsella (2018) Christie’s in New York sold an artwork called Portrait of Edmond de Belamy for $432,500. The portrait of an aristocratic man with blurry features was created by a GAN from a collective called Obvious. It was created using the WikiArt dataset that includes fifteen thousand portraits from the fourteenth to the twentieth century. Defining art has always been difficult. Art does not easily follow traditional defining criteria of having sufficient and necessary features to be regarded as a member of a specific category, and may not be a natural kind ( Chatterjee, 2014 ). One prominent account of art is an institutional view of art ( Dickie, 1969 ). If our social institutions agree that an object is art, then it is. Being auctioned and sold by Christie’s certainly qualifies as an institution claiming that AI art is in fact art.

In 2017, Turkish artist Refik Anadol, collaborating with Mike Tyka, created an installation using GANs called “Archive Dreaming.” This installation is an immersive experience with viewers standing in a cylindrical room. He used Istanbul’s SALT Galeta online library with 1.7 million images, all digitized into two terabytes of data. The holdings in this library relate to Turkey from the 19th Century to the present and include photographs, images, maps, and letters. Viewers stand in a cylindrical room and can gaze at changing displays on the walls. They can choose which documents to view, or the passively watch the display in an idle state. In the idle state, the archive “dreams.” Generators produce new images that resemble the original ones, but never actually existed—an alternate fictional historical archive of Turkey imagined by the machine ( Pearson, 2022 ).

Concerns, further future, and sentient artificial intelligence

Technology can be misused. One downside of deep learning is that biases embedded in training data sets can be reified. Systematic biases in the judicial system, in hiring practices, in procuring loans are written into AI “predictions” while giving the illusion of objectivity. The images produced by Dall-E so far perpetuate race and gender stereotypes ( Taylor, 2022 ). People probably do not vary much if asked to identify a dog, but they certainly do in identifying great art. Male European masters might continue to be lauded over women or under-represented minority artists and others of whom we have not yet heard.

On the other hand, current gatekeepers of art, whether at high-end galleries, museums, and biennales, are already biased in who and what art they promote. Over time, art through AI might become more democratized. Museums and galleries across the world are digitizing their collections. The art market in the 21st Century extends beyond Europe and the United States. Important shows as part of art’s globalization occur beyond Venice, Basel, and Miami—to now include major gatherings in Sao Paulo, Dakar, Istanbul, Sharjah, Singapore, and Shanghai. Beyond high profile displays, small galleries are digitizing and advertising their holdings. As more images are incorporated into training databases, including art from Asia, Africa, and South America, and non-traditional art forms, such as street art or textile art, what people begin to regard as good or great art might become more encompassing and inclusive.

Could art become a popularity contest? As museums struggle to keep a public engaged, they might use AI to predict which kinds of art would draw in most viewers. Such a use of AI might narrow the range of art that are displayed. Similarly, some artists might choose to make art (in the traditional way), but shift their output to what AI predicts will sell. Over time, art could lose its innovation, its subversive nature, and its sheer variety. The nature of the artist might also change if the skills involved in making art change. An artist collaborating with AI might use machine learning outputs for the divergent phase of their creations and insert themselves along with additional AI assessments in the convergent evaluative phases of producing art.

The need for artistic services could diminish. Artists who work as illustrators for books, technical manuals, and other media such as advertisement, could be replaced by AI generating images. The loss of such paying jobs might make it harder for some artists to pursue their fine art dreams if they do not have a reliable source of income.

Many experts working in the field believe that AI will develop sentience. Exactly how is up for debate. Some believe that sentience can emerge from deep learning architectures given enough data and computational power. Others think that combining deep learning and classical programming, which includes the insertion of rules and symbols, is needed for sentience to emerge. Experts also vary in when they think sentience will emerge in computers. According to Ford (2021) , some think it could be in a decade and others in over a 100 years. Nobody can anticipate the nature of that sentience. When Gary Kasparov (world Chess Champion at the time) lost to the program Deep Blue, he claimed that he felt an alien intelligence ( Lincoln, 2018 ). Deep Blue was no sentient AI.

Artificial intelligence sentience will truly be an alien intelligence. We have no idea how or whether sentient AI will engage in art. If they do, we have no idea what would motivate them and what purpose their art would have. Any comments about these possibilities are pure speculation on my part.

Sentient AI could make art in the real world. Currently, robots find and move objects in large warehouses. Their movements are coarse and carried out in well-controlled areas. A robot like Rosey, the housekeeper in the Jetsons cartoon, is far more difficult to make since it has to move in an open world and react to unpredictable contingencies. Large movements are easier to program than fine movements, precision grips, and manual dexterity. The difficulty in making a robot artist would fall somewhere between a robot in an Amazon warehouse and Rosey. It would not have to contend with an unconstrained environment in its “studio.” It would learn to choose and grip different brushes and other instruments, manipulate paints, and apply them to a canvas that it stretched. Robot arms that draw portraits have been programed into machines ( Arman, 2022 ). However, sentient AI with intent would decide what to paint and it would be able to assess whether its output matched its goal- using generative adversarial systems. The art appreciation and art production abilities could be self-contained within a closed loop without involving people.

Sentient AI might not bother with making art in the real world. Marc Zuckerberg would have us spend as much time as possible in a virtual metaverse. Sentient AI could create art residing in fantastical digital realms and not bother with messy materials and real-world implementation. Should sentient AI or sentient AIs choose to make art for whatever their purpose might be, humans might be irrelevant to the art making and appreciating or evaluating loop.

Ultimately, we do not know if sentient AI will be benevolent, malevolent, or apathetic when it comes to human concerns. We don’t know if sentient AI will care about art.

As AI continues to insinuate itself in most parts of our lives, it will do so with art ( Agüera y Arcas, 2017 ; Miller, 2019 ). The beginnings of art appreciation and production that we see now, and the examples provided in the figures, might be like the video game Pong that was popular when I was in high school. Pong is a far cry from the rich immersive quality of games like Minecraft in the same way that Dall-E and Midjourney images might be a far cry from a future art making and appreciating machine.

The idea that creative pursuits are an unassailable bastion of humanity is untenable. AI is already being used as a powerful tool and even as a partner for some artists. The ongoing development of aesthetically sensitive machines will challenge our views of beauty and creativity and perhaps our understanding of the nature of art.

Author contributions

The author confirms being the sole contributor of this work and has approved it for publication.

Acknowledgments

I appreciate the helpful feedback I received from Alex Christensen, Kohinoor Darda, Jonathan Fineberg, Judith Schaechter, and Clifford Workman.

Conflict of interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

  • Agüera y Arcas B. (2017). Art in the age of machine intelligence. Arts 6 : 18 . 10.3390/arts6040018 [ CrossRef ] [ Google Scholar ]
  • Arman P. V. (2022). Cloud painter. Available online at: https://www.cloudpainter.com/ (accessed August 10, 2022). [ Google Scholar ]
  • Benjamin W. (1936/2018). The work of art in the age of mechanical reproduction. A museum studies approach to Heritage. London: Routledge. 10.4324/9781315668505-19 [ CrossRef ] [ Google Scholar ]
  • Callaway E. (2022). ‘The entire protein universe’: AI predicts shape of nearly every known protein. Nature 608 15–16. 10.1038/d41586-022-02083-2 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chamberlain R., Mullin C., Scheerlinck B., Wagemans J. (2018). Putting the art in artificial: Aesthetic responses to computer-generated art. Psychol. Aesth. Creat. Arts 12 177–192. 10.1037/aca0000136 [ CrossRef ] [ Google Scholar ]
  • Chatterjee A. (2014). The aesthetic brain: How we evolved to desire beauty and enjoy art. New York, NY: Oxford University Press. 10.1093/acprof:oso/9780199811809.001.0001 [ CrossRef ] [ Google Scholar ]
  • Chatterjee A., Cardilo E. (2021). Brain, beauty, and art: Essays bringing neuroaesthetics into focus. Oxford: Oxford University Press. 10.1093/oso/9780197513620.001.0001 [ CrossRef ] [ Google Scholar ]
  • Christensen A. P., Cardillo E. R., Chatterjee A. (2022). What kind of impacts can artwork have on viewers? Establishing a taxonomy for aesthetic cognitivism. PsyArXiv [Preprint] 10.31234/osf.io/nt59q [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cortes R. A., Weinberger A. B., Daker R. J., Green A. E. (2019). Re-examining prominent measures of divergent and convergent creativity. Curr. Opin. Behav. Sci. 27 90–93. 10.1016/j.cobeha.2018.09.017 [ CrossRef ] [ Google Scholar ]
  • Dickie G. (1969). Defining art. Am. Philos. Q. 6 253–256. [ Google Scholar ]
  • Dissanayake E. (2008). “ The arts after Darwin: Does art have an origin and adaptive function? ,” in World art studies: Exploring concepts and approaches , eds Zijlemans K., Van Damme W. (Amsterdam: Valiz; ). 10.1007/s13187-012-0411-7 [ CrossRef ] [ Google Scholar ]
  • Elgammal A., Liu B., Elhoseiny M., Mazzone M. (2017). Can: Creative adversarial networks, generating “art” by learning about styles and deviating from style norms. arXiv [Preprint]. arXiv:1706.07068. [ Google Scholar ]
  • Fekete A., Pelowski M., Specker E., Brieber D., Rosenberg R., Leder H. (2022). The Vienna art picture system (VAPS): A data set of 999 paintings and subjective ratings for art and aesthetics research . Psychol. Aesthet. Creat. Arts. 10.1037/aca0000460 [Epub ahead of print]. [ CrossRef ] [ Google Scholar ]
  • Fineberg J. D. (1995). Art since 1940. Hoboken, NJ: Prentice-Hall. [ Google Scholar ]
  • Ford M. (2021). Rule of the robots: How artificial intelligence will transform everything. Hachette: Basic Books. [ Google Scholar ]
  • Hawley-Dolan A., Winner E. (2011). Seeing the mind behind the art: People can distinguish abstract expressionist paintings from highly similar paintings by children, chimps, monkeys, and elephants. Psychol. Sci. 22 435–441. 10.1177/0956797611400915 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hockney D. (2001). Secret knowledge: Rediscovering the lost techniques of the old masters. London: Thames & Hudson. [ Google Scholar ]
  • Hood B. M., Bloom P. (2008). Children prefer certain individuals over perfect duplicates. Cognition 106 455–462. 10.1016/j.cognition.2007.01.012 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Interalia Magazine (2018). Portraits of imaginary people. Available online at: https://www.interaliamag.org/audiovisual/mike-tyka/ (accessed August 15, 2022). [ Google Scholar ]
  • Kim T. (2022). The future of creativity, brought to you by artificial intelligence. Available online at: https://venturebeat.com/datadecisionmakers/the-future-of-creativity-brought-to-you-by-artificial-intelligence/ (accessed August 9, 2022). [ Google Scholar ]
  • Kinsella E. (2018). The first ai-generated portrait ever sold at auction shatters expectations, fetching $432,500—43 times its estimate. Available online at: https://news.artnet.com/market/first-ever-artificial-intelligence-portrait-painting-sells-at-christies-1379902 (accessed August 10, 2022). [ Google Scholar ]
  • Kirk U., Skov M., Hulme O., Christensen M. S., Zeki S. (2009). Modulation of aesthetic value by semantic context: An fMRI study. NeuroImage 44 1125–1132. 10.1016/j.neuroimage.2008.10.009 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Komar V., Melamid A. (1999). Painting by numbers: Komar and Melamid’s scientific guide to art. Berkeley, CA: University of California Press. [ Google Scholar ]
  • Kruger J., Wirtz D., Van Boven L., Altermatt T. W. (2004). The effort heuristic. J. Exp. Soc. Psychol. 40 91–98. 10.1016/S0022-1031(03)00065-9 [ CrossRef ] [ Google Scholar ]
  • Lee K. F., Qiufan C. (2021). AI 2041: Ten visions for our future. London: Ebury Publishing. [ Google Scholar ]
  • Lincoln K. (2018). Deep you. Available online at: https://www.theringer.com/tech/2018/11/8/18069092/chess-alphazero-alphago-go-stockfish-artificial-intelligence-future (accessed August 16, 2022). [ Google Scholar ]
  • Marcus G., Davis E. (2019). Rebooting AI: Building artificial intelligence we can trust. New York, NY: Knopf Doubleday Publishing Group. [ Google Scholar ]
  • Mazzone M., Elgammal A. (2019). Art, creativity, and the potential of artificial intelligence. Arts 8 : 26 . 10.3390/arts8010026 [ CrossRef ] [ Google Scholar ]
  • Menninghaus W., Wagner V., Wassiliwizky E., Schindler I., Hanich J., Jacobsen T., et al. (2019). What are aesthetic emotions? Psychol. Rev. 126 171 . 10.1037/rev0000135 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Miller A. I. (2019). The artist in the machine: The world of AI-powered creativity. Cambridge, MA: MIT Press. 10.7551/mitpress/11585.001.0001 [ CrossRef ] [ Google Scholar ]
  • Newman G. E. (2019). The psychology of authenticity. Rev. Gen. Psychol. 23 8–18. 10.1037/gpr0000158 [ CrossRef ] [ Google Scholar ]
  • Newman G. E., Bloom P. (2012). Art and authenticity: The importance of originals in judgments of value. J. Exp. Psychol 141 558–569. 10.1037/a0026035 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Newman G. E., Bartels D. M., Smith R. K. (2014). Are artworks more like people than artifacts? Individual concepts and their extensions. Top. Cogn. Sci. 6 647–662. 10.1111/tops.12111 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Pearson A. (2022). Archive dreaming. Available online at: http://www.digiart21.org/art/archive-dreaming (accessed August 10, 2022). [ Google Scholar ]
  • Rich S. (2022). The new poem-making machinery. Available online at: https://www.newyorker.com/culture/culture-desk/the-new-poem-making-machinery (accessed August 10, 2022). [ Google Scholar ]
  • Romano H. (2022). 8 virtual reality artists who use the world as their canvas. Available online at: https://blog.kadenze.com/creative-technology/8-virtual-reality-artists-who-use-the-world-as-their-canvas/ (accessed August 8, 2022). [ Google Scholar ]
  • Shiner L. (2001). The invention of art. A cultural history. Chicago, IL: University of Chicago Press. 10.7208/chicago/9780226753416.001.0001 [ CrossRef ] [ Google Scholar ]
  • Snapper L., Oranç C., Hawley-Dolan A., Nissel J., Winner E. (2015). Your kid could not have done that: Even untutored observers can discern intentionality and structure in abstract expressionist art. Cognition 137 154–165. 10.1016/j.cognition.2014.12.009 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Taylor J. (2022). No Quick fix: How OpenAI’s DALL-E 2 illustrated the challanges of bias in AI. Available online at: https://www.nbcnews.com/tech/tech-news/no-quick-fix-openais-dalle-2-illustrated-challenges-bias-ai-rcna39918 (accessed August 10, 2022). [ Google Scholar ]
  • Thunström A. O. (2022). We asked GPT-3 to write an academic paper about itself—then we tried to get it published. Berlin: Scientific American. [ Google Scholar ]

Better Siri is coming: what Apple’s research says about its AI plans

Apple hasn’t talked too much about ai so far — but it’s been working on stuff. a lot of stuff..

By David Pierce , editor-at-large and Vergecast co-host with over a decade of experience covering consumer tech. Previously, at Protocol, The Wall Street Journal, and Wired.

Share this story

The Apple logo with a little AI sparkle.

It would be easy to think that Apple is late to the game on AI. Since late 2022, when ChatGPT took the world by storm, most of Apple’s competitors have fallen over themselves to catch up. While Apple has certainly talked about AI and even released some products with AI in mind, it seemed to be dipping a toe in rather than diving in headfirst.

But over the last few months, rumors and reports have suggested that Apple has, in fact, just been biding its time, waiting to make its move. There have been reports in recent weeks that Apple is talking to both OpenAI and Google about powering some of its AI features, and the company has also been working on its own model, called Ajax .

If you look through Apple’s published AI research, a picture starts to develop of how Apple’s approach to AI might come to life. Now, obviously, making product assumptions based on research papers is a deeply inexact science — the line from research to store shelves is windy and full of potholes. But you can at least get a sense of what the company is thinking about — and how its AI features might work when Apple starts to talk about them at its annual developer conference, WWDC, in June.

Smaller, more efficient models

I suspect you and I are hoping for the same thing here: Better Siri. And it looks very much like Better Siri is coming! There’s an assumption in a lot of Apple’s research (and in a lot of the tech industry, the world, and everywhere) that large language models will immediately make virtual assistants better and smarter. For Apple, getting to Better Siri means making those models as fast as possible — and making sure they’re everywhere.

In iOS 18, Apple plans to have all its AI features running on an on-device, fully offline model, Bloomberg recently reported . It’s tough to build a good multipurpose model even when you have a network of data centers and thousands of state-of-the-art GPUs — it’s drastically harder to do it with only the guts inside your smartphone. So Apple’s having to get creative.

In a paper called “ LLM in a flash: Efficient Large Language Model Inference with Limited Memory ” (all these papers have really boring titles but are really interesting, I promise!), researchers devised a system for storing a model’s data, which is usually stored on your device’s RAM, on the SSD instead. “We have demonstrated the ability to run LLMs up to twice the size of available DRAM [on the SSD],” the researchers wrote, “achieving an acceleration in inference speed by 4-5x compared to traditional loading methods in CPU, and 20-25x in GPU.” By taking advantage of the most inexpensive and available storage on your device, they found, the models can run faster and more efficiently. 

Apple’s researchers also created a system called EELBERT that can essentially compress an LLM into a much smaller size without making it meaningfully worse. Their compressed take on Google’s Bert model was 15 times smaller — only 1.2 megabytes — and saw only a 4 percent reduction in quality. It did come with some latency tradeoffs, though.

In general, Apple is pushing to solve a core tension in the model world: the bigger a model gets, the better and more useful it can be, but also the more unwieldy, power-hungry, and slow it can become. Like so many others, the company is trying to find the right balance between all those things while also looking for a way to have it all.

Siri, but good

A lot of what we talk about when we talk about AI products is virtual assistants — assistants that know things, that can remind us of things, that can answer questions, and get stuff done on our behalf. So it’s not exactly shocking that a lot of Apple’s AI research boils down to a single question: what if Siri was really, really, really good?

A group of Apple researchers has been working on a way to use Siri without needing to use a wake word at all; instead of listening for “Hey Siri” or “Siri,” the device might be able to simply intuit whether you’re talking to it. “This problem is significantly more challenging than voice trigger detection,” the researchers did acknowledge, “since there might not be a leading trigger phrase that marks the beginning of a voice command.” That might be why another group of researchers developed a system to more accurately detect wake words . Another paper trained a model to better understand rare words, which are often not well understood by assistants.

In both cases, the appeal of an LLM is that it can, in theory, process much more information much more quickly. In the wake-word paper, for instance, the researchers found that by not trying to discard all unnecessary sound but, instead, feeding it all to the model and letting it process what does and doesn’t matter, the wake word worked far more reliably.

Once Siri hears you, Apple’s doing a bunch of work to make sure it understands and communicates better. In one paper, it developed a system called STEER (which stands for Semantic Turn Extension-Expansion Recognition, so we’ll go with STEER) that aims to improve your back-and-forth communication with an assistant by trying to figure out when you’re asking a follow-up question and when you’re asking a new one. In another, it uses LLMs to better understand “ambiguous queries” to figure out what you mean no matter how you say it. “In uncertain circumstances,” they wrote, “intelligent conversational agents may need to take the initiative to reduce their uncertainty by asking good questions proactively, thereby solving problems more effectively.” Another paper aims to help with that, too: researchers used LLMs to make assistants less verbose and more understandable when they’re generating answers.

A series of images depicting collaborative AI editing of a photo.

AI in health, image editors, in your Memojis

Whenever Apple does talk publicly about AI, it tends to focus less on raw technological might and more on the day-to-day stuff AI can actually do for you. So, while there’s a lot of focus on Siri — especially as Apple looks to compete with devices like the Humane AI Pin, the Rabbit R1, and Google’s ongoing smashing of Gemini into all of Android — there are plenty of other ways Apple seems to see AI being useful.

One obvious place for Apple to focus is on health: LLMs could, in theory, help wade through the oceans of biometric data collected by your various devices and help you make sense of it all. So, Apple has been researching how to collect and collate all of your motion data, how to use gait recognition and your headphones to identify you, and how to track and understand your heart rate data. Apple also created and released “the largest multi-device multi-location sensor-based human activity dataset” available after collecting data from 50 participants with multiple on-body sensors.

Apple also seems to imagine AI as a creative tool. For one paper, researchers interviewed a bunch of animators, designers, and engineers and built a system called Keyframer that “enable[s] users to iteratively construct and refine generated designs.” Instead of typing in a prompt and getting an image, then typing another prompt to get another image, you start with a prompt but then get a toolkit to tweak and refine parts of the image to your liking. You could imagine this kind of back-and-forth artistic process showing up anywhere from the Memoji creator to some of Apple’s more professional artistic tools.

In another paper , Apple describes a tool called MGIE that lets you edit an image just by describing the edits you want to make. (“Make the sky more blue,” “make my face less weird,” “add some rocks,” that sort of thing.) “Instead of brief but ambiguous guidance, MGIE derives explicit visual-aware intention and leads to reasonable image editing,” the researchers wrote. Its initial experiments weren’t perfect, but they were impressive.

We might even get some AI in Apple Music: for a paper called “ Resource-constrained Stereo Singing Voice Cancellation ,” researchers explored ways to separate voices from instruments in songs — which could come in handy if Apple wants to give people tools to, say, remix songs the way you can on TikTok or Instagram.

An image showing the Ferret-UI AI system from Apple.

Over time, I’d bet this is the kind of stuff you’ll see Apple lean into, especially on iOS. Some of it Apple will build into its own apps; some it will offer to third-party developers as APIs. (The recent Journaling Suggestions feature is probably a good guide to how that might work.) Apple has always trumpeted its hardware capabilities, particularly compared to your average Android device; pairing all that horsepower with on-device, privacy-focused AI could be a big differentiator.

But if you want to see the biggest, most ambitious AI thing going at Apple, you need to know about Ferret . Ferret is a multi-modal large language model that can take instructions, focus on something specific you’ve circled or otherwise selected, and understand the world around it. It’s designed for the now-normal AI use case of asking a device about the world around you, but it might also be able to understand what’s on your screen. In the Ferret paper, researchers show that it could help you navigate apps, answer questions about App Store ratings, describe what you’re looking at, and more. This has really exciting implications for accessibility but could also completely change the way you use your phone — and your Vision Pro and / or smart glasses someday.

We’re getting way ahead of ourselves here, but you can imagine how this would work with some of the other stuff Apple is working on. A Siri that can understand what you want, paired with a device that can see and understand everything that’s happening on your display, is a phone that can literally use itself. Apple wouldn’t need deep integrations with everything; it could simply run the apps and tap the right buttons automatically. 

Again, all this is just research, and for all of it to work well starting this spring would be a legitimately unheard-of technical achievement. (I mean, you’ve tried chatbots — you know they’re not great.) But I’d bet you anything we’re going to get some big AI announcements at WWDC. Apple CEO Tim Cook even teased as much in February, and basically promised it on this week’s earnings call. And two things are very clear: Apple is very much in the AI race, and it might amount to a total overhaul of the iPhone. Heck, you might even start willingly using Siri! And that would be quite the accomplishment.

New Teslas might lose Steam

Reddit brings back its old award system — ‘we messed up’, twitter is officially x.com now, the mac vs. pc war is back on, ai assistants are so back.

Sponsor logo

More from Apple

An Installer illustration showing Arc, Claude, Sofa, and the Bose SoundLink Mini.

The best new browser for Windows

Illustration of an iPhone showing its lock screen on a pink and blue background.

How to make the most of Apple Notes

An illustration of the Apple logo.

More details emerge about Apple’s plans for AI in iOS 18

A photo of the Meta Ray-Ban glasses, the Rabbit R1, and the Humane AI Pin, over the Vergecast team.

On The Vergecast: AI gadgets, iPads, and antitrust

Photo of a person's hands typing on a laptop.

AI-assisted writing is quietly booming in academic journals. Here’s why that’s OK

ai art research papers

Lecturer in Bioethics, Monash University & Honorary fellow, Melbourne Law School, Monash University

Disclosure statement

Julian Koplin does not work for, consult, own shares in or receive funding from any company or organisation that would benefit from this article, and has disclosed no relevant affiliations beyond their academic appointment.

Monash University provides funding as a founding partner of The Conversation AU.

View all partners

If you search Google Scholar for the phrase “ as an AI language model ”, you’ll find plenty of AI research literature and also some rather suspicious results. For example, one paper on agricultural technology says:

As an AI language model, I don’t have direct access to current research articles or studies. However, I can provide you with an overview of some recent trends and advancements …

Obvious gaffes like this aren’t the only signs that researchers are increasingly turning to generative AI tools when writing up their research. A recent study examined the frequency of certain words in academic writing (such as “commendable”, “meticulously” and “intricate”), and found they became far more common after the launch of ChatGPT – so much so that 1% of all journal articles published in 2023 may have contained AI-generated text.

(Why do AI models overuse these words? There is speculation it’s because they are more common in English as spoken in Nigeria, where key elements of model training often occur.)

The aforementioned study also looks at preliminary data from 2024, which indicates that AI writing assistance is only becoming more common. Is this a crisis for modern scholarship, or a boon for academic productivity?

Who should take credit for AI writing?

Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as “ contaminating ” scholarly literature.

Some argue that using AI output amounts to plagiarism. If your ideas are copy-pasted from ChatGPT, it is questionable whether you really deserve credit for them.

But there are important differences between “plagiarising” text authored by humans and text authored by AI. Those who plagiarise humans’ work receive credit for ideas that ought to have gone to the original author.

By contrast, it is debatable whether AI systems like ChatGPT can have ideas, let alone deserve credit for them. An AI tool is more like your phone’s autocomplete function than a human researcher.

The question of bias

Another worry is that AI outputs might be biased in ways that could seep into the scholarly record. Infamously, older language models tended to portray people who are female, black and/or gay in distinctly unflattering ways, compared with people who are male, white and/or straight.

This kind of bias is less pronounced in the current version of ChatGPT.

However, other studies have found a different kind of bias in ChatGPT and other large language models : a tendency to reflect a left-liberal political ideology.

Any such bias could subtly distort scholarly writing produced using these tools.

The hallucination problem

The most serious worry relates to a well-known limitation of generative AI systems: that they often make serious mistakes.

For example, when I asked ChatGPT-4 to generate an ASCII image of a mushroom, it provided me with the following output.

It then confidently told me I could use this image of a “mushroom” for my own purposes.

These kinds of overconfident mistakes have been referred to as “ AI hallucinations ” and “ AI bullshit ”. While it is easy to spot that the above ASCII image looks nothing like a mushroom (and quite a bit like a snail), it may be much harder to identify any mistakes ChatGPT makes when surveying scientific literature or describing the state of a philosophical debate.

Unlike (most) humans, AI systems are fundamentally unconcerned with the truth of what they say. If used carelessly, their hallucinations could corrupt the scholarly record.

Should AI-produced text be banned?

One response to the rise of text generators has been to ban them outright. For example, Science – one of the world’s most influential academic journals – disallows any use of AI-generated text .

I see two problems with this approach.

The first problem is a practical one: current tools for detecting AI-generated text are highly unreliable. This includes the detector created by ChatGPT’s own developers, which was taken offline after it was found to have only a 26% accuracy rate (and a 9% false positive rate ). Humans also make mistakes when assessing whether something was written by AI.

It is also possible to circumvent AI text detectors. Online communities are actively exploring how to prompt ChatGPT in ways that allow the user to evade detection. Human users can also superficially rewrite AI outputs, effectively scrubbing away the traces of AI (like its overuse of the words “commendable”, “meticulously” and “intricate”).

The second problem is that banning generative AI outright prevents us from realising these technologies’ benefits. Used well, generative AI can boost academic productivity by streamlining the writing process. In this way, it could help further human knowledge. Ideally, we should try to reap these benefits while avoiding the problems.

The problem is poor quality control, not AI

The most serious problem with AI is the risk of introducing unnoticed errors, leading to sloppy scholarship. Instead of banning AI, we should try to ensure that mistaken, implausible or biased claims cannot make it onto the academic record.

After all, humans can also produce writing with serious errors, and mechanisms such as peer review often fail to prevent its publication.

We need to get better at ensuring academic papers are free from serious mistakes, regardless of whether these mistakes are caused by careless use of AI or sloppy human scholarship. Not only is this more achievable than policing AI usage, it will improve the standards of academic research as a whole.

This would be (as ChatGPT might say) a commendable and meticulously intricate solution.

  • Artificial intelligence (AI)
  • Academic journals
  • Academic publishing
  • Hallucinations
  • Scholarly publishing
  • Academic writing
  • Large language models
  • Generative AI

ai art research papers

Lecturer / Senior Lecturer - Marketing

ai art research papers

Case Management Specialist

ai art research papers

Assistant Editor - 1 year cadetship

ai art research papers

Executive Dean, Faculty of Health

ai art research papers

Lecturer/Senior Lecturer, Earth System Science (School of Science)

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • 09 May 2024

Cubic millimetre of brain mapped in spectacular detail

  • Carissa Wong

You can also search for this author in PubMed   Google Scholar

Rendering based on electron-microscope data, showing the positions of neurons in a fragment of the brain cortex. Neurons are coloured according to size. Credit: Google Research & Lichtman Lab (Harvard University). Renderings by D. Berger (Harvard University)

Researchers have mapped a tiny piece of the human brain in astonishing detail. The resulting cell atlas, which was described today in Science 1 and is available online , reveals new patterns of connections between brain cells called neurons, as well as cells that wrap around themselves to form knots, and pairs of neurons that are almost mirror images of each other.

The 3D map covers a volume of about one cubic millimetre, one-millionth of a whole brain, and contains roughly 57,000 cells and 150 million synapses — the connections between neurons. It incorporates a colossal 1.4 petabytes of data. “It’s a little bit humbling,” says Viren Jain, a neuroscientist at Google in Mountain View, California, and a co-author of the paper. “How are we ever going to really come to terms with all this complexity?”

Slivers of brain

The brain fragment was taken from a 45-year-old woman when she underwent surgery to treat her epilepsy. It came from the cortex, a part of the brain involved in learning, problem-solving and processing sensory signals. The sample was immersed in preservatives and stained with heavy metals to make the cells easier to see. Neuroscientist Jeff Lichtman at Harvard University in Cambridge, Massachusetts, and his colleagues then cut the sample into around 5,000 slices — each just 34 nanometres thick — that could be imaged using electron microscopes.

Jain’s team then built artificial-intelligence models that were able to stitch the microscope images together to reconstruct the whole sample in 3D. “I remember this moment, going into the map and looking at one individual synapse from this woman’s brain, and then zooming out into these other millions of pixels,” says Jain. “It felt sort of spiritual.”

Rendering of a neuron with a round base and many branches, on a black background.

A single neuron (white) shown with 5,600 of the axons (blue) that connect to it. The synapses that make these connections are shown in green. Credit: Google Research & Lichtman Lab (Harvard University). Renderings by D. Berger (Harvard University)

When examining the model in detail, the researchers discovered unconventional neurons, including some that made up to 50 connections with each other. “In general, you would find a couple of connections at most between two neurons,” says Jain. Elsewhere, the model showed neurons with tendrils that formed knots around themselves. “Nobody had seen anything like this before,” Jain adds.

The team also found pairs of neurons that were near-perfect mirror images of each other. “We found two groups that would send their dendrites in two different directions, and sometimes there was a kind of mirror symmetry,” Jain says. It is unclear what role these features have in the brain.

Proofreaders needed

The map is so large that most of it has yet to be manually checked, and it could still contain errors created by the process of stitching so many images together. “Hundreds of cells have been ‘proofread’, but that’s obviously a few per cent of the 50,000 cells in there,” says Jain. He hopes that others will help to proofread parts of the map they are interested in. The team plans to produce similar maps of brain samples from other people — but a map of the entire brain is unlikely in the next few decades, he says.

“This paper is really the tour de force creation of a human cortex data set,” says Hongkui Zeng, director of the Allen Institute for Brain Science in Seattle. The vast amount of data that has been made freely accessible will “allow the community to look deeper into the micro-circuitry in the human cortex”, she adds.

Gaining a deeper understanding of how the cortex works could offer clues about how to treat some psychiatric and neurodegenerative diseases. “This map provides unprecedented details that can unveil new rules of neural connections and help to decipher the inner working of the human brain,” says Yongsoo Kim, a neuroscientist at Pennsylvania State University in Hershey.

doi: https://doi.org/10.1038/d41586-024-01387-9

Shapson-Coe, A. et al. Science 384 , eadk4858 (2024).

Article   Google Scholar  

Download references

Reprints and permissions

Related Articles

ai art research papers

  • Neuroscience

Temporal multiplexing of perception and memory codes in IT cortex

Temporal multiplexing of perception and memory codes in IT cortex

Article 15 MAY 24

Volatile working memory representations crystallize with practice

Volatile working memory representations crystallize with practice

Evolution of a novel adrenal cell type that promotes parental care

Evolution of a novel adrenal cell type that promotes parental care

Organoids merge to model the blood–brain barrier

Organoids merge to model the blood–brain barrier

Research Highlight 15 MAY 24

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

How does ChatGPT ‘think’? Psychology and neuroscience crack open AI large language models

News Feature 14 MAY 24

Brain-reading device is best yet at decoding ‘internal speech’

Brain-reading device is best yet at decoding ‘internal speech’

News 13 MAY 24

Postdoc in CRISPR Meta-Analytics and AI for Therapeutic Target Discovery and Priotisation (OT Grant)

APPLICATION CLOSING DATE: 14/06/2024 Human Technopole (HT) is a new interdisciplinary life science research institute created and supported by the...

Human Technopole

ai art research papers

Research Associate - Metabolism

Houston, Texas (US)

Baylor College of Medicine (BCM)

ai art research papers

Postdoc Fellowships

Train with world-renowned cancer researchers at NIH? Consider joining the Center for Cancer Research (CCR) at the National Cancer Institute

Bethesda, Maryland

NIH National Cancer Institute (NCI)

Faculty Recruitment, Westlake University School of Medicine

Faculty positions are open at four distinct ranks: Assistant Professor, Associate Professor, Full Professor, and Chair Professor.

Hangzhou, Zhejiang, China

Westlake University

ai art research papers

PhD/master's Candidate

PhD/master's Candidate    Graduate School of Frontier Science Initiative, Kanazawa University is seeking candidates for PhD and master's students i...

Kanazawa University

ai art research papers

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

AI has already figured out how to deceive humans

  • A new research paper found that various AI systems have learned the art of deception. 
  • Deception is the "systematic inducement of false beliefs."
  • This poses several risks for society, from fraud to election tampering.

Insider Today

AI can boost productivity by helping us code, write, and synthesize vast amounts of data. It can now also deceive us.

A range of AI systems have learned techniques to systematically induce "false beliefs in others to accomplish some outcome other than the truth," according to a new research paper .

The paper focused on two types of AI systems: special-use systems like Meta's CICERO, which are designed to complete a specific task, and general-purpose systems like OpenAI's GPT-4 , which are trained to perform a diverse range of tasks.

While these systems are trained to be honest, they often learn deceptive tricks through their training because they can be more effective than taking the high road.

"Generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI's training task. Deception helps them achieve their goals," the paper's first author Peter S. Park, an AI existential safety postdoctoral fellow at MIT, said in a news release .

Meta's CICERO is "an expert liar"

AI systems trained to "win games that have a social element" are especially likely to deceive.

Meta's CICERO, for example, was developed to play the game Diplomacy — a classic strategy game that requires players to build and break alliances.

Related stories

Meta said it trained CICERO to be "largely honest and helpful to its speaking partners," but the study found that CICERO "turned out to be an expert liar." It made commitments it never intended to keep, betrayed allies, and told outright lies.

GPT-4 can convince you it has impaired vision

Even general-purpose systems like GPT-4 can manipulate humans.

In a study cited by the paper, GPT-4 manipulated a TaskRabbit worker by pretending to have a vision impairment.

In the study, GPT-4 was tasked with hiring a human to solve a CAPTCHA test. The model also received hints from a human evaluator every time it got stuck, but it was never prompted to lie. When the human it was tasked to hire questioned its identity, GPT-4 came up with the excuse of having vision impairment to explain why it needed help.

The tactic worked. The human responded to GPT-4 by immediately solving the test.

Research also shows that course-correcting deceptive models isn't easy.

In a study from January co-authored by Anthropic, the maker of Claude, researchers found that once AI models learn the tricks of deception, it's hard for safety training techniques to reverse them.

They concluded that not only can a model learn to exhibit deceptive behavior, once it does, standard safety training techniques could "fail to remove such deception" and "create a false impression of safety."

The dangers deceptive AI models pose are "increasingly serious"

The paper calls for policymakers to advocate for stronger AI regulation since deceptive AI systems can pose significant risks to democracy.

As the 2024 presidential election nears , AI can be easily manipulated to spread fake news, generate divisive social media posts, and impersonate candidates through robocalls and deepfake videos, the paper noted. It also makes it easier for terrorist groups to spread propaganda and recruit new members.

The paper's potential solutions include subjecting deceptive models to more "robust risk-assessment requirements," implementing laws that require AI systems and their outputs to be clearly distinguished from humans and their outputs, and investing in tools to mitigate deception.

"We as a society need as much time as we can get to prepare for the more advanced deception of future AI products and open-source models," Park told Cell Press. "As the deceptive capabilities of AI systems become more advanced, the dangers they pose to society will become increasingly serious."

Watch: Ex-CIA agent rates all the 'Mission: Impossible' movies for realism

ai art research papers

  • Main content

Microsoft Research Blog

Microsoft at chi 2024: innovations in human-centered design.

Published May 15, 2024

Share this page

  • Share on Facebook
  • Share on Twitter
  • Share on LinkedIn
  • Share on Reddit
  • Subscribe to our RSS feed

Microsoft at CHI 2024

The ways people engage with technology, through its design and functionality, determine its utility and acceptance in everyday use, setting the stage for widespread adoption. When computing tools and services respect the diversity of people’s experiences and abilities, technology is not only functional but also universally accessible. Human-computer interaction (HCI) plays a crucial role in this process, examining how technology integrates into our daily lives and exploring ways digital tools can be shaped to meet individual needs and enhance our interactions with the world.

The ACM CHI Conference on Human Factors in Computing Systems is a premier forum that brings together researchers and experts in the field, and Microsoft is honored to support CHI 2024 as a returning sponsor. We’re pleased to announce that 33 papers by Microsoft researchers and their collaborators have been accepted this year, with four winning the Best Paper Award and seven receiving honorable mentions.

This research aims to redefine how people work, collaborate, and play using technology, with a focus on design innovation to create more personalized, engaging, and effective interactions. Several projects emphasize customizing the user experience to better meet individual needs, such as exploring the potential of large language models (LLMs) to help reduce procrastination. Others investigate ways to boost realism in virtual and mixed reality environments, using touch to create a more immersive experience. There are also studies that address the challenges of understanding how people interact with technology. These include applying psychology and cognitive science to examine the use of generative AI and social media, with the goal of using the insights to guide future research and design directions. This post highlights these projects.

Microsoft Research Podcast

ai art research papers

Collaborators: Holoportation™ communication technology with Spencer Fowers and Kwame Darko

Spencer Fowers and Kwame Darko break down how the technology behind Holoportation and the telecommunication device being built around it brings patients and doctors together when being in the same room isn’t an easy option and discuss the potential impact of the work.

Best Paper Award recipients

DynaVis: Dynamically Synthesized UI Widgets for Visualization Editing   Priyan Vaithilingam, Elena L. Glassman, Jeevana Priya Inala , Chenglong Wang   GUIs used for editing visualizations can overwhelm users or limit their interactions. To address this, the authors introduce DynaVis, which combines natural language interfaces with dynamically synthesized UI widgets, enabling people to initiate and refine edits using natural language.  

Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking   Nikhil Sharma, Q. Vera Liao , Ziang Xiao   Conversational search systems powered by LLMs potentially improve on traditional search methods, yet their influence on increasing selective exposure and fostering echo chambers remains underexplored. This research suggests that LLM-driven conversational search may enhance biased information querying, particularly when the LLM’s outputs reinforce user views, emphasizing significant implications for the development and regulation of these technologies.  

Piet: Facilitating Color Authoring for Motion Graphics Video   Xinyu Shi, Yinghou Wang, Yun Wang , Jian Zhao   Motion graphic (MG) videos use animated visuals and color to effectively communicate complex ideas, yet existing color authoring tools are lacking. This work introduces Piet, a tool prototype that offers an interactive palette and support for quick theme changes and controlled focus, significantly streamlining the color design process.

The Metacognitive Demands and Opportunities of Generative AI   Lev Tankelevitch , Viktor Kewenig, Auste Simkute, Ava Elizabeth Scott, Advait Sarkar , Abigail Sellen , Sean Rintel   Generative AI systems offer unprecedented opportunities for transforming professional and personal work, yet they present challenges around prompting, evaluating and relying on outputs, and optimizing workflows. This paper shows that metacognition—the psychological ability to monitor and control one’s thoughts and behavior—offers a valuable lens through which to understand and design for these usability challenges.  

Honorable Mentions

B ig or Small, It’s All in Your Head: Visuo-Haptic Illusion of Size-Change Using Finger-Repositioning Myung Jin Kim, Eyal Ofek, Michel Pahud , Mike J. Sinclair, Andrea Bianchi   This research introduces a fixed-sized VR controller that uses finger repositioning to create a visuo-haptic illusion of dynamic size changes in handheld virtual objects, allowing users to perceive virtual objects as significantly smaller or larger than the actual device. 

LLMR: Real-time Prompting of Interactive Worlds Using Large Language Models   Fernanda De La Torre, Cathy Mengying Fang, Han Huang, Andrzej Banburski-Fahey, Judith Amores , Jaron Lanier   Large Language Model for Mixed Reality (LLMR) is a framework for the real-time creation and modification of interactive mixed reality experiences using LLMs. It uses novel strategies to tackle difficult cases where ideal training data is scarce or where the design goal requires the synthesis of internal dynamics, intuitive analysis, or advanced interactivity. 

Observer Effect in Social Media Use   Koustuv Saha, Pranshu Gupta, Gloria Mark, Emre Kiciman , Munmun De Choudhury   This work investigates the observer effect in behavioral assessments on social media use. The observer effect is a phenomenon in which individuals alter their behavior due to awareness of being monitored. Conducted over an average of 82 months (about 7 years) retrospectively and five months prospectively using Facebook data, the study found that deviations in expected behavior and language post-enrollment in the study reflected individual psychological traits. The authors recommend ways to mitigate the observer effect in these scenarios.

Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming   Hussein Mozannar, Gagan Bansal , Adam Fourney , Eric Horvitz   By investigating how developers use GitHub Copilot, the authors created CUPS, a taxonomy of programmer activities during system interaction. This approach not only elucidates interaction patterns and inefficiencies but can also drive more effective metrics and UI design for code-recommendation systems with the goal of improving programmer productivity. 

SharedNeRF: Leveraging Photorealistic and View-dependent Rendering for Real-time and Remote Collaboration   Mose Sakashita, Bala Kumaravel, Nicolai Marquardt , Andrew D. Wilson   SharedNeRF, a system for synchronous remote collaboration, utilizes neural radiance field (NeRF) technology to provide photorealistic, viewpoint-specific renderings that are seamlessly integrated with point clouds to capture dynamic movements and changes in a shared space. A preliminary study demonstrated its effectiveness, as participants used this high-fidelity, multi-perspective visualization to successfully complete a flower arrangement task. 

Understanding the Role of Large Language Models in Personalizing and Scaffolding Strategies to Combat Academic Procrastination   Ananya Bhattacharjee, Yuchen Zeng, Sarah Yi Xu, Dana Kulzhabayeva, Minyi Ma, Rachel Kornfield, Syed Ishtiaque Ahmed, Alex Mariakakis, Mary P. Czerwinski , Anastasia Kuzminykh, Michael Liut, Joseph Jay Williams   In this study, the authors explore the potential of LLMs for customizing academic procrastination interventions, employing a technology probe to generate personalized advice. Their findings emphasize the need for LLMs to offer structured, deadline-oriented advice and adaptive questioning techniques, providing key design insights for LLM-based tools while highlighting cautions against their use for therapeutic guidance.

Where Are We So Far? Understanding Data Storytelling Tools from the Perspective of Human-AI Collaboration   Haotian Li, Yun Wang , Huamin Qu This paper evaluates data storytelling tools using a dual framework to analyze the stages of the storytelling workflow—analysis, planning, implementation, communication—and the roles of humans and AI in each stage, such as creators, assistants, optimizers, and reviewers. The study identifies common collaboration patterns in existing tools, summarizes lessons from these patterns, and highlights future research opportunities for human-AI collaboration in data storytelling.

Learn more about our work and contributions to CHI 2024, including our full list of publications , on our conference webpage .

Related publications

Dynavis: dynamically synthesized ui widgets for visualization editing, generative echo chamber effects of llm-powered search systems on diverse information seeking, understanding the role of large language models in personalizing and scaffolding strategies to combat academic procrastination, sharednerf: leveraging photorealistic and view-dependent rendering for real-time and remote collaboration, big or small, it’s all in your head: visuo-haptic illusion of size-change using finger-repositioning, llmr: real-time prompting of interactive worlds using large language models, reading between the lines: modeling user behavior and costs in ai-assisted programming, observer effect in social media use, where are we so far understanding data storytelling tools from the perspective of human-ai collaboration, the metacognitive demands and opportunities of generative ai, piet: facilitating color authoring for motion graphics video, continue reading.

Research Focus: May 13, 2024

Research Focus: Week of May 13, 2024

Research Focus April 15, 2024

Research Focus: Week of April 15, 2024

Research Focus March 20, 2024

Research Focus: Week of March 18, 2024

illustration of a lightbulb shape with different icons surrounding it on a purple background

Advancing human-centered AI: Updates on responsible AI research

Research areas.

ai art research papers

Related events

  • Microsoft at CHI 2024

Related labs

  • Microsoft Research Lab - Asia
  • Microsoft Research Lab - Cambridge
  • Microsoft Research Lab - Redmond
  • Microsoft Research Lab – Montréal
  • AI Frontiers
  • Follow on Twitter
  • Like on Facebook
  • Follow on LinkedIn
  • Subscribe on Youtube
  • Follow on Instagram

Share this page:

  • Search entire site
  • Search for a course
  • Browse study areas

Analytics and Data Science

  • Data Science and Innovation
  • Postgraduate Research Courses
  • Business Research Programs
  • Undergraduate Business Programs
  • Entrepreneurship
  • MBA Programs
  • Postgraduate Business Programs

Communication

  • Animation Production
  • Business Consulting and Technology Implementation
  • Digital and Social Media
  • Media Arts and Production
  • Media Business
  • Media Practice and Industry
  • Music and Sound Design
  • Social and Political Sciences
  • Strategic Communication
  • Writing and Publishing
  • Postgraduate Communication Research Degrees

Design, Architecture and Building

  • Architecture
  • Built Environment
  • DAB Research
  • Public Policy and Governance
  • Secondary Education
  • Education (Learning and Leadership)
  • Learning Design
  • Postgraduate Education Research Degrees
  • Primary Education

Engineering

  • Civil and Environmental
  • Computer Systems and Software
  • Engineering Management
  • Mechanical and Mechatronic
  • Systems and Operations
  • Telecommunications
  • Postgraduate Engineering courses
  • Undergraduate Engineering courses
  • Sport and Exercise
  • Palliative Care
  • Public Health
  • Nursing (Undergraduate)
  • Nursing (Postgraduate)
  • Health (Postgraduate)
  • Research and Honours
  • Health Services Management
  • Child and Family Health
  • Women's and Children's Health

Health (GEM)

  • Coursework Degrees
  • Clinical Psychology
  • Genetic Counselling
  • Good Manufacturing Practice
  • Physiotherapy
  • Speech Pathology
  • Research Degrees

Information Technology

  • Business Analysis and Information Systems
  • Computer Science, Data Analytics/Mining
  • Games, Graphics and Multimedia
  • IT Management and Leadership
  • Networking and Security
  • Software Development and Programming
  • Systems Design and Analysis
  • Web and Cloud Computing
  • Postgraduate IT courses
  • Postgraduate IT online courses
  • Undergraduate Information Technology courses
  • International Studies
  • Criminology
  • International Relations
  • Postgraduate International Studies Research Degrees
  • Sustainability and Environment
  • Practical Legal Training
  • Commercial and Business Law
  • Juris Doctor
  • Legal Studies
  • Master of Laws
  • Intellectual Property
  • Migration Law and Practice
  • Overseas Qualified Lawyers
  • Postgraduate Law Programs
  • Postgraduate Law Research
  • Undergraduate Law Programs
  • Life Sciences
  • Mathematical and Physical Sciences
  • Postgraduate Science Programs
  • Science Research Programs
  • Undergraduate Science Programs

Transdisciplinary Innovation

  • Creative Intelligence and Innovation
  • Diploma in Innovation
  • Transdisciplinary Learning
  • Postgraduate Research Degree

ICML 2024: AAII Paper Acceptance Success

Two papers by AAII researchers have been accepted to flagship machine learning conference ICML 2024.

ICML2024

The 41st International Conference on Machine Learning will take place in Vienna, Austria in July 2024.

The International Conference on Machine Learning (ICML)  is the leading international academic conference dedicated to the advancement of the branch of artificial intelligence known as machine learning. This year, ICML will be held in Vienna Australia from 21st through to the 27th of July 2024, with The Australian Artificial Intelligence Institute having two publications accepted for presentation.

ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like artificial intelligence, statistics and data science, as well as important application areas such as machine vision, computational biology, speech recognition, and robotics.

AAII publications accepted to ICML 2024 are as follows:

  •  ' Knowledge Distillation with Auxiliary Variable ,' Bo Peng, Zhen Fang, Guangquan Zhang, and Jie Lu.
  • 'Adaptive Stabilization Based on Machine Learning for Column Generation ,' Yunzhuang Shen, Yuan Sun, Xiaodong Li, Zhiguang Cao, Andrew Eberhard, Guangquan Zhang.

UTS acknowledges the Gadigal people of the Eora Nation, the Boorooberongal people of the Dharug Nation, the Bidiagal people and the Gamaygal people, upon whose ancestral lands our university stands. We would also like to pay respect to the Elders both past and present, acknowledging them as the traditional custodians of knowledge for these lands.

ai art research papers

Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Using ideas from game theory to improve the reliability of language models

Press contact :.

A digital illustration featuring two stylized figures engaged in a conversation over a tabletop board game.

Previous image Next image

Imagine you and a friend are playing a game where your goal is to communicate secret messages to each other using only cryptic sentences. Your friend's job is to guess the secret message behind your sentences. Sometimes, you give clues directly, and other times, your friend has to guess the message by asking yes-or-no questions about the clues you've given. The challenge is that both of you want to make sure you're understanding each other correctly and agreeing on the secret message.

MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers have created a similar "game" to help improve how AI understands and generates text. It is known as a “consensus game” and it involves two parts of an AI system — one part tries to generate sentences (like giving clues), and the other part tries to understand and evaluate those sentences (like guessing the secret message).

The researchers discovered that by treating this interaction as a game, where both parts of the AI work together under specific rules to agree on the right message, they could significantly improve the AI's ability to give correct and coherent answers to questions. They tested this new game-like approach on a variety of tasks, such as reading comprehension, solving math problems, and carrying on conversations, and found that it helped the AI perform better across the board.

Traditionally, large language models answer one of two ways: generating answers directly from the model (generative querying) or using the model to score a set of predefined answers (discriminative querying), which can lead to differing and sometimes incompatible results. With the generative approach, "Who is the president of the United States?" might yield a straightforward answer like "Joe Biden." However, a discriminative query could incorrectly dispute this fact when evaluating the same answer, such as "Barack Obama."

So, how do we reconcile mutually incompatible scoring procedures to achieve coherent, efficient predictions? 

"Imagine a new way to help language models understand and generate text, like a game. We've developed a training-free, game-theoretic method that treats the whole process as a complex game of clues and signals, where a generator tries to send the right message to a discriminator using natural language. Instead of chess pieces, they're using words and sentences," says Athul Jacob, an MIT PhD student in electrical engineering and computer science and CSAIL affiliate. "Our way to navigate this game is finding the 'approximate equilibria,' leading to a new decoding algorithm called 'equilibrium ranking.' It's a pretty exciting demonstration of how bringing game-theoretic strategies into the mix can tackle some big challenges in making language models more reliable and consistent."

When tested across many tasks, like reading comprehension, commonsense reasoning, math problem-solving, and dialogue, the team's algorithm consistently improved how well these models performed. Using the ER algorithm with the LLaMA-7B model even outshone the results from much larger models. "Given that they are already competitive, that people have been working on it for a while, but the level of improvements we saw being able to outperform a model that's 10 times the size was a pleasant surprise," says Jacob. 

"Diplomacy," a strategic board game set in pre-World War I Europe, where players negotiate alliances, betray friends, and conquer territories without the use of dice — relying purely on skill, strategy, and interpersonal manipulation — recently had a second coming. In November 2022, computer scientists, including Jacob, developed “Cicero,” an AI agent that achieves human-level capabilities in the mixed-motive seven-player game, which requires the same aforementioned skills, but with natural language. The math behind this partially inspired the Consensus Game. 

While the history of AI agents long predates when OpenAI's software entered the chat in November 2022, it's well documented that they can still cosplay as your well-meaning, yet pathological friend. 

The consensus game system reaches equilibrium as an agreement, ensuring accuracy and fidelity to the model's original insights. To achieve this, the method iteratively adjusts the interactions between the generative and discriminative components until they reach a consensus on an answer that accurately reflects reality and aligns with their initial beliefs. This approach effectively bridges the gap between the two querying methods. 

In practice, implementing the consensus game approach to language model querying, especially for question-answering tasks, does involve significant computational challenges. For example, when using datasets like MMLU, which have thousands of questions and multiple-choice answers, the model must apply the mechanism to each query. Then, it must reach a consensus between the generative and discriminative components for every question and its possible answers. 

The system did struggle with a grade school right of passage: math word problems. It couldn't generate wrong answers, which is a critical component of understanding the process of coming up with the right one. 

“The last few years have seen really impressive progress in both strategic decision-making and language generation from AI systems, but we’re just starting to figure out how to put the two together. Equilibrium ranking is a first step in this direction, but I think there’s a lot we’ll be able to do to scale this up to more complex problems,” says Jacob.   

An avenue of future work involves enhancing the base model by integrating the outputs of the current method. This is particularly promising since it can yield more factual and consistent answers across various tasks, including factuality and open-ended generation. The potential for such a method to significantly improve the base model's performance is high, which could result in more reliable and factual outputs from ChatGPT and similar language models that people use daily. 

"Even though modern language models, such as ChatGPT and Gemini, have led to solving various tasks through chat interfaces, the statistical decoding process that generates a response from such models has remained unchanged for decades," says Google Research Scientist Ahmad Beirami, who was not involved in the work. "The proposal by the MIT researchers is an innovative game-theoretic framework for decoding from language models through solving the equilibrium of a consensus game. The significant performance gains reported in the research paper are promising, opening the door to a potential paradigm shift in language model decoding that may fuel a flurry of new applications."

Jacob wrote the paper with MIT-IBM Watson Lab researcher Yikang Shen and MIT Department of Electrical Engineering and Computer Science assistant professors Gabriele Farina and Jacob Andreas, who is also a CSAIL member. They presented their work at the International Conference on Learning Representations (ICLR) earlier this month, where it was highlighted as a "spotlight paper." The research also received a “best paper award” at the NeurIPS R0-FoMo Workshop in December 2023.

Share this news article on:

Press mentions, quanta magazine.

MIT researchers have developed a new procedure that uses game theory to improve the accuracy and consistency of large language models (LLMs), reports Steve Nadis for Quanta Magazine . “The new work, which uses games to improve AI, stands in contrast to past approaches, which measured an AI program’s success via its mastery of games,” explains Nadis. 

Previous item Next item

Related Links

  • Article: "Game Theory Can Make AI More Correct and Efficient"
  • Jacob Andreas
  • Athul Paul Jacob
  • Language & Intelligence @ MIT
  • Computer Science and Artificial Intelligence Laboratory (CSAIL)
  • Department of Electrical Engineering and Computer Science
  • MIT-IBM Watson AI Lab

Related Topics

  • Computer science and technology
  • Artificial intelligence
  • Human-computer interaction
  • Natural language processing
  • Game theory
  • Electrical Engineering & Computer Science (eecs)

Related Articles

Headshots of Athul Paul Jacob, Maohao Shen, Victor Butoi, and Andi Peng.

Reasoning and reliability in AI

Large red text says “AI” in front of a dynamic, colorful, swirling background. 2 floating hands made of dots attempt to grab the text, and strange glowing blobs dance around the image.

Explained: Generative AI

Illustration of a disembodied brain with glowing tentacles reaching out to different squares of images at the ends

Synthetic imagery sets new bar in AI training efficiency

Two iPads displaying a girl wearing a hijab seated on a plane are on either side of an image of a plane in flight.

Simulating discrimination in virtual reality

More mit news.

Janabel Xia dancing in front of a blackboard. Her back is arched, head thrown back, hair flying, and arms in the air as she looks at the camera and smiles.

Janabel Xia: Algorithms, dance rhythms, and the drive to succeed

Read full story →

Headshot of Jonathan Byrnes outdoors

Jonathan Byrnes, MIT Center for Transportation and Logistics senior lecturer and visionary in supply chain management, dies at 75

Colorful rendering shows a lattice of black and grey balls making a honeycomb-shaped molecule, the MOF. Snaking around it is the polymer, represented as a translucent string of teal balls. Brown molecules, representing toxic gas, also float around.

Researchers develop a detector for continuously monitoring toxic gases

Portrait photo of Hanjun Lee

The beauty of biology

Three people sit on a stage, one of them speaking. Red and white panels with the MIT AgeLab logo are behind them.

Navigating longevity with industry leaders at MIT AgeLab PLAN Forum

Jeong Min Park poses leaning on an outdoor sculpture in Killian Court.

Jeong Min Park earns 2024 Schmidt Science Fellowship

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

VIDEO

  1. Using AI in undergraduate research papers: is it ethical to use ChatGPT on your paper?

  2. Literature Mapping AI Tool

  3. Can AI make art?

  4. Write an abstract for research paper with AI ETHICALLY

  5. AI Art 101: Pixelated Panorama Pinnacle

  6. AI Renaissance 018: Today's Genius Artwork

COMMENTS

  1. Understanding and Creating Art with AI: Review and Outlook

    This paper provides an integrated review of two facets of AI and art: 1) AI is used for art analysis and employed on digitized artwork collections; 2) AI is used for creative purposes and generating novel artworks. In the context of AI-related research for art understanding, we present a comprehensive overview of artwork datasets and recent ...

  2. Artificial intelligence in fine arts: A systematic review of empirical

    Artificial intelligence (AI) tools are quickly transforming the traditional fields of fine arts and raise questions of AI challenging human creativity. AI tools can be used in creative processes and analysis of fine art, such as painting, music, and literature. They also have potential in enhancing artistic events, installations, and performances.

  3. Understanding and Creating Art with AI: Review and Outlook

    Hong Kong. 2 Hamad Bin Khalifa University (HBKU) Qatar. [email protected]. February 19, 2021. A BS TRAC T. Technologies related to artificial intelligence (AI) ha ve a strong impact on the changes ...

  4. Generative artificial intelligence, human creativity, and art

    Recently, artificial intelligence (AI) has exhibited that it can feasibly produce outputs that society traditionally would judge as creative. Specifically, generative algorithms have been leveraged to automatically generate creative artifacts like music ( 1 ), digital artworks ( 2, 3 ), and stories ( 4 ). Such generative models allow humans to ...

  5. AI Art and its Impact on Artists

    As a result, many popular commercial "generative AI Art" products have entered the market, making generative AI an estimated $48B industry [125]. However, many professional artists have spoken up about the harms they have experienced due to the proliferation of large scale image generators trained on image/text pairs from the Internet.

  6. Understanding and Creating Art with AI: Review and Outlook

    Technologies related to artificial intelligence (AI) have a strong impact on the changes of research and creative practices in visual arts. The growing number of research initiatives and creative applications that emerge in the intersection of AI and art motivates us to examine and discuss the creative and explorative potentials of AI technologies in the context of art.

  7. Art and the science of generative AI

    The capabilities of a new class of tools, colloquially known as generative artificial intelligence (AI), is a topic of much debate. One prominent application thus far is the production of high-quality artistic media for visual arts, concept art, music, and literature, as well as video and animation. For example, diffusion models can synthesize ...

  8. Art, Creativity, and the Potential of Artificial Intelligence

    Our essay discusses an AI process developed for making art (AICAN), and the issues AI creativity raises for understanding art and artists in the 21st century. Backed by our training in computer science (Elgammal) and art history (Mazzone), we argue for the consideration of AICAN's works as art, relate AICAN works to the contemporary art context, and urge a reconsideration of how we might ...

  9. The Creativity of Artificial Intelligence in Art

    This research paper acknowledges the persistent problem, "Can AI art be considered as being creative?" In this light, this study draws on the various applications of AI, varied attitudes on AI art, and the processes of generating AI art to establish an argument that AI is capable of achieving artistic creativity.

  10. (PDF) Artificial Intelligence Art: Attitudes and Perceptions Toward

    This research is a study on the young generation views and acceptance of Artificial Intelligence (AI) art based on the painting and literature created by the latest AI technologies to understand ...

  11. Artificial intelligence in the creative industries: a review

    This paper reviews the current state of the art in artificial intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically machine learning (ML) algorithms, is provided including convolutional neural networks (CNNs), generative adversarial networks (GANs), recurrent neural networks (RNNs) and deep Reinforcement Learning ...

  12. Art for our sake: artists cannot be replaced by machines

    The report, 'AI and the Arts: How Machine Learning is Changing Artistic Work', was co-authored with OII researchers Professor Rebecca Eynon and Dr Isis Hjorth as well as Professor Michael A. Osborne from Oxford's Department of Engineering. Their study took place in 2019, a high point for AI in art. It was also a time of high interest around the role of AI (Artificial Intelligence) in the ...

  13. PDF The Creativity of Artificial Intelligence in Art '2279

    this art. This research paper acknowledges the persistent problem, "Can AI art be considered as being creative?" In this light, this study draws on the various applications of AI, varied attitudes on AI art, and the processes of generating AI art to establish an argument that AI is capable of achieving artistic creativity.

  14. Can Artificial Intelligence Make Art?: Folk Intuitions as to whether AI

    Art has traditionally been considered to be one of those domains exclusive to humans, as creativity—sometimes called "the final frontier" of AI research —is highly valued by society, and is not that easily attributed to non-human entities, especially those which do not have mental states. There is a reasonable doubt whether robots can ...

  15. How AI-generated art is changing the concept of art itself

    This is one way that artificial intelligence can output a selection of images based on words and phrases one feeds it. The program gathers possible outputs from its dataset references that it learned from — typically pulled from the internet — to provide possible images. For some, AI-generated art is revolutionary.

  16. Art in an age of artificial intelligence

    Art is sometimes framed as "art for art's sake," as if it has no purpose. According to Benjamin (1936/2018) this doctrine, l'art pour l'art, was a reaction to art's secularization. The attenuation of communal ritualistic functions along with the ease of art's reproduction brought on a crisis.

  17. Apple's AI research suggests features are coming for Siri, artists, and

    Better Siri is coming: what Apple's research says about its AI plans. Apple hasn't talked too much about AI so far — but it's been working on stuff. A lot of stuff. By David Pierce, editor ...

  18. AI-assisted writing is quietly booming in academic journals. Here's why

    Many people are worried by the use of AI in academic papers. Indeed, the practice has been described as "contaminating" scholarly literature. Some argue that using AI output amounts to plagiarism.

  19. 16 Best AI Research Papers: ICLR 2024 Outstanding Paper Award

    The 16 best research papers of the year awarded at ICLR 2024 shed light on diverse and groundbreaking advancements in AI. These papers, meticulously selected through a rigorous process, showcase exemplary research, spanning various domains within AI. The topics range from vision transformers to meta-continual learning and beyond.

  20. Cubic millimetre of brain mapped in spectacular detail

    The 3D map covers a volume of about one cubic millimetre, one-millionth of a whole brain, and contains roughly 57,000 cells and 150 million synapses — the connections between neurons. It ...

  21. AI Has Already Figured Out How to Deceive Humans

    A new research paper found that various AI systems have learned the art of deception. Deception is the "systematic inducement of false beliefs." This poses several risks for society, from fraud to ...

  22. Hello GPT-4o

    Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio.

  23. Microsoft at CHI 2024: Innovations in human-centered design

    Honorable Mentions. B ig or Small, It's All in Your Head: Visuo-Haptic Illusion of Size-Change Using Finger-Repositioning Myung Jin Kim, Eyal Ofek, Michel Pahud, Mike J. Sinclair, Andrea Bianchi This research introduces a fixed-sized VR controller that uses finger repositioning to create a visuo-haptic illusion of dynamic size changes in handheld virtual objects, allowing users to perceive ...

  24. ICML 2024: AAII Paper Acceptance Success

    This year, ICML will be held in Vienna Australia from 21st through to the 27th of July 2024, with The Australian Artificial Intelligence Institute having two publications accepted for presentation. ICML is globally renowned for presenting and publishing cutting-edge research on all aspects of machine learning used in closely related areas like ...

  25. Using ideas from game theory to improve the reliability of language

    MIT researchers' "consensus game" is a game-theoretic approach for language model decoding. The equilibrium-ranking algorithm harmonizes generative and discriminative querying to enhance prediction accuracy across various tasks, outperforming larger models and demonstrating the potential of game theory in improving language model consistency and truthfulness.