A Survey of Deep Learning: Platforms, Applications and Emerging Research Trends

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Survey Paper
  • Open access
  • Published: 31 March 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

  • Laith Alzubaidi   ORCID: orcid.org/0000-0002-7296-5413 1 , 5 ,
  • Jinglan Zhang 1 ,
  • Amjad J. Humaidi 2 ,
  • Ayad Al-Dujaili 3 ,
  • Ye Duan 4 ,
  • Omran Al-Shamma 5 ,
  • J. Santamaría 6 ,
  • Mohammed A. Fadhel 7 ,
  • Muthana Al-Amidie 4 &
  • Laith Farhan 8  

Journal of Big Data volume  8 , Article number:  53 ( 2021 ) Cite this article

397k Accesses

2261 Citations

37 Altmetric

Metrics details

In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

Introduction

Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [ 1 , 2 , 3 , 4 , 5 , 6 ]. Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [ 7 , 8 , 9 ]. Another name for DL is representation learning (RL). The continuing appearance of novel studies in the fields of deep and distributed learning is due to both the unpredictable growth in the ability to obtain data and the amazing progress made in the hardware technologies, e.g. High Performance Computing (HPC) [ 10 ].

DL is derived from the conventional neural network but considerably outperforms its predecessors. Moreover, DL employs transformations and graph technologies simultaneously in order to build up multi-layer learning models. The most recently developed DL techniques have obtained good outstanding performance across a variety of applications, including audio and speech processing, visual data processing, natural language processing (NLP), among others [ 11 , 12 , 13 , 14 ].

Usually, the effectiveness of an ML algorithm is highly dependent on the integrity of the input-data representation. It has been shown that a suitable data representation provides an improved performance when compared to a poor data representation. Thus, a significant research trend in ML for many years has been feature engineering, which has informed numerous research studies. This approach aims at constructing features from raw data. In addition, it is extremely field-specific and frequently requires sizable human effort. For instance, several types of features were introduced and compared in the computer vision context, such as, histogram of oriented gradients (HOG) [ 15 ], scale-invariant feature transform (SIFT) [ 16 ], and bag of words (BoW) [ 17 ]. As soon as a novel feature is introduced and is found to perform well, it becomes a new research direction that is pursued over multiple decades.

Relatively speaking, feature extraction is achieved in an automatic way throughout the DL algorithms. This encourages researchers to extract discriminative features using the smallest possible amount of human effort and field knowledge [ 18 ]. These algorithms have a multi-layer data representation architecture, in which the first layers extract the low-level features while the last layers extract the high-level features. Note that artificial intelligence (AI) originally inspired this type of architecture, which simulates the process that occurs in core sensorial regions within the human brain. Using different scenes, the human brain can automatically extract data representation. More specifically, the output of this process is the classified objects, while the received scene information represents the input. This process simulates the working methodology of the human brain. Thus, it emphasizes the main benefit of DL.

In the field of ML, DL, due to its considerable success, is currently one of the most prominent research trends. In this paper, an overview of DL is presented that adopts various perspectives such as the main concepts, architectures, challenges, applications, computational tools and evolution matrix. Convolutional neural network (CNN) is one of the most popular and used of DL networks [ 19 , 20 ]. Because of CNN, DL is very popular nowadays. The main advantage of CNN compared to its predecessors is that it automatically detects the significant features without any human supervision which made it the most used. Therefore, we have dug in deep with CNN by presenting the main components of it. Furthermore, we have elaborated in detail the most common CNN architectures, starting with the AlexNet network and ending with the High-Resolution network (HR.Net).

Several published DL review papers have been presented in the last few years. However, all of them have only been addressed one side focusing on one application or topic such as the review of CNN architectures [ 21 ], DL for classification of plant diseases [ 22 ], DL for object detection [ 23 ], DL applications in medical image analysis [ 24 ], and etc. Although these reviews present good topics, they do not provide a full understanding of DL topics such as concepts, detailed research gaps, computational tools, and DL applications. First, It is required to understand DL aspects including concepts, challenges, and applications then going deep in the applications. To achieve that, it requires extensive time and a large number of research papers to learn about DL including research gaps and applications. Therefore, we propose a deep review of DL to provide a more suitable starting point from which to develop a full understanding of DL from one review paper. The motivation behinds our review was to cover the most important aspect of DL including open challenges, applications, and computational tools perspective. Furthermore, our review can be the first step towards other DL topics.

The main aim of this review is to present the most important aspects of DL to make it easy for researchers and students to have a clear image of DL from single review paper. This review will further advance DL research by helping people discover more about recent developments in the field. Researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field. Our contributions are outlined as follows:

This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.

We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.

We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.

We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.

We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.

The rest of the paper is organized as follows: “ Survey methodology ” section describes The survey methodology. “ Background ” section presents the background. “ Classification of DL approaches ” section defines the classification of DL approaches. “ Types of DL networks ” section displays types of DL networks. “ CNN architectures ” section shows CNN Architectures. “ Challenges (limitations) of deep learning and alternate solutions ” section details the challenges of DL and alternate solutions. “ Applications of deep learning ” section outlines the applications of DL. “ Computational approaches ” section explains the influence of computational approaches (CPU, GPU, FPGA) on DL. “ Evaluation metrics ” section presents the evaluation metrics. “ Frameworks and datasets ” section lists frameworks and datasets. “ Summary and conclusion ” section presents the summary and conclusion.

Survey methodology

We have reviewed the significant research papers in the field published during 2010–2020, mainly from the years of 2020 and 2019 with some papers from 2021. The main focus was papers from the most reputed publishers such as IEEE, Elsevier, MDPI, Nature, ACM, and Springer. Some papers have been selected from ArXiv. We have reviewed more than 300 papers on various DL topics. There are 108 papers from the year 2020, 76 papers from the year 2019, and 48 papers from the year 2018. This indicates that this review focused on the latest publications in the field of DL. The selected papers were analyzed and reviewed to (1) list and define the DL approaches and network types, (2) list and explain CNN architectures, (3) present the challenges of DL and suggest the alternate solutions, (4) assess the applications of DL, (5) assess computational approaches. The most keywords used for search criteria for this review paper are (“Deep Learning”), (“Machine Learning”), (“Convolution Neural Network”), (“Deep Learning” AND “Architectures”), ((“Deep Learning”) AND (“Image”) AND (“detection” OR “classification” OR “segmentation” OR “Localization”)), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “CPU” OR “GPU” OR “FPGA”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Imbalanced Data”), (“Deep Learning” AND “Interpretability of data”), (“Deep Learning” AND “Overfitting”), (“Deep Learning” AND “Underspecification”). Figure  1 shows our search structure of the survey paper. Table  1 presents the details of some of the journals that have been cited in this review paper.

figure 1

Search framework

This section will present a background of DL. We begin with a quick introduction to DL, followed by the difference between DL and ML. We then show the situations that require DL. Finally, we present the reasons for applying DL.

DL, a subset of ML (Fig.  2 ), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [ 18 , 25 ].

figure 2

Deep learning family

Achieving the classification task using conventional ML techniques requires several sequential steps, specifically pre-processing, feature extraction, wise feature selection, learning, and classification. Furthermore, feature selection has a great impact on the performance of ML techniques. Biased feature selection may lead to incorrect discrimination between classes. Conversely, DL has the ability to automate the learning of feature sets for several tasks, unlike conventional ML methods [ 18 , 26 ]. DL enables learning and classification to be achieved in a single shot (Fig.  3 ). DL has become an incredibly popular type of ML algorithm in recent years due to the huge growth and evolution of the field of big data [ 27 , 28 ]. It is still in continuous development regarding novel performance for several ML tasks [ 22 , 29 , 30 , 31 ] and has simplified the improvement of many learning fields [ 32 , 33 ], such as image super-resolution [ 34 ], object detection [ 35 , 36 ], and image recognition [ 30 , 37 ]. Recently, DL performance has come to exceed human performance on tasks such as image classification (Fig.  4 ).

figure 3

The difference between deep learning and traditional machine learning

figure 4

Deep learning performance compared to human

Nearly all scientific fields have felt the impact of this technology. Most industries and businesses have already been disrupted and transformed through the use of DL. The leading technology and economy-focused companies around the world are in a race to improve DL. Even now, human-level performance and capability cannot exceed that the performance of DL in many areas, such as predicting the time taken to make car deliveries, decisions to certify loan requests, and predicting movie ratings [ 38 ]. The winners of the 2019 “Nobel Prize” in computing, also known as the Turing Award, were three pioneers in the field of DL (Yann LeCun, Geoffrey Hinton, and Yoshua Bengio) [ 39 ]. Although a large number of goals have been achieved, there is further progress to be made in the DL context. In fact, DL has the ability to enhance human lives by providing additional accuracy in diagnosis, including estimating natural disasters [ 40 ], the discovery of new drugs [ 41 ], and cancer diagnosis [ 42 , 43 , 44 ]. Esteva et al. [ 45 ] found that a DL network has the same ability to diagnose the disease as twenty-one board-certified dermatologists using 129,450 images of 2032 diseases. Furthermore, in grading prostate cancer, US board-certified general pathologists achieved an average accuracy of 61%, while the Google AI [ 44 ] outperformed these specialists by achieving an average accuracy of 70%. In 2020, DL is playing an increasingly vital role in early diagnosis of the novel coronavirus (COVID-19) [ 29 , 46 , 47 , 48 ]. DL has become the main tool in many hospitals around the world for automatic COVID-19 classification and detection using chest X-ray images or other types of images. We end this section by the saying of AI pioneer Geoffrey Hinton “Deep learning is going to be able to do everything”.

When to apply deep learning

Machine intelligence is useful in many situations which is equal or better than human experts in some cases [ 49 , 50 , 51 , 52 ], meaning that DL can be a solution to the following problems:

Cases where human experts are not available.

Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).

Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).

Cases where solutions require adaptation based on specific cases (personalization, biometrics).

Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).

Why deep learning?

Several performance features may answer this question, e.g

Universal Learning Approach: Because DL has the ability to perform in approximately all application domains, it is sometimes referred to as universal learning.

Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.

Generalization: Different data types or different applications can use the same DL technique, an approach frequently referred to as transfer learning (TL) which explained in the latter section. Furthermore, it is a useful approach in problems where data is insufficient.

Scalability: DL is highly scalable. ResNet [ 37 ], which was invented by Microsoft, comprises 1202 layers and is frequently applied at a supercomputing scale. Lawrence Livermore National Laboratory (LLNL), a large enterprise working on evolving frameworks for networks, adopted a similar approach, where thousands of nodes can be implemented [ 53 ].

Classification of DL approaches

DL techniques are classified into three major categories: unsupervised, partially supervised (semi-supervised) and supervised. Furthermore, deep reinforcement learning (DRL), also known as RL, is another type of learning technique, which is mostly considered to fall into the category of partially supervised (and occasionally unsupervised) learning techniques.

Deep supervised learning

Deep semi-supervised learning.

In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.

Deep unsupervised learning

This technique makes it possible to implement the learning process in the absence of available labeled data (i.e. no labels are required). Here, the agent learns the significant features or interior representation required to discover the unidentified structure or relationships in the input data. Techniques of generative networks, dimensionality reduction and clustering are frequently counted within the category of unsupervised learning. Several members of the DL family have performed well on non-linear dimensionality reduction and clustering tasks; these include restricted Boltzmann machines, auto-encoders and GANs as the most recently developed techniques. Moreover, RNNs, which include GRUs and LSTM approaches, have also been employed for unsupervised learning in a wide range of applications. The main disadvantages of unsupervised learning are unable to provide accurate information concerning data sorting and computationally complex. One of the most popular unsupervised learning approaches is clustering [ 54 ].

Deep reinforcement learning

For solving a task, the selection of the type of reinforcement learning that needs to be performed is based on the space or the scope of the problem. For example, DRL is the best way for problems involving many parameters to be optimized. By contrast, derivative-free reinforcement learning is a technique that performs well for problems with limited parameters. Some of the applications of reinforcement learning are business strategy planning and robotics for industrial automation. The main drawback of Reinforcement Learning is that parameters may influence the speed of learning. Here are the main motivations for utilizing Reinforcement Learning:

It assists you to identify which action produces the highest reward over a longer period.

It assists you to discover which situation requires action.

It also enables it to figure out the best approach for reaching large rewards.

Reinforcement Learning also gives the learning agent a reward function.

Reinforcement Learning can’t utilize in all the situation such as:

In case there is sufficient data to resolve the issue with supervised learning techniques.

Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.

Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [ 57 ]. Recursive auto-associative memory (RAAM) [ 58 ] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [ 58 ]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [ 59 ] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig.  5 . RvNN has been employed in several applications [ 60 , 61 , 62 ].

figure 5

An example of RvNN tree

Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL [ 63 , 64 , 65 ]. RNN is mainly applied in the area of speech processing and NLP contexts [ 66 , 67 ]. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig.  6 . Pascanu et al. [ 68 ] introduced three different types of deep RNN techniques, namely “Hidden-to-Hidden”, “Hidden-to-Output”, and “Input-to-Hidden”. A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

figure 6

Typical unfolded RNN diagram

However, RNN’s sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach [ 69 ]. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM [ 70 ]. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks [ 37 ], residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.

Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm [ 30 , 71 , 72 , 73 , 74 , 75 ]. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision [ 76 ]. CNNs have been extensively applied in a range of different fields, including computer vision [ 77 ], speech processing [ 78 ], Face Recognition [ 79 ], etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat’s brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN [ 80 ]. Goodfellow et al. [ 28 ] identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig.  7 .

figure 7

An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or \(m \times m \times r\) , where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions ( \(n \times n \times q\) ), similar to the input image; here, however, n must be smaller than m , while q is either equal to or smaller than r . In addition, the kernels are the basis of the local connections, which share similar parameters (bias \(b^{k}\) and weight \(W^{k}\) ) for generating k feature maps \(h^{k}\) with a size of ( \(m-n-1\) ) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1 , similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size \(p \times p\) , where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.

Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.

Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

Large-scale network implementation is much easier with CNN than with other neural networks.

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.

Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a \(4 \times 4\) gray-scale image with a \(2 \times 2\) random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure  8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the \(2 \times 2\) kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

figure 8

The primary calculations executed at each step of convolutional layer

However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.

Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure  9 illustrates these three pooling operations.

figure 9

Three types of pooling operations

Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2 .

Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3 .

ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4 .

Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.

Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5 .

Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6 .

Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7 .

Note that the learnable weight is denoted as a.

Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig.  10 .

figure 10

Fully connected layer

Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability \(p \in \left\{ 0\left. , 1 \right\} \right. \) . In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8 .

Here, \(e^{a_{i}}\) represents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9 .

Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10 .

Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11 .

The margin m is commonly set to 1. Moreover, the predicted output is denoted as \(p_{_{i}}\) , while the desired output is denoted as \(y_{_{i}}\) .

Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as “just-fitted” if it executes well on both training and testing data. These three types are illustrated in Fig.  11 . Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.

Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.

Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.

Batch Normalization: This method ensures the performance of the output activations [ 81 ]. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the “internal covariance shift” of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

The advantages of utilizing batch normalization are as follows:

It prevents the problem of vanishing gradient from arising.

It can effectively control the poor weight initialization.

It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).

It struggles to decrease training dependency across hyper-parameters.

Chances of over-fitting are reduced, since it has a minor influence on regularization.

figure 11

Over-fitting and under-fitting issues

Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12 .

The final weight in the current training epoch is denoted by \(w_{i j^{t}}\) , while the weight in the preceding \((t-1)\) training epoch is denoted \(w_{i j^{t-1}}\) . The learning rate is \(\eta \) and the prediction error is E . Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

Batch Gradient Descent: During the execution of this technique [ 82 ], the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).

Stochastic Gradient Descent: The parameters are updated at each training sample in this technique [ 83 ]. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.

Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them [ 84 ]. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.

Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor \(\lambda \) (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13 .

The weight increment in the current \(t^{\prime} \text{th}\) training epoch is denoted as \( \Delta w_{i j^{t}}\) , while \(\eta \) is the learning rate, and the weight increment in the preceding \((t-1)^{\prime} \text{th}\) training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [ 85 ] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14 .

Design of algorithms (backpropagation)

Let’s start with a notation that refers to weights in the network unambiguously. We denote \({\varvec{w}}_{i j}^{h}\) to be the weight for the connection from \(\text {ith}\) input or (neuron at \(\left. (\text {h}-1){\text{th}}\right) \) to the \(j{\text{t }}\) neuron in the \(\text {hth}\) layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

figure 12

MLP structure

Where \(w_{11}^{2}\) has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be \(w_{21}^{2}\) which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is \((10 * 2=20\) connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

where \(\text {d}\) is the label of induvial input \(\text {ith}\) and \(\text {y}\) is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}.\) But to compute those, a local variable is introduced, \(\delta _{j}^{1}\) which is called the local error in the \(j{\text{th} }\) neuron in the \(h{\text{th} }\) layer. Based on that local error Backpropagation will give the procedure to compute \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}\) how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13 .

figure 13

Neuron activation functions

Output error for \(\delta _{\text {j}}^{1}\) each \(1=1: \text {L}\) where \(\text {L}\) is no. of neuron in output

where \(\text {e}(\text {k})\) is the error of the epoch \(\text {k}\) as shown in Eq. ( 2 ) and \(\varvec{\vartheta }^{\prime }\left( {\varvec{v}}_{j}({\varvec{k}})\right) \) is the derivate of the activation function for \(v_{j}\) at the output.

Backpropagate the error at all the rest layer except the output

where \(\delta _{j}^{1}({\mathbf {k}})\) is the output error and \(w_{j l}^{h+1}(k)\) is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. ( 16 ) and ( 17 ).

Improving performance of CNN

Based on our experiments in different DL applications [ 86 , 87 , 88 ]. We can conclude the most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning (explained in latter sections).

Increase the training time.

Increase the depth (or width) of the model.

Add regularization.

Increase hyperparameters tuning.

CNN architectures

Over the last 10 years, several CNN architectures have been presented [ 21 , 26 ]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it should be noted that the key upgrade in CNN performance occurred largely due to the processing-unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table  2 presents the brief overview of CNN architectures.

The history of deep CNNs began with the appearance of LeNet [ 89 ] (Fig.  14 ). At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [ 30 ], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [ 30 ] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure  15 illustrates the basic design of the AlexNet architecture.

figure 14

The architecture of LeNet

figure 15

The architecture of AlexNet

The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [ 90 , 91 ]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [ 92 ] could be utilized as a non-saturating activation function to enhance the rate of convergence [ 93 ]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters \((5\times 5 \; \text{and}\; 11 \times 11)\) in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications.

Network-in-network

This network model, which has some slight differences from the preceding models, introduced two innovative concepts [ 94 ]. The first was employing multiple layers of perception convolution. These convolutions are executed using a 1×1 filter, which supports the addition of extra nonlinearity in the networks. Moreover, this supports enlarging the network depth, which may later be regularized using dropout. For DL models, this idea is frequently employed in the bottleneck layer. As a substitution for a FC layer, the GAP is also employed, which represents the second novel concept and enables a significant reduction in the number of model parameters. In addition, GAP considerably updates the network architecture. Generating a final low-dimensional feature vector with no reduction in the feature maps dimension is possible when GAP is used on a large feature map [ 95 , 96 ]. Figure  16 shows the structure of the network.

figure 16

The architecture of network-in-network

Before 2013, the CNN learning mechanism was basically constructed on a trial-and-error basis, which precluded an understanding of the precise purpose following the enhancement. This issue restricted the deep CNN performance on convoluted images. In response, Zeiler and Fergus introduced DeconvNet (a multilayer de-convolutional neural network) in 2013 [ 97 ]. This method later became known as ZefNet, which was developed in order to quantitively visualize the network. Monitoring the CNN performance via understanding the neuron activation was the purpose of the network activity visualization. However, Erhan et al. utilized this exact concept to optimize deep belief network (DBN) performance by visualizing the features of the hidden layers [ 98 ]. Moreover, in addition to this issue, Le et al. assessed the deep unsupervised auto-encoder (AE) performance by visualizing the created classes of the image using the output neurons [ 99 ]. By reversing the operation order of the convolutional and pooling layers, DenconvNet operates like a forward-pass CNN. Reverse mapping of this kind launches the convolutional layer output backward to create visually observable image shapes that accordingly give the neural interpretation of the internal feature representation learned at each layer [ 100 ]. Monitoring the learning schematic through the training stage was the key concept underlying ZefNet. In addition, it utilized the outcomes to recognize an ability issue coupled with the model. This concept was experimentally proven on AlexNet by applying DeconvNet. This indicated that only certain neurons were working, while the others were out of action in the first two layers of the network. Furthermore, it indicated that the features extracted via the second layer contained aliasing objects. Thus, Zeiler and Fergus changed the CNN topology due to the existence of these outcomes. In addition, they executed parameter optimization, and also exploited the CNN learning by decreasing the stride and the filter sizes in order to retain all features of the initial two convolutional layers. An improvement in performance was accordingly achieved due to this rearrangement in CNN topology. This rearrangement proposed that the visualization of the features could be employed to identify design weaknesses and conduct appropriate parameter alteration. Figure  17 shows the structure of the network.

figure 17

The architecture of ZefNet

Visual geometry group (VGG)

After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [ 101 ], it featured nineteen more layers than ZefNet [ 97 ] and AlexNet [ 30 ] to simulate the relations of the network representational capacity in depth. Conversely, in the 2013-ILSVRC competition, ZefNet was the frontier network, which proposed that filters with small sizes could enhance the CNN performance. With reference to these results, VGG inserted a layer of the heap of \(3\times 3\) filters rather than the \(5\times 5\) and 11 × 11 filters in ZefNet. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small-size filters made the receptive field similarly efficient to the large-size filters \((7 \times 7 \; \text{and}\; 5 \times 5)\) . By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting \(1\times 1\) convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. With respect to network tuning, a max pooling layer [ 102 ] is inserted following the convolutional layer, while padding is implemented to maintain the spatial resolution. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014-ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure  18 shows the structure of the network.

figure 18

The architecture of VGG

In the 2014-ILSVRC competition, GoogleNet (also called Inception-V1) emerged as the winner [ 103 ]. Achieving high-level accuracy with decreased computational cost is the core aim of the GoogleNet architecture. It proposed a novel inception block (module) concept in the CNN context, since it combines multiple-scale convolutional transformations by employing merge, transform, and split functions for feature extraction. Figure  19 illustrates the inception block architecture. This architecture incorporates filters of different sizes ( \(5\times 5, 3\times 3, \; \text{and} \; 1\times 1\) ) to capture channel information together with spatial information at diverse ranges of spatial resolution. The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture [ 94 ], which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of variants existing in a similar class of several images. The motivation of GoogLeNet was to improve the efficiency of CNN parameters, as well as to enhance the learning capacity. In addition, it regulates the computation by inserting a \(1\times 1\) convolutional filter, as a bottleneck layer, ahead of using large-size kernels. GoogleNet employed sparse connections to overcome the redundant information problem. It decreased cost by neglecting the irrelevant channels. It should be noted here that only some of the input channels are connected to some of the output channels. By employing a GAP layer as the end layer, rather than utilizing a FC layer, the density of connections was decreased. The number of parameters was also significantly decreased from 40 to 5 million parameters due to these parameter tunings. The additional regularity factors used included the employment of RmsProp as optimizer and batch normalization [ 104 ]. Furthermore, GoogleNet proposed the idea of auxiliary learners to speed up the rate of convergence. Conversely, the main shortcoming of GoogleNet was its heterogeneous topology; this shortcoming requires adaptation from one module to another. Other shortcomings of GoogleNet include the representation jam, which substantially decreased the feature space in the following layer, and in turn occasionally leads to valuable information loss.

figure 19

The basic structure of Google Block

Highway network

Increasing the network depth enhances its performance, mainly for complicated tasks. By contrast, the network training becomes difficult. The presence of several layers in deeper networks may result in small gradient values of the back-propagation of error at lower layers. In 2015, Srivastava et al. [ 105 ] suggested a novel CNN architecture, called Highway Network, to overcome this issue. This approach is based on the cross-connectivity concept. The unhindered information flow in Highway Network is empowered by instructing two gating units inside the layer. The gate mechanism concept was motivated by LSTM-based RNN [ 106 , 107 ]. The information aggregation was conducted by merging the information of the \(\i{\text{th}}-k\) layers with the next \(\i{\text{th}}\) layer to generate a regularization impact, which makes the gradient-based training of the deeper network very simple. This empowers the training of networks with more than 100 layers, such as a deeper network of 900 layers with the SGD algorithm. A Highway Network with a depth of fifty layers presented an improved rate of convergence, which is better than thin and deep architectures at the same time [ 108 ]. By contrast, [ 69 ] empirically demonstrated that plain Net performance declines when more than ten hidden layers are inserted. It should be noted that even a Highway Network 900 layers in depth converges much more rapidly than the plain network.

He et al. [ 37 ] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig.  20 , which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is illustrated in Fig.  20 , which contains the fundamental ResNet block diagram. This is a conventional feedforward network plus a residual connection. The residual layer output can be identified as the \((l - 1){\text{th}}\) outputs, which are delivered from the preceding layer \((x_{l} - 1)\) . After executing different operations [such as convolution using variable-size filters, or batch normalization, before applying an activation function like ReLU on \((x_{l} - 1)\) ], the output is \(F(x_{l} - 1)\) . The ending residual output is \(x_{l}\) , which can be mathematically represented as in Eq. 18 .

There are numerous basic residual blocks included in the residual network. Based on the type of the residual network architecture, operations in the residual block are also changed [ 37 ].

figure 20

The block diagram for ResNet

In comparison to the highway network, ResNet presented shortcut connections inside layers to enable cross-layer connectivity, which are parameter-free and data-independent. Note that the layers characterize non-residual functions when a gated shortcut is closed in the highway network. By contrast, the individuality shortcuts are never closed, while the residual information is permanently passed in ResNet. Furthermore, ResNet has the potential to prevent the problems of gradient diminishing, as the shortcut connections (residual links) accelerate the deep network convergence. ResNet was the winner of the 2015-ILSVRC championship with 152 layers of depth; this represents 8 times the depth of VGG and 20 times the depth of AlexNet. In comparison with VGG, it has lower computational complexity, even with enlarged depth.

Inception: ResNet and Inception-V3/4

Szegedy et al. [ 103 , 109 , 110 ] proposed Inception-ResNet and Inception-V3/4 as upgraded types of Inception-V1/2. The concept behind Inception-V3 was to minimize the computational cost with no effect on the deeper network generalization. Thus, Szegedy et al. used asymmetric small-size filters ( \(1\times 5\) and \(1\times 7\) ) rather than large-size filters ( \( 7\times 7\) and \(5\times 5\) ); moreover, they utilized a bottleneck of \(1\times 1\) convolution prior to the large-size filters [ 110 ]. These changes make the operation of the traditional convolution very similar to cross-channel correlation. Previously, Lin et al. utilized the 1 × 1 filter potential in NIN architecture [ 94 ]. Subsequently, [ 110 ] utilized the same idea in an intelligent manner. By using \(1\times 1\) convolutional operation in Inception-V3, the input data are mapped into three or four isolated spaces, which are smaller than the initial input spaces. Next, all of these correlations are mapped in these smaller spaces through common \(5\times 5\) or \(3\times 3\) convolutions. By contrast, in Inception-ResNet, Szegedy et al. bring together the inception block and the residual learning power by replacing the filter concatenation with the residual connection [ 111 ]. Szegedy et al. empirically demonstrated that Inception-ResNet (Inception-4 with residual connections) can achieve a similar generalization power to Inception-V4 with enlarged width and depth and without residual connections. Thus, it is clearly illustrated that using residual connections in training will significantly accelerate the Inception network training. Figure  21 shows The basic block diagram for Inception Residual unit.

figure 21

The basic block diagram for Inception Residual unit

To solve the problem of the vanishing gradient, DenseNet was presented, following the same direction as ResNet and the Highway network [ 105 , 111 , 112 ]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large number of weights, since each layer has an isolated group of weights. DenseNet employed cross-layer connectivity in an improved approach to address this problem [ 112 , 113 , 114 ]. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. In traditional CNNs, there are l connections between the previous layer and the current layer, while in DenseNet, there are \(\frac{l(l+1)}{2}\) direct connections. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure  22 shows the architecture of DenseNet Network.

figure 22

(adopted from [ 112 ])

The architecture of DenseNet Network

ResNext is an enhanced version of the Inception Network [ 115 ]. It is also known as the Aggregated Residual Transform Network. Cardinality, which is a new term presented by [ 115 ], utilized the split, transform, and merge topology in an easy and effective way. It denotes the size of the transformation set as an extra dimension [ 116 , 117 , 118 ]. However, the Inception network manages network resources more efficiently, as well as enhancing the learning ability of the conventional CNN. In the transformation branch, different spatial embeddings (employing e.g. \(5\times 5\) , \(3\times 3\) , and \(1\times 1\) ) are used. Thus, customizing each layer is required separately. By contrast, ResNext derives its characteristic features from ResNet, VGG, and Inception. It employed the VGG deep homogenous topology with the basic architecture of GoogleNet by setting \(3\times 3\) filters as spatial resolution inside the blocks of split, transform, and merge. Figure  23 shows the ResNext building blocks. ResNext utilized multi-transformations inside the blocks of split, transform, and merge, as well as outlining such transformations in cardinality terms. The performance is significantly improved by increasing the cardinality, as Xie et al. showed. The complexity of ResNext was regulated by employing \(1\times 1\) filters (low embeddings) ahead of a \(3\times 3\) convolution. By contrast, skipping connections are used for optimized training [ 115 ].

figure 23

The basic block diagram for the ResNext building blocks

The feature reuse problem is the core shortcoming related to deep residual networks, since certain feature blocks or transformations contribute a very small amount to learning. Zagoruyko and Komodakis [ 119 ] accordingly proposed WideResNet to address this problem. These authors advised that the depth has a supplemental influence, while the residual units convey the core learning ability of deep residual networks. WideResNet utilized the residual block power via making the ResNet wider instead of deeper [ 37 ]. It enlarged the width by presenting an extra factor, k, which handles the network width. In other words, it indicated that layer widening is a highly successful method of performance enhancement compared to deepening the residual network. While enhanced representational capacity is achieved by deep residual networks, these networks also have certain drawbacks, such as the exploding and vanishing gradient problems, feature reuse problem (inactivation of several feature maps), and the time-intensive nature of the training. He et al. [ 37 ] tackled the feature reuse problem by including a dropout in each residual block to regularize the network in an efficient manner. In a similar manner, utilizing dropouts, Huang et al. [ 120 ] presented the stochastic depth concept to solve the slow learning and gradient vanishing problems. Earlier research was focused on increasing the depth; thus, any small enhancement in performance required the addition of several new layers. When comparing the number of parameters, WideResNet has twice that of ResNet, as an experimental study showed. By contrast, WideResNet presents an improved method for training relative to deep networks [ 119 ]. Note that most architectures prior to residual networks (including the highly effective VGG and Inception) were wider than ResNet. Thus, wider residual networks were established once this was determined. However, inserting a dropout between the convolutional layers (as opposed to within the residual block) made the learning more effective in WideResNet [ 121 , 122 ].

Pyramidal Net

The depth of the feature map increases in the succeeding layer due to the deep stacking of multi-convolutional layers, as shown in previous deep CNN architectures such as ResNet, VGG, and AlexNet. By contrast, the spatial dimension reduces, since a sub-sampling follows each convolutional layer. Thus, augmented feature representation is recompensed by decreasing the size of the feature map. The extreme expansion in the depth of the feature map, alongside the spatial information loss, interferes with the learning ability in the deep CNNs. ResNet obtained notable outcomes for the issue of image classification. Conversely, deleting a convolutional block—in which both the number of channel and spatial dimensions vary (channel depth enlarges, while spatial dimension reduces)—commonly results in decreased classifier performance. Accordingly, the stochastic ResNet enhanced the performance by decreasing the information loss accompanying the residual unit drop. Han et al. [ 123 ] proposed Pyramidal Net to address the ResNet learning interference problem. To address the depth enlargement and extreme reduction in spatial width via ResNet, Pyramidal Net slowly enlarges the residual unit width to cover the most feasible places rather than saving the same spatial dimension inside all residual blocks up to the appearance of the down-sampling. It was referred to as Pyramidal Net due to the slow enlargement in the feature map depth based on the up-down method. Factor l, which was determined by Eq. 19 , regulates the depth of the feature map.

Here, the dimension of the l th residual unit is indicated by \(d_{l}\) ; moreover, n indicates the overall number of residual units, the step factor is indicated by \(\lambda \) , and the depth increase is regulated by the factor \(\frac{\lambda }{n}\) , which uniformly distributes the weight increase across the dimension of the feature map. Zero-padded identity mapping is used to insert the residual connections among the layers. In comparison to the projection-based shortcut connections, zero-padded identity mapping requires fewer parameters, which in turn leads to enhanced generalization [ 124 ]. Multiplication- and addition-based widening are two different approaches used in Pyramidal Nets for network widening. More specifically, the first approach (multiplication) enlarges geometrically, while the second one (addition) enlarges linearly [ 92 ]. The main problem associated with the width enlargement is the growth in time and space required related to the quadratic time.

Extreme inception architecture is the main characteristic of Xception. The main idea behind Xception is its depthwise separable convolution [ 125 ]. The Xception model adjusted the original inception block by making it wider and exchanging a single dimension ( \(3 \times 3\) ) followed by a \(1 \times 1\) convolution to reduce computational complexity. Figure  24 shows the Xception block architecture. The Xception network becomes extra computationally effective through the use of the decoupling channel and spatial correspondence. Moreover, it first performs mapping of the convolved output to the embedding short dimension by applying \(1 \times 1\) convolutions. It then performs k spatial transformations. Note that k here represents the width-defining cardinality, which is obtained via the transformations number in Xception. However, the computations were made simpler in Xception by distinctly convolving each channel around the spatial axes. These axes are subsequently used as the \(1 \times 1\) convolutions (pointwise convolution) for performing cross-channel correspondence. The \(1 \times 1\) convolution is utilized in Xception to regularize the depth of the channel. The traditional convolutional operation in Xception utilizes a number of transformation segments equivalent to the number of channels; Inception, moreover, utilizes three transformation segments, while traditional CNN architecture utilizes only a single transformation segment. Conversely, the suggested Xception transformation approach achieves extra learning efficiency and better performance but does not minimize the number of parameters [ 126 , 127 ].

figure 24

The basic block diagram for the Xception block architecture

Residual attention neural network

To improve the network feature representation, Wang et al. [ 128 ] proposed the Residual Attention Network (RAN). Enabling the network to learn aware features of the object is the main purpose of incorporating attention into the CNN. The RAN consists of stacked residual blocks in addition to the attention module; hence, it is a feed-forward CNN. However, the attention module is divided into two branches, namely the mask branch and trunk branch. These branches adopt a top-down and bottom-up learning strategy respectively. Encapsulating two different strategies in the attention model supports top-down attention feedback and fast feed-forward processing in only one particular feed-forward process. More specifically, the top-down architecture generates dense features to make inferences about every aspect. Moreover, the bottom-up feedforward architecture generates low-resolution feature maps in addition to robust semantic information. Restricted Boltzmann machines employed a top-down bottom-up strategy as in previously proposed studies [ 129 ]. During the training reconstruction phase, Goh et al. [ 130 ] used the mechanism of top-down attention in deep Boltzmann machines (DBMs) as a regularizing factor. Note that the network can be globally optimized using a top-down learning strategy in a similar manner, where the maps progressively output to the input throughout the learning process [ 129 , 130 , 131 , 132 ].

Incorporating the attention concept with convolutional blocks in an easy way was used by the transformation network, as obtained in a previous study [ 133 ]. Unfortunately, these are inflexible, which represents the main problem, along with their inability to be used for varying surroundings. By contrast, stacking multi-attention modules has made RAN very effective at recognizing noisy, complex, and cluttered images. RAN’s hierarchical organization gives it the capability to adaptively allocate a weight for every feature map depending on its importance within the layers. Furthermore, incorporating three distinct levels of attention (spatial, channel, and mixed) enables the model to use this ability to capture the object-aware features at these distinct levels.

Convolutional block attention module

The importance of the feature map utilization and the attention mechanism is certified via SE-Network and RAN [ 128 , 134 , 135 ]. The convolutional block attention (CBAM) module, which is a novel attention-based CNN, was first developed by Woo et al. [ 136 ]. This module is similar to SE-Network and simple in design. SE-Network disregards the object’s spatial locality in the image and considers only the channels’ contribution during the image classification. Regarding object detection, object spatial location plays a significant role. The convolutional block attention module sequentially infers the attention maps. More specifically, it applies channel attention preceding the spatial attention to obtain the refined feature maps. Spatial attention is performed using 1 × 1 convolution and pooling functions, as in the literature. Generating an effective feature descriptor can be achieved by using a spatial axis along with the pooling of features. In addition, generating a robust spatial attention map is possible, as CBAM concatenates the max pooling and average pooling operations. In a similar manner, a collection of GAP and max pooling operations is used to model the feature map statistics. Woo et al. [ 136 ] demonstrated that utilizing GAP will return a sub-optimized inference of channel attention, whereas max pooling provides an indication of the distinguishing object features. Thus, the utilization of max pooling and average pooling enhances the network’s representational power. The feature maps improve the representational power, as well as facilitating a focus on the significant portion of the chosen features. The expression of 3D attention maps through a serial learning procedure assists in decreasing the computational cost and the number of parameters, as Woo et al. [ 136 ] experimentally proved. Note that any CNN architecture can be simply integrated with CBAM.

Concurrent spatial and channel excitation mechanism

To make the work valid for segmentation tasks, Roy et al. [ 137 , 138 ] expanded Hu et al. [ 134 ] effort by adding the influence of spatial information to the channel information. Roy et al. [ 137 , 138 ] presented three types of modules: (1) channel squeeze and excitation with concurrent channels (scSE); (2) exciting spatially and squeezing channel-wise (sSE); (3) exciting channel-wise and squeezing spatially (cSE). For segmentation purposes, they employed auto-encoder-based CNNs. In addition, they suggested inserting modules following the encoder and decoder layers. To specifically highlight the object-specific feature maps, they further allocated attention to every channel by expressing a scaling factor from the channel and spatial information in the first module (scSE). In the second module (sSE), the feature map information has lower importance than the spatial locality, as the spatial information plays a significant role during the segmentation process. Therefore, several channel collections are spatially divided and developed so that they can be employed in segmentation. In the final module (cSE), a similar SE-block concept is used. Furthermore, the scaling factor is derived founded on the contribution of the feature maps within the object detection [ 137 , 138 ].

CNN is an efficient technique for detecting object features and achieving well-behaved recognition performance in comparison with innovative handcrafted feature detectors. A number of restrictions related to CNN are present, meaning that the CNN does not consider certain relations, orientation, size, and perspectives of features. For instance, when considering a face image, the CNN does not count the various face components (such as mouth, eyes, nose, etc.) positions, and will incorrectly activate the CNN neurons and recognize the face without taking specific relations (such as size, orientation etc.) into account. At this point, consider a neuron that has probability in addition to feature properties such as size, orientation, perspective, etc. A specific neuron/capsule of this type has the ability to effectively detect the face along with different types of information. Thus, many layers of capsule nodes are used to construct the capsule network. An encoding unit, which contains three layers of capsule nodes, forms the CapsuleNet or CapsNet (the initial version of the capsule networks).

For example, the MNIST architecture comprises \(28\times 28\) images, applying 256 filters of size \(9\times 9\) and with stride 1. The \(28-9+1=20\) is the output plus 256 feature maps. Next, these outputs are input to the first capsule layer, while producing an 8D vector rather than a scalar; in fact, this is a modified convolution layer. Note that a stride 2 with \(9\times 9\) filters is employed in the first convolution layer. Thus, the dimension of the output is \((20-9)/2+1=6\) . The initial capsules employ \(8\times 32\) filters, which generate 32 × 8 × 6 × 6 (32 for groups, 8 for neurons, while 6 × 6 is the neuron size).

Figure  25 represents the complete CapsNet encoding and decoding processes. In the CNN context, a max-pooling layer is frequently employed to handle the translation change. It can detect the feature moves in the event that the feature is still within the max-pooling window. This approach has the ability to detect the overlapped features; this is highly significant in detection and segmentation operations, since the capsule involves the weighted features sum from the preceding layer.

figure 25

The complete CapsNet encoding and decoding processes

In conventional CNNs, a particular cost function is employed to evaluate the global error that grows toward the back throughout the training process. Conversely, in such cases, the activation of a neuron will not grow further once the weight between two neurons turns out to be zero. Instead of a single size being provided with the complete cost function in repetitive dynamic routing alongside the agreement, the signal is directed based on the feature parameters. Sabour et al. [ 139 ] provides more details about this architecture. When using MNIST to recognize handwritten digits, this innovative CNN architecture gives superior accuracy. From the application perspective, this architecture has extra suitability for segmentation and detection approaches when compared with classification approaches [ 140 , 141 , 142 ].

High-resolution network (HRNet)

High-resolution representations are necessary for position-sensitive vision tasks, such as semantic segmentation, object detection, and human pose estimation. In the present up-to-date frameworks, the input image is encoded as a low-resolution representation using a subnetwork that is constructed as a connected series of high-to-low resolution convolutions such as VGGNet and ResNet. The low-resolution representation is then recovered to become a high-resolution one. Alternatively, high-resolution representations are maintained during the entire process using a novel network, referred to as a High-Resolution Network (HRNet) [ 143 , 144 ]. This network has two principal features. First, the convolution series of high-to-low resolutions are connected in parallel. Second, the information across the resolutions are repeatedly exchanged. The advantage achieved includes getting a representation that is more accurate in the spatial domain and extra-rich in the semantic domain. Moreover, HRNet has several applications in the fields of object detection, semantic segmentation, and human pose prediction. For computer vision problems, the HRNet represents a more robust backbone. Figure  26 illustrates the general architecture of HRNet.

figure 26

The general architecture of HRNet

Challenges (limitations) of deep learning and alternate solutions

When employing DL, several difficulties are often taken into consideration. Those more challenging are listed next and several possible alternatives are accordingly provided.

Training data

DL is extremely data-hungry considering it also involves representation learning [ 145 , 146 ]. DL demands an extensively large amount of data to achieve a well-behaved performance model, i.e. as the data increases, an extra well-behaved performance model can be achieved (Fig.  27 ). In most cases, the available data are sufficient to obtain a good performance model. However, sometimes there is a shortage of data for using DL directly [ 87 ]. To properly address this issue, three suggested methods are available. The first involves the employment of the transfer-learning concept after data is collected from similar tasks. Note that while the transferred data will not directly augment the actual data, it will help in terms of both enhancing the original input representation of data and its mapping function [ 147 ]. In this way, the model performance is boosted. Another technique involves employing a well-trained model from a similar task and fine-tuning the ending of two layers or even one layer based on the limited original data. Refer to [ 148 , 149 ] for a review of different transfer-learning techniques applied in the DL approach. In the second method, data augmentation is performed [ 150 ]. This task is very helpful for use in augmenting the image data, since the image translation, mirroring, and rotation commonly do not change the image label. Conversely, it is important to take care when applying this technique in some cases such as with bioinformatics data. For instance, when mirroring an enzyme sequence, the output data may not represent the actual enzyme sequence. In the third method, the simulated data can be considered for increasing the volume of the training set. It is occasionally possible to create simulators based on the physical process if the issue is well understood. Therefore, the result will involve the simulation of as much data as needed. Processing the data requirement for DL-based simulation is obtained as an example in Ref. [ 151 ].

figure 27

The performance of DL regarding the amount of data

  • Transfer learning

Recent research has revealed a widespread use of deep CNNs, which offer ground-breaking support for answering many classification problems. Generally speaking, deep CNN models require a sizable volume of data to obtain good performance. The common challenge associated with using such models concerns the lack of training data. Indeed, gathering a large volume of data is an exhausting job, and no successful solution is available at this time. The undersized dataset problem is therefore currently solved using the TL technique [ 148 , 149 ], which is highly efficient in addressing the lack of training data issue. The mechanism of TL involves training the CNN model with large volumes of data. In the next step, the model is fine-tuned for training on a small request dataset.

The student-teacher relationship is a suitable approach to clarifying TL. Gathering detailed knowledge of the subject is the first step [ 152 ]. Next, the teacher provides a “course” by conveying the information within a “lecture series” over time. Put simply, the teacher transfers the information to the student. In more detail, the expert (teacher) transfers the knowledge (information) to the learner (student). Similarly, the DL network is trained using a vast volume of data, and also learns the bias and the weights during the training process. These weights are then transferred to different networks for retraining or testing a similar novel model. Thus, the novel model is enabled to pre-train weights rather than requiring training from scratch. Figure  28 illustrates the conceptual diagram of the TL technique.

Pre-trained models: Many CNN models, e.g. AlexNet [ 30 ], GoogleNet [ 103 ], and ResNet [ 37 ], have been trained on large datasets such as ImageNet for image recognition purposes. These models can then be employed to recognize a different task without the need to train from scratch. Furthermore, the weights remain the same apart from a few learned features. In cases where data samples are lacking, these models are very useful. There are many reasons for employing a pre-trained model. First, training large models on sizeable datasets requires high-priced computational power. Second, training large models can be time-consuming, taking up to multiple weeks. Finally, a pre-trained model can assist with network generalization and speed up the convergence.

A research problem using pre-trained models: Training a DL approach requires a massive number of images. Thus, obtaining good performance is a challenge under these circumstances. Achieving excellent outcomes in image classification or recognition applications, with performance occasionally superior to that of a human, becomes possible through the use of deep convolutional neural networks (DCNNs) including several layers if a huge amount of data is available [ 37 , 148 , 153 ]. However, avoiding overfitting problems in such applications requires sizable datasets and properly generalizing DCNN models. When training a DCNN model, the dataset size has no lower limit. However, the accuracy of the model becomes insufficient in the case of the utilized model has fewer layers, or if a small dataset is used for training due to over- or under-fitting problems. Due to they have no ability to utilize the hierarchical features of sizable datasets, models with fewer layers have poor accuracy. It is difficult to acquire sufficient training data for DL models. For example, in medical imaging and environmental science, gathering labelled datasets is very costly [ 148 ]. Moreover, the majority of the crowdsourcing workers are unable to make accurate notes on medical or biological images due to their lack of medical or biological knowledge. Thus, ML researchers often rely on field experts to label such images; however, this process is costly and time consuming. Therefore, producing the large volume of labels required to develop flourishing deep networks turns out to be unfeasible. Recently, TL has been widely employed to address the later issue. Nevertheless, although TL enhances the accuracy of several tasks in the fields of pattern recognition and computer vision [ 154 , 155 ], there is an essential issue related to the source data type used by the TL as compared to the target dataset. For instance, enhancing the medical image classification performance of CNN models is achieved by training the models using the ImageNet dataset, which contains natural images [ 153 ]. However, such natural images are completely dissimilar from the raw medical images, meaning that the model performance is not enhanced. It has further been proven that TL from different domains does not significantly affect performance on medical imaging tasks, as lightweight models trained from scratch perform nearly as well as standard ImageNet-transferred models [ 156 ]. Therefore, there exists scenarios in which using pre-trained models do not become an affordable solution. In 2020, some researchers have utilized same-domain TL and achieved excellent results [ 86 , 87 , 88 , 157 ]. Same-domain TL is an approach of using images that look similar to the target dataset for training. For example, using X-ray images of different chest diseases to train the model, then fine-tuning and training it on chest X-ray images for COVID-19 diagnosis. More details about same-domain TL and how to implement the fine-tuning process can be found in [ 87 ].

figure 28

The conceptual diagram of the TL technique

Data augmentation techniques

If the goal is to increase the amount of available data and avoid the overfitting issue, data augmentation techniques are one possible solution [ 150 , 158 , 159 ]. These techniques are data-space solutions for any limited-data problem. Data augmentation incorporates a collection of methods that improve the attributes and size of training datasets. Thus, DL networks can perform better when these techniques are employed. Next, we list some data augmentation alternate solutions.

Flipping: Flipping the vertical axis is a less common practice than flipping the horizontal one. Flipping has been verified as valuable on datasets like ImageNet and CIFAR-10. Moreover, it is highly simple to implement. In addition, it is not a label-conserving transformation on datasets that involve text recognition (such as SVHN and MNIST).

Color space: Encoding digital image data is commonly used as a dimension tensor ( \(height \times width \times color channels\) ). Accomplishing augmentations in the color space of the channels is an alternative technique, which is extremely workable for implementation. A very easy color augmentation involves separating a channel of a particular color, such as Red, Green, or Blue. A simple way to rapidly convert an image using a single-color channel is achieved by separating that matrix and inserting additional double zeros from the remaining two color channels. Furthermore, increasing or decreasing the image brightness is achieved by using straightforward matrix operations to easily manipulate the RGB values. By deriving a color histogram that describes the image, additional improved color augmentations can be obtained. Lighting alterations are also made possible by adjusting the intensity values in histograms similar to those employed in photo-editing applications.

Cropping: Cropping a dominant patch of every single image is a technique employed with combined dimensions of height and width as a specific processing step for image data. Furthermore, random cropping may be employed to produce an impact similar to translations. The difference between translations and random cropping is that translations conserve the spatial dimensions of this image, while random cropping reduces the input size [for example from (256, 256) to (224, 224)]. According to the selected reduction threshold for cropping, the label-preserving transformation may not be addressed.

Rotation: When rotating an image left or right from within 0 to 360 degrees around the axis, rotation augmentations are obtained. The rotation degree parameter greatly determines the suitability of the rotation augmentations. In digit recognition tasks, small rotations (from 0 to 20 degrees) are very helpful. By contrast, the data label cannot be preserved post-transformation when the rotation degree increases.

Translation: To avoid positional bias within the image data, a very useful transformation is to shift the image up, down, left, or right. For instance, it is common that the whole dataset images are centered; moreover, the tested dataset should be entirely made up of centered images to test the model. Note that when translating the initial images in a particular direction, the residual space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. The spatial dimensions of the image post-augmentation are preserved using this padding.

Noise injection This approach involves injecting a matrix of arbitrary values. Such a matrix is commonly obtained from a Gaussian distribution. Moreno-Barea et al. [ 160 ] employed nine datasets to test the noise injection. These datasets were taken from the UCI repository [ 161 ]. Injecting noise within images enables the CNN to learn additional robust features.

However, highly well-behaved solutions for positional biases available within the training data are achieved by means of geometric transformations. To separate the distribution of the testing data from the training data, several prospective sources of bias exist. For instance, when all faces should be completely centered within the frames (as in facial recognition datasets), the problem of positional biases emerges. Thus, geometric translations are the best solution. Geometric translations are helpful due to their simplicity of implementation, as well as their effective capability to disable the positional biases. Several libraries of image processing are available, which enables beginning with simple operations such as rotation or horizontal flipping. Additional training time, higher computational costs, and additional memory are some shortcomings of geometric transformations. Furthermore, a number of geometric transformations (such as arbitrary cropping or translation) should be manually observed to ensure that they do not change the image label. Finally, the biases that separate the test data from the training data are more complicated than transitional and positional changes. Hence, it is not trivial answering to when and where geometric transformations are suitable to be applied.

Imbalanced data

Commonly, biological data tend to be imbalanced, as negative samples are much more numerous than positive ones [ 162 , 163 , 164 ]. For example, compared to COVID-19-positive X-ray images, the volume of normal X-ray images is very large. It should be noted that undesirable results may be produced when training a DL model using imbalanced data. The following techniques are used to solve this issue. First, it is necessary to employ the correct criteria for evaluating the loss, as well as the prediction result. In considering the imbalanced data, the model should perform well on small classes as well as larger ones. Thus, the model should employ area under curve (AUC) as the resultant loss as well as the criteria [ 165 ]. Second, it should employ the weighted cross-entropy loss, which ensures the model will perform well with small classes if it still prefers to employ the cross-entropy loss. Simultaneously, during model training, it is possible either to down-sample the large classes or up-sample the small classes. Finally, to make the data balanced as in Ref. [ 166 ], it is possible to construct models for every hierarchical level, as a biological system frequently has hierarchical label space. However, the effect of the imbalanced data on the performance of the DL model has been comprehensively investigated. In addition, to lessen the problem, the most frequently used techniques were also compared. Nevertheless, note that these techniques are not specified for biological problems.

Interpretability of data

Occasionally, DL techniques are analyzed to act as a black box. In fact, they are interpretable. The need for a method of interpreting DL, which is used to obtain the valuable motifs and patterns recognized by the network, is common in many fields, such as bioinformatics [ 167 ]. In the task of disease diagnosis, it is not only required to know the disease diagnosis or prediction results of a trained DL model, but also how to enhance the surety of the prediction outcomes, as the model makes its decisions based on these verifications [ 168 ]. To achieve this, it is possible to give a score of importance for every portion of the particular example. Within this solution, back-propagation-based techniques or perturbation-based approaches are used [ 169 ]. In the perturbation-based approaches, a portion of the input is changed and the effect of this change on the model output is observed [ 170 , 171 , 172 , 173 ]. This concept has high computational complexity, but it is simple to understand. On the other hand, to check the score of the importance of various input portions, the signal from the output propagates back to the input layer in the back-propagation-based techniques. These techniques have been proven valuable in [ 174 ]. In different scenarios, various meanings can represent the model interpretability.

Uncertainty scaling

Commonly, the final prediction label is not the only label required when employing DL techniques to achieve the prediction; the score of confidence for every inquiry from the model is also desired. The score of confidence is defined as how confident the model is in its prediction [ 175 ]. Since the score of confidence prevents belief in unreliable and misleading predictions, it is a significant attribute, regardless of the application scenario. In biology, the confidence score reduces the resources and time expended in proving the outcomes of the misleading prediction. Generally speaking, in healthcare or similar applications, the uncertainty scaling is frequently very significant; it helps in evaluating automated clinical decisions and the reliability of machine learning-based disease-diagnosis [ 176 , 177 ]. Because overconfident prediction can be the output of different DL models, the score of probability (achieved from the softmax output of the direct-DL) is often not in the correct scale [ 178 ]. Note that the softmax output requires post-scaling to achieve a reliable probability score. For outputting the probability score in the correct scale, several techniques have been introduced, including Bayesian Binning into Quantiles (BBQ) [ 179 ], isotonic regression [ 180 ], histogram binning [ 181 ], and the legendary Platt scaling [ 182 ]. More specifically, for DL techniques, temperature scaling was recently introduced, which achieves superior performance compared to the other techniques.

Catastrophic forgetting

This is defined as incorporating new information into a plain DL model, made possible by interfering with the learned information. For instance, consider a case where there are 1000 types of flowers and a model is trained to classify these flowers, after which a new type of flower is introduced; if the model is fine-tuned only with this new class, its performance will become unsuccessful with the older classes [ 183 , 184 ]. The logical data are continually collected and renewed, which is in fact a highly typical scenario in many fields, e.g. Biology. To address this issue, there is a direct solution that involves employing old and new data to train an entirely new model from scratch. This solution is time-consuming and computationally intensive; furthermore, it leads to an unstable state for the learned representation of the initial data. At this time, three different types of ML techniques, which have not catastrophic forgetting, are made available to solve the human brain problem founded on the neurophysiological theories [ 185 , 186 ]. Techniques of the first type are founded on regularizations such as EWC [ 183 ] Techniques of the second type employ rehearsal training techniques and dynamic neural network architecture like iCaRL [ 187 , 188 ]. Finally, techniques of the third type are founded on dual-memory learning systems [ 189 ]. Refer to [ 190 , 191 , 192 ] in order to gain more details.

Model compression

To obtain well-trained models that can still be employed productively, DL models have intensive memory and computational requirements due to their huge complexity and large numbers of parameters [ 193 , 194 ]. One of the fields that is characterized as data-intensive is the field of healthcare and environmental science. These needs reduce the deployment of DL in limited computational-power machines, mainly in the healthcare field. The numerous methods of assessing human health and the data heterogeneity have become far more complicated and vastly larger in size [ 195 ]; thus, the issue requires additional computation [ 196 ]. Furthermore, novel hardware-based parallel processing solutions such as FPGAs and GPUs [ 197 , 198 , 199 ] have been developed to solve the computation issues associated with DL. Recently, numerous techniques for compressing the DL models, designed to decrease the computational issues of the models from the starting point, have also been introduced. These techniques can be classified into four classes. In the first class, the redundant parameters (which have no significant impact on model performance) are reduced. This class, which includes the famous deep compression method, is called parameter pruning [ 200 ]. In the second class, the larger model uses its distilled knowledge to train a more compact model; thus, it is called knowledge distillation [ 201 , 202 ]. In the third class, compact convolution filters are used to reduce the number of parameters [ 203 ]. In the final class, the information parameters are estimated for preservation using low-rank factorization [ 204 ]. For model compression, these classes represent the most representative techniques. In [ 193 ], it has been provided a more comprehensive discussion about the topic.

Overfitting

DL models have excessively high possibilities of resulting in data overfitting at the training stage due to the vast number of parameters involved, which are correlated in a complex manner. Such situations reduce the model’s ability to achieve good performance on the tested data [ 90 , 205 ]. This problem is not only limited to a specific field, but involves different tasks. Therefore, when proposing DL techniques, this problem should be fully considered and accurately handled. In DL, the implied bias of the training process enables the model to overcome crucial overfitting problems, as recent studies suggest [ 205 , 206 , 207 , 208 ]. Even so, it is still necessary to develop techniques that handle the overfitting problem. An investigation of the available DL algorithms that ease the overfitting problem can categorize them into three classes. The first class acts on both the model architecture and model parameters and includes the most familiar approaches, such as weight decay [ 209 ], batch normalization [ 210 ], and dropout [ 90 ]. In DL, the default technique is weight decay [ 209 ], which is used extensively in almost all ML algorithms as a universal regularizer. The second class works on model inputs such as data corruption and data augmentation [ 150 , 211 ]. One reason for the overfitting problem is the lack of training data, which makes the learned distribution not mirror the real distribution. Data augmentation enlarges the training data. By contrast, marginalized data corruption improves the solution exclusive to augmenting the data. The final class works on the model output. A recently proposed technique penalizes the over-confident outputs for regularizing the model [ 178 ]. This technique has demonstrated the ability to regularize RNNs and CNNs.

Vanishing gradient problem

In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely in the training stage, a problem called the vanishing gradient problem arises [ 212 , 213 , 214 ]. More specifically, in each training iteration, every weight of the neural network is updated based on the current weight and is proportionally relative to the partial derivative of the error function. However, this weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case means that no extra training is possible and the neural network will stop completely. Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the input that produces a small variation at the output. In a shallow network, only some layers use these activations, which is not a significant issue. While using more layers will lead the gradient to become very small in the training stage, in this case, the network works efficiently. The back-propagation technique is used to determine the gradients of the neural networks. Initially, this technique determines the network derivatives of each layer in the reverse direction, starting from the last layer and progressing back to the first layer. The next step involves multiplying the derivatives of each layer down the network in a similar manner to the first step. For instance, multiplying N small derivatives together when there are N hidden layers employs an activation function such as the sigmoid function. Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the biases and weights of the first layers cannot be updated efficiently during the training stage because the gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are frequently critical to recognizing the essential elements of the input data. However, such a problem can be avoided through employing activation functions. These functions lack the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to max, the ReLU [ 91 ] is the most popular selection, as it does not yield a small derivative that is employed in the field. Another solution involves employing the batch normalization layer [ 81 ]. As mentioned earlier, the problem occurs once a large input space is squashed into a small space, leading to vanishing the derivative. Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression | x | does not accomplish the exterior boundaries of the sigmoid function. The normalization process makes the largest part of it come down in the green area, which ensures that the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers of the network compared to the time required to recognize the vanishing gradient problem [ 215 ].

Exploding gradient problem

Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are accumulated during back-propagation [ 216 , 217 , 218 ]. The latter will lead to extremely significant updates to the weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its ability to learn effectively. Grosso modo, moving backward in the network during back-propagation, the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential solutions include:

Using different weight regularization techniques.

Redesigning the architecture of the network model.

Underspecification

In 2020, a team of computer scientists at Google has identified a new challenge called underspecification [ 219 ]. ML models including DL models often show surprisingly poor behavior when they are tested in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics. The reason behind the weak performance is due to underspecification. It has been shown that small modifications can force a model towards a completely different solution as well as lead to different predictions in deployment domains. There are different techniques of addressing underspecification issue. One of them is to design “stress tests” to examine how good a model works on real-world data and to find out the possible issues. Nevertheless, this demands a reliable understanding of the process the model can work inaccurately. The team stated that “Designing stress tests that are well-matched to applied requirements, and that provide good “coverage” of potential failure modes is a major challenge”. Underspecification puts major constraints on the credibility of ML predictions and may require some reconsidering over certain applications. Since ML is linked to human by serving several applications such as medical imaging and self-driving cars, it will require proper attention to this issue.

Applications of deep learning

Presently, various DL applications are widespread around the world. These applications include healthcare, social network analysis, audio and speech processing (like recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification), among others (Fig.  29 ) [ 220 , 221 , 222 , 223 , 224 ]. These applications have been classified into five categories: classification, localization, detection, segmentation, and registration. Although each of these tasks has its own target, there is fundamental overlap in the pipeline implementation of these applications as shown in Fig.  30 . Classification is a concept that categorizes a set of data into classes. Detection is used to locate interesting objects in an image with consideration given to the background. In detection, multiple objects, which could be from dissimilar classes, are surrounded by bounding boxes. Localization is the concept used to locate the object, which is surrounded by a single bounding box. In segmentation (semantic segmentation), the target object edges are surrounded by outlines, which also label them; moreover, fitting a single image (which could be 2D or 3D) onto another refers to registration. One of the most important and wide-ranging DL applications are in healthcare [ 225 , 226 , 227 , 228 , 229 , 230 ]. This area of research is critical due to its relation to human lives. Moreover, DL has shown tremendous performance in healthcare. Therefore, we take DL applications in the medical image analysis field as an example to describe the DL applications.

figure 29

Examples of DL applications

figure 30

Workflow of deep learning tasks

Classification

Computer-Aided Diagnosis (CADx) is another title sometimes used for classification. Bharati et al. [ 231 ] used a chest X-ray dataset for detecting lung diseases based on a CNN. Another study attempted to read X-ray images by employing CNN [ 232 ]. In this modality, the comparative accessibility of these images has likely enhanced the progress of DL. [ 233 ] used an improved pre-trained GoogLeNet CNN containing more than 150,000 images for training and testing processes. This dataset was augmented from 1850 chest X-rays. The creators reorganized the image orientation into lateral and frontal views and achieved approximately 100% accuracy. This work of orientation classification has clinically limited use. As a part of an ultimately fully automated diagnosis workflow, it obtained the data augmentation and pre-trained efficiency in learning the metadata of relevant images. Chest infection, commonly referred to as pneumonia, is extremely treatable, as it is a commonly occurring health problem worldwide. Conversely, Rajpurkar et al. [ 234 ] utilized CheXNet, which is an improved version of DenseNet [ 112 ] with 121 convolution layers, for classifying fourteen types of disease. These authors used the CheXNet14 dataset [ 235 ], which comprises 112,000 images. This network achieved an excellent performance in recognizing fourteen different diseases. In particular, pneumonia classification accomplished a 0.7632 AUC score using receiver operating characteristics (ROC) analysis. In addition, the network obtained better than or equal to the performance of both a three-radiologist panel and four individual radiologists. Zuo et al. [ 236 ] have adopted CNN for candidate classification in lung nodule. Shen et al. [ 237 ] employed both Random Forest (RF) and SVM classifiers with CNNs to classify lung nodules. They employed two convolutional layers with each of the three parallel CNNs. The LIDC-IDRI (Lung Image Database Consortium) dataset, which contained 1010-labeled CT lung scans, was used to classify the two types of lung nodules (malignant and benign). Different scales of the image patches were used by every CNN to extract features, while the output feature vector was constructed using the learned features. Next, these vectors were classified into malignant or benign using either the RF classifier or SVM with radial basis function (RBF) filter. The model was robust to various noisy input levels and achieved an accuracy of 86% in nodule classification. Conversely, the model of [ 238 ] interpolates the image data missing between PET and MRI images using 3D CNNs. The Alzheimer Disease Neuroimaging Initiative (ADNI) database, containing 830 PET and MRI patient scans, was utilized in their work. The PET and MRI images are used to train the 3D CNNs, first as input and then as output. Furthermore, for patients who have no PET images, the 3D CNNs utilized the trained images to rebuild the PET images. These rebuilt images approximately fitted the actual disease recognition outcomes. However, this approach did not address the overfitting issues, which in turn restricted their technique in terms of its possible capacity for generalization. Diagnosing normal versus Alzheimer’s disease patients has been achieved by several CNN models [ 239 , 240 ]. Hosseini-Asl et al. [ 241 ] attained 99% accuracy for up-to-date outcomes in diagnosing normal versus Alzheimer’s disease patients. These authors applied an auto-encoder architecture using 3D CNNs. The generic brain features were pre-trained on the CADDementia dataset. Subsequently, the outcomes of these learned features became inputs to higher layers to differentiate between patient scans of Alzheimer’s disease, mild cognitive impairment, or normal brains based on the ADNI dataset and using fine-tuned deep supervision techniques. The architectures of VGGNet and RNNs, in that order, were the basis of both VOXCNN and ResNet models developed by Korolev et al. [ 242 ]. They also discriminated between Alzheimer’s disease and normal patients using the ADNI database. Accuracy was 79% for Voxnet and 80% for ResNet. Compared to Hosseini-Asl’s work, both models achieved lower accuracies. Conversely, the implementation of the algorithms was simpler and did not require feature hand-crafting, as Korolev declared. In 2020, Mehmood et al. [ 240 ] trained a developed CNN-based network called “SCNN” with MRI images for the tasks of classification of Alzheimer’s disease. They achieved state-of-the-art results by obtaining an accuracy of 99.05%.

Recently, CNN has taken some medical imaging classification tasks to different level from traditional diagnosis to automated diagnosis with tremendous performance. Examples of these tasks are diabetic foot ulcer (DFU) (as normal and abnormal (DFU) classes) [ 87 , 243 , 244 , 245 , 246 ], sickle cells anemia (SCA) (as normal, abnormal (SCA), and other blood components) [ 86 , 247 ], breast cancer by classify hematoxylin–eosin-stained breast biopsy images into four classes: invasive carcinoma, in-situ carcinoma, benign tumor and normal tissue [ 42 , 88 , 248 , 249 , 250 , 251 , 252 ], and multi-class skin cancer classification [ 253 , 254 , 255 ].

In 2020, CNNs are playing a vital role in early diagnosis of the novel coronavirus (COVID-2019). CNN has become the primary tool for automatic COVID-19 diagnosis in many hospitals around the world using chest X-ray images [ 256 , 257 , 258 , 259 , 260 ]. More details about the classification of medical imaging applications can be found in [ 226 , 261 , 262 , 263 , 264 , 265 ].

Localization

Although applications in anatomy education could increase, the practicing clinician is more likely to be interested in the localization of normal anatomy. Radiological images are independently examined and described outside of human intervention, while localization could be applied in completely automatic end-to-end applications [ 266 , 267 , 268 ]. Zhao et al. [ 269 ] introduced a new deep learning-based approach to localize pancreatic tumor in projection X-ray images for image-guided radiation therapy without the need for fiducials. Roth et al. [ 270 ] constructed and trained a CNN using five convolutional layers to classify around 4000 transverse-axial CT images. These authors used five categories for classification: legs, pelvis, liver, lung, and neck. After data augmentation techniques were applied, they achieved an AUC score of 0.998 and the classification error rate of the model was 5.9%. For detecting the positions of the spleen, kidney, heart, and liver, Shin et al. [ 271 ] employed stacked auto-encoders on 78 contrast-improved MRI scans of the stomach area containing the kidneys or liver. Temporal and spatial domains were used to learn the hierarchal features. Based on the organs, these approaches achieved detection accuracies of 62–79%. Sirazitdinov et al. [ 268 ] presented an aggregate of two convolutional neural networks, namely RetinaNet and Mask R-CNN for pneumonia detection and localization.

Computer-Aided Detection (CADe) is another method used for detection. For both the clinician and the patient, overlooking a lesion on a scan may have dire consequences. Thus, detection is a field of study requiring both accuracy and sensitivity [ 272 , 273 , 274 ]. Chouhan et al. [ 275 ] introduced an innovative deep learning framework for the detection of pneumonia by adopting the idea of transfer learning. Their approach obtained an accuracy of 96.4% with a recall of 99.62% on unseen data. In the area of COVID-19 and pulmonary disease, several convolutional neural network approaches have been proposed for automatic detection from X-ray images which showed an excellent performance [ 46 , 276 , 277 , 278 , 279 ].

In the area of skin cancer, there several applications were introduced for the detection task [ 280 , 281 , 282 ]. Thurnhofer-Hemsi et al. [ 283 ] introduced a deep learning approach for skin cancer detection by fine-tuning five state-of-art convolutional neural network models. They addressed the issue of a lack of training data by adopting the ideas of transfer learning and data augmentation techniques. DenseNet201 network has shown superior results compared to other models.

Another interesting area is that of histopathological images, which are progressively digitized. Several papers have been published in this field [ 284 , 285 , 286 , 287 , 288 , 289 , 290 ]. Human pathologists read these images laboriously; they search for malignancy markers, such as a high index of cell proliferation, using molecular markers (e.g. Ki-67), cellular necrosis signs, abnormal cellular architecture, enlarged numbers of mitotic figures denoting augmented cell replication, and enlarged nucleus-to-cytoplasm ratios. Note that the histopathological slide may contain a huge number of cells (up to the thousands). Thus, the risk of disregarding abnormal neoplastic regions is high when wading through these cells at excessive levels of magnification. Ciresan et al. [ 291 ] employed CNNs of 11–13 layers for identifying mitotic figures. Fifty breast histology images from the MITOS dataset were used. Their technique attained recall and precision scores of 0.7 and 0.88 respectively. Sirinukunwattana et al. [ 292 ] utilized 100 histology images of colorectal adenocarcinoma to detect cell nuclei using CNNs. Roughly 30,000 nuclei were hand-labeled for training purposes. The novelty of this approach was in the use of Spatially Constrained CNN. This CNN detects the center of nuclei using the surrounding spatial context and spatial regression. Instead of this CNN, Xu et al. [ 293 ] employed a stacked sparse auto-encoder (SSAE) to identify nuclei in histological slides of breast cancer, achieving 0.83 and 0.89 recall and precision scores respectively. In this field, they showed that unsupervised learning techniques are also effectively utilized. In medical images, Albarquoni et al. [ 294 ] investigated the problem of insufficient labeling. They crowd-sourced the actual mitoses labeling in the histology images of breast cancer (from amateurs online). Solving the recurrent issue of inadequate labeling during the analysis of medical images can be achieved by feeding the crowd-sourced input labels into the CNN. This method signifies a remarkable proof-of-concept effort. In 2020, Lei et al. [ 285 ] introduced the employment of deep convolutional neural networks for automatic identification of mitotic candidates from histological sections for mitosis screening. They obtained the state-of-the-art detection results on the dataset of the International Pattern Recognition Conference (ICPR) 2012 Mitosis Detection Competition.

Segmentation

Although MRI and CT image segmentation research includes different organs such as knee cartilage, prostate, and liver, most research work has concentrated on brain segmentation, particularly tumors [ 295 , 296 , 297 , 298 , 299 , 300 ]. This issue is highly significant in surgical preparation to obtain the precise tumor limits for the shortest surgical resection. During surgery, excessive sacrificing of key brain regions may lead to neurological shortfalls including cognitive damage, emotionlessness, and limb difficulty. Conventionally, medical anatomical segmentation was done by hand; more specifically, the clinician draws out lines within the complete stack of the CT or MRI volume slice by slice. Thus, it is perfect for implementing a solution that computerizes this painstaking work. Wadhwa et al. [ 301 ] presented a brief overview on brain tumor segmentation of MRI images. Akkus et al. [ 302 ] wrote a brilliant review of brain MRI segmentation that addressed the different metrics and CNN architectures employed. Moreover, they explain several competitions in detail, as well as their datasets, which included Ischemic Stroke Lesion Segmentation (ISLES), Mild Traumatic brain injury Outcome Prediction (MTOP), and Brain Tumor Segmentation (BRATS).

Chen et al. [ 299 ] proposed convolutional neural networks for precise brain tumor segmentation. The approach that they employed involves several approaches for better features learning including the DeepMedic model, a novel dual-force training scheme, a label distribution-based loss function, and Multi-Layer Perceptron-based post-processing. They conducted their method on the two most modern brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Hu et al. [ 300 ] introduced the brain tumor segmentation method by adopting a multi-cascaded convolutional neural network (MCCNN) and fully connected conditional random fields (CRFs). The achieved results were excellent compared with the state-of-the-art methods.

Moeskops et al. [ 303 ] employed three parallel-running CNNs, each of which had a 2D input patch of dissimilar size, for segmenting and classifying MRI brain images. These images, which include 35 adults and 22 pre-term infants, were classified into various tissue categories such as cerebrospinal fluid, grey matter, and white matter. Every patch concentrates on capturing various image aspects with the benefit of employing three dissimilar sizes of input patch; here, the bigger sizes incorporated the spatial features, while the lowest patch sizes concentrated on the local textures. In general, the algorithm has Dice coefficients in the range of 0.82–0.87 and achieved a satisfactory accuracy. Although 2D image slices are employed in the majority of segmentation research, Milletrate et al. [ 304 ] implemented 3D CNN for segmenting MRI prostate images. Furthermore, they used the PROMISE2012 challenge dataset, from which fifty MRI scans were used for training and thirty for testing. The U-Net architecture of Ronnerberger et al. [ 305 ] inspired their V-net. This model attained a 0.869 Dice coefficient score, the same as the winning teams in the competition. To reduce overfitting and create the model of a deeper 11-convolutional layer CNN, Pereira et al. [ 306 ] applied intentionally small-sized filters of 3x3. Their model used MRI scans of 274 gliomas (a type of brain tumor) for training. They achieved first place in the 2013 BRATS challenge, as well as second place in the BRATS challenge 2015. Havaei et al. [ 307 ] also considered gliomas using the 2013 BRATS dataset. They investigated different 2D CNN architectures. Compared to the winner of BRATS 2013, their algorithm worked better, as it required only 3 min to execute rather than 100 min. The concept of cascaded architecture formed the basis of their model. Thus, it is referred to as an InputCascadeCNN. Employing FC Conditional Random Fields (CRFs), atrous spatial pyramid pooling, and up-sampled filters were techniques introduced by Chen et al. [ 308 ]. These authors aimed to enhance the accuracy of localization and enlarge the field of view of every filter at a multi-scale. Their model, DeepLab, attained 79.7% mIOU (mean Intersection Over Union). In the PASCAL VOC-2012 image segmentation, their model obtained an excellent performance.

Recently, the Automatic segmentation of COVID-19 Lung Infection from CT Images helps to detect the development of COVID-19 infection by employing several deep learning techniques [ 309 , 310 , 311 , 312 ].

Registration

Usually, given two input images, the four main stages of the canonical procedure of the image registration task are [ 313 , 314 ]:

Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.

Feature Extraction: it computes the set of features extracted from each input image.

Feature Matching: it allows finding similarities between the previously obtained features.

Pose Optimization: it is aimed to minimize the distance between both input images.

Then, the result of the registration procedure is the suitable geometric transformation (e.g. translation, rotation, scaling, etc.) that provides both input images within the same coordinate system in a way the distance between them is minimal, i.e. their level of superimposition/overlapping is optimal. It is out of the scope of this work to provide an extensive review of this topic. Nevertheless, a short summary is accordingly introduced next.

Commonly, the input images for the DL-based registration approach could be in various forms, e.g. point clouds, voxel grids, and meshes. Additionally, some techniques allow as inputs the result of the Feature Extraction or Matching steps in the canonical scheme. Specifically, the outcome could be some data in a particular form as well as the result of the steps from the classical pipeline (feature vector, matching vector, and transformation). Nevertheless, with the newest DL-based methods, a novel conceptual type of ecosystem issues. It contains acquired characteristics about the target, materials, and their behavior that can be registered with the input data. Such a conceptual ecosystem is formed by a neural network and its training manner, and it could be counted as an input to the registration approach. Nevertheless, it is not an input that one might adopt in every registration situation since it corresponds to an interior data representation.

From a DL view-point, the interpretation of the conceptual design enables differentiating the input data of a registration approach into defined or non-defined models. In particular, the illustrated phases are models that depict particular spatial data (e.g. 2D or 3D) while a non-defined one is a generalization of a data set created by a learning system. Yumer et al. [ 315 ] developed a framework in which the model acquires characteristics of objects, meaning ready to identify what a more sporty car seems like or a more comfy chair is, also adjusting a 3D model to fit those characteristics while maintaining the main characteristics of the primary data. Likewise, a fundamental perspective of the unsupervised learning method introduced by Ding et al. [ 316 ] is that there is no target for the registration approach. In this instance, the network is able of placing each input point cloud in a global space, solving SLAM issues in which many point clouds have to be registered rigidly. On the other hand, Mahadevan [ 317 ] proposed the combination of two conceptual models utilizing the growth of Imagination Machines to give flexible artificial intelligence systems and relationships between the learned phases through training schemes that are not inspired on labels and classifications. Another practical application of DL, especially CNNs, to image registration is the 3D reconstruction of objects. Wang et al. [ 318 ] applied an adversarial way using CNNs to rebuild a 3D model of an object from its 2D image. The network learns many objects and orally accomplishes the registration between the image and the conceptual model. Similarly, Hermoza et al. [ 319 ] also utilize the GAN network for prognosticating the absent geometry of damaged archaeological objects, providing the reconstructed object based on a voxel grid format and a label selecting its class.

DL for medical image registration has numerous applications, which were listed by some review papers [ 320 , 321 , 322 ]. Yang et al. [ 323 ] implemented stacked convolutional layers as an encoder-decoder approach to predict the morphing of the input pixel into its last formation using MRI brain scans from the OASIS dataset. They employed a registration model known as Large Deformation Diffeomorphic Metric Mapping (LDDMM) and attained remarkable enhancements in computation time. Miao et al. [ 324 ] used synthetic X-ray images to train a five-layer CNN to register 3D models of a trans-esophageal probe, a hand implant, and a knee implant onto 2D X-ray images for pose estimation. They determined that their model achieved an execution time of 0.1 s, representing an important enhancement against the conventional registration techniques based on intensity; moreover, it achieved effective registrations 79–99% of the time. Li et al. [ 325 ] introduced a neural network-based approach for the non-rigid 2D–3D registration of the lateral cephalogram and the volumetric cone-beam CT (CBCT) images.

Computational approaches

For computationally exhaustive applications, complex ML and DL approaches have rapidly been identified as the most significant techniques and are widely used in different fields. The development and enhancement of algorithms aggregated with capabilities of well-behaved computational performance and large datasets make it possible to effectively execute several applications, as earlier applications were either not possible or difficult to take into consideration.

Currently, several standard DNN configurations are available. The interconnection patterns between layers and the total number of layers represent the main differences between these configurations. The Table  2 illustrates the growth rate of the overall number of layers over time, which seems to be far faster than the “Moore’s Law growth rate”. In normal DNN, the number of layers grew by around 2.3× each year in the period from 2012 to 2016. Recent investigations of future ResNet versions reveal that the number of layers can be extended up to 1000. However, an SGD technique is employed to fit the weights (or parameters), while different optimization techniques are employed to obtain parameter updating during the DNN training process. Repetitive updates are required to enhance network accuracy in addition to a minorly augmented rate of enhancement. For example, the training process using ImageNet as a large dataset, which contains more than 14 million images, along with ResNet as a network model, take around 30K to 40K repetitions to converge to a steady solution. In addition, the overall computational load, as an upper-level prediction, may exceed 1020 FLOPS when both the training set size and the DNN complexity increase.

Prior to 2008, boosting the training to a satisfactory extent was achieved by using GPUs. Usually, days or weeks are needed for a training session, even with GPU support. By contrast, several optimization strategies were developed to reduce the extensive learning time. The computational requirements are believed to increase as the DNNs continuously enlarge in both complexity and size.

In addition to the computational load cost, the memory bandwidth and capacity have a significant effect on the entire training performance, and to a lesser extent, deduction. More specifically, the parameters are distributed through every layer of the input data, there is a sizeable amount of reused data, and the computation of several network layers exhibits an excessive computation-to-bandwidth ratio. By contrast, there are no distributed parameters, the amount of reused data is extremely small, and the additional FC layers have an extremely small computation-to-bandwidth ratio. Table  3 presents a comparison between different aspects related to the devices. In addition, the table is established to facilitate familiarity with the tradeoffs by obtaining the optimal approach for configuring a system based on either FPGA, GPU, or CPU devices. It should be noted that each has corresponding weaknesses and strengths; accordingly, there are no clear one-size-fits-all solutions.

Although GPU processing has enhanced the ability to address the computational challenges related to such networks, the maximum GPU (or CPU) performance is not achieved, and several techniques or models have turned out to be strongly linked to bandwidth. In the worst cases, the GPU efficiency is between 15 and 20% of the maximum theoretical performance. This issue is required to enlarge the memory bandwidth using high-bandwidth stacked memory. Next, different approaches based on FPGA, GPU, and CPU are accordingly detailed.

CPU-based approach

The well-behaved performance of the CPU nodes usually assists robust network connectivity, storage abilities, and large memory. Although CPU nodes are more common-purpose than those of FPGA or GPU, they lack the ability to match them in unprocessed computation facilities, since this requires increased network ability and a larger memory capacity.

GPU-based approach

GPUs are extremely effective for several basic DL primitives, which include greatly parallel-computing operations such as activation functions, matrix multiplication, and convolutions [ 326 , 327 , 328 , 329 , 330 ]. Incorporating HBM-stacked memory into the up-to-date GPU models significantly enhances the bandwidth. This enhancement allows numerous primitives to efficiently utilize all computational resources of the available GPUs. The improvement in GPU performance over CPU performance is usually 10-20:1 related to dense linear algebra operations.

Maximizing parallel processing is the base of the initial GPU programming model. For example, a GPU model may involve up to sixty-four computational units. There are four SIMD engines per each computational layer, and each SIMD has sixteen floating-point computation lanes. The peak performance is 25 TFLOPS (fp16) and 10 TFLOPS (fp32) as the percentage of the employment approaches 100%. Additional GPU performance may be achieved if the addition and multiply functions for vectors combine the inner production instructions for matching primitives related to matrix operations.

For DNN training, the GPU is usually considered to be an optimized design, while for inference operations, it may also offer considerable performance improvements.

FPGA-based approach

FPGA is wildly utilized in various tasks including deep learning [ 199 , 247 , 331 , 332 , 333 , 334 ]. Inference accelerators are commonly implemented utilizing FPGA. The FPGA can be effectively configured to reduce the unnecessary or overhead functions involved in GPU systems. Compared to GPU, the FPGA is restricted to both weak-behaved floating-point performance and integer inference. The main FPGA aspect is the capability to dynamically reconfigure the array characteristics (at run-time), as well as the capability to configure the array by means of effective design with little or no overhead.

As mentioned earlier, the FPGA offers both performance and latency for every watt it gains over GPU and CPU in DL inference operations. Implementation of custom high-performance hardware, pruned networks, and reduced arithmetic precision are three factors that enable the FPGA to implement DL algorithms and to achieve FPGA with this level of efficiency. In addition, FPGA may be employed to implement CNN overlay engines with over 80% efficiency, eight-bit accuracy, and over 15 TOPs peak performance; this is used for a few conventional CNNs, as Xillinx and partners demonstrated recently. By contrast, pruning techniques are mostly employed in the LSTM context. The sizes of the models can be efficiently minimized by up to 20×, which provides an important benefit during the implementation of the optimal solution, as MLP neural processing demonstrated. A recent study in the field of implementing fixed-point precision and custom floating-point has revealed that lowering the 8-bit is extremely promising; moreover, it aids in supplying additional advancements to implementing peak performance FPGA related to the DNN models.

Evaluation metrics

Evaluation metrics adopted within DL tasks play a crucial role in achieving the optimized classifier [ 335 ]. They are utilized within a usual data classification procedure through two main stages: training and testing. It is utilized to optimize the classification algorithm during the training stage. This means that the evaluation metric is utilized to discriminate and select the optimized solution, e.g., as a discriminator, which can generate an extra-accurate forecast of upcoming evaluations related to a specific classifier. For the time being, the evaluation metric is utilized to measure the efficiency of the created classifier, e.g. as an evaluator, within the model testing stage using hidden data. As given in Eq. 20 , TN and TP are defined as the number of negative and positive instances, respectively, which are successfully classified. In addition, FN and FP are defined as the number of misclassified positive and negative instances respectively. Next, some of the most well-known evaluation metrics are listed below.

Accuracy: Calculates the ratio of correct predicted classes to the total number of samples evaluated (Eq. 20 ).

Sensitivity or Recall: Utilized to calculate the fraction of positive patterns that are correctly classified (Eq. 21 ).

Specificity: Utilized to calculate the fraction of negative patterns that are correctly classified (Eq. 22 ).

Precision: Utilized to calculate the positive patterns that are correctly predicted by all predicted patterns in a positive class (Eq. 23 ).

F1-Score: Calculates the harmonic average between recall and precision rates (Eq. 24 ).

J Score: This metric is also called Youdens J statistic. Eq. 25 represents the metric.

False Positive Rate (FPR): This metric refers to the possibility of a false alarm ratio as calculated in Eq. 26

Area Under the ROC Curve: AUC is a common ranking type metric. It is utilized to conduct comparisons between learning algorithms [ 336 , 337 , 338 ], as well as to construct an optimal learning model [ 339 , 340 ]. In contrast to probability and threshold metrics, the AUC value exposes the entire classifier ranking performance. The following formula is used to calculate the AUC value for two-class problem [ 341 ] (Eq. 27 )

Here, \(S_{p}\) represents the sum of all positive ranked samples. The number of negative and positive samples is denoted as \(n_{n}\) and \(n_{p}\) , respectively. Compared to the accuracy metrics, the AUC value was verified empirically and theoretically, making it very helpful for identifying an optimized solution and evaluating the classifier performance through classification training.

When considering the discrimination and evaluation processes, the AUC performance was brilliant. However, for multiclass issues, the AUC computation is primarily cost-effective when discriminating a large number of created solutions. In addition, the time complexity for computing the AUC is \(O \left( |C|^{2} \; n\log n\right) \) with respect to the Hand and Till AUC model [ 341 ] and \(O \left( |C| \; n\log n\right) \) according to Provost and Domingo’s AUC model [ 336 ].

Frameworks and datasets

Several DL frameworks and datasets have been developed in the last few years. various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Table  4 lists the most utilized frameworks and libraries.

Based on the star ratings on Github, as well as our own background in the field, TensorFlow is deemed the most effective and easy to use. It has the ability to work on several platforms. (Github is one of the biggest software hosting sites, while Github stars refer to how well-regarded a project is on the site). Moreover, there are several other benchmark datasets employed for different DL tasks. Some of these are listed in Table  5 .

Summary and conclusion

Finally, it is mandatory the inclusion of a brief discussion by gathering all the relevant data provided along this extensive research. Next, an itemized analysis is presented in order to conclude our review and exhibit the future directions.

DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.

DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.

Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.

The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.

Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.

In the CNN context, ensemble learning [ 342 , 343 ] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.

The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.

It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.

With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.

Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.

Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.

Availability of data and materials

Not applicable.

Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci. 2020;6:307.

Article   Google Scholar  

Amrit C, Paauw T, Aly R, Lavric M. Identifying child abuse through text mining and machine learning. Expert Syst Appl. 2017;88:402–18.

Hossain E, Khan I, Un-Noor F, Sikander SS, Sunny MSH. Application of big data and machine learning in smart grid, and associated security concerns: a review. IEEE Access. 2019;7:13960–88.

Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015;2(1):23.

Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M. Content-based video recommendation system based on stylistic visual features. J Data Semant. 2016;5(2):99–113.

Al-Dulaimi K, Chandran V, Nguyen K, Banks J, Tomeo-Reyes I. Benchmarking hep-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape. Pattern Recogn Lett. 2019;125:534–41.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR). 2018;51(5):1–36.

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK. A state-of-the-art survey on deep learning theory and architectures. Electronics. 2019;8(3):292.

Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao KT, Rose G, Chakma G. A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC). 2018;14(2):1–21.

Adeel A, Gogate M, Hussain A. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf Fusion. 2020;59:163–70.

Tian H, Chen SC, Shyu ML. Evolutionary programming based deep learning feature selection and network construction for visual data classification. Inf Syst Front. 2020;22(5):1053–66.

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag. 2018;13(3):55–75.

Koppe G, Meyer-Lindenberg A, Durstewitz D. Deep learning for small and big data in psychiatry. Neuropsychopharmacology. 2021;46(1):176–90.

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE; 2005. p. 886–93.

Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol. 2. IEEE; 1999. p. 1150–7.

Wu L, Hoi SC, Yu N. Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process. 2010;19(7):1908–20.

Article   MathSciNet   MATH   Google Scholar  

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Yao G, Lei T, Zhong J. A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett. 2019;118:14–22.

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell. 2020;9(2):85–112.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hasan RI, Yusuf SM, Alzubaidi L. Review of the state of the art of deep learning for plant diseases: a broad analysis and discussion. Plants. 2020;9(10):1302.

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X. A review of object detection based on deep learning. Multimed Tools Appl. 2020;79(33):23729–91.

Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access. 2017;6:9375–89.

Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. 2020. https://doi.org/10.1109/TKDE.2020.2981333 .

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–65.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT press; 2016.

MATH   Google Scholar  

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021;8(1):1–54.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

Bhowmick S, Nagarajaiah S, Veeraraghavan A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from uav videos. Sensors. 2020;20(21):6299.

Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307.

Li Y, Zhang T, Sun S, Gao X. Accelerating flash calculation through deep learning methods. J Comput Phys. 2019;394:153–65.

Yang W, Zhang X, Tian Y, Wang W, Xue JH, Liao Q. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed. 2019;21(12):3106–21.

Tang J, Li S, Liu P. A review of lane detection methods based on deep learning. Pattern Recogn. 2020;111:107623.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.

Ng A. Machine learning yearning: technical strategy for AI engineers in the era of deep learning. 2019. https://www.mlyearning.org .

Metz C. Turing award won by 3 pioneers in artificial intelligence. The New York Times. 2019;27.

Nevo S, Anisimov V, Elidan G, El-Yaniv R, Giencke P, Gigi Y, Hassidim A, Moshe Z, Schlesinger M, Shalev G, et al. Ml for flood forecasting at scale; 2019. arXiv preprint arXiv:1901.09583 .

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241–50.

Benhammou Y, Achchab B, Herrera F, Tabik S. Breakhis based breast cancer automatic diagnosis using deep learning: taxonomy, survey and insights. Neurocomputing. 2020;375:9–24.

Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE. 2020;15(6):e0233678.

Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.

Google Scholar  

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Brunese L, Mercaldo F, Reginelli A, Santone A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput Methods Programs Biomed. 2020;196(105):608.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

Shorfuzzaman M, Hossain MS. Metacovid: a siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recogn. 2020;113:107700.

Carvelli L, Olesen AN, Brink-Kjær A, Leary EB, Peppard PE, Mignot E, Sørensen HB, Jennum P. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep Med. 2020;69:109–19.

De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–50.

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.

Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, McKeown A, Yang G, Wu X, Yan F, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.

Van Essen B, Kim H, Pearce R, Boakye K, Chen B. Lbann: livermore big artificial neural network HPC toolkit. In: Proceedings of the workshop on machine learning in high-performance computing environments; 2015. p. 1–6.

Saeed MM, Al Aghbari Z, Alsharidah M. Big data clustering techniques based on spark: a literature review. PeerJ Comput Sci. 2020;6:321.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1631–42.

Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure. In: Proceedings of international conference on neural networks (ICNN’96), vol 1. IEEE; 1996. p. 347–52.

Socher R, Lin CCY, Ng AY, Manning CD. Parsing natural scenes and natural language with recursive neural networks. In: ICML; 2011.

Louppe G, Cho K, Becot C, Cranmer K. QCD-aware recursive neural networks for jet physics. J High Energy Phys. 2019;2019(1):57.

Sadr H, Pedram MM, Teshnehlab M. A robust sentiment analysis method based on sequential combination of convolutional and recursive neural networks. Neural Process Lett. 2019;50(3):2745–61.

Urban G, Subrahmanya N, Baldi P. Inner and outer recursive neural networks for chemoinformatics applications. J Chem Inf Model. 2018;58(2):207–11.

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast. 2020;37(1):388–427.

Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Learn codes: inventing low-latency codes via recurrent neural networks. IEEE J Sel Areas Inf Theory. 2020;1(1):207–16.

John RA, Acharya J, Zhu C, Surendran A, Bose SK, Chaturvedi A, Tiwari N, Gao Y, He Y, Zhang KK, et al. Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks. Nat Commun. 2020;11(1):1–9.

Batur Dinler Ö, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci. 2020;10(4):1273.

Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, vol. 2016, NIH Public Access; 2016. p. 856.

Pascanu R, Gulcehre C, Cho K, Bengio Y. How to construct deep recurrent neural networks. In: Proceedings of the second international conference on learning representations (ICLR 2014); 2014.

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 249–56.

Gao C, Yan J, Zhou S, Varshney PK, Liu H. Long short-term memory-based deep recurrent neural networks for target tracking. Inf Sci. 2019;502:279–96.

Zhou DX. Theory of deep convolutional neural networks: downsampling. Neural Netw. 2020;124:319–27.

Article   MATH   Google Scholar  

Jhong SY, Tseng PY, Siriphockpirom N, Hsia CH, Huang MS, Hua KL, Chen YY. An automated biometric identification system using CNN-based palm vein recognition. In: 2020 international conference on advanced robotics and intelligent systems (ARIS). IEEE; 2020. p. 1–6.

Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. Deepcryopicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinform. 2020;21(1):1–38.

Wang T, Lu C, Yang M, Hong F, Liu C. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput Sci. 2020;6:324.

Li G, Zhang M, Li J, Lv F, Tong G. Efficient densely connected convolutional neural networks. Pattern Recogn. 2021;109:107610.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.

Fang W, Love PE, Luo H, Ding L. Computer vision for behaviour-based safety in construction: a review and future directions. Adv Eng Inform. 2020;43:100980.

Palaz D, Magimai-Doss M, Collobert R. End-to-end acoustic modeling using convolutional neural networks for hmm-based automatic speech recognition. Speech Commun. 2019;108:15–32.

Li HC, Deng ZY, Chiang HH. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20(21):6114.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift; 2015. arXiv preprint arXiv:1502.03167 .

Ruder S. An overview of gradient descent optimization algorithms; 2016. arXiv preprint arXiv:1609.04747 .

Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–86.

Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on. 2012;14(8).

Zhang Z. Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS). IEEE; 2018. p. 1–2.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Duan Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics. 2020;9(3):427.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Santamaría J, Duan Y, Oleiwi SR. Towards a better understanding of transfer learning for medical imaging: a case study. Appl Sci. 2020;10(13):4523.

Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw Stat Mech Perspect. 1995;261:276.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

MathSciNet   MATH   Google Scholar  

Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8609–13.

Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network; 2015. arXiv preprint arXiv:1505.00853 .

Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst. 1998;6(02):107–16.

Lin M, Chen Q, Yan S. Network in network; 2013. arXiv preprint arXiv:1312.4400 .

Hsiao TY, Chang YC, Chou HH, Chiu CT. Filter-based deep-compression with global average pooling for convolutional networks. J Syst Arch. 2019;95:9–18.

Li Z, Wang SH, Fan RR, Cao G, Zhang YD, Guo T. Teeth category classification via seven-layer deep convolutional neural network with max pooling and global average pooling. Int J Imaging Syst Technol. 2019;29(4):577–83.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33.

Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Univ Montreal. 2009;1341(3):1.

Le QV. Building high-level features using large scale unsupervised learning. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8595–8.

Grün F, Rupprecht C, Navab N, Tombari F. A taxonomy and library for visualizing learned features in convolutional neural networks; 2016. arXiv preprint arXiv:1606.07757 .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:1409.1556 .

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007. p. 1–8.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

Bengio Y, et al. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization; 2015. arXiv:1502.04390 corr abs/1502.04390

Srivastava RK, Greff K, Schmidhuber J. Highway networks; 2015. arXiv preprint arXiv:1505.00387 .

Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans Smart Grid. 2017;10(1):841–51.

Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016;16(1):115.

CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012;32:333–8.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning; 2016. arXiv preprint arXiv:1602.07261 .

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–26.

Wu S, Zhong S, Liu Y. Deep residual learning for image steganalysis. Multimed Tools Appl. 2018;77(9):10437–53.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–08.

Rubin J, Parvaneh S, Rahman A, Conroy B, Babaeizadeh S. Densely connected convolutional networks for detection of atrial fibrillation from short single-lead ECG recordings. J Electrocardiol. 2018;51(6):S18-21.

Kuang P, Ma T, Chen Z, Li F. Image super-resolution with densely connected convolutional networks. Appl Intell. 2019;49(1):125–36.

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–500.

Su A, He X, Zhao X. Jpeg steganalysis based on ResNeXt with gauss partial derivative filters. Multimed Tools Appl. 2020;80(3):3349–66.

Yadav D, Jalal A, Garlapati D, Hossain K, Goyal A, Pant G. Deep learning-based ResNeXt model in phycological studies for future. Algal Res. 2020;50:102018.

Han W, Feng R, Wang L, Gao L. Adaptive spatial-scale-aware deep convolutional neural network for high-resolution remote sensing imagery scene classification. In: IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE; 2018. p. 4736–9.

Zagoruyko S, Komodakis N. Wide residual networks; 2016. arXiv preprint arXiv:1605.07146 .

Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ. Deep networks with stochastic depth. In: European conference on computer vision. Springer; 2016. p. 646–61.

Huynh HT, Nguyen H. Joint age estimation and gender classification of Asian faces using wide ResNet. SN Comput Sci. 2020;1(5):1–9.

Takahashi R, Matsubara T, Uehara K. Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol. 2019;30(9):2917–31.

Han D, Kim J, Kim J. Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 5927–35.

Wang Y, Wang L, Wang H, Li P. End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access. 2019;7:31959–70.

Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1251–8.

Lo WW, Yang X, Wang Y. An xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP international conference on new technologies, mobility and security (NTMS). IEEE; 2019. p. 1–5.

Rahimzadeh M, Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of xception and resnet50v2. Inform Med Unlocked. 2020;19:100360.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3156–64.

Salakhutdinov R, Larochelle H. Efficient learning of deep boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 693–700.

Goh H, Thome N, Cord M, Lim JH. Top-down regularization of deep belief networks. Adv Neural Inf Process Syst. 2013;26:1878–86.

Guan J, Lai R, Xiong A, Liu Z, Gu L. Fixed pattern noise reduction for infrared images based on cascade residual attention CNN. Neurocomputing. 2020;377:301–13.

Bi Q, Qin K, Zhang H, Li Z, Xu K. RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing. 2020;377:345–59.

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2015. p. 2017–25.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

Mou L, Zhu XX. Learning to pay attention on spectral domain: a spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2019;58(1):110–22.

Woo S, Park J, Lee JY, So Kweon I. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.

Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 421–9.

Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation’’ blocks. IEEE Trans Med Imaging. 2018;38(2):540–9.

Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 3856–66.

Arun P, Buddhiraju KM, Porwal A. Capsulenet-based spatial-spectral classifier for hyperspectral images. IEEE J Sel Topics Appl Earth Obs Remote Sens. 2019;12(6):1849–65.

Xinwei L, Lianghao X, Yi Y. Compact video fingerprinting via an improved capsule net. Syst Sci Control Eng. 2020;9:1–9.

Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing. 2020;379:152–61.

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020. https://doi.org/10.1109/TPAMI.2020.2983686 .

Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR 2020; 2020. https://www.microsoft.com/en-us/research/publication/higherhrnet-scale-aware-representation-learning-for-bottom-up-human-pose-estimation/ .

Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv:1912.11460 .

Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385 .

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 3320–8.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International conference on artificial neural networks. Springer; 2018. p. 270–9.

Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

Wang F, Wang H, Wang H, Li G, Situ G. Learning from simulation: an end-to-end deep-learning approach for computational ghost imaging. Opt Express. 2019;27(18):25560–72.

Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing. 2016;177:447–53.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.

Cook D, Feuz KD, Krishnan NC. Transfer learning for activity recognition: a survey. Knowl Inf Syst. 2013;36(3):537–56.

Cao X, Wang Z, Yan P, Li X. Transfer learning for pedestrian detection. Neurocomputing. 2013;100:51–7.

Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2019. p. 3347–57.

Pham TN, Van Tran L, Dao SVT. Early disease classification of mango leaves using feed-forward neural network and hybrid metaheuristic feature selection. IEEE Access. 2020;8:189960–73.

Saleh AM, Hamoud T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J Big Data. 2021;8(1):1–20.

Hirahara D, Takaya E, Takahara T, Ueda T. Effects of data count and image scaling on deep learning training. PeerJ Comput Sci. 2020;6:312.

Moreno-Barea FJ, Strazzera F, Jerez JM, Urda D, Franco L. Forward noise adjustment scheme for data augmentation. In: 2018 IEEE symposium series on computational intelligence (SSCI). IEEE; 2018. p. 728–34.

Dua D, Karra Taniskidou E. Uci machine learning repository. Irvine: University of california. School of Information and Computer Science; 2017. http://archive.ics.uci.edu/ml

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.

Yang P, Zhang Z, Zhou BB, Zomaya AY. Sample subset optimization for classifying imbalanced biological data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2011. p. 333–44.

Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern. 2013;44(3):445–55.

Wang S, Sun S, Xu J. Auc-maximized deep convolutional neural fields for sequence labeling 2015. arXiv preprint arXiv:1511.05265 .

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760–9.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2016. p. 3504–12.

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170,387.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.

Pokuri BSS, Ghosal S, Kokate A, Sarkar S, Ganapathysubramanian B. Interpretable deep learning for guided microstructure-property explorations in photovoltaics. NPJ Comput Mater. 2019;5(1):1–11.

Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

Wang L, Nie R, Yu Z, Xin R, Zheng C, Zhang Z, Zhang J, Cai J. An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNA-sequencing data. Nat Mach Intell. 2020;2(11):1–11.

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks; 2017. arXiv preprint arXiv:1703.01365 .

Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10(3):61–74.

Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557.

Herzog L, Murina E, Dürr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790.

Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions; 2017. arXiv preprint arXiv:1701.06548 .

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the... AAAI conference on artificial intelligence. AAAI conference on artificial intelligence, vol. 2015. NIH Public Access; 2015. p. 2901.

Li M, Sethi IK. Confidence-based classifier design. Pattern Recogn. 2006;39(7):1230–40.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers. In: ICML, vol. 1, Citeseer; 2001. p. 609–16.

Steinwart I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans Inf Theory. 2005;51(1):128–42.

Lee K, Lee K, Shin J, Lee H. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of the IEEE international conference on computer vision; 2019. p. 312–21.

Shmelkov K, Schmid C, Alahari K. Incremental learning of object detectors without catastrophic forgetting. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3400–09.

Zenke F, Gerstner W, Ganguli S. The temporal paradox of Hebbian learning and homeostatic plasticity. Curr Opin Neurobiol. 2017;43:166–76.

Andersen N, Krauth N, Nabavi S. Hebbian plasticity in vivo: relevance and induction. Curr Opin Neurobiol. 2017;45:188–92.

Zheng R, Chakraborti S. A phase ii nonparametric adaptive exponentially weighted moving average control chart. Qual Eng. 2016;28(4):476–90.

Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH. ICARL: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2001–10.

Hinton GE, Plaut DC. Using fast weights to deblur old memories. In: Proceedings of the ninth annual conference of the cognitive science society; 1987. p. 177–86.

Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Soltoggio A, Stanley KO, Risi S. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Netw. 2018;108:48–67.

Parisi GI, Tani J, Weber C, Wermter S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017;96:137–49.

Cheng Y, Wang D, Zhou P, Zhang T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag. 2018;35(1):126–36.

Wiedemann S, Kirchhoffer H, Matlage S, Haase P, Marban A, Marinč T, Neumann D, Nguyen T, Schwarz H, Wiegand T, et al. Deepcabac: a universal compression algorithm for deep neural networks. IEEE J Sel Topics Signal Process. 2020;14(4):700–14.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9.

Shawahna A, Sait SM, El-Maleh A. Fpga-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 2018;7:7823–59.

Min Z. Public welfare organization management system based on FPGA and deep learning. Microprocess Microsyst. 2020;80:103333.

Al-Shamma O, Fadhel MA, Hameed RA, Alzubaidi L, Zhang J. Boosting convolutional neural networks performance based on fpga accelerator. In: International conference on intelligent systems design and applications. Springer; 2018. p. 509–17.

Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding; 2015. arXiv preprint arXiv:1510.00149 .

Chen Z, Zhang L, Cao Z, Guo J. Distilling the knowledge from handcrafted features for human activity recognition. IEEE Trans Ind Inform. 2018;14(10):4334–42.

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network; 2015. arXiv preprint arXiv:1503.02531 .

Lenssen JE, Fey M, Libuschewski P. Group equivariant capsule networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 8844–53.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R. Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 1269–77.

Xu Q, Zhang M, Gu Z, Pan G. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing. 2019;328:69–74.

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. Commun ACM. 2018;64(3):107–15.

Xu X, Jiang X, Ma C, Du P, Li X, Lv S, Yu L, Ni Q, Chen Y, Su J, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–9.

Sharma K, Alsadoon A, Prasad P, Al-Dala’in T, Nguyen TQV, Pham DTH. A novel solution of using deep learning for left ventricle detection: enhanced feature extraction. Comput Methods Programs Biomed. 2020;197:105751.

Zhang G, Wang C, Xu B, Grosse R. Three mechanisms of weight decay regularization; 2018. arXiv preprint arXiv:1810.12281 .

Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE; 2016. p. 2657–61.

Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett. 2017;24(3):279–83.

Wang X, Qin Y, Wang Y, Xiang S, Chen H. ReLTanh: an activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing. 2019;363:88–98.

Tan HH, Lim KH. Vanishing gradient mitigation with deep learning neural network optimization. In: 2019 7th international conference on smart computing & communications (ICSCC). IEEE; 2019. p. 1–4.

MacDonald G, Godbout A, Gillcash B, Cairns S. Volume-preserving neural networks: a solution to the vanishing gradient problem; 2019. arXiv preprint arXiv:1911.09576 .

Mittal S, Vaishay S. A survey of techniques for optimizing deep learning on GPUs. J Syst Arch. 2019;99:101635.

Kanai S, Fujiwara Y, Iwamura S. Preventing gradient explosions in gated recurrent units. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 435–44.

Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 582–91.

Ribeiro AH, Tiels K, Aguirre LA, Schön T. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In: International conference on artificial intelligence and statistics, PMLR; 2020. p. 2370–80.

D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, Chen C, Deaton J, Eisenstein J, Hoffman MD, et al. Underspecification presents challenges for credibility in modern machine learning; 2020. arXiv preprint arXiv:2011.03395 .

Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020;49(2):1–15.

Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

Kuutti S, Bowden R, Jin Y, Barber P, Fallah S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst. 2020;22:712–33.

Yolcu G, Oztel I, Kazan S, Oz C, Bunyak F. Deep learning-based face analysis system for monitoring customer interest. J Ambient Intell Humaniz Comput. 2020;11(1):237–48.

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019;7:128837–68.

Muhammad K, Khan S, Del Ser J, de Albuquerque VHC. Deep learning for multigrade brain tumor classification in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst. 2020;32:507–22.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D. Ensemconvnet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl. 2020;79(41):31663–90.

Zeleznik R, Foldyna B, Eslami P, Weiss J, Alexander I, Taron J, Parmar C, Alvi RM, Banerji D, Uno M, et al. Deep convolutional neural networks to predict cardiovascular risk from computed tomography. Nature Commun. 2021;12(1):1–9.

Wang J, Liu Q, Xie H, Yang Z, Zhou H. Boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural networks. Cancers. 2021;13(4):661.

Yu H, Yang LT, Zhang Q, Armstrong D, Deen MJ. Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives. Neurocomputing. 2021. https://doi.org/10.1016/j.neucom.2020.04.157 .

Bharati S, Podder P, Mondal MRH. Hybrid deep learning for detecting lung diseases from X-ray images. Inform Med Unlocked. 2020;20:100391.

Dong Y, Pan Y, Zhang J, Xu W. Learning to read chest X-ray images from 16000+ examples using CNN. In: 2017 IEEE/ACM international conference on connected health: applications, systems and engineering technologies (CHASE). IEEE; 2017. p. 51–7.

Rajkomar A, Lingam S, Taylor AG, Blum M, Mongan J. High-throughput classification of radiographs using deep convolutional neural networks. J Digit Imaging. 2017;30(1):95–101.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. Chexnet: radiologist-level pneumonia detection on chest X-rays with deep learning; 2017. arXiv preprint arXiv:1711.05225 .

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2097–106.

Zuo W, Zhou F, Li Z, Wang L. Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access. 2019;7:32510–21.

Shen W, Zhou M, Yang F, Yang C, Tian J. Multi-scale convolutional neural networks for lung nodule classification. In: International conference on information processing in medical imaging. Springer; 2015. p. 588–99.

Li R, Zhang W, Suk HI, Wang L, Li J, Shen D, Ji S. Deep learning based imaging data completion for improved brain disease diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer; 2014. p. 305–12.

Wen J, Thibeau-Sutre E, Diaz-Melo M, Samper-González J, Routier A, Bottani S, Dormont D, Durrleman S, Burgos N, Colliot O, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal. 2020;63:101694.

Mehmood A, Maqsood M, Bashir M, Shuyuan Y. A deep siamese convolution neural network for multi-class classification of Alzheimer disease. Brain Sci. 2020;10(2):84.

Hosseini-Asl E, Ghazal M, Mahmoud A, Aslantas A, Shalaby A, Casanova M, Barnes G, Gimel’farb G, Keynton R, El-Baz A. Alzheimer’s disease diagnostics by a 3d deeply supervised adaptable convolutional network. Front Biosci. 2018;23:584–96.

Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). IEEE; 2017. p. 835–8.

Alzubaidi L, Fadhel MA, Oleiwi SR, Al-Shamma O, Zhang J. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimed Tools Appl. 2020;79(21):15655–77.

Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. Dfunet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans Emerg Topics Comput Intell. 2018;4(5):728–39.

Yap MH., Hachiuma R, Alavi A, Brungel R, Goyal M, Zhu H, Cassidy B, Ruckert J, Olshansky M, Huang X, et al. Deep learning in diabetic foot ulcers detection: a comprehensive evaluation; 2020. arXiv preprint arXiv:2010.03341 .

Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:198977–9000.

Fadhel MA, Al-Shamma O, Alzubaidi L, Oleiwi SR. Real-time sickle cell anemia diagnosis based hardware accelerator. In: International conference on new trends in information and communications technology applications, Springer; 2020. p. 189–99.

Debelee TG, Kebede SR, Schwenker F, Shewarega ZM. Deep learning in selected cancers’ image analysis—a survey. J Imaging. 2020;6(11):121.

Khan S, Islam N, Jan Z, Din IU, Rodrigues JJC. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn Lett. 2019;125:1–6.

Alzubaidi L, Hasan RI, Awad FH, Fadhel MA, Alshamma O, Zhang J. Multi-class breast cancer classification by a novel two-branch deep convolutional neural network architecture. In: 2019 12th international conference on developments in eSystems engineering (DeSE). IEEE; 2019. p. 268–73.

Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Gr. 2019;71:90–103.

Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A. Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors. 2020;20(16):4373.

Hosny KM, Kassem MA, Foaud MM. Skin cancer classification using deep learning and transfer learning. In: 2018 9th Cairo international biomedical engineering conference (CIBEC). IEEE; 2018. p. 90–3.

Dorj UO, Lee KK, Choi JY, Lee M. The skin cancer classification using deep convolutional neural network. Multimed Tools Appl. 2018;77(8):9909–24.

Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access. 2020;8:114822–32.

Heidari M, Mirniaharikandehei S, Khuzani AZ, Danala G, Qiu Y, Zheng B. Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. Int J Med Inform. 2020;144:104284.

Al-Timemy AH, Khushaba RN, Mosa ZM, Escudero J. An efficient mixture of deep and machine learning models for COVID-19 and tuberculosis detection using X-ray images in resource limited settings 2020. arXiv preprint arXiv:2007.08223 .

Abraham B, Nair MS. Computer-aided detection of COVID-19 from X-ray images using multi-CNN and Bayesnet classifier. Biocybern Biomed Eng. 2020;40(4):1436–45.

Nour M, Cömert Z, Polat K. A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization. Appl Soft Comput. 2020;97:106580.

Mallio CA, Napolitano A, Castiello G, Giordano FM, D’Alessio P, Iozzino M, Sun Y, Angeletti S, Russano M, Santini D, et al. Deep learning algorithm trained with COVID-19 pneumonia also identifies immune checkpoint inhibitor therapy-related pneumonitis. Cancers. 2021;13(4):652.

Fourcade A, Khonsari R. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019;120(4):279–88.

Guo Z, Li X, Huang H, Guo N, Li Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 2019;3(2):162–9.

Thakur N, Yoon H, Chong Y. Current trends of artificial intelligence for colorectal cancer pathology image analysis: a systematic review. Cancers. 2020;12(7):1884.

Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik. 2019;29(2):102–27.

Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1):113.

Nehme E, Freedman D, Gordon R, Ferdman B, Weiss LE, Alalouf O, Naor T, Orange R, Michaeli T, Shechtman Y. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat Methods. 2020;17(7):734–40.

Zulkifley MA, Abdani SR, Zulkifley NH. Pterygium-Net: a deep learning approach to pterygium detection and localization. Multimed Tools Appl. 2019;78(24):34563–84.

Sirazitdinov I, Kholiavchenko M, Mustafaev T, Yixuan Y, Kuleev R, Ibragimov B. Deep neural network ensemble for pneumonia localization from a large-scale chest X-ray database. Comput Electr Eng. 2019;78:388–99.

Zhao W, Shen L, Han B, Yang Y, Cheng K, Toesca DA, Koong AC, Chang DT, Xing L. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys. 2019;105(2):432–9.

Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Lu L, Summers RM. Anatomy-specific classification of medical images using deep convolutional nets. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI). IEEE; 2015. p. 101–4.

Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.

Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z. CLU-CNNs: object detection for medical images. Neurocomputing. 2019;350:53–9.

Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536.

Article   MathSciNet   Google Scholar  

Lumini A, Nanni L. Review fair comparison of skin detection approaches on publicly available datasets. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113677 .

Chouhan V, Singh SK, Khamparia A, Gupta D, Tiwari P, Moreira C, Damaševičius R, De Albuquerque VHC. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Appl Sci. 2020;10(2):559.

Apostolopoulos ID, Mpesiana TA. COVID-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med. 2020;43(2):635–40.

Mahmud T, Rahman MA, Fattah SA. CovXNet: a multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization. Comput Biol Med. 2020;122:103869.

Tayarani-N MH. Applications of artificial intelligence in battling against COVID-19: a literature review. Chaos Solitons Fractals. 2020;142:110338.

Toraman S, Alakus TB, Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos Solitons Fractals. 2020;140:110122.

Dascalu A, David E. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Adegun A, Viriri S. Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev. 2020;54:1–31.

Zhang N, Cai YX, Wang YY, Tian YT, Wang XL, Badami B. Skin cancer diagnosis based on optimized convolutional neural network. Artif Intell Med. 2020;102:101756.

Thurnhofer-Hemsi K, Domínguez E. A convolutional neural network framework for accurate skin cancer detection. Neural Process Lett. 2020. https://doi.org/10.1007/s11063-020-10364-y .

Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nat Mach Intell. 2020;2(6):356–62.

Lei H, Liu S, Elazab A, Lei B. Attention-guided multi-branch convolutional neural network for mitosis detection from histopathological images. IEEE J Biomed Health Inform. 2020;25(2):358–70.

Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recogn Lett. 2020;133:232–9.

Sebai M, Wang X, Wang T. Maskmitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Med Biol Eng Comput. 2020;58:1603–23.

Sebai M, Wang T, Al-Fadhli SA. Partmitosis: a partially supervised deep learning framework for mitosis detection in breast cancer histopathology images. IEEE Access. 2020;8:45133–47.

Mahmood T, Arsalan M, Owais M, Lee MB, Park KR. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and deep CNNs. J Clin Med. 2020;9(3):749.

Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2020;67:101813.

Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. p. 411–8.

Sirinukunwattana K, Raza SEA, Tsang YW, Snead DR, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1196–206.

Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging. 2015;35(1):119–30.

Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1313–21.

Abd-Ellah MK, Awad AI, Khalaf AA, Hamed HF. Two-phase multi-model automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process. 2018;2018(1):97.

Thaha MM, Kumar KPM, Murugan B, Dhanasekeran S, Vijayakarthick P, Selvi AS. Brain tumor segmentation using convolutional neural networks in MRI images. J Med Syst. 2019;43(9):294.

Talo M, Yildirim O, Baloglu UB, Aydin G, Acharya UR. Convolutional neural networks for multi-class brain disease detection using MRI images. Comput Med Imaging Gr. 2019;78:101673.

Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Mult Scler J. 2020;26(10):1217–26.

Chen S, Ding C, Liu M. Dual-force convolutional neural networks for accurate brain tumor segmentation. Pattern Recogn. 2019;88:90–100.

Hu K, Gan Q, Zhang Y, Deng S, Xiao F, Huang W, Cao C, Gao X. Brain tumor segmentation using multi-cascaded convolutional neural networks and conditional random field. IEEE Access. 2019;7:92615–29.

Wadhwa A, Bhardwaj A, Verma VS. A review on brain tumor segmentation of MRI images. Magn Reson Imaging. 2019;61:247–59.

Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: state of the art and future directions. J Digit Imaging. 2017;30(4):449–59.

Moeskops P, Viergever MA, Mendrik AM, De Vries LS, Benders MJ, Išgum I. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1252–61.

Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.

Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. 2016;35(5):1240–51.

Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.

Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z. COVID-19 chest CT image segmentation—a deep convolutional neural network solution; 2020. arXiv preprint arXiv:2004.10987 .

Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging. 2020;39(8):2653–63.

Khan SH, Sohail A, Khan A, Lee YS. Classification and region analysis of COVID-19 infection using lung CT images and deep convolutional neural networks; 2020. arXiv preprint arXiv:2009.08864 .

Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng. 2020;14:4–5.

Santamaría J, Rivero-Cejudo M, Martos-Fernández M, Roca F. An overview on the latest nature-inspired and metaheuristics-based image registration algorithms. Appl Sci. 2020;10(6):1928.

Santamaría J, Cordón O, Damas S. A comparative study of state-of-the-art evolutionary image registration methods for 3D modeling. Comput Vision Image Underst. 2011;115(9):1340–54.

Yumer ME, Mitra NJ. Learning semantic deformation flows with 3D convolutional networks. In: European conference on computer vision. Springer; 2016. p. 294–311.

Ding L, Feng C. Deepmapping: unsupervised map estimation from multiple point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 8650–9.

Mahadevan S. Imagination machines: a new challenge for artificial intelligence. AAAI. 2018;2018:7988–93.

Wang L, Fang Y. Unsupervised 3D reconstruction from a single image via adversarial learning; 2017. arXiv preprint arXiv:1711.09312 .

Hermoza R, Sipiran I. 3D reconstruction of incomplete archaeological objects using a generative adversarial network. In: Proceedings of computer graphics international 2018. Association for Computing Machinery; 2018. p. 5–11.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.

Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vision Appl. 2020;31(1):8.

de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal. 2019;52:128–43.

Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: fast predictive image registration—a deep learning approach. NeuroImage. 2017;158:378–96.

Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging. 2016;35(5):1352–63.

Li P, Pei Y, Guo Y, Ma G, Xu T, Zha H. Non-rigid 2D–3D registration using convolutional autoencoders. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE; 2020. p. 700–4.

Zhang J, Yeung SH, Shu Y, He B, Wang W. Efficient memory management for GPU-based deep learning systems; 2019. arXiv preprint arXiv:1903.06631 .

Zhao H, Han Z, Yang Z, Zhang Q, Yang F, Zhou L, Yang M, Lau FC, Wang Y, Xiong Y, et al. Hived: sharing a {GPU} cluster for deep learning with guarantees. In: 14th {USENIX} symposium on operating systems design and implementation ({OSDI} 20); 2020. p. 515–32.

Lin Y, Jiang Z, Gu J, Li W, Dhar S, Ren H, Khailany B, Pan DZ. DREAMPlace: deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans Comput Aided Des Integr Circuits Syst. 2020;40:748–61.

Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.

Castro FM, Guil N, Marín-Jiménez MJ, Pérez-Serrano J, Ujaldón M. Energy-based tuning of convolutional neural networks on multi-GPUs. Concurr Comput Pract Exp. 2019;31(21):4786.

Gschwend D. Zynqnet: an fpga-accelerated embedded convolutional neural network; 2020. arXiv preprint arXiv:2005.06892 .

Zhang N, Wei X, Chen H, Liu W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics. 2021;10(3):282.

Zhao M, Hu C, Wei F, Wang K, Wang C, Jiang Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors. 2019;19(2):350.

Liu X, Yang J, Zou C, Chen Q, Yan X, Chen Y, Cai C. Collaborative edge computing with FPGA-based CNN accelerators for energy-efficient and time-aware face tracking system. IEEE Trans Comput Soc Syst. 2021. https://doi.org/10.1109/TCSS.2021.3059318 .

Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.

Provost F, Domingos P. Tree induction for probability-based ranking. Mach Learn. 2003;52(3):199–215.

Rakotomamonyj A. Optimizing area under roc with SVMS. In: Proceedings of the European conference on artificial intelligence workshop on ROC curve and artificial intelligence (ROCAI 2004), 2004. p. 71–80.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the roc curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:101078.

Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.

Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.

Masoudnia S, Mersa O, Araabi BN, Vahabie AH, Sadeghi MA, Ahmadabadi MN. Multi-representational learning for offline signature verification using multi-loss snapshot ensemble of CNNs. Expert Syst Appl. 2019;133:317–30.

Coupé P, Mansencal B, Clément M, Giraud R, de Senneville BD, Ta VT, Lepetit V, Manjon JV. Assemblynet: a large ensemble of CNNs for 3D whole brain MRI segmentation. NeuroImage. 2020;219:117026.

Download references

Acknowledgements

We would like to thank the professors from the Queensland University of Technology and the University of Information Technology and Communications who gave their feedback on the paper.

This research received no external funding.

Author information

Authors and affiliations.

School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia

Laith Alzubaidi & Jinglan Zhang

Control and Systems Engineering Department, University of Technology, Baghdad, 10001, Iraq

Amjad J. Humaidi

Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001, Iraq

Ayad Al-Dujaili

Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO, 65211, USA

Ye Duan & Muthana Al-Amidie

AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001, Iraq

Laith Alzubaidi & Omran Al-Shamma

Department of Computer Science, University of Jaén, 23071, Jaén, Spain

J. Santamaría

College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005, Iraq

Mohammed A. Fadhel

School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD, UK

Laith Farhan

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: LA, and JZ; methodology: LA, JZ, and JS; software: LA, and MAF; validation: LA, JZ, MA, and LF; formal analysis: LA, JZ, YD, and JS; investigation: LA, and JZ; resources: LA, JZ, and MAF; data curation: LA, and OA.; writing–original draft preparation: LA, and OA; writing—review and editing: LA, JZ, AJH, AA, YD, OA, JS, MAF, MA, and LF; visualization: LA, and MAF; supervision: JZ, and YD; project administration: JZ, YD, and JS; funding acquisition: LA, AJH, AA, and YD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laith Alzubaidi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8 , 53 (2021). https://doi.org/10.1186/s40537-021-00444-8

Download citation

Received : 21 January 2021

Accepted : 22 March 2021

Published : 31 March 2021

DOI : https://doi.org/10.1186/s40537-021-00444-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Machine learning
  • Convolution neural network (CNN)
  • Deep neural network architectures
  • Deep learning applications
  • Image classification
  • Medical image analysis
  • Supervised learning

research paper on deep learning

Subscribe to the PwC Newsletter

Join the community, edit social preview.

research paper on deep learning

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row.

  • BIG-BENCH MACHINE LEARNING

Remove a task

Add a method, remove a method, edit datasets, deep learning.

20 Jul 2018  ·  Nicholas G. Polson , Vadim O. Sokolov · Edit social preview

Deep learning (DL) is a high dimensional data reduction technique for constructing high-dimensional predictors in input-output models. DL is a form of machine learning that uses hierarchical layers of latent features. In this article, we review the state-of-the-art of deep learning from a modeling and algorithmic perspective. We provide a list of successful areas of applications in Artificial Intelligence (AI), Image Processing, Robotics and Automation. Deep learning is predictive in its nature rather then inferential and can be viewed as a black-box methodology for high-dimensional function estimation.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

deep learning Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Synergic Deep Learning for Smart Health Diagnosis of COVID-19 for Connected Living and Smart Cities

COVID-19 pandemic has led to a significant loss of global deaths, economical status, and so on. To prevent and control COVID-19, a range of smart, complex, spatially heterogeneous, control solutions, and strategies have been conducted. Earlier classification of 2019 novel coronavirus disease (COVID-19) is needed to cure and control the disease. It results in a requirement of secondary diagnosis models, since no precise automated toolkits exist. The latest finding attained using radiological imaging techniques highlighted that the images hold noticeable details regarding the COVID-19 virus. The application of recent artificial intelligence (AI) and deep learning (DL) approaches integrated to radiological images finds useful to accurately detect the disease. This article introduces a new synergic deep learning (SDL)-based smart health diagnosis of COVID-19 using Chest X-Ray Images. The SDL makes use of dual deep convolutional neural networks (DCNNs) and involves a mutual learning process from one another. Particularly, the representation of images learned by both DCNNs is provided as the input of a synergic network, which has a fully connected structure and predicts whether the pair of input images come under the identical class. Besides, the proposed SDL model involves a fuzzy bilateral filtering (FBF) model to pre-process the input image. The integration of FBL and SDL resulted in the effective classification of COVID-19. To investigate the classifier outcome of the SDL model, a detailed set of simulations takes place and ensures the effective performance of the FBF-SDL model over the compared methods.

A deep learning approach for remote heart rate estimation

Weakly supervised spatial deep learning for earth image segmentation based on imperfect polyline labels.

In recent years, deep learning has achieved tremendous success in image segmentation for computer vision applications. The performance of these models heavily relies on the availability of large-scale high-quality training labels (e.g., PASCAL VOC 2012). Unfortunately, such large-scale high-quality training data are often unavailable in many real-world spatial or spatiotemporal problems in earth science and remote sensing (e.g., mapping the nationwide river streams for water resource management). Although extensive efforts have been made to reduce the reliance on labeled data (e.g., semi-supervised or unsupervised learning, few-shot learning), the complex nature of geographic data such as spatial heterogeneity still requires sufficient training labels when transferring a pre-trained model from one region to another. On the other hand, it is often much easier to collect lower-quality training labels with imperfect alignment with earth imagery pixels (e.g., through interpreting coarse imagery by non-expert volunteers). However, directly training a deep neural network on imperfect labels with geometric annotation errors could significantly impact model performance. Existing research that overcomes imperfect training labels either focuses on errors in label class semantics or characterizes label location errors at the pixel level. These methods do not fully incorporate the geometric properties of label location errors in the vector representation. To fill the gap, this article proposes a weakly supervised learning framework to simultaneously update deep learning model parameters and infer hidden true vector label locations. Specifically, we model label location errors in the vector representation to partially reserve geometric properties (e.g., spatial contiguity within line segments). Evaluations on real-world datasets in the National Hydrography Dataset (NHD) refinement application illustrate that the proposed framework outperforms baseline methods in classification accuracy.

Prediction of Failure Categories in Plastic Extrusion Process with Deep Learning

Hyperparameters tuning of faster r-cnn deep learning transfer for persistent object detection in radar images, a comparative study of automated legal text classification using random forests and deep learning, a semi-supervised deep learning approach for vessel trajectory classification based on ais data, an improved approach towards more robust deep learning models for chemical kinetics, power system transient security assessment based on deep learning considering partial observability, a multi-attention collaborative deep learning approach for blood pressure prediction.

We develop a deep learning model based on Long Short-term Memory (LSTM) to predict blood pressure based on a unique data set collected from physical examination centers capturing comprehensive multi-year physical examination and lab results. In the Multi-attention Collaborative Deep Learning model (MAC-LSTM) we developed for this type of data, we incorporate three types of attention to generate more explainable and accurate results. In addition, we leverage information from similar users to enhance the predictive power of the model due to the challenges with short examination history. Our model significantly reduces predictive errors compared to several state-of-the-art baseline models. Experimental results not only demonstrate our model’s superiority but also provide us with new insights about factors influencing blood pressure. Our data is collected in a natural setting instead of a setting designed specifically to study blood pressure, and the physical examination items used to predict blood pressure are common items included in regular physical examinations for all the users. Therefore, our blood pressure prediction results can be easily used in an alert system for patients and doctors to plan prevention or intervention. The same approach can be used to predict other health-related indexes such as BMI.

Export Citation Format

Share document.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Review Article
  • Open access
  • Published: 05 April 2022

Recent advances and applications of deep learning methods in materials science

  • Kamal Choudhary   ORCID: orcid.org/0000-0001-9737-8074 1 , 2 , 3 ,
  • Brian DeCost   ORCID: orcid.org/0000-0002-3459-5888 4 ,
  • Chi Chen   ORCID: orcid.org/0000-0001-8008-7043 5 ,
  • Anubhav Jain   ORCID: orcid.org/0000-0001-5893-9967 6 ,
  • Francesca Tavazza   ORCID: orcid.org/0000-0002-5602-180X 1 ,
  • Ryan Cohn   ORCID: orcid.org/0000-0002-7898-0059 7 ,
  • Cheol Woo Park 8 ,
  • Alok Choudhary 9 ,
  • Ankit Agrawal 9 ,
  • Simon J. L. Billinge   ORCID: orcid.org/0000-0002-9734-4998 10 ,
  • Elizabeth Holm 7 ,
  • Shyue Ping Ong   ORCID: orcid.org/0000-0001-5726-2587 5 &
  • Chris Wolverton   ORCID: orcid.org/0000-0003-2248-474X 8  

npj Computational Materials volume  8 , Article number:  59 ( 2022 ) Cite this article

64k Accesses

226 Citations

38 Altmetric

Metrics details

  • Atomistic models
  • Computational methods

Deep learning (DL) is one of the fastest-growing topics in materials data science, with rapidly emerging applications spanning atomistic, image-based, spectral, and textual data modalities. DL allows analysis of unstructured data and automated identification of features. The recent development of large materials databases has fueled the application of DL methods in atomistic prediction in particular. In contrast, advances in image and spectral data have largely leveraged synthetic data enabled by high-quality forward models as well as by generative unsupervised DL methods. In this article, we present a high-level overview of deep learning methods followed by a detailed discussion of recent developments of deep learning in atomistic simulation, materials imaging, spectral analysis, and natural language processing. For each modality we discuss applications involving both theoretical and experimental data, typical modeling approaches with their strengths and limitations, and relevant publicly available software and datasets. We conclude the review with a discussion of recent cross-cutting work related to uncertainty quantification in this field and a brief perspective on limitations, challenges, and potential growth areas for DL methods in materials science.

Similar content being viewed by others

research paper on deep learning

Highly accurate protein structure prediction with AlphaFold

John Jumper, Richard Evans, … Demis Hassabis

research paper on deep learning

De novo design of protein structure and function with RFdiffusion

Joseph L. Watson, David Juergens, … David Baker

research paper on deep learning

Pretraining a foundation model for generalizable fluorescence microscopy-based image restoration

Chenxi Ma, Weimin Tan, … Bo Yan

Introduction

“Processing-structure-property-performance” is the key mantra in Materials Science and Engineering (MSE) 1 . The length and time scales of material structures and phenomena vary significantly among these four elements, adding further complexity 2 . For instance, structural information can range from detailed knowledge of atomic coordinates of elements to the microscale spatial distribution of phases (microstructure), to fragment connectivity (mesoscale), to images and spectra. Establishing linkages between the above components is a challenging task.

Both experimental and computational techniques are useful to identify such relationships. Due to rapid growth in automation in experimental equipment and immense expansion of computational resources, the size of public materials datasets has seen exponential growth. Several large experimental and computational datasets 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 have been developed through the Materials Genome Initiative (MGI) 11 and the increasing adoption of Findable, Accessible, Interoperable, Reusable (FAIR) 12 principles. Such an outburst of data requires automated analysis which can be facilitated by machine learning (ML) techniques 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 .

Deep learning (DL) 21 , 22 is a specialized branch of machine learning (ML). Originally inspired by biological models of computation and cognition in the human brain 23 , 24 , one of DL’s major strengths is its potential to extract higher-level features from the raw input data.

DL applications are rapidly replacing conventional systems in many aspects of our daily lives, for example, in image and speech recognition, web search, fraud detection, email/spam filtering, financial risk modeling, and so on. DL techniques have been proven to provide exciting new capabilities in numerous fields (such as playing Go 25 , self-driving cars 26 , navigation, chip design, particle physics, protein science, drug discovery, astrophysics, object recognition 27 , etc).

Recently DL methods have been outperforming other machine learning techniques in numerous scientific fields, such as chemistry, physics, biology, and materials science 20 , 28 , 29 , 30 , 31 , 32 . DL applications in MSE are still relatively new, and the field has not fully explored its potential, implications, and limitations. DL provides new approaches for investigating material phenomena and has pushed materials scientists to expand their traditional toolset.

DL methods have been shown to act as a complementary approach to physics-based methods for materials design. While large datasets are often viewed as a prerequisite for successful DL applications, techniques such as transfer learning, multi-fidelity modelling, and active learning can often make DL feasible for small datasets as well 33 , 34 , 35 , 36 .

Traditionally, materials have been designed experimentally using trial and error methods with a strong dose of chemical intuition. In addition to being a very costly and time-consuming approach, the number of material combinations is so huge that it is intractable to study experimentally, leading to the need for empirical formulation and computational methods. While computational approaches (such as density functional theory, molecular dynamics, Monte Carlo, phase-field, finite elements) are much faster and cheaper than experiments, they are still limited by length and time scale constraints, which in turn limits their respective domains of applicability. DL methods can offer substantial speedups compared to conventional scientific computing, and, for some applications, are reaching an accuracy level comparable to physics-based or computational models.

Moreover, entering a new domain of materials science and performing cutting-edge research requires years of education, training, and the development of specialized skills and intuition. Fortunately, we now live in an era of increasingly open data and computational resources. Mature, well-documented DL libraries make DL research much more easily accessible to newcomers than almost any other research field. Testing and benchmarking methodologies such as underfitting/overfitting/cross-validation 15 , 16 , 37 are common knowledge, and standards for measuring model performance are well established in the community.

Despite their many advantages, DL methods have disadvantages too, the most significant one being their black-box nature 38 which may hinder physical insights into the phenomena under examination. Evaluating and increasing the interpretability and explainability of DL models remains an active field of research. Generally a DL model has a few thousand to millions of parameters, making model interpretation and direct generation of scientific insight difficult.

Although there are several good recent reviews of ML applications in MSE 15 , 16 , 17 , 19 , 39 , 40 , 41 , 42 , 43 , 44 , 45 , 46 , 47 , 48 , 49 , DL for materials has been advancing rapidly, warranting a dedicated review to cover the explosion of research in this field. This article discusses some of the basic principles in DL methods and highlights major trends among the recent advances in DL applications for materials science. As the tools and datasets for DL applications in materials keep evolving, we provide a github repository ( https://github.com/deepmaterials/dlmatreview ) that can be updated as new resources are made publicly available.

General machine learning concepts

It is beyond the scope of this article to give a detailed hands-on introduction to Deep Learning. There are many materials for this purpose, for example, the free online book “Neural Networks and Deep Learning” by Michael Nielsen ( http://neuralnetworksanddeeplearning.com ), Deep Learning by Goodfellow et al. 21 , and multiple online courses at Coursera, Udemy, and so on. Rather, this article aims to motivate materials scientist researchers in the types of problems that are amenable to DL, and to introduce some of the basic concepts, jargon, and materials-specific databases and software (at the time of writing) as a helpful on-ramp to help get started. With this in mind, we begin with a very basic introduction to Deep learning.

Artificial intelligence (AI) 13 is the development of machines and algorithms that mimic human intelligence, for example, by optimizing actions to achieve certain goals. Machine learning (ML) is a subset of AI, and provides the ability to learn without explicitly being programmed for a given dataset such as playing chess, social network recommendation etc. DL, in turn, is the subset of ML that takes inspiration from biological brains and uses multilayer neural networks to solve ML tasks. A schematic of AI-ML-DL context and some of the key application areas of DL in the materials science and engineering field are shown in Fig. 1 .

figure 1

Deep learning is considered a part of machine learning, which is contained in an umbrella term artificial intelligence.

Some of the commonly used ML technologies are linear regression, decision trees, and random forest in which generalized models are trained to learn coefficients/weights/parameters for a given dataset (usually structured i.e., on a grid or a spreadsheet).

Applying traditional ML techniques to unstructured data (such as pixels or features from an image, sounds, text, and graphs) is challenging because users have to first extract generalized meaningful representations or features themselves (such as calculating pair-distribution for an atomic structure) and then train the ML models. Hence, the process becomes time-consuming, brittle, and not easily scalable. Here, deep learning (DL) techniques become more important.

DL methods are based on artificial neural networks and allied techniques. According to the “universal approximation theorem” 50 , 51 , neural networks can approximate any function to arbitrary accuracy. However, it is important to note that the theorem doesn’t guarantee that the functions can be learnt easily 52 .

Neural networks

A perceptron or a single artificial neuron 53 is the building block of artificial neural networks (ANNs) and performs forward propagation of information. For a set of inputs [ x 1 ,  x 2 , . . . ,  x m ] to the perceptron, we assign floating number weights (and biases to shift wights) [ w 1 ,  w 2 , . . . ,  w m ] and then we multiply them correspondingly together to get a sum of all of them. Some of the common software packages allowing NN trainings are: PyTorch 54 , Tensorflow 55 , and MXNet 56 . Please note that certain commercial equipment, instruments, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by NIST, nor is it intended to imply that the materials or equipment identified are necessarily the best available for the purpose.

Activation function

Activation functions (such as sigmoid, hyperbolic tangent (tanh), rectified linear unit (ReLU), leaky ReLU, Swish) are the critical nonlinear components that enable neural networks to compose many small building blocks to learn complex nonlinear functions. For example, the sigmoid activation maps real numbers to the range (0, 1); this activation function is often used in the last layer of binary classifiers to model probabilities. The choice of activation function can affect training efficiency as well as final accuracy 57 .

Loss function, gradient descent, and normalization

The weight matrices of a neural network are initialized randomly or obtained from a pre-trained model. These weight matrices are multiplied with the input matrix (or output from a previous layer) and subjected to a nonlinear activation function to yield updated representations, which are often referred to as activations or feature maps. The loss function (also known as an objective function or empirical risk) is calculated by comparing the output of the neural network and the known target value data. Typically, network weights are iteratively updated via stochastic gradient descent algorithms to minimize the loss function until the desired accuracy is achieved. Most modern deep learning frameworks facilitate this by using reverse-mode automatic differentiation 58 to obtain the partial derivatives of the loss function with respect to each network parameter through recursive application of the chain rule. Colloquially, this is also known as back-propagation.

Common gradient descent algorithms include: Stochastic Gradient Descent (SGD), Adam, Adagrad etc. The learning rate is an important parameter in gradient descent. Except for SGD, all other methods use adaptive learning parameter tuning. Depending on the objective such as classification or regression, different loss functions such as Binary Cross Entropy (BCE), Negative Log likelihood (NLLL) or Mean Squared Error (MSE) are used.

The inputs of a neural network are generally scaled i.e., normalized to have zero mean and unit standard deviation. Scaling is also applied to the input of hidden layers (using batch or layer normalization) to improve the stability of ANNs.

Epoch and mini-batches

A single pass of the entire training data is called an epoch, and multiple epochs are performed until the weights converge. In DL, datasets are usually large and computing gradients for the entire dataset and network becomes challenging. Hence, the forward passes are done with small subsets of the training data called mini-batches.

Underfitting, overfitting, regularization, and early stopping

During an ML training, the dataset is split into training, validation, and test sets. The test set is never used during the training process. A model is said to be underfitting if the model performs poorly on the training set and lacks the capacity to fully learn the training data. A model is said to overfit if the model performs too well on the training data but does not perform well on the validation data. Overfitting is controlled with regularization techniques such as L2 regularization, dropout, and early stopping 37 .

Regularization discourages the model from simply memorizing the training data, resulting in a model that is more generalizable. Overfitting models are often characterized by neurons that have weights with large magnitudes. L2 regularization reduces the possibility of overfitting by adding an additional term to the loss function that penalizes the large weight values, keeping the values of the weights and biases small during training. Another popular regularization is dropout 59 in which we randomly set the activations for an NN layer to zero during training. Similar to bagging 60 , the use of dropout brings about the same effect of training a collection of randomly chosen models which prevents the co-adaptations among the neurons, consequently reducing the likelihood of the model from overfitting. In early stopping, further epochs for training are stopped before the model overfits i.e., accuracy on the validation set flattens or decreases.

Convolutional neural networks

Convolutional neural networks (CNN) 61 can be viewed as a regularized version of multilayer perceptrons with a strong inductive bias for learning translation-invariant image representations. There are four main components in CNNs: (a) learnable convolution filterbanks, (b) nonlinear activations, (c) spatial coarsening (via pooling or strided convolution), (d) a prediction module, often consisting of fully connected layers that operate on a global instance representation.

In CNNs we use convolution functions with multiple kernels or filters with trainable and shared weights or parameters, instead of general matrix multiplication. These filters/kernels are matrices with a relatively small number of rows and columns that convolve over the input to automatically extract high-level local features in the form of feature maps. The filters slide/convolve (element-wise multiply) across the input with a fixed number of strides to produce the feature map and the information thus learnt is passed to the hidden/fully connected layers. Depending on the input data, these filters can be one, two, or three-dimensional.

Similar to the fully connected NNs, nonlinearities such as ReLU are then applied that allows us to deal with nonlinear and complicated data. The pooling operation preserves spatial invariance, downsamples and reduces the dimension of each feature map obtained after convolution. These downsampling/pooling operations can be of different types such as maximum-pooling, minimum-pooling, average pooling, and sum pooling. After one or more convolutional and pooling layers, the outputs are usually reduced to a one-dimensional global representation. CNNs are especially popular for image data.

Graph neural networks

Graphs and their variants.

Classical CNNs as described above are based on a regular grid Euclidean data (such as 2D grid in images). However, real-life data structures, such as social networks, segments of images, word vectors, recommender systems, and atomic/molecular structures, are usually non-Euclidean. In such cases, graph-based non-Euclidean data structures become especially important.

Mathematically, a graph G is defined as a set of nodes/vertices V , a set of edges/links, E and node features, X : G  = ( V ,  E ,  X ) 62 , 63 , 64 and can be used to represent non-Euclidean data. An edge is formed between a pair of two nodes and contains the relation information between the nodes. Each node and edge can have attributes/features associated with it. An adjacency matrix A is a square matrix indicating connections between the nodes or not in the form of 1 (connected) and 0 (unconnected). A graph can be of various types such as: undirected/directed, weighted/unweighted, homogeneous/heterogeneous, static/dynamic.

An undirected graph captures symmetric relations between nodes, while a directed one captures asymmetric relations such that A i j  ≠  A j i . In a weighted graph, each edge is associated with a scalar weight rather than just 1s and 0s. In a homogeneous graph, all the nodes represent instances of the same type, and all the edges capture relations of the same type while in a heterogeneous graph, the nodes and edges can be of different types. Heterogeneous graphs provide an easy interface for managing nodes and edges of different types as well as their associated features. When input features or graph topology vary with time, they are called dynamic graphs otherwise they are considered static. If a node is connected to another node more than once it is termed a multi-graph.

Types of GNNs

At present, GNNs are probably the most popular AI method for predicting various materials properties based on structural information 33 , 65 , 66 , 67 , 68 , 69 . Graph neural networks (GNNs) are DL methods that operate on graph domain and can capture the dependence of graphs via message passing between the nodes and edges of graphs. There are two key steps in GNN training: (a) we first aggregate information from neighbors and (b) update the nodes and/or edges. Importantly, aggregation is permutation invariant. Similar to the fully connected NNs, the input node features, X (with embedding matrix) are multiplied with the adjacency matrix and the weight matrices and then multiplied with the nonlinear activation function to provide outputs for the next layer. This method is called the propagation rule.

Based on the propagation rule and aggregation methodology, there could be different variants of GNNs such as Graph convolutional network (GCN) 70 , Graph attention network (GAT) 71 , Relational-GCN 72 , graph recurrent network (GRN) 73 , Graph isomerism network (GIN) 74 , and Line graph neural network (LGNN) 75 . Graph convolutional neural networks are the most popular GNNs.

Sequence-to-sequence models

Traditionally, learning from sequential inputs such as text involves generating a fixed-length input from the data. For example, the “bag-of-words” approach simply counts the number of instances of each word in a document and produces a fixed-length vector that is the size of the overall vocabulary.

In contrast, sequence-to-sequence models can take into account sequential/contextual information about each word and produce outputs of arbitrary length. For example, in named entity recognition (NER), an input sequence of words (e.g., a chemical abstract) is mapped to an output sequence of “entities” or categories where every word in the sequence is assigned a category.

An early form of sequence-to-sequence model is the recurrent neural network, or RNN. Unlike the fully connected NN architecture, where there is no connection between hidden nodes in the same layer, but only between nodes in adjacent layers, RNN has feedback connections. Each hidden layer can be unfolded and processed similarly to traditional NNs sharing the same weight matrices. There are multiple types of RNNs, of which the most common ones are: gated recurrent unit recurrent neural network (GRURNN), long short-term memory (LSTM) network, and clockwork RNN (CW-RNN) 76 .

However, all such RNNs suffer from some drawbacks, including: (i) difficulty of parallelization and therefore difficulty in training on large datasets and (ii) difficulty in preserving long-range contextual information due to the “vanishing gradient” problem. Nevertheless, as we will later describe, LSTMs have been successfully applied to various NER problems in the materials domain.

More recently, sequence-to-sequence models based on a “transformer” architecture, such as Google’s Bidirectional Encoder Representations from Transformers (BERT) model 77 , have helped address some of the issues of traditional RNNs. Rather than passing a state vector that is iterated word-by-word, such models use an attention mechanism to allow access to all previous words simultaneously without explicit time steps. This mechanism facilitates parallelization and also better preserves long-term context.

Generative models

While the above DL frameworks are based on supervised machine learning (i.e., we know the target or ground truth data such as in classification and regression) and discriminative (i.e., learn differentiating features between various datasets), many AI tasks are based on unsupervised (such as clustering) and are generative (i.e., aim to learn underlying distributions) 78 .

Generative models are used to (a) generate data samples similar to the training set with variations i.e., augmentation and for synthetic data, (b) learn good generalized latent features, (c) guide mixed reality applications such as virtual try-on. There are various types of generative models, of which the most common are: (a) variational encoders (VAE), which explicitly define and learn likelihood of data, (b) Generative adversarial networks (GAN), which learn to directly generate samples from model’s distribution, without defining any density function.

A VAE model has two components: namely encoder and decoder. A VAE’s encoder takes input from a target distribution and compresses it into a low-dimensional latent space. Then the decoder takes that latent space representation and reproduces the original image. Once the network is trained, we can generate latent space representations of various images, and interpolate between these before forwarding them through the decoder which produces new images. A VAE is similar to a principal component analysis (PCA) but instead of linear data assumption in PCA, VAEs work in nonlinear domain. A GAN model also has two components: namely generator, and discriminator. GAN’s generator generates fake/synthetic data that could fool the discriminator. Its discriminator tries to distinguish fake data from real ones. This process is also termed as “min-max two-player game.” We note that VAE models learn the hidden state distributions during the training process, while GAN’s hidden state distributions are predefined. Rather GAN generators serve to generate images that could fool the discriminator. These techniques are widely used for images and spectra and have also been recently applied to atomic structures.

Deep reinforcement learning

Reinforcement learning (RL) deals with tasks in which a computational agent learns to make decisions by trial and error. Deep RL uses DL into the RL framework, allowing agents to make decisions from unstructured input data 79 . In traditional RL, Markov decision process (MDP) is used in which an agent at every timestep takes action to receive a scalar reward and transitions to the next state according to system dynamics to learn policy in order to maximize returns. However, in deep RL, the states are high-dimensional (such as continuous images or spectra) which act as an input to DL methods. DRL architectures can be either model-based or model-free.

Scientific machine learning

The nascent field of scientific machine learning (SciML) 80 is creating new opportunities across all paradigms of machine learning, and deep learning in particular. SciML is focused on creating ML systems that incorporate scientific knowledge and physical principles, either directly in the specific form of the model or indirectly through the optimization algorithms used for training. This offers potential improvements in sample and training complexity, robustness (particularly under extrapolation), and model interpretability. One prominent theme can be found in ref. 57 . Such implementations usually involve applying multiple physics-based constraints while training a DL model 81 , 82 , 83 . One of the key challenges of universal function approximation is that a NN can quickly learn spurious features that have nothing to do with the features that a researcher could be actually interested in, within the data. In this sense, physics-based regularization can assist. Physics-based deep learning can also aid in inverse design problems, a challenging but important task 84 , 85 . On the flip side, deep Learning using Graph Neural Nets and symbolic regression (stochastically building symbolic expressions) has even been used to “discover” symbolic equations from data that capture known (and unknown) physics behind the data 86 , i.e., to deep learn a physics model rather than to use a physics model to constrain DL.

Overview of applications

Some aspects of successful DL application that require materials-science-specific considerations are:

acquiring large, balanced, and diverse datasets (often on the order of 10,000 data points or more),

determing an appropriate DL approach and suitable vector or graph representation of the input samples, and

selecting appropriate performance metrics relevant to scientific goals.

In the following sections we discuss some of the key areas of materials science in which DL has been applied with available links to repositories and datasets that help in the reproducibility and extensibility of the work. In this review we categorize materials science applications at a high level by the type of input data considered: 11 atomistic, 12 stoichiometric, 13 spectral, 14 image, and 15 text. We summarize prevailing machine learning tasks and their impact on materials research and development within each broad materials data modality.

Applications in atomistic representations

In this section, we provide a few examples of solving materials science problems with DL methods trained on atomistic data. The atomic structure of material usually consists of atomic coordinates and atomic composition information of material. An arbitrary number of atoms and types of elements in a system poses a challenge to apply traditional ML algorithms for atomistic predictions. DL-based methods are an obvious strategy to tackle this problem. There have been several previous attempts to represent crystals and molecules using fixed-size descriptors such as Coulomb matrix 87 , 88 , 89 , classical force field inspired descriptors (CFID) 90 , 91 , 92 , pair-distribution function (PRDF), Voronoi tessellation 93 , 94 , 95 . Recently graph neural network methods have been shown to surpass previous hand-crafted feature set 28 .

DL for atomistic materials applications include: (a) force-field development, (b) direct property predictions, (c) materials screening. In addition to the above points, we also elucidate upon some of the recent generative adversarial network and complimentary methods to atomistic aproaches.

Databases and software libraries

In Table 1 we provide some of the commonly used datasets used for atomistic DL models for molecules, solids, and proteins. We note that the computational methods used for different datasets are different and many of them are continuously evolving. Generally it takes years to generate such databases using conventional methods such as density functional theory; in contrast, DL methods can be used to make predictions with much reduced computational cost and reasonable accuracy.

Table 1 we provide DL software packages used for atomistic materials design. The type of models includes general property (GP) predictors and interatomic force fields (FF). The models have been demonstrated in molecules (Mol), solid-state materials (Sol), or proteins (Prot). For some force fields, high-performance large-scale implementations (LSI) that leverage paralleling computing exist. Some of these methods mainly used interatomic distances to build graphs while others use distances as well as bond-angle information. Recently, including bond angle within GNN has been shown to drastically improve the performance with comparable computational timings.

Force-field development

The first application includes the development of DL-based force fields (FF) 96 , 97 /interatomic potentials. Some of the major advantages of such applications are that they are very fast (on the order of hundreds to thousands times 64 ) for making predictions and solving the tenuous development of FFs, but the disadvantage is they still require a large dataset using computationally expensive methods to train.

Models such as Behler-Parrinello neural network (BPNN) and its variants 98 , 99 are used for developing interatomic potentials that can be used beyond just 0 K temperature and time-dependent behavior using molecular dynamics simulations such as for nanoparticles 100 . Such FF models have been developed for molecular systems, such as water, methane, and other organic molecules 99 , 101 as well as solids such as silicon 98 , sodium 102 , graphite 103 , and titania ( T i O 2 ) 104 .

While the above works are mainly based on NNs, there has also been the development of graph neural network force-field (GNNFF) framework 105 , 106 that bypasses both computational bottlenecks. GNNFF can predict atomic forces directly using automatically extracted structural features that are not only translationally invariant, but rotationally-covariant to the coordinate space of the atomic positions, i.e., the features and hence the predicted force vectors rotate the same way as the rotation of coordinates. In addition to the development of pure NN-based FFs, there have also been recent developments of combining traditional FFs such as bond-order potentials with NNs and ReaxFF with message passing neural network (MPNN) that can help mitigate the NNs issue for extrapolation 82 , 107 .

Direct property prediction from atomistic configurations

DL methods can be used to establish a structure-property relationship between atomic structure and their properties with high accuracy 28 , 108 . Models such as SchNet, crystal graph convolutional neural network (CGCNN), improved crystal graph convolutional neural network (iCGCNN), directional message passing neural network (DimeNet), atomistic line graph neural network (ALIGNN) and materials graph neural network (MEGNet) shown in Table 1 have been used to predict up to 50 properties of crystalline and molecular materials. These property datasets are usually obtained from ab-initio calculations. A schematic of such models shown in Fig. 2 . While SchNet, CGCNN, MEGNet are primarily based on atomic distances, iCGCNN, DimeNet, and ALIGNN models capture many-body interactions using GCNN.

figure 2

a CGCNN model in which crystals are converted to graphs with nodes representing atoms in the unit cell and edges representing atom connections. Nodes and edges are characterized by vectors corresponding to the atoms and bonds in the crystal, respectively [Reprinted with permission from ref. 67 Copyright 2019 American Physical Society], b ALIGNN 65 model in which the convolution layer alternates between message passing on the bond graph and its bond-angle line graph. c MEGNet in which the initial graph is represented by the set of atomic attributes, bond attributes and global state attributes [Reprinted with permission from ref. 33 Copyright 2019 American Chemical Society] model, d iCGCNN model in which multiple edges connect a node to neighboring nodes to show the number of Voronoi neighbors [Reprinted with permission from ref. 122 Copyright 2019 American Physical Society].

Some of these properties include formation energies, electronic bandgaps, solar-cell efficiency, topological spin-orbit spillage, dielectric constants, piezoelectric constants, 2D exfoliation energies, electric field gradients, elastic modulus, Seebeck coefficients, power factors, carrier effective masses, highest occupied molecular orbital, lowest unoccupied molecular orbital, energy gap, zero-point vibrational energy, dipole moment, isotropic polarizability, electronic spatial extent, internal energy.

For instance, the current state-of-the-art mean absolute error for formation energy for solids at 0 K is 0.022 eV/atom as obtained by the ALIGNN model 65 . DL is also heavily being used for predicting catalytic behavior of materials such as the Open Catalyst Project 109 which is driven by the DL methods materials design. There is an ongoing effort to continuously improve the models. Usually energy-based models such as formation and total energies are more accurate than electronic property-based models such as bandgaps and power factors.

In addition to molecules and solids, property predictions models have also been used for bio-materials such as proteins, which can be viewed as large molecules. There have been several efforts for predicting protein-based properties, such as binding affinity 66 and docking predictions 110 .

There have also been several applications for identifying reasonable chemical space using DL methods such as autoencoders 111 and reinforcement learning 112 , 113 , 114 for inverse materials design. Inverse materials design with techniques such as GAN deals with finding chemical compounds with suitable properties and act as complementary to forward prediction models. While such concepts have been widely applied to molecular systems, 115 , recently these methods have been applied to solids as well 116 , 117 , 118 , 119 , 120 .

Fast materials screening

DFT-based high-throughput methods are usually limited to a few thousands of compounds and take a long time for calculations, DL-based methods can aid this process and allow much faster predictions. DL-based property prediction models mentioned above can be used for pre-screening chemical compounds. Hence, DL-based tools can be viewed as a pre-screening tool for traditional methods such as DFT. For example, Xie et al. used CGCNN model to screen stable perovskite materials 67 as well hierarchical visualization of materials space 121 . Park et al. 122 used iCGCNN to screen T h C r 2 S i 2 -type materials. Lugier et al. used DL methods to predict thermoelectric properties 123 . Rosen et al. 124 used graph neural network models to predict the bandgaps of metal-organic frameworks. DL for molecular materials has been used to predict technologically important properties such as aqueous solubility 125 and toxicity 126 .

It should be noted that the full atomistic representations and the associated DL models are only possible if the crystal structure and atom positions are available. In practice, the precise atom positions are only available from DFT structural relaxations or experiments, and are one of the goals for materials discovery instead of the starting point. Hence, alternative methods have been proposed to bypass the necessity for atom positions in building DL models. For example, Jain and Bligaard 127 proposed the atomic position-independent descriptors and used a CNN model to learn the energies of crystals. Such descriptors include information based only on the symmetry (e.g., space group and Wyckoff position). In principle, the method can be applied universally in all crystals. Nevertheless, the model errors tend to be much higher than graph-based models. Similar coarse-grained representation using Wyckoff representation was also used by Goodall et al. 128 . Alternatively, Zuo et al. 129 started from the hypothetical structures without precise atom positions, and used a Bayesian optimization method coupled with a MEGNet energy model as an energy evaluator to perform direct structural relaxation. Applying the Bayesian optimization with symmetry relaxation (BOWSR) algorithm successfully discovered ReWB (Pca2 1 ) and MoWC 2 (P6 3 /mmc) hard materials, which were then experimentally synthesized.

Applications in chemical formula and segment representations

One of the earliest applications for DL included SMILES for molecules, elemental fractions and chemical descriptors for solids, and sequence of protein names as descriptors. Such descriptors lack explicit inclusion of atomic structure information but are still useful for various pre-screening applications for both theoretical and experimental data.

SMILES and fragment representation

The simplified molecular-input line-entry system (SMILES) is a method to represent elemental and bonding for molecular structures using short American Standard Code for Information Interchange (ASCII) strings. SMILES can express structural differences including the chirality of compounds, making it more useful than a simply chemical formula. A SMILES string is a simple grid-like (1-D grid) structure that can represent molecular sequences such as DNA, macromolecules/polymers, protein sequences also 130 , 131 . In addition to the chemical constituents as in the chemical formula, bondings (such as double and triple bondings) are represented by special symbols (such as ’=’ and ’#’). The presence of a branch point indicated using a left-hand bracket “(” while the right-hand bracket “)” indicates that all the atoms in that branch have been taken into account. SMILES strings are represented as a distributed representation termed a SMILES feature matrix (as a sparse matrix), and then we can apply DL to the matrix similar to image data. The length of the SMILES matrix is generally kept fixed (such as 400) during training and in addition to the SMILES multiple elemental attributes and bonding attributes (such as chirality, aromaticity) can be used. Key DL tasks for molecules include (a) novel molecule design, (b) molecule screening.

Novel molecules with target properties can designed using VAE, GAN and RNN based methods 132 , 133 , 134 . These DL-generated molecules might not be physically valid, but the goal is to train the model to learn the patterns in SMILES strings such that the output resembles valid molecules. Then chemical intuitions can be further used to screen the molecules. DL for SMILES can also be used for molecularscreening such as to predict molecular toxicity. Some of the common SMILES datasets are: ZINC 135 , Tox21 136 , and PubChem 137 .

Due to the limitations to enforce the generation of valid molecular structures from SMILES, fragment-based models are developed such as DeepFrag and DeepFrag-K 138 , 139 . In fragment-based models, a ligand/receptor complex is removed and then a DL model is trained to predict the most suitable fragment substituent. A set of useful tools for SMILES and fragment representations are provided in Table 2 .

Chemical formula representation

There are several ways of using the chemical formula-based representations for building ML/DL models, beginning with a simple vector of raw elemental fractions 140 , 141 or of weight percentages of alloying compositions 142 , 143 , 144 , 145 , as well as more sophisticated hand-crafted descriptors or physical attributes to add known chemistry knowledge (e.g., electronegativity, valency, etc. of constituent elements) to the feature representations 146 , 147 , 148 , 149 , 150 , 151 . Statistical and mathematical operations such as average, max, min, median, mode, and exponentiation can be carried out on elemental properties of the constituent elements to get a set of descriptors for a given compound. The number of such composition-based features can range from a few dozens to a few hundreds. One of the commonly used representations that have been shown to work for a variety of different use-cases is the materials agnostic platform for informatics and exploration (MagPie) 150 . All these composition-based representations can be used with both traditional ML methods such as Random Forest as well as DL.

It is relevant to note that ElemNet 141 , which is a 17-layer neural network composed of fully connected layers and uses only raw elemental fractions as input, was found to significantly outperform traditional ML methods such as Random Forest, even when they were allowed to use more sophisticated physical attributes based on MagPie as input. Although no periodic table information was provided to the model, it was found to self-learn some interesting chemistry, like groups (element similarity) and charge balance (element interaction). It was also able to predict phase diagrams on unseen materials systems, underscoring the power of DL for representation learning directly from raw inputs without explicit feature extraction. Further increasing the depth of the network was found to adversely affect the model accuracy due to the vanishing gradient problem. To address this issue, Jha et al. 152 developed IRNet, which uses individual residual learning to allow a smoother flow of gradients and enable deeper learning for cases where big data is available. IRNet models were tested on a variety of big and small materials datasets, such as OQMD, AFLOW, Materials Project, JARVIS, using different vector-based materials representations (element fractions, MagPie, structural) and were found to not only successfully alleviate the vanishing gradient problem and enable deeper learning, but also lead to significantly better model accuracy as compared to plain deep neural networks and traditional ML techniques for a given input materials representation in the presence of big data 153 . Further, graph-based methods such as Roost 154 have also been developed which can outperform many similar techniques.

Such methods have been used for diverse DFT datasets mentioned above in Table 1 as well as experimental datasets such as SuperCon 155 , 156 for quick pre-screening applications. In terms of applications, they have been applied for predicting properties such as formation energy 141 , bandgap, and magnetization 152 , superconducting temperatures 156 , bulk, and shear modulus 153 . They have also been used for transfer learning across datasets for enhanced predictive accuracy on small data 34 , even for different source and target properties 157 , which is especially useful to build predictive models for target properties for which big source datasets may not be readily available.

There have been libraries of such descriptors developed such as MatMiner 151 and DScribe 158 . Some examples of such models are given in Table 2 . Such representations are especially useful for experimental datasets such as those for superconducting materials where the atomic structure is not tabulated. However, these representations cannot distinguish different polymorphs of a system with different point groups and space groups. It has been recently shown that although composition-based representations can help build ML/DL models to predict some properties like formation energy with remarkable accuracy, it does not necessarily translate to accurate predictions of other properties such as stability, when compared to DFT’s own accuracy 159 .

Spectral models

When electromagnetic radiation hits materials, the interaction between the radiation and matter measured as a function of the wavelength or frequency of the radiation produces a spectroscopic signal. By studying spectroscopy, researchers can gain insights into the materials’ composition, structural, and dynamic properties. Spectroscopic techniques are foundational in materials characterization. For instance, X-ray diffraction (XRD) has been used to characterize the crystal structure of materials for more than a century. Spectroscopic analysis can involve fitting quantitative physical models (for example, Rietveld refinement) or more empirical approaches such as fitting linear combinations of reference spectra, such as with x-ray absorption near-edge spectroscopy (XANES). Both approaches require a high degree of researcher expertise through careful design of experiments; specification, revision, and iterative fitting of physical models; or the availability of template spectra of known materials. In recent years, with the advances in high-throughput experiments and computational data, spectroscopic data has multiplied, giving opportunities for researchers to learn from the data and potentially displace the conventional methods in analyzing such data. This section covers emerging DL applications in various modes of spectroscopic data analysis, aiming to offer practice examples and insights. Some of the applications are shown in Fig. 3 .

figure 3

a Predicting structure information from the X-ray diffraction 374 , Reprinted according to the terms of the CC-BY license 374 . Copyright 2020. b Predicting catalysis properties from computational electronic density of states data. Reprinted according to the terms of the CC-BY license 202 . Copyright 2021.

Currently, large-scale and element-diverse spectral data mainly exist in computational databases. For example, in ref. 160 , the authors calculated the infrared spectra, piezoelectric tensor, Born effective charge tensor, and dielectric response as a part of the JARVIS-DFT DFPT database. The Materials Project has established the largest computational X-ray absorption database (XASDb), covering the K-edge X-ray near-edge fine structure (XANES) 161 , 162 and the L-edge XANES 163 of a large number of material structures. The database currently hosts more than 400,000 K-edge XANES site-wise spectra and 90,000 L-edge XANES site-wise spectra of many compounds in the Materials Project. There are considerably fewer experimental XAS spectra, being on the order of hundreds, as seen in the EELSDb and the XASLib. Collecting large experimental spectra databases that cover a wide range of elements is a challenging task. Collective efforts focused on curating data extracted from different sources, as found in the RRUFF Raman, XRD and chemistry database 164 , the open Raman database 165 , and the SOP spectra library 166 . However, data consistency is not guaranteed. It is also now possible for contributors to share experimental data in a Materials Project curated database, MPContribs 167 . This database is supported by the US Department of Energy (DOE) providing some expectation of persistence. Entries can be kept private or published and are linked to the main materials project computational databases. There is an ongoing effort to capture data from DOE-funded synchrotron light sources ( https://lightsources.materialsproject.org/ ) into MPContribs in the future.

Recent advances in sources, detectors, and experimental instrumentation have made high-throughput measurements of experimental spectra possible, giving rise to new possibilities for spectral data generation and modeling. Such examples include the HTEM database 10 that contains 50,000 optical absorption spectra and the UV-Vis database of 180,000 samples from the Joint Center for Artificial Photosynthesis. Some of the common spectra databases for spectra data are shown in Table 3 . There are beginning to appear cloud-based software as a service platforms for high-throughput data analysis, for example, pair-distribution function (PDF) in the cloud ( https://pdfitc.org ) 168 which are backed by structured databases, where data can be kept private or made public. This transition to the cloud from data analysis software installed and run locally on a user’s computer will facilitate the sharing and reuse of data by the community.

Applications

Due to the widespread deployment of XRD across many materials technologies, XRD spectra became one of the first test grounds for DL models. Phase identification from XRD can be mapped into a classification task (assuming all phases are known) or an unsupervised clustering task. Unlike the traditional analysis of XRD data, where the spectra are treated as convolved, discrete peak positions and intensities, DL methods treat the data as a continuous pattern similar to an image. Unfortunately, a significant number of experimental XRD datasets in one place are not readily available at the moment. Nevertheless, extensive, high-quality crystal structure data makes creating simulated XRD trivial.

Park et al. 169 calculated 150,000 XRD patterns from the Inorganic Crystal Structure Database (ICSD) structural database 170 and then used CNN models to predict structural information from the simulated XRD patterns. The accuracies of the CNN models reached 81.14%, 83.83%, and 94.99% for space-group, extinction-group, and crystal-system classifications, respectively.

Liu et al. 95 obtained similar accuracies by using a CNN for classifying atomic pair-distribution function (PDF) data into space groups. The PDF is obtained by Fourier transforming XRD into real space and is particularly useful for studying the local and nanoscale structure of materials. In the case of the PDF, models were trained, validated, and tested on simulated data from the ICSD. However, the trained model showed excellent performance when given experimental data, something that can be a challenge in XRD data because of the different resolutions and line-shapes of the diffraction data depending on specifics of the sample and experimental conditions. The PDF seems to be more robust against these aspects.

Similarly, Zaloga et al. 171 also used the ICSD database for XRD pattern generation and CNN models to classify crystals. The models achieved 90.02% and 79.82% accuracy for crystal systems and space groups, respectively.

It should be noted that the ICSD database contains many duplicates, and such duplicates should be filtered out to avoid information leakage. There is also a large difference in the number of structures represented in each space group (the label) in the database resulting in data normalization challenges.

Lee et al. 172 developed a CNN model for phase identification from samples consisting of a mixture of several phases in a limited chemical space relevant for battery materials. The training data are mixed patterns consisting of 1,785,405 synthetic XRD patterns from the Sr-Li-Al-O phase space. The resulting CNN can not only identify the phases but also predict the compound fraction in the mixture. A similar CNN was utilized by Wang et al. 173 for fast identification of metal-organic frameworks (MOFs), where experimental spectral noise was extracted and then synthesized into the theoretical XRD for training data augmentation.

An alternative idea was proposed by Dong et al. 174 . Instead of recognizing only phases from the CNN, a proposed “parameter quantification network” (PQ-Net) was able to extract physico-chemical information. The PQ-Net yields accurate predictions for scale factors, crystallite size, and lattice parameters for simulated and experimental XRD spectra. The work by Aguiar et al. 175 took a step further and proposed a modular neural network architecture that enables the combination of diffraction patterns and chemistry data and provided a ranked list of predictions. The ranked list predictions provide user flexibility and overcome some aspects of overconfidence in model predictions. In practical applications, AI-driven XRD identification can be beneficial for high-throughput materials discovery, as shown by Maffettone et al. 176 . In their work, an ensemble of 50 CNN models was trained on synthetic data reproducing experimental variations (missing peaks, broadening, peaking shifting, noises). The model ensemble is capable of predicting the probability of each category label. A similar data augmentation idea was adopted by Oviedo et al. 177 , where experimental XRD data for 115 thin-film metal-halides were measured, and CNN models trained on the augmented XRD data achieved accuracies of 93% and 89% for classifying dimensionality and space group, respectively.

Although not a DL method, an unsupervised machine learning approach, non-negative matrix factorization (NMF), is showing great promise for yielding chemically relevant XRD spectra from time- or spatially-dependent sets of diffraction patterns. NMF is closely related to principle component analysis in that it takes a set of patterns as a matrix and then compresses the data by reducing the dimensionality by finding the most important components. In NMF a constraint is applied that all the components and their weights must be strictly positive. This often corresponds to a real physical situation (for example, spectra tend to be positive, as are the weights of chemical constituents). As a result, it appears that the mathematical decomposition often results in interpretable, physically meaningful, components and weights, as shown by Liu et al. for PDF data 178 . An extension of this showed that in a spatially resolved study, NMF could be used to extract chemically resolved differential PDFs (similar to the information in EXAFS) from non-chemically resolved PDF measurements 179 . NMF is very quick and easy to apply and can be applied to just about any set of spectra. It is likely to become widely used and is being implemented in the PDFitc.org website to make it more accessible to potential users.

Other than XRD, the XAS, Raman, and infrared spectra, also contain rich structure-dependent spectroscopic information about the material. Unlike XRD, where relatively simple theories and equations exist to relate structures to the spectral patterns, the relationships between general spectra and structures are somewhat elusive. This difficulty has created a higher demand for machine learning models to learn structural information from other spectra.

For instance, the case of X-ray absorption spectroscopy (XAS), including the X-ray absorption near-edge spectroscopy (XANES) and extended X-ray absorption fine structure (EXAFS), is usually used to analyze the structural information on an atomic level. However, the high signal-to-noise XANES region has no equation for data fitting. DL modeling of XAS data is fascinating and offers unprecedented insights. Timoshenko et al. used neural networks to predict the coordination numbers of Pt 180 and Cu 181 in nanoclusters from the XANES. Aside from the high accuracies, the neural network also offers high prediction speed and new opportunities for quantitative XANES analysis. Timoshenko et al. 182 further carried out a novel analysis of EXAFS using DL. Although EXAFS analysis has an explicit equation to fit, the study is limited to the first few coordination shells and on relatively ordered materials. Timoshenko et al. 182 first transformed the EXAFS data into 2D maps with a wavelet transform and then supplied the 2D data to a neural network model. The model can instantly predict relatively long-range radial distribution functions, offering in situ local structure analysis of materials. The advent of high-throughput XAS databases has recently unveiled more possibilities for machine learning models to be deployed using XAS data. For example, Zheng et al. 161 used an ensemble learning method to match and fast search new spectra in the XASDb. Later, the same authors showed that random forest models outperform DL models such as MLPs or CNNs in directly predicting atomic environment labels from the XANES spectra 183 . Similar approaches were also adopted by Torrisi et al. 184 In practical applications, Andrejevic et al. 185 used the XASDb data together with the topological materials database. They constructed CNN models to classify the topology of materials from the XANES and symmetry group inputs. The model correctly predicted 81% topological and 80% trivial cases and achieved 90% accuracy in material classes containing certain elements.

Raman, infrared, and other vibrational spectroscopies provide structural fingerprints and are usually used to discriminate and estimate the concentration of components in a mixture. For example, Madden et al. 186 have used neural network models to predict the concentration of illicit materials in a mixture using the Raman spectra. Interestingly, several groups have independently found that DL models outperform chemometrics analysis in vibrational spectroscopies 187 , 188 . For learning vibrational spectra, the number of training spectra is usually less than or on the order of the number of features (intensity points), and the models can easily overfit. Hence, dimensional reduction strategies are commonly used to compress the information dimension using, for example, principal component analysis (PCA) 189 , 190 . DL approaches do not have such concerns and offer elegant and unified solutions. For example, Liu et al. 191 applied CNN models to the Raman spectra in the RRUFF spectral database and show that CNN models outperform classical machine learning models such as SVM in classification tasks. More DL applications in vibrational spectral analysis can be found in a recent review by Yang et al. 192 .

Although most current DL work focuses on the inverse problem, i.e., predicting structural information from the spectra, some innovative approaches also solve the forward problems by predicting the spectra from the structure. In this case, the spectroscopy data can be viewed simply as a high-dimensional material property of the structure. This is most common in molecular science, where predicting the infrared spectra 193 , molecular excitation spectra 194 , is of particular interest. In the early 2000s, Selzer et al. 193 and Kostka et al. 195 attempted predicting the infrared spectra directly from the molecular structural descriptors using neural networks. Non-DL models can also perform such tasks to a reasonable accuracy 196 . For DL models, Chen et al. 197 used a Euclidean neural network (E(3)NN) to predict the phonon density of state (DOS) spectra 198 from atom positions and element types. The E(3)NN model captures symmetries of the crystal structures, with no need to perform data augmentation to achieve target invariances. Hence the E(3)NN model is extremely data-efficient and can give reliable DOS spectra prediction and heat capacity using relatively sparse data of 1200 calculation results on 65 elements. A similar idea was also used to predict the XAS spectra. Carbone et al. 199 used a message passing neural network (MPNN) to predict the O and N K-edge XANES spectra from the molecular structures in the QM9 database 7 . The training XANES data were generated using the FEFF package 200 . The trained MPNN model reproduced all prominent peaks in the predicted XANES, and 90% of the predicted peaks are within 1 eV of the FEFF calculations. Similarly, Rankine et al. 201 started from the two-body radial distribution function (RDC) and used a deep neural network model to predict the Fe K-edge XANES spectra for arbitrary local environments.

In addition to learn the structure-spectra or spectra-structure relationships, a few works have also explored the possibility of relating spectra to other material properties in a non-trivial way. The DOSnet proposed by Fung et al. 202 (Fig. 3 b) uses the electronic DOS spectra calculated from DFT as inputs to a CNN model to predict the adsorption energies of H, C, N, O, S and their hydrogenated counterparts, CH, CH 2 , CH 3 , NH, OH, and SH, on bimetallic alloy surfaces. This approach extends the previous d-band theory 203 , where only the d-band center, a scalar, was used to correlate with the adsorption energy on transition metals. Similarly, Kaundinya et al. 204 used Atomistic Line Graph Neural Network (ALIGNN) to predict DOS for 56,000 materials in the JARVIS-DFT database using a direct discretized spectrum (D-ALIGNN), and a compressed low-dimensional representation using an autoencoder (AE-ALIGNN). Stein et al. 205 tried to learn the mapping between the image and the UV-vis spectrum of the material using the conditional variational encoder (cVAE) with neural network models as the backbone. Such models can generate the UV-vis spectrum directly from a simple material image, offering much faster material characterizations. Predicting gas adsorption isotherms for direct air capture (DAC) are also an important application of spectra-based DL models. There have been several important works 206 , 207 for CO 2 capture with high-performance metal-organic frameworks (MOFs) which are important for mitigating climate change issues.

Image-based models

Computer vision is often credited as precipitating the current wave of mainstream DL applications a decade ago 208 . Naturally, materials researchers have developed a broad portfolio of applications of computer vision for accelerating and improving image-based material characterization techniques. High-level microscopy vision tasks can be organized as follows: image classification (and material property regression), auto-tuning experimental imaging hyperparameters, pixelwise learning (e.g., semantic segmentation), super-resolution imaging, object/entity recognition, localization, and tracking, microstructure representation learning.

Often these tasks generalize across many different imaging modalities, spanning optical microscopy (OM), scanning electron microscopy (SEM) techniques, scanning probe microscopy (SPM, as in scanning tunneling microscopy (STM) or atomic force microscopy (AFM), and transmission electron microscopy (TEM) variants, including scanning transmission electron microscopy (STEM).

The images obtained with these techniques range from capturing local atomic to mesoscale structures (microstructure), the distribution and type of defects, and their dynamics which are critically linked to the functionality and performance of the materials. Over the past few decades, atomic-scale imaging has become widespread and near-routine due to aberration-corrected STEM 209 . The collection of large image datasets is increasingly presenting an analysis bottleneck in the materials characterization pipeline, and the immediate need for automated image analysis becomes important. Non-DL image analysis methods have driven tremendous progress in quantitative microscopy, but often image processing pipelines are brittle and require too much manual identification of image features to be broadly applicable. Thus, DL is currently the most promising solution for high-performance, high-throughput automated analysis of image datasets. For a good overview of applications in microstructure characterization specifically, see 210 .

Image datasets for materials can come from either experiments or simulations. Software libraries mentioned above can be used to generate images such as STM/STEM. Images can also be obtained from the literature. A few common examples for image datasets are shown below in Table 4 . Recently, there has been a rapid development in the field of image learning tasks for materials leading to several useful packages. We list some of them in Table 4 .

Applications in image classification and regression

DL for images can be used to automatically extract information from images or transform images into a more useful state. The benefits of automated image analysis include higher throughput, better consistency of measurements compared to manual analysis, and even the ability to measure signals in images that humans cannot detect. The benefits of altering images include image super-resolution, denoising, inferring 3D structure from 2D images, and more. Examples of the applications of each task are summarized below.

Image classification and regression

Classification and regression are the processes of predicting one or more values associated with an image. In the context of DL the only difference between the two methods is that the outputs of classification are discrete while the outputs of regression models are continuous. The same network architecture may be used for both classification and regression by choosing the appropriate activation function (i.e., linear for regression or Softmax for classification) for the output of the network. Due to its simplicity image classification is one of the most established DL techniques available in the materials science literature. Nonetheless, this technique remains an area of active research.

Modarres et al. applied DL with transfer learning to automatically classify SEM images of different material systems 211 . They demonstrated how a single approach can be used to identify a wide variety of features and material systems such as particles, fibers, Microelectromechanical systems (MEMS) devices, and more. The model achieved 90% accuracy on a test set. Misclassifications resulted from images containing objects from multiple classes, which is an inherent limitation of single-class classification. More advanced techniques such as those described in subsequent sections can be applied to avoid these limitations. Additionally, they developed a system to deploy the trained model at scale to process thousands of images in parallel. This approach is essential for large-scale, high-throughput experiments or industrial applications of classification. ImageNet-based deep transfer learning has also been successfully applied for crack detection in macroscale materials images 212 , 213 , as well as for property prediction on small, noisy, and heterogeneous industrial datasets 214 , 215 .

DL has also been applied to characterize the symmetries of simulated measurements of samples. In ref. 216 , Ziletti et al. obtained a large database of perfect crystal structures, introduced defects into the perfect lattices, and simulated diffraction patterns for each structure. DL models were trained to identify the space group of each diffraction patterns. The model achieved high classification performance, even on crystals with significant numbers of defects, surpassing the performance of conventional algorithms for detecting symmetries from diffraction patterns.

DL has also been applied to classify symmetries in simulated STM measurements of 2D material systems 217 . DFT was used to generate simulated STM images for a variety of material systems. A convolutional neural network was trained to identify which of the five 2D Bravais lattices each material belonged to using the simulated STM image as input. The model achieved an average F1 score of around 0.9 for each lattice type.

DL has also been used to improve the analysis of electron backscatter diffraction (EBSD) data, with Liu et al. 218 presenting one of the first DL-based solution for EBSD indexing capable of taking an EBSD image as input and predicting the three Euler angles representing the orientation that would have led to the given EBSD pattern. However, they considered the three Euler angles to be independent of each other, creating separate CNNs for each angle, although the three angles should be considered together. Jha et al. 219 built upon that work to train a single DL model to predict the three Euler angles in simulated EBSD patterns of polycrystalline Ni while directly minimizing the misorientation angle between the true and predicted orientations. When tested on experimental EBSD patterns, the model achieved 16% lower disorientation error than dictionary-based indexing. Similarly, Kaufman et al. trained a CNN to predict the corresponding space group for a given diffraction pattern 220 . This enables EBSD to be used for phase identification in samples where the existing phases are unknown, providing a faster or more cost-effective method of characterizing than X-ray or neutron diffraction. The results from these studies demonstrate the promise of applying DL to improve the performance and utility of EBSD experiments.

Recently, DL has also been to learn crystal plasticity using images of strain profiles as input 221 , 222 . The work in ref. 221 used domain knowledge integration in the form of two-point auto-correlation to enhance the predictive accuracy, while 222 applied residual learning to learn crystal plasticity at nanoscale. It used strain profiles of materials of varying sample widths ranging from 2 μm down to 62.5 nm obtained from discrete dislocation dynamics to build a deep residual network capable of identifying prior deformation history of the sample as low, medium, or high. Compared to the correlation function-based method (68.24% accuracy), the DL model was found to be significantly more accurate (92.48%) and also capable of predicting stress-strain curves of test samples. This work additionally used saliency maps to try to interpret the developed DL model.

Pixelwise learning

DL can also be applied to generate one or more predictions for every pixel in an image. This can provide more detailed information about the size, position, orientation, and morphology of features of interest in images. Thus, pixelwise learning has been a significant area of focus with many recent studies appearing in materials science literature.

Azimi et al. applied an ensemble of fully convolutional neural networks to segment martensite, tempered martensite, bainite, and pearlite in SEM images of carbon steels. Their model achieved 94% accuracy, demonstrating a significant improvement over previous efforts to automate the segmentation of different phases in SEM images. Decost, Francis, and Holm applied PixelNet to segment microstructural constituents in the UltraHigh Carbon Steel Database 223 , 224 . In contrast to fully convolutional neural networks, which encode and decode visual signals using a series of convolution layers, PixelNet constructs “hypercolumns”, or concatenations of feature representations corresponding to each pixel at different layers in a neural network. The hypercolumns are treated as individual feature vectors, which can then be classified using any typical classification approach, like a multilayer perceptron. This approach achieved phase segmentation precision and recall scores of 86.5% and 86.5%, respectively. Additionally, this approach was used to segment spheroidite particles in the matrix, achieving precision and recall scores of 91.1% and 91.1%, respectively.

Pixelwise DL has also been applied to automatically segment dislocations in Ni superalloys 210 . Dislocations are visually similar to \(\gamma -{\gamma }^{\prime}\) and dislocation in Ni superalloys. With limited training data, a single segmentation model could not distinguish between these features. To overcome this, a second model was trained to generate a coarse mask corresponding to the deformed region in the material. Overlaying this mask with predictions from the first model selects the dislocations, enabling them to be distinguished from \(\gamma -{\gamma }^{\prime}\) interfaces.

Stan, Thompson, and Voorhees applied Pixelwise DL to characterize dendritic growth from serial sectioning and synchrotron computed tomography data 225 . Both of these techniques generate large amounts of data, making manual analysis impractical. Conventional image processing approaches, utilizing thresholding, edge detectors, or other hand-crafted filters, cannot effectively deal with noise, contrast gradients, and other artifacts that are present in the data. Despite having a small training set of labeled images, SegNet automatically segmented these images with much higher performance.

Object/entity recognition, localization, and tracking

Object detection or localization is needed when individual instances of recognized objects in a given image need to be distinguished from each other. In cases where instances do not overlap each other by a significant amount, individual instances can be resolved through post-processing of semantic segmentation outputs. This technique has been applied extensively to detect individual atoms and defects in microstructural images.

Madsen et al. applied pixelwise DL to detect atoms in simulated atomic-resolution TEM images of graphene 226 . A neural network was trained to detect the presence of each atom as well as predict its column height. Pixelwise results are used as seeds for watershed segmentation to achieve instance-level detection. Analysis of the arrangement of the atoms led to the autonomous characterization of defects in the lattice structure of the material. Interestingly, despite being trained only on simulations, the model successfully detected atomic positions in experimental images.

Maksov et al. demonstrated atomistic defect recognition and tracking across sequences of atomic-resolution STEM images of WS 2 227 . The lattice structure and defects existing in the first frame were characterized through a physics-based approach utilizing Fourier transforms. The positions of atoms and defects in the first frame were used to train a segmentation model. Despite only using the first frame for training, the model successfully identified and tracked defects in the subsequent frames for each sequence, even when the lattice underwent significant deformation. Similarly, Yang et al. 228 used U-net architecture (as shown in Fig. 4 ) to detect vacancies and dopants in WSe 2 in STEM images with model accuracy of up to 98%. They classified the possible atomic sites based on experimental observations into five different types: tungsten, vanadium substituting for tungsten, selenium with no vacancy, mono-vacancy of selenium, and di-vacancy of selenium.

figure 4

a Deep neural networks U-Net model constructed for quantification analysis of annular dark-field in the scanning transmission electron microscope (ADF-STEM) image of V-WSe 2 . b Examples of training dataset for deep learning of atom segmentation model for five different species. c Pixel-level accuracy of the atom segmentation model as a function of training epoch. d Measurement accuracy of the segmentation model compared with human-based measurements. Scale bars are 1 nm [Reprinted according to the terms of the CC-BY license ref. 228 ].

Roberts et al. developed DefectSegNet to automatically identify defects in transmission and STEM images of steel including dislocations, precipitates, and voids 229 . They provide detailed information on the model’s design, training, and evaluation. They also compare measurements generated from the model to manual measurements performed by several different human experts, demonstrating that the measurements generated by DL are quantitatively more accurate and consistent.

Kusche et al. applied DL to localize defects in panoramic SEM images of dual-phase steel 230 . Manual thresholding was applied to identify dark defects against the brighter matrix. Regions containing defects were classified via two neural networks. The first neural network distinguished between inclusions and ductile damage in the material. The second classified the type of ductile damage (i.e., notching, martensite cracking, etc.) Each defect was also segmented via a watershed algorithm to obtain detailed information on its size, position, and morphology.

Applying DL to localize defects and atomic structures is a popular area in materials science research. Thus, several other recent studies on these applications can be found in the literature 231 , 232 , 233 , 234 .

In the above examples pixelwise DL, or classification models are combined with image analysis to distinguish individual instances of detected objects. However, when several adjacent objects of the same class touch or overlap each other in the image, this approach will falsely detect them to be a single, larger object. In this case, DL models designed for the detection or instance segmentation can be used to resolve overlapping instances. In one such study, Cohn and Holm applied DL for instance-level segmentation of individual particles and satellites in dense powder images 235 . Segmenting each particle allows for computer vision to generate detailed size and morphology information which can be used to supplement experimental powder characterization for additive manufacturing. Additionally, overlaying the powder and satellite masks yielded the first method for quantifying the satellite content of powder samples, which cannot be measured experimentally.

Super-resolution imaging and auto-tuning experimental parameters

The studies listed so far focus on automating the analysis of existing data after it has been collected experimentally. However, DL can also be applied during experiments to improve the quality of the data itself. This can reduce the time for data collection or improve the amount of information captured in each image. Super-resolution and other DL techniques can also be applied in situ to autonomously adjust experimental parameters.

Recording high-resolution electron microscope images often require large dwell times, limiting the throughput of microscopy experiments. Additionally, during imaging, interactions between the electron beam and a microscopy sample can result in undesirable effects, including charging of non-conductive samples and damage to sensitive samples. Thus, there is interest in using DL to artificially increase the resolution of images without introducing these artifacts. One method of interest is applying generative adversarial networks (GANs) for this application.

De Haan et al. recorded SEM images of the same regions of interest in carbon samples containing gold nanoparticles at two resolutions 236 . Low-resolution images recorded were used as inputs to a GAN. The corresponding images with twice the resolution were used as the ground truth. After training the GAN reduced the number of undetected gaps between nanoparticles from 13.9 to 3.7%, indicating that super-resolution was successful. Thus, applying DL led to a four-fold reduction of the interaction time between the electron beam and the sample.

Ede and Beanland collected a dataset of STEM images of different samples 237 . Images were subsampled with spiral and ‘jittered’ grid masks to obtain partial images with resolutions reduced by a factor up to 100. A GAN was trained to reconstruct full images from their corresponding partial images. The results indicated that despite a significant reduction in the sampling area, this approach successfully reconstructed high-resolution images with relatively small errors.

DL has also been applied to automated tip conditioning for SPM experiments. Rashidi and Wolkow trained a model to detect artifacts in SPM measurements resulting from degradation in tip quality 238 . Using an ensemble of convolutional neural networks resulted in 99% accuracy. After detecting that a tip has degraded, the SPM was configured to automatically recondition the tip in situ until the network indicated that the atomic sharpness of the tip has been restored. Monitoring and reconditioning the tip is the most time and labor-intensive part of conducting SPM experiments. Thus, automating this process through DL can increase the throughput and decrease the cost of collecting data through SPM.

In addition to materials characterization, DL can be applied to autonomously adjust parameters during manufacturing. Scime et al. mounted a camera to multiple 3D printers 239 . Images of the build plate were recorded throughout the printing process. A dynamic segmentation convolutional neural network was trained to recognize defects such as recoater streaking, incomplete spreading, spatter, porosity, and others. The trained model achieved high performance and was transferable to multiple printers from three different methods of additive manufacturing. This work is the first step to enabling smart additive manufacturing machines that can correct defects and adjust parameters during printing.

There is also growing interest in establishing instruments and laboratories for autonomous experimentation. Eppel et al. trained multiple models to detect chemicals, materials, and transparent vessels in a chemistry lab setting 240 . This study provides a rigorous analysis of several different approaches for scene understanding. Models were trained to characterize laboratory scenes with different methods including semantic segmentation and instance segmentation, both with and without overlapping instances. The models successfully detected individual vessels and materials in a variety of settings. Finer-grained understanding of the contents of vessels, such as segmentation of individual phases in multi-phase systems, was limited, outlining the path for future work in this area. The results represent an important step towards realizing automated experimentation for laboratory-scale experiments.

Microstructure representation learning

Materials microstructure is often represented in the form of multi-phase high-dimensional 2D/3D images and thus can readily leverage image-based DL methods to learn robust, low-dimensional microstructure representations, which can subsequently be used for building predictive and generative models to learn forward and inverse structure-property linkages, which are typically studied across different length scales (multi-scale modeling). In this context, homogenization and localization refer to the transfer of information from lower length scales to higher length scales and vice-versa. DL using customized CNNs has been used both for homogenization, i.e., predicting the macroscale property of material given its microstructure information 221 , 241 , 242 , as well as for localization, i.e., predicting the strain distribution across a given microstructure for a loading condition 243 .

Transfer learning has also been widely used for analyzing materials microstructure images; methods for improving the use of transfer learning to materials science applications remain an area of active research. Goetz et al. investigated the use of unsupervised domain adaptation as an alternative to simply fine-tuning a pre-trained model 244 . In this technique a model is first trained on a labeled dataset in the source domain. Next, a discriminator model is used to train the model to generate domain-agnostic features. Compared to simple fine-tuning, unsupervised domain adaptation improved the performance of classification and segmentation neural networks on materials science datasets. However, it was determined that the highest performance was achieved when the source domain was more visually similar to the target (for example, using a different set of microstructural images instead of ImageNet.) This highlights the utility of establishing large, publicly available datasets of annotated images in materials science.

Kitaraha and Holm used the output of an intermediate layer of a pre-trained convolutional neural network as a feature representation for images of steel surface defects and Inconnel fracture surfaces 245 . Images were classified by defect type or fracture surface orientation using unsupervised DL. Even though no labeled data was used to train the neural network or the unsupervised classifier, the model found natural decision boundaries that achieved a classification performance of 98% and 88% for the defect classes and fracture surface orientations, respectively. Visualization of the representations through principal component analysis (PCA) and t-distributed stochastic neighborhood embedding (t-SNE) provided qualitative insights into the representations. Although the detailed physical interpretation of the representations is still a distant goal, this study provides tools for investigating patterns in visual signals contained in image-based datasets in materials science.

Larmuseau et al. investigated the use of triplet networks to obtain consistent representations for visually similar images of materials 246 . Triplet networks are trained with three images at a time. The first image, the reference, is classified by the network. The second image, called the positive, is another image with the same class label. The last image, called the negative, is an image from a separate class. During training the loss function includes errors in predicting the class of the reference image, the difference in representations of the reference and positive images, and the similarity in representations of the reference and negative images. This process allows the network to learn consistent representations for images in the same class while distinguishing images from different classes. The triple network outperformed an ordinary convolutional neural network trained for image classification on the same dataset.

In addition to investigating representations used to analyze existing images, DL can generate synthetic images of materials systems. Generative Adversarial Networks (GANs) are currently the predominant method for synthetic microstructure generation. GANs consist of a generator, which creates a synthetic microstructure image, and a discriminator, which attempts to predict if a given input image is real or synthetic. With careful application, GANs can be a powerful tool for microstructure representation learning and design.

Yang and Li et al. 247 , 248 developed a GAN-based model for learning a low-dimensional embedding of microstructures, which could then be easily sampled and used with the generator of the GAN model to generate realistic, statistically similar microstructure images, thus enabling microstructural materials design. The model was able to capture complex, nonlinear microstructure characteristics and learn the mapping between the latent design variables and microstructures. In order to close the loop, the method was combined with a Bayesian optimization approach to design microstructures with optimal optical absorption performance. The discovered microstructures were found to have up to 17% better property than randomly sampled microstructures. The unique architecture of their GAN model also facilitated generator scalability to generate arbitrary-sized microstructure images and discriminator transferability to build structure-property prediction models. Yang et al. 249 recently combined GANs with MDNs (mixture density networks) to enable inverse modeling in microstructural materials design, i.e., generate the microstructure for a given desired property.

Hsu et al. constructed a GAN to generate 3D synthetic solid oxide fuel cell microstructures 250 . These microstructures were compared to other synthetic microstructures generated by DREAM.3D as well as experimentally observed microstructures measured via sectioning and imaging with PFIB-SEM. Synthetic microstructures generated from the GAN were observed to qualitatively show better agreement to the experimental microstructures than the DREAM.3D microstructures, as evidenced by the more realistic phase connectivity and lower amount of agglomeration of solid phases. Additionally, a statistical analysis of various features such as volume fraction, particle size, and several other quantities demonstrated that the GAN microstructures were quantitatively more similar to the real microstructures than the DREAM.3D microstructures.

In a similar study, Chun et al. generated synthetic microstructures of high energy materials using a GAN 251 . Once again, a synthetic microstructure generated via GAN showed better qualitative visual similarity to an experimentally observed microstructure compared to a synthetic microstructure generated via a transfer learning approach, with sharper phase boundaries and fewer computational artifacts. Additionally, a statistical analysis of the void size, aspect ratio, and orientation distributions indicated that the GAN produced microstructures that were quantitatively more similar to real materials.

Applications of DL to microstructure representation learning can help researchers improve the performance of predictive models used for the applications listed above. Additionally, using generative models can generate more realistic simulated microstructures. This can help researchers develop more accurate models for predicting material properties and performance without needing to synthesize and process these materials, significantly increasing the throughput of materials selection and screening experiments.

Mesoscale modeling applications

In addition to image-based characterization, deep learning methods are increasingly used in mesoscale modeling. Dai et al. 252 trained a GNN successfully trained to predict magnetostriction in a wide range of synthetic polycrystalline systems with around 10% prediction error. The microstructure is represented by a graph where each node corresponds to a single grain, and the edges between nodes indicate an interface between neighboring grains. Five node features (3 Euler angles, volume, and the number of neighbors) were associated with each grain. The GNN outperformed other machine learning approaches for property prediction of polycrystalline materials by accounting for interactions between neighboring grains.

Similarly, Cohn and Holm present preliminary work applying GNNs to predict the occurrence of abnormal grain growth (AGG) in Monte Carlo simulations of microstructure evolution 253 . AGG appears to be stochastic, making it notoriously difficult to predict, control, and even observe experimentally in some materials. AGG has been reproduced in Monte Carlo simulations of material systems, but a model that can predict which initial microstructures will undergo AGG has not been established before. A dataset of Monte Carlo simulations was created using SPPARKS 254 , 255 . A microstructure GNN was trained to predict AGG in individual simulations, with 75% classification accuracy. In comparison, an image-based only achieved 60% accuracy. The GNN also provided physical insight to understanding AGG and indicated that only 2 neighborhood shells are needed to achieve the maximum performance achieved in the study. These early results motivate additional work on applying GNNs to predict the occurrence in both simulated and real materials during processing.

Natural language processing

Most of the existing knowledge in the materials domain is currently unavailable as structured information and only exists as unstructured text, tables, or images in various publications. There exists a great opportunity to use natural language processing (NLP) techniques to convert text to structured data or to directly learn and make inferences from the text information. However, as a relatively new field within materials science, many challenges remain unsolved in this domain, such as resolving dependencies between words and phrases across multiple sentences and paragraphs.

Datasets for NLP

Datasets relevant to natural language processing include peer-reviewed journal articles, articles published on preprint servers such as arXiv or ChemRxiv, patents, and online material such as Wikipedia. Unfortunately, accessing or parsing most such datasets remains difficult. Peer-reviewed journal articles are typically subject to copyright restrictions and thus difficult to obtain, especially in the large numbers required for machine learning. Many publishers now offer text and data mining (TDM) agreements that can be signed online, allowing at least a limited, restricted amount of work to be performed. However, gaining access to the full text of many publications still typically requires strict and dedicated agreements with each publisher. The major advantage of working with publishers is that they have often already converted the articles from a document format such as PDF into an easy-to-parse format such as HyperText Markup Language (HTML). In contrast, articles on preprint servers and patents are typically available with fewer restrictions, but are commonly available only as PDF files. It remains difficult to properly parse text from PDF files in a reliable manner, even when the text is embedded in the PDF. Therefore, new tools that can easily and automatically convert such content into well-structured HTML format with few residual errors would likely have a major impact on the field. Finally, online sources of information such as Wikipedia can serve as another type of data source. However, such online sources are often more difficult to verify in terms of accuracy and also do not contain as much domain-specific information as the research literature.

Software libraries for NLP

Applying NLP to a raw dataset involves multiple steps. These steps include retrieving the data, various forms of “pre-processing” (sentence and word tokenization, word stemming and lemmatization, featurization such as word vectors or part of speech tagging), and finally machine learning for information extraction (e.g., named entity recognition, entity-relationship modeling, question and answer, or others). Multiple software libraries exist to aid in materials NLP, as described in Table 5 . We note that although many of these steps can in theory be performed by general-purpose NLP libraries such as NLTK 256 , SpaCy 257 , or AllenNLP 258 , the specialized nature of chemistry and materials science text (including the presence of complex chemical formulas) often leads to errors. For example, researchers have developed specialized codes to perform preprocessing that better detect chemical formulas (and not split them into separate tokens or apply stemming/lemmatization to them) and scientific phrases and notation such as oxidation states or symbols for physical units.

Similarly, chemistry-specific codes for extracting entities are better at extracting the names of chemical elements (e.g., recognizing that “He” likely represents helium and not a male pronoun) and abbreviations for chemical formulas. Finally, word embeddings that convert words such as “manganese” into numerical vectors for further data mining are more informative when trained specifically on materials science text versus more generic texts, even when the latter datasets are larger 259 . Thus, domain-specific tools for NLP are required in nearly all aspects of the pipeline. The main exception is that the architecture of the specific neural network models used for information extraction (e.g., LSTM, BERT, or architectures used to generate word embeddings such as word2vec or GloVe) are typically not modified specifically for the materials domain. Thus, much of the materials and chemistry-centric work currently regards data retrieval and appropriate preprocessing. A longer discussion of this topic, with specific examples, can be found in refs. 260 , 261 .

NLP methods for materials have been applied for information extraction and search (particularly as applied to synthesis prediction) as well as materials discovery. As the domain is rapidly growing, we suggest dedicated reviews on this topic by Olivetti et al. 261 and Kononova et al. 260 for more information.

One of the major uses of NLP methods is to extract datasets from the text in published studies. Conventionally, such datasets required manual entry of datasets by researchers combing the literature, a laborious and time-consuming process. Recently, software tools such as ChemDataExtractor 262 and other methods 263 based on more conventional machine learning and rule-based approaches have enabled automated or semi-automated extraction of datasets such as Curie and Néel magnetic phase transition temperatures 264 , battery properties 265 , UV-vis spectra 266 , and surface and pore characteristics of metal-organic frameworks 267 . In the past few years, DL approaches such as LSTMs and transformer-based models have been employed to extract various categories of information 268 , and in particular materials synthesis information 269 , 270 , 271 from text sources. Such data have been used to predict synthesis maps for titania nanotubes 272 , various binary and ternary oxides 273 , and perovskites 274 .

Databases based on natural language processing have also been used to train machine learning models to identify materials with useful functional properties, such as the recent discovery of the large magnetocaloric properties of HoBe 2 275 . Similarly, Cooper et al. 276 demonstrated a “design to device approach” for designing dye-sensitized solar cells that are co-sensitized with two dyes 276 . This study used automated text mining to compile a list of candidate dyes for the application along with measured properties such as maximum absorption wavelengths and extinction coefficients. The resulting list of 9431 dyes extracted from the literature was downselected to 309 candidates using various criteria such as molecular structure and ability to absorb in the solar spectrum. These candidates were evaluated for suitable combinations for co-sensitization, yielding 33 dyes that were further downselected using density functional theory calculations and experimental constraints. The resulting 5 dyes were evaluated experimentally, both individually and in combinations, resulting in a combination of dyes that not only outperformed any of the individual dyes but demonstrated performance comparable to existing standard material. This study demonstrates the possibility of using literature-based extraction to identify materials candidates for new applications from the vast body of published work, which may have never tested those materials for the desired application.

It is even possible that natural language processing can directly make materials predictions without intermediary models. In a study reported by Tshitoyan et al. 259 (as shown in Fig. 5 ), word embeddings (i.e., numerical vectors representing distinct words) trained on materials science literature could directly predict materials applications through a simple dot product between the trained embedding for a composition word (such as PbTe) and an application words (such as thermoelectrics). The researchers demonstrated that such an approach, if applied in the past using historical data, may have subsequently predicted many recently reported thermoelectric materials; they also presented a list of potentially interesting thermoelectric compositions using the known literature at the time. Since then, several of these predictions have been tested either computationally 277 , 278 , 279 , 280 , 281 , 282 or experimentally 283 as potential thermoelectrics. Such approaches have recently been applied to search for understudied areas of metallocene catalysis 284 , although challenges still remain in such direct approaches to materials prediction.

figure 5

a Network for training word embeddings for natural language processing application. A one-hot encoded vector at left represents each distinct word in the corpus; the role of a hidden layer is to predict the probability of neighboring words in the corpus. This network structure trains a relatively small hidden layer of 100–200 neurons to contain information on the context of words in the entire corpus, with the result that similar words end up with similar hidden layer weights (word embeddings). Such word embeddings can transform wordsin text form into numerical vectors that may be useful for a variety of applications. b projection of word embeddings for various materials science words, as trained on a corpus scientific abstracts, into two dimensions using principle components analysis. Without any explicit training, the word embeddings naturally preserve relationships between chemical formulas, their common oxides, and their ground state structures. [Reprinted according to the terms of the CC-BY license ref. 259 ].

Uncertainty quantification

Uncertainty quantification (UQ) is an essential step in evaluating the robustness of DL. Specifically, DL models have been criticized for lack of robustness, interpretability, and reliability and the addition of carefully quantified uncertainties would go a long way towards addressing such shortcomings. While most of the focus in the DL field currently goes into developing new algorithms or training networks to high accuracy, there is increasing attention to UQ, as exemplified by the detailed review of Abdar et al. 285 . However, determining the uncertainty associated with DL predictions is still challenging and far from a completely solved problem.

The main drawback to estimating UQ when performing DL is the fact that most of the currently available UQ implementations do not work for arbitrary, off-the-shelf models, without retraining or redesigning. Bayesian NNs are the exception; however, they require significant modifications to the training procedure, are computationally expensive compared to non-Bayesian NNs, and become increasingly inefficient the larger the datasize gets. A considerable fraction of the current research in DL UQ focuses exactly on such an issue: how to evaluate uncertainty without requiring computationally expensive retraining or DL code modifications. An example of such an effort is the work of Mi et al. 286 , where three scalable methods are explored, to evaluate the variance of output from trained NN, without requiring any amount of retraining. Another example is Teye, Azizpour, and Smith’s exploration of the use of batch normalization as a way to approximate inference in Bayesian models 287 .

Before reviewing the most common methods used to evaluate uncertainty in DL, let us briefly point out key reasons to add UQ to DL modeling. Reaching high accuracy when training DL models implicitly assume the availability of a sufficiently large and diverse training dataset. Unfortunately, this rarely occurs in material discovery applications 288 . ML/DL models are prone to perform poorly on extrapolation 289 . It is also extremely difficult for ML/DL models to recognize ambiguous samples 290 . In general, determining the amount of data necessary to train a DL to achieve the required accuracy is a challenging problem. Careful evaluation of the uncertainty associated with DL predictions would not only increase reliability in predicted results but would also provide guidance on estimating the needed training dataset size as well as suggesting what new data should be added to reach the target accuracy (uncertainty-guided decision). Zhang, Kailkhura, and Han’s work emphasizes how including a UQ-motivated reject option into the DL model substantially improves the performance of the remaining material data 288 . Such a reject option is associated with the detection of out-of-distribution samples, which is only possible through UQ analysis of the predicted results.

Two different uncertainty types are associated with each ML prediction: epistemic uncertainty and aleatory uncertainty. Epistemic uncertainty is related to insufficient training data in part of the input domain. As mentioned above, while DL is very effective at interpolation tasks, they can have more difficulty in extrapolation. Therefore, it is vital to quantify the lack of accuracy due to localized, insufficient training data. The aleatory uncertainty, instead, is related to parameters not included in the model. It relates to the possibility of training on data that our DL perceives as very similar but that are associated with different outputs because of missing features in the model. Ideally, we would like UQ methodologies to distinguish and quantify both types of uncertainties separately.

The most common approaches to evaluate uncertainty using DL are Dropout methods, Deep Ensemble methods, Quantile regression, and Gaussian Processes. Dropout methods are commonly used to avoid overfitting. In this type of approach, network nodes are disabled randomly during training, resulting in the evaluation of a different subset of the network at each training step. When a similar randomization procedure is also applied to the prediction procedure, the methodology becomes Monte-Carlo dropout 291 . Repeating such randomization multiple times produces a distribution over the outputs, from which mean and variance are determined for each prediction. Another example of using a dropout approach to approximate Bayesian inference in deep Gaussian processes is the work of Gal and Ghahramani 292 .

Deep ensemble methodologies 293 , 294 , 295 , 296 combine deep learning modelling with ensemble learning. Ensemble methods utilize multiple models and different random initializations to improve predictability. Because of the multiple predictions, statistical distributions of the outputs are generated. Combining such results into a Gaussian distribution, confidence intervals are obtained through variance evaluation. Such a multi-model strategy allows the evaluation of aleatory uncertainty when sufficient training data are provided. For areas without sufficient data, the predicted mean and variance will not be accurate, but the expectation is that a very large variance should be estimated, clearly indicating non-trustable predictions. Monte-Carlo Dropout and Deep Ensembles approaches can be combined to further improve confidence in the predicted outputs.

Quantile regression can be utilized with DL 297 . In this approach, the loss function is used in a way that allows to predict for the chosen quantile a (between 0 and 1). A choice of a  = 0.5 corresponds to evaluating the Mean Absolute Error (MAE) and predicting the median of the distribution. Predicting for two more quantile values (amin and amax) determines confidence intervals of width amax − amin. For instance, predicting for amin = 0.1 and amax = 0.8 produces confidence intervals covering 70% of the population. The largest drawback of using quantile to estimate prediction intervals is the need to run the model three times, one for each quantile needed. However, a recent implementation in TensorFlow allows to simultaneously obtain multiple quantiles in one run.

Lastly, Gaussian Processes (GP) can be used within a DL approach as well and have the side benefit of providing UQ information at no extra cost. Gaussian processes are a family of infinite-dimensional multivariate Gaussian distributions completely specified by a mean function and a flexible kernel function (prior distribution). By optimizing such functions to fit the training data, the posterior distribution is determined, which is later used to predict outputs for inputs not included in the training set. Because the prior is a Gaussian process, the posterior distribution is Gaussian as well 298 , thus providing mean and variance information for each predicted data. However, in practice standard kernels under-perform 299 . In 2016, Wilson et al. 300 suggested processing inputs through a neural network prior to a Gaussian process model. This procedure could extract high-level patterns and features, but required careful design and optimization. In general, Deep Gaussian processes improve the performance of Gaussian processes by mapping the inputs through multiple Gaussian process ‘layers’. Several groups have followed this avenue and further perfected such an approach (ref. 299 and references within). A common drawback of Bayesian methods is a prohibitive computational cost if dealing with large datasets 292 .

Limitations and challenges

Although DL methods have various fascinating opportunities for materials design, they have several limitations and there is much room to improve. Reliability and quality assessment of datasets used in DL tasks are challenging because there is either a lack of ground truth data, or there are not enough metrics for global comparison, or datasets using similar or identical set-ups may not be reproducible 301 . This poses an important challenge in relying upon DL-based prediction.

Material representations based on chemical formula alone by definition do not consider structure, which on the one hand makes them more amenable to work for new compounds for which structure information may not be available, but on the other hand, makes it impossible for them to capture phenomena such as phase transitions. Properties of materials depend sensitively on structure to the extent that their properties can be quite opposite depending on the atomic arrangement, like a diamond (hard, wide-band-gap insulator) and graphite (soft, semi-metal). It is thus not a surprise that chemical formula-based methods may not be adequate in some cases 159 .

Atomistic graph-based predictions, although considered a full atomistic description, are tested on bulk materials only and not for defective systems or for multi-dimensional phases of space exploration such as using genetic algorithms. In general, this underscores that the input features must be predictive for the output labels and not be missing some key information. Although atomistic graph neural network models such as atomistic line graph neural network (ALIGNN) have achieved remarkable accuracy compared to previous atomistic based models, the model errors still need to be further brought down to reach something resembling deep learning ‘chemical-accuracies.’

In terms of images and spectra, the experimental data are too noisy most of the time and require much manipulation before applying DL. In contrast, theory-based simulated data represent an alternate path forward but may not capture realistic scenarios such as the presence of structured noise 217 .

Uncertainty quantification for deep learning for materials science is important, yet only a few works have been published in this field. To alleviate the black-box 38 nature of the DL methods, a package such as GNNExplainer 302 has been tried in the context of the material. Such attempts at greater interpretability will be important moving forward to gain the trust of the materials community.

While training-validation-test split strategies were primarily designed in DL for image classification tasks with a certain number of classes, the same for regression models in materials science may not be the best approach. This is because it is possible that during the training the model is seeing a material very similar to the test set material and in reality it is difficult to generalize the model. Best practices need to be developed for data split, normalization, and augmentation to avoid such issues 289 .

Finally, we note an important technological challenge is to make a closed-loop autonomous materials design and synthesis process 303 , 304 that can include both machine learning and experimental components in a self-driving laboratory 305 . For an overview of early proof of principle attempts see 306 . For example, in an autonomous synthesis experiment the oxidation state of copper (and therefore the oxide phase) was varied in a sample of copper oxide by automatically flowing more oxidizing or more reducing gas over the sample and monitoring the charge state of the copper using XANES. An algorithmic decision policy was then used to automatically change the gas composition for a subsequent experiment based on the prior experiments, with no human in the loop, in such a way as to autonomously move towards a target copper oxidation state 307 . This simple proof of principle experiment provides just a glimpse of what is possible moving forward.

Data availability

The data from new figures are available on reasonable request from the corresponding author. Data from other publishers are not available from the corresponding author of this work but may be available by reaching the corresponding author of the cited work.

Code availability

Software packages mentioned in the article (whichever made available by the authors) can be found at https://github.com/deepmaterials/dlmatreview . Software for other packages can be obtained by reaching the corresponding author of the cited work.

Callister, W. D. et al. Materials Science and Engineering: An Introduction (Wiley, 2021).

Saito, T. Computational Materials Design, Vol. 34 (Springer Science & Business Media, 2013).

Choudhary, K. et al. The joint automated repository for various integrated simulations (jarvis) for data-driven materials design. npj Comput. Mater. 6 , 1–13 (2020).

Article   Google Scholar  

Kirklin, S. et al. The open quantum materials database (oqmd): assessing the accuracy of dft formation energies. npj Comput. Mater. 1 , 1–15 (2015).

Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. APL Mater. 1 , 011002 (2013).

Curtarolo, S. et al. Aflow: An automatic framework for high-throughput materials discovery. Comput. Mater. Sci. 58 , 218–226 (2012).

Article   CAS   Google Scholar  

Ramakrishnan, R., Dral, P. O., Rupp, M. & Von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Sci. Data 1 , 1–7 (2014).

Draxl, C. & Scheffler, M. Nomad: The fair concept for big data-driven materials science. MRS Bull. 43 , 676–682 (2018).

Wang, R., Fang, X., Lu, Y., Yang, C.-Y. & Wang, S. The pdbbind database: methodologies and updates. J. Med. Chem. 48 , 4111–4119 (2005).

Zakutayev, A. et al. An open experimental database for exploring inorganic materials. Sci. Data 5 , 1–12 (2018).

de Pablo, J. J. et al. New frontiers for the materials genome initiative. npj Comput. Mater. 5 , 1–23 (2019).

Wilkinson, M. D. et al. The fair guiding principles for sci. data management and stewardship. Sci. Data 3 , 1–9 (2016).

Friedman, J. et al. The Elements of Statistical Learning, Vol. 1 (Springer series in statistics New York, 2001).

Agrawal, A. & Choudhary, A. Perspective: Materials informatics and big data: Realization of the “fourth paradigm” of science in materials science. APL Mater. 4 , 053208 (2016).

Vasudevan, R. K. et al. Materials science in the artificial intelligence age: high-throughput library generation, machine learning, and a pathway from correlations to the underpinning physics. MRS Commun. 9 , 821–838 (2019).

Schmidt, J., Marques, M. R., Botti, S. & Marques, M. A. Recent advances and applications of machine learning in solid-state materials science. npj Comput. Mater. 5 , 1–36 (2019).

Butler, K. T., Davies, D. W., Cartwright, H., Isayev, O. & Walsh, A. Machine learning for molecular and materials science. Nature 559 , 547–555 (2018).

Xu, Y. et al. Deep dive into machine learning models for protein engineering. J. Chem. Inf. Model. 60 , 2773–2790 (2020).

Schleder, G. R., Padilha, A. C., Acosta, C. M., Costa, M. & Fazzio, A. From dft to machine learning: recent approaches to materials science–a review. J. Phys. Mater. 2 , 032001 (2019).

Agrawal, A. & Choudhary, A. Deep materials informatics: applications of deep learning in materials science. MRS Commun. 9 , 779–792 (2019).

Goodfellow, I., Bengio, Y. & Courville, A. Deep Learning (MIT Press, 2016).

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521 , 436–444 (2015).

McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5 , 115–133 (1943).

Rosenblatt, F. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65 , 386–408 (1958).

Gibney, E. Google ai algorithm masters ancient game of go. Nat. News 529 , 445 (2016).

Ramos, S., Gehrig, S., Pinggera, P., Franke, U. & Rother, C. Detecting unexpected obstacles for self-driving cars: Fusing deep learning and geometric modeling. in 2017 IEEE Intelligent Vehicles Symposium (IV) , 1025–1032 (IEEE, 2017).

Buduma, N. & Locascio, N. Fundamentals of deep learning: Designing next-generation machine intelligence algorithms (O’Reilly Media, Inc., O’Reilly, 2017).

Kearnes, S., McCloskey, K., Berndl, M., Pande, V. & Riley, P. Molecular graph convolutions: moving beyond fingerprints. J. Computer Aided Mol. Des. 30 , 595–608 (2016).

Albrecht, T., Slabaugh, G., Alonso, E. & Al-Arif, S. M. R. Deep learning for single-molecule science. Nanotechnology 28 , 423001 (2017).

Ge, M., Su, F., Zhao, Z. & Su, D. Deep learning analysis on microscopic imaging in materials science. Mater. Today Nano 11 , 100087 (2020).

Agrawal, A., Gopalakrishnan, K. & Choudhary, A. In Handbook on Big Data and Machine Learning in the Physical Sciences: Volume 1. Big Data Methods in Experimental Materials Discovery World Scientific Series on Emerging Technologies, 205–230 (“World Scientific, 2020).

Erdmann, M., Glombitza, J., Kasieczka, G. & Klemradt, U. Deep Learning for Physics Research (World Scientific, 2021).

Chen, C., Ye, W., Zuo, Y., Zheng, C. & Ong, S. P. Graph networks as a universal machine learning framework for molecules and crystals. Chem. Mater. 31 , 3564–3572 (2019).

Jha, D. et al. Enhancing materials property prediction by leveraging computational and experimental data using deep transfer learning. Nat. Commun . 10 , 1–12 (2019).

Cubuk, E. D., Sendek, A. D. & Reed, E. J. Screening billions of candidates for solid lithium-ion conductors: a transfer learning approach for small data. J. Chem. Phys. 150 , 214701 (2019).

Chen, C., Zuo, Y., Ye, W., Li, X. & Ong, S. P. Learning properties of ordered and disordered materials from multi-fidelity data. Nat. Comput. Sci. 1 , 46–53 (2021).

Artrith, N. et al. Best practices in machine learning for chemistry. Nat. Chem. 13 , 505–508 (2021).

Holm, E. A. In defense of the black box. Science 364 , 26–27 (2019).

Mueller, T., Kusne, A. G. & Ramprasad, R. Machine learning in materials science: Recent progress and emerging applications. Rev. Comput. Chem. 29 , 186–273 (2016).

CAS   Google Scholar  

Wei, J. et al. Machine learning in materials science. InfoMat 1 , 338–358 (2019).

Liu, Y. et al. Machine learning in materials genome initiative: a review. J. Mater. Sci. Technol. 57 , 113–122 (2020).

Wang, A. Y.-T. et al. Machine learning for materials scientists: an introductory guide toward best practices. Chem. Mater. 32 , 4954–4965 (2020).

Morgan, D. & Jacobs, R. Opportunities and challenges for machine learning in materials science. Annu. Rev. Mater. Res. 50 , 71–103 (2020).

Himanen, L., Geurts, A., Foster, A. S. & Rinke, P. Data-driven materials science: status, challenges, and perspectives. Adv. Sci. 6 , 1900808 (2019).

Rajan, K. Informatics for materials science and engineering: data-driven discovery for accelerated experimentation and application (Butterworth-Heinemann, 2013).

Montáns, F. J., Chinesta, F., Gómez-Bombarelli, R. & Kutz, J. N. Data-driven modeling and learning in science and engineering. Comptes Rendus Mécanique 347 , 845–855 (2019).

Aykol, M. et al. The materials research platform: defining the requirements from user stories. Matter 1 , 1433–1438 (2019).

Stanev, V., Choudhary, K., Kusne, A. G., Paglione, J. & Takeuchi, I. Artificial intelligence for search and discovery of quantum materials. Commun. Mater. 2 , 1–11 (2021).

Chen, C. et al. A critical review of machine learning of energy materials. Adv. Energy Mater. 10 , 1903242 (2020).

Cybenko, G. Approximation by superpositions of a sigmoidal function. Math. Control Signals Syst. 2 , 303–314 (1989).

Kidger, P. & Lyons, T. Universal approximation with deep narrow networks . in Conference on learning theory , 2306–2327 (PMLR, 2020).

Lin, H. W., Tegmark, M. & Rolnick, D. Why does deep and cheap learning work so well? J. Stat. Phys. 168 , 1223–1247 (2017).

Minsky, M. & Papert, S. A. Perceptrons: An introduction to computational geometry (MIT press, 2017).

Paszke, A. et al. Pytorch: An imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32 , 8026–8037 (2019).

Google Scholar  

Abadi et al., TensorFlow: A system for large-scale machine learning. arXiv:1605.08695, Preprint at https://arxiv.org/abs/1605.08695 (2006).

Chen, T. et al. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv . https://arxiv.org/abs/1512.01274 (2015).

Nwankpa, C., Ijomah, W., Gachagan, A. & Marshall, S. Activation functions: comparison of trends in practice and research for deep learning. arXiv . https://arxiv.org/abs/1811.03378 (2018).

Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. Automatic differentiation in machine learning: a survey. J. Machine Learn. Res. 18 , 1–43 (2018).

Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. arXiv. https://arxiv.org/abs/1207.0580 (2012).

Breiman, L. Bagging predictors. Machine Learn. 24 , 123–140 (1996).

LeCun, Y. et al. The Handbook of Brain Theory and Neural Networks vol. 3361 (MIT press Cambridge, MA, USA 1995).

Wilson, R. J. Introduction to Graph Theory (Pearson Education India, 1979).

West, D. B. et al. Introduction to Graph Theory Vol. 2 (Prentice hall Upper Saddle River, 2001).

Wang, M. et al. Deep graph library: A graph-centric, highly-performant package for graph neural networks. arXiv . https://arxiv.org/abs/1909.01315 (2019).

Choudhary, K. & DeCost, B. Atomistic line graph neural network for improved materials property predictions. npj Comput. Mater. 7 , 1–8 (2021).

Li, M. et al. Dgl-lifesci: An open-source toolkit for deep learning on graphs in life science. arXiv . https://arxiv.org/abs/2106.14232 (2021).

Xie, T. & Grossman, J. C. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties. Phys. Rev. Lett. 120 , 145301 (2018).

Klicpera, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. arXiv . https://arxiv.org/abs/2003.03123 (2020).

Schutt, K. et al. Schnetpack: A deep learning toolbox for atomistic systems. J. Chem. Theory Comput. 15 , 448–455 (2018).

Kipf, T. N. & Welling, M. Semi-supervised classification with graph convolutional networks. arXiv . https://arxiv.org/abs/1609.02907 (2016).

Veličković, P. et al. Graph attention networks. arXiv . https://arxiv.org/abs/1710.10903 (2017).

Schlichtkrull, M. et al. Modeling relational data with graph convolutional networks. arXiv. https://arxiv.org/abs/1703.06103 (2017).

Song, L., Zhang, Y., Wang, Z. & Gildea, D. A graph-to-sequence model for AMR-to-text generation . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 1616–1626 (Association for Computational Linguistics, 2018).

Xu, K., Hu, W., Leskovec, J. & Jegelka, S. How powerful are graph neural networks? arXiv . https://arxiv.org/abs/1810.00826 (2018).

Chen, Z., Li, X. & Bruna, J. Supervised community detection with line graph neural networks. arXiv . https://arxiv.org/abs/1705.08415 (2017).

Jing, Y., Bian, Y., Hu, Z., Wang, L. & Xie, X.-Q. S. Deep learning for drug design: an artificial intelligence paradigm for drug discovery in the big data era. AAPS J. 20 , 1–10 (2018).

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv. https://arxiv.org/abs/1810.04805 (2018).

De Cao, N. & Kipf, T. Molgan: An implicit generative model for small molecular graphs. arXiv . https://arxiv.org/abs/1805.11973 (2018).

Pereira, T., Abbasi, M., Ribeiro, B. & Arrais, J. P. Diversity oriented deep reinforcement learning for targeted molecule generation. J. Cheminformatics 13 , 1–17 (2021).

Baker, N. et al. Workshop report on basic research needs for scientific machine learning: core technologies for artificial intelligence. Tech. Rep . https://doi.org/10.2172/1478744 . (2019).

Chan, H. et al. Rapid 3d nanoscale coherent imaging via physics-aware deep learning. Appl. Phys. Rev. 8 , 021407 (2021).

Pun, G. P., Batra, R., Ramprasad, R. & Mishin, Y. Physically informed artificial neural networks for atomistic modeling of materials. Nat. Commun. 10 , 1–10 (2019).

Onken, D. et al. A neural network approach for high-dimensional optimal control. arXiv. https://arxiv.org/abs/2104.03270 (2021).

Zunger, A. Inverse design in search of materials with target functionalities. Nat. Rev. Chem. 2 , 1–16 (2018).

Chen, L., Zhang, W., Nie, Z., Li, S. & Pan, F. Generative models for inverse design of inorganic solid materials. J. Mater. Inform. 1 , 4 (2021).

Cranmer, M. et al. Discovering symbolic models from deep learning with inductive biases. arXiv . https://arxiv.org/abs/2006.11287 (2020).

Rupp, M., Tkatchenko, A., Müller, K.-R. & Von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Phys. Rev. Lett. 108 , 058301 (2012).

Bartók, A. P., Kondor, R. & Csányi, G. On representing chemical environments. Phys. Rev. B 87 , 184115 (2013).

Faber, F. A. et al. Prediction errors of molecular machine learning models lower than hybrid dft error. J. Chem. Theory Comput. 13 , 5255–5264 (2017).

Choudhary, K., DeCost, B. & Tavazza, F. Machine learning with force-field-inspired descriptors for materials: Fast screening and mapping energy landscape. Phys. Rev. Mater. 2 , 083801 (2018).

Choudhary, K., Garrity, K. F., Ghimire, N. J., Anand, N. & Tavazza, F. High-throughput search for magnetic topological materials using spin-orbit spillage, machine learning, and experiments. Phys. Rev. B 103 , 155131 (2021).

Choudhary, K., Garrity, K. F. & Tavazza, F. Data-driven discovery of 3d and 2d thermoelectric materials. J. Phys. Condens. Matter 32 , 475501 (2020).

Ward, L. et al. Including crystal structure attributes in machine learning models of formation energies via voronoi tessellations. Phys. Rev. B 96 , 024104 (2017).

Isayev, O. et al. Universal fragment descriptors for predicting properties of inorganic crystals. Nat. Commun. 8 , 1–12 (2017).

Liu, C.-H., Tao, Y., Hsu, D., Du, Q. & Billinge, S. J. Using a machine learning approach to determine the space group of a structure from the atomic pair distribution function. Acta Crystallogr. Sec. A 75 , 633–643 (2019).

Smith, J. S., Isayev, O. & Roitberg, A. E. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chem. Sci. 8 , 3192–3203 (2017).

Behler, J. Atom-centered symmetry functions for constructing high-dimensional neural network potentials. J. Chem. Phys. 134 , 074106 (2011).

Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98 , 146401 (2007).

Ko, T. W., Finkler, J. A., Goedecker, S. & Behler, J. A fourth-generation high-dimensional neural network potential with accurate electrostatics including non-local charge transfer. Nat. Commun. 12 , 398 (2021).

Weinreich, J., Romer, A., Paleico, M. L. & Behler, J. Properties of alpha-brass nanoparticles. 1. neural network potential energy surface. J. Phys. Chem C 124 , 12682–12695 (2020).

Wang, H., Zhang, L., Han, J. & E, W. Deepmd-kit: A deep learning package for many-body potential energy representation and molecular dynamics. Computer Phys. Commun. 228 , 178–184 (2018).

Eshet, H., Khaliullin, R. Z., Kühne, T. D., Behler, J. & Parrinello, M. Ab initio quality neural-network potential for sodium. Phys. Rev. B 81 , 184107 (2010).

Khaliullin, R. Z., Eshet, H., Kühne, T. D., Behler, J. & Parrinello, M. Graphite-diamond phase coexistence study employing a neural-network mapping of the ab initio potential energy surface. Phys. Rev. B 81 , 100103 (2010).

Artrith, N. & Urban, A. An implementation of artificial neural-network potentials for atomistic materials simulations: Performance for tio2. Comput. Mater. Sci. 114 , 135–150 (2016).

Park, C. W. et al. Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture. npj Comput. Mater. 7 , 1–9 (2021).

Chmiela, S., Sauceda, H. E., Müller, K.-R. & Tkatchenko, A. Towards exact molecular dynamics simulations with machine-learned force fields. Nat. Commun. 9 , 1–10 (2018).

Xue, L.-Y. et al. Reaxff-mpnn machine learning potential: a combination of reactive force field and message passing neural networks. Phys. Chem. Chem. Phys. 23 , 19457–19464 (2021).

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O. & Dahl, G. E. Neural message passing for quantum chemistry. arXiv . https://arxiv.org/abs/1704.01212 (2017).

Zitnick, C. L. et al. An introduction to electrocatalyst design using machine learning for renewable energy storage. arXiv. https://arxiv.org/abs/2010.09435 (2020).

McNutt, A. T. et al. Gnina 1 molecular docking with deep learning. J. Cheminformatics 13 , 1–20 (2021).

Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. in International conference on machine learning , 2323–2332 (PMLR, 2018).

Olivecrona, M., Blaschke, T., Engkvist, O. & Chen, H. Molecular de-novo design through deep reinforcement learning. J. Cheminformatics 9 , 1–14 (2017).

You, J., Liu, B., Ying, R., Pande, V. & Leskovec, J. Graph convolutional policy network for goal-directed molecular graph generation. arXiv. https://arxiv.org/abs/1806.02473 (2018).

Putin, E. et al. Reinforced adversarial neural computer for de novo molecular design. J. Chem. Inf. Model. 58 , 1194–1204 (2018).

Sanchez-Lengeling, B., Outeiral, C., Guimaraes, G. L. & Aspuru-Guzik, A. Optimizing distributions over molecular space. an objective-reinforced generative adversarial network for inverse-design chemistry (organic). ChemRxiv https://doi.org/10.26434/chemrxiv.5309668.v3 (2017).

Nouira, A., Sokolovska, N. & Crivello, J.-C. Crystalgan: learning to discover crystallographic structures with generative adversarial networks. arXiv. https://arxiv.org/abs/1810.11203 (2018).

Long, T. et al. Constrained crystals deep convolutional generative adversarial network for the inverse design of crystal structures. npj Comput. Mater. 7 , 66 (2021).

Noh, J. et al. Inverse design of solid-state materials via a continuous representation. Matter 1 , 1370–1384 (2019).

Kim, S., Noh, J., Gu, G. H., Aspuru-Guzik, A. & Jung, Y. Generative adversarial networks for crystal structure prediction. ACS Central Sci. 6 , 1412–1420 (2020).

Long, T. et al. Inverse design of crystal structures for multicomponent systems. arXiv. https://arxiv.org/abs/2104.08040 (2021).

Xie, T. & Grossman, J. C. Hierarchical visualization of materials space with graph convolutional neural networks. J. Chem. Phys. 149 , 174111 (2018).

Park, C. W. & Wolverton, C. Developing an improved crystal graph convolutional neural network framework for accelerated materials discovery. Phys. Rev. Mater. 4 , 063801 (2020).

Laugier, L. et al. Predicting thermoelectric properties from crystal graphs and material descriptors-first application for functional materials. arXiv. https://arxiv.org/abs/1811.06219 (2018).

Rosen, A. S. et al. Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery. Matter 4 , 1578–1597 (2021).

Lusci, A., Pollastri, G. & Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53 , 1563–1575 (2013).

Xu, Y. et al. Deep learning for drug-induced liver injury. J. Chem. Inf. Model. 55 , 2085–2093 (2015).

Jain, A. & Bligaard, T. Atomic-position independent descriptor for machine learning of material properties. Phys. Rev. B 98 , 214112 (2018).

Goodall, R. E., Parackal, A. S., Faber, F. A., Armiento, R. & Lee, A. A. Rapid discovery of novel materials by coordinate-free coarse graining. arXiv . https://arxiv.org/abs/2106.11132 (2021).

Zuo, Y. et al. Accelerating Materials Discovery with Bayesian Optimization and Graph Deep Learning. arXiv . https://arxiv.org/abs/2104.10242 (2021).

Lin, T.-S. et al. Bigsmiles: a structurally-based line notation for describing macromolecules. ACS Central Sci. 5 , 1523–1531 (2019).

Tyagi, A. et al. Cancerppd: a database of anticancer peptides and proteins. Nucleic Acids Res. 43 , D837–D843 (2015).

Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (selfies): a 100% robust molecular string representation. Machine Learn. Sci. Technol. 1 , 045024 (2020).

Lim, J., Ryu, S., Kim, J. W. & Kim, W. Y. Molecular generative model based on conditional variational autoencoder for de novo molecular design. J. Cheminformatics 10 , 1–9 (2018).

Krasnov, L., Khokhlov, I., Fedorov, M. V. & Sosnin, S. Transformer-based artificial neural networks for the conversion between chemical notations. Sci. Rep. 11 , 1–10 (2021).

Irwin, J. J., Sterling, T., Mysinger, M. M., Bolstad, E. S. & Coleman, R. G. Zinc: a free tool to discover chemistry for biology. J. Chem. Inf. Model. 52 , 1757–1768 (2012).

Dix, D. J. et al. The toxcast program for prioritizing toxicity testing of environmental chemicals. Toxicol. Sci. 95 , 5–12 (2007).

Kim, S. et al. Pubchem 2019 update: improved access to chemical data. Nucleic Acids Res. 47 , D1102–D1109 (2019).

Hirohara, M., Saito, Y., Koda, Y., Sato, K. & Sakakibara, Y. Convolutional neural network based on smiles representation of compounds for detecting chemical motif. BMC Bioinformatics 19 , 83–94 (2018).

Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Central Sci. 4 , 268–276 (2018).

Liu, R. et al. Deep learning for chemical compound stability prediction . In Proceedings of ACM SIGKDD workshop on large-scale deep learning for data mining (DL-KDD) , 1–7. https://rosanneliu.com/publication/kdd/ (ACM SIGKDD, 2016).

Jha, D. et al. Elemnet: Deep learning the chem. mater. from only elemental composition. Sci. Rep. 8 , 1–13 (2018).

Agrawal, A. et al. Exploration of data science techniques to predict fatigue strength of steel from composition and processing parameters. Integr. Mater. Manuf. Innov. 3 , 90–108 (2014).

Agrawal, A. & Choudhary, A. A fatigue strength predictor for steels using ensemble data mining: steel fatigue strength predictor . In Proceedings of the 25th ACM International on Conference on information and knowledge management , 2497–2500. https://doi.org/10.1145/2983323.2983343 (2016).

Agrawal, A. & Choudhary, A. An online tool for predicting fatigue strength of steel alloys based on ensemble data mining. Int. J. Fatigue 113 , 389–400 (2018).

Agrawal, A., Saboo, A., Xiong, W., Olson, G. & Choudhary, A. Martensite start temperature predictor for steels using ensemble data mining . in 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA) , 521–530 (IEEE, 2019).

Meredig, B. et al. Combinatorial screening for new materials in unconstrained composition space with machine learning. Phys. Rev. B 89 , 094104 (2014).

Agrawal, A., Meredig, B., Wolverton, C. & Choudhary, A. A formation energy predictor for crystalline materials using ensemble data mining . in 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) , 1276–1279 (IEEE, 2016).

Furmanchuk, A., Agrawal, A. & Choudhary, A. Predictive analytics for crystalline materials: bulk modulus. RSC Adv. 6 , 95246–95251 (2016).

Furmanchuk, A. et al. Prediction of seebeck coefficient for compounds without restriction to fixed stoichiometry: A machine learning approach. J. Comput. Chem. 39 , 191–202 (2018).

Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2 , 1–7 (2016).

Ward, L. et al. Matminer: An open source toolkit for materials data mining. Comput. Mater. Sci. 152 , 60–69 (2018).

Jha, D. et al. Irnet: A general purpose deep residual regression framework for materials discovery . In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining , 2385–2393. https://arxiv.org/abs/1907.03222 (2019).

Jha, D. et al. Enabling deeper learning on big data for materials informatics applications. Sci. Rep. 11 , 1–12 (2021).

Goodall, R. E. & Lee, A. A. Predicting materials properties without crystal structure: Deep representation learning from stoichiometry. Nat. Commun. 11 , 1–9 (2020).

NIMS. Superconducting material database (supercon) . https://supercon.nims.go.jp/ (2021).

Stanev, V. et al. Machine learning modeling of superconducting critical temperature. npj Comput. Mater. 4 , 1–14 (2018).

Gupta, V. et al. Cross-property deep transfer learning framework for enhanced predictive analytics on small materials data. Nat. Commun . 12 , 1–10 (2021).

Himanen, L. et al. Dscribe: Library of descriptors for machine learning in materials science. Computer Phys. Commun. 247 , 106949 (2020).

Bartel, C. J. et al. A critical examination of compound stability predictions from machine-learned formation energies. npj Comput. Mater. 6 , 1–11 (2020).

Choudhary, K. et al. High-throughput density functional perturbation theory and machine learning predictions of infrared, piezoelectric, and dielectric responses. npj Comput. Mater. 6 , 1–13 (2020).

Zheng, C. et al. Automated generation and ensemble-learned matching of X-ray absorption spectra. npj Comput. Mater. 4 , 1–9 (2018).

Mathew, K. et al. High-throughput computational x-ray absorption spectroscopy. Sci. Data 5 , 1–8 (2018).

Chen, Y. et al. Database of ab initio l-edge x-ray absorption near edge structure. Sci. Data 8 , 1–8 (2021).

Lafuente, B., Downs, R. T., Yang, H. & Stone, N. In Highlights in mineralogical crystallography 1–30 (De Gruyter (O), 2015).

El Mendili, Y. et al. Raman open database: first interconnected raman–x-ray diffraction open-access resource for material identification. J. Appl. Crystallogr. 52 , 618–625 (2019).

Fremout, W. & Saverwyns, S. Identification of synthetic organic pigments: the role of a comprehensive digital raman spectral library. J. Raman Spectrosc. 43 , 1536–1544 (2012).

Huck, P. & Persson, K. A. Mpcontribs: user contributed data to the materials project database . https://docs.mpcontribs.org/ (2019).

Yang, L. et al. A cloud platform for atomic pair distribution function analysis: Pdfitc. Acta Crystallogr. A 77 , 2–6 (2021).

Park, W. B. et al. Classification of crystal structure using a convolutional neural network. IUCrJ 4 , 486–494 (2017).

Hellenbrandt, M. The Inorganic Crystal Structure Database (ICSD)—present and future. Crystallogr. Rev. 10 , 17–22 (2004).

Zaloga, A. N., Stanovov, V. V., Bezrukova, O. E., Dubinin, P. S. & Yakimov, I. S. Crystal symmetry classification from powder X-ray diffraction patterns using a convolutional neural network. Mater. Today Commun. 25 , 101662 (2020).

Lee, J.-W., Park, W. B., Lee, J. H., Singh, S. P. & Sohn, K.-S. A deep-learning technique for phase identification in multiphase inorganic compounds using synthetic XRD powder patterns. Nat. Commun. 11 , 86 (2020).

Wang, H. et al. Rapid identification of X-ray diffraction patterns based on very limited data by interpretable convolutional neural networks. J. Chem. Inf. Model. 60 , 2004–2011 (2020).

Dong, H. et al. A deep convolutional neural network for real-time full profile analysis of big powder diffraction data. npj Comput. Mater. 7 , 1–9 (2021).

Aguiar, J. A., Gong, M. L. & Tasdizen, T. Crystallographic prediction from diffraction and chemistry data for higher throughput classification using machine learning. Comput. Mater. Sci. 173 , 109409 (2020).

Maffettone, P. M. et al. Crystallography companion agent for high-throughput materials discovery. Nat. Comput. Sci. 1 , 290–297 (2021).

Oviedo, F. et al. Fast and interpretable classification of small X-ray diffraction datasets using data augmentation and deep neural networks. npj Comput. Mater. 5 , 1–9 (2019).

Liu, C.-H. et al. Validation of non-negative matrix factorization for rapid assessment of large sets of atomic pair-distribution function (pdf) data. J. Appl. Crystallogr. 54 , 768–775 (2021).

Rakita, Y. et al. Studying heterogeneities in local nanostructure with scanning nanostructure electron microscopy (snem). arXiv https://arxiv.org/abs/2110.03589 (2021).

Timoshenko, J., Lu, D., Lin, Y. & Frenkel, A. I. Supervised machine-learning-based determination of three-dimensional structure of metallic nanoparticles. J. Phys. Chem Lett. 8 , 5091–5098 (2017).

Timoshenko, J. et al. Subnanometer substructures in nanoassemblies formed from clusters under a reactive atmosphere revealed using machine learning. J. Phys. Chem C 122 , 21686–21693 (2018).

Timoshenko, J. et al. Neural network approach for characterizing structural transformations by X-ray absorption fine structure spectroscopy. Phys. Rev. Lett. 120 , 225502 (2018).

Zheng, C., Chen, C., Chen, Y. & Ong, S. P. Random forest models for accurate identification of coordination environments from X-ray absorption near-edge structure. Patterns 1 , 100013 (2020).

Torrisi, S. B. et al. Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. npj Comput. Mater. 6 , 1–11 (2020).

Andrejevic, N., Andrejevic, J., Rycroft, C. H. & Li, M. Machine learning spectral indicators of topology. arXiv preprint at https://arxiv.org/abs/2003.00994 (2020).

Madden, M. G. & Ryder, A. G. Machine learning methods for quantitative analysis of raman spectroscopy data . in Opto-Ireland 2002: Optics and Photonics Technologies and Applications , Vol. 4876, 1130–1139 (International Society for Optics and Photonics, 2003).

Conroy, J., Ryder, A. G., Leger, M. N., Hennessey, K. & Madden, M. G. Qualitative and quantitative analysis of chlorinated solvents using Raman spectroscopy and machine learning . in Opto-Ireland 2005: Optical Sensing and Spectroscopy, Vol. 5826, 131–142 (International Society for Optics and Photonics, 2005).

Acquarelli, J. et al. Convolutional neural networks for vibrational spectroscopic data analysis. Anal. Chim. Acta 954 , 22–31 (2017).

O’Connell, M.-L., Howley, T., Ryder, A. G., Leger, M. N. & Madden, M. G. Classification of a target analyte in solid mixtures using principal component analysis, support vector machines, and Raman spectroscopy . in Opto-Ireland 2005: Optical Sensing and Spectroscopy , Vol. 5826, 340–350 (International Society for Optics and Photonics, 2005).

Zhao, J., Chen, Q., Huang, X. & Fang, C. H. Qualitative identification of tea categories by near infrared spectroscopy and support vector machine. J. Pharm. Biomed. Anal. 41 , 1198–1204 (2006).

Liu, J. et al. Deep convolutional neural networks for Raman spectrum recognition: a unified solution. Analyst 142 , 4067–4074 (2017).

Yang, J. et al. Deep learning for vibrational spectral analysis: Recent progress and a practical guide. Anal. Chim. Acta 1081 , 6–17 (2019).

Selzer, P., Gasteiger, J., Thomas, H. & Salzer, R. Rapid access to infrared reference spectra of arbitrary organic compounds: scope and limitations of an approach to the simulation of infrared spectra by neural networks. Chem. Euro. J. 6 , 920–927 (2000).

Ghosh, K. et al. Deep learning spectroscopy: neural networks for molecular excitation spectra. Adv. Sci. 6 , 1801367 (2019).

Kostka, T., Selzer, P. & Gasteiger, J. A combined application of reaction prediction and infrared spectra simulation for the identification of degradation products of s-triazine herbicides. Chemistry 7 , 2254–2260 (2001).

Mahmoud, C. B., Anelli, A., Csányi, G. & Ceriotti, M. Learning the electronic density of states in condensed matter. Phys. Rev. B 102 , 235130 (2020).

Chen, Z. et al. Direct prediction of phonon density of states with Euclidean neural networks. Adv. Sci. 8 , 2004214 (2021).

Kong, S. et al. Density of states prediction for materials discovery via contrastive learning from probabilistic embeddings. arXiv . https://arxiv.org/abs/2110.11444 (2021).

Carbone, M. R., Topsakal, M., Lu, D. & Yoo, S. Machine-learning X-ray absorption spectra to quantitative accuracy. Phys. Rev. Lett. 124 , 156401 (2020).

Rehr, J. J., Kas, J. J., Vila, F. D., Prange, M. P. & Jorissen, K. Parameter-free calculations of X-ray spectra with FEFF9. Phys. Chem. Chem. Phys. 12 , 5503–5513 (2010).

Rankine, C. D., Madkhali, M. M. M. & Penfold, T. J. A deep neural network for the rapid prediction of X-ray absorption spectra. J. Phys. Chem A 124 , 4263–4270 (2020).

Fung, V., Hu, G., Ganesh, P. & Sumpter, B. G. Machine learned features from density of states for accurate adsorption energy prediction. Nat. Commun. 12 , 88 (2021).

Hammer, B. & Nørskov, J. Theoretical surface science and catalysis-calculations and concepts. Adv. Catal. Impact Surface Sci. Catal. 45 , 71–129 (2000).

Kaundinya, P. R., Choudhary, K. & Kalidindi, S. R. Prediction of the electron density of states for crystalline compounds with atomistic line graph neural networks (alignn). arXiv. https://arxiv.org/abs/2201.08348 (2022).

Stein, H. S., Soedarmadji, E., Newhouse, P. F., Guevarra, D. & Gregoire, J. M. Synthesis, optical imaging, and absorption spectroscopy data for 179072 metal oxides. Sci. Data 6 , 9 (2019).

Choudhary, A. et al. Graph neural network predictions of metal organic framework co2 adsorption properties. arXiv . https://arxiv.org/abs/2112.10231 (2021).

Anderson, R., Biong, A. & Gómez-Gualdrón, D. A. Adsorption isotherm predictions for multiple molecules in mofs using the same deep learning model. J. Chem. Theory Comput. 16 , 1271–1283 (2020).

Krizhevsky, A., Sutskever, I. & Hinton, G. E. Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25 , 1097–1105 (2012).

Varela, M. et al. Materials characterization in the aberration-corrected scanning transmission electron microscope. Annu. Rev. Mater. Res. 35 , 539–569 (2005).

Holm, E. A. et al. Overview: Computer vision and machine learning for microstructural characterization and analysis. Metal. Mater Trans. A 51 , 5985–5999 (2020).

Modarres, M. H. et al. Neural network for nanoscience scanning electron microscope image recognition. Sci. Rep. 7 , 1–12 (2017).

Gopalakrishnan, K., Khaitan, S. K., Choudhary, A. & Agrawal, A. Deep convolutional neural networks with transfer learning for computer vision-based data-driven pavement distress detection. Construct. Build. Mater. 157 , 322–330 (2017).

Gopalakrishnan, K., Gholami, H., Vidyadharan, A., Choudhary, A. & Agrawal, A. Crack damage detection in unmanned aerial vehicle images of civil infrastructure using pre-trained deep learning model. Int. J. Traffic Transp. Eng . 8 , 1–14 (2018).

Yang, Z. et al. Data-driven insights from predictive analytics on heterogeneous experimental data of industrial magnetic materials . In IEEE International Conference on Data Mining Workshops (ICDMW) , 806–813. https://doi.org/10.1109/ICDMW.2019.00119 (IEEE Computer Society, 2019).

Yang, Z. et al. Heterogeneous feature fusion based machine learning on shallow-wide and heterogeneous-sparse industrial datasets . In 25th International Conference on Pattern Recognition Workshops, ICPR 2020 , 566–577. https://doi.org/10.1007/978-3-030-68799-1_41 (Springer Science and Business Media Deutschland GmbH, 2021).

Ziletti, A., Kumar, D., Scheffler, M. & Ghiringhelli, L. M. Insightful classification of crystal structures using deep learning. Nat. Commun. 9 , 2775 (2018).

Choudhary, K. et al. Computational scanning tunneling microscope image database. Sci. Data 8 , 1–9 (2021).

Liu, R., Agrawal, A., Liao, W.-k., Choudhary, A. & De Graef, M. Materials discovery: Understanding polycrystals from large-scale electron patterns . in 2016 IEEE International Conference on Big Data (Big Data) , 2261–2269 (IEEE, 2016).

Jha, D. et al. Extracting grain orientations from EBSD patterns of polycrystalline materials using convolutional neural networks. Microsc. Microanal. 24 , 497–502 (2018).

Kaufmann, K., Zhu, C., Rosengarten, A. S. & Vecchio, K. S. Deep neural network enabled space group identification in EBSD. Microsc. Microanal. 26 , 447–457 (2020).

Yang, Z. et al. Deep learning based domain knowledge integration for small datasets: Illustrative applications in materials informatics . in 2019 International Joint Conference on Neural Networks (IJCNN) , 1–8 (IEEE, 2019).

Yang, Z. et al. Learning to predict crystal plasticity at the nanoscale: Deep residual networks and size effects in uniaxial compression discrete dislocation simulations. Sci. Rep. 10 , 1–14 (2020).

Decost, B. L. et al. Uhcsdb: Ultrahigh carbon steel micrograph database. Integr. Mater. Manuf. Innov. 6 , 197–205 (2017).

Decost, B. L., Lei, B., Francis, T. & Holm, E. A. High throughput quantitative metallography for complex microstructures using deep learning: a case study in ultrahigh carbon steel. Microsc. Microanal. 25 , 21–29 (2019).

Stan, T., Thompson, Z. T. & Voorhees, P. W. Optimizing convolutional neural networks to perform semantic segmentation on large materials imaging datasets: X-ray tomography and serial sectioning. Materials Characterization 160 , 110119 (2020).

Madsen, J. et al. A deep learning approach to identify local structures in atomic-resolution transmission electron microscopy images. Adv. Theory Simulations 1 , 1800037 (2018).

Maksov, A. et al. Deep learning analysis of defect and phase evolution during electron beam-induced transformations in ws 2. npj Comput. Mater. 5 , 1–8 (2019).

Yang, S.-H. et al. Deep learning-assisted quantification of atomic dopants and defects in 2d materials. Adv. Sci. https://doi.org/10.1002/advs.202101099 (2021).

Roberts, G. et al. Deep learning for semantic segmentation of defects in advanced stem images of steels. Sci. Rep. 9 , 1–12 (2019).

Kusche, C. et al. Large-area, high-resolution characterisation and classification of damage mechanisms in dual-phase steel using deep learning. PLoS ONE 14 , e0216493 (2019).

Vlcek, L. et al. Learning from imperfections: predicting structure and thermodynamics from atomic imaging of fluctuations. ACS Nano 13 , 718–727 (2019).

Ziatdinov, M., Maksov, A. & Kalinin, S. V. Learning surface molecular structures via machine vision. npj Comput. Mater. 3 , 1–9 (2017).

Ovchinnikov, O. S. et al. Detection of defects in atomic-resolution images of materials using cycle analysis. Adv. Struct. Chem. Imaging 6 , 3 (2020).

Li, W., Field, K. G. & Morgan, D. Automated defect analysis in electron microscopic images. npj Comput. Mater. 4 , 1–9 (2018).

Cohn, R. et al. Instance segmentation for direct measurements of satellites in metal powders and automated microstructural characterization from image data. JOM 73 , 2159–2172 (2021).

de Haan, K., Ballard, Z. S., Rivenson, Y., Wu, Y. & Ozcan, A. Resolution enhancement in scanning electron microscopy using deep learning. Sci. Rep. 9 , 1–7 (2019).

Ede, J. M. & Beanland, R. Partial scanning transmission electron microscopy with deep learning. Sci. Rep. 10 , 1–10 (2020).

Rashidi, M. & Wolkow, R. A. Autonomous scanning probe microscopy in situ tip conditioning through machine learning. ACS Nano 12 , 5185–5189 (2018).

Scime, L., Siddel, D., Baird, S. & Paquit, V. Layer-wise anomaly detection and classification for powder bed additive manufacturing processes: A machine-agnostic algorithm for real-time pixel-wise semantic segmentation. Addit. Manufact. 36 , 101453 (2020).

Eppel, S., Xu, H., Bismuth, M. & Aspuru-Guzik, A. Computer vision for recognition of materials and vessels in chemistry lab settings and the Vector-LabPics Data Set. ACS Central Sci. 6 , 1743–1752 (2020).

Yang, Z. et al. Deep learning approaches for mining structure-property linkages in high contrast composites from simulation datasets. Comput. Mater. Sci. 151 , 278–287 (2018).

Cecen, A., Dai, H., Yabansu, Y. C., Kalidindi, S. R. & Song, L. Material structure-property linkages using three-dimensional convolutional neural networks. Acta Mater. 146 , 76–84 (2018).

Yang, Z. et al. Establishing structure-property localization linkages for elastic deformation of three-dimensional high contrast composites using deep learning approaches. Acta Mater. 166 , 335–345 (2019).

Goetz, A. et al. Addressing materials’ microstructure diversity using transfer learning. arXiv . arXiv-2107. https://arxiv.org/abs/2107.13841 (2021).

Kitahara, A. R. & Holm, E. A. Microstructure cluster analysis with transfer learning and unsupervised learning. Integr. Mater. Manuf. Innov. 7 , 148–156 (2018).

Larmuseau, M. et al. Compact representations of microstructure images using triplet networks. npj Comput. Mater. 2020 6:1 6 , 1–11 (2020).

Li, X. et al. A deep adversarial learning methodology for designing microstructural material systems . in International Design Engineering Technical Conferences and Computers and Information in Engineering Conference , Vol. 51760, V02BT03A008 (American Society of Mechanical Engineers, 2018).

Yang, Z. et al. Microstructural materials design via deep adversarial learning methodology. J. Mech. Des. 140 , 111416 (2018).

Yang, Z. et al. A general framework combining generative adversarial networks and mixture density networks for inverse modeling in microstructural materials design. arXiv . https://arxiv.org/abs/2101.10553 (2021).

Hsu, T. et al. Microstructure generation via generative adversarial network for heterogeneous, topologically complex 3d materials. JOM 73 , 90–102 (2020).

Chun, S. et al. Deep learning for synthetic microstructure generation in a materials-by-design framework for heterogeneous energetic materials. Sci. Rep. 10 , 1–15 (2020).

Dai, M., Demirel, M. F., Liang, Y. & Hu, J.-M. Graph neural networks for an accurate and interpretable prediction of the properties of polycrystalline materials. npj Comput. Mater. 7 , 1–9 (2021).

Cohn, R. & Holm, E. Neural message passing for predicting abnormal grain growth in Monte Carlo simulations of microstructural evolution. arXiv. https://arxiv.org/abs/2110.09326v1 (2021).

Plimpton, S. et al. SPPARKS Kinetic Monte Carlo Simulator . https://spparks.github.io/index.html . (2021).

Plimpton, S. et al. Crossing the mesoscale no-man’s land via parallel kinetic Monte Carlo. Tech. Rep . https://doi.org/10.2172/966942 (2009).

Xue, N. Steven bird, evan klein and edward loper. natural language processing with python. oreilly media, inc.2009. isbn: 978-0-596-51649-9. Nat. Lang. Eng. 17 , 419–424 (2010).

Honnibal, M. & Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. https://doi.org/10.5281/zenodo.3358113 (2017).

Gardner, M. et al. Allennlp: A deep semantic natural language processing platform. arXiv. https://arxiv.org/abs/1803.07640 (2018).

Tshitoyan, V. et al. Unsupervised word embeddings capture latent knowledge from materials science literature. Nature 571 , 95–98 (2019).

Kononova, O. et al. Opportunities and challenges of text mining in aterials research. iScience 24 , 102155 (2021).

Olivetti, E. A. et al. Data-driven materials research enabled by natural language processing and information extraction. Appl. Phys. Rev. 7 , 041317 (2020).

Swain, M. C. & Cole, J. M. Chemdataextractor: a toolkit for automated extraction of chemical information from the scientific literature. J. Chem. Inf. Model. 56 , 1894–1904 (2016).

Park, S. et al. Text mining metal–organic framework papers. J. Chem. Inf. Model. 58 , 244–251 (2018).

Court, C. J. & Cole, J. M. Auto-generated materials database of curie and néel temperatures via semi-supervised relationship extraction. Sci. Data 5 , 1–12 (2018).

Huang, S. & Cole, J. M. A database of battery materials auto-generated using chemdataextractor. Sci. Data 7 , 1–13 (2020).

Beard, E. J., Sivaraman, G., Vázquez-Mayagoitia, Á., Vishwanath, V. & Cole, J. M. Comparative dataset of experimental and computational attributes of uv/vis absorption spectra. Sci. Data 6 , 1–11 (2019).

Tayfuroglu, O., Kocak, A. & Zorlu, Y. In silico investigation into h2 uptake in mofs: combined text/data mining and structural calculations. Langmuir 36 , 119–129 (2019).

Weston, L. et al. Named entity recognition and normalization applied to large-scale information extraction from the materials science literature. J. Chem. Inf. Model. 59 , 3692–3702 (2019).

Vaucher, A. C. et al. Automated extraction of chemical synthesis actions from experimental procedures. Nat. Commun. 11 , 1–11 (2020).

He, T. et al. Similarity of precursors in solid-state synthesis as text-mined from scientific literature. Chem. Mater. 32 , 7861–7873 (2020).

Kononova, O. et al. Text-mined dataset of inorganic materials synthesis recipes. Sci. Data 6 , 1–11 (2019).

Kim, E. et al. Materials synthesis insights from scientific literature via text extraction and machine learning. Chem. Mater. 29 , 9436–9444 (2017).

Kim, E., Huang, K., Jegelka, S. & Olivetti, E. Virtual screening of inorganic materials synthesis parameters with deep learning. npj Comput. Mater. 3 , 1–9 (2017).

Kim, E. et al. Inorganic materials synthesis planning with literature-trained neural networks. J. Chem. Inf. Model. 60 , 1194–1201 (2020).

de Castro, P. B. et al. Machine-learning-guided discovery of the gigantic magnetocaloric effect in hob 2 near the hydrogen liquefaction temperature. NPG Asia Mater. 12 , 1–7 (2020).

Cooper, C. B. et al. Design-to-device approach affords panchromatic co-sensitized solar cells. Adv. Energy Mater. 9 , 1802820 (2019).

Yang, X., Dai, Z., Zhao, Y., Liu, J. & Meng, S. Low lattice thermal conductivity and excellent thermoelectric behavior in li3sb and li3bi. J. Phys. Condens. Matter 30 , 425401 (2018).

Wang, Y., Gao, Z. & Zhou, J. Ultralow lattice thermal conductivity and electronic properties of monolayer 1t phase semimetal site2 and snte2. Phys. E 108 , 53–59 (2019).

Jong, U.-G., Yu, C.-J., Kye, Y.-H., Hong, S.-N. & Kim, H.-G. Manifestation of the thermoelectric properties in ge-based halide perovskites. Phys. Rev. Mater. 4 , 075403 (2020).

Yamamoto, K., Narita, G., Yamasaki, J. & Iikubo, S. First-principles study of thermoelectric properties of mixed iodide perovskite cs (b, b’) i3 (b, b’= ge, sn, and pb). J. Phys. Chem. Solids 140 , 109372 (2020).

Viennois, R. et al. Anisotropic low-energy vibrational modes as an effect of cage geometry in the binary barium silicon clathrate b a 24 s i 100. Phys. Rev. B 101 , 224302 (2020).

Haque, E. Effect of electron-phonon scattering, pressure and alloying on the thermoelectric performance of tmcu _3 ch _4(tm= v, nb, ta; ch= s, se, te). arXiv . https://arxiv.org/abs/2010.08461 (2020).

Yahyaoglu, M. et al. Phase-transition-enhanced thermoelectric transport in rickardite mineral cu3–x te2. Chem. Mater. 33 , 1832–1841 (2021).

Ho, D., Shkolnik, A. S., Ferraro, N. J., Rizkin, B. A. & Hartman, R. L. Using word embeddings in abstracts to accelerate metallocene catalysis polymerization research. Computers Chem. Eng. 141 , 107026 (2020).

Abdar, M. et al. A review of uncertainty quantification in deep learning: techniques, applications and challenges. Inf. Fusion . 76 , 243–297 (2021).

Mi, Lu, et al. Training-free uncertainty estimation for dense regression: Sensitivityas a surrogate. arXiv . preprint at arXiv:1910.04858. https://arxiv.org/abs/1910.04858 (2019).

Teye, M., Azizpour, H. & Smith, K. Bayesian uncertainty estimation for batch normalized deep networks . in International Conference on Machine Learning , 4907–4916 (PMLR, 2018).

Zhang, J., Kailkhura, B. & Han, T. Y.-J. Leveraging uncertainty from deep learning for trustworthy material discovery workflows. ACS Omega 6 , 12711–12721 (2021).

Meredig, B. et al. Can machine learning identify the next high-temperature superconductor? examining extrapolation performance for materials discovery. Mol. Syst. Des. Eng. 3 , 819–825 (2018).

Zhang, J., Kailkhura, B. & Han, T. Y.-J. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning . in International Conference on Machine Learning , 11117–11128 (PMLR, 2020).

Seoh, R. Qualitative analysis of monte carlo dropout. arXiv. https://arxiv.org/abs/2007.01720 (2020).

Gal, Y. & Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning . in international conference on machine learning , 1050–1059 (PMLR, 2016).

Jain, S., Liu, G., Mueller, J. & Gifford, D. Maximizing overall diversity for improved uncertainty estimates in deep ensembles . In Proceedings of the AAAI Conference on Artificial Intelligence , 34 , 4264–4271. https://doi.org/10.1609/aaai.v34i04.5849 (2020).

Ganaie, M. et al. Ensemble deep learning: a review. arXiv . https://arxiv.org/abs/2104.02395 (AAAI Technical Track: Machine Learning, 2021).

Fort, S., Hu, H. & Lakshminarayanan, B. Deep ensembles: a loss landscape perspective. arXiv. https://arxiv.org/abs/1912.02757 (2019).

Lakshminarayanan, B., Pritzel, A. & Blundell, C. Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv. https://arxiv.org/abs/1612.01474 (2016).

Moon, S. J., Jeon, J.-J., Lee, J. S. H. & Kim, Y. Learning multiple quantiles with neural networks. J. Comput. Graph. Stat. 30 , 1–11. https://doi.org/10.1080/10618600.2021.1909601 (2021).

Rasmussen, C. E. Summer School on Machine Learning , 63–71 (Springer, 2003).

Hegde, P., Heinonen, M., Lähdesmäki, H. & Kaski, S. Deep learning with differential gaussian process flows. arXiv. https://arxiv.org/abs/1810.04066 (2018).

Wilson, A. G., Hu, Z., Salakhutdinov, R. & Xing, E. P. Deep kernel learning. in Artificial intelligence and statistics , 370–378 (PMLR, 2016).

Hegde, V. I. et al. Reproducibility in high-throughput density functional theory: a comparison of aflow, materials project, and oqmd. arXiv. https://arxiv.org/abs/2007.01988 (2020).

Ying, R., Bourgeois, D., You, J., Zitnik, M. & Leskovec, J. Gnnexplainer: Generating explanations for graph neural networks. Adv. Neural Inf. Process. Syst. 32 , 9240 (2019).

Roch, L. M. et al. Chemos: orchestrating autonomous experimentation. Sci. Robot. 3 , eaat5559 (2018).

Szymanski, N. et al. Toward autonomous design and synthesis of novel inorganic materials. Mater. Horiz. 8 , 2169–2198. https://doi.org/10.1039/D1MH00495F (2021).

MacLeod, B. P. et al. Self-driving laboratory for accelerated discovery of thin-film materials. Sci. Adv. 6 , eaaz8867 (2020).

Stach, E. A. et al. Autonomous experimentation systems for materials development: a community perspective. Matter https://www.cell.com/matter/fulltext/S2590-2385(21)00306-4 (2021).

Rakita, Y. et al. Active reaction control of cu redox state based on real-time feedback from i n situ synchrotron measurements. J. Am. Chem. Soc. 142 , 18758–18762 (2020).

Chmiela, S. et al. Machine learning of accurate energy-conserving molecular force fields. Sci. Adv. 3 , e1603015 (2017).

Thomas, R. S. et al. The us federal tox21 program: a strategic and operational plan for continued leadership. Altex 35 , 163 (2018).

Russell Johnson, N. Nist computational chemistry comparison and benchmark database . In The 4th Joint Meeting of the US Sections of the Combustion Institute . https://ci.confex.com/ci/2005/techprogram/P1309.HTM (2005).

Lopez, S. A. et al. The harvard organic photovoltaic dataset. Sci. Data 3 , 1–7 (2016).

Johnson, R. D. et al. Nist computational chemistry comparison and benchmark database . http://srdata.nist.gov/cccbdb (2006).

Mobley, D. L. & Guthrie, J. P. Freesolv: a database of experimental and calculated hydration free energies, with input files. J. Computer Aided Mol. Des. 28 , 711–720 (2014).

Andersen, C. W. et al. Optimade: an api for exchanging materials data. arXiv. https://arxiv.org/abs/2103.02068 (2021).

Chanussot, L. et al. Open catalyst 2020 (oc20) dataset and community challenges. ACS Catal. 11 , 6059–6072 (2021).

Dunn, A., Wang, Q., Ganose, A., Dopp, D. & Jain, A. Benchmarking materials property prediction methods: the matbench test set and automatminer reference algorithm. npj Comput. Mater. 6 , 1–10 (2020).

Talirz, L. et al. Materials cloud, a platform for open computational science. Sci. Data 7 , 1–12 (2020).

Chung, Y. G. et al. Advances, updates, and analytics for the computation-ready, experimental metal–organic framework database: Core mof 2019. J. Chem. Eng. Data 64 , 5985–5998 (2019).

Sussman, J. L. et al. Protein data bank (pdb): database of three-dimensional structural information of biological macromolecules. Acta Crystallogr. Sec. D Biol. Crystallogr. 54 , 1078–1084 (1998).

Benson, M. L. et al. Binding moad, a high-quality protein–ligand database. Nucleic Acids Res. 36 , D674–D678 (2007).

Fung, V., Zhang, J., Juarez, E. & Sumpter, B. G. Benchmarking graph neural networks for materials chemistry. npj Comput. Mater. 7 , 1–8 (2021).

Louis, S.-Y. et al. Graph convolutional neural networks with global attention for improved materials property prediction. Phys. Chem. Chem. Phys. 22 , 18141–18148 (2020).

Khorshidi, A. & Peterson, A. A. Amp: A modular approach to machine learning in atomistic simulations. Computer Phys. Commun. 207 , 310–324 (2016).

Yao, K., Herr, J. E., Toth, D. W., Mckintyre, R. & Parkhill, J. The tensormol-0.1 model chemistry: a neural network augmented with long-range physics. Chem. Sci. 9 , 2261–2269 (2018).

Doerr, S. et al. Torchmd: A deep learning framework for molecular simulations. J. Chem. Theory Comput. 17 , 2355–2363 (2021).

Kolb, B., Lentz, L. C. & Kolpak, A. M. Discovering charge density functionals and structure-property relationships with prophet: A general framework for coupling machine learning and first-principles methods. Sci. Rep. 7 , 1–9 (2017).

Zhang, L., Han, J., Wang, H., Car, R. & Weinan, E. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120 , 143001 (2018).

Geiger, M. et al. e3nn/e3nn: 2021-06-21 . https://doi.org/10.5281/zenodo.5006322 (2021).

Duvenaud, D. K. et al. Convolutional networks on graphs for learning molecular fingerprints (eds. Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M. & Garnett, R.) in Adv. Neural Inf. Process. Syst. 28 2224–2232 (Curran Associates, Inc., 2015).

Li, X. et al. Deepchemstable: Chemical stability prediction with an attention-based graph convolution network. J. Chem. Inf. Model. 59 , 1044–1049 (2019).

Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9 , 513–530 (2018).

Wang, A. Y.-T., Kauwe, S. K., Murdock, R. J. & Sparks, T. D. Compositionally restricted attention-based network for materials property predictions. npj Comput. Mater. 7 , 77 (2021).

Zhou, Q. et al. Learning atoms for materials discovery. Proc. Natl Acad. Sci. USA 115 , E6411–E6417 (2018).

O’Boyle, N. & Dalke, A. Deepsmiles: An adaptation of smiles for use in machine-learning of chemical structures. ChemRxiv https://doi.org/10.26434/chemrxiv.7097960.v1 (2018).

Green, H., Koes, D. R. & Durrant, J. D. Deepfrag: a deep convolutional neural network for fragment-based lead optimization. Chem. Sci. 12 , 8036–8047. https://doi.org/10.1039/D1SC00163A (2021).

Elhefnawy, W., Li, M., Wang, J. & Li, Y. Deepfrag-k: a fragment-based deep learning approach for protein fold recognition. BMC Bioinformatics 21 , 203 (2020).

Paul, A. et al. Chemixnet: Mixed dnn architectures for predicting chemical properties using multiple molecular representations. arXiv . https://arxiv.org/abs/1811.08283 (2018).

Paul, A. et al. Transfer learning using ensemble neural networks for organic solar cell screening . in 2019 International Joint Conference on Neural Networks (IJCNN) , 1–8 (IEEE, 2019).

Choudhary, K. et al. Computational screening of high-performance optoelectronic materials using optb88vdw and tb-mbj formalisms. Sci. Data 5 , 1–12 (2018).

Wong-Ng, W., McMurdie, H., Hubbard, C. & Mighell, A. D. Jcpds-icdd research associateship (cooperative program with nbs/nist). J. Res. Natl Inst. Standards Technol. 106 , 1013 (2001).

Belsky, A., Hellenbrandt, M., Karen, V. L. & Luksch, P. New developments in the inorganic crystal structure database (icsd): accessibility in support of materials research and design. Acta Crystallogr. Sec. B Struct. Sci. 58 , 364–369 (2002).

Gražulis, S. et al. Crystallography Open Database—an open-access collection of crystal structures. J. Appl. Crystallogr. 42 , 726–729 (2009).

Linstrom, P. J. & Mallard, W. G. The nist chemistry webbook: a chemical data resource on the internet. J. Chem. Eng. Data 46 , 1059–1063 (2001).

Saito, T. et al. Spectral database for organic compounds (sdbs). (National Institute of Advanced Industrial Science and Technology (AIST), 2006).

Steinbeck, C., Krause, S. & Kuhn, S. Nmrshiftdb constructing a free chemical information system with open-source components. J. Chem. inf. Computer Sci. 43 , 1733–1739 (2003).

Fung, V., Hu, G., Ganesh, P. & Sumpter, B. G. Machine learned features from density of states for accurate adsorption energy prediction. Nat. Commun. 12 , 1–11 (2021).

Kong, S., Guevarra, D., Gomes, C. P. & Gregoire, J. M. Materials representation and transfer learning for multi-property prediction. arXiv . https://arxiv.org/abs/2106.02225 (2021).

Bang, K., Yeo, B. C., Kim, D., Han, S. S. & Lee, H. M. Accelerated mapping of electronic density of states patterns of metallic nanoparticles via machine-learning. Sci. Rep . 11 , 1–11 (2021).

Chen, D. et al. Automating crystal-structure phase mapping by combining deep learning with constraint reasoning. Nat. Machine Intell. 3 , 812–822 (2021).

Ophus, C. A fast image simulation algorithm for scanning transmission electron microscopy. Adv. Struct. Chem. imaging 3 , 1–11 (2017).

Aversa, R., Modarres, M. H., Cozzini, S., Ciancio, R. & Chiusole, A. The first annotated set of scanning electron microscopy images for nanoscience. Sci. Data 5 , 1–10 (2018).

Ziatdinov, M. et al. Causal analysis of competing atomistic mechanisms in ferroelectric materials from high-resolution scanning transmission electron microscopy data. npj Comput. Mater. 6 , 1–9 (2020).

Souza, A. L. F. et al. Deepfreak: Learning crystallography diffraction patterns with automated machine learning. arXiv. http://arxiv.org/abs/1904.11834 (2019).

Scime, L. et al. Layer-wise imaging dataset from powder bed additive manufacturing processes for machine learning applications (peregrine v2021-03). Tech. Rep . https://www.osti.gov/biblio/1779073 (2021).

Somnath, S., Smith, C. R., Laanait, N., Vasudevan, R. K. & Jesse, S. Usid and pycroscopy–open source frameworks for storing and analyzing imaging and spectroscopy data. Microsc. Microanal. 25 , 220–221 (2019).

Savitzky, B. H. et al. py4dstem: A software package for multimodal analysis of four-dimensional scanning transmission electron microscopy datasets. arXiv. https://arxiv.org/abs/2003.09523 (2020).

Madsen, J. & Susi, T. The abtem code: transmission electron microscopy from first principles. Open Res. Euro. 1 , 24 (2021).

Koch, C. T. Determination of core structure periodicity and point defect density along dislocations . (Arizona State University, 2002).

Allen, L. J. et al. Modelling the inelastic scattering of fast electrons. Ultramicroscopy 151 , 11–22 (2015).

Maxim, Z., Jesse, S., Sumpter, B. G., Kalinin, S. V. & Dyck, O. Tracking atomic structure evolution during directed electron beam induced si-atom motion in graphene via deep machine learning. Nanotechnology 32 , 035703 (2020).

Khadangi, A., Boudier, T. & Rajagopal, V. Em-net: Deep learning for electron microscopy image segmentation . in 2020 25th International Conference on Pattern Recognition (ICPR) , 31–38 (IEEE, 2021).

Meyer, C. et al. Nion swift: Open source image processing software for instrument control, data acquisition, organization, visualization, and analysis using python. Microsc. Microanal. 25 , 122–123 (2019).

Kim, J., Tiong, L. C. O., Kim, D. & Han, S. S. Deep learning-based prediction of material properties using chemical compositions and diffraction patterns as experimentally accessible inputs. J. Phys. Chem Lett. 12 , 8376–8383 (2021).

Von Chamier, L. et al. Zerocostdl4mic: an open platform to simplify access and use of deep-learning in microscopy. BioRxiv. https://www.biorxiv.org/content/10.1101/2020.03.20.000133v4 (2020).

Jha, D. et al. Peak area detection network for directly learning phase regions from raw x-ray diffraction patterns . in 2019 International Joint Conference on Neural Networks (IJCNN) , 1–8 (IEEE, 2019).

Hawizy, L., Jessop, D. M., Adams, N. & Murray-Rust, P. Chemicaltagger: A tool for semantic text-mining in chemistry. J. Cheminformatics 3 , 1–13 (2011).

Corbett, P. & Boyle, J. Chemlistem: chemical named entity recognition using recurrent neural networks. J. Cheminformatics 10 , 1–9 (2018).

Rocktäschel, T., Weidlich, M. & Leser, U. Chemspot: a hybrid system for chemical named entity recognition. Bioinformatics 28 , 1633–1640 (2012).

Jessop, D. M., Adams, S. E., Willighagen, E. L., Hawizy, L. & Murray-Rust, P. Oscar4: a flexible architecture for chemical text-mining. J. Cheminformatics 3 , 1–12 (2011).

Leaman, R., Wei, C.-H. & Lu, Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J. Cheminformatics 7 , 1–10 (2015).

Suzuki, Y. et al. Symmetry prediction and knowledge discovery from X-ray diffraction patterns using an interpretable machine learning approach. Sci. Rep. 10 , 21790 (2020).

Download references

Acknowledgements

Contributions from K.C. were supported by the financial assistance award 70NANB19H117 from the U.S. Department of Commerce, National Institute of Standards and Technology. E.A.H. and R.C. (CMU) were supported by the National Science Foundation under grant CMMI-1826218 and the Air Force D3OM2S Center of Excellence under agreement FA8650-19-2-5209. A.J., C.C., and S.P.O. were supported by the Materials Project, funded by the U.S. Department of Energy, Office of Science, Office of Basic Energy Sciences, Materials Sciences and Engineering Division under contract no. DE-AC02-05-CH11231: Materials Project program KC23MP. S.J.L.B. was supported by the U.S. National Science Foundation through grant DMREF-1922234. A.A. and A.C. were supported by NIST award 70NANB19H005 and NSF award CMMI-2053929.

Author information

Authors and affiliations.

Materials Science and Engineering Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA

Kamal Choudhary & Francesca Tavazza

Theiss Research, La Jolla, CA, 92037, USA

Kamal Choudhary

DeepMaterials LLC, Silver Spring, MD, 20906, USA

Material Measurement Science Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA

Brian DeCost

Department of NanoEngineering, University of California San Diego, San Diego, CA, 92093, USA

Chi Chen & Shyue Ping Ong

Energy Technologies Area, Lawrence Berkeley National Laboratory, Berkeley, CA, USA

  • Anubhav Jain

Department of Materials Science and Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213, USA

Ryan Cohn & Elizabeth Holm

Department of Materials Science and Engineering, Northwestern University, Evanston, IL, 60208, USA

Cheol Woo Park & Chris Wolverton

Department of Electrical and Computer Engineering, Northwestern University, Evanston, IL, 60208, USA

Alok Choudhary & Ankit Agrawal

Department of Applied Physics and Applied Mathematics and the Data Science Institute, Fu Foundation School of Engineering and Applied Sciences, Columbia University, New York, NY, 10027, USA

Simon J. L. Billinge

You can also search for this author in PubMed   Google Scholar

Contributions

The authors contributed equally to the search as well as analysis of the literature and writing of the manuscript.

Corresponding author

Correspondence to Kamal Choudhary .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Choudhary, K., DeCost, B., Chen, C. et al. Recent advances and applications of deep learning methods in materials science. npj Comput Mater 8 , 59 (2022). https://doi.org/10.1038/s41524-022-00734-6

Download citation

Received : 25 October 2021

Accepted : 24 February 2022

Published : 05 April 2022

DOI : https://doi.org/10.1038/s41524-022-00734-6

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Analysis of solar energy potentials of five selected south-east cities in nigeria using deep learning algorithms.

  • Samuel Ikemba
  • Kim Song-hyun
  • Akeeb Adepoju Fawole

Sustainable Energy Research (2024)

Correlative, ML-based and non-destructive 3D-analysis of intergranular fatigue cracking in SAC305-Bi solder balls

  • Charlotte Cui
  • Fereshteh Falah Chamasemani
  • Roland Brunner

npj Materials Degradation (2024)

Structured information extraction from scientific text with large language models

  • John Dagdelen
  • Alexander Dunn

Nature Communications (2024)

Methods and applications of machine learning in computational design of optoelectronic semiconductors

  • Xiaoyu Yang
  • Lijun Zhang

Science China Materials (2024)

Data-driven analysis of spinodoid topologies: anisotropy, inverse design, and elasticity tensor distribution

  • Farshid Golnary
  • Mohsen Asghari

International Journal of Mechanics and Materials in Design (2024)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research paper on deep learning

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • PeerJ Comput Sci

Logo of peerjcs

A comprehensive review of deep learning-based single image super-resolution

Syed muhammad arsalan bashir.

1 School of Electronics and Information, Northwestern Polytechnical University, Xi’an, Shaanxi, China

2 Quality Assurance, Pakistan Space and Upper Atmosphere Research Commission, Karachi, Sindh, Pakistan

Mahrukh Khan

3 Department of Computer Science, National University of Computer and Emerging Sciences, Karachi, Sindh, Pakistan

4 School of Marine Science and Technology, Northwestern Polytechnical University, Xi’an, Shaanxi, China

Associated Data

The following information was supplied regarding data availability:

The raw data for Figure 11 are available in the Supplemental File .

Image super-resolution (SR) is one of the vital image processing methods that improve the resolution of an image in the field of computer vision. In the last two decades, significant progress has been made in the field of super-resolution, especially by utilizing deep learning methods. This survey is an effort to provide a detailed survey of recent progress in single-image super-resolution in the perspective of deep learning while also informing about the initial classical methods used for image super-resolution. The survey classifies the image SR methods into four categories, i.e., classical methods, supervised learning-based methods, unsupervised learning-based methods, and domain-specific SR methods. We also introduce the problem of SR to provide intuition about image quality metrics, available reference datasets, and SR challenges. Deep learning-based approaches of SR are evaluated using a reference dataset. Some of the reviewed state-of-the-art image SR methods include the enhanced deep SR network (EDSR), cycle-in-cycle GAN (CinCGAN), multiscale residual network (MSRN), meta residual dense network (Meta-RDN), recurrent back-projection network (RBPN), second-order attention network (SAN), SR feedback network (SRFBN) and the wavelet-based residual attention network (WRAN). Finally, this survey is concluded with future directions and trends in SR and open problems in SR to be addressed by the researchers.

Introduction

The image-based computer graphics models lack resolution independence ( Freeman, Jones & Pasztor, 2002 ) as the images cannot be zoomed beyond the image sample resolution without compromising the quality of images. This is the case, especially in realistic images, for instance, natural photographs. Thus, simple image interpolation will lead to the blurring of features and edges within a sample image.

The concept of super-resolution was first used by Gerchberg (1974) to improve the resolution of an optical system beyond the diffraction limit. In the past two decades, the concept of super-resolution (SR) is defined as the method of producing high-resolution (HR) images from a corresponding low-resolution (LR) image. Initially, this technique was classified as spatial resolution enhancement ( Tsai & Huang, 1984 ). The applications of super-resolution include computer graphics ( Kim, Lee & Lee, 2016a , b ; Tao et al., 2017 ), medical imaging ( Bates et al., 2007 ; Fernández-Suárez & Ting, 2008 ; Huang et al., 2008 ; Hamaide et al., 2017 ; Jurek et al., 2020 ; Teh et al., 2020 ; Bashir & Wang, 2021a ), security, and surveillance ( Zhang et al., 2010 ; Shamsolmoali et al., 2018 ; Lee, Kim & Heo, 2020 ), which shows the importance of this topic in recent years.

Although being explored for decades, image super-resolution remains a challenging task in computer vision. This problem is fundamentally ill-posed because there can be several HR images with slight variations in camera angle, color, brightness, and other variables for any given LR image. Furthermore, there are fundamental uncertainties among the LR and HR data since the downsampling of different HR images may lead to a similar LR image, making this conversion a many-to-one process ( Yang & Yang, 2013 ).

The existing methods of image super-resolution can be categorized into single-image super-resolution (SISR) and multiple-image approaches. In single image SR, the learning is performed for single LR-HR pair for a single image, while in multiple-image SR, the learning is performed for a large number of LR-HR pairs for a particular scene, thereby enabling the generation of an HR image from a scene (multiple images) ( Kawulok et al., 2020 ). Video super-resolution deals with multiple successive images (frames) and utilizes the relationship within the frames to super-resolve a target frame; it is a special type of multiple image SR where the images are part of a scene containing different frames ( Liu et al., 2020b ).

In the past, classical SR methods such as statistical methods, prediction-based methods, patch-based methods, edge-based, and sparse representation methods were used to achieve super-resolution. However, recently the advances in computational power and big data have made researchers use deep learning (DL) to address the problem of SR. In the past decade, deep learning-based SR studies have reported superior performance than the classical methods, and DL methods have been used frequently to achieve SR. Researchers have used a range of methods to explore SR, ranging from the first method of Convolutional Neural Network (CNN) ( Dong et al., 2014 ) to the recently used Generative Adversarial Nets (GAN) ( Ledig et al., 2017 ). In principle, the methods used in deep learning-based SR methods vary in hyper-parameters such as network architecture, learning strategies, activation functions, and loss functions.

In this study, a brief overview of the classical methods of SR is outlined initially, whereas the main focus is given to give an overview of the most recent research in SR using deep learning. Previous studies have explored the literature on SR, but most of these studies emphasize the classical methods ( Borman & Stevenson, 1998 ; Park, Park & Kang, 2003 ; Van Ouwerkerk, 2006 ; Yang, Ma & Yang, 2014 ; Thapa et al., 2016 ), additionally ( Yang, Ma & Yang, 2014 ; Thapa et al., 2016 ) used human visual perception to gauge the performance of SR methods.

In recent years, there have been some reviews ( Ha et al., 2019 ; Yang et al., 2019 ; Zhang et al., 2019c ; Zhou & Feng, 2019 ; Li et al., 2020 ) focused on deep learning-based image super-resolution. The study by Yang et al. (2019) was focused on the deep learning methods for single image super-resolution. Zhang et al. (2019c) limited the scope of image SR to CNN-based methods for space applications, thereby only reviewing four methods namely, SRCNN, FSRCNN, VDSR and DRCN. Ha et al. (2019) reviewed the state-of-the-art SISR methods and classified them based on the type of framework, i.e., CNN, RNN-CNN-based methods and GAN-based methods. Zhou & Feng (2019) briefly reviewed some of the state-of-the-art SISR methods and provided an introduction of some of the methods without any evaluation of comparison of methods, while Li et al. (2020) reviewed the state-of-the-art methods in image SR while emphasizing on the methods based on CNNs and GANs for real-time applications. These review papers did not encompass the domain of super-resolution as a whole, and this paper fills that research gap by providing an overview of both classical and deep learning-based methods. At the same time, we have reviewed the deep learning-based methods into subdomain based on the functional blocks, i.e., upsampling methods, SR networks, learning strategies, SR framework and other improvements. This review paper fills the gap of a comprehensive review where a reader could access the overall progress of image super-resolution with appropriate section for the overall image quality metrics, SR methods, datasets, applications, and challenges in the field of image SR.

This survey is a comprehensive overview of the recent advances in SR, emphasizing deep learning-based approaches and their achievements in systematically achieving SR. Tables S1 and S2 respectively show the complete list of symbols and acronyms used in this study.

The key features of this study are:

  • We highlight the brief overview of the classical methods in SR and their contributions in light of past studies to give perspective.
  • We provide a detailed survey of deep learning-based SR, including the definition of the problem, dataset details, performance evaluation, deep learning methods used for SR, and specific applications where these SR methods were used and their performance.
  • We compare and contrast the recent advances in deep learning-based SR methods by summarizing the bounds of the methods by providing details of components of the SR methods used structurally.
  • Finally, the open problems in SR and critical challenges that require further probing are highlighted in this survey to provide future directions in SR.

This study is organized as follows:

In “Introduction”, we have introduced the concept of SR and the overall overview of this study. In Fig. 1 , we have summarized the hierarchical structure of this review. There are four main sections: classical methods, deep learning-based methods, applications of SR, Discussion, and future directions. In “Super-Resolution: Definitions and Terminologies”, we put forward the problem definition and details of the evaluation dataset. “Survey Methodology” discusses the methodology for the selection of studies included within this review. In “Conventional Methods of Super-Resolution”, we compare and contrast the classical methods of SR, whereas, in “Supervised Super-Resolution”, the SR methods based on supervised deep learning are explored. “Unsupervised Super-Resolution” covers the studies that used unsupervised deep learning-based methods for SR, and in “Domain-Specific Applications of Super-Resolution”, various field-specific applications of SR in recent years are discussed. “Discussion and Future Directions” summarizes open challenges and limitations in current SR methods and puts forward future research directions, while “Conclusion” highlights the conclusions.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g001.jpg

Four main categories are (a) classical methods of image super-resolution, (b) deep learning-based methods for SR, (c) applications of super-resolution, (d) future research and directions in SR. Green color represent first-level sections, the blue color is for second-level subsections, and orange color represent third level subsections.

Super-resolution: definitions and terminologies

In this section, the problem definition and the associated concepts of image super-resolution are discussed in light of the literature review.

Single image super-resolution—problem definition

The image SR focuses on the recovery of an HR image from LR image input as and in principle, the LR image I x L R can be represented as the output of the degradation function, as shown in (1) .

Where d is the SR degradation function that is responsible for the conversion of HR image to LR image, I y H R is the input HR image (reference image), whereas ∂ depicts the input parameters of the image degradation function. Degradation parameters are usually scaling factor, blur type, and noise. In practice, the degradation process and dependent parameters are unknown, and only LR images are used to get HR images by the SR method. The SR process is responsible for predicting the inverse of the degradation function d, such that g = d − 1

Where g is the SR function, δ depicts the input parameters to the function g , and I y E is the estimated HR corresponding to the input I x L R image. It is also worth noticing that the super-resolution function, as in (2) , is ill-posed, as the function g is a non-injective function; thus, there are infinite possibilities of I y E for which the condition d ( I y E , ∂ ) = I x L R will hold.

The degradation process for the input LR images is unknown, and this process is affected by numerous factors such as sensor-induced noise, artifacts created because of lossy compression, speckle noise, motion blur, and misfocused images. In the literature, most of the studies have used a single downsampling function as the image degradation function:

Where ↓ s f is the downsampling operator with s f being the scaling factor. One of the frequently used downsampling functions in SR is the bicubic interpolation ( Shi et al., 2016 ; Zhang & An, 2017 ; Shocher, Cohen & Irani, 2018 ) with antialiasing. In some studies, like ( Zhang, Zuo & Zhang, 2018 ), researchers have used more operations in the downsampling function, and the overall downsampling operation is:

Where I y H R ⊗ κ depicts the convolution of the HR image I y H R with the blurring kernel κ , n σ represents the additive white Gaussian noise with a standard deviation of σ . The degradation function defined in (4) and Fig. 2 is closer to the actual function as it considers more parameters than the simple downsampling degradation function ( Zhang, Zuo & Zhang, 2018 ).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g002.jpg

Noise is added to simulate realistic degradation within an image.

Finally, the purpose of SR is to minimize the loss function as follows:

Where L ( I y E , I y H R ) is the loss function between the output HR image of SR and the actual HR image, h is the tradeoff parameter, whereas Ψ ( ϕ ) is the regularization term. The most common loss function used in SR is the pixel-based mean square error (MSE), which can also be referred to as pixel loss. In recent years, researchers have used a combination of various loss functions, and these combinations are further explored in later sections. Further mathematical modeling of the SR problem is discussed in Candès & Fernandez-Granda (2014) .

Methods for quality of SR images

Image quality can have several definitions as per the measurement methods, and it is generally a measure of the quality of visual attributes and perception of the viewers. The image quality assessment (IQA) methods are characterized into subjective methods (human perception of an image is natural and of good quality) and objective methods (quantitative methods by which image quality can be numerically computed) ( Thung & Raveendran, 2009 ).

Quality-related visual aspects of an image are mostly a good measure, but this method requires more resources, especially if the dataset is large ( Wei, Yuan & Cai, 1999 ); thus, in SR and computer vision tasks, the more suitable methods are objective. As per ( Saad, Bovik & Charrier, 2012 ), the IQA methods are primarily categorized into three categories, i.e., reference image-based features from the actual image and blind IQA with no information about the ground truth. In this section, IQA methods primarily used in the domain of SR are further explored.

Peak signal-to-noise ratio

In information systems, the peak signal-to-noise ratio (PSNR) is a measurement technique for analyzing the signal power compared to the noise power, especially in images; the PSNR is used as a quantitative measure of the compression quality of an image. In super-resolution, the PSNR of an image is defined by the maximum pixel value and the mean square error between the reference image and the SR image, also known as the power of image distortion noise. For a given maximum pixel value ( M ) and the reference image ( I r ) having t pixels and the SR image ( I y ), the peak signal-to-noise ratio is defined as:

Where M is mostly for 8-bit color space depth, i.e., the max value of 255 and M S E is given by:

As seen from (6) , the PSNR is related to the individual pixel intensity values of the SR image and reference image and is a pixel-based metric of image quality. In some cases ( Almohammad & Ghinea, 2010 ; Horé & Ziou, 2010 ; Goyal, Lather & Lather, 2015 ), this quality metric can be misleading as the overall image might not be visually similar to that of the reference image. This metric is still used for image comparisons, especially comparing the results of SR algorithms with previously published results to compare the working of any new method in the field of SR.

MSE for color images averaged for color channels, and an alternate approach is to measure PSNR for luminance and or greyscale channels separately as the human eye is more sensitive to changes in luminance in contrast to changes in chrominance ( Dabov et al., 2006 ).

Structural similarity index

The visual perception of humans is efficient in extracting the structural information within an image, and PSNR does not consider the structural composition of the image ( Rouse & Hemami, 2008 ). The structural similarity index metric (SSIM) was proposed by ( Wang et al., 2004 ) to measure the structural similarity between images by comparing the contrast, luminance, and structural details within the reference image.

An image I r with total pixels P ; the contrast C I , and luminance L I can be denoted as the standard deviation and the mean of the image intensity given by:

The i t h pixel of the reference image is denoted by I r ( i ) . The comparisons based on the contrast and luminance between the reference image I r and the estimated image I ^ are:

Where μ 1 = ( k 1 S ) 2 and μ 2 = ( k 2 S ) 2 , these constant terms ensure stability by ensuring k 1 << 1 and k 2 << 1 .

Normalized pixel values I r − L I r / C I r represent the image structure, while the inner product of these is the equivalent of structural similarity between the reference image I r and the estimated image I ^ . The covariance σ I r , I ) is given by:

Function for structural comparison C o m s ( I r , I ^ ) is given by:

Where μ 3 is stability constant, the final structural similarity index (SSIM) is given by:

The control parameters α , β and γ can be adjusted to increase the importance of luminance, contrast, and structural comparison in calculating the S S I M .

Conventionally, PSNR is used in computer vision tasks for evaluation, but SSIM is based on human perception of structural information within an image. Thus this method is widely used for comparing the structural similarity between images ( Blau & Michaeli, 2018 ; Sara, Akter & Uddin, 2019 ). In medical images where the variance or luminance of the reference images are low, SSIM could be very unstable, thus reporting false results; however, this is not the case for natural images ( Pambrun & Noumeir, 2015 ).

Opinion scoring

Opinion scoring is a qualitative method, which lies in the subjective category of IQA. In this method, the quality testers are asked to grade the quality of images based on specific criteria, e.g., sharpness, natural look, and color, where the final graded score is the mean of the rated scores.

This method has limitations such as non-linearity between the scores, variation in results due to changes in test criteria, and human error. In SR, certain methods have reported good objective quality scores but scored poorly in subjective results, especially in human face reconstruction ( Ledig et al., 2017 ; Nasrollahi & Moeslund, 2014 ; Chen et al., 2018b ). Thus, the opinion scoring method is also used in studies ( Wei, Yuan & Cai, 1999 ; Deng, 2018 ; Ravì et al., 2018 , 2019 ; Vasu, Thekke Madam & Rajagopalan, 2019 ) to measure the quality of human perception.

Perceptual quality

Opinion scoring used human raters for manual evaluation of the images; while this method can provide accurate results as far as human perception is concerned, this method requires many resources, especially large datasets ( Viswanathan & Viswanathan, 2005 ). Initially ( Kim & Lee, 2017 ) proposed a CNN-based full reference image quality assessment (FR-IQA) model where human behavior was learned using an IQA database that contained distorted images, subjective scores, and error maps, and this method was called DeepQA.

In Ma et al. (2017a) , the authors used quality-discriminable image pairs (DIP) for training, and the system was called dipQA (DIP inferred quality index); they used RankNet with L2R algorithm to learn blind opinion IQA, whereas in Ma et al. (2018) a multi-task end-to-end optimized deep neural network (MEON) was proposed. MEON used two stages, in the first stage, distortion type learning using large datasets already available, and in the second stage, the output of the first stage was used to train the quality assessment network using stochastic gradient descent. In Talebi & Milanfar (2018) , the authors used CNN to develop a no-reference IQA method known as NIMA; NIMA was trained on pixel-level and aesthetic quality datasets.

RankIQA ( Liu, Van De Weijer & Bagdanov, 2017 ) trained a Siamese network to grade the quality of images using datasets with known image distortions, CNNs were used to learn the IQA, and this method even outperformed full-reference methods without using the reference image. IQA proposed in Bosse et al. (2018) included ten convolution layers and five pooling layers for feature extraction while there were two fully connected layers for regression; this method performed significantly well for both no-reference and full-reference IQA.

Even though opinion scoring and perceptual quality-based methods do exhibit human perception in IQA, but the quality we require is still an open question (i.e., if we want images to be more natural or similar to the reference image); thus, PSNR and SSIM are the primarily used methods in computer vision and SR.

Task-based evaluation

Although the primary purpose of image SR is to achieve better resolution, as mentioned earlier, SR is also helpful in other computer vision tasks ( Kim, Lee & Lee, 2016b ; Tao et al., 2017 ; Teh et al., 2020 ; Liu et al., 2018a ). The performance achieved in these can indirectly measure the performance of the SR methods used in those tasks. In the case of medical images, the researchers used the original and SR constructed images to see the performance in the training and prediction phases. In general, computer vision tasks such as classification ( Krizhevsky, Sutskever & Hinton, 2012 ; Cai et al., 2019 ), face recognition ( Nasrollahi & Moeslund, 2014 ; Liu et al., 2015 ; Chen et al., 2018b ), and object segmentation ( Martin et al., 2001 ; Lin et al., 2014 ; Wang et al., 2018b ) can be done using SR images. The performance of these computer vision tasks can be used as a metric to assess the performance of the SR method.

Miscellaneous IQA methods

The development of IQA methods is an open field, and in recent years various researchers have proposed SR metrics, but these methods were not used widely by the SR community. Feature similarity (FSIM) index metric ( Zhang et al., 2011 ) evaluates image quality by extracting feature points considered by the human visual system based on gradient magnitude and phase congruency. The multi-scale structural similarity (MS-SSIM) ( Wang, Simoncelli & Bovik, 2003 ) used multi-scale to incorporate variations in the viewing conditions to measure the image quality and proposed that MS-SSIM provides form flexibility in the measurement of image quality than single-scale SSIM. In Li & Bovik (2010) , the authors claimed that SSIM and MS-SSIM do not perform well on distorted and noisy images; thus, they used a four-component-based weighted method that adjusted the weight of scores based on the local feature, whereas in the case of contrast-distorted images Yao & Liu (2018) like TID2013 and CSIQ datasets SSIM does not perform well.

According to Blau & Michaeli (2018) , the perceptual quality and image distortion are at odds with each other; as the distortion decreases, the perceptual quality should also be worse; thus, the accurate measurement of SR image quality is still an open area of research.

The comparison of image quality assessment metrics for super-resolution is shown in Table 1 . It depends on the requirements of the methods; most of the methods use PSNR and SSIM to evaluate the performance as these are quantitative methods.

Operating color channels

In most datasets, RGB color space is used; thus, SR methods mostly employ RGB images, YCbCr space is also used in SR ( Dong et al., 2016 ). The Y component in YCbCr is the luminance component, which represents the light intensity, while Cb and Cr are the chrominance components (i.e., blue-differenced and red-differenced Chroma channels) ( Shaik et al., 2015 ). In recent years, most of the SR challenges and datasets use the RGB color space, limiting the use of RGB space for comparison with state of the art. Furthermore, the results of IQA based on PSNR vary if the color space in the testing stage is different from the training/evaluation stage.

Details of the reference dataset

The datasets used in evaluating the SR algorithms are summarized in this section; the various datasets discussed in this section vary in the total number of example images, image resolution, quality, and imaging hardware setup. A few of the datasets comprise paired LR-HR images for training and testing SR algorithms. In contrast, the rest of the datasets include HR images, and the corresponding LR images are usually generated by using bicubic interpolation with antialiasing as performed in Shi et al. (2016) , Zhang & An (2017) and Shocher, Cohen & Irani (2018) . Matlab function imresize (I, scale), where the default method is bicubic interpolation with antialiasing, and scale is the downsampling factor input to the function.

Table 2 comprises a list of datasets frequently used in SR and information on total image count, image format, pixel count, HR resolution, type of dataset, and classes of images.

Most of the datasets for SR are unpaired data, and the LR images are generated using various scale factors using bicubic interpolation with antialiasing. Other than the mentioned datasets in Table 2 , datasets like General-100 ( Dong, Loy & Tang, 2016 ), L20 ( Timofte, Rothe & Van Gool, 2016 ) and ImageNet ( Deng et al., 2009 ) are also used in computer vision tasks. In recent times, researchers have preferred the use of multiple datasets for training/evaluation and testing the SR models; for instance, in Bashir & Ghouri (2014) , Lai et al. (2017) , Sajjadi, Scholkopf & Hirsch (2017) and Tong et al. (2017) , the researchers used SET5, SET14, BSDS100 and URBAN100 for training and testing.

Super-resolution challenges

The most prominent SR challenges NTIRE ( Agustsson & Timofte, 2017 ; Timofte et al., 2017 ), and PIRM ( Blau et al., 2018 ), are discussed in this section.

The New Trends in Image Restoration and Enhancement (NTIRE) challenge ( Agustsson & Timofte, 2017 ; Timofte et al., 2017 ) was in collaboration with the Conference on Computer Vision and Pattern Recognition (CVPR). NTIRE includes various challenges like colorization, image denoising, and SR. In the case of SR, the DIV2K dataset ( Agustsson & Timofte, 2017 ) was used, which included bicubic downscaled image pairs and blind images with realistic but unknown degradation. This dataset has been widely used to evaluate SR methods under known and unknown conditions to compare against the state-of-the-art methods.

The perceptual image restoration and manipulation (PIRM) challenges were in collaboration with the European Conference on Computer vision (ECCV), and like NTIRE, it contained multiple challenges. Apart from the three challenges mentioned in NTIRE, PIRM also focused on SR for smartphones and compared perceptual quality with generation accuracy ( Blau et al., 2018 ). As mentioned by Blau & Michaeli (2018) , the models that focus on distortion often give visually unpleasant SR images, while the models focusing on the perceptual image quality do not perform well on information fidelity. Using the image quality metrics NIQE ( Mittal, Soundararajan & Bovik, 2013 ) and ( Ma et al., 2017b ), the methods that performed best in achieving perceptual quality ( Blau & Michaeli, 2018 ) was the winner. In contrast, in a sub-challenge ( Ignatov et al., 2018b ), SR methods were evaluated using limited resources to evaluate SR performance for smartphones using the PSNR, MS-SSIM, and opinion scoring metrics. Thus, PIRM encouraged the researchers to explore the perception-distortion tradeoff domain and SR for smartphones.

Survey methodology

The majority of the studies included in this review paper are peer-reviewed publications to ensure the validity of the methods; these studies include conference proceedings and journal papers. The included papers include early access and a published version of recent papers for super-resolution from 2008 to 2021. However, some in classical methods, some papers, and initial papers on image SR were included before this range to develop the review and give the background of the classical methods developed before the deep learning-based methods overtook the field. Google Scholar, IEEE Xplore, and Science Direct were queried to collect the initial list of papers in this research. Specific keywords were used to search the databases and based on the abstract. A further selection of papers was made using the reference sections of the selected papers as they contain additional relevant studies in image super-resolution. The last query was made on May 08, 2021. The collected papers were segregated based on their relevance with the Section; for example, papers with supervised learning were stored separately for review in “Supervised Super-Resolution”, and studies highlighting the applications of SR methods were grouped for discussion in “Domain-Specific Applications of Super-Resolution”.

The relevant search terms include image super-resolution, super-resolution, deep learning super-resolution, convolutional neural networks, image upsampling methods, super-resolution frameworks, supervised super-resolution, unsupervised super-resolution, super-resolution review, image interpolation, pixel-based methods, super-resolution application, assisted diagnosis using deep learning. The search keywords were not limited to single-image SR because our target was to report other aspects of super-resolution, including classical methods, applications, and datasets for SR.

Logical operators and wildcards were used to combine the keywords further and perform the additional search. Initial screening of the collected papers was performed following the inclusion/exclusion criteria shown in Table 3 . The whole process is graphically shown in Fig. 3 , where 653 studies were collected over one year. A total of 242 studies were included from the initial 653 collected research studies.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g003.jpg

Sample details based on inclusion/exclusion criteria defined in Table 5 .

Conventional methods of super-resolution

Classical methods of SR are briefly discussed in this section to encompass the overall development cycle of the SR. The classical methods include prediction-based, edge-based, statistical, patch-based and sparse representation methods.

The primary methods were based on prediction, and the first method ( Duchon, 1979 ) was based on Lanczos filtering, which filtered the digital data using sigma factors (with modifiable weight function), and a similar frequency-domain filtering approach was used in Tsai & Huang (1984) for image resampling. In contrast, cubic convolution ( Keys, 1981 ) was used for resampling the image data, and the results showed that this prediction method was more accurate than the nearest-neighbor prediction algorithm and linear interpolation of image data ( Parker, Kenyon & Troxel, 1983 ). In Tsai & Huang (1984) , the authors did not consider the blur in the imaging process, while ( Irani & Peleg, 1991 ) used the knowledge of the imaging process and the relative displacements for image interpolation when the sampling rate was kept constant and this method reduced to deblurring.

The patch-based approach was used in Freeman, Jones & Pasztor (2002) ; the authors used a training set where various patches within the training set were extracted as training patterns, which helped generate detailed high-frequency images using the patch texture information. In Chang, Yeung & Xiong (2004) , the authors used locally linear embedding to use local patches for generating high-resolution images based on the local patch features. In contrast, ( Glasner, Bagon & Irani, 2009 ) used the concept of reoccurrence of geometrically similar patches in natural images to select the best possible pixel value based on the patch redundancy on the same scales. In Baker & Kanade (2002) , the authors introduced the concept of hallucination, where they extracted local features within the LR image first and used these to map the HR image.

Edge-based methods use edge smoothness priors to upsample images, and in Sun, Xu & Shum (2008) , a generic image prior, gradient prior profile was used to smoothen the edges within an image to achieve super-resolution in natural images. In Freedman & Fattal (2011) , the authors used specially designed filters to search for similar patches using the local self-similarity observation, which performed lower nearest patch computations; this method was able to reconstruct realistic-looking edges, whereas it performed poorly in clustered regions with fine details.

Statistical methods were used to perform image super-resolution ( Kim & Kwon, 2010 ), where the authors used Kernel ridge regression (KRR) with gradient descent to learn the mapping function from the image example pairs. Adaptive regularization was used to supervise the energy change during the image resampling iterative process. This provided more accurate results as the energy map was used to limit the energy change per iteration, which reduced the noise while maintaining the perceptual quality ( Xiong, Sun & Wu, 2010 ) while Yang et al. (2010) and Yang et al. (2008) used sparse representation methods to perform image super-resolution which used the concept of compressed sensing.

The robust SR method proposed in Zomet, Rav-Acha & Peleg (2001) used the information of outliers to improve the performance of SR in patches where other methods introduce noise due to these outliers. Additionally, Yang et al. (2007) proposed a post-processing model that enhanced the resolution of a set of images using a single reference image up to 100x scaling factor. Another way to achieve SR is to use LR images to achieve a single HR image ( Tipping & Bishop, 2003 ). The conventional upsampling methods, such as interpolation-based, use the information within the LR image to generate HR images, and these methods do not add any new information to the image ( Farsiu et al., 2004a ). Furthermore, they also introduce some inherent problems, such as noise amplification and blur enhancement. Thus, in recent years, the researchers have shifted to learning-based upsampling methods explored in “Supervised Super-Resolution”.

Supervised super-resolution

Various deep learning methods were developed over the years to solve the SR problem; in this section, the models discussed are trained using both low and high-resolution images (LR–HR pairs). Although there are significant differences in the supervised SR models, and the models can be classified based on the components like the upsampling method employed, deep learning network, learning algorithm, and model frameworks. Any supervised image SR model is based on the combinations of these components, and in this section, we summarize the employed methods for these four components in light of recent supervised image SR research studies.

The component-based review of various methods is performed in this section, and the basic overview of the models is shown in Fig. 1 .

Upsampling methods

The upsampling is essential in deep learning-based SR methods such as its positioning, and the method performed for upsampling has a significant impact on the training and test performance of the model. There are some commonly used methods ( Yang et al., 2010 , 2008 ; Lee, Yang & Oh, 2015 ; Timofte, De Smet & Van Gool, 2015 ), which use the conventional CNNs for end-to-end learning. In this subsection, various deep learning-based upsampling layers are discussed.

As mentioned in “Conventional Methods of Super-Resolution”, the interpolation-based methods of upsampling do not add any new information; hence, learning-based methods are used in image SR in the last decade.

Sub-pixel layer

The end-to-end learning layer ( Shi et al., 2016 ), called the sub-pixel layer, performs upsampling by generating several additional channels using convolution, and by reshaping these channels, this layer performs upsampling, as shown in Fig. 4 . In this layer, convolution is applied to s f 2 where s f is the scaling factor, as shown in Fig. 4B . Since the input image size is h × w × c where h is height, w is width, and c depicts color channels, the resulting convolution is h × w × c s f 2 . In order to achieve the final image, a reshuffling ( Shi et al., 2016 ) operation is performed to get the final output image s f h × s f w × c , as shown in Fig. 4C . Since it is an end-to-end layer, this layer is frequently used in SR models ( Ledig et al., 2017 ; Zhang, Zuo & Zhang, 2018 ; Ahn, Kang & Sohn, 2018a ; Zhang et al., 2018b ).

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g004.jpg

(A) Input. (B) Convolution. (C) Reshaping.

This layer has a wide receptive field, which helps learn more contextual information that generates realistic details, whereas this layer may generate some false artifacts at the boundaries of complex patterns due to its uneven distribution of the respective field. Furthermore, predicting the neighborhood pixels in a block-type region sometimes results in unsmooth outputs that do not look realistic when compared with the true HR image; to address this issue, PixelTCL ( Gao et al., 2020 ) was proposed that used the interdependent prediction layer, which used the information of the interlinked pixels during upsampling. The results were smooth and more realistic when compared with the ground truth image.

Deconvolution layer

The deconvolution layer also referred to as transposed convolution layer ( Zeiler et al., 2010 ), is the converse of the convolution, i.e., predicting the probable input HR-image based on the feature maps from the LR image. In this process, additional zeros are inserted to increase the resolution, and afterwards, convolution is performed. For instance, taking scaling factor 2 for the SR image, a convolution kernel of 3 × 3 (as shown in Figs. 5A , ​ ,5B 5B and ​ and5C), 5C ), the input LR image is expanded twice by inserting zeros, convolution with the kernel is performed by using a stride and padding of 1 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g005.jpg

(A) Input. (B) Expansion. (C) Convolution.

The deconvolution layer is widely used in SR methods ( Sroubek, Cristobal & Flusser, 2008 ; Hugelier et al., 2016 ; Lam et al., 2017 ; Tong et al., 2017 ; Haris, Shakhnarovich & Ukita, 2018 ), as it generates HR images in an end-to-end way, and it has compatibility with the vanilla convolution. As per Odena, Dumoulin & Olah (2017) , in some cases, this layer may cause the problem of uneven overlapping within the generated HR image as the patterns are replicated in a check-like format and may result in a non-realistic HR image, thereby decreasing the performance of the SR method.

Meta upscaling

The scaling factor was predefined in the previously mentioned methods, thereby training multiple upsampling modules with different factors, which is often inefficient and is not the actual requirement of an SR method. A meta upscaling module ( Hu et al., 2019 ) was proposed; this module uses arbitrary scaling factors to generate SR image-based in meta-learning. Meta scaling module projects every position in the required HR image to a small patch in the given LR feature maps j × j × c i , where j is arbitrary, and ci is the total number of channels within the extracted feature map (in Hu et al., 2019 this was 64). Additionally, it also generates the convolution weights ( j × j × ( c i × c o ) ) , where c o represents the output image channels, and it is usually 3. Thus, the meta upscaling module continuously uses arbitrary scaling factors within a single model and using a substantial training set, a large number of factors are simultaneously trained. The performance of this layer even surpasses the results produced with fixed factor models, and even though this module predicts the weights during the inference time, the overall execution time for weight prediction is 100 times less than the total time required for feature extraction ( Hu et al., 2019 ). In cases where there is a need for larger magnifications, this module may become unstable as it predicts the convolution weights for every pixel independent of the image information within those pixels.

This upscaling method is frequently used in recent years, particularly in post-upsampling frameworks (“Post-Upsampling SR”). The high-level representations extracted from the low-level information are used to construct an HR image using meta upscaling in the last layer of the model, making this method an end-to-end SR approach.

The comparison of upsampling methods is shown in Table 4 ; most SR methods use deconvolution or sub-pixel layers for upscaling. However, for multiple scale factors, meta upscaling is used.

Deep learning SR networks

The network design and advancements in design architecture are recent trends in deep learning, and in SR, researchers have tried several design implications along with the SR framework (as seen in “SR Frameworks”) for designing the overall SR network. Some of the fundamental and recent network designs are discussed in this section.

Recursive learning

One of the basic network-based learning strategies is to use the same module for recursively learning high-level features. This method also minimizes the parameters as the strategy is based on the same module being updated recursively, as shown in Fig. 6A .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g006.jpg

(A) Recursive learning, (B) residual learning, (C) dense connection-based learning, (D) multiscale learning, (E) advanced convolution-based learning, (F) attention-based learning.

One of the most used recursive networks is the Deeply-recursive Convolutional Network (DRCN) ( Kim, Lee & Lee, 2016b ). Utilizing a single convolution layer DRCN reaches up to a 41 × 41 repetitive field without requiring additional parameters, which is very deep compared to the Super-resolution Convolution Neural Network SRCNN ( Thapa et al., 2016 ) ( 13 × 13 ) . The Deep Recursive Residual Network (DRRN) ( Tai, Yang & Liu, 2017 ) utilized a ResBlock ( He et al., 2016 ) as part of the recursive module for a total of 25 recursions and was reported to achieve better performance than the baseline ResBlock. Using the concept of DRCN, Tai et al. (2017) proposed a memory block-based method MemNet which contained six recursive ResBlocks, whereas the Cascading Residual Network (CARN) ( Ahn, Kang & Sohn, 2018a ) also used ResBlocks as recursive units. In this approach, the network shares the weights globally in recursion using an iterative up-and-down sampling-based approach. Apart from end-to-end recursions, the researchers also used Dual-state Recurrent Network (DSRN) ( Han et al., 2018 ), which shared the signals between the LR and generated HR states within the network.

Overall, while reducing the parameters, recursive learning networks can learn the complex representation of the data at the cost of computational performance. Additionally, the increase in computational requirements may result in an exploding or vanishing gradient. Thus, recursive learning is often used in combination with multi-supervision or residual learning for minimizing the risk of exploding or vanishing gradient ( Kim, Lee & Lee, 2016b ; Tai et al., 2017 ; Tai, Yang & Liu, 2017 ; Han et al., 2018 ).

Residual learning

Residual learning was widely used in the field of SR ( Bevilacqua et al., 2012 ; Timofte, De Smet & Van Gool, 2015 ; Timofte, De & Van Gool, 2013 ), until ResNet ( He et al., 2016 ) was proposed for learning residuals, as shown in Fig. 6B . Overall, there are two approaches, local and global residual learning.

The local residual learning approach mitigates the degradation problem ( He et al., 2016 ) caused by increased network depth. Furthermore, the local residual learning also improved the learning rate and reduced the training difficulty; this is frequently used in the SR field ( Protter et al., 2009 ; Mao, Shen & Bin, 2016 ; Han et al., 2018 ; Li et al., 2018 ).

The global residual learning is an approach used in which the input and the final output are correlated, and in image SR, the output HR is highly correlated with the input LR image; thus, learning the global residuals between LR and HR image is significant in SR. In global residual learning, the model only learns the residual map that transforms the LR image into an HR image by generating the missing high-frequency details in the LR image. Furthermore, the residuals are minimal; thereby, the learning difficulty and model complexity are significantly reduced in global residual-based learning. This method is also frequently used in SR methods ( Kim, Lee & Lee, 2016a ; Tai, Yang & Liu, 2017 ; Tai et al., 2017 ; Hui, Wang & Gao, 2018 ).

Overall, both methods use residuals to connect the input image with the output HR image; in the case of global residual learning, the connection is directly made, which in local residual learning various layers of different depth to connect the input (using local residuals) with the output.

Dense connection-based learning

This learning method uses dense blocks to address SR, like DenseNet ( Huang et al., 2017 ). The dense block utilizes all the features maps generated by the previous layers as inputs and its feature inputs, leading to l ( l − 1 ) / 2 connections in an l-layer ( l ≥ 2 ) dense block. Using dense blocks will increase the reusability of the features while resolving the gradient vanishing problem. Furthermore, the dense connections also minimize the model size by utilizing a small growth rate and enfolding the channels using concatenated input features.

Dense connections are used in SR to connect the low-level and high-level features maps for reconstructing a high-quality fine-detailed HR image, as shown in Fig. 6C . SRDenseNet ( Tong et al., 2017 ) proposed a 69-layer network containing dense connections within the dense blocks and dense connections among the dense blocks. In SRDenseNet, the feature maps from the prior blocks and the feature maps were used as inputs of all preceding blocks. RDN ( Zhang et al., 2018b ), CARN ( Ahn, Kang & Sohn, 2018a ), MemNet ( Tai et al., 2017 ) and ESRGAN ( Wang et al., 2019c ) also used layer or block-level dense connection, while DBPN ( Wang et al., 2018b ) only used the dense connection between the upsampling and downsampling units.

Multi-path learning

In multi-path learning, the features are transferred to multiple paths for different representations, and these representations are later combined to gain improved performance. Scale-specific, local, and global multi-path learnings are the main types.

For different scales, the super-resolution models use different feature extraction; in Lim et al. (2017) , the authors proposed a single network-based multi-path learning for multiple scales. The intermediate layers of the model were shared for feature extraction, while scale-specific paths, including pre-processing and upsampling, were at the end of the models, i.e., the start and end of the network. During training, the scale relative paths are enabled and updated accordingly, and the proposed deep super-resolution MDSR method ( Lim et al., 2017 ) also decreases the overall model size because of the sharing of parameters across the scales. Like MDSR, a similar multi-path-based approach is also implemented in ProSR and CARN.

Local multi-path learning is inspired using a new block, the inception module ( Szegedy et al., 2015 ), for multi-scale feature extraction, as performed in MSRN ( Li et al., 2018 ) (shown in Fig. 6D ). The additional block consists of 3 × 3 and 5 × 5 kernel size convolution layers, which simultaneously extracts the features. After combining the outputs of the two convolution layers, the final output goes through a 5 × 5 kernel convolution. Furthermore, a path links the input and output by element-wise addition and uses this local multi-path learning; this method extracts features efficiently than multi-scale learning.

Another variation of multi-path learning is global multi-path learning; in this method, various features are extracted from multi-paths that can interact. In DSRN ( Han et al., 2018 ), there are two paths for extracting low and high-level information, and there is a continuous sharing of features for improved learning. In contrast, in pixel recursive SR ( Dahl, Norouzi & Shlens, 2017 ), a conditioning path is responsible for extracting global structures, and the prior path further finds the serial codependence among the generated pixels. A different method was employed by Ren, El-Khamy & Lee (2017) , where multi-path learning was performed for unbalanced structures, which were later combined in the final layer to get the SR output.

Advanced convolution-based learning

In SR, the methods explored depend on the convolution operation, and various research studies have attempted to modify the convolution operation for better performance. In recent years, research studies have shown that group convolution, as shown in Fig. 6E , decreased the total number of parameters at the cost of small loops in performance ( Hui, Wang & Gao, 2018 ; Johnson, Alahi & Li, 2016 ). In CARN-M ( Ahn, Kang & Sohn, 2018a ) and IDN ( Hui, Wang & Gao, 2018 ), group convolution was used instead of vanilla convolution. In dilated convolution, the contextual information is used to generate realistic-looking SR images ( Zhang et al., 2017 ); dilated convolution was used to double the receptive field, resulting in better results.

Another type of convolution is depthwise separable convolution ( Howard et al., 2009 ); although this convolution significantly reduces the total number of parameters, it reduces the overall performance.

Attention-based learning

In deep learning, attention learning is the idea where certain factors are given more preference, which processes the data than others; here, two types of attention-based learning mechanisms are discussed in SR. In channel attention, a particular block is added in the model where global average pooling (GAP) squeezes the input channels; two fully connected layers process these constants to generate channel-wise residuals ( Hu, Shen & Sun, 2018 ), as shown in Fig. 6F . This technique has been incorporated in SR, known as RCAN ( Zhang et al., 2018a ), which has improved performance. Instead of GAP, Dai et al. (2019) used the second-order channel attention (SPCA) module, which used second-order feature metric for extracting more data representation using channel-based attention

In SR, most of the models use local fields for the generation of SR pixels, while in a few cases, some textures or patches which are far apart are necessary for generating accurate local patches. In Zhang et al. (2019b) , local and non-local attention blocks were used to extract local and non-local representations between pixel data. Similarly, the non-local attention technique was incorporated by Dai et al. (2019) to capture contextual information using a non-local attention method. Chen et al. proposed an SR reconstruction method with feature maps to facilitate the reconstruction of the image using an attention mechanism ( Chen et al., 2021 ), while Yang et al. proposed a channel attention and spatial graph convolutional network (CASGCN) for a more robust feature obtaining and feature correlations modeling ( Yang & Qi, 2021 ).

Wavelet transform-based learning

Wavelet transform (WT) ( Daubechies & Bates, 1993 ; Griffel & Daubechies, 1995 ) represents textures using high-frequency sub-bands and global structural information in low-frequency sub-bands in a highly efficient way. WT was used in SR to generate the residuals of the HR sub-bands using the sub-bands of the interpolated LR wavelet. Using the WT, the LR image is decomposed, while the inverse WT provides the reconstruction of the HR image in SR. Other examples of WT based SR are Wavelet-based residual attention network (WRAN) ( Xue et al., 2020 ), multi-level wavelet CNN (MWCNN) ( Liu et al., 2018b ) and ( Ma et al., 2019 ); these approaches used a hybrid approach by combining WT with other learning methods to improve the overall performance.

Region-recursive-based learning

In SR, most methods follow the underlying assumption that it is a pixel-independent process; thus, there is no priority to the interdependence among the generated pixels. Using the concept of PixelCNN ( Van Den Oord et al., 2016 ; Dahl, Norouzi & Shlens, 2017 ) proposed a method for pixel recursive learning, which performed SR by pixel-by-pixel generation using two networks. The two networks ( Dahl, Norouzi & Shlens, 2017 ) captured information about pixel dependence and global contextual information within the pixel recursive SR method. Using the mean opinion scoring-based evaluation method ( Dahl, Norouzi & Shlens, 2017 ) performed well compared to other methods for generating SR face images using the pixel recursive method. The attention-based face hallucination method ( Cao et al., 2017a ) also utilized the concept of a path-based attention shifting mechanism to enhance the details in the local patches.

While the region-recursive methods perform marginally better than other methods, the recursive process exponentially increases the training difficulty and computation costs due to long propagation paths.

Other methods

Other SR networks are also used by researchers, such as Desubpixel-based learning ( Vu et al., 2019 ), xUnit-based learning ( Kligvasser, Shaham & Michaeli, 2018 ) and Pyramid Pooling-based learning ( Zhao et al., 2017 ).

To improve the computational speed, the desubpixel-based approach was used to extract features in a low-dimensional space, which does the inverse task of the sub-pixel layer. By segmenting the images spatially and using them as separate channels, the desubpixel-based learning avoids any information loss; after learning the data representations in low-dimensional space, the images are upsampled to get a high-resolution image. This technique is particularly efficient in applications with limited resources such as smartphones.

In xUnit learning, a spatial activation function was proposed for learning complicated features and textures. In xUnit, the ReLU operation was replaced by xUnit to generate the weight maps through Gaussian gating and convolution. The model size was decreased by 50% using xUnit at the cost of increased computational demand without compromising the SR performance ( Kligvasser, Shaham & Michaeli, 2018 ).

Learning strategies

Learning strategies also dictate the overall performance of any SR algorithm as the evaluations are dependent upon the choice of the learning strategy selected. In this section, recent research studies are discussed using the learning strategy utilized in SR, and some of the critical strategies are discussed in detail.

Loss functions

For any application in deep learning, the selection of the loss functions is critical, and in SR, these functions are used to measure the error in the reconstruction of HR, which further helps optimize the model iteratively. Since the necessary element of the images is a pixel, initial research studies employed the pixel loss, L2, but it was evaluated that the pixel loss cannot wholly represent the quality of reconstruction ( Ghodrati et al., 2019 ). Thus, in SR, different loss functions such as content loss ( Johnson, Alahi & Li, 2016 ) or adversarial loss ( Ledig et al., 2017 ) are used to measure the error in the generation these loss functions have been widely used in the field of SR. Various loss functions are explored in this section, and the notation follows the previously defined variables except where defined otherwise.

Content Loss. The perceptual quality, as mentioned previously, is essential in the evaluation of an SR model, and this loss was used in SR ( Johnson, Alahi & Li, 2016 ; Dosovitskiy & Brox, 2016 ) to measure the differences between the generated and ground-truth images using an image classification network (N). Let the high-level data representation on the l t h lth layer is r l ( I ) , the content loss is defined as the Euclidean among the high-level representations of the two images I and I ^ , where I is the original image and I ^ is the generated SR image as below:

Where h l , w l and c l respectively are height, width, and several channels of the image representations in the l layer.

Content loss aims to share information about image features from the image classification network N c to the SR network. This loss function ensures the visual similarity between the original image ( I ) and the generated image ( I ^ ) by comparing the content and not the individual pixels. Thus, this loss function helps in producing visually perceptible and more realistic looking images in the field of SR as in Ledig et al. (2017) , Wang et al. (2018b) , Sajjadi, Scholkopf & Hirsch (2017) , Wang et al. (2019c) , Johnson, Alahi & Li (2016) and Bulat & Tzimiropoulos (2018) where the networks used as pre-trained CNNs were ResNet ( He et al., 2016 ) and VGG ( Simonyan & Zisserman, 2015 ).

Adversarial Loss. In recent years, after the development of GANs ( Goodfellow et al., 2014 ), GANs have received more consideration due to their ability to learn and self-supervise. A GAN combines dual networks performing generation and discrimination tasks, i.e., generating the actual output and using a discriminator network to evaluate the results of the generative network. While training the GANs, two continuous updates were performed, i.e. (i) Adjust the generator for better results while training the discriminator to discriminate more efficiently and (ii) Adjust the discriminator while training the generator. This is a recursive training network, and through many iterations of training and evaluation, the generator can generate the output that conforms to the distribution of the actual data. The discriminator is unable to differentiate between real and generated information.

In terms of image SR, the purpose of a generative network is to generate an HR image, while another discriminator network will be used to evaluate if the image is of the same distribution as the input data. This method was first introduced in SR as SRGAN ( Ledig et al., 2017 ), the adversarial loss in Ledig et al. (2017) was represented by:

Where L G A N _ C E _ g . is the adversarial loss function of the generator in the SR model, while L G A N _ C E _ d is the adversarial loss function of the discriminator D , which is a binary classifier. In (17) , the randomly sampled ground truth image is denoted by I s . The same loss functions were reported by Sajjadi, Scholkopf & Hirsch (2017) .

Other than binary classification error, the studies Yuan et al. (2018) and Wang et al. (2018a) used mean square error for improved training and better results compared to ( Ledig et al., 2017 ), the loss functions are given in (18) and (19) :

Contrary to the loss functions mentioned in (18) and (19) , ( Park et al., 2018 ) showed that in some cases, pixel-level discriminator network generates high-frequency noise; thus, we used another discriminator network to evaluate the first discriminator network for high-frequency representations. Using the two discriminator networks, ( Park et al., 2018 ) were able to capture all attributes accurately.

Various opinion scoring systems have been used regressively to test the performance of the SR model that uses adversarial loss. Although the SR models attained lower PSNR than the pixel-loss-based SR on perceptual quality metrics like opinion scoring, these adversarial loss-based SR methods scored very high ( Ledig et al., 2017 ; Sajjadi, Scholkopf & Hirsch, 2017 ). The use of a discriminator as the control network for the generator GANs was able to regenerate some intricate patterns that were very difficult to learn using ordinary deep learning methods. The only drawback of the GANs is their training stability ( Arjovsky, Chintala & Bottou, 2017 ; Gulrajani et al., 2017 ; Lee et al., 2018a ; Miyato et al., 2018 ).

Pixel Loss. As evident from the name, this loss function performs a pixel-wise comparison between the reference image and the generated image, and there are two types of comparisons, i.e., an L 1 loss, which is also termed as mean absolute error and L 2 loss, which is the mean square error ( M S E )

The L 1 loss in some cases becomes numerically unstable to compute; thus, another variant of the L1 loss called the Charbonnier loss ( Farsiu et al., 2004b ; Barron, 2017 , 2019 ; Lai et al., 2017 ) is given by:

Here e is a constant which ensures numerical stability.

The pixel loss function ensures that the generated HR image I ^ has the same pixel values as the HR image I . Furthermore, the L2 loss used the square of pixel-value errors, giving more weightage to high-value differences than lower ones; thus, this loss function may give either the too variable result (in case of outliers) or give too smooth results (in case of minimal error values). Therefore, the L1 loss function is widely used over L2 loss ( Zhao et al., 2016 ; Lim et al., 2017 ; Ahn, Kang & Sohn, 2018a ). Furthermore, the PSNR equation is closely related to the definition of L1 loss, and minimizing L1 loss always leads to increased PSNR. Thus, researchers have often used the L1 loss to maximize the PSNR; as mentioned earlier, the pixel loss function does not cater to perceptual quality or textures. Thus, SR networks based on this loss function may have less high-frequency details, resulting in smooth but unrealistic HR images ( Wang, Simoncelli & Bovik, 2003 ; Wang et al., 2004 ).

Style Reconstruction Loss. Ideally, the reconstructed HR image should have comparable styles to the actual HR image (colors, textures, gradient, contrast), thus using the research studies ( Sajjadi, Scholkopf & Hirsch, 2017 ; Gatys, Ecker & Bethge, 2015 ), style reconstruction loss was used in SR to match the texture details of the reference image with the generated image. The correlation between the feature maps of different channels as given by the Gram matrix ( Levy & Goldberg, 2014 ) G ( l ) . G i , j ( l ) is the dot product of the features i , and j in the layer l , it is which is given by:

Where v e c ( ) is the vectorization operation and c h i ( l ) denoted the i t h channel of feature maps in the layer l . Now the texture loss is given by (24)

Using the texture loss function in (24) , EnhanceNet ( Sajjadi, Scholkopf & Hirsch, 2017 ) reported more realistic results that look visually similar to the reference HR image. Although an optimized texture loss function-based SR generates more realistic-looking images, the selection of patch size is still an open field of research. The selection of small patch size leads to the generation of artifacts in the textured region, while selecting a big patch size generates artifacts across the whole image as the patches are averages over the whole image.

Total Variation Loss. Using the pixel values of the neighboring pixels, the total variation loss ( Rudin, Osher & Fatemi, 1992 ) was defined as the sum of the absolute difference among the values of the neighboring pixels as:

Total variation loss was used in Ledig et al. (2017) and Yuan et al. (2018) to ensure smoothness across sharp edges/transitions within the generated image.

Cycle Consistency Loss. Using the CycleGAN ( Zhu et al., 2017a ) image SR method was presented in Yuan et al. (2018) using the cyclic consistency loss function. Using the generated HR image I ^ , the network generated another LR image I L R ′ , which is further compared with the input LR image I L R for cyclic consistency.

In practice, various loss functions are used as a combination in SR to ensure various aspects of the generation process in the form of a weighted average as in Kim, Lee & Lee (2016a) , Wang et al. (2018b) , Sajjadi, Scholkopf & Hirsch (2017) and Lai et al. (2017) . The selection of appropriate weights of the loss functions in itself is another learning problem as the results vary significantly by varying the weights of the loss function in image SR.

Curriculum learning

In the Curriculum learning technique ( Bengio et al., 2009 ), the method adapts itself to the variable difficulty of tasks, i.e., starting from simple images with minimum noise to complex images. Since SR always suffers from adverse conditions, the curriculum approach is mainly applied to its learning difficulty and network size. For reducing the training difficulty of the network in SR, small scaling factor, SR is performed in the beginning; in the curriculum learning-based SR, the training starts with 2× upsampling, and gradually the following scaling factors 4 × , 8 × , and so on are generated using the output of previously trained networks. ProSR ( Wang et al., 2018a ) uses the upsampled output of the previous level and linearly trains the next level using the previous one, while ADRSR ( Bei et al., 2018 ) concatenates the HR output of the previous levels and further adds another convolution layer. In CARN ( Ahn, Kang & Sohn, 2018b ), the previously generated image is entirely replaced by the next level generated image, updating the HR image in sequential order.

Another alternative is to transform the image SR problem into N subsets and gradually solving these problems; as in Park, Kim & Chun (2018) , the 8 × upsampling problem was divided into three problems (i.e., 1 × t o 2 × ; 2 × t o 4 × and 4 × t o 8 × ) and three separate networks were used to solve these problems. Using a combination of the previous reconstruction, the next level was finetuned in this method. The same concept was used in Li et al. (2019b) to train the network from low image degradations to high image degradations, thus gradually increasing the noise in the LR input image. Curriculum learning reduces the training difficulty; hence, the total computational time is also reduced.

Batch normalization

Batch normalization (BN) was proposed by Ioffe & Szegedy (2015) to stabilize and accelerate the deep CNNs by reducing the internal covariate shift of the network. Every mini-batch was normalized, and two additional parameters were used per channel to preserve the representation ability. Batch normalization is responsible for working on the intermediate feature maps; thus, it resolves the vanishing gradient issue while allowing high learning rates. This technique is widely used in SR models such as Ledig et al. (2017) , Zhang, Zuo & Zhang (2018) , Tai, Yang & Liu (2017) , Tai et al. (2017) , Ledig et al. (2017) , Sønderby et al. (2017) , Tai et al. (2017) , Tai, Yang & Liu (2017) , Liu et al. (2018b) and Zhang, Zuo & Zhang (2018) . In contrast, ( Lim et al., 2017 ) claimed that batch normalization-based networks lose the scale information of the generated images. Thus, there is a lack of flexibility in the network; hence, Lim et al. (2017) removed batch normalization and used the additional memory to design a large model with superior performance compared to the BN-based network. Other studies Wang et al. (2019c) , Wang et al. (2018a) and Chen et al. (2018a) also implemented this technique to achieve marginally better performance.

Multi-supervision

Using numerous supervision signals within the same model for improving the gradient propagation and evading the exploding/vanishing gradient problem is called multi-supervision. In Kim, Lee & Lee (2016b) , multi-supervision is incorporated within the recursive units to address the gradient problems. In SR, the multi-supervision learning technique is implemented by catering to a few other factors in the loss function, which improves the back-propagation path and reduces the training difficulty of the model.

SR frameworks

SR being an ill-posed problem; thus, upsampling is critical in defining the performance of the SR method. Based on learning strategies, upsampling methods, and network types, there are several frameworks for SR; here, four of them are discussed in detail, especially in light of the upsampling method used within the framework, as shown in Figs. 7 – 10 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g007.jpg

Pre-upsampling SR

Learning the mapping functions for upsampling from an LR image directly to an HR image is done using this framework, where the LR image is upsampled in the beginning, and various convolution layers are used to extract representations in an iterative way using deep neural networks. Using this concept Dong et al. (2014 , 2016) introduced the pre-upsampling-based SR framework (SRCNN), as shown in Fig. 7 . SRCNN was used to learn the end-to-end mapping of LR-HR image conversion using CNNs. Using the classical methods of upsampling as discussed in “Conventional Methods of Super-Resolution”, the LR image is firstly converted to an HR image, and then deep CNNs were used to learn the representations for mapping the HR image.

Since the pre-upsampling layer already performs the actual pixel conversion task, the network needs to refine the results using CNNs; this results in reduced learning difficulty. Compared to single-scale SR ( Kim, Lee & Lee, 2016a ), which uses specific scales of input, these models can handle any random size image for refinement and have similar performance. In recent years, many application-oriented research studies have used this framework ( Kim, Lee & Lee, 2016b ; Shocher, Cohen & Irani, 2018 ; Tai, Yang & Liu, 2017 ; Tai et al., 2017 ), the differences in these models are in the deep learning layers employed after the upsampling. The only drawback in this model is the use of a predefined classical method of pre-upsampling, which often results in the introduction of image blur, noise amplification in the upsampled image, which later affects the quality of the concluding HR image. Moreover, the dimensions of the image are increased at the start of the method. Thus, the computational cost and memory requirements of this framework are higher ( Shi et al., 2016 ).

Post-upsampling SR

To minimize the memory requirements and increase computational efficiency, the post-upsampling method was used in SR to utilize deep learning to learn the mapping functions in low-dimensional space. This concept was first used in SR by Shi et al. (2016) and Dong, Loy & Tang (2016) , and the network diagram is shown in Fig. 8 .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g008.jpg

Due to low computational costs and the use of low-dimensional space for deep learning, this model has been widely used in SR because this reduces the complexity of the model ( Ledig et al., 2017 ; Lim et al., 2017 ; Tong et al., 2017 ; Han et al., 2018 ).

Iterative up-and-down sampling SR

Since the LR–HR mapping is an ill-posed problem, efficient learning using the LR-HR image pair using back-propagation ( Irani & Peleg, 1991 ) was used in SR ( Timofte, Rothe & Van Gool, 2016 ). The SR network is called the iterative up-down sampling SR, as shown in Fig. 9 . This model refines the image using recursive back-propagation, i.e., continuously measuring the error and refining the model based on the reconstruction error. The DBPN method proposed in Haris, Shakhnarovich & Ukita (2018) used this concept to perform continuous upsampling and downsampling, and the final image was constructed using the intermediate generations of the HR image.

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g009.jpg

Similarly, SRFBN ( Li et al., 2019b ) used this technique with densely connected layers for image SR, while RBPN ( Haris, Shakhnarovich & Ukita, 2019 ) used recurrent back-propagation with iterative up-down upsampling for video SR. This framework has shown significant improvement over the other frameworks; still, the back-propagation modules and their appropriate use require further exploration as this concept is recently introduced.

Progressive-upsampling SR

Since the post-upsampling framework uses a single layer at the end of upsampling and the learning is fixed for scaling factors; thus, multi-scale SR will increase the computational cost of the post-upsampling framework. Thus, using progressive upsampling within the framework to gradually achieve the required scaling was proposed, as seen in Fig. 10 . An example of this framework is the LapSRN ( Lai et al., 2017 ), which uses cascaded CNN-based modules responsible for mapping a single scaling factor, and the output of one module acts as the input LR image to the other module. This framework was also used in ProSR ( Wang et al., 2018a ) and MS-LapSRN ( Lai et al., 2017 ).

This model achieves higher learning rates as the learning difficulty is less since the SR operation is segregated into several small upscaling tasks, which is more straightforward for CNNs to learn. Furthermore, this model has built-in support for multi-scale SR as the images are scaled with various intermediate scaling factors. Training stability and convergence are the main issues with this framework, and this requires further research.

Other improvements

Apart from the four primary considerations in image SR, other factors have a significant effect on the performance of a super-resolution method, and in this section, a few are discussed in light of recent research.

Data augmentation

Data augmentation is a common technique in deep learning, and this concept is used to further enhance the performance of a deep learning model by generating more training data using the same dataset. In the case of image super-resolution, some of the augmentation techniques are flipping, cropping, angular rotation, skew, and color degradation ( Timofte, Rothe & Van Gool, 2016 ; Lai et al., 2017 ; Lim et al., 2017 ; Tai, Yang & Liu, 2017 ; Han et al., 2018 ). Recoloring the image using channel shuffling in the LR-HR image pair is also used as data augmentation in image SR ( Bei et al., 2018 ).

Enhanced prediction

This data augmentation method affects the output HR image as multiple LR images are augmented using rotation and flipping functions ( Timofte, Rothe & Van Gool, 2016 ). These augmented are fed to the model for reconstruction, the reconstructed outputs are inversely transformed, and the final HR image is based on the mean ( Timofte, Rothe & Van Gool, 2016 ; Wang et al., 2018a ) or median ( Shocher, Cohen & Irani, 2018 ) pixel values of the corresponding augmented outputs.

Network fusion and interpolation

This technique used multiple models to predict the HR image, and each prediction acts as the input to the following network, like in context-wise network fusion (CNF) ( Ren, El-Khamy & Lee, 2017 ). The CNF was based on three individual SRCNNs, and this model achieved the performance, which was compared with the state-of-the-art SR models ( Ren, El-Khamy & Lee, 2017 ).

In the SR network, network interpolation is a model that uses PSNR-based and GAN-based models for image SR to boost SR performance. Network interpolation strategy ( Wang et al., 2019c , 2019b ) used a PSNR-based model for training. In contrast, a GAN-based model was used for fine-tuning while the parameters were interpolated to get the weights of interpolation, and their results had few artifacts and look realistic.

Multi-task learning

Multi-task learning is used for learning various problems and getting a generalized model for representations found in learning, for example, image segmentation, object detection, and facial recognition ( Caruana, 1997 ; Collobert & Weston, 2008 ). In the field of super-resolution, Wang et al. (2018b) used semantic maps as input to the model and predicted the parameters of the affine transformation on the transitional feature maps. The SFT-GAN in Wang et al. (2018b) generated more realistic and crisp-looking images with good visual details regarding the textured regions. While in DNSR ( Bei et al., 2018 ), a denoising network was proposed to denoise the output generated by the SR network; thus, using this closed-loop system ( Bei et al., 2018 ) was able to achieve good results. Like DNSR, ( Yuan et al., 2018 ) proposed an unsupervised SR using the cycle-in-cycle GAN (CinCGAN) for denoising during the SR task. Using a multi-tasking framework may increase the computational difficulty, but the system's performance is also enhanced in terms of PSNR and perceptual quality indexes.

State-of-the-art SR methods

The recent year has excelled in developing SR models, especially using supervised deep learning; thus, the models have excelled in achieving state-of-the-art performance. Previously various aspects of the SR models and their underlying components were discussed in light of their strengths and weaknesses. In recent times the use of multiple learning strategies is common, and most of the state-of-the-art methods have used a combination of these strategies.

The first innovation was using the dual-branched network (DBCN) ( Gao, Zhang & Mou, 2019 ) to increase the computational efficiency of the single-branched network by using a smaller number of convolutional layers for representation. Furthermore, in RCAN ( Zhang et al., 2018a ), attention-based learning was used combined with residual learning, L1 pixel loss function, and subpixel upsampling method to achieve the state-of-the-art results in image SR. Furthermore, various models and their reported results and some key factors are summarized in Table 5 .

“US,” “Rec.,” “Res.,” “Attent.,” “Dense,” “Pre.,” “Post.,” “Iter.,” and “Prog.” represent upsampling methods, recursive learning, residual learning, attention-based learning, dense connections, pre-upsampling framework, post-upsampling framework, iterative up-down upsampling framework, and progressive upsampling framework respectively.

In previous sections, we discussed various strategies and compared and contrasted them; while these are important, the performance of any SR algorithm in comparison to the computational cost and parameters is also vital. In Fig. 11 , we have graphically shown the performance of SR methods using PSNR metrics compared to their size (represented as several parameters) and computational cost (measured by the number of Multi-Adds). The datasets used in measurements are Set14, B100 and Urban 100; the overall PSNR is the average score over the three datasets, while the scaling factor for these models was fixed to 2 × .

An external file that holds a picture, illustration, etc.
Object name is peerj-cs-07-621-g011.jpg

Image quality index is represented by PSNR (in blue color), which is a significant evaluation indicator of any super-resolution method; the total number of parameters learned by every method is shown in green . The computational efficiency is measured in tera multiply-adds, and it is shown in orange color.

As evident in Fig. 11 , the five best-performing methods on the selected datasets based on PSNR are WRAN (34.790 dB), RCAN (34.540 dB), SAN (34.480 dB), Meta-RDN (34.400 dB) and EDSR (34.330 dB). The variation in the average PSNR reported by these methods only varies in the range of 0.46 dB. There is a significant difference in the number of parameters reported by these five methods; WRAN reported only 2.710 million parameters while EDSR reported the highest parameters among the five methods, i.e., 40.74 million. WRAN and RCAN performed well in terms of PSNR, the number of parameters, and computational cost, thereby making them one of the best methods for image super-resolution.

Unsupervised super-resolution

In this section, the methods of unsupervised SR are discussed, which does not require LR–HR pairs. The limitation of the supervised learning methods is that the LR images are usually generated using known degradations. In supervised learning, the model learns the reverse transformation function of the degradation function to convert the LR image into the HR image. Thus, using the unsupervised model to upsample the LR images is a field of growing interest, where the model learns the real-world image degradation to achieve SR using the information of unpaired LR and HR images. A few of the unsupervised SR models are discussed in the sub-sections.

Weakly-supervised super-resolution

The first method to address the use of known degradation in the model for the generation of LR images using weakly supervised deep learning, this method utilized the unpaired LR and HR images for training the model. Although this model still requires both LR and HR images, the associations are not defined. Thus, there are two possible approaches; the first one is to learn the degradation function first, which can generate the degraded LR images and train the model to generate the HR images. The other method is to employ degradation function learning and LR-HR mapping cyclically, thus validating the results with each other ( Ignatov et al., 2018a ).

Cyclic weakly-supervised SR

Using the unpaired LR and HR images and referring them to two separated uncorrelated datasets, this method uses a cycle-in-cycle approach to predict the mapping function of these two datasets, i.e., from LR to HR and HR to LR images. This is a recursive process where the mapping functions generate images with equal distribution, and these images are fed to the second prediction cyclically.

Using the deep learning-based CycleGAN ( Zhu et al., 2017a ), a cycle-in-cycle SR framework was proposed in Yuan et al. (2018) this framework used a total of four generators, while there were two discriminators; the two GANS learned the representation of degraded LR to LR and LR to HR mappings. In Yuan et al. (2018) , the first generator is a simple denoising element that generates similar scale denoised LR images; these denoised images act as input to the second generator to regenerate the HR image, which is further validated by the adversarial network, i.e., a discriminator. Thus, using different loss functions, the CycleGAN achieves image SR using weakly supervised learning.

Although this method has achieved comparable results, especially in very noisy images where classical degradation functions in supervised learning cannot be used, there is room for research to decrease the learning difficulty of the computational cost of this method.

Learning the degradation function

A similar concept to the cyclic SR, but the two networks, i.e., a degradation learning network and LR-HR mapping network, are independently trained. In Bulat, Yang & Tzimiropoulos (2018) , a two-staged method of image SR was proposed, where a GAN learns the representations of the HR to LR transformation while the second GAN is trained using the paired output of the first GAN to learn the mapping representations of the LR to HR transformation. This two-stage model outperformed the state-of-the-art in Fréchet Inception Distance (FID) ( Heusel et al., 2017 ) with 10% failure cases. This method reported superior reconstruction of HR human facial features.

Zero-shot super-resolution

Using the training concept at the time of the test, the zero-shot SR (ZSSR) ( Shocher, Cohen & Irani, 2018 ) uses a single image to train a deep learning network using image augmentation techniques to learn the degradation function. ZSSR was used ( Michaeli & Irani, 2013 ) to predict the degradation kernel, which was further used to generate scaled and augmented images. The final step was to train an SRCNN network to learn the representations of this dataset, and in this way, the ZSSR uses augmentation and input image data to achieve SR. This model outdid the state-of-the-art for non-bicubic, noisy, and blurred LR images by 1dB in the case of estimated kernels and 2dB for known kernels.

Since this model requires training for every input image at the test time, the overall inference time is substantial.

Image prior in SR

The low-level details in any learning problem can be mapped using CNNs, thus using a randomly initialized CNN as an image prior ( Ulyanov, Vedaldi & Lempitsky, 2020 ) to perform SR. The network is not trained; instead, it uses a random vector v as input to the model, and it generates the HR image I y H R . This method aims to determine an image I ^ y H R that, when downsampled, returns an LR image that is similar to the input LR image I ^ x H R . The model performed 2dB the state-of-the-art methods but reported superior results than the conventional bicubic upsampling method by 1dB.

Domain-specific applications of super-resolution

In this section, various applications of SR grouped by the application domains are discussed.

Face image super-resolution

Face hallucination (FH) is perhaps the ultimate target utility of the image SR for face-recognition-based tasks such as ( Gunturk et al., 2003 ; Taigman et al., 2014 ; Korshunova et al., 2017 ; Zhang et al., 2018c ; Grm, Scheirer & Štruc, 2020 ). The facial images contain facial-structured information; thus, using image priors in FH has been a common approach to achieve FH.

Using techniques such as in CBN ( Zhu et al., 2016 ), the generated HR images can be constrained to face-related features, forcing the model to output HR images containing facial features. In CBN, this was achieved by using a facial prior and a dense correspondence field estimation. While in FSRNet ( Chen et al., 2018b ) facial parsing maps and facial landmark heatmaps were used as priors to the learning network to achieve face image SR, SICNN ( Zhang et al., 2018c ) used a joint training approach to recover the real identity using a super-identity loss function. Super-FAN ( Bulat & Tzimiropoulos, 2018 ) approached FH using end-to-end learning with FAN to ensure the generated images are consistent with human facial features.

Using implicit methods for solving the face misalignment problem is another way to approach FH; for instance, in Yu & Porikli (2017) , the spatial transformation is achieved using transformation networks ( Jaderberg et al., 2015 ). Another method based on ( Jaderberg et al., 2015 ) is TDAE ( Wang et al., 2019b ), which uses a three-module approach for FH, using a D-E-D (decoder-encoder-decoder) model to achieve FH; the first decoder performs denoising and upsampling while encoder downsamples the denoised image which is fed to the final decoder for FH. Another approach is to use HR exemplars from datasets to decompose the facial features of an LR image and project the HR features into the exemplar dataset to achieve FH ( Yang, Liu & Yang, 2018 ). In Song et al. (2018) , an adversarial discriminative network was proposed for feature learning on both feature space and raw pixel space; this method performed well for heterogenous face recognition (HFR).

In other research studies, human perception of attention shifting ( Najemnik & Geisler, 2005 ) was used in Attention-FH ( Cao et al., 2017a ) to learn face patches for local enhancement of FH. In Xu et al. (2017) , a multi-class GAN network was proposed for FH, which composed of multiple generators and discriminators, while in Yu & Porikli (2016) , the authors adopted a network model analogous to SRGAN ( Ledig et al., 2017 ). Using conditional GAN ( Gauthier, 2014 ), the studies ( Lee et al., 2018b ; Yu et al., 2018 ) used additional facial features to achieve FH with predefined attributes. Gao et al. proposed an efficient multilayer locality-constrained matrix regression (MLCMR) framework for face super-resolution of highly degraded LR images ( Gao et al., 2021 ).

Real-world image super-resolution

In real-world images, the sensors used to capture them already introduce degradations as the final RGB (8-bit) image is converted from the raw image (usually more than 14-bit or higher). Thus, using these images as a reference for SR is not optimal as the images have already been degraded ( Wang, Chen & Hoi, 2020 ). To approach this problem, research studies such as Zhang et al. (2019a) and Chen et al. (2019) have proposed methods for developing real-world image datasets. In Zhang et al. (2019a) , the SR-RAW dataset was developed by the authors, which contained raw-HR-LR(RGB) pairs generated using the optical zoom in cameras, while in Chen et al. (2019) , image resolution and its relationship with the field of view (FoV) were explored by the authors to generate a real-world dataset called City100.

Depth map super-resolution

In the field of computer vision, problems like image segmentation ( Zaitoun & Aqel, 2015 ; Yu & Koltun, 2016 ; Kirillov et al., 2019 ) and pose estimation ( Wei et al., 2016 ; Cao et al., 2017b ; Chen & Ramanan, 2017 ) have been approached by using depth maps. Depth maps retain the distance information of the scene and the observer, although these depth maps are of low-resolution because of the hardware constraints of the modern camera systems. Thus, image SR is used in this regard to increase the resolution of the depth maps.

Using multiple cameras to record the same scene and generate multiple HR images is the most suitable way of doing depth map SR. In Hui, Loy & Tang (2016) , the authors used two separate CNNs to downsample HR image concurrently and upsample the LR depth map; after the generation of RGB features from the downsampling CNN, these features were used to fine-tune the upsampling process of depth maps, while Riegler, Rüther & Bischof (2016) used the energy minimization model (such as Bashir & Ghouri (2014) ) to guide the model for generating HR depth maps without the need for reference images.

Remote sensing and satellite imaging

The use of SR in improving the resolution of remote sensing and satellite imaging has increased in the past years ( Shermeyer & Van Etten, 2019 ). In Li et al. (2017b) , the authors used the concept of multi-line cameras to utilize multiple LR images to generate a high-quality HR image from the ZY-3 (TLC) satellite image dataset. In Zhu et al. (2017b) and Benecki et al. (2018) , the authors argued that the conventional methods of evaluation of the SR techniques are not valid for satellite imaging as the degradation functions and operation conditions of the satellite hardware are entirely in a different environment and thus ( Benecki et al., 2018 ) proposed a new way for validation of SR methods for satellite image SR methods. An adaptive multi-scale detail enhancement (AMDE-SR) was proposed in ( Zhu et al., 2018 ) to use the multi-scale SR method to generate high-detailed HR images with accurate textual and high-frequency information. GAN-based methods provide superior performance for remote sensing image SR; Liu et al. developed a novel cascaded conditional Wasserstein generative adversarial network (CCWGAN) to generate HR images for remote sensing ( Liu et al., 2020a ). Bashir et al. proposed a YOLOv3-based small-object detection framework SRCGAN-RFA-YOLO ( Bashir & Wang, 2021b ), where the authors used residual feature aggregation and cyclic GAN to improve the resolution of remote sensing images before performing object detection.

Video super-resolution

In video SR, multiple frames represent the same scene; thus, there is inter and intra-frame spatial dependency in the video, which includes the information of brightness, colors, and relative motion of objects. Using the optical flow-based method ( Sun, Roth & Black, 2010 ; Liao et al., 2015 ), Sun et al. and Liao et al. proposed a method to generate probable HR candidate images and ensemble these images using CNNs. Using the Druleas ( Drulea & Nedevschi, 2011 ) algorithm, CVSRnet ( Kappeler et al., 2016 ) addressed the effect of motion by using CNNs for the images in successive frames to generate HR images.

Apart from direct learning motion compensation, a trainable spatial transformer ( Jaderberg et al., 2015 ) was used in VESPCN ( Caballero et al., 2017 ) to motion compensation mapping using data from successive frames for end-to-end mapping. Using a sub-pixel layer-based module, ( Tao et al., 2017 ) achieved super-resolution and motion compensation simultaneously.

Another approach is to use recurrent networks to indirectly grasp the spatial and temporal interdependency to address the motion compensation. In STCN ( Guo & Chao, 2017 ), the authors used a bidirectional LSTM ( Graves, Fernández & Schmidhuber, 2005 ) and deep CNNs to extract the temporal and spatial information from the video frames, while BRCN ( Huang, Wang & Wang, 2015 ) utilized RNNs, CNNs, and conditional CNNs respectively for temporal, spatial and temporal-spatial interdependency mapping. Using 3D convolution filters of small size to replace the large-sized filter, FSTRN ( Li et al., 2019a ) achieves state-of-the-art performance using deep CNNs, sustaining a low computational cost. A novel spatio-temporal matching network (STMN) for video SR was proposed, which worked on the wavelet transform to minimize the dependence on motion estimations ( Zhu et al., 2021 ).

SR for medical imaging

Other fields also used the concept of image super-resolution to achieve high-resolution images, such as in Mahapatra, Bozorgtabar & Garnavi (2019) ; the authors proposed the use of progressive GANs to enhance the image quality of magnetic resonance (MR) images. DeepResolve ( Chaudhari et al., 2018 ) used image SR methods to generate thin-sliced knee MR images from the thick-sliced input images. Since the diffusion MRI has high image acquisition time and low resolution, Super-resolution Reconstruction Diffusion Tensor Imaging (SRR-DTI) reconstructed HR diffusion parameters from LR diffusion-weighted (DW) images ( Van Steenkiste et al., 2016 ). Hamaide et al. (2017) also used SRR-DTI to find the structural sex variances in the adult zebra finch brain.

Assisted diagnosis using super-resolution has been a recent trend; for instance, researchers used deep learning-based SR methods to assist the diagnosis of movement disorders like isolated dystonia ( Bashir & Wang, 2021a ).

Other applications

Other applications of SR include object detection ( Li et al., 2017a ; Tan, Yan & Bare, 2018 ), stereo image SR ( Duan & Xiao, 2019 ; Guo, Chen & Huang, 2019 ; Wang et al., 2019a ), and super-resolution in optical microscopy ( Qiao et al., 2021 ). Overall, SR plays a vital role in multi-disciplines, from medical science, computer vision to satellite imaging and remote sensing.

Discussion and future directions

This paper gives an overall review of literature for image super-resolution, and the contribution of this paper is discussed in this section.

Learning strategies in image SR are introduced in “Learning Strategies”; while the learning strategies are well matured in image SR, there are research directions in the development of alternate loss functions and alternative of batch normalization

There are various loss functions in SR, and the choice of SR depends upon the task, while it is still an open research area to find an optimal loss function that fits all SR frameworks. A combination of loss functions is currently used to optimize the learning process, and there are no standard criteria for the selection of loss function; thus, exploring various probable loss functions for super-resolution is a promising future direction.

Batch normalization is a technique that performs well in computer vision tasks and reduces the overall runtime of the training, and enhances the performance; however, in SR batch normalization proved to be sub-optimal ( Lim et al., 2017 ; Wang et al., 2018a ; Chen et al., 2018a ). In this regard, normalization techniques for super-resolution should be explored further.

Network design

Network design strategies require further exploration in SR as the network design inherently dictates the overall performance of any SR method. Some of the key research areas are highlighted in this section

As discussed in “Upsampling Methods”, current upsampling methods have significant drawbacks for the deconvolution layer and may produce checkerboard artifacts. In contrast, the sub-pixel layer is susceptible to the non-uniform distribution of receptive fields; the meta-scale method has stability issues, while the interpolation-based methods lack end-to-end learning. Thus, further research is required to explore upsampling methods that can be generic to SR models and can be applied to LR images with any scaling factors.

For human perception in SR, further research is required in attention-based SR, where the models may be trained to give more attention to some image features than others like the human visual system does.

Using a combination of low and high-level representations simultaneously to accelerate the SR process is another field in network design for fast and accurate reconstruction of the HR image.

Exploring network architectures that can be implemented in practical applications since current methods use deep neural networks, which increases the performance of the SR at the expense of higher computational cost; thus, research in the development of network architecture that is minimal and provides optimal performance is another promising research direction.

Evaluation metrics

The image quality metrics used in SR act as the benchmark score, while the two most commonly used metrics, PSNR and SSIM, help gauge the performance of SR, but these metrics introduce inherent issues in the generated image. Using PSNR as an evaluation metric usually introduces non-realistic smooth surfaces, while SSIM works with textures, structures, brightness, and contrast to imitate human perception. These metrics cannot completely grasp the perceptual quality of images ( Ledig et al., 2017 ; Sajjadi, Scholkopf & Hirsch, 2017 ). Opinion scoring is a metric that ensures perceptual quality, but this metric is impractical for implementing SR methods for large datasets; thus, a probable research direction is developing a universal quality metric for SR.

In the past 2 years, unsupervised SR methods have gained popularity, but still, the task of collecting various resolution scenes for a similar pose is difficult; thus, bicubic interpolation is used instead to generate an unpaired SR dataset. In actuality, the unsupervised SR methods learn the inverse mapping of this interpolation for the reconstruction of HR images, and the actual learning of SR is still an open research field using unsupervised learning methods.

A detailed survey of classical SR and recent advances in SR with deep learning are explored in this survey paper. The central theme of this survey was to discuss deep learning-based SR techniques and the application of SR in various fields. Although image SR has achieved a lot in the last decade, some open problems are highlighted in “Discussion and Future Directions”. This survey is intended for the researchers in the field of SR and researchers from other fields to use image SR in their respective fields of interest.

Supplemental Information

Supplemental information 1, supplemental information 2, supplemental information 3.

A comparison of image SR methods.

Acknowledgments

We show our gratitude to the authors of all referred research studies for sharing results, especially to the authors of Kim, Lee & Lee (2016b) , Ledig et al. (2017) , Dong, Loy & Tang (2016) , Lai et al. (2017) , Haris, Shakhnarovich & Ukita (2018) , Hu et al. (2019) , Tai, Yang & Liu (2017) , Tai et al. (2017) , Li et al. (2018) , Lim et al. (2017) , Zhang et al. (2018a) , Dai et al. (2019) , Xue et al. (2020) and Caballero et al. (2017) .

Funding Statement

This work was supported by the National Natural Science Foundation of China (No. 62071384) and the Natural Science Basic Research Plan in Shaanxi Province of China (No. 2019JM-311). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Additional Information and Declarations

The authors declare that they have no competing interests.

Syed Muhammad Arsalan Bashir conceived and designed the experiments, performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Yi Wang analyzed the data, performed the computation work, authored or reviewed drafts of the paper, and approved the final draft.

Mahrukh Khan performed the experiments, analyzed the data, performed the computation work, prepared figures and/or tables, authored or reviewed drafts of the paper, and approved the final draft.

Yilong Niu analyzed the data, authored or reviewed drafts of the paper, and approved the final draft.

Full Text (PDF, 1,642 KB)

  • Related pubs.

A Review of New Developments in Finance with Deep Learning: Deep Hedging and Deep Calibration

Yuji shinozaki.

The application of machine learning to the field of finance has recently become the subject of active discussions. In particular, the deep learning is expected to significantly advance the techniques of hedging and calibration. As these two techniques play a central role in financial engineering and mathematical finance, the application to them attracts attentions of both practitioners and researchers. Deep hedging, which applies deep learning to hedging, is expected to make it possible to analyze how factors such as transaction costs affect hedging strategies. Since the impact of these factors was difficult to be assessed quantitatively due to the computational costs, deep hedging opens possibilities not only for refining and automating hedging operations of derivatives but also for broader applications in risk management. Deep calibration, which applies deep learning to calibration, is expected to make the parameter optimization calculation, which is an essential procedure in derivative pricing and risk management, faster and more stable. This paper provides an overview of the existing literature and suggests future research directions from both practical and academic perspectives. Specifically, the paper shows the implications of deep learning to existing theoretical frameworks and practical motivations in finance and identifies potential future developments that deep learning can bring about and the practical challenges.

Keywords: Financial engineering; Mathematical finance; Derivatives; Hedging; Calibration; Numerical optimization

Views expressed in the paper are those of the authors and do not necessarily reflect those of the Bank of Japan or Institute for Monetary and Economic Studies.

Copyright © 2024 Bank of Japan All Rights Reserved.

Home Japanese Home

Book cover

International Conference on Space Information Network

SINC 2023: Space Information Networks pp 17–33 Cite as

Applications of Deep Learning in Satellite Communication: A Survey

  • Yuanzhi He 6 , 7 ,
  • Biao Sheng 6 ,
  • Yuan Li 7 ,
  • Changxu Wang 7 ,
  • Xiang Chen 6 &
  • Jinchao Liu 8  
  • Conference paper
  • First Online: 28 March 2024

35 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 2057))

Satellite communication is a key aspect of future 6G networks, and the impact of artificial intelligence technology utilizing deep learning on satellite communications has garnered significant interest. This paper outlines the current research status of deep learning applications in satellite communication from the perspective of the physical layer, data link layer, and network layer. It also examines the limitations of deep learning in satellite communication applications and anticipates potential research directions for the future.

This is a preview of subscription content, log in via an institution .

Duan, T., Dinavahi, V.: Starlink space network-enhanced cyber-physical power system. IEEE Trans. Smart Grid 12 (4), 3673–3675 (2021)

Article   Google Scholar  

Wang, Z.J., Du, X.J., Yin, J.W., et al.: Development and prospect of LEO satellite Internet. Appl. Electron. Tech. 46 (7), 49–52 (2020)

Google Scholar  

Wang, P., Zhu, S., Li, C., et al.: Analysis on development of satellite internet standardization. Radio Commun. Technol. 49 (5), 1–7 (2023)

Wang, C.T., Zhai, L.J., Xu, X.F.: Development and prospects of space-terrestrial integrated information network. Radio Commun. Technol. 46 (5), 493–504 (2020)

Fang, X., Feng, W., Wei, T., et al.: 5G embraces satellites for 6G ubiquitous IoT: basic models for integrated satellite terrestrial networks. IEEE Internet Things J. 8 (18), 14399–14417 (2021)

Zhang, S.J., Zhao, X.T., Zhao, Y.F., et al.: Integration of satellite internet and terrestrial networks: integrated mode, frequency usage and application prospects. Radio Commun. Technol. 49 (5) (2023)

Sun, Y.H., Peng, M.G.: Low earth orbit satellite communication supporting direct connection with mobile phones: key technologies, recent progress and future directions. Telecommun. Sci. 39 (02), 25–36 (2023)

MathSciNet   Google Scholar  

ITU-R, Workshop on “IMT for 2030 and beyond”. https://www.itu.int/en/ITU-R/study-groups/rsg5/rwp5d/imt-2030/Pages/default.aspx

Xiao, Z., et al.: LEO satellite access network (LEO-SAN) towards 6G: challenges and approaches. IEEE Wirel. Commun. 1–8 (2022)

Azari, M.M., Solanki, S., Chatzinotas, S., et al.: Evolution of non-terrestrial networks from 5G to 6G: a survey. IEEE Commun. Surv. Tutor. 24 (4), 2633–2672 (2022)

Fourati, F., Alouini, M.S.: Artificial intelligence for satellite communication: a review. Intell. Converg. Netw. 2 (3), 213–243 (2021)

Wang, X., Shen, W., Xing, C., et al.: Joint Bayesian channel estimation and data detection for OTFS systems in LEO satellite communications. IEEE Trans. Commun. 70 (7), 4386–4399 (2020)

Yan, W.K., Yan, Y., Fan, Y.N., Yao, X.J., Gao, X., Sun, W.: A modulation recognition algorithm based on wavelet transform entropy and high-order cumulant for satellite signal modulation. Chin. J. Space Sci. 241 (6), 968–975 (2021). (in Chinese)

Li, J., Tang, X., Gao, L., Chen, L.: Satellite communication anti-jamming based on artificial bee colony blind source separation. In: 2021 6th International Conference on Communication, Image and Signal Processing (CCISP), pp. 240–244 (2021)

Subramanian, V., Karunamurthy, J.V., Ramachandran, B.: Hardware doppler shift emulation and compensation for LoRa LEO satellite communication. In: 2023 International Conference on IT Innovation and Knowledge Discovery (ITIKD), pp. 1–6 (2023)

Kang, M.J., Lee, J.H., Chae, S.H.: Channel estimation with DnCNN in massive MISO LEO satellite systems. In: 2023 Fourteenth International Conference on Ubiquitous and Future Networks (ICUFN), pp. 825–827(2023)

Zhang, Y., Wu, Y., Liu, A., et al.: Deep learning-based channel prediction for LEO satellite massive MIMO communication system. IEEE Wirel. Commun. Lett. 10 (8), 1835–1839 (2021)

Güven, E., Kurt, G.K.: CNN-aided channel and carrier frequency offset estimation for HAPS-LEO links. In: 2022 IEEE Symposium on Computers and Communications (ISCC), pp. 1–6 (2022)

Zha, X., Peng, H., Qin, X., Li, T.Y., Li, G.: Satellite amplitude-phase signals modulation identification and demodulation algorithm based on the cyclic neural network. Acta Electron. Sin. 11 (47), 2443–2448 (2019)

Ren, J., Ji, L.B., Dang, L.: Satellite signal modulation recognition algorithm based on deep learning. Radio Eng. 52 (4), 529–535 (2022)

Han, C., Huo, L., Tong, X., et al.: Spatial anti-jamming scheme for internet of satellites based on the deep reinforcement learning and stackelberg game. IEEE Trans. Veh. Technol. 69 (5), 5331–5342 (2020)

Li, H., Liu, Y., Shi, J., Zhou, Y., Zhuo, R., Li, S.: Multimodal LSTM forecasting for LEO satellite communication terminal access. In: 2023 IEEE 97th Vehicular Technology Conference (VTC2023-Spring), pp. 1–5 (2023)

Liao, X., Hu, X., Liu, Z., et al.: Distributed intelligence: a verification for multi-agent DRL based multibeam satellite resource allocation. IEEE Commun. Lett. 42 (12), 2785–2789 (2020)

Wu, X.W., Ling, X., Zhu, L.D.: Access and mobility management technologies for 6G satellite communications network. Telecommun. Sci. 37 (06), 78–90 (2021)

Jiang, Z., Li, W., Wang, X., et al.: A LEO satellite handover strategy based on graph and multiobjective multiagent path finding. Int. J. Aerosp. Eng. 2023 , 1–16 (2023)

Hu, X., Zhang, Y., Liao, X., et al.: Dynamic beam hopping method based on multi-objective deep reinforcement learning for next generation satellite broadband systems. IEEE Trans. Broadcast. 66 (3), 630–646 (2020)

Li, X., Zhang, H., Zhou, H., et al.: Multi-agent DRL for resource allocation and cache design in terrestrial-satellite networks. IEEE Trans. Wireless Commun. 22 (8), 5031–5042 (2023)

Ma, S., Hu, X., Liao, X., Wang, W.: Deep reinforcement learning for dynamic bandwidth allocation in multi-beam satellite systems. In: 2021 IEEE 6th International Conference on Computer and Communication Systems (ICCCS), pp. 955–959 (2021)

Makki, B., Chitti, K., Behravan, A., et al.: A survey of NOMA: current status and open research challenges. IEEE Open J. Commun. Soc. 1 , 179–189 (2020)

Zhu, X., Jiang, C., Kuang, L., et al.: Non-orthogonal multiple access based integrated terrestrial-satellite networks. IEEE J. Sel. Areas Commun. 35 (10), 2253–2267 (2017)

Zhang, Q., An, K., Yan, X., et al.: User pairing for delay-limited NOMA-based satellite networks with deep reinforcement learning. Sensors 23 (16), 7062 (2023)

Lee, J.H., Seo, H., Park, J., et al.: Learning emergent random access protocol for LEO satellite networks. IEEE Trans. Wireless Commun. 22 (1), 257–269 (2023)

Yang, J., Xiao, Z., Cui, H., et al.: DQN-ALrM-based intelligent handover method for satellite-ground integrated network. IEEE Trans. Cogn. Commun. Netw. 9 (4), 977–990 (2023)

Wang, J., Mu, W., Liu, Y., Guo, L., Zhang, S., Gui, G.: Deep reinforcement learning-based satellite handover scheme for satellite communications. In: 2021 13th International Conference on Wireless Communications and Signal Processing (WCSP), pp. 1–6 (2021)

Leng, T., Xu, Y., Cui, G., et al.: Caching-aware intelligent handover strategy for LEO satellite networks. Remote Sens. 13 (11), 22–30 (2021)

Xu, H., Li, D., Liu, M., et al.: QoE-driven intelligent handover for user-centric mobile satellite networks. IEEE Trans. Veh. Technol. 69 (9), 10127–10139 (2020)

Lei, L., Lagunas, E., Yuan, Y., Kibria, M.G., Chatzinotas, S., Ottersten, B.: Deep learning for beam hopping in multibeam satellite systems. In: 2020 IEEE 91st Vehicular Technology Conference (VTC2020-Spring), pp. 1–5 (2020)

Cao, X., Li, Y., Xiong, X., et al.: Dynamic routings in satellite networks: an overview. Sensors 22 (12), 45–52 (2022)

Liu, Y., Yu, J.J.Q., Kang, J., et al.: Privacy-preserving traffic flow prediction: a federated learning approach. IEEE Internet Things J. 7 (8), 7751–7763 (2020)

Wu, G., Luo, Q., Zhu, Y., et al.: Flexible task scheduling in data relay satellite networks. IEEE Trans. Aerosp. Electron. Syst. 58 (2), 1055–1068 (2022)

Liu, D., Zhang, J., Cui, J., et al.: Deep learning aided routing for space-air-ground integrated networks relying on real satellite, flight, and shipping data. IEEE Wirel. Commun. 29 (2), 177–184 (2022)

Wang, F., Jiang, D., Wang, Z., et al.: Fuzzy-CNN based multi-task routing for integrated satellite-terrestrial networks. IEEE Trans. Veh. Technol. 71 (2), 1913–1926 (2023)

Wan, X., Fu, X., Li, J., et al.: Research on satellite traffic classification based on deep packet recognition and convolution neural network. In: 2023 8th International Conference on Computer and Communication Systems (ICCCS), pp. 494–498 (2023)

Zhu, F., Liu, L., Lin, T.: An LSTM-based traffic prediction algorithm with attention mechanism for satellite network. In: Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition, pp. 205–209 (2020)

Zhang, S., Liu, A., Han, C., et al.: Multi-agent reinforcement learning-based orbital edge offloading in SAGIN supporting internet of remote things. IEEE Internet Things J. 10 (23), 20472–20483 (2023)

Lan, W., Chen, K., Li, Y., et al.: Deep Reinforcement Learning for Privacy-Preserving Task Offloading in Integrated Satellite-Terrestrial Networks. arXiv (2023). http://arxiv.org/abs/2306.17183

Zhan, H., Xi, S., Jiang, H., et al.: Resource allocation and offloading strategy for UAV-assisted LEO satellite edge computing. Drones 7 (6), 383 (2023)

Han, D., Ye, Q., Peng, H., et al.: Two-timescale learning-based task offloading for remote IoT in integrated satellite-terrestrial networks. IEEE Internet Things J. 10 (12), 10131–10145 (2023)

Luo, X., Chen, H.H., Guo, Q.: Semantic communications: overview, open issues, and future research directions. IEEE Wirel. Commun. 29 (1), 210–219 (2022)

Dai, J., Zhang, P., Niu, K., et al.: Communication beyond transmitting bits: semantics-guided source and channel coding. IEEE Wirel. Commun. 4 , 170–177 (2023)

Download references

Author information

Authors and affiliations.

School of Systems Science and Engineering, Sun Yat-Sen University, Guangzhou, 100876, China

Yuanzhi He, Biao Sheng & Xiang Chen

Institute of Systems Engineering, Academy of Military Sciences, Beijing, 100141, China

Yuanzhi He, Yuan Li & Changxu Wang

China Coast Guard, Beijing, 100141, China

Jinchao Liu

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yuanzhi He .

Editor information

Editors and affiliations.

Institute of China Electronic Equipment System Engineering Corporation, Beijing, China

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

He, Y., Sheng, B., Li, Y., Wang, C., Chen, X., Liu, J. (2024). Applications of Deep Learning in Satellite Communication: A Survey. In: Yu, Q. (eds) Space Information Networks. SINC 2023. Communications in Computer and Information Science, vol 2057. Springer, Singapore. https://doi.org/10.1007/978-981-97-1568-8_3

Download citation

DOI : https://doi.org/10.1007/978-981-97-1568-8_3

Published : 28 March 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-97-1567-1

Online ISBN : 978-981-97-1568-8

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

This paper is in the following e-collection/theme issue:

Published on 16.4.2024 in Vol 26 (2024)

Adverse Event Signal Detection Using Patients’ Concerns in Pharmaceutical Care Records: Evaluation of Deep Learning Models

Authors of this article:

Author Orcid Image

Original Paper

  • Satoshi Nishioka 1 , PhD   ; 
  • Satoshi Watabe 1 , BSc   ; 
  • Yuki Yanagisawa 1 , PhD   ; 
  • Kyoko Sayama 1 , MSc   ; 
  • Hayato Kizaki 1 , MSc   ; 
  • Shungo Imai 1 , PhD   ; 
  • Mitsuhiro Someya 2 , BSc   ; 
  • Ryoo Taniguchi 2 , PhD   ; 
  • Shuntaro Yada 3 , PhD   ; 
  • Eiji Aramaki 3 , PhD   ; 
  • Satoko Hori 1 , PhD  

1 Division of Drug Informatics, Keio University Faculty of Pharmacy, Tokyo, Japan

2 Nakajima Pharmacy, Hokkaido, Japan

3 Nara Institute of Science and Technology, Nara, Japan

Corresponding Author:

Satoko Hori, PhD

Division of Drug Informatics

Keio University Faculty of Pharmacy

1-5-30 Shibakoen

Tokyo, 105-8512

Phone: 81 3 5400 2650

Email: [email protected]

Background: Early detection of adverse events and their management are crucial to improving anticancer treatment outcomes, and listening to patients’ subjective opinions (patients’ voices) can make a major contribution to improving safety management. Recent progress in deep learning technologies has enabled various new approaches for the evaluation of safety-related events based on patient-generated text data, but few studies have focused on the improvement of real-time safety monitoring for individual patients. In addition, no study has yet been performed to validate deep learning models for screening patients’ narratives for clinically important adverse event signals that require medical intervention. In our previous work, novel deep learning models have been developed to detect adverse event signals for hand-foot syndrome or adverse events limiting patients’ daily lives from the authored narratives of patients with cancer, aiming ultimately to use them as safety monitoring support tools for individual patients.

Objective: This study was designed to evaluate whether our deep learning models can screen clinically important adverse event signals that require intervention by health care professionals. The applicability of our deep learning models to data on patients’ concerns at pharmacies was also assessed.

Methods: Pharmaceutical care records at community pharmacies were used for the evaluation of our deep learning models. The records followed the SOAP format, consisting of subjective (S), objective (O), assessment (A), and plan (P) columns. Because of the unique combination of patients’ concerns in the S column and the professional records of the pharmacists, this was considered a suitable data for the present purpose. Our deep learning models were applied to the S records of patients with cancer, and the extracted adverse event signals were assessed in relation to medical actions and prescribed drugs.

Results: From 30,784 S records of 2479 patients with at least 1 prescription of anticancer drugs, our deep learning models extracted true adverse event signals with more than 80% accuracy for both hand-foot syndrome (n=152, 91%) and adverse events limiting patients’ daily lives (n=157, 80.1%). The deep learning models were also able to screen adverse event signals that require medical intervention by health care providers. The extracted adverse event signals could reflect the side effects of anticancer drugs used by the patients based on analysis of prescribed anticancer drugs. “Pain or numbness” (n=57, 36.3%), “fever” (n=46, 29.3%), and “nausea” (n=40, 25.5%) were common symptoms out of the true adverse event signals identified by the model for adverse events limiting patients’ daily lives.

Conclusions: Our deep learning models were able to screen clinically important adverse event signals that require intervention for symptoms. It was also confirmed that these deep learning models could be applied to patients’ subjective information recorded in pharmaceutical care records accumulated during pharmacists’ daily work.

Introduction

Increasing numbers of people are expected to develop cancers in our aging society [ 1 - 3 ]. Thus, there is increasing interest in how to detect and manage the side effects of anticancer therapies in order to improve treatment regimens and patients’ quality of life [ 4 - 8 ]. The primary approaches for side effect management are “early signal detection and early intervention” [ 9 - 11 ]. Thus, more efficient approaches for this purpose are needed.

It has been recognized that patients’ voices concerning adverse events represent an important source of information. Several studies have indicated that the number, severity, and time of occurrence of adverse events might be underevaluated by physicians [ 12 - 15 ]. Thus, patient-reported outcomes (PROs) have recently received more attention in the drug evaluation process, reflecting patients’ real voices. Various kinds of PRO measures have been developed and investigated in different disease populations [ 16 , 17 ]. Health care authorities have also encouraged the pharmaceutical industry to use PROs for drug evaluation [ 18 , 19 ], and it is becoming more common to take PRO assessment results into consideration for drug marketing approval [ 20 , 21 ]. Similar trends can be seen in the clinical management of individual patients. Thus, health care professionals have an interest in understanding how to appropriately gather patients’ concerns in order to improve safety management and clinical decisions [ 22 - 24 ].

The applications of deep learning for natural language processing have expanded dramatically in recent years [ 25 ]. Since the development of a high-performance deep learning model in 2018 [ 26 ], attempts to apply cutting-edge deep learning models to various kinds of patient-generated text data for the evaluation of safety events or the analysis of unscalable subjective information from patients have been accelerating [ 27 - 31 ]. Most studies have been conducted to use patients’ narrative data for pharmacovigilance [ 27 , 32 - 35 ], while few have been aimed at improvement of real-time safety monitoring for individual patients. In addition, there have been some studies on adverse event severity grading based on health care records [ 36 - 39 ], but none has yet aimed to extract clinically important adverse event signals that require medical intervention from patients’ narratives. It is important to know whether deep learning models could contribute to the detection of such important adverse event signals from concern texts generated by individual patients.

To address this question, we have developed deep learning models to detect adverse event signals from individual patients with cancer based on patients’ blog articles in online communities, following other types of natural language processing–related previous work [ 40 , 41 ]. One deep learning model focused on the specific symptom of hand-foot syndrome (HFS), which is one of the typical side effects of anticancer treatments [ 42 ], and another focused on a broad range of adverse events that impact patients’ activities of daily living [ 43 ]. We showed that our models can provide good performance scores in targeting adverse event signals. However, the evaluation relied on patients’ narratives from the patients’ blog data used for deep learning model training, so further evaluation is needed to ensure the validity and applicability of the models to other texts regarding patients’ concerns. In addition, the blog data source did not contain medical information, so it was not feasible to assess whether the models could contribute to the extraction of clinically important adverse event signals.

To address these challenges, we focused on pharmaceutical care records written by pharmacists at community pharmacies. The gold standard format for pharmaceutical care records in Japan is the SOAP (subjective, objective, assessment, plan)-based document that follows the “problem-oriented system” concept proposed by Weed [ 44 ] in 1968. Pharmacists track patients’ subjective concerns in the S column, provide objective information or observations in the O column, give their assessment from the pharmacist perspective in the A column, and suggest a plan for moving forward in the P column [ 45 , 46 ]. We considered that SOAP-based pharmaceutical care records could be a unique data source suitable for further evaluation of our deep learning models because they contain both patients’ concerns and professional health care records by pharmacists, including the medication prescription history with time stamps. Therefore, this study was designed to assess whether our deep learning models could extract clinically important adverse event signals that require intervention by medical professionals from these records. We also aimed to evaluate the characteristics of the models when applied to patients’ subjective information noted in the pharmaceutical care records, as there have been only a few studies on the application of deep learning models to patients’ concerns recorded during pharmacists’ daily work [ 47 - 49 ].

Here, we report the results of applying our deep learning models to patients’ concern text data in pharmaceutical care records, focusing on patients receiving anticancer treatment.

Data Source

The original data source was 2,276,494 pharmaceutical care records for 303,179 patients, created from April 2020 to December 2021 at community pharmacies belonging to the Nakajima Pharmacy Group in Japan [ 50 ]. To focus on patients with cancer, records of patients with at least 1 prescription for an anticancer drug were retrieved by sorting individual drug codes (YJ codes) used in Japan (YJ codes starting with 42 refer to anticancer drugs). Records in the S column (ie, S records) were collected from the patients with cancer as the text data of patients’ concerns for deep learning model analysis.

Deep Learning Models

The deep learning models used for this research were those that we constructed based on patients’ narratives in blog articles posted in an online community and that showed the best performance score in each task in our previous work (ie, a Bidirectional Encoder Representations From Transformers [BERT]–based model for HFS signal extraction [ 42 ] and a T5-based model for adverse event signal extraction [ 43 ]). BERT [ 26 ] and T5 [ 51 ] both belong to a type of deep learning model that has recently shown high performance in several studies [ 29 , 52 ]. Hereafter, we refer to the deep learning model for HFS signals as the HFS model, the model for any adverse event signals as All AE (ie, all or any adverse events) model, and the model for adverse event signals limited to patients’ activities of daily living as the AE-L (adverse events limiting patients’ daily lives) model. It was also confirmed that these deep learning models showed similar or higher performance scores for the HFS, All AE, or AE-L identification tasks using 1000 S records randomly extracted from the data source of this study compared to the values obtained in our previous work [ 42 , 43 ] (the performance scores of sentence-level tasks from our previous work are comparable, as the mean number of words in the sentences in the data source in our previous work was 32.7 [SD 33.9], which is close to that of the S records used in this study, 38.8 [SD 29.4]). The method and results of the performance-level check are described in detail in Multimedia Appendix 1 [ 42 , 43 ]. We applied the deep learning models to all text data in this study without any adjustment in setting parameters from those used in constructing them based on patient-authored texts in our previous work [ 42 , 43 ].

Evaluation of Extracted S Records by the Deep Learning Models

In this study, we focused on the evaluation of S records that our deep learning models extracted as HFS or AE-L positive. Each positive S record was assessed as if it was a true adverse event signal, a sort of adverse event symptom, whether or not an intervention was made by health care professionals. We also investigated the kind of anticancer treatment prescription in connection with each adverse event signal identified in S records.

To assess whether an extracted positive S record was a true adverse event signal, we used the same annotation guidelines as in our previous work [ 43 ]. In brief, each S record was treated as an “adverse event signal” if any untoward medical occurrence happened to the patient, regardless of the cause. For the AE-L model only, if a positive S record was confirmed as an adverse event signal, it was further categorized into 1 or more of the following adverse event symptoms: “fatigue,” “nausea,” “vomiting,” “diarrhea,” “constipation,” “appetite loss,” “pain or numbness,” “rash or itchy,” “hair loss,” “menstrual irregularity,” “fever,” “taste disorder,” “dizziness,” “sleep disorder,” “edema,” or “others.”

For the assessment of interventions by health care professionals and anticancer treatment prescriptions, information from the O, A, and P columns and drug prescription history in the data source were investigated for the extracted positive S records. The interventions by health care professionals were categorized in any of the following: “adding symptomatic treatment for the adverse event signal,” “dose reduction or discontinuation of causative anticancer treatment,” “consultation with physician,” “others,” or “no intervention (ie, just following up the adverse event signal).” The actions categorized in “others” were further evaluated individually. For this assessment, we also randomly extracted 200 S records and evaluated them in the same way for comparison with the results from the deep learning model. Prescription history of anticancer treatment was analyzed by primary category of mechanism of action (MoA) with subcategories if applicable (eg, target molecule for kinase inhibitors).

Applicability Check to Other Text Data Including Patients’ Concerns

To check the applicability of our deep learning models to data from a different source, interview transcripts from patients with cancer were also evaluated. The interview transcripts were created by the Database of Individual Patient Experiences-Japan (DIPEx-Japan) [ 53 ]. DIPEx-Japan divides the interview transcripts into sections for each topic, such as “onset of disease” and “treatment,” and posts the processed texts on its website. Processing is conducted by accredited researchers based on qualitative research methods established by the University of Oxford [ 54 ]. In this study, interview text data created from interviews with 52 patients with breast cancer conducted from January 2008 to October 2018 were used to assess whether our deep learning models can extract adverse event signals from this source. In total, 508 interview transcripts were included with the approval of DIPEx-Japan.

Ethical Considerations

This study was conducted with anonymized data following approval by the ethics committee of the Keio University Faculty of Pharmacy (210914-1 and 230217-1) and in accordance with relevant guidelines and regulations and the Declaration of Helsinki. Informed consent specific to this study was waived due to the retrospective observational design of the study with the approval of the ethics committee of the Keio University Faculty of Pharmacy. To respect the will of each individual stakeholder, however, we provided patients and pharmacists of the pharmacy group with an opportunity to refuse the sharing of their pharmaceutical care records by posting an overview of this study at each pharmacy store or on their web page regarding the analysis using pharmaceutical care records. Interview transcripts from DIPEx-Japan were provided through a data sharing arrangement for using narrative data for research and education. Consent for interview transcription and its sharing from DIPEx-Japan was obtained from the participants when the interviews were recorded.

From the original data source of 2,180,902 pharmaceutical care records for 291,150 patients, S records written by pharmacists for patients with a history of at least 1 prescription of an anticancer drug were extracted. This yielded 30,784 S records for 2479 patients with cancer ( Table 1 ). The mean and median number of words in the S records were 38.8 (SD 29.4) and 32 (IQR 20-50), respectively. We applied our deep learning models, HFS, All AE, and AE-L, to these 30,784 S records for the evaluation of the deep learning models for adverse event signal detection.

For interview transcripts created by DIPEx-Japan, the mean and median number of words were 428.9 (SD 160.9) and 416 (IQR 308-526), respectively, in the 508 transcripts for 52 patients with breast cancer.

a SOAP: subjective, objective, assessment, plan.

b S: subjective.

Application of the HFS Model

First, we applied the HFS model to the S records for patients with cancer. The BERT-based model was used for this research as it showed the best performance score in our previous work [ 42 ].

S Records Extracted as HFS Positive

The S records extracted as HFS positive by the HFS model ( Table 2 ) amounted to 167 (0.5%) records for 119 (4.8%) patients. A majority of the patients had 1 HFS-positive record in their S records (n=91, 76.5%), while 2 patients had as many as 6 (1.7%) HFS-positive records. When we examined whether the extracted S records were true adverse event signals or not, 152 records were confirmed to be adverse event signals, while the other 15 records were false-positives. All the false-positive S records were descriptions about the absence of symptoms or confirmation of improving condition (eg, “no diarrhea, mouth ulcers, or limb pain so far” or “the skin on the soles of my feet has calmed down a lot with this ointment”). Some examples of S records that were predicted as HFS positive by the model are shown in Table S1 in Multimedia Appendix 2 .

The same examination was conducted with interview transcripts from DIPEx-Japan. Only 1 (0.2%) transcript was extracted as HFS positive by the HFS model, and it was a true adverse event signal (100%). The actual transcript extracted as HFS positive is shown in Table S2 in Multimedia Appendix 2 .

a S: subjective.

b HFS: hand-foot syndrome.

c All false-positive S records were denial of symptoms or confirmation of improving condition.

Interventions by Health Care Professionals

The 167 S records extracted as HFS positive as well as 200 randomly selected records were checked for interventions by health care professionals ( Figure 1 ). The proportion showing any action by health care professionals was 64.1% for 167 HFS-positive S records compared to 13% for the 200 random S records. Among the actions taken for HFS positives, “adding symptomatic treatment” was the most common, accounting for around half (n=79, 47.3%), followed by “other” (n=18, 10.8%). Most “other” actions were educational guidance from pharmacists, such as instructions on moisturizing, nail care, or application of ointment and advice on daily living (eg, “avoid tight socks”).

research paper on deep learning

Anticancer Drugs Prescribed

The types of anticancer drugs prescribed for HFS-positive patients are summarized based on the prescription histories in Table 3 . For the 152 adverse event signals identified by the HFS model in the previous section, the most common MoA class of anticancer drugs used for the patients was antimetabolite (n=62, 40.8%), specifically fluoropyrimidines (n=59, 38.8%). Kinase inhibitors were next (n=49, 32.2%), with epidermal growth factor receptor (EGFR) inhibitors and multikinase inhibitors as major subgroups (n=28, 18.4% and n=14, 9.2%, respectively). The third and fourth most common MoAs were aromatase inhibitors (n=24, 15.8%) and antiandrogen or estrogen drugs (n=7, 4.6% each) for hormone therapy.

a EGFR: epidermal growth factor receptor.

b VEGF: vascular endothelial growth factor.

c HER2: human epidermal growth factor receptor-2.

d CDK4/6: cyclin-dependent kinase 4/6.

Application of the All AE or AE-L model

The All AE and AE-L models were also applied to the same S records for patients with cancer. The T5-based model was used for this research as it gave the best performance score in our previous work [ 43 ].

S Records Extracted as All AE or AE-L positive

The numbers of S records extracted as positive were 7604 (24.7%) for 1797 patients and 196 (0.6%) for 142 patients for All AE and AE-L, respectively. In the case of All AE, patients tended to have multiple adverse event positives in their S records (n=1315, 73.2% of patients had at least 2 positives). In the case of AE-L, most patients had only 1 AE-L positive (n=104, 73.2%), and the largest number of AE-L positives for 1 patient was 4 (2.8%; Table 4 ).

We focused on AE-L evaluation due to its greater importance from a medical viewpoint and lower workload for manual assessment, considering the number of positive S records. Of the 197 AE-L–positive S records, it was confirmed that 157 (80.1%) records accurately extracted adverse event signals, while 39 (19.9%) records were false-positives that did not include any adverse event signals ( Table 4 ). The contents of the 39 false-positives were all descriptions about the absence of symptoms or confirmation of improving condition, showing a similar tendency to the HFS false-positives (eg, “The diarrhea has calmed down so far. Symptoms in hands and feet are currently fine” and “No symptoms for the following: upset in stomach, diarrhea, nausea, abdominal pain, abdominal pain or stomach cramps, constipation”). Examples of S records that were predicted as AE-L positive are shown in Table S3 in Multimedia Appendix 2 .

The deep learning models were also applied to interview transcripts from DIPEx-Japan in the same manner. The deep learning models identified 84 (16.5%) and 18 (3.5%) transcripts as All AE or AE-L positive, respectively. Of the 84 All AE–positive transcripts, 73 (86.9%) were true adverse event signals. The false-positives of All AE (n=11, 13.1%) were categorized into any of the following 3 types: explanations about the disease or its prognosis, stories when their cancer was discovered, or emotional changes that did not include clear adverse event mentions. With regard to AE-L, all the 18 (100%) positives were true adverse event signals (Table S4 in Multimedia Appendix 2 ). Examples of actual transcripts extracted as All AE or AE-L positive are shown in Table S5 in Multimedia Appendix 2 .

b All AE: all (or any of) adverse event.

c AE-L: adverse events limiting patients’ daily lives.

d All false-positive S records were denial of symptoms or confirmation of improving condition.

Whether or not interventions were made by health care professionals was investigated for the 196 AE-L–positive S records. As in the HFS model evaluation, data from 200 randomly selected S records were used for comparison ( Figure 2 ). In total, 91 (46.4%) records in the 196 AE-L–positive records were accompanied by an intervention, while the corresponding figure in the 200 random records was 26 (13%) records. The most common action in response to adverse event signals identified by the AE-L model was “adding symptomatic treatment” (n=71, 36.2%), followed by “other” (n=11, 5.6%). “Other” included educational guidance from pharmacists, inquiries from pharmacists to physicians, or recommendations for patients to visit a doctor.

research paper on deep learning

The types of anticancer drugs prescribed for patients with adverse event signals identified by the AE-L model were summarized based on the prescription histories ( Table 5 ). In connection with the 157 adverse event signals, the most common MoA of the prescribed anticancer drug was antimetabolite (n=62, 39.5%) and fluoropyrimidine (n=53, 33.8%), which accounted for the majority. Kinase inhibitor (n=31, 19.7%) was the next largest category with multikinase inhibitor (n=14, 8.9%) as the major subgroup. These were followed by antiandrogen (n=27, 17.2%), antiestrogen (n=10, 6.4%), and aromatase inhibitor (n=10, 6.4%) for hormone therapy.

b JAK: janus kinase.

c VEGF: vascular endothelial growth factor.

d BTK: bruton tyrosine kinase.

e FLT3: FMS-like tyrosine kinase-3.

f PARP: poly-ADP ribose polymerase.

g CDK4/6: cyclin-dependent kinase 4/6.

h CD20: cluster of differentiation 20.

Adverse Event Symptoms

For the 157 adverse event signals identified by the AE-L model, the symptoms were categorized according to the predefined guideline in our previous work [ 43 ]. “Pain or numbness” (n=57, 36.3%) accounted for the largest proportion followed by “fever” (n=46, 29.3%) and “nausea” (n=40, 25.5%; Table 6 ). Symptoms classified as “others” included chills, tinnitus, running tears, dry or peeling skin, and frequent urination. When comparing the proportion of the symptoms associated with or without interventions by health care professionals, a trend toward a greater proportion of interventions was observed in “fever,” “nausea,” “diarrhea,” “constipation,” “vomiting,” and “edema” ( Figure 3 , black boxes). On the other hand, a smaller proportion was observed in “pain or numbness,” “fatigue,” “appetite loss,” “rash or itchy,” “taste disorder,” and “dizziness” ( Figure 3 , gray boxes).

research paper on deep learning

This study was designed to evaluate our deep learning models, previously constructed based on patient-authored texts posted in an online community, by applying them to pharmaceutical care records that contain both patients’ subjective concerns and medical information created by pharmacists. Based on the results, we discuss whether these deep learning models can extract clinically important adverse event signals that require medical intervention, and what characteristics they show when applied to data on patients’ concerns in pharmaceutical care records.

Performance for Adverse Event Signal Extraction

The first requirement for the deep learning models is to extract adverse event signals from patients’ narratives precisely. In this study, we evaluated the proportion of true adverse event signals in positive S records extracted by the HFS or AE-L model. True adverse event signals amounted to 152 (91%) and 157 (80.1%) for the HFS and AE-L models, respectively ( Tables 2 and 4 ). Given that the proportion of true adverse event signals in 200 randomly extracted S records without deep learning models was 54 (27%; categories other than “no adverse event” in Figures 1 and 2 ), the HFS and AE-L models were able to concentrate S records with adverse event mentions. Although 15 (9%) for the HFS model and 39 (19.9%) for the AE-L model were false-positives, it was confirmed all of the false-positive records described a lack of symptoms or confirmation of improving condition. We considered that such false-positives are due to the unique feature of pharmaceutical care records, where pharmacists might proactively interview patients about potential side effects of their medications. As the data set of blog articles we used to construct the deep learning models included few such cases (especially comments on lack of symptoms), our models seemed unable to exclude them correctly. Even though we confirmed that the proportion of true “adverse event” signals extracted from the S records by the HFS or AE-L model was more than 80%, the performance scores to extract true “HFS” or “AE-L” signals were not so high based on the performance check using 1000 randomly extracted S records ( F 1 -scores were 0.50 and 0.22 for true HFS and AE-L signals, respectively; Table S1 in Multimedia Appendix 1 ). It is considered that the performance to extract true HFS and AE-L signals was relatively low due to the short length of texts in the S records, providing less context to judge the impact on patients’ daily lives, especially for the AE-L model (the mean word number of the S records was 38.8 [SD 29.4; Table 1 ], similar to the sentence-level tasks in our previous work [ 42 , 43 ]). However, we consider a true adverse event signal proportion of more than 80% in this study represents a promising outcome, as this is the first attempt to apply our deep learning models to a different source of patients’ concern data, and the extracted positive cases would be worthy of evaluation by a medical professional, as the potential adverse events could be caused by drugs taken by the patients.

When the deep learning models were applied to DIPEx-Japan interview transcripts, including patients’ concerns, the proportion of true adverse event signals was also more than 80% (for All AE: n=73, 86.9% and for HFS and AE-L: n=18, 100%). The difference in the results between pharmaceutical care S records and DIPEx-Japan interview transcripts was the features of false-positives, descriptions about lack of symptoms or confirmation of improving condition in S records versus explanations about disease or its prognosis, stories about when their cancer was discovered, or emotional changes in interview transcripts. This is considered due to the difference in the nature of the data source; the pharmaceutical care records were generated in a real-time manner by pharmacists through their daily work, where adverse event signals are proactively monitored, while the interview transcripts were purely based on patients’ retrospective memories. Our deep learning models were able to extract true adverse event signals with an accuracy of more than 80% from both text data sources in spite of the difference in their nature. When looking at future implementation of the deep learning models in society (discussed in the Potential for Deep Learning Model Implementation in Society section), it may be desirable to further adjust deep learning models to reduce false-positives depending upon the features of the data source.

Identification of Important Adverse Events Requiring Medical Intervention

To assess whether the models could extract clinically important adverse event signals, we investigated interventions by health care professionals connected with the adverse event signals that are identified by our deep learning models. In the 200 randomly extracted S records, only 26 (13%) consisted of adverse event signals, leading to any intervention by health care professionals. On the other hand, the proportion of signals associated with interventions was increased to 107 (64.1%) and 91 (46.4%) in the S records extracted as positive by the HFS and AE-L models, respectively ( Figures 1 and 2 ). These results suggest that both deep learning models can screen clinically important adverse event signals that require intervention from health care professionals. The performance level in screening adverse event signals requiring medical intervention was higher in the HFS model than in the AE-L model (n=107, 64.1% vs n=91, 46.4%; Figures 1 and 2 ). Since the target events were specific and adverse event signals of HFS were narrowly defined, which is one of the typical side effects of some anticancer drugs, we consider that health care providers paid special attention to HFS-related signals and took action proactively. In both deep learning models, similar trends were observed in actions taken by health care professionals in response to extracted adverse event signals; common actions were attempts to manage adverse event symptoms by symptomatic treatment or other mild interventions, including educational guidance from pharmacists or recommendations for patients to visit a doctor. More direct interventions focused on the causative drugs (ie, “dose reduction or discontinuation of anticancer treatment”) amounted to less than 5%; 7 (4.2%) for the HFS model and 6 (3.1%) for the AE-L model ( Figures 1 and 2 ). Thus, it appears that our deep learning models can contribute to screening mild to moderate adverse event signals that require preventive actions such as symptomatic treatments or professional advice from health care providers, especially for patients with less sensitivity to adverse event signals or who have few opportunities to visit clinics and pharmacies.

Ability to Catch Real Side Effect Signals of Anticancer Drugs

Based on the drug prescription history associated with S records extracted as HFS or AE-L positive, the type and duration of anticancer drugs taken by patients experiencing the adverse event signals were investigated. For the HFS model, the most common MoA of anticancer drug was antimetabolite (fluoropyrimidine: n=59, 38.8%), followed by kinase inhibitors (n=49, 32.2%, of which EGFR inhibitors and multikinase inhibitors accounted for n=28, 18.4% and n=14, 9.2%, respectively) and aromatase inhibitors (n=24, 15.8%; Table 3 ). It is known that fluoropyrimidine and multikinase inhibitors are typical HFS-inducing drugs [ 55 - 58 ], suggesting that the HFS model accurately extracted HFS side effect signals derived from these drugs. Note that symptoms such as acneiform rash, xerosis, eczema, paronychia, changes in the nails, arthralgia, or stiffness of limb joints, which are common side effects of EGFR inhibitors or aromatase inhibitors [ 59 , 60 ], might be extracted as closely related expressions to those of HFS signals. When looking at the MoA of anticancer drugs for patients with adverse event signals identified by the AE-L model, antimetabolite (fluoropyrimidine) was the most common one (n=53, 33.8%), as in the case of those identified by the HFS model, followed by kinase inhibitors (n=31, 19.7%) and antiandrogens (n=27, 17.2%; Table 5 ). Since the AE-L model targets a broad range of adverse event symptoms, it is difficult to rationalize the relationship between the adverse event signals and types of anticancer drugs. However, the type of anticancer drugs would presumably closely correspond to the standard treatments of the cancer types of the patients. Based on the prescribed anticancer drugs, we can infer that a large percentage of the patients had breast or lung cancer, indicating that our study results were based on data from such a population. Thus, a possible direction for the expansion of this research would be adjusting the deep learning models by additional training with expressions for typical side effects associated with standard treatments of other cancer types. To interpret these results correctly, it should be noted that we could not investigate anticancer treatments conducted outside of the pharmacies (eg, the time-course relationship with intravenously administered drugs would be missed, as the administration will be done at hospitals). To further evaluate how useful this model is in side effect signal monitoring for patients with cancer, comprehensive medical information for the eligible patients would be required.

Suitability of the Deep Learning Models for Specific Adverse Event Symptoms

Among the adverse event signals identified by the AE-L model, the type of symptom was categorized according to a predefined annotation guideline that we previously developed [ 43 ]. The most frequently recorded adverse event signals identified by the AE-L model were “pain or numbness” (n=57, 36.3%), “fever” (n=46, 29.3%), and “nausea” (n=40, 25.5%; Table 6 ). Since the pharmaceutical care records had information about interventions by health care professionals, the frequency of the presence or absence of the interventions for each symptom was examined. A trend toward a greater proportion of interventions was observed in “fever,” “nausea,” “diarrhea,” “constipation,” “vomiting,” and “edema” ( Figure 3 , black boxes). There seem to be 2 possible explanations for this: these symptoms are of high importance and require early medical intervention or effective symptomatic treatments are available for these symptoms in clinical practice so that medical intervention is an easy option. On the other hand, a trend for a smaller proportion of adverse event signals to result in interventions was observed for “pain or numbness,” “fatigue,” “appetite loss,” “rash or itchy,” “taste disorder,” and “dizziness” ( Figure 3 , gray boxes). The reason for this may be the lack of effective symptomatic treatments or the difficulty of judging whether the severity of these symptoms justifies medical intervention by health care providers. In either case, there may be room for improvement in the quality of medical care for these symptoms. We expect that our research will contribute to a quality improvement in safety monitoring in clinical practice by supporting adverse event signal detection in a cost-effective manner.

Potential for Deep Learning Model Implementation in Society

Although we evaluated our deep learning models using pharmaceutical care records in this study, the main target of future implementation of our deep learning models in society would be narrative texts that patients directly write to record their daily experiences. For example, the application of these deep learning models to electronic media where patients record their daily experiences in their lives with disease (eg, health care–related e-communities and disease diary applications) could enable information about adverse event signal onset that patients experience to be provided to health care providers in a timely manner. Adverse event signals can automatically be identified and shared with health care providers based on the concern texts that patients post to any platform. This system will have the advantage that health care providers can efficiently grasp safety-related events that patients experience outside of clinic visits so that they can conduct more focused or personalized interactions with patients at their clinic visits. However, consideration should be given to avoid an excessive burden on health care providers. For instance, limiting the sharing of adverse event signals to those of high severity or summarizing adverse event signals over a week rather than sharing each one in a real-time manner may be reasonable approaches for medical staff. We also need to think about how to encourage patients to record their daily experiences using electronic tools. Not only technical progress and support but also the establishment of an ecosystem where both patients and medical staff can feel benefit will be required. Prospective studies with deep learning models to follow up patients in the long term and evaluate outcomes will be needed. We primarily looked at patient-authored texts as targets of implementation, but our deep learning models may also be worth using medical data including patients’ subjective concerns, such as pharmaceutical care S records. As this study confirmed that our deep learning models are applicable to patients’ concern texts tracked by pharmacists, it should be possible to use them to analyze other “patient voice-like” medical text data that have not been actively investigated so far.

Limitations

First, the major limitation of this study was that we were not able to collect complete medical information of the patients. Although we designed this study to analyze patients’ concerns extracted by the deep learning models and their relationship with medical information contained in the pharmaceutical care records, some information could not be tracked (eg, missing history of medical interventions or anticancer treatment at hospitals as well as diagnosis of patients’ primary cancers). Second, there might be a data creation bias in S records for patients’ concerns by pharmacists. For example, symptoms that have little impact on intervention decisions might less likely be recorded by them. It should be also noted that the characteristics of S records may not be consistent at different community pharmacies.

Conclusions

Our deep learning models were able to screen clinically important adverse event signals that require intervention by health care professionals from patients’ concerns in pharmaceutical care records. Thus, these models have the potential to support real-time adverse event monitoring of individual patients taking anticancer treatments in an efficient manner. We also confirmed that these deep learning models constructed based on patient-authored texts could be applied to patients’ subjective information recorded by pharmacists through their daily work. Further research may help to expand the applicability of the deep learning models for implementation in society or for analysis of data on patients’ concerns accumulated in professional records at pharmacies or hospitals.

Acknowledgments

This work was supported by Japan Society for the Promotion of Science, Grants-in-Aid for Scientific Research (KAKENHI; grant 21H03170) and Japan Science and Technology Agency, Core Research for Evolutional Science and Technology (CREST; grant JPMJCR22N1), Japan. Mr Yuki Yokokawa and Ms Sakura Yokoyama at our laboratory advised SN about the structure of pharmaceutical care records. This study would not have been feasible without the high quality of pharmaceutical care records created by many individual pharmacists at Nakajima Pharmacy Group through their daily work.

Data Availability

The data sets generated and analyzed during this study are available from the corresponding author on reasonable request.

Authors' Contributions

SN and SH designed the study. SN retrieved the subjective records of patients with cancer from the data source for the application of deep learning models and organized other data for subsequent evaluations. SN ran the deep learning models with the support of SW. SN, YY, and KS checked the adverse event signals for each subjective record that was extracted as positive by the models for hand-foot syndrome or adverse events limiting patients’ daily lives and evaluated the adverse event signal symptoms, details of interventions taken by health care professionals, and types of anticancer drugs prescribed for patients based on available data from the data source. HK and SI advised on the study concept and process. MS and RT provided pharmaceutical records at their community pharmacies along with advice on how to use and interpret them. SY and EA supervised the natural language processing research as specialists. SH supervised the study overall. SN drafted and finalized the paper. All authors reviewed and approved the paper.

Conflicts of Interest

SN is an employee of Daiichi Sankyo Co, Ltd. All other authors declare no conflicts of interest.

Performance evaluation of deep learning models.

Examples of S records and sample interview transcripts.

  • Global cancer observatory: cancer over time. World Health Organization. URL: https://gco.iarc.fr/overtime/en [accessed 2023-07-02]
  • Mattiuzzi C, Lippi G. Current cancer epidemiology. J Epidemiol Glob Health. 2019;9(4):217-222. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Montazeri F, Komaki H, Mohebi F, Mohajer B, Mansournia MA, Shahraz S, et al. Editorial: disparities in cancer prevention and epidemiology. Front Oncol. 2022;12:872051. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lasala R, Santoleri F. Association between adherence to oral therapies in cancer patients and clinical outcome: a systematic review of the literature. Br J Clin Pharmacol. 2022;88(5):1999-2018. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pudkasam S, Polman R, Pitcher M, Fisher M, Chinlumprasert N, Stojanovska L, et al. Physical activity and breast cancer survivors: importance of adherence, motivational interviewing and psychological health. Maturitas. 2018;116:66-72. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Markman M. Chemotherapy-associated neurotoxicity: an important side effect-impacting on quality, rather than quantity, of life. J Cancer Res Clin Oncol. 1996;122(9):511-512. [ CrossRef ] [ Medline ]
  • Jitender S, Mahajan R, Rathore V, Choudhary R. Quality of life of cancer patients. J Exp Ther Oncol. 2018;12(3):217-221. [ Medline ]
  • Di Nardo P, Lisanti C, Garutti M, Buriolla S, Alberti M, Mazzeo R, et al. Chemotherapy in patients with early breast cancer: clinical overview and management of long-term side effects. Expert Opin Drug Saf. 2022;21(11):1341-1355. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Cuomo RE. Improving cancer patient outcomes and cost-effectiveness: a Markov simulation of improved early detection, side effect management, and palliative care. Cancer Invest. 2023;41(10):858-862. [ CrossRef ] [ Medline ]
  • Pulito C, Cristaudo A, La Porta C, Zapperi S, Blandino G, Morrone A, et al. Oral mucositis: the hidden side of cancer therapy. J Exp Clin Cancer Res. 2020;39(1):210. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Bartal A, Mátrai Z, Szûcs A, Liszkay G. Main treatment and preventive measures for hand-foot syndrome, a dermatologic side effect of cancer therapy. Magy Onkol. 2011;55(2):91-98. [ FREE Full text ] [ Medline ]
  • Basch E, Jia X, Heller G, Barz A, Sit L, Fruscione M, et al. Adverse symptom event reporting by patients vs clinicians: relationships with clinical outcomes. J Natl Cancer Inst. 2009;101(23):1624-1632. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Basch E. The missing voice of patients in drug-safety reporting. N Engl J Med. 2010;362(10):865-869. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Fromme EK, Eilers KM, Mori M, Hsieh YC, Beer TM. How accurate is clinician reporting of chemotherapy adverse effects? A comparison with patient-reported symptoms from the Quality-of-Life Questionnaire C30. J Clin Oncol. 2004;22(17):3485-3490. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Liu L, Suo T, Shen Y, Geng C, Song Z, Liu F, et al. Clinicians versus patients subjective adverse events assessment: based on patient-reported outcomes version of the common terminology criteria for adverse events (PRO-CTCAE). Qual Life Res. 2020;29(11):3009-3015. [ CrossRef ] [ Medline ]
  • Churruca K, Pomare C, Ellis LA, Long JC, Henderson SB, Murphy LED, et al. Patient-reported outcome measures (PROMs): a review of generic and condition-specific measures and a discussion of trends and issues. Health Expect. 2021;24(4):1015-1024. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Pérez-Alfonso KE, Sánchez-Martínez V. Electronic patient-reported outcome measures evaluating cancer symptoms: a systematic review. Semin Oncol Nurs. 2021;37(2):151145. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Patient-reported outcome measures: use in medical product development to support labeling claims. U.S. Food & Drug Administration. 2009. URL: https:/​/www.​fda.gov/​regulatory-information/​search-fda-guidance-documents/​patient-reported-outcome-measures-use-medical-product-development-support-labeling-claims [accessed 2023-11-26]
  • Appendix 2 to the guideline on the evaluation of anticancer medicinal products in man—the use of patient-reported outcome (PRO) measures in oncology studies—scientific guideline. European Medicines Agency. 2016. URL: https:/​/www.​ema.europa.eu/​en/​appendix-2-guideline-evaluation-anticancer-medicinal-products-man-use-patient-reported-outcome-pro [accessed 2023-11-26]
  • Weber SC. The evolution and use of patient-reported outcomes in regulatory decision making. RF Q. 2023;3(1):4-9. [ FREE Full text ]
  • Teixeira MM, Borges FC, Ferreira PS, Rocha J, Sepodes B, Torre C. A review of patient-reported outcomes used for regulatory approval of oncology medicinal products in the European Union between 2017 and 2020. Front Med (Lausanne). 2022;9:968272. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Newell S, Jordan Z. The patient experience of patient-centered communication with nurses in the hospital setting: a qualitative systematic review protocol. JBI Database System Rev Implement Rep. 2015;13(1):76-87. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Yagasaki K, Takahashi H, Ouchi T, Yamagami J, Hamamoto Y, Amagai M, et al. Patient voice on management of facial dermatological adverse events with targeted therapies: a qualitative study. J Patient Rep Outcomes. 2019;3(1):27. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Giardina TD, Korukonda S, Shahid U, Vaghani V, Upadhyay DK, Burke GF, et al. Use of patient complaints to identify diagnosis-related safety concerns: a mixed-method evaluation. BMJ Qual Saf. 2021;30(12):996-1001. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lauriola I, Lavelli A, Aiolli F. An introduction to deep learning in natural language processing: models, techniques, and tools. Neurocomputing. 2022;470:443-456. [ CrossRef ]
  • Devlin J, Chang MW, Lee K, Toutanova K. BERT: pre-training of deep bidirectional transformers for language understanding. 2019. Presented at: 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; June 2-7, 2019;4171-4186; Minneapolis, MN, USA.
  • Dreisbach C, Koleck TA, Bourne PE, Bakken S. A systematic review of natural language processing and text mining of symptoms from electronic patient-authored text data. Int J Med Inform. 2019;125:37-46. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Sim JA, Huang X, Horan MR, Stewart CM, Robison LL, Hudson MM, et al. Natural language processing with machine learning methods to analyze unstructured patient-reported outcomes derived from electronic health records: a systematic review. Artif Intell Med. 2023;146:102701. [ CrossRef ] [ Medline ]
  • Weissenbacher D, Banda JM, Davydova V, Estrada-Zavala D, Gascó Sánchez L, Ge Y, et al. Overview of the seventh social media mining for health applications (#SMM4H) shared tasks at COLING 2022. 2022. Presented at: Proceedings of the Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task; October 12-17, 2022;221-241; Gyeongju, Republic of Korea. URL: https://aclanthology.org/2022.smm4h-1.54/
  • Matsuda S, Ohtomo T, Okuyama M, Miyake H, Aoki K. Estimating patient satisfaction through a language processing model: model development and evaluation. JMIR Form Res. 2023;7(1):e48534. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Yu D, Vydiswaran VGV. An assessment of mentions of adverse drug events on social media with natural language processing: model development and analysis. JMIR Med Inform. 2022;10(9):e38140. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Liu X, Chen H. A research framework for pharmacovigilance in health social media: identification and evaluation of patient adverse drug event reports. J Biomed Inform. 2015;58:268-279. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Nikfarjam A, Sarker A, O'Connor K, Ginn R, Gonzalez G. Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J Am Med Inform Assoc. 2015;22(3):671-681. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Kakalou C, Dimitsaki S, Dimitriadis VK, Natsiavas P. Exploiting social media for active pharmacovigilance: the PVClinical social media workspace. Stud Health Technol Inform. 2022;290:739-743. [ CrossRef ] [ Medline ]
  • Bousquet C, Dahamna B, Guillemin-Lanne S, Darmoni SJ, Faviez C, Huot C, et al. The adverse drug reactions from patient reports in social media project: five major challenges to overcome to operationalize analysis and efficiently support pharmacovigilance process. JMIR Res Protoc. 2017;6(9):e179. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Young IJB, Luz S, Lone N. A systematic review of natural language processing for classification tasks in the field of incident reporting and adverse event analysis. Int J Med Inform. 2019;132:103971. [ CrossRef ] [ Medline ]
  • Jacobsson R, Bergvall T, Sandberg L, Ellenius J. Extraction of adverse event severity information from clinical narratives using natural language processing. Pharmacoepidemiol Drug Saf. 2017;26(S2):37. [ FREE Full text ]
  • Liang C, Gong Y. Predicting harm scores from patient safety event reports. Stud Health Technol Inform. 2017;245:1075-1079. [ CrossRef ] [ Medline ]
  • Jiang G, Wang L, Liu H, Solbrig HR, Chute CG. Building a knowledge base of severe adverse drug events based on AERS reporting data using semantic web technologies. Stud Health Technol Inform. 2013;192(1-2):496-500. [ CrossRef ] [ Medline ]
  • Usui M, Aramaki E, Iwao T, Wakamiya S, Sakamoto T, Mochizuki M. Extraction and standardization of patient complaints from electronic medication histories for pharmacovigilance: natural language processing analysis in Japanese. JMIR Med Inform. 2018;6(3):e11021. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Watanabe T, Yada S, Aramaki E, Yajima H, Kizaki H, Hori S. Extracting multiple worries from breast cancer patient blogs using multilabel classification with the natural language processing model bidirectional encoder representations from transformers: infodemiology study of blogs. JMIR Cancer. 2022;8(2):e37840. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Nishioka S, Watanabe T, Asano M, Yamamoto T, Kawakami K, Yada S, et al. Identification of hand-foot syndrome from cancer patients' blog posts: BERT-based deep-learning approach to detect potential adverse drug reaction symptoms. PLoS One. 2022;17(5):e0267901. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Nishioka S, Asano M, Yada S, Aramaki E, Yajima H, Yanagisawa Y, et al. Adverse event signal extraction from cancer patients' narratives focusing on impact on their daily-life activities. Sci Rep. 2023;13(1):15516. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Weed LL. Medical records that guide and teach. N Engl J Med. 1968;278(11):593-600. [ CrossRef ] [ Medline ]
  • Podder V, Lew V, Ghassemzadeh S. SOAP Notes. Treasure Island, FL. StatPearls Publishing; 2023.
  • Shenavar Masooleh I, Ramezanzadeh E, Yaseri M, Sahere Mortazavi Khatibani S, Sadat Fayazi H, Ali Balou H, et al. The effectiveness of training on daily progress note writing by medical interns. J Adv Med Educ Prof. 2021;9(3):168-175. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Grothen AE, Tennant B, Wang C, Torres A, Sheppard BB, Abastillas G, et al. Application of artificial intelligence methods to pharmacy data for cancer surveillance and epidemiology research: a systematic review. JCO Clin Cancer Inform. 2020;4:1051-1058. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ohno Y, Kato R, Ishikawa H, Nishiyama T, Isawa M, Mochizuki M, et al. Using the natural language processing system MedNER-J to analyze pharmaceutical care records. medRxiv. Preprint posted online on October 2, 2023. [ CrossRef ]
  • Ranchon F, Chanoine S, Lambert-Lacroix S, Bosson JL, Moreau-Gaudry A, Bedouch P. Development of artificial intelligence powered apps and tools for clinical pharmacy services: a systematic review. Int J Med Inform. 2023;172:104983. [ CrossRef ] [ Medline ]
  • Nakajima Pharmacy. URL: https://www.nakajima-phar.co.jp/ [accessed 2023-12-07]
  • Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. 2020;21(140):1-67. [ FREE Full text ]
  • Pathak A. Comparative analysis of transformer based language models. Comput Sci Inf Technol. 2021.:165-176. [ FREE Full text ] [ CrossRef ]
  • DIPEx Japan. URL: https://www.dipex-j.org/ [accessed 2024-02-04]
  • Herxheimer A, McPherson A, Miller R, Shepperd S, Yaphe J, Ziebland S. Database of patients' experiences (DIPEx): a multi-media approach to sharing experiences and information. Lancet. 2000;355(9214):1540-1543. [ CrossRef ] [ Medline ]
  • Lara PE, Muiño CB, de Spéville BD, Reyes JJ. Hand-foot skin reaction to regorafenib. Actas Dermosifiliogr. 2016;107(1):71-73. [ CrossRef ]
  • Zaiem A, Hammamia SB, Aouinti I, Charfi O, Ladhari W, Kastalli S, et al. Hand-foot syndrome induced by chemotherapy drug: case series study and literature review. Indian J Pharmacol. 2022;54(3):208-215. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • McLellan B, Ciardiello F, Lacouture ME, Segaert S, Van Cutsem E. Regorafenib-associated hand-foot skin reaction: practical advice on diagnosis, prevention, and management. Ann Oncol. 2015;26(10):2017-2026. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Ai L, Xu Z, Yang B, He Q, Luo P. Sorafenib-associated hand-foot skin reaction: practical advice on diagnosis, mechanism, prevention, and management. Expert Rev Clin Pharmacol. 2019;12(12):1121-1127. [ CrossRef ] [ Medline ]
  • Tenti S, Correale P, Cheleschi S, Fioravanti A, Pirtoli L. Aromatase inhibitors-induced musculoskeletal disorders: current knowledge on clinical and molecular aspects. Int J Mol Sci. 2020;21(16):5625. [ FREE Full text ] [ CrossRef ] [ Medline ]
  • Lacouture ME, Melosky BL. Cutaneous reactions to anticancer agents targeting the epidermal growth factor receptor: a dermatology-oncology perspective. Skin Therapy Lett. 2007;12(6):1-5. [ FREE Full text ] [ Medline ]

Abbreviations

Edited by G Eysenbach; submitted 25.12.23; peer-reviewed by CY Wang, L Guo; comments to author 24.01.24; revised version received 14.02.24; accepted 09.03.24; published 16.04.24.

©Satoshi Nishioka, Satoshi Watabe, Yuki Yanagisawa, Kyoko Sayama, Hayato Kizaki, Shungo Imai, Mitsuhiro Someya, Ryoo Taniguchi, Shuntaro Yada, Eiji Aramaki, Satoko Hori. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 16.04.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

Help | Advanced Search

Computer Science > Information Retrieval

Title: a survey on retrieval-augmented text generation for large language models.

Abstract: Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Deep Learning: An overview and its practical examples

    research paper on deep learning

  2. (PDF) Deep Learning Face Detection and Recognition

    research paper on deep learning

  3. (PDF) Plant Disease Detection using Deep Learning

    research paper on deep learning

  4. Chart : What is Deep Learning

    research paper on deep learning

  5. (PDF) Personalized Research Paper Recommendation using Deep Learning

    research paper on deep learning

  6. The 9 Deep Learning Papers You Need to Know About 3

    research paper on deep learning

VIDEO

  1. Why you should read Research Papers in ML & DL? #machinelearning #deeplearning

  2. Deep Learning Lane Marker Segmentation From Automatically Generated Labels

  3. [Paper Review] Deep Learning for Anomaly Detection in Time-Series Data: Review, Analysis, and

  4. XGen 7B: Salesforce's 8k LLM for long sequence modeling

  5. Deep Learning Demystified

  6. Tree of Thoughts Prompt

COMMENTS

  1. Deep Learning: A Comprehensive Overview on Techniques ...

    Deep learning (DL), a branch of machine learning (ML) and artificial intelligence (AI) is nowadays considered as a core technology of today's Fourth Industrial Revolution (4IR or Industry 4.0). Due to its learning capabilities from data, DL technology originated from artificial neural network (ANN), has become a hot topic in the context of computing, and is widely applied in various ...

  2. A Survey of Deep Learning: Platforms, Applications and Emerging

    In this paper, we seek to provide a thorough investigation of deep learning in its applications and mechanisms. Specifically, as a categorical collection of state of the art in deep learning research, we hope to provide a broad reference for those seeking a primer on deep learning and its various implementations, platforms, algorithms, and uses ...

  3. Deep learning in computer vision: A critical review of emerging

    Deep learning has been overwhelmingly successful in computer vision (CV), natural language processing, and video/speech recognition. In this paper, our focus is on CV. We provide a critical review of recent achievements in terms of techniques and applications.

  4. PDF Deep Learning: A Comprehensive Overview on Techniques ...

    This paper is organized as follows. Section "Why Deep Learning in Today's Research andApplications?" motivates why deep learning is important to build data-driven intel-ligent systems. In Section" Deep Learning Techniques and Applications", we present our DL taxonomy by taking into account the variations of deep learning tasks and how they

  5. Deep learning

    Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically ...

  6. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy

    Moreover, DL models are typically considered as "black-box" machines that hamper the standard development of deep learning research and applications. Thus for clear understanding, in this paper, we present a structured and comprehensive view on DL techniques considering the variations in real-world problems and tasks.

  7. [1404.7828] Deep Learning in Neural Networks: An Overview

    Juergen Schmidhuber. In recent years, deep artificial neural networks (including recurrent ones) have won numerous contests in pattern recognition and machine learning. This historical survey compactly summarises relevant work, much of it from the previous millennium. Shallow and deep learners are distinguished by the depth of their credit ...

  8. PDF The Principles of Deep Learning Theory arXiv:2106.10165v2 [cs.LG] 24

    The Principles of Deep Learning Theory An Effective Theory Approach to Understanding Neural Networks Daniel A. Roberts and Sho Yaida based on research in collaboration with Boris Hanin arXiv:2106.10165v2 [cs.LG] 24 Aug 2021 [email protected], [email protected]. ii. Contents Preface vii

  9. A survey on deep learning and its applications

    Abstract. Deep learning, a branch of machine learning, is a frontier for artificial intelligence, aiming to be closer to its primary goal—artificial intelligence. This paper mainly adopts the summary and the induction methods of deep learning. Firstly, it introduces the global development and the current situation of deep learning.

  10. Review of deep learning: concepts, CNN architectures, challenges

    Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [1,2,3,4,5,6].Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [7,8,9].

  11. Deep learning in mental health outcome research: a scoping review

    Deep learning (DL), as one of the most recent generation of AI technologies, has demonstrated superior performance in many real-world applications ranging from computer vision to healthcare. The ...

  12. Deep Learning

    Deep Learning. Deep learning (DL) is a high dimensional data reduction technique for constructing high-dimensional predictors in input-output models. DL is a form of machine learning that uses hierarchical layers of latent features. In this article, we review the state-of-the-art of deep learning from a modeling and algorithmic perspective.

  13. deep learning Latest Research Papers

    The application of recent artificial intelligence (AI) and deep learning (DL) approaches integrated to radiological images finds useful to accurately detect the disease. This article introduces a new synergic deep learning (SDL)-based smart health diagnosis of COVID-19 using Chest X-Ray Images. The SDL makes use of dual deep convolutional ...

  14. 7 Best Research Papers To Read To Get Started With Deep Learning

    Research Paper: Deep Residual Learning for Image Recognition. Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Summary: There are several transfer learning models that are used by data scientists to achieve optimal results on a particular task. The AlexNet model was the first to be introduced to win an image processing challenge in ...

  15. Recent advances and applications of deep learning methods in ...

    Deep learning (DL) is one of the fastest-growing topics in materials data science, with rapidly emerging applications spanning atomistic, image-based, spectral, and textual data modalities. DL ...

  16. Efficient Deep Learning: A Survey on Making Deep Learning Models

    Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these ...

  17. A comprehensive review on ensemble deep learning ...

    Thus, several research efforts have applied deep ensemble learning in many fields, and most of these efforts are articulated around simple ensemble methods. This paper provided a comprehensive review of the various strategies for ensemble learning, especially in the case of deep learning.

  18. [2104.05314] Machine learning and deep learning

    This article introduces the fundamentals of machine learning and deep learning, a machine learning concept based on artificial neural networks, for intelligent systems. It covers the process of automated model building, the challenges of human-machine interaction and artificial intelligence servitization, and the applications of deep learning in electronic markets and networked business.

  19. A comprehensive review of deep learning-based single image super

    These review papers did not encompass the domain of super-resolution as a whole, and this paper fills that research gap by providing an overview of both classical and deep learning-based methods. At the same time, we have reviewed the deep learning-based methods into subdomain based on the functional blocks, i.e., upsampling methods, SR ...

  20. Getting started with reading Deep Learning Research papers: The Why and

    The WHY. In the answer to a question on Quora, asking how to test if one is qualified to pursue a career in Machine Learning, Andrew Ng (founder Google Brain, former head of Baidu AI group) said that anyone is qualified for a career in Machine Learning.He said that after you have completed some ML related courses, "to go even further, read research papers.

  21. Artificial Intelligence Driving Materials Discovery? Perspective on the

    In an article in Nature published in November 2023, Merchant et al. describe the application of artificial intelligence and machine learning (AI/ML) techniques such as deep learning of experimental databases and computational data to the discovery of new inorganic materials, including classical inorganic compounds such as oxides and halides, as well as other main group compounds and ...

  22. A Review of New Developments in Finance with Deep Learning: Deep

    This paper reviews cutting-edge research from both practical and academic perspectives. ... Specifically, the paper shows the implications of deep learning to existing theoretical frameworks and practical motivations in finance and identifies potential future developments that deep learning can bring about and the practical challenges.

  23. Integrating Large Language Models (LLMs) and Deep ...

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... (AIGC) generative deep learning ...

  24. Applications of Deep Learning in Satellite Communication: A Survey

    This paper analyzes the applications of deep learning in satellite communications from the perspective of physical layer, link layer and network layer, which is helpful to understand the role of deep learning at different levels. Most of the existing reviews on satellite communication are aimed at a specific application or method.

  25. Deep Learning Approaches on Image Captioning: A Review

    Image captioning is a research area of immense importance, aiming to generate natural language descriptions for visual content in the form of still images. The advent of deep learning and more recently vision-language pre-training techniques has revolutionized the field, leading to more sophisticated methods and improved performance. In this survey paper, we provide a structured review of deep ...

  26. Journal of Medical Internet Research

    Background: Early detection of adverse events and their management are crucial to improving anticancer treatment outcomes, and listening to patients' subjective opinions (patients' voices) can make a major contribution to improving safety management. Recent progress in deep learning technologies has enabled various new approaches for the evaluation of safety-related events based on patient ...

  27. A Survey on Retrieval-Augmented Text Generation for Large Language Models

    Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but incorrect responses by LLMs, thereby ...