Suggestions or feedback?

MIT News | Massachusetts Institute of Technology

  • Machine learning
  • Social justice
  • Black holes
  • Classes and programs

Departments

  • Aeronautics and Astronautics
  • Brain and Cognitive Sciences
  • Architecture
  • Political Science
  • Mechanical Engineering

Centers, Labs, & Programs

  • Abdul Latif Jameel Poverty Action Lab (J-PAL)
  • Picower Institute for Learning and Memory
  • Lincoln Laboratory
  • School of Architecture + Planning
  • School of Engineering
  • School of Humanities, Arts, and Social Sciences
  • Sloan School of Management
  • School of Science
  • MIT Schwarzman College of Computing

Explained: Neural networks

Press contact :, media download.

research in neural network

*Terms of Use:

Images for download on the MIT News office website are made available to non-commercial entities, press and the general public under a Creative Commons Attribution Non-Commercial No Derivatives license . You may not alter the images provided, other than to crop them to size. A credit line must be used when reproducing images; if one is not provided below, credit the images to "MIT."

research in neural network

Previous image Next image

In the past 10 years, the best-performing artificial-intelligence systems — such as the speech recognizers on smartphones or Google’s latest automatic translator — have resulted from a technique called “deep learning.”

Deep learning is in fact a new name for an approach to artificial intelligence called neural networks, which have been going in and out of fashion for more than 70 years. Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to MIT in 1952 as founding members of what’s sometimes called the first cognitive science department.

Neural nets were a major area of research in both neuroscience and computer science until 1969, when, according to computer science lore, they were killed off by the MIT mathematicians Marvin Minsky and Seymour Papert, who a year later would become co-directors of the new MIT Artificial Intelligence Laboratory.

The technique then enjoyed a resurgence in the 1980s, fell into eclipse again in the first decade of the new century, and has returned like gangbusters in the second, fueled largely by the increased processing power of graphics chips.

“There’s this idea that ideas in science are a bit like epidemics of viruses,” says Tomaso Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences at MIT, an investigator at MIT’s McGovern Institute for Brain Research, and director of MIT’s Center for Brains, Minds, and Machines . “There are apparently five or six basic strains of flu viruses, and apparently each one comes back with a period of around 25 years. People get infected, and they develop an immune response, and so they don’t get infected for the next 25 years. And then there is a new generation that is ready to be infected by the same strain of virus. In science, people fall in love with an idea, get excited about it, hammer it to death, and then get immunized — they get tired of it. So ideas should have the same kind of periodicity!”

Weighty matters

Neural nets are a means of doing machine learning, in which a computer learns to perform some task by analyzing training examples. Usually, the examples have been hand-labeled in advance. An object recognition system, for instance, might be fed thousands of labeled images of cars, houses, coffee cups, and so on, and it would find visual patterns in the images that consistently correlate with particular labels.

Modeled loosely on the human brain, a neural net consists of thousands or even millions of simple processing nodes that are densely interconnected. Most of today’s neural nets are organized into layers of nodes, and they’re “feed-forward,” meaning that data moves through them in only one direction. An individual node might be connected to several nodes in the layer beneath it, from which it receives data, and several nodes in the layer above it, to which it sends data.

To each of its incoming connections, a node will assign a number known as a “weight.” When the network is active, the node receives a different data item — a different number — over each of its connections and multiplies it by the associated weight. It then adds the resulting products together, yielding a single number. If that number is below a threshold value, the node passes no data to the next layer. If the number exceeds the threshold value, the node “fires,” which in today’s neural nets generally means sending the number — the sum of the weighted inputs — along all its outgoing connections.

When a neural net is being trained, all of its weights and thresholds are initially set to random values. Training data is fed to the bottom layer — the input layer — and it passes through the succeeding layers, getting multiplied and added together in complex ways, until it finally arrives, radically transformed, at the output layer. During training, the weights and thresholds are continually adjusted until training data with the same labels consistently yield similar outputs.

Minds and machines

The neural nets described by McCullough and Pitts in 1944 had thresholds and weights, but they weren’t arranged into layers, and the researchers didn’t specify any training mechanism. What McCullough and Pitts showed was that a neural net could, in principle, compute any function that a digital computer could. The result was more neuroscience than computer science: The point was to suggest that the human brain could be thought of as a computing device.

Neural nets continue to be a valuable tool for neuroscientific research. For instance, particular network layouts or rules for adjusting weights and thresholds have reproduced observed features of human neuroanatomy and cognition, an indication that they capture something about how the brain processes information.

The first trainable neural network, the Perceptron, was demonstrated by the Cornell University psychologist Frank Rosenblatt in 1957. The Perceptron’s design was much like that of the modern neural net, except that it had only one layer with adjustable weights and thresholds, sandwiched between input and output layers.

Perceptrons were an active area of research in both psychology and the fledgling discipline of computer science until 1959, when Minsky and Papert published a book titled “Perceptrons,” which demonstrated that executing certain fairly common computations on Perceptrons would be impractically time consuming.

“Of course, all of these limitations kind of disappear if you take machinery that is a little more complicated — like, two layers,” Poggio says. But at the time, the book had a chilling effect on neural-net research.

“You have to put these things in historical context,” Poggio says. “They were arguing for programming — for languages like Lisp. Not many years before, people were still using analog computers. It was not clear at all at the time that programming was the way to go. I think they went a little bit overboard, but as usual, it’s not black and white. If you think of this as this competition between analog computing and digital computing, they fought for what at the time was the right thing.”

Periodicity

By the 1980s, however, researchers had developed algorithms for modifying neural nets’ weights and thresholds that were efficient enough for networks with more than one layer, removing many of the limitations identified by Minsky and Papert. The field enjoyed a renaissance.

But intellectually, there’s something unsatisfying about neural nets. Enough training may revise a network’s settings to the point that it can usefully classify data, but what do those settings mean? What image features is an object recognizer looking at, and how does it piece them together into the distinctive visual signatures of cars, houses, and coffee cups? Looking at the weights of individual connections won’t answer that question.

In recent years, computer scientists have begun to come up with ingenious methods for deducing the analytic strategies adopted by neural nets. But in the 1980s, the networks’ strategies were indecipherable. So around the turn of the century, neural networks were supplanted by support vector machines, an alternative approach to machine learning that’s based on some very clean and elegant mathematics.

The recent resurgence in neural networks — the deep-learning revolution — comes courtesy of the computer-game industry. The complex imagery and rapid pace of today’s video games require hardware that can keep up, and the result has been the graphics processing unit (GPU), which packs thousands of relatively simple processing cores on a single chip. It didn’t take long for researchers to realize that the architecture of a GPU is remarkably like that of a neural net.

Modern GPUs enabled the one-layer networks of the 1960s and the two- to three-layer networks of the 1980s to blossom into the 10-, 15-, even 50-layer networks of today. That’s what the “deep” in “deep learning” refers to — the depth of the network’s layers. And currently, deep learning is responsible for the best-performing systems in almost every area of artificial-intelligence research.

Under the hood

The networks’ opacity is still unsettling to theorists, but there’s headway on that front, too. In addition to directing the Center for Brains, Minds, and Machines (CBMM), Poggio leads the center’s research program in Theoretical Frameworks for Intelligence . Recently, Poggio and his CBMM colleagues have released a three-part theoretical study of neural networks.

The first part , which was published last month in the International Journal of Automation and Computing , addresses the range of computations that deep-learning networks can execute and when deep networks offer advantages over shallower ones. Parts two and three , which have been released as CBMM technical reports, address the problems of global optimization, or guaranteeing that a network has found the settings that best accord with its training data, and overfitting, or cases in which the network becomes so attuned to the specifics of its training data that it fails to generalize to other instances of the same categories.

There are still plenty of theoretical questions to be answered, but CBMM researchers’ work could help ensure that neural networks finally break the generational cycle that has brought them in and out of favor for seven decades.

Share this news article on:

Related links.

  • Tomaso Poggio
  • Center for Brains, Minds, and Machines
  • McGovern Institute
  • Department of Brain and Cognitive Sciences

Related Topics

  • Artificial intelligence
  • Brain and cognitive sciences
  • Computer modeling
  • Computer science and technology
  • Neuroscience
  • History of science
  • History of MIT
  • Center for Brains Minds and Machines

Related Articles

Researchers at MIT’s Microsystems Technology Laboratories have built a low-power chip specialized for automatic speech recognition. With power savings of 90 to 99 percent, it could make voice control practical for relatively simple electronic devices.

Voice control everywhere

research in neural network

Model sheds light on purpose of inhibitory neurons

research in neural network

Learning words from pictures

The researchers’ neural network was fed video from 26 terabytes of video data downloaded from the photo-sharing site Flickr. Researchers found the network can interpret natural sounds in terms of image categories. For instance, the network might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.

Computer learns to recognize sounds by watching video

Previous item Next item

More MIT News

Paulo Lozano, Agustín Rayo, and Griselda Gómez pose for a photo together in a crowded room

MIT-Mexico Program fosters cross-border collaboration

Read full story →

Four colorful Tetris-like pieces speed through a blue neural network, with a burst of light on the top right.

With inspiration from “Tetris,” MIT researchers develop a better radiation detector

The columns of Building 7 are visible through green foliage on a sunny Spring day.

QS World University Rankings rates MIT No. 1 in 11 subjects for 2024

Paula Hammond speaks at podium in lecture hall

Tackling cancer at the nanoscale

A glowing neural network is in background with a text input box near center

A faster, better way to prevent an AI chatbot from giving toxic responses

A man wearing a protective masks walks down an empty New York subway station, with silver subways cars on the left and right sides.

Has remote work changed how people travel in the US?

  • More news on MIT News homepage →

Massachusetts Institute of Technology 77 Massachusetts Avenue, Cambridge, MA, USA

  • Map (opens in new window)
  • Events (opens in new window)
  • People (opens in new window)
  • Careers (opens in new window)
  • Accessibility
  • Social Media Hub
  • MIT on Facebook
  • MIT on YouTube
  • MIT on Instagram

Face of AI processing informations and learning to imitate human.

A neural network is a  machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions.

Every neural network consists of layers of nodes, or artificial neurons—an input layer, one or more hidden layers, and an output layer. Each node connects to others, and has its own associated weight and threshold. If the output of any individual node is above the specified threshold value, that node is activated, sending data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network.

Neural networks rely on training data to learn and improve their accuracy over time. Once they are fine-tuned for accuracy, they are powerful tools in computer science and  artificial intelligence , allowing us to classify and cluster data at a high velocity. Tasks in speech recognition or image recognition can take minutes versus hours when compared to the manual identification by human experts. One of the best-known examples of a neural network is Google’s search algorithm.

Neural networks are sometimes called artificial neural networks (ANNs) or simulated neural networks (SNNs). They are a subset of machine learning, and at the heart of deep learning models.

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Register for the ebook on generative AI

Think of each individual node as its own linear regression model, composed of input data, weights, a bias (or threshold), and an output. The formula would look something like this:

∑wixi + bias = w1x1 + w2x2 + w3x3 + bias

output = f(x) = 1 if ∑w1x1 + b>= 0; 0 if ∑w1x1 + b < 0

Once an input layer is determined, weights are assigned. These weights help determine the importance of any given variable, with larger ones contributing more significantly to the output compared to other inputs. All inputs are then multiplied by their respective weights and then summed. Afterward, the output is passed through an activation function, which determines the output. If that output exceeds a given threshold, it “fires” (or activates) the node, passing data to the next layer in the network. This results in the output of one node becoming in the input of the next node. This process of passing data from one layer to the next layer defines this neural network as a feedforward network.

Let’s break down what one single node might look like using binary values. We can apply this concept to a more tangible example, like whether you should go surfing (Yes: 1, No: 0). The decision to go or not to go is our predicted outcome, or y-hat. Let’s assume that there are three factors influencing your decision-making:

  • Are the waves good? (Yes: 1, No: 0)
  • Is the line-up empty? (Yes: 1, No: 0)
  • Has there been a recent shark attack? (Yes: 0, No: 1)

Then, let’s assume the following, giving us the following inputs:

  • X1 = 1, since the waves are pumping
  • X2 = 0, since the crowds are out
  • X3 = 1, since there hasn’t been a recent shark attack

Now, we need to assign some weights to determine importance. Larger weights signify that particular variables are of greater importance to the decision or outcome.

  • W1 = 5, since large swells don’t come around often
  • W2 = 2, since you’re used to the crowds
  • W3 = 4, since you have a fear of sharks

Finally, we’ll also assume a threshold value of 3, which would translate to a bias value of –3. With all the various inputs, we can start to plug in values into the formula to get the desired output.

Y-hat = (1*5) + (0*2) + (1*4) – 3 = 6

If we use the activation function from the beginning of this section, we can determine that the output of this node would be 1, since 6 is greater than 0. In this instance, you would go surfing; but if we adjust the weights or the threshold, we can achieve different outcomes from the model. When we observe one decision, like in the above example, we can see how a neural network could make increasingly complex decisions depending on the output of previous decisions or layers.

In the example above, we used perceptrons to illustrate some of the mathematics at play here, but neural networks leverage sigmoid neurons, which are distinguished by having values between 0 and 1. Since neural networks behave similarly to decision trees, cascading data from one node to another, having x values between 0 and 1 will reduce the impact of any given change of a single variable on the output of any given node, and subsequently, the output of the neural network.

As we start to think about more practical use cases for neural networks, like image recognition or classification, we’ll leverage supervised learning, or labeled datasets, to train the algorithm. As we train the model, we’ll want to evaluate its accuracy using a cost (or loss) function. This is also commonly referred to as the mean squared error (MSE). In the equation below,

  • i represents the index of the sample,
  • y-hat is the predicted outcome,
  • y is the actual value, and
  • m is the number of samples.

𝐶𝑜𝑠𝑡 𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛= 𝑀𝑆𝐸=1/2𝑚 ∑129_(𝑖=1)^𝑚▒(𝑦 ̂^((𝑖) )−𝑦^((𝑖) ) )^2

Ultimately, the goal is to minimize our cost function to ensure correctness of fit for any given observation. As the model adjusts its weights and bias, it uses the cost function and reinforcement learning to reach the point of convergence, or the local minimum. The process in which the algorithm adjusts its weights is through gradient descent, allowing the model to determine the direction to take to reduce errors (or minimize the cost function). With each training example, the parameters of the model adjust to gradually converge at the minimum.  

See this IBM Developer article for a deeper explanation of the quantitative concepts involved in neural networks .

Most deep neural networks are feedforward, meaning they flow in one direction only, from input to output. However, you can also train your model through backpropagation; that is, move in the opposite direction from output to input. Backpropagation allows us to calculate and attribute the error associated with each neuron, allowing us to adjust and fit the parameters of the model(s) appropriately.

The all new enterprise studio that brings together traditional machine learning along with new generative AI capabilities powered by foundation models.

Neural networks can be classified into different types, which are used for different purposes. While this isn’t a comprehensive list of types, the below would be representative of the most common types of neural networks that you’ll come across for its common use cases:

The perceptron is the oldest neural network, created by Frank Rosenblatt in 1958.

Feedforward neural networks, or multi-layer perceptrons (MLPs), are what we’ve primarily been focusing on within this article. They are comprised of an input layer, a hidden layer or layers, and an output layer. While these neural networks are also commonly referred to as MLPs, it’s important to note that they are actually comprised of sigmoid neurons, not perceptrons, as most real-world problems are nonlinear. Data usually is fed into these models to train them, and they are the foundation for computer vision, natural language processing , and other neural networks.

Convolutional neural networks (CNNs) are similar to feedforward networks, but they’re usually utilized for image recognition, pattern recognition, and/or computer vision. These networks harness principles from linear algebra, particularly matrix multiplication, to identify patterns within an image.

Recurrent neural networks (RNNs) are identified by their feedback loops. These learning algorithms are primarily leveraged when using time-series data to make predictions about future outcomes, such as stock market predictions or sales forecasting.

Deep Learning and neural networks tend to be used interchangeably in conversation, which can be confusing. As a result, it’s worth noting that the “deep” in deep learning is just referring to the depth of layers in a neural network. A neural network that consists of more than three layers—which would be inclusive of the inputs and the output—can be considered a deep learning algorithm. A neural network that only has two or three layers is just a basic neural network.

To learn more about the differences between neural networks and other forms of artificial intelligence,  like machine learning, please read the blog post “ AI vs. Machine Learning vs. Deep Learning vs. Neural Networks: What’s the Difference? ”

The history of neural networks is longer than most people think. While the idea of “a machine that thinks” can be traced to the Ancient Greeks, we’ll focus on the key events that led to the evolution of thinking around neural networks, which has ebbed and flowed in popularity over the years:

1943: Warren S. McCulloch and Walter Pitts published “ A logical calculus of the ideas immanent in nervous activity  (link resides outside ibm.com)” This research sought to understand how the human brain could produce complex patterns through connected brain cells, or neurons. One of the main ideas that came out of this work was the comparison of neurons with a binary threshold to Boolean logic (i.e., 0/1 or true/false statements).   

1958: Frank Rosenblatt is credited with the development of the perceptron, documented in his research, “ The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain ” (link resides outside ibm.com). He takes McCulloch and Pitt’s work a step further by introducing weights to the equation. Leveraging an IBM 704, Rosenblatt was able to get a computer to learn how to distinguish cards marked on the left vs. cards marked on the right.

1974: While numerous researchers contributed to the idea of backpropagation, Paul Werbos was the first person in the US to note its application within neural networks within his PhD thesis  (link resides outside ibm.com).

1989: Yann LeCun published a paper (link resides outside ibm.com) illustrating how the use of constraints in backpropagation and its integration into the neural network architecture can be used to train algorithms. This research successfully leveraged a neural network to recognize hand-written zip code digits provided by the U.S. Postal Service.

Design complex neural networks. Experiment at scale to deploy optimized learning models within IBM Watson Studio.

Build and scale trusted AI on any cloud. Automate the AI lifecycle for ModelOps.

Take the next step to start operationalizing and scaling generative AI and machine learning for business.

Register for our e-book for insights into the opportunities, challenges and lessons learned from infusing AI into businesses.

These terms are often used interchangeably, but what differences make each a unique technology?

Get an in-depth understanding of neural networks, their basic functions and the fundamentals of building one.

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Mobile Navigation

Multimodal Neurons

Multimodal neurons in artificial neural networks

We’ve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn.

More resources

Fifteen years ago, Quiroga et al. [^reference-1]  discovered that the human brain possesses multimodal neurons. These neurons respond to clusters of abstract concepts centered around a common high-level theme, rather than any specific visual feature. The most famous of these was the “Halle Berry” neuron, a neuron featured in both  Scientific American  and  The New York Times , that responds to photographs, sketches, and the text “Halle Berry” (but not other names).

Two months ago, OpenAI announced  CLIP , a general-purpose vision system that matches the performance of a ResNet-50, [^reference-2]  but outperforms existing vision systems on some of the most challenging datasets. Each of these challenge datasets,  ObjectNet ,  ImageNet Rendition , and  ImageNet Sketch , stress tests the model’s robustness to not recognizing not just simple distortions or changes in lighting or pose, but also to complete abstraction and reconstruction—sketches, cartoons, and even statues of the objects.

Now, we’re releasing our discovery of the presence of multimodal neurons in CLIP. One such neuron, for example, is a “Spider-Man” neuron (bearing a remarkable resemblance to the “Halle Berry” neuron) that responds to an image of a spider, an image of the text “spider,” and the comic book character “Spider-Man” either in costume or illustrated.

Our discovery of multimodal neurons in CLIP gives us a clue as to what may be a common mechanism of both synthetic and natural vision systems—abstraction. We discover that the highest layers of CLIP organize images as a loose semantic collection of ideas, providing a simple explanation for both the model’s versatility and the representation’s compactness.

research in neural network

Using the tools of interpretability, we give an unprecedented look into the rich visual concepts that exist within the weights of CLIP. Within CLIP, we discover high-level concepts that span a large subset of the human visual lexicon—geographical regions, facial expressions, religious iconography, famous people and more. By probing what each neuron affects downstream, we can get a glimpse into how CLIP performs its classification.

Multimodal neurons in CLIP

Our  paper  builds on nearly a decade of research into interpreting convolutional networks, [^reference-3] [^reference-4] [^reference-5] [^reference-6] [^reference-7] [^reference-8] [^reference-9] [^reference-10] [^reference-11] [^reference-12]  beginning with the observation that many of these classical techniques are directly applicable to CLIP. We employ two tools to understand the activations of the model:  feature visualization , [^reference-6] [^reference-5] [^reference-12]  which maximizes the neuron’s firing by doing gradient-based optimization on the input, and  dataset examples , [^reference-4]  which looks at the distribution of maximal activating images for a neuron from a dataset.

Using these simple techniques, we’ve found the majority of the neurons in CLIP RN50x4 (a ResNet-50 scaled up 4x using the EfficientNet scaling rule) to be readily interpretable. Indeed, these neurons appear to be extreme examples of “multi-faceted neurons,”  [^reference-11]  neurons that respond to multiple distinct cases, only at a higher level of abstraction.

research in neural network

self + relief

research in neural network

child’s drawing

research in neural network

West Africa

research in neural network

Architecture

Selected neurons from the final layer of four CLIP models. Each neuron is represented by a feature visualization with a human-chosen concept labels to help quickly provide a sense of each neuron. Labels were picked after looking at hundreds of stimuli that activate the neuron, in addition to feature visualizations. We chose to include some of the examples here to demonstrate the model’s proclivity towards stereotypical depictions of regions, emotions, and other concepts. We also see discrepancies in the level of neuronal resolution: while certain countries like the US and India were associated with well-defined neurons, the same was not true of countries in Africa, where neurons tended to fire for entire regions. We discuss some of these biases and their implications in later sections.

Indeed, we were surprised to find many of these categories appear to mirror neurons in the medial temporal lobe documented in epilepsy patients with intracranial depth electrodes. These include neurons that respond to emotions, [^reference-17]  animals, [^reference-18]  and famous people. [^reference-1]

But our investigation into CLIP reveals many more such strange and wonderful abstractions, including neurons that appear to count [ 17 ,  202 ,  310 ], neurons responding to art styles [ 75 ,  587 ,  122 ], even images with evidence of digital alteration [ 1640 ].

Absent concepts

While this analysis shows a great breadth of concepts, we note that a simple analysis on a neuron level cannot represent a complete documentation of the model’s behavior. The authors of CLIP have demonstrated, for example, that the model is capable of very precise geolocation, [^reference-19]  (Appendix E.4, Figure 20) with a granularity that extends down to the level of a city and even a neighborhood. In fact, we offer an anecdote: we have noticed, by running our own personal photos through CLIP, that CLIP can often recognize if a photo was taken in San Francisco, and sometimes even the neighborhood (e.g., “Twin Peaks”).

Despite our best efforts, however, we have not found a “San Francisco” neuron, nor did it seem from attribution that San Francisco decomposes nicely into meaningful unit concepts like “California” and “city.” We believe this information to be encoded within the activations of the model somewhere, but in a more exotic way, either as a direction or as some other more complex manifold. We believe this to be a fruitful direction for further research.

How multimodal neurons compose

These multimodal neurons can give us insight into understanding how CLIP performs classification. With a sparse linear probe, [^reference-19]  we can easily inspect CLIP’s weights to see which concepts combine to achieve a final classification for ImageNet classification:

research in neural network

For text classification, a key observation is that these concepts are contained within neurons in a way that, similar to the word2vec objective, [^reference-20]  is  almost linear . The concepts, therefore, form a simple algebra that behaves similarly to a linear probe. By linearizing the attention, we too can inspect any sentence, much like a linear probe, as shown below:

research in neural network

Fallacies of abstraction

The degree of abstraction in CLIP surfaces a new vector of attack that we believe has not manifested in previous systems. Like many deep networks, the representations at the highest layers of the model are completely dominated by such high-level abstractions. What distinguishes CLIP, however, is a matter of degree—CLIP’s multimodal neurons generalize across the literal and the iconic, which may be a double-edged sword.

Through a series of carefully-constructed experiments, we demonstrate that we can exploit this reductive behavior to fool the model into making absurd classifications. We have observed that the excitations of the neurons in CLIP are often controllable by its response to  images of text , providing a simple vector of attacking the model.

The finance neuron [ 1330 ], for example, responds to images of piggy banks, but also responds to the string “$$$”. By forcing the finance neuron to fire, we can fool our model into classifying a dog as a piggy bank.

Attacks in the wild

We refer to these attacks as  typographic attacks . We believe attacks such as those described above are far from simply an academic concern. By exploiting the model’s ability to read text robustly, we find that even  photographs of hand-written text  can often fool the model. Like the Adversarial Patch, [^reference-21]  this attack works in the wild; but unlike such attacks, it requires no more technology than pen and paper.

We also believe that these attacks may also take a more subtle, less conspicuous form. An image, given to CLIP, is abstracted in many subtle and sophisticated ways, and these abstractions may over-abstract common patterns—oversimplifying and, by virtue of that, overgeneralizing.

Bias and overgeneralization

Our model, despite being trained on a curated subset of the internet, still inherits its many unchecked biases and associations. Many associations we have discovered appear to be benign, but yet we have discovered several cases where CLIP holds associations that could result in representational harm, such as denigration of certain individuals or groups.

We have observed, for example, a “Middle East” neuron  [1895]  with an association with terrorism; and an “immigration” neuron  [395]  that responds to Latin America. We have even found a neuron that fires for both dark-skinned people and gorillas [ 1257 ], mirroring earlier photo tagging incidents in other models we consider unacceptable. [^reference-22]

These associations present obvious challenges to applications of such powerful visual systems. [^footnote-1] Whether fine-tuned or used zero-shot, it is likely that these biases and associations will remain in the system, with their effects manifesting in both visible and nearly invisible ways during deployment. Many biased behaviors may be difficult to anticipate a priori, making their measurement and correction difficult. We believe that these tools of interpretability may aid practitioners the ability to preempt potential problems, by discovering some of these associations and ambigiuities ahead of time.

Our own understanding of CLIP is still evolving, and we are still determining if and how we would release large versions of CLIP. We hope that further community exploration of the released versions as well as the tools we are announcing today will help advance general understanding of multimodal systems, as well as inform our own decision-making.

Alongside the publication of “Multimodal Neurons in Artificial Neural Networks,” we are also releasing some of the tools we have ourselves used to understand CLIP—the OpenAI  Microscope  catalog has been updated with feature visualizations, dataset examples, and text feature visualizations for every neuron in CLIP RN50x4. We are also releasing the weights of CLIP RN50x4 and RN101  to further accommodate such research. We believe these investigations of CLIP only scratch the surface in understanding CLIP’s behavior, and we invite the research community to join in improving our understanding of CLIP and models like it.

  • Gabriel Goh
  • Chelsea Voss
  • Daniela Amodei
  • Shan Carter
  • Michael Petrov
  • Justin Jay Wang
  • Nick Cammarata

Acknowledgments

Sandhini Agarwal, Greg Brockman, Miles Brundage, Jeff Clune, Steve Dowling, Jonathan Gordon, Gretchen Krueger, Faiz Mandviwalla, Vedant Misra, Reiichiro Nakano, Ashley Pilipiszyn, Alec Radford, Aditya Ramesh, Pranav Shyam, Ilya Sutskever, Martin Wattenberg & Hannah Wong

  • Survey Paper
  • Open access
  • Published: 31 March 2021

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

  • Laith Alzubaidi   ORCID: orcid.org/0000-0002-7296-5413 1 , 5 ,
  • Jinglan Zhang 1 ,
  • Amjad J. Humaidi 2 ,
  • Ayad Al-Dujaili 3 ,
  • Ye Duan 4 ,
  • Omran Al-Shamma 5 ,
  • J. Santamaría 6 ,
  • Mohammed A. Fadhel 7 ,
  • Muthana Al-Amidie 4 &
  • Laith Farhan 8  

Journal of Big Data volume  8 , Article number:  53 ( 2021 ) Cite this article

390k Accesses

2194 Citations

40 Altmetric

Metrics details

In the last few years, the deep learning (DL) computing paradigm has been deemed the Gold Standard in the machine learning (ML) community. Moreover, it has gradually become the most widely used computational approach in the field of ML, thus achieving outstanding results on several complex cognitive tasks, matching or even beating those provided by human performance. One of the benefits of DL is the ability to learn massive amounts of data. The DL field has grown fast in the last few years and it has been extensively used to successfully address a wide range of traditional applications. More importantly, DL has outperformed well-known ML techniques in many domains, e.g., cybersecurity, natural language processing, bioinformatics, robotics and control, and medical information processing, among many others. Despite it has been contributed several works reviewing the State-of-the-Art on DL, all of them only tackled one aspect of the DL, which leads to an overall lack of knowledge about it. Therefore, in this contribution, we propose using a more holistic approach in order to provide a more suitable starting point from which to develop a full understanding of DL. Specifically, this review attempts to provide a more comprehensive survey of the most important aspects of DL and including those enhancements recently added to the field. In particular, this paper outlines the importance of DL, presents the types of DL techniques and networks. It then presents convolutional neural networks (CNNs) which the most utilized DL network type and describes the development of CNNs architectures together with their main features, e.g., starting with the AlexNet network and closing with the High-Resolution network (HR.Net). Finally, we further present the challenges and suggested solutions to help researchers understand the existing research gaps. It is followed by a list of the major DL applications. Computational tools including FPGA, GPU, and CPU are summarized along with a description of their influence on DL. The paper ends with the evolution matrix, benchmark datasets, and summary and conclusion.

Introduction

Recently, machine learning (ML) has become very widespread in research and has been incorporated in a variety of applications, including text mining, spam detection, video recommendation, image classification, and multimedia concept retrieval [ 1 , 2 , 3 , 4 , 5 , 6 ]. Among the different ML algorithms, deep learning (DL) is very commonly employed in these applications [ 7 , 8 , 9 ]. Another name for DL is representation learning (RL). The continuing appearance of novel studies in the fields of deep and distributed learning is due to both the unpredictable growth in the ability to obtain data and the amazing progress made in the hardware technologies, e.g. High Performance Computing (HPC) [ 10 ].

DL is derived from the conventional neural network but considerably outperforms its predecessors. Moreover, DL employs transformations and graph technologies simultaneously in order to build up multi-layer learning models. The most recently developed DL techniques have obtained good outstanding performance across a variety of applications, including audio and speech processing, visual data processing, natural language processing (NLP), among others [ 11 , 12 , 13 , 14 ].

Usually, the effectiveness of an ML algorithm is highly dependent on the integrity of the input-data representation. It has been shown that a suitable data representation provides an improved performance when compared to a poor data representation. Thus, a significant research trend in ML for many years has been feature engineering, which has informed numerous research studies. This approach aims at constructing features from raw data. In addition, it is extremely field-specific and frequently requires sizable human effort. For instance, several types of features were introduced and compared in the computer vision context, such as, histogram of oriented gradients (HOG) [ 15 ], scale-invariant feature transform (SIFT) [ 16 ], and bag of words (BoW) [ 17 ]. As soon as a novel feature is introduced and is found to perform well, it becomes a new research direction that is pursued over multiple decades.

Relatively speaking, feature extraction is achieved in an automatic way throughout the DL algorithms. This encourages researchers to extract discriminative features using the smallest possible amount of human effort and field knowledge [ 18 ]. These algorithms have a multi-layer data representation architecture, in which the first layers extract the low-level features while the last layers extract the high-level features. Note that artificial intelligence (AI) originally inspired this type of architecture, which simulates the process that occurs in core sensorial regions within the human brain. Using different scenes, the human brain can automatically extract data representation. More specifically, the output of this process is the classified objects, while the received scene information represents the input. This process simulates the working methodology of the human brain. Thus, it emphasizes the main benefit of DL.

In the field of ML, DL, due to its considerable success, is currently one of the most prominent research trends. In this paper, an overview of DL is presented that adopts various perspectives such as the main concepts, architectures, challenges, applications, computational tools and evolution matrix. Convolutional neural network (CNN) is one of the most popular and used of DL networks [ 19 , 20 ]. Because of CNN, DL is very popular nowadays. The main advantage of CNN compared to its predecessors is that it automatically detects the significant features without any human supervision which made it the most used. Therefore, we have dug in deep with CNN by presenting the main components of it. Furthermore, we have elaborated in detail the most common CNN architectures, starting with the AlexNet network and ending with the High-Resolution network (HR.Net).

Several published DL review papers have been presented in the last few years. However, all of them have only been addressed one side focusing on one application or topic such as the review of CNN architectures [ 21 ], DL for classification of plant diseases [ 22 ], DL for object detection [ 23 ], DL applications in medical image analysis [ 24 ], and etc. Although these reviews present good topics, they do not provide a full understanding of DL topics such as concepts, detailed research gaps, computational tools, and DL applications. First, It is required to understand DL aspects including concepts, challenges, and applications then going deep in the applications. To achieve that, it requires extensive time and a large number of research papers to learn about DL including research gaps and applications. Therefore, we propose a deep review of DL to provide a more suitable starting point from which to develop a full understanding of DL from one review paper. The motivation behinds our review was to cover the most important aspect of DL including open challenges, applications, and computational tools perspective. Furthermore, our review can be the first step towards other DL topics.

The main aim of this review is to present the most important aspects of DL to make it easy for researchers and students to have a clear image of DL from single review paper. This review will further advance DL research by helping people discover more about recent developments in the field. Researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field. Our contributions are outlined as follows:

This is the first review that almost provides a deep survey of the most important aspects of deep learning. This review helps researchers and students to have a good understanding from one paper.

We explain CNN in deep which the most popular deep learning algorithm by describing the concepts, theory, and state-of-the-art architectures.

We review current challenges (limitations) of Deep Learning including lack of training data, Imbalanced Data, Interpretability of data, Uncertainty scaling, Catastrophic forgetting, Model compression, Overfitting, Vanishing gradient problem, Exploding Gradient Problem, and Underspecification. We additionally discuss the proposed solutions tackling these issues.

We provide an exhaustive list of medical imaging applications with deep learning by categorizing them based on the tasks by starting with classification and ending with registration.

We discuss the computational approaches (CPU, GPU, FPGA) by comparing the influence of each tool on deep learning algorithms.

The rest of the paper is organized as follows: “ Survey methodology ” section describes The survey methodology. “ Background ” section presents the background. “ Classification of DL approaches ” section defines the classification of DL approaches. “ Types of DL networks ” section displays types of DL networks. “ CNN architectures ” section shows CNN Architectures. “ Challenges (limitations) of deep learning and alternate solutions ” section details the challenges of DL and alternate solutions. “ Applications of deep learning ” section outlines the applications of DL. “ Computational approaches ” section explains the influence of computational approaches (CPU, GPU, FPGA) on DL. “ Evaluation metrics ” section presents the evaluation metrics. “ Frameworks and datasets ” section lists frameworks and datasets. “ Summary and conclusion ” section presents the summary and conclusion.

Survey methodology

We have reviewed the significant research papers in the field published during 2010–2020, mainly from the years of 2020 and 2019 with some papers from 2021. The main focus was papers from the most reputed publishers such as IEEE, Elsevier, MDPI, Nature, ACM, and Springer. Some papers have been selected from ArXiv. We have reviewed more than 300 papers on various DL topics. There are 108 papers from the year 2020, 76 papers from the year 2019, and 48 papers from the year 2018. This indicates that this review focused on the latest publications in the field of DL. The selected papers were analyzed and reviewed to (1) list and define the DL approaches and network types, (2) list and explain CNN architectures, (3) present the challenges of DL and suggest the alternate solutions, (4) assess the applications of DL, (5) assess computational approaches. The most keywords used for search criteria for this review paper are (“Deep Learning”), (“Machine Learning”), (“Convolution Neural Network”), (“Deep Learning” AND “Architectures”), ((“Deep Learning”) AND (“Image”) AND (“detection” OR “classification” OR “segmentation” OR “Localization”)), (“Deep Learning” AND “detection” OR “classification” OR “segmentation” OR “Localization”), (“Deep Learning” AND “CPU” OR “GPU” OR “FPGA”), (“Deep Learning” AND “Transfer Learning”), (“Deep Learning” AND “Imbalanced Data”), (“Deep Learning” AND “Interpretability of data”), (“Deep Learning” AND “Overfitting”), (“Deep Learning” AND “Underspecification”). Figure  1 shows our search structure of the survey paper. Table  1 presents the details of some of the journals that have been cited in this review paper.

figure 1

Search framework

This section will present a background of DL. We begin with a quick introduction to DL, followed by the difference between DL and ML. We then show the situations that require DL. Finally, we present the reasons for applying DL.

DL, a subset of ML (Fig.  2 ), is inspired by the information processing patterns found in the human brain. DL does not require any human-designed rules to operate; rather, it uses a large amount of data to map the given input to specific labels. DL is designed using numerous layers of algorithms (artificial neural networks, or ANNs), each of which provides a different interpretation of the data that has been fed to them [ 18 , 25 ].

figure 2

Deep learning family

Achieving the classification task using conventional ML techniques requires several sequential steps, specifically pre-processing, feature extraction, wise feature selection, learning, and classification. Furthermore, feature selection has a great impact on the performance of ML techniques. Biased feature selection may lead to incorrect discrimination between classes. Conversely, DL has the ability to automate the learning of feature sets for several tasks, unlike conventional ML methods [ 18 , 26 ]. DL enables learning and classification to be achieved in a single shot (Fig.  3 ). DL has become an incredibly popular type of ML algorithm in recent years due to the huge growth and evolution of the field of big data [ 27 , 28 ]. It is still in continuous development regarding novel performance for several ML tasks [ 22 , 29 , 30 , 31 ] and has simplified the improvement of many learning fields [ 32 , 33 ], such as image super-resolution [ 34 ], object detection [ 35 , 36 ], and image recognition [ 30 , 37 ]. Recently, DL performance has come to exceed human performance on tasks such as image classification (Fig.  4 ).

figure 3

The difference between deep learning and traditional machine learning

figure 4

Deep learning performance compared to human

Nearly all scientific fields have felt the impact of this technology. Most industries and businesses have already been disrupted and transformed through the use of DL. The leading technology and economy-focused companies around the world are in a race to improve DL. Even now, human-level performance and capability cannot exceed that the performance of DL in many areas, such as predicting the time taken to make car deliveries, decisions to certify loan requests, and predicting movie ratings [ 38 ]. The winners of the 2019 “Nobel Prize” in computing, also known as the Turing Award, were three pioneers in the field of DL (Yann LeCun, Geoffrey Hinton, and Yoshua Bengio) [ 39 ]. Although a large number of goals have been achieved, there is further progress to be made in the DL context. In fact, DL has the ability to enhance human lives by providing additional accuracy in diagnosis, including estimating natural disasters [ 40 ], the discovery of new drugs [ 41 ], and cancer diagnosis [ 42 , 43 , 44 ]. Esteva et al. [ 45 ] found that a DL network has the same ability to diagnose the disease as twenty-one board-certified dermatologists using 129,450 images of 2032 diseases. Furthermore, in grading prostate cancer, US board-certified general pathologists achieved an average accuracy of 61%, while the Google AI [ 44 ] outperformed these specialists by achieving an average accuracy of 70%. In 2020, DL is playing an increasingly vital role in early diagnosis of the novel coronavirus (COVID-19) [ 29 , 46 , 47 , 48 ]. DL has become the main tool in many hospitals around the world for automatic COVID-19 classification and detection using chest X-ray images or other types of images. We end this section by the saying of AI pioneer Geoffrey Hinton “Deep learning is going to be able to do everything”.

When to apply deep learning

Machine intelligence is useful in many situations which is equal or better than human experts in some cases [ 49 , 50 , 51 , 52 ], meaning that DL can be a solution to the following problems:

Cases where human experts are not available.

Cases where humans are unable to explain decisions made using their expertise (language understanding, medical decisions, and speech recognition).

Cases where the problem solution updates over time (price prediction, stock preference, weather prediction, and tracking).

Cases where solutions require adaptation based on specific cases (personalization, biometrics).

Cases where size of the problem is extremely large and exceeds our inadequate reasoning abilities (sentiment analysis, matching ads to Facebook, calculation webpage ranks).

Why deep learning?

Several performance features may answer this question, e.g

Universal Learning Approach: Because DL has the ability to perform in approximately all application domains, it is sometimes referred to as universal learning.

Robustness: In general, precisely designed features are not required in DL techniques. Instead, the optimized features are learned in an automated fashion related to the task under consideration. Thus, robustness to the usual changes of the input data is attained.

Generalization: Different data types or different applications can use the same DL technique, an approach frequently referred to as transfer learning (TL) which explained in the latter section. Furthermore, it is a useful approach in problems where data is insufficient.

Scalability: DL is highly scalable. ResNet [ 37 ], which was invented by Microsoft, comprises 1202 layers and is frequently applied at a supercomputing scale. Lawrence Livermore National Laboratory (LLNL), a large enterprise working on evolving frameworks for networks, adopted a similar approach, where thousands of nodes can be implemented [ 53 ].

Classification of DL approaches

DL techniques are classified into three major categories: unsupervised, partially supervised (semi-supervised) and supervised. Furthermore, deep reinforcement learning (DRL), also known as RL, is another type of learning technique, which is mostly considered to fall into the category of partially supervised (and occasionally unsupervised) learning techniques.

Deep supervised learning

Deep semi-supervised learning.

In this technique, the learning process is based on semi-labeled datasets. Occasionally, generative adversarial networks (GANs) and DRL are employed in the same way as this technique. In addition, RNNs, which include GRUs and LSTMs, are also employed for partially supervised learning. One of the advantages of this technique is to minimize the amount of labeled data needed. On other the hand, One of the disadvantages of this technique is irrelevant input feature present training data could furnish incorrect decisions. Text document classifier is one of the most popular example of an application of semi-supervised learning. Due to difficulty of obtaining a large amount of labeled text documents, semi-supervised learning is ideal for text document classification task.

Deep unsupervised learning

This technique makes it possible to implement the learning process in the absence of available labeled data (i.e. no labels are required). Here, the agent learns the significant features or interior representation required to discover the unidentified structure or relationships in the input data. Techniques of generative networks, dimensionality reduction and clustering are frequently counted within the category of unsupervised learning. Several members of the DL family have performed well on non-linear dimensionality reduction and clustering tasks; these include restricted Boltzmann machines, auto-encoders and GANs as the most recently developed techniques. Moreover, RNNs, which include GRUs and LSTM approaches, have also been employed for unsupervised learning in a wide range of applications. The main disadvantages of unsupervised learning are unable to provide accurate information concerning data sorting and computationally complex. One of the most popular unsupervised learning approaches is clustering [ 54 ].

Deep reinforcement learning

For solving a task, the selection of the type of reinforcement learning that needs to be performed is based on the space or the scope of the problem. For example, DRL is the best way for problems involving many parameters to be optimized. By contrast, derivative-free reinforcement learning is a technique that performs well for problems with limited parameters. Some of the applications of reinforcement learning are business strategy planning and robotics for industrial automation. The main drawback of Reinforcement Learning is that parameters may influence the speed of learning. Here are the main motivations for utilizing Reinforcement Learning:

It assists you to identify which action produces the highest reward over a longer period.

It assists you to discover which situation requires action.

It also enables it to figure out the best approach for reaching large rewards.

Reinforcement Learning also gives the learning agent a reward function.

Reinforcement Learning can’t utilize in all the situation such as:

In case there is sufficient data to resolve the issue with supervised learning techniques.

Reinforcement Learning is computing-heavy and time-consuming. Specially when the workspace is large.

Types of DL networks

The most famous types of deep learning networks are discussed in this section: these include recursive neural networks (RvNNs), RNNs, and CNNs. RvNNs and RNNs were briefly explained in this section while CNNs were explained in deep due to the importance of this type. Furthermore, it is the most used in several applications among other networks.

Recursive neural networks

RvNN can achieve predictions in a hierarchical structure also classify the outputs utilizing compositional vectors [ 57 ]. Recursive auto-associative memory (RAAM) [ 58 ] is the primary inspiration for the RvNN development. The RvNN architecture is generated for processing objects, which have randomly shaped structures like graphs or trees. This approach generates a fixed-width distributed representation from a variable-size recursive-data structure. The network is trained using an introduced back-propagation through structure (BTS) learning system [ 58 ]. The BTS system tracks the same technique as the general-back propagation algorithm and has the ability to support a treelike structure. Auto-association trains the network to regenerate the input-layer pattern at the output layer. RvNN is highly effective in the NLP context. Socher et al. [ 59 ] introduced RvNN architecture designed to process inputs from a variety of modalities. These authors demonstrate two applications for classifying natural language sentences: cases where each sentence is split into words and nature images, and cases where each image is separated into various segments of interest. RvNN computes a likely pair of scores for merging and constructs a syntactic tree. Furthermore, RvNN calculates a score related to the merge plausibility for every pair of units. Next, the pair with the largest score is merged within a composition vector. Following every merge, RvNN generates (a) a larger area of numerous units, (b) a compositional vector of the area, and (c) a label for the class (for instance, a noun phrase will become the class label for the new area if two units are noun words). The compositional vector for the entire area is the root of the RvNN tree structure. An example RvNN tree is shown in Fig.  5 . RvNN has been employed in several applications [ 60 , 61 , 62 ].

figure 5

An example of RvNN tree

Recurrent neural networks

RNNs are a commonly employed and familiar algorithm in the discipline of DL [ 63 , 64 , 65 ]. RNN is mainly applied in the area of speech processing and NLP contexts [ 66 , 67 ]. Unlike conventional networks, RNN uses sequential data in the network. Since the embedded structure in the sequence of the data delivers valuable information, this feature is fundamental to a range of different applications. For instance, it is important to understand the context of the sentence in order to determine the meaning of a specific word in it. Thus, it is possible to consider the RNN as a unit of short-term memory, where x represents the input layer, y is the output layer, and s represents the state (hidden) layer. For a given input sequence, a typical unfolded RNN diagram is illustrated in Fig.  6 . Pascanu et al. [ 68 ] introduced three different types of deep RNN techniques, namely “Hidden-to-Hidden”, “Hidden-to-Output”, and “Input-to-Hidden”. A deep RNN is introduced that lessens the learning difficulty in the deep network and brings the benefits of a deeper RNN based on these three techniques.

figure 6

Typical unfolded RNN diagram

However, RNN’s sensitivity to the exploding gradient and vanishing problems represent one of the main issues with this approach [ 69 ]. More specifically, during the training process, the reduplications of several large or small derivatives may cause the gradients to exponentially explode or decay. With the entrance of new inputs, the network stops thinking about the initial ones; therefore, this sensitivity decays over time. Furthermore, this issue can be handled using LSTM [ 70 ]. This approach offers recurrent connections to memory blocks in the network. Every memory block contains a number of memory cells, which have the ability to store the temporal states of the network. In addition, it contains gated units for controlling the flow of information. In very deep networks [ 37 ], residual connections also have the ability to considerably reduce the impact of the vanishing gradient issue which explained in later sections. CNN is considered to be more powerful than RNN. RNN includes less feature compatibility when compared to CNN.

Convolutional neural networks

In the field of DL, the CNN is the most famous and commonly employed algorithm [ 30 , 71 , 72 , 73 , 74 , 75 ]. The main benefit of CNN compared to its predecessors is that it automatically identifies the relevant features without any human supervision [ 76 ]. CNNs have been extensively applied in a range of different fields, including computer vision [ 77 ], speech processing [ 78 ], Face Recognition [ 79 ], etc. The structure of CNNs was inspired by neurons in human and animal brains, similar to a conventional neural network. More specifically, in a cat’s brain, a complex sequence of cells forms the visual cortex; this sequence is simulated by the CNN [ 80 ]. Goodfellow et al. [ 28 ] identified three key benefits of the CNN: equivalent representations, sparse interactions, and parameter sharing. Unlike conventional fully connected (FC) networks, shared weights and local connections in the CNN are employed to make full use of 2D input-data structures like image signals. This operation utilizes an extremely small number of parameters, which both simplifies the training process and speeds up the network. This is the same as in the visual cortex cells. Notably, only small regions of a scene are sensed by these cells rather than the whole scene (i.e., these cells spatially extract the local correlation available in the input, like local filters over the input).

A commonly used type of CNN, which is similar to the multi-layer perceptron (MLP), consists of numerous convolution layers preceding sub-sampling (pooling) layers, while the ending layers are FC layers. An example of CNN architecture for image classification is illustrated in Fig.  7 .

figure 7

An example of CNN architecture for image classification

The input x of each layer in a CNN model is organized in three dimensions: height, width, and depth, or \(m \times m \times r\) , where the height (m) is equal to the width. The depth is also referred to as the channel number. For example, in an RGB image, the depth (r) is equal to three. Several kernels (filters) available in each convolutional layer are denoted by k and also have three dimensions ( \(n \times n \times q\) ), similar to the input image; here, however, n must be smaller than m , while q is either equal to or smaller than r . In addition, the kernels are the basis of the local connections, which share similar parameters (bias \(b^{k}\) and weight \(W^{k}\) ) for generating k feature maps \(h^{k}\) with a size of ( \(m-n-1\) ) each and are convolved with input, as mentioned above. The convolution layer calculates a dot product between its input and the weights as in Eq. 1 , similar to NLP, but the inputs are undersized areas of the initial image size. Next, by applying the nonlinearity or an activation function to the convolution-layer output, we obtain the following:

The next step is down-sampling every feature map in the sub-sampling layers. This leads to a reduction in the network parameters, which accelerates the training process and in turn enables handling of the overfitting issue. For all feature maps, the pooling function (e.g. max or average) is applied to an adjacent area of size \(p \times p\) , where p is the kernel size. Finally, the FC layers receive the mid- and low-level features and create the high-level abstraction, which represents the last-stage layers as in a typical neural network. The classification scores are generated using the ending layer [e.g. support vector machines (SVMs) or softmax]. For a given instance, every score represents the probability of a specific class.

Benefits of employing CNNs

The benefits of using CNNs over other traditional neural networks in the computer vision environment are listed as follows:

The main reason to consider CNN is the weight sharing feature, which reduces the number of trainable network parameters and in turn helps the network to enhance generalization and to avoid overfitting.

Concurrently learning the feature extraction layers and the classification layer causes the model output to be both highly organized and highly reliant on the extracted features.

Large-scale network implementation is much easier with CNN than with other neural networks.

The CNN architecture consists of a number of layers (or so-called multi-building blocks). Each layer in the CNN architecture, including its function, is described in detail below.

Convolutional Layer: In CNN architecture, the most significant component is the convolutional layer. It consists of a collection of convolutional filters (so-called kernels). The input image, expressed as N-dimensional metrics, is convolved with these filters to generate the output feature map.

Kernel definition: A grid of discrete numbers or values describes the kernel. Each value is called the kernel weight. Random numbers are assigned to act as the weights of the kernel at the beginning of the CNN training process. In addition, there are several different methods used to initialize the weights. Next, these weights are adjusted at each training era; thus, the kernel learns to extract significant features.

Convolutional Operation: Initially, the CNN input format is described. The vector format is the input of the traditional neural network, while the multi-channeled image is the input of the CNN. For instance, single-channel is the format of the gray-scale image, while the RGB image format is three-channeled. To understand the convolutional operation, let us take an example of a \(4 \times 4\) gray-scale image with a \(2 \times 2\) random weight-initialized kernel. First, the kernel slides over the whole image horizontally and vertically. In addition, the dot product between the input image and the kernel is determined, where their corresponding values are multiplied and then summed up to create a single scalar value, calculated concurrently. The whole process is then repeated until no further sliding is possible. Note that the calculated dot product values represent the feature map of the output. Figure  8 graphically illustrates the primary calculations executed at each step. In this figure, the light green color represents the \(2 \times 2\) kernel, while the light blue color represents the similar size area of the input image. Both are multiplied; the end result after summing up the resulting product values (marked in a light orange color) represents an entry value to the output feature map.

figure 8

The primary calculations executed at each step of convolutional layer

However, padding to the input image is not applied in the previous example, while a stride of one (denoted for the selected step-size over all vertical or horizontal locations) is applied to the kernel. Note that it is also possible to use another stride value. In addition, a feature map of lower dimensions is obtained as a result of increasing the stride value.

On the other hand, padding is highly significant to determining border size information related to the input image. By contrast, the border side-features moves carried away very fast. By applying padding, the size of the input image will increase, and in turn, the size of the output feature map will also increase. Core Benefits of Convolutional Layers.

Sparse Connectivity: Each neuron of a layer in FC neural networks links with all neurons in the following layer. By contrast, in CNNs, only a few weights are available between two adjacent layers. Thus, the number of required weights or connections is small, while the memory required to store these weights is also small; hence, this approach is memory-effective. In addition, matrix operation is computationally much more costly than the dot (.) operation in CNN.

Weight Sharing: There are no allocated weights between any two neurons of neighboring layers in CNN, as the whole weights operate with one and all pixels of the input matrix. Learning a single group of weights for the whole input will significantly decrease the required training time and various costs, as it is not necessary to learn additional weights for each neuron.

Pooling Layer: The main task of the pooling layer is the sub-sampling of the feature maps. These maps are generated by following the convolutional operations. In other words, this approach shrinks large-size feature maps to create smaller feature maps. Concurrently, it maintains the majority of the dominant information (or features) in every step of the pooling stage. In a similar manner to the convolutional operation, both the stride and the kernel are initially size-assigned before the pooling operation is executed. Several types of pooling methods are available for utilization in various pooling layers. These methods include tree pooling, gated pooling, average pooling, min pooling, max pooling, global average pooling (GAP), and global max pooling. The most familiar and frequently utilized pooling methods are the max, min, and GAP pooling. Figure  9 illustrates these three pooling operations.

figure 9

Three types of pooling operations

Sometimes, the overall CNN performance is decreased as a result; this represents the main shortfall of the pooling layer, as this layer helps the CNN to determine whether or not a certain feature is available in the particular input image, but focuses exclusively on ascertaining the correct location of that feature. Thus, the CNN model misses the relevant information.

Activation Function (non-linearity) Mapping the input to the output is the core function of all types of activation function in all types of neural network. The input value is determined by computing the weighted summation of the neuron input along with its bias (if present). This means that the activation function makes the decision as to whether or not to fire a neuron with reference to a particular input by creating the corresponding output.

Non-linear activation layers are employed after all layers with weights (so-called learnable layers, such as FC layers and convolutional layers) in CNN architecture. This non-linear performance of the activation layers means that the mapping of input to output will be non-linear; moreover, these layers give the CNN the ability to learn extra-complicated things. The activation function must also have the ability to differentiate, which is an extremely significant feature, as it allows error back-propagation to be used to train the network. The following types of activation functions are most commonly used in CNN and other deep neural networks.

Sigmoid: The input of this activation function is real numbers, while the output is restricted to between zero and one. The sigmoid function curve is S-shaped and can be represented mathematically by Eq. 2 .

Tanh: It is similar to the sigmoid function, as its input is real numbers, but the output is restricted to between − 1 and 1. Its mathematical representation is in Eq. 3 .

ReLU: The mostly commonly used function in the CNN context. It converts the whole values of the input to positive numbers. Lower computational load is the main benefit of ReLU over the others. Its mathematical representation is in Eq. 4 .

Occasionally, a few significant issues may occur during the use of ReLU. For instance, consider an error back-propagation algorithm with a larger gradient flowing through it. Passing this gradient within the ReLU function will update the weights in a way that makes the neuron certainly not activated once more. This issue is referred to as “Dying ReLU”. Some ReLU alternatives exist to solve such issues. The following discusses some of them.

Leaky ReLU: Instead of ReLU down-scaling the negative inputs, this activation function ensures these inputs are never ignored. It is employed to solve the Dying ReLU problem. Leaky ReLU can be represented mathematically as in Eq. 5 .

Note that the leak factor is denoted by m. It is commonly set to a very small value, such as 0.001.

Noisy ReLU: This function employs a Gaussian distribution to make ReLU noisy. It can be represented mathematically as in Eq. 6 .

Parametric Linear Units: This is mostly the same as Leaky ReLU. The main difference is that the leak factor in this function is updated through the model training process. The parametric linear unit can be represented mathematically as in Eq. 7 .

Note that the learnable weight is denoted as a.

Fully Connected Layer: Commonly, this layer is located at the end of each CNN architecture. Inside this layer, each neuron is connected to all neurons of the previous layer, the so-called Fully Connected (FC) approach. It is utilized as the CNN classifier. It follows the basic method of the conventional multiple-layer perceptron neural network, as it is a type of feed-forward ANN. The input of the FC layer comes from the last pooling or convolutional layer. This input is in the form of a vector, which is created from the feature maps after flattening. The output of the FC layer represents the final CNN output, as illustrated in Fig.  10 .

figure 10

Fully connected layer

Loss Functions: The previous section has presented various layer-types of CNN architecture. In addition, the final classification is achieved from the output layer, which represents the last layer of the CNN architecture. Some loss functions are utilized in the output layer to calculate the predicted error created across the training samples in the CNN model. This error reveals the difference between the actual output and the predicted one. Next, it will be optimized through the CNN learning process.

However, two parameters are used by the loss function to calculate the error. The CNN estimated output (referred to as the prediction) is the first parameter. The actual output (referred to as the label) is the second parameter. Several types of loss function are employed in various problem types. The following concisely explains some of the loss function types.

Cross-Entropy or Softmax Loss Function: This function is commonly employed for measuring the CNN model performance. It is also referred to as the log loss function. Its output is the probability \(p \in \left\{ 0\left. , 1 \right\} \right. \) . In addition, it is usually employed as a substitution of the square error loss function in multi-class classification problems. In the output layer, it employs the softmax activations to generate the output within a probability distribution. The mathematical representation of the output class probability is Eq. 8 .

Here, \(e^{a_{i}}\) represents the non-normalized output from the preceding layer, while N represents the number of neurons in the output layer. Finally, the mathematical representation of cross-entropy loss function is Eq. 9 .

Euclidean Loss Function: This function is widely used in regression problems. In addition, it is also the so-called mean square error. The mathematical expression of the estimated Euclidean loss is Eq. 10 .

Hinge Loss Function: This function is commonly employed in problems related to binary classification. This problem relates to maximum-margin-based classification; this is mostly important for SVMs, which use the hinge loss function, wherein the optimizer attempts to maximize the margin around dual objective classes. Its mathematical formula is Eq. 11 .

The margin m is commonly set to 1. Moreover, the predicted output is denoted as \(p_{_{i}}\) , while the desired output is denoted as \(y_{_{i}}\) .

Regularization to CNN

For CNN models, over-fitting represents the central issue associated with obtaining well-behaved generalization. The model is entitled over-fitted in cases where the model executes especially well on training data and does not succeed on test data (unseen data) which is more explained in the latter section. An under-fitted model is the opposite; this case occurs when the model does not learn a sufficient amount from the training data. The model is referred to as “just-fitted” if it executes well on both training and testing data. These three types are illustrated in Fig.  11 . Various intuitive concepts are used to help the regularization to avoid over-fitting; more details about over-fitting and under-fitting are discussed in latter sections.

Dropout: This is a widely utilized technique for generalization. During each training epoch, neurons are randomly dropped. In doing this, the feature selection power is distributed equally across the whole group of neurons, as well as forcing the model to learn different independent features. During the training process, the dropped neuron will not be a part of back-propagation or forward-propagation. By contrast, the full-scale network is utilized to perform prediction during the testing process.

Drop-Weights: This method is highly similar to dropout. In each training epoch, the connections between neurons (weights) are dropped rather than dropping the neurons; this represents the only difference between drop-weights and dropout.

Data Augmentation: Training the model on a sizeable amount of data is the easiest way to avoid over-fitting. To achieve this, data augmentation is used. Several techniques are utilized to artificially expand the size of the training dataset. More details can be found in the latter section, which describes the data augmentation techniques.

Batch Normalization: This method ensures the performance of the output activations [ 81 ]. This performance follows a unit Gaussian distribution. Subtracting the mean and dividing by the standard deviation will normalize the output at each layer. While it is possible to consider this as a pre-processing task at each layer in the network, it is also possible to differentiate and to integrate it with other networks. In addition, it is employed to reduce the “internal covariance shift” of the activation layers. In each layer, the variation in the activation distribution defines the internal covariance shift. This shift becomes very high due to the continuous weight updating through training, which may occur if the samples of the training data are gathered from numerous dissimilar sources (for example, day and night images). Thus, the model will consume extra time for convergence, and in turn, the time required for training will also increase. To resolve this issue, a layer representing the operation of batch normalization is applied in the CNN architecture.

The advantages of utilizing batch normalization are as follows:

It prevents the problem of vanishing gradient from arising.

It can effectively control the poor weight initialization.

It significantly reduces the time required for network convergence (for large-scale datasets, this will be extremely useful).

It struggles to decrease training dependency across hyper-parameters.

Chances of over-fitting are reduced, since it has a minor influence on regularization.

figure 11

Over-fitting and under-fitting issues

Optimizer selection

This section discusses the CNN learning process. Two major issues are included in the learning process: the first issue is the learning algorithm selection (optimizer), while the second issue is the use of many enhancements (such as AdaDelta, Adagrad, and momentum) along with the learning algorithm to enhance the output.

Loss functions, which are founded on numerous learnable parameters (e.g. biases, weights, etc.) or minimizing the error (variation between actual and predicted output), are the core purpose of all supervised learning algorithms. The techniques of gradient-based learning for a CNN network appear as the usual selection. The network parameters should always update though all training epochs, while the network should also look for the locally optimized answer in all training epochs in order to minimize the error.

The learning rate is defined as the step size of the parameter updating. The training epoch represents a complete repetition of the parameter update that involves the complete training dataset at one time. Note that it needs to select the learning rate wisely so that it does not influence the learning process imperfectly, although it is a hyper-parameter.

Gradient Descent or Gradient-based learning algorithm: To minimize the training error, this algorithm repetitively updates the network parameters through every training epoch. More specifically, to update the parameters correctly, it needs to compute the objective function gradient (slope) by applying a first-order derivative with respect to the network parameters. Next, the parameter is updated in the reverse direction of the gradient to reduce the error. The parameter updating process is performed though network back-propagation, in which the gradient at every neuron is back-propagated to all neurons in the preceding layer. The mathematical representation of this operation is as Eq. 12 .

The final weight in the current training epoch is denoted by \(w_{i j^{t}}\) , while the weight in the preceding \((t-1)\) training epoch is denoted \(w_{i j^{t-1}}\) . The learning rate is \(\eta \) and the prediction error is E . Different alternatives of the gradient-based learning algorithm are available and commonly employed; these include the following:

Batch Gradient Descent: During the execution of this technique [ 82 ], the network parameters are updated merely one time behind considering all training datasets via the network. In more depth, it calculates the gradient of the whole training set and subsequently uses this gradient to update the parameters. For a small-sized dataset, the CNN model converges faster and creates an extra-stable gradient using BGD. Since the parameters are changed only once for every training epoch, it requires a substantial amount of resources. By contrast, for a large training dataset, additional time is required for converging, and it could converge to a local optimum (for non-convex instances).

Stochastic Gradient Descent: The parameters are updated at each training sample in this technique [ 83 ]. It is preferred to arbitrarily sample the training samples in every epoch in advance of training. For a large-sized training dataset, this technique is both more memory-effective and much faster than BGD. However, because it is frequently updated, it takes extremely noisy steps in the direction of the answer, which in turn causes the convergence behavior to become highly unstable.

Mini-batch Gradient Descent: In this approach, the training samples are partitioned into several mini-batches, in which every mini-batch can be considered an under-sized collection of samples with no overlap between them [ 84 ]. Next, parameter updating is performed following gradient computation on every mini-batch. The advantage of this method comes from combining the advantages of both BGD and SGD techniques. Thus, it has a steady convergence, more computational efficiency and extra memory effectiveness. The following describes several enhancement techniques in gradient-based learning algorithms (usually in SGD), which further powerfully enhance the CNN training process.

Momentum: For neural networks, this technique is employed in the objective function. It enhances both the accuracy and the training speed by summing the computed gradient at the preceding training step, which is weighted via a factor \(\lambda \) (known as the momentum factor). However, it therefore simply becomes stuck in a local minimum rather than a global minimum. This represents the main disadvantage of gradient-based learning algorithms. Issues of this kind frequently occur if the issue has no convex surface (or solution space).

Together with the learning algorithm, momentum is used to solve this issue, which can be expressed mathematically as in Eq. 13 .

The weight increment in the current \(t^{\prime} \text{th}\) training epoch is denoted as \( \Delta w_{i j^{t}}\) , while \(\eta \) is the learning rate, and the weight increment in the preceding \((t-1)^{\prime} \text{th}\) training epoch. The momentum factor value is maintained within the range 0 to 1; in turn, the step size of the weight updating increases in the direction of the bare minimum to minimize the error. As the value of the momentum factor becomes very low, the model loses its ability to avoid the local bare minimum. By contrast, as the momentum factor value becomes high, the model develops the ability to converge much more rapidly. If a high value of momentum factor is used together with LR, then the model could miss the global bare minimum by crossing over it.

However, when the gradient varies its direction continually throughout the training process, then the suitable value of the momentum factor (which is a hyper-parameter) causes a smoothening of the weight updating variations.

Adaptive Moment Estimation (Adam): It is another optimization technique or learning algorithm that is widely used. Adam [ 85 ] represents the latest trends in deep learning optimization. This is represented by the Hessian matrix, which employs a second-order derivative. Adam is a learning strategy that has been designed specifically for training deep neural networks. More memory efficient and less computational power are two advantages of Adam. The mechanism of Adam is to calculate adaptive LR for each parameter in the model. It integrates the pros of both Momentum and RMSprop. It utilizes the squared gradients to scale the learning rate as RMSprop and it is similar to the momentum by using the moving average of the gradient. The equation of Adam is represented in Eq. 14 .

Design of algorithms (backpropagation)

Let’s start with a notation that refers to weights in the network unambiguously. We denote \({\varvec{w}}_{i j}^{h}\) to be the weight for the connection from \(\text {ith}\) input or (neuron at \(\left. (\text {h}-1){\text{th}}\right) \) to the \(j{\text{t }}\) neuron in the \(\text {hth}\) layer. So, Fig. 12 shows the weight on a connection from the neuron in the first layer to another neuron in the next layer in the network.

figure 12

MLP structure

Where \(w_{11}^{2}\) has represented the weight from the first neuron in the first layer to the first neuron in the second layer, based on that the second weight for the same neuron will be \(w_{21}^{2}\) which means is the weight comes from the second neuron in the previous layer to the first layer in the next layer which is the second in this net. Regarding the bias, since the bias is not the connection between the neurons for the layers, so it is easily handled each neuron must have its own bias, some network each layer has a certain bias. It can be seen from the above net that each layer has its own bias. Each network has the parameters such as the no of the layer in the net, the number of the neurons in each layer, no of the weight (connection) between the layers, the no of connection can be easily determined based on the no of neurons in each layer, for example, if there are ten input fully connect with two neurons in the next layer then the number of connection between them is \((10 * 2=20\) connection, weights), how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network,

where \(\text {d}\) is the label of induvial input \(\text {ith}\) and \(\text {y}\) is the output of the same individual input. Backpropagation is about understanding how to change the weights and biases in a network based on the changes of the cost function (Error). Ultimately, this means computing the partial derivatives \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}.\) But to compute those, a local variable is introduced, \(\delta _{j}^{1}\) which is called the local error in the \(j{\text{th} }\) neuron in the \(h{\text{th} }\) layer. Based on that local error Backpropagation will give the procedure to compute \(\partial \text {E} / \partial \text {w}_{\text {ij}}^{h}\) and \(\partial \text {E} / \partial \text {b}_{\text {j}}^{h}\) how the error is defined, and the weight is updated, we will imagine there is there are two layers in our neural network that is shown in Fig. 13 .

figure 13

Neuron activation functions

Output error for \(\delta _{\text {j}}^{1}\) each \(1=1: \text {L}\) where \(\text {L}\) is no. of neuron in output

where \(\text {e}(\text {k})\) is the error of the epoch \(\text {k}\) as shown in Eq. ( 2 ) and \(\varvec{\vartheta }^{\prime }\left( {\varvec{v}}_{j}({\varvec{k}})\right) \) is the derivate of the activation function for \(v_{j}\) at the output.

Backpropagate the error at all the rest layer except the output

where \(\delta _{j}^{1}({\mathbf {k}})\) is the output error and \(w_{j l}^{h+1}(k)\) is represented the weight after the layer where the error need to obtain.

After finding the error at each neuron in each layer, now we can update the weight in each layer based on Eqs. ( 16 ) and ( 17 ).

Improving performance of CNN

Based on our experiments in different DL applications [ 86 , 87 , 88 ]. We can conclude the most active solutions that may improve the performance of CNN are:

Expand the dataset with data augmentation or use transfer learning (explained in latter sections).

Increase the training time.

Increase the depth (or width) of the model.

Add regularization.

Increase hyperparameters tuning.

CNN architectures

Over the last 10 years, several CNN architectures have been presented [ 21 , 26 ]. Model architecture is a critical factor in improving the performance of different applications. Various modifications have been achieved in CNN architecture from 1989 until today. Such modifications include structural reformulation, regularization, parameter optimizations, etc. Conversely, it should be noted that the key upgrade in CNN performance occurred largely due to the processing-unit reorganization, as well as the development of novel blocks. In particular, the most novel developments in CNN architectures were performed on the use of network depth. In this section, we review the most popular CNN architectures, beginning from the AlexNet model in 2012 and ending at the High-Resolution (HR) model in 2020. Studying these architectures features (such as input size, depth, and robustness) is the key to help researchers to choose the suitable architecture for the their target task. Table  2 presents the brief overview of CNN architectures.

The history of deep CNNs began with the appearance of LeNet [ 89 ] (Fig.  14 ). At that time, the CNNs were restricted to handwritten digit recognition tasks, which cannot be scaled to all image classes. In deep CNN architecture, AlexNet is highly respected [ 30 ], as it achieved innovative results in the fields of image recognition and classification. Krizhevesky et al. [ 30 ] first proposed AlexNet and consequently improved the CNN learning ability by increasing its depth and implementing several parameter optimization strategies. Figure  15 illustrates the basic design of the AlexNet architecture.

figure 14

The architecture of LeNet

figure 15

The architecture of AlexNet

The learning ability of the deep CNN was limited at this time due to hardware restrictions. To overcome these hardware limitations, two GPUs (NVIDIA GTX 580) were used in parallel to train AlexNet. Moreover, in order to enhance the applicability of the CNN to different image categories, the number of feature extraction stages was increased from five in LeNet to seven in AlexNet. Regardless of the fact that depth enhances generalization for several image resolutions, it was in fact overfitting that represented the main drawback related to the depth. Krizhevesky et al. used Hinton’s idea to address this problem [ 90 , 91 ]. To ensure that the features learned by the algorithm were extra robust, Krizhevesky et al.’s algorithm randomly passes over several transformational units throughout the training stage. Moreover, by reducing the vanishing gradient problem, ReLU [ 92 ] could be utilized as a non-saturating activation function to enhance the rate of convergence [ 93 ]. Local response normalization and overlapping subsampling were also performed to enhance the generalization by decreasing the overfitting. To improve on the performance of previous networks, other modifications were made by using large-size filters \((5\times 5 \; \text{and}\; 11 \times 11)\) in the earlier layers. AlexNet has considerable significance in the recent CNN generations, as well as beginning an innovative research era in CNN applications.

Network-in-network

This network model, which has some slight differences from the preceding models, introduced two innovative concepts [ 94 ]. The first was employing multiple layers of perception convolution. These convolutions are executed using a 1×1 filter, which supports the addition of extra nonlinearity in the networks. Moreover, this supports enlarging the network depth, which may later be regularized using dropout. For DL models, this idea is frequently employed in the bottleneck layer. As a substitution for a FC layer, the GAP is also employed, which represents the second novel concept and enables a significant reduction in the number of model parameters. In addition, GAP considerably updates the network architecture. Generating a final low-dimensional feature vector with no reduction in the feature maps dimension is possible when GAP is used on a large feature map [ 95 , 96 ]. Figure  16 shows the structure of the network.

figure 16

The architecture of network-in-network

Before 2013, the CNN learning mechanism was basically constructed on a trial-and-error basis, which precluded an understanding of the precise purpose following the enhancement. This issue restricted the deep CNN performance on convoluted images. In response, Zeiler and Fergus introduced DeconvNet (a multilayer de-convolutional neural network) in 2013 [ 97 ]. This method later became known as ZefNet, which was developed in order to quantitively visualize the network. Monitoring the CNN performance via understanding the neuron activation was the purpose of the network activity visualization. However, Erhan et al. utilized this exact concept to optimize deep belief network (DBN) performance by visualizing the features of the hidden layers [ 98 ]. Moreover, in addition to this issue, Le et al. assessed the deep unsupervised auto-encoder (AE) performance by visualizing the created classes of the image using the output neurons [ 99 ]. By reversing the operation order of the convolutional and pooling layers, DenconvNet operates like a forward-pass CNN. Reverse mapping of this kind launches the convolutional layer output backward to create visually observable image shapes that accordingly give the neural interpretation of the internal feature representation learned at each layer [ 100 ]. Monitoring the learning schematic through the training stage was the key concept underlying ZefNet. In addition, it utilized the outcomes to recognize an ability issue coupled with the model. This concept was experimentally proven on AlexNet by applying DeconvNet. This indicated that only certain neurons were working, while the others were out of action in the first two layers of the network. Furthermore, it indicated that the features extracted via the second layer contained aliasing objects. Thus, Zeiler and Fergus changed the CNN topology due to the existence of these outcomes. In addition, they executed parameter optimization, and also exploited the CNN learning by decreasing the stride and the filter sizes in order to retain all features of the initial two convolutional layers. An improvement in performance was accordingly achieved due to this rearrangement in CNN topology. This rearrangement proposed that the visualization of the features could be employed to identify design weaknesses and conduct appropriate parameter alteration. Figure  17 shows the structure of the network.

figure 17

The architecture of ZefNet

Visual geometry group (VGG)

After CNN was determined to be effective in the field of image recognition, an easy and efficient design principle for CNN was proposed by Simonyan and Zisserman. This innovative design was called Visual Geometry Group (VGG). A multilayer model [ 101 ], it featured nineteen more layers than ZefNet [ 97 ] and AlexNet [ 30 ] to simulate the relations of the network representational capacity in depth. Conversely, in the 2013-ILSVRC competition, ZefNet was the frontier network, which proposed that filters with small sizes could enhance the CNN performance. With reference to these results, VGG inserted a layer of the heap of \(3\times 3\) filters rather than the \(5\times 5\) and 11 × 11 filters in ZefNet. This showed experimentally that the parallel assignment of these small-size filters could produce the same influence as the large-size filters. In other words, these small-size filters made the receptive field similarly efficient to the large-size filters \((7 \times 7 \; \text{and}\; 5 \times 5)\) . By decreasing the number of parameters, an extra advantage of reducing computational complication was achieved by using small-size filters. These outcomes established a novel research trend for working with small-size filters in CNN. In addition, by inserting \(1\times 1\) convolutions in the middle of the convolutional layers, VGG regulates the network complexity. It learns a linear grouping of the subsequent feature maps. With respect to network tuning, a max pooling layer [ 102 ] is inserted following the convolutional layer, while padding is implemented to maintain the spatial resolution. In general, VGG obtained significant results for localization problems and image classification. While it did not achieve first place in the 2014-ILSVRC competition, it acquired a reputation due to its enlarged depth, homogenous topology, and simplicity. However, VGG’s computational cost was excessive due to its utilization of around 140 million parameters, which represented its main shortcoming. Figure  18 shows the structure of the network.

figure 18

The architecture of VGG

In the 2014-ILSVRC competition, GoogleNet (also called Inception-V1) emerged as the winner [ 103 ]. Achieving high-level accuracy with decreased computational cost is the core aim of the GoogleNet architecture. It proposed a novel inception block (module) concept in the CNN context, since it combines multiple-scale convolutional transformations by employing merge, transform, and split functions for feature extraction. Figure  19 illustrates the inception block architecture. This architecture incorporates filters of different sizes ( \(5\times 5, 3\times 3, \; \text{and} \; 1\times 1\) ) to capture channel information together with spatial information at diverse ranges of spatial resolution. The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture [ 94 ], which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of variants existing in a similar class of several images. The motivation of GoogLeNet was to improve the efficiency of CNN parameters, as well as to enhance the learning capacity. In addition, it regulates the computation by inserting a \(1\times 1\) convolutional filter, as a bottleneck layer, ahead of using large-size kernels. GoogleNet employed sparse connections to overcome the redundant information problem. It decreased cost by neglecting the irrelevant channels. It should be noted here that only some of the input channels are connected to some of the output channels. By employing a GAP layer as the end layer, rather than utilizing a FC layer, the density of connections was decreased. The number of parameters was also significantly decreased from 40 to 5 million parameters due to these parameter tunings. The additional regularity factors used included the employment of RmsProp as optimizer and batch normalization [ 104 ]. Furthermore, GoogleNet proposed the idea of auxiliary learners to speed up the rate of convergence. Conversely, the main shortcoming of GoogleNet was its heterogeneous topology; this shortcoming requires adaptation from one module to another. Other shortcomings of GoogleNet include the representation jam, which substantially decreased the feature space in the following layer, and in turn occasionally leads to valuable information loss.

figure 19

The basic structure of Google Block

Highway network

Increasing the network depth enhances its performance, mainly for complicated tasks. By contrast, the network training becomes difficult. The presence of several layers in deeper networks may result in small gradient values of the back-propagation of error at lower layers. In 2015, Srivastava et al. [ 105 ] suggested a novel CNN architecture, called Highway Network, to overcome this issue. This approach is based on the cross-connectivity concept. The unhindered information flow in Highway Network is empowered by instructing two gating units inside the layer. The gate mechanism concept was motivated by LSTM-based RNN [ 106 , 107 ]. The information aggregation was conducted by merging the information of the \(\i{\text{th}}-k\) layers with the next \(\i{\text{th}}\) layer to generate a regularization impact, which makes the gradient-based training of the deeper network very simple. This empowers the training of networks with more than 100 layers, such as a deeper network of 900 layers with the SGD algorithm. A Highway Network with a depth of fifty layers presented an improved rate of convergence, which is better than thin and deep architectures at the same time [ 108 ]. By contrast, [ 69 ] empirically demonstrated that plain Net performance declines when more than ten hidden layers are inserted. It should be noted that even a Highway Network 900 layers in depth converges much more rapidly than the plain network.

He et al. [ 37 ] developed ResNet (Residual Network), which was the winner of ILSVRC 2015. Their objective was to design an ultra-deep network free of the vanishing gradient issue, as compared to the previous networks. Several types of ResNet were developed based on the number of layers (starting with 34 layers and going up to 1202 layers). The most common type was ResNet50, which comprised 49 convolutional layers plus a single FC layer. The overall number of network weights was 25.5 M, while the overall number of MACs was 3.9 M. The novel idea of ResNet is its use of the bypass pathway concept, as shown in Fig.  20 , which was employed in Highway Nets to address the problem of training a deeper network in 2015. This is illustrated in Fig.  20 , which contains the fundamental ResNet block diagram. This is a conventional feedforward network plus a residual connection. The residual layer output can be identified as the \((l - 1){\text{th}}\) outputs, which are delivered from the preceding layer \((x_{l} - 1)\) . After executing different operations [such as convolution using variable-size filters, or batch normalization, before applying an activation function like ReLU on \((x_{l} - 1)\) ], the output is \(F(x_{l} - 1)\) . The ending residual output is \(x_{l}\) , which can be mathematically represented as in Eq. 18 .

There are numerous basic residual blocks included in the residual network. Based on the type of the residual network architecture, operations in the residual block are also changed [ 37 ].

figure 20

The block diagram for ResNet

In comparison to the highway network, ResNet presented shortcut connections inside layers to enable cross-layer connectivity, which are parameter-free and data-independent. Note that the layers characterize non-residual functions when a gated shortcut is closed in the highway network. By contrast, the individuality shortcuts are never closed, while the residual information is permanently passed in ResNet. Furthermore, ResNet has the potential to prevent the problems of gradient diminishing, as the shortcut connections (residual links) accelerate the deep network convergence. ResNet was the winner of the 2015-ILSVRC championship with 152 layers of depth; this represents 8 times the depth of VGG and 20 times the depth of AlexNet. In comparison with VGG, it has lower computational complexity, even with enlarged depth.

Inception: ResNet and Inception-V3/4

Szegedy et al. [ 103 , 109 , 110 ] proposed Inception-ResNet and Inception-V3/4 as upgraded types of Inception-V1/2. The concept behind Inception-V3 was to minimize the computational cost with no effect on the deeper network generalization. Thus, Szegedy et al. used asymmetric small-size filters ( \(1\times 5\) and \(1\times 7\) ) rather than large-size filters ( \( 7\times 7\) and \(5\times 5\) ); moreover, they utilized a bottleneck of \(1\times 1\) convolution prior to the large-size filters [ 110 ]. These changes make the operation of the traditional convolution very similar to cross-channel correlation. Previously, Lin et al. utilized the 1 × 1 filter potential in NIN architecture [ 94 ]. Subsequently, [ 110 ] utilized the same idea in an intelligent manner. By using \(1\times 1\) convolutional operation in Inception-V3, the input data are mapped into three or four isolated spaces, which are smaller than the initial input spaces. Next, all of these correlations are mapped in these smaller spaces through common \(5\times 5\) or \(3\times 3\) convolutions. By contrast, in Inception-ResNet, Szegedy et al. bring together the inception block and the residual learning power by replacing the filter concatenation with the residual connection [ 111 ]. Szegedy et al. empirically demonstrated that Inception-ResNet (Inception-4 with residual connections) can achieve a similar generalization power to Inception-V4 with enlarged width and depth and without residual connections. Thus, it is clearly illustrated that using residual connections in training will significantly accelerate the Inception network training. Figure  21 shows The basic block diagram for Inception Residual unit.

figure 21

The basic block diagram for Inception Residual unit

To solve the problem of the vanishing gradient, DenseNet was presented, following the same direction as ResNet and the Highway network [ 105 , 111 , 112 ]. One of the drawbacks of ResNet is that it clearly conserves information by means of preservative individuality transformations, as several layers contribute extremely little or no information. In addition, ResNet has a large number of weights, since each layer has an isolated group of weights. DenseNet employed cross-layer connectivity in an improved approach to address this problem [ 112 , 113 , 114 ]. It connected each layer to all layers in the network using a feed-forward approach. Therefore, the feature maps of each previous layer were employed to input into all of the following layers. In traditional CNNs, there are l connections between the previous layer and the current layer, while in DenseNet, there are \(\frac{l(l+1)}{2}\) direct connections. DenseNet demonstrates the influence of cross-layer depth wise-convolutions. Thus, the network gains the ability to discriminate clearly between the added and the preserved information, since DenseNet concatenates the features of the preceding layers rather than adding them. However, due to its narrow layer structure, DenseNet becomes parametrically high-priced in addition to the increased number of feature maps. The direct admission of all layers to the gradients via the loss function enhances the information flow all across the network. In addition, this includes a regularizing impact, which minimizes overfitting on tasks alongside minor training sets. Figure  22 shows the architecture of DenseNet Network.

figure 22

(adopted from [ 112 ])

The architecture of DenseNet Network

ResNext is an enhanced version of the Inception Network [ 115 ]. It is also known as the Aggregated Residual Transform Network. Cardinality, which is a new term presented by [ 115 ], utilized the split, transform, and merge topology in an easy and effective way. It denotes the size of the transformation set as an extra dimension [ 116 , 117 , 118 ]. However, the Inception network manages network resources more efficiently, as well as enhancing the learning ability of the conventional CNN. In the transformation branch, different spatial embeddings (employing e.g. \(5\times 5\) , \(3\times 3\) , and \(1\times 1\) ) are used. Thus, customizing each layer is required separately. By contrast, ResNext derives its characteristic features from ResNet, VGG, and Inception. It employed the VGG deep homogenous topology with the basic architecture of GoogleNet by setting \(3\times 3\) filters as spatial resolution inside the blocks of split, transform, and merge. Figure  23 shows the ResNext building blocks. ResNext utilized multi-transformations inside the blocks of split, transform, and merge, as well as outlining such transformations in cardinality terms. The performance is significantly improved by increasing the cardinality, as Xie et al. showed. The complexity of ResNext was regulated by employing \(1\times 1\) filters (low embeddings) ahead of a \(3\times 3\) convolution. By contrast, skipping connections are used for optimized training [ 115 ].

figure 23

The basic block diagram for the ResNext building blocks

The feature reuse problem is the core shortcoming related to deep residual networks, since certain feature blocks or transformations contribute a very small amount to learning. Zagoruyko and Komodakis [ 119 ] accordingly proposed WideResNet to address this problem. These authors advised that the depth has a supplemental influence, while the residual units convey the core learning ability of deep residual networks. WideResNet utilized the residual block power via making the ResNet wider instead of deeper [ 37 ]. It enlarged the width by presenting an extra factor, k, which handles the network width. In other words, it indicated that layer widening is a highly successful method of performance enhancement compared to deepening the residual network. While enhanced representational capacity is achieved by deep residual networks, these networks also have certain drawbacks, such as the exploding and vanishing gradient problems, feature reuse problem (inactivation of several feature maps), and the time-intensive nature of the training. He et al. [ 37 ] tackled the feature reuse problem by including a dropout in each residual block to regularize the network in an efficient manner. In a similar manner, utilizing dropouts, Huang et al. [ 120 ] presented the stochastic depth concept to solve the slow learning and gradient vanishing problems. Earlier research was focused on increasing the depth; thus, any small enhancement in performance required the addition of several new layers. When comparing the number of parameters, WideResNet has twice that of ResNet, as an experimental study showed. By contrast, WideResNet presents an improved method for training relative to deep networks [ 119 ]. Note that most architectures prior to residual networks (including the highly effective VGG and Inception) were wider than ResNet. Thus, wider residual networks were established once this was determined. However, inserting a dropout between the convolutional layers (as opposed to within the residual block) made the learning more effective in WideResNet [ 121 , 122 ].

Pyramidal Net

The depth of the feature map increases in the succeeding layer due to the deep stacking of multi-convolutional layers, as shown in previous deep CNN architectures such as ResNet, VGG, and AlexNet. By contrast, the spatial dimension reduces, since a sub-sampling follows each convolutional layer. Thus, augmented feature representation is recompensed by decreasing the size of the feature map. The extreme expansion in the depth of the feature map, alongside the spatial information loss, interferes with the learning ability in the deep CNNs. ResNet obtained notable outcomes for the issue of image classification. Conversely, deleting a convolutional block—in which both the number of channel and spatial dimensions vary (channel depth enlarges, while spatial dimension reduces)—commonly results in decreased classifier performance. Accordingly, the stochastic ResNet enhanced the performance by decreasing the information loss accompanying the residual unit drop. Han et al. [ 123 ] proposed Pyramidal Net to address the ResNet learning interference problem. To address the depth enlargement and extreme reduction in spatial width via ResNet, Pyramidal Net slowly enlarges the residual unit width to cover the most feasible places rather than saving the same spatial dimension inside all residual blocks up to the appearance of the down-sampling. It was referred to as Pyramidal Net due to the slow enlargement in the feature map depth based on the up-down method. Factor l, which was determined by Eq. 19 , regulates the depth of the feature map.

Here, the dimension of the l th residual unit is indicated by \(d_{l}\) ; moreover, n indicates the overall number of residual units, the step factor is indicated by \(\lambda \) , and the depth increase is regulated by the factor \(\frac{\lambda }{n}\) , which uniformly distributes the weight increase across the dimension of the feature map. Zero-padded identity mapping is used to insert the residual connections among the layers. In comparison to the projection-based shortcut connections, zero-padded identity mapping requires fewer parameters, which in turn leads to enhanced generalization [ 124 ]. Multiplication- and addition-based widening are two different approaches used in Pyramidal Nets for network widening. More specifically, the first approach (multiplication) enlarges geometrically, while the second one (addition) enlarges linearly [ 92 ]. The main problem associated with the width enlargement is the growth in time and space required related to the quadratic time.

Extreme inception architecture is the main characteristic of Xception. The main idea behind Xception is its depthwise separable convolution [ 125 ]. The Xception model adjusted the original inception block by making it wider and exchanging a single dimension ( \(3 \times 3\) ) followed by a \(1 \times 1\) convolution to reduce computational complexity. Figure  24 shows the Xception block architecture. The Xception network becomes extra computationally effective through the use of the decoupling channel and spatial correspondence. Moreover, it first performs mapping of the convolved output to the embedding short dimension by applying \(1 \times 1\) convolutions. It then performs k spatial transformations. Note that k here represents the width-defining cardinality, which is obtained via the transformations number in Xception. However, the computations were made simpler in Xception by distinctly convolving each channel around the spatial axes. These axes are subsequently used as the \(1 \times 1\) convolutions (pointwise convolution) for performing cross-channel correspondence. The \(1 \times 1\) convolution is utilized in Xception to regularize the depth of the channel. The traditional convolutional operation in Xception utilizes a number of transformation segments equivalent to the number of channels; Inception, moreover, utilizes three transformation segments, while traditional CNN architecture utilizes only a single transformation segment. Conversely, the suggested Xception transformation approach achieves extra learning efficiency and better performance but does not minimize the number of parameters [ 126 , 127 ].

figure 24

The basic block diagram for the Xception block architecture

Residual attention neural network

To improve the network feature representation, Wang et al. [ 128 ] proposed the Residual Attention Network (RAN). Enabling the network to learn aware features of the object is the main purpose of incorporating attention into the CNN. The RAN consists of stacked residual blocks in addition to the attention module; hence, it is a feed-forward CNN. However, the attention module is divided into two branches, namely the mask branch and trunk branch. These branches adopt a top-down and bottom-up learning strategy respectively. Encapsulating two different strategies in the attention model supports top-down attention feedback and fast feed-forward processing in only one particular feed-forward process. More specifically, the top-down architecture generates dense features to make inferences about every aspect. Moreover, the bottom-up feedforward architecture generates low-resolution feature maps in addition to robust semantic information. Restricted Boltzmann machines employed a top-down bottom-up strategy as in previously proposed studies [ 129 ]. During the training reconstruction phase, Goh et al. [ 130 ] used the mechanism of top-down attention in deep Boltzmann machines (DBMs) as a regularizing factor. Note that the network can be globally optimized using a top-down learning strategy in a similar manner, where the maps progressively output to the input throughout the learning process [ 129 , 130 , 131 , 132 ].

Incorporating the attention concept with convolutional blocks in an easy way was used by the transformation network, as obtained in a previous study [ 133 ]. Unfortunately, these are inflexible, which represents the main problem, along with their inability to be used for varying surroundings. By contrast, stacking multi-attention modules has made RAN very effective at recognizing noisy, complex, and cluttered images. RAN’s hierarchical organization gives it the capability to adaptively allocate a weight for every feature map depending on its importance within the layers. Furthermore, incorporating three distinct levels of attention (spatial, channel, and mixed) enables the model to use this ability to capture the object-aware features at these distinct levels.

Convolutional block attention module

The importance of the feature map utilization and the attention mechanism is certified via SE-Network and RAN [ 128 , 134 , 135 ]. The convolutional block attention (CBAM) module, which is a novel attention-based CNN, was first developed by Woo et al. [ 136 ]. This module is similar to SE-Network and simple in design. SE-Network disregards the object’s spatial locality in the image and considers only the channels’ contribution during the image classification. Regarding object detection, object spatial location plays a significant role. The convolutional block attention module sequentially infers the attention maps. More specifically, it applies channel attention preceding the spatial attention to obtain the refined feature maps. Spatial attention is performed using 1 × 1 convolution and pooling functions, as in the literature. Generating an effective feature descriptor can be achieved by using a spatial axis along with the pooling of features. In addition, generating a robust spatial attention map is possible, as CBAM concatenates the max pooling and average pooling operations. In a similar manner, a collection of GAP and max pooling operations is used to model the feature map statistics. Woo et al. [ 136 ] demonstrated that utilizing GAP will return a sub-optimized inference of channel attention, whereas max pooling provides an indication of the distinguishing object features. Thus, the utilization of max pooling and average pooling enhances the network’s representational power. The feature maps improve the representational power, as well as facilitating a focus on the significant portion of the chosen features. The expression of 3D attention maps through a serial learning procedure assists in decreasing the computational cost and the number of parameters, as Woo et al. [ 136 ] experimentally proved. Note that any CNN architecture can be simply integrated with CBAM.

Concurrent spatial and channel excitation mechanism

To make the work valid for segmentation tasks, Roy et al. [ 137 , 138 ] expanded Hu et al. [ 134 ] effort by adding the influence of spatial information to the channel information. Roy et al. [ 137 , 138 ] presented three types of modules: (1) channel squeeze and excitation with concurrent channels (scSE); (2) exciting spatially and squeezing channel-wise (sSE); (3) exciting channel-wise and squeezing spatially (cSE). For segmentation purposes, they employed auto-encoder-based CNNs. In addition, they suggested inserting modules following the encoder and decoder layers. To specifically highlight the object-specific feature maps, they further allocated attention to every channel by expressing a scaling factor from the channel and spatial information in the first module (scSE). In the second module (sSE), the feature map information has lower importance than the spatial locality, as the spatial information plays a significant role during the segmentation process. Therefore, several channel collections are spatially divided and developed so that they can be employed in segmentation. In the final module (cSE), a similar SE-block concept is used. Furthermore, the scaling factor is derived founded on the contribution of the feature maps within the object detection [ 137 , 138 ].

CNN is an efficient technique for detecting object features and achieving well-behaved recognition performance in comparison with innovative handcrafted feature detectors. A number of restrictions related to CNN are present, meaning that the CNN does not consider certain relations, orientation, size, and perspectives of features. For instance, when considering a face image, the CNN does not count the various face components (such as mouth, eyes, nose, etc.) positions, and will incorrectly activate the CNN neurons and recognize the face without taking specific relations (such as size, orientation etc.) into account. At this point, consider a neuron that has probability in addition to feature properties such as size, orientation, perspective, etc. A specific neuron/capsule of this type has the ability to effectively detect the face along with different types of information. Thus, many layers of capsule nodes are used to construct the capsule network. An encoding unit, which contains three layers of capsule nodes, forms the CapsuleNet or CapsNet (the initial version of the capsule networks).

For example, the MNIST architecture comprises \(28\times 28\) images, applying 256 filters of size \(9\times 9\) and with stride 1. The \(28-9+1=20\) is the output plus 256 feature maps. Next, these outputs are input to the first capsule layer, while producing an 8D vector rather than a scalar; in fact, this is a modified convolution layer. Note that a stride 2 with \(9\times 9\) filters is employed in the first convolution layer. Thus, the dimension of the output is \((20-9)/2+1=6\) . The initial capsules employ \(8\times 32\) filters, which generate 32 × 8 × 6 × 6 (32 for groups, 8 for neurons, while 6 × 6 is the neuron size).

Figure  25 represents the complete CapsNet encoding and decoding processes. In the CNN context, a max-pooling layer is frequently employed to handle the translation change. It can detect the feature moves in the event that the feature is still within the max-pooling window. This approach has the ability to detect the overlapped features; this is highly significant in detection and segmentation operations, since the capsule involves the weighted features sum from the preceding layer.

figure 25

The complete CapsNet encoding and decoding processes

In conventional CNNs, a particular cost function is employed to evaluate the global error that grows toward the back throughout the training process. Conversely, in such cases, the activation of a neuron will not grow further once the weight between two neurons turns out to be zero. Instead of a single size being provided with the complete cost function in repetitive dynamic routing alongside the agreement, the signal is directed based on the feature parameters. Sabour et al. [ 139 ] provides more details about this architecture. When using MNIST to recognize handwritten digits, this innovative CNN architecture gives superior accuracy. From the application perspective, this architecture has extra suitability for segmentation and detection approaches when compared with classification approaches [ 140 , 141 , 142 ].

High-resolution network (HRNet)

High-resolution representations are necessary for position-sensitive vision tasks, such as semantic segmentation, object detection, and human pose estimation. In the present up-to-date frameworks, the input image is encoded as a low-resolution representation using a subnetwork that is constructed as a connected series of high-to-low resolution convolutions such as VGGNet and ResNet. The low-resolution representation is then recovered to become a high-resolution one. Alternatively, high-resolution representations are maintained during the entire process using a novel network, referred to as a High-Resolution Network (HRNet) [ 143 , 144 ]. This network has two principal features. First, the convolution series of high-to-low resolutions are connected in parallel. Second, the information across the resolutions are repeatedly exchanged. The advantage achieved includes getting a representation that is more accurate in the spatial domain and extra-rich in the semantic domain. Moreover, HRNet has several applications in the fields of object detection, semantic segmentation, and human pose prediction. For computer vision problems, the HRNet represents a more robust backbone. Figure  26 illustrates the general architecture of HRNet.

figure 26

The general architecture of HRNet

Challenges (limitations) of deep learning and alternate solutions

When employing DL, several difficulties are often taken into consideration. Those more challenging are listed next and several possible alternatives are accordingly provided.

Training data

DL is extremely data-hungry considering it also involves representation learning [ 145 , 146 ]. DL demands an extensively large amount of data to achieve a well-behaved performance model, i.e. as the data increases, an extra well-behaved performance model can be achieved (Fig.  27 ). In most cases, the available data are sufficient to obtain a good performance model. However, sometimes there is a shortage of data for using DL directly [ 87 ]. To properly address this issue, three suggested methods are available. The first involves the employment of the transfer-learning concept after data is collected from similar tasks. Note that while the transferred data will not directly augment the actual data, it will help in terms of both enhancing the original input representation of data and its mapping function [ 147 ]. In this way, the model performance is boosted. Another technique involves employing a well-trained model from a similar task and fine-tuning the ending of two layers or even one layer based on the limited original data. Refer to [ 148 , 149 ] for a review of different transfer-learning techniques applied in the DL approach. In the second method, data augmentation is performed [ 150 ]. This task is very helpful for use in augmenting the image data, since the image translation, mirroring, and rotation commonly do not change the image label. Conversely, it is important to take care when applying this technique in some cases such as with bioinformatics data. For instance, when mirroring an enzyme sequence, the output data may not represent the actual enzyme sequence. In the third method, the simulated data can be considered for increasing the volume of the training set. It is occasionally possible to create simulators based on the physical process if the issue is well understood. Therefore, the result will involve the simulation of as much data as needed. Processing the data requirement for DL-based simulation is obtained as an example in Ref. [ 151 ].

figure 27

The performance of DL regarding the amount of data

  • Transfer learning

Recent research has revealed a widespread use of deep CNNs, which offer ground-breaking support for answering many classification problems. Generally speaking, deep CNN models require a sizable volume of data to obtain good performance. The common challenge associated with using such models concerns the lack of training data. Indeed, gathering a large volume of data is an exhausting job, and no successful solution is available at this time. The undersized dataset problem is therefore currently solved using the TL technique [ 148 , 149 ], which is highly efficient in addressing the lack of training data issue. The mechanism of TL involves training the CNN model with large volumes of data. In the next step, the model is fine-tuned for training on a small request dataset.

The student-teacher relationship is a suitable approach to clarifying TL. Gathering detailed knowledge of the subject is the first step [ 152 ]. Next, the teacher provides a “course” by conveying the information within a “lecture series” over time. Put simply, the teacher transfers the information to the student. In more detail, the expert (teacher) transfers the knowledge (information) to the learner (student). Similarly, the DL network is trained using a vast volume of data, and also learns the bias and the weights during the training process. These weights are then transferred to different networks for retraining or testing a similar novel model. Thus, the novel model is enabled to pre-train weights rather than requiring training from scratch. Figure  28 illustrates the conceptual diagram of the TL technique.

Pre-trained models: Many CNN models, e.g. AlexNet [ 30 ], GoogleNet [ 103 ], and ResNet [ 37 ], have been trained on large datasets such as ImageNet for image recognition purposes. These models can then be employed to recognize a different task without the need to train from scratch. Furthermore, the weights remain the same apart from a few learned features. In cases where data samples are lacking, these models are very useful. There are many reasons for employing a pre-trained model. First, training large models on sizeable datasets requires high-priced computational power. Second, training large models can be time-consuming, taking up to multiple weeks. Finally, a pre-trained model can assist with network generalization and speed up the convergence.

A research problem using pre-trained models: Training a DL approach requires a massive number of images. Thus, obtaining good performance is a challenge under these circumstances. Achieving excellent outcomes in image classification or recognition applications, with performance occasionally superior to that of a human, becomes possible through the use of deep convolutional neural networks (DCNNs) including several layers if a huge amount of data is available [ 37 , 148 , 153 ]. However, avoiding overfitting problems in such applications requires sizable datasets and properly generalizing DCNN models. When training a DCNN model, the dataset size has no lower limit. However, the accuracy of the model becomes insufficient in the case of the utilized model has fewer layers, or if a small dataset is used for training due to over- or under-fitting problems. Due to they have no ability to utilize the hierarchical features of sizable datasets, models with fewer layers have poor accuracy. It is difficult to acquire sufficient training data for DL models. For example, in medical imaging and environmental science, gathering labelled datasets is very costly [ 148 ]. Moreover, the majority of the crowdsourcing workers are unable to make accurate notes on medical or biological images due to their lack of medical or biological knowledge. Thus, ML researchers often rely on field experts to label such images; however, this process is costly and time consuming. Therefore, producing the large volume of labels required to develop flourishing deep networks turns out to be unfeasible. Recently, TL has been widely employed to address the later issue. Nevertheless, although TL enhances the accuracy of several tasks in the fields of pattern recognition and computer vision [ 154 , 155 ], there is an essential issue related to the source data type used by the TL as compared to the target dataset. For instance, enhancing the medical image classification performance of CNN models is achieved by training the models using the ImageNet dataset, which contains natural images [ 153 ]. However, such natural images are completely dissimilar from the raw medical images, meaning that the model performance is not enhanced. It has further been proven that TL from different domains does not significantly affect performance on medical imaging tasks, as lightweight models trained from scratch perform nearly as well as standard ImageNet-transferred models [ 156 ]. Therefore, there exists scenarios in which using pre-trained models do not become an affordable solution. In 2020, some researchers have utilized same-domain TL and achieved excellent results [ 86 , 87 , 88 , 157 ]. Same-domain TL is an approach of using images that look similar to the target dataset for training. For example, using X-ray images of different chest diseases to train the model, then fine-tuning and training it on chest X-ray images for COVID-19 diagnosis. More details about same-domain TL and how to implement the fine-tuning process can be found in [ 87 ].

figure 28

The conceptual diagram of the TL technique

Data augmentation techniques

If the goal is to increase the amount of available data and avoid the overfitting issue, data augmentation techniques are one possible solution [ 150 , 158 , 159 ]. These techniques are data-space solutions for any limited-data problem. Data augmentation incorporates a collection of methods that improve the attributes and size of training datasets. Thus, DL networks can perform better when these techniques are employed. Next, we list some data augmentation alternate solutions.

Flipping: Flipping the vertical axis is a less common practice than flipping the horizontal one. Flipping has been verified as valuable on datasets like ImageNet and CIFAR-10. Moreover, it is highly simple to implement. In addition, it is not a label-conserving transformation on datasets that involve text recognition (such as SVHN and MNIST).

Color space: Encoding digital image data is commonly used as a dimension tensor ( \(height \times width \times color channels\) ). Accomplishing augmentations in the color space of the channels is an alternative technique, which is extremely workable for implementation. A very easy color augmentation involves separating a channel of a particular color, such as Red, Green, or Blue. A simple way to rapidly convert an image using a single-color channel is achieved by separating that matrix and inserting additional double zeros from the remaining two color channels. Furthermore, increasing or decreasing the image brightness is achieved by using straightforward matrix operations to easily manipulate the RGB values. By deriving a color histogram that describes the image, additional improved color augmentations can be obtained. Lighting alterations are also made possible by adjusting the intensity values in histograms similar to those employed in photo-editing applications.

Cropping: Cropping a dominant patch of every single image is a technique employed with combined dimensions of height and width as a specific processing step for image data. Furthermore, random cropping may be employed to produce an impact similar to translations. The difference between translations and random cropping is that translations conserve the spatial dimensions of this image, while random cropping reduces the input size [for example from (256, 256) to (224, 224)]. According to the selected reduction threshold for cropping, the label-preserving transformation may not be addressed.

Rotation: When rotating an image left or right from within 0 to 360 degrees around the axis, rotation augmentations are obtained. The rotation degree parameter greatly determines the suitability of the rotation augmentations. In digit recognition tasks, small rotations (from 0 to 20 degrees) are very helpful. By contrast, the data label cannot be preserved post-transformation when the rotation degree increases.

Translation: To avoid positional bias within the image data, a very useful transformation is to shift the image up, down, left, or right. For instance, it is common that the whole dataset images are centered; moreover, the tested dataset should be entirely made up of centered images to test the model. Note that when translating the initial images in a particular direction, the residual space should be filled with Gaussian or random noise, or a constant value such as 255 s or 0 s. The spatial dimensions of the image post-augmentation are preserved using this padding.

Noise injection This approach involves injecting a matrix of arbitrary values. Such a matrix is commonly obtained from a Gaussian distribution. Moreno-Barea et al. [ 160 ] employed nine datasets to test the noise injection. These datasets were taken from the UCI repository [ 161 ]. Injecting noise within images enables the CNN to learn additional robust features.

However, highly well-behaved solutions for positional biases available within the training data are achieved by means of geometric transformations. To separate the distribution of the testing data from the training data, several prospective sources of bias exist. For instance, when all faces should be completely centered within the frames (as in facial recognition datasets), the problem of positional biases emerges. Thus, geometric translations are the best solution. Geometric translations are helpful due to their simplicity of implementation, as well as their effective capability to disable the positional biases. Several libraries of image processing are available, which enables beginning with simple operations such as rotation or horizontal flipping. Additional training time, higher computational costs, and additional memory are some shortcomings of geometric transformations. Furthermore, a number of geometric transformations (such as arbitrary cropping or translation) should be manually observed to ensure that they do not change the image label. Finally, the biases that separate the test data from the training data are more complicated than transitional and positional changes. Hence, it is not trivial answering to when and where geometric transformations are suitable to be applied.

Imbalanced data

Commonly, biological data tend to be imbalanced, as negative samples are much more numerous than positive ones [ 162 , 163 , 164 ]. For example, compared to COVID-19-positive X-ray images, the volume of normal X-ray images is very large. It should be noted that undesirable results may be produced when training a DL model using imbalanced data. The following techniques are used to solve this issue. First, it is necessary to employ the correct criteria for evaluating the loss, as well as the prediction result. In considering the imbalanced data, the model should perform well on small classes as well as larger ones. Thus, the model should employ area under curve (AUC) as the resultant loss as well as the criteria [ 165 ]. Second, it should employ the weighted cross-entropy loss, which ensures the model will perform well with small classes if it still prefers to employ the cross-entropy loss. Simultaneously, during model training, it is possible either to down-sample the large classes or up-sample the small classes. Finally, to make the data balanced as in Ref. [ 166 ], it is possible to construct models for every hierarchical level, as a biological system frequently has hierarchical label space. However, the effect of the imbalanced data on the performance of the DL model has been comprehensively investigated. In addition, to lessen the problem, the most frequently used techniques were also compared. Nevertheless, note that these techniques are not specified for biological problems.

Interpretability of data

Occasionally, DL techniques are analyzed to act as a black box. In fact, they are interpretable. The need for a method of interpreting DL, which is used to obtain the valuable motifs and patterns recognized by the network, is common in many fields, such as bioinformatics [ 167 ]. In the task of disease diagnosis, it is not only required to know the disease diagnosis or prediction results of a trained DL model, but also how to enhance the surety of the prediction outcomes, as the model makes its decisions based on these verifications [ 168 ]. To achieve this, it is possible to give a score of importance for every portion of the particular example. Within this solution, back-propagation-based techniques or perturbation-based approaches are used [ 169 ]. In the perturbation-based approaches, a portion of the input is changed and the effect of this change on the model output is observed [ 170 , 171 , 172 , 173 ]. This concept has high computational complexity, but it is simple to understand. On the other hand, to check the score of the importance of various input portions, the signal from the output propagates back to the input layer in the back-propagation-based techniques. These techniques have been proven valuable in [ 174 ]. In different scenarios, various meanings can represent the model interpretability.

Uncertainty scaling

Commonly, the final prediction label is not the only label required when employing DL techniques to achieve the prediction; the score of confidence for every inquiry from the model is also desired. The score of confidence is defined as how confident the model is in its prediction [ 175 ]. Since the score of confidence prevents belief in unreliable and misleading predictions, it is a significant attribute, regardless of the application scenario. In biology, the confidence score reduces the resources and time expended in proving the outcomes of the misleading prediction. Generally speaking, in healthcare or similar applications, the uncertainty scaling is frequently very significant; it helps in evaluating automated clinical decisions and the reliability of machine learning-based disease-diagnosis [ 176 , 177 ]. Because overconfident prediction can be the output of different DL models, the score of probability (achieved from the softmax output of the direct-DL) is often not in the correct scale [ 178 ]. Note that the softmax output requires post-scaling to achieve a reliable probability score. For outputting the probability score in the correct scale, several techniques have been introduced, including Bayesian Binning into Quantiles (BBQ) [ 179 ], isotonic regression [ 180 ], histogram binning [ 181 ], and the legendary Platt scaling [ 182 ]. More specifically, for DL techniques, temperature scaling was recently introduced, which achieves superior performance compared to the other techniques.

Catastrophic forgetting

This is defined as incorporating new information into a plain DL model, made possible by interfering with the learned information. For instance, consider a case where there are 1000 types of flowers and a model is trained to classify these flowers, after which a new type of flower is introduced; if the model is fine-tuned only with this new class, its performance will become unsuccessful with the older classes [ 183 , 184 ]. The logical data are continually collected and renewed, which is in fact a highly typical scenario in many fields, e.g. Biology. To address this issue, there is a direct solution that involves employing old and new data to train an entirely new model from scratch. This solution is time-consuming and computationally intensive; furthermore, it leads to an unstable state for the learned representation of the initial data. At this time, three different types of ML techniques, which have not catastrophic forgetting, are made available to solve the human brain problem founded on the neurophysiological theories [ 185 , 186 ]. Techniques of the first type are founded on regularizations such as EWC [ 183 ] Techniques of the second type employ rehearsal training techniques and dynamic neural network architecture like iCaRL [ 187 , 188 ]. Finally, techniques of the third type are founded on dual-memory learning systems [ 189 ]. Refer to [ 190 , 191 , 192 ] in order to gain more details.

Model compression

To obtain well-trained models that can still be employed productively, DL models have intensive memory and computational requirements due to their huge complexity and large numbers of parameters [ 193 , 194 ]. One of the fields that is characterized as data-intensive is the field of healthcare and environmental science. These needs reduce the deployment of DL in limited computational-power machines, mainly in the healthcare field. The numerous methods of assessing human health and the data heterogeneity have become far more complicated and vastly larger in size [ 195 ]; thus, the issue requires additional computation [ 196 ]. Furthermore, novel hardware-based parallel processing solutions such as FPGAs and GPUs [ 197 , 198 , 199 ] have been developed to solve the computation issues associated with DL. Recently, numerous techniques for compressing the DL models, designed to decrease the computational issues of the models from the starting point, have also been introduced. These techniques can be classified into four classes. In the first class, the redundant parameters (which have no significant impact on model performance) are reduced. This class, which includes the famous deep compression method, is called parameter pruning [ 200 ]. In the second class, the larger model uses its distilled knowledge to train a more compact model; thus, it is called knowledge distillation [ 201 , 202 ]. In the third class, compact convolution filters are used to reduce the number of parameters [ 203 ]. In the final class, the information parameters are estimated for preservation using low-rank factorization [ 204 ]. For model compression, these classes represent the most representative techniques. In [ 193 ], it has been provided a more comprehensive discussion about the topic.

Overfitting

DL models have excessively high possibilities of resulting in data overfitting at the training stage due to the vast number of parameters involved, which are correlated in a complex manner. Such situations reduce the model’s ability to achieve good performance on the tested data [ 90 , 205 ]. This problem is not only limited to a specific field, but involves different tasks. Therefore, when proposing DL techniques, this problem should be fully considered and accurately handled. In DL, the implied bias of the training process enables the model to overcome crucial overfitting problems, as recent studies suggest [ 205 , 206 , 207 , 208 ]. Even so, it is still necessary to develop techniques that handle the overfitting problem. An investigation of the available DL algorithms that ease the overfitting problem can categorize them into three classes. The first class acts on both the model architecture and model parameters and includes the most familiar approaches, such as weight decay [ 209 ], batch normalization [ 210 ], and dropout [ 90 ]. In DL, the default technique is weight decay [ 209 ], which is used extensively in almost all ML algorithms as a universal regularizer. The second class works on model inputs such as data corruption and data augmentation [ 150 , 211 ]. One reason for the overfitting problem is the lack of training data, which makes the learned distribution not mirror the real distribution. Data augmentation enlarges the training data. By contrast, marginalized data corruption improves the solution exclusive to augmenting the data. The final class works on the model output. A recently proposed technique penalizes the over-confident outputs for regularizing the model [ 178 ]. This technique has demonstrated the ability to regularize RNNs and CNNs.

Vanishing gradient problem

In general, when using backpropagation and gradient-based learning techniques along with ANNs, largely in the training stage, a problem called the vanishing gradient problem arises [ 212 , 213 , 214 ]. More specifically, in each training iteration, every weight of the neural network is updated based on the current weight and is proportionally relative to the partial derivative of the error function. However, this weight updating may not occur in some cases due to a vanishingly small gradient, which in the worst case means that no extra training is possible and the neural network will stop completely. Conversely, similarly to other activation functions, the sigmoid function shrinks a large input space to a tiny input space. Thus, the derivative of the sigmoid function will be small due to large variation at the input that produces a small variation at the output. In a shallow network, only some layers use these activations, which is not a significant issue. While using more layers will lead the gradient to become very small in the training stage, in this case, the network works efficiently. The back-propagation technique is used to determine the gradients of the neural networks. Initially, this technique determines the network derivatives of each layer in the reverse direction, starting from the last layer and progressing back to the first layer. The next step involves multiplying the derivatives of each layer down the network in a similar manner to the first step. For instance, multiplying N small derivatives together when there are N hidden layers employs an activation function such as the sigmoid function. Hence, the gradient declines exponentially while propagating back to the first layer. More specifically, the biases and weights of the first layers cannot be updated efficiently during the training stage because the gradient is small. Moreover, this condition decreases the overall network accuracy, as these first layers are frequently critical to recognizing the essential elements of the input data. However, such a problem can be avoided through employing activation functions. These functions lack the squishing property, i.e., the ability to squish the input space to within a small space. By mapping X to max, the ReLU [ 91 ] is the most popular selection, as it does not yield a small derivative that is employed in the field. Another solution involves employing the batch normalization layer [ 81 ]. As mentioned earlier, the problem occurs once a large input space is squashed into a small space, leading to vanishing the derivative. Employing batch normalization degrades this issue by simply normalizing the input, i.e., the expression | x | does not accomplish the exterior boundaries of the sigmoid function. The normalization process makes the largest part of it come down in the green area, which ensures that the derivative is large enough for further actions. Furthermore, faster hardware can tackle the previous issue, e.g. that provided by GPUs. This makes standard back-propagation possible for many deeper layers of the network compared to the time required to recognize the vanishing gradient problem [ 215 ].

Exploding gradient problem

Opposite to the vanishing problem is the one related to gradient. Specifically, large error gradients are accumulated during back-propagation [ 216 , 217 , 218 ]. The latter will lead to extremely significant updates to the weights of the network, meaning that the system becomes unsteady. Thus, the model will lose its ability to learn effectively. Grosso modo, moving backward in the network during back-propagation, the gradient grows exponentially by repetitively multiplying gradients. The weight values could thus become incredibly large and may overflow to become a not-a-number (NaN) value. Some potential solutions include:

Using different weight regularization techniques.

Redesigning the architecture of the network model.

Underspecification

In 2020, a team of computer scientists at Google has identified a new challenge called underspecification [ 219 ]. ML models including DL models often show surprisingly poor behavior when they are tested in real-world applications such as computer vision, medical imaging, natural language processing, and medical genomics. The reason behind the weak performance is due to underspecification. It has been shown that small modifications can force a model towards a completely different solution as well as lead to different predictions in deployment domains. There are different techniques of addressing underspecification issue. One of them is to design “stress tests” to examine how good a model works on real-world data and to find out the possible issues. Nevertheless, this demands a reliable understanding of the process the model can work inaccurately. The team stated that “Designing stress tests that are well-matched to applied requirements, and that provide good “coverage” of potential failure modes is a major challenge”. Underspecification puts major constraints on the credibility of ML predictions and may require some reconsidering over certain applications. Since ML is linked to human by serving several applications such as medical imaging and self-driving cars, it will require proper attention to this issue.

Applications of deep learning

Presently, various DL applications are widespread around the world. These applications include healthcare, social network analysis, audio and speech processing (like recognition and enhancement), visual data processing methods (such as multimedia data analysis and computer vision), and NLP (translation and sentence classification), among others (Fig.  29 ) [ 220 , 221 , 222 , 223 , 224 ]. These applications have been classified into five categories: classification, localization, detection, segmentation, and registration. Although each of these tasks has its own target, there is fundamental overlap in the pipeline implementation of these applications as shown in Fig.  30 . Classification is a concept that categorizes a set of data into classes. Detection is used to locate interesting objects in an image with consideration given to the background. In detection, multiple objects, which could be from dissimilar classes, are surrounded by bounding boxes. Localization is the concept used to locate the object, which is surrounded by a single bounding box. In segmentation (semantic segmentation), the target object edges are surrounded by outlines, which also label them; moreover, fitting a single image (which could be 2D or 3D) onto another refers to registration. One of the most important and wide-ranging DL applications are in healthcare [ 225 , 226 , 227 , 228 , 229 , 230 ]. This area of research is critical due to its relation to human lives. Moreover, DL has shown tremendous performance in healthcare. Therefore, we take DL applications in the medical image analysis field as an example to describe the DL applications.

figure 29

Examples of DL applications

figure 30

Workflow of deep learning tasks

Classification

Computer-Aided Diagnosis (CADx) is another title sometimes used for classification. Bharati et al. [ 231 ] used a chest X-ray dataset for detecting lung diseases based on a CNN. Another study attempted to read X-ray images by employing CNN [ 232 ]. In this modality, the comparative accessibility of these images has likely enhanced the progress of DL. [ 233 ] used an improved pre-trained GoogLeNet CNN containing more than 150,000 images for training and testing processes. This dataset was augmented from 1850 chest X-rays. The creators reorganized the image orientation into lateral and frontal views and achieved approximately 100% accuracy. This work of orientation classification has clinically limited use. As a part of an ultimately fully automated diagnosis workflow, it obtained the data augmentation and pre-trained efficiency in learning the metadata of relevant images. Chest infection, commonly referred to as pneumonia, is extremely treatable, as it is a commonly occurring health problem worldwide. Conversely, Rajpurkar et al. [ 234 ] utilized CheXNet, which is an improved version of DenseNet [ 112 ] with 121 convolution layers, for classifying fourteen types of disease. These authors used the CheXNet14 dataset [ 235 ], which comprises 112,000 images. This network achieved an excellent performance in recognizing fourteen different diseases. In particular, pneumonia classification accomplished a 0.7632 AUC score using receiver operating characteristics (ROC) analysis. In addition, the network obtained better than or equal to the performance of both a three-radiologist panel and four individual radiologists. Zuo et al. [ 236 ] have adopted CNN for candidate classification in lung nodule. Shen et al. [ 237 ] employed both Random Forest (RF) and SVM classifiers with CNNs to classify lung nodules. They employed two convolutional layers with each of the three parallel CNNs. The LIDC-IDRI (Lung Image Database Consortium) dataset, which contained 1010-labeled CT lung scans, was used to classify the two types of lung nodules (malignant and benign). Different scales of the image patches were used by every CNN to extract features, while the output feature vector was constructed using the learned features. Next, these vectors were classified into malignant or benign using either the RF classifier or SVM with radial basis function (RBF) filter. The model was robust to various noisy input levels and achieved an accuracy of 86% in nodule classification. Conversely, the model of [ 238 ] interpolates the image data missing between PET and MRI images using 3D CNNs. The Alzheimer Disease Neuroimaging Initiative (ADNI) database, containing 830 PET and MRI patient scans, was utilized in their work. The PET and MRI images are used to train the 3D CNNs, first as input and then as output. Furthermore, for patients who have no PET images, the 3D CNNs utilized the trained images to rebuild the PET images. These rebuilt images approximately fitted the actual disease recognition outcomes. However, this approach did not address the overfitting issues, which in turn restricted their technique in terms of its possible capacity for generalization. Diagnosing normal versus Alzheimer’s disease patients has been achieved by several CNN models [ 239 , 240 ]. Hosseini-Asl et al. [ 241 ] attained 99% accuracy for up-to-date outcomes in diagnosing normal versus Alzheimer’s disease patients. These authors applied an auto-encoder architecture using 3D CNNs. The generic brain features were pre-trained on the CADDementia dataset. Subsequently, the outcomes of these learned features became inputs to higher layers to differentiate between patient scans of Alzheimer’s disease, mild cognitive impairment, or normal brains based on the ADNI dataset and using fine-tuned deep supervision techniques. The architectures of VGGNet and RNNs, in that order, were the basis of both VOXCNN and ResNet models developed by Korolev et al. [ 242 ]. They also discriminated between Alzheimer’s disease and normal patients using the ADNI database. Accuracy was 79% for Voxnet and 80% for ResNet. Compared to Hosseini-Asl’s work, both models achieved lower accuracies. Conversely, the implementation of the algorithms was simpler and did not require feature hand-crafting, as Korolev declared. In 2020, Mehmood et al. [ 240 ] trained a developed CNN-based network called “SCNN” with MRI images for the tasks of classification of Alzheimer’s disease. They achieved state-of-the-art results by obtaining an accuracy of 99.05%.

Recently, CNN has taken some medical imaging classification tasks to different level from traditional diagnosis to automated diagnosis with tremendous performance. Examples of these tasks are diabetic foot ulcer (DFU) (as normal and abnormal (DFU) classes) [ 87 , 243 , 244 , 245 , 246 ], sickle cells anemia (SCA) (as normal, abnormal (SCA), and other blood components) [ 86 , 247 ], breast cancer by classify hematoxylin–eosin-stained breast biopsy images into four classes: invasive carcinoma, in-situ carcinoma, benign tumor and normal tissue [ 42 , 88 , 248 , 249 , 250 , 251 , 252 ], and multi-class skin cancer classification [ 253 , 254 , 255 ].

In 2020, CNNs are playing a vital role in early diagnosis of the novel coronavirus (COVID-2019). CNN has become the primary tool for automatic COVID-19 diagnosis in many hospitals around the world using chest X-ray images [ 256 , 257 , 258 , 259 , 260 ]. More details about the classification of medical imaging applications can be found in [ 226 , 261 , 262 , 263 , 264 , 265 ].

Localization

Although applications in anatomy education could increase, the practicing clinician is more likely to be interested in the localization of normal anatomy. Radiological images are independently examined and described outside of human intervention, while localization could be applied in completely automatic end-to-end applications [ 266 , 267 , 268 ]. Zhao et al. [ 269 ] introduced a new deep learning-based approach to localize pancreatic tumor in projection X-ray images for image-guided radiation therapy without the need for fiducials. Roth et al. [ 270 ] constructed and trained a CNN using five convolutional layers to classify around 4000 transverse-axial CT images. These authors used five categories for classification: legs, pelvis, liver, lung, and neck. After data augmentation techniques were applied, they achieved an AUC score of 0.998 and the classification error rate of the model was 5.9%. For detecting the positions of the spleen, kidney, heart, and liver, Shin et al. [ 271 ] employed stacked auto-encoders on 78 contrast-improved MRI scans of the stomach area containing the kidneys or liver. Temporal and spatial domains were used to learn the hierarchal features. Based on the organs, these approaches achieved detection accuracies of 62–79%. Sirazitdinov et al. [ 268 ] presented an aggregate of two convolutional neural networks, namely RetinaNet and Mask R-CNN for pneumonia detection and localization.

Computer-Aided Detection (CADe) is another method used for detection. For both the clinician and the patient, overlooking a lesion on a scan may have dire consequences. Thus, detection is a field of study requiring both accuracy and sensitivity [ 272 , 273 , 274 ]. Chouhan et al. [ 275 ] introduced an innovative deep learning framework for the detection of pneumonia by adopting the idea of transfer learning. Their approach obtained an accuracy of 96.4% with a recall of 99.62% on unseen data. In the area of COVID-19 and pulmonary disease, several convolutional neural network approaches have been proposed for automatic detection from X-ray images which showed an excellent performance [ 46 , 276 , 277 , 278 , 279 ].

In the area of skin cancer, there several applications were introduced for the detection task [ 280 , 281 , 282 ]. Thurnhofer-Hemsi et al. [ 283 ] introduced a deep learning approach for skin cancer detection by fine-tuning five state-of-art convolutional neural network models. They addressed the issue of a lack of training data by adopting the ideas of transfer learning and data augmentation techniques. DenseNet201 network has shown superior results compared to other models.

Another interesting area is that of histopathological images, which are progressively digitized. Several papers have been published in this field [ 284 , 285 , 286 , 287 , 288 , 289 , 290 ]. Human pathologists read these images laboriously; they search for malignancy markers, such as a high index of cell proliferation, using molecular markers (e.g. Ki-67), cellular necrosis signs, abnormal cellular architecture, enlarged numbers of mitotic figures denoting augmented cell replication, and enlarged nucleus-to-cytoplasm ratios. Note that the histopathological slide may contain a huge number of cells (up to the thousands). Thus, the risk of disregarding abnormal neoplastic regions is high when wading through these cells at excessive levels of magnification. Ciresan et al. [ 291 ] employed CNNs of 11–13 layers for identifying mitotic figures. Fifty breast histology images from the MITOS dataset were used. Their technique attained recall and precision scores of 0.7 and 0.88 respectively. Sirinukunwattana et al. [ 292 ] utilized 100 histology images of colorectal adenocarcinoma to detect cell nuclei using CNNs. Roughly 30,000 nuclei were hand-labeled for training purposes. The novelty of this approach was in the use of Spatially Constrained CNN. This CNN detects the center of nuclei using the surrounding spatial context and spatial regression. Instead of this CNN, Xu et al. [ 293 ] employed a stacked sparse auto-encoder (SSAE) to identify nuclei in histological slides of breast cancer, achieving 0.83 and 0.89 recall and precision scores respectively. In this field, they showed that unsupervised learning techniques are also effectively utilized. In medical images, Albarquoni et al. [ 294 ] investigated the problem of insufficient labeling. They crowd-sourced the actual mitoses labeling in the histology images of breast cancer (from amateurs online). Solving the recurrent issue of inadequate labeling during the analysis of medical images can be achieved by feeding the crowd-sourced input labels into the CNN. This method signifies a remarkable proof-of-concept effort. In 2020, Lei et al. [ 285 ] introduced the employment of deep convolutional neural networks for automatic identification of mitotic candidates from histological sections for mitosis screening. They obtained the state-of-the-art detection results on the dataset of the International Pattern Recognition Conference (ICPR) 2012 Mitosis Detection Competition.

Segmentation

Although MRI and CT image segmentation research includes different organs such as knee cartilage, prostate, and liver, most research work has concentrated on brain segmentation, particularly tumors [ 295 , 296 , 297 , 298 , 299 , 300 ]. This issue is highly significant in surgical preparation to obtain the precise tumor limits for the shortest surgical resection. During surgery, excessive sacrificing of key brain regions may lead to neurological shortfalls including cognitive damage, emotionlessness, and limb difficulty. Conventionally, medical anatomical segmentation was done by hand; more specifically, the clinician draws out lines within the complete stack of the CT or MRI volume slice by slice. Thus, it is perfect for implementing a solution that computerizes this painstaking work. Wadhwa et al. [ 301 ] presented a brief overview on brain tumor segmentation of MRI images. Akkus et al. [ 302 ] wrote a brilliant review of brain MRI segmentation that addressed the different metrics and CNN architectures employed. Moreover, they explain several competitions in detail, as well as their datasets, which included Ischemic Stroke Lesion Segmentation (ISLES), Mild Traumatic brain injury Outcome Prediction (MTOP), and Brain Tumor Segmentation (BRATS).

Chen et al. [ 299 ] proposed convolutional neural networks for precise brain tumor segmentation. The approach that they employed involves several approaches for better features learning including the DeepMedic model, a novel dual-force training scheme, a label distribution-based loss function, and Multi-Layer Perceptron-based post-processing. They conducted their method on the two most modern brain tumor segmentation datasets, i.e., BRATS 2017 and BRATS 2015 datasets. Hu et al. [ 300 ] introduced the brain tumor segmentation method by adopting a multi-cascaded convolutional neural network (MCCNN) and fully connected conditional random fields (CRFs). The achieved results were excellent compared with the state-of-the-art methods.

Moeskops et al. [ 303 ] employed three parallel-running CNNs, each of which had a 2D input patch of dissimilar size, for segmenting and classifying MRI brain images. These images, which include 35 adults and 22 pre-term infants, were classified into various tissue categories such as cerebrospinal fluid, grey matter, and white matter. Every patch concentrates on capturing various image aspects with the benefit of employing three dissimilar sizes of input patch; here, the bigger sizes incorporated the spatial features, while the lowest patch sizes concentrated on the local textures. In general, the algorithm has Dice coefficients in the range of 0.82–0.87 and achieved a satisfactory accuracy. Although 2D image slices are employed in the majority of segmentation research, Milletrate et al. [ 304 ] implemented 3D CNN for segmenting MRI prostate images. Furthermore, they used the PROMISE2012 challenge dataset, from which fifty MRI scans were used for training and thirty for testing. The U-Net architecture of Ronnerberger et al. [ 305 ] inspired their V-net. This model attained a 0.869 Dice coefficient score, the same as the winning teams in the competition. To reduce overfitting and create the model of a deeper 11-convolutional layer CNN, Pereira et al. [ 306 ] applied intentionally small-sized filters of 3x3. Their model used MRI scans of 274 gliomas (a type of brain tumor) for training. They achieved first place in the 2013 BRATS challenge, as well as second place in the BRATS challenge 2015. Havaei et al. [ 307 ] also considered gliomas using the 2013 BRATS dataset. They investigated different 2D CNN architectures. Compared to the winner of BRATS 2013, their algorithm worked better, as it required only 3 min to execute rather than 100 min. The concept of cascaded architecture formed the basis of their model. Thus, it is referred to as an InputCascadeCNN. Employing FC Conditional Random Fields (CRFs), atrous spatial pyramid pooling, and up-sampled filters were techniques introduced by Chen et al. [ 308 ]. These authors aimed to enhance the accuracy of localization and enlarge the field of view of every filter at a multi-scale. Their model, DeepLab, attained 79.7% mIOU (mean Intersection Over Union). In the PASCAL VOC-2012 image segmentation, their model obtained an excellent performance.

Recently, the Automatic segmentation of COVID-19 Lung Infection from CT Images helps to detect the development of COVID-19 infection by employing several deep learning techniques [ 309 , 310 , 311 , 312 ].

Registration

Usually, given two input images, the four main stages of the canonical procedure of the image registration task are [ 313 , 314 ]:

Target Selection: it illustrates the determined input image that the second counterpart input image needs to remain accurately superimposed to.

Feature Extraction: it computes the set of features extracted from each input image.

Feature Matching: it allows finding similarities between the previously obtained features.

Pose Optimization: it is aimed to minimize the distance between both input images.

Then, the result of the registration procedure is the suitable geometric transformation (e.g. translation, rotation, scaling, etc.) that provides both input images within the same coordinate system in a way the distance between them is minimal, i.e. their level of superimposition/overlapping is optimal. It is out of the scope of this work to provide an extensive review of this topic. Nevertheless, a short summary is accordingly introduced next.

Commonly, the input images for the DL-based registration approach could be in various forms, e.g. point clouds, voxel grids, and meshes. Additionally, some techniques allow as inputs the result of the Feature Extraction or Matching steps in the canonical scheme. Specifically, the outcome could be some data in a particular form as well as the result of the steps from the classical pipeline (feature vector, matching vector, and transformation). Nevertheless, with the newest DL-based methods, a novel conceptual type of ecosystem issues. It contains acquired characteristics about the target, materials, and their behavior that can be registered with the input data. Such a conceptual ecosystem is formed by a neural network and its training manner, and it could be counted as an input to the registration approach. Nevertheless, it is not an input that one might adopt in every registration situation since it corresponds to an interior data representation.

From a DL view-point, the interpretation of the conceptual design enables differentiating the input data of a registration approach into defined or non-defined models. In particular, the illustrated phases are models that depict particular spatial data (e.g. 2D or 3D) while a non-defined one is a generalization of a data set created by a learning system. Yumer et al. [ 315 ] developed a framework in which the model acquires characteristics of objects, meaning ready to identify what a more sporty car seems like or a more comfy chair is, also adjusting a 3D model to fit those characteristics while maintaining the main characteristics of the primary data. Likewise, a fundamental perspective of the unsupervised learning method introduced by Ding et al. [ 316 ] is that there is no target for the registration approach. In this instance, the network is able of placing each input point cloud in a global space, solving SLAM issues in which many point clouds have to be registered rigidly. On the other hand, Mahadevan [ 317 ] proposed the combination of two conceptual models utilizing the growth of Imagination Machines to give flexible artificial intelligence systems and relationships between the learned phases through training schemes that are not inspired on labels and classifications. Another practical application of DL, especially CNNs, to image registration is the 3D reconstruction of objects. Wang et al. [ 318 ] applied an adversarial way using CNNs to rebuild a 3D model of an object from its 2D image. The network learns many objects and orally accomplishes the registration between the image and the conceptual model. Similarly, Hermoza et al. [ 319 ] also utilize the GAN network for prognosticating the absent geometry of damaged archaeological objects, providing the reconstructed object based on a voxel grid format and a label selecting its class.

DL for medical image registration has numerous applications, which were listed by some review papers [ 320 , 321 , 322 ]. Yang et al. [ 323 ] implemented stacked convolutional layers as an encoder-decoder approach to predict the morphing of the input pixel into its last formation using MRI brain scans from the OASIS dataset. They employed a registration model known as Large Deformation Diffeomorphic Metric Mapping (LDDMM) and attained remarkable enhancements in computation time. Miao et al. [ 324 ] used synthetic X-ray images to train a five-layer CNN to register 3D models of a trans-esophageal probe, a hand implant, and a knee implant onto 2D X-ray images for pose estimation. They determined that their model achieved an execution time of 0.1 s, representing an important enhancement against the conventional registration techniques based on intensity; moreover, it achieved effective registrations 79–99% of the time. Li et al. [ 325 ] introduced a neural network-based approach for the non-rigid 2D–3D registration of the lateral cephalogram and the volumetric cone-beam CT (CBCT) images.

Computational approaches

For computationally exhaustive applications, complex ML and DL approaches have rapidly been identified as the most significant techniques and are widely used in different fields. The development and enhancement of algorithms aggregated with capabilities of well-behaved computational performance and large datasets make it possible to effectively execute several applications, as earlier applications were either not possible or difficult to take into consideration.

Currently, several standard DNN configurations are available. The interconnection patterns between layers and the total number of layers represent the main differences between these configurations. The Table  2 illustrates the growth rate of the overall number of layers over time, which seems to be far faster than the “Moore’s Law growth rate”. In normal DNN, the number of layers grew by around 2.3× each year in the period from 2012 to 2016. Recent investigations of future ResNet versions reveal that the number of layers can be extended up to 1000. However, an SGD technique is employed to fit the weights (or parameters), while different optimization techniques are employed to obtain parameter updating during the DNN training process. Repetitive updates are required to enhance network accuracy in addition to a minorly augmented rate of enhancement. For example, the training process using ImageNet as a large dataset, which contains more than 14 million images, along with ResNet as a network model, take around 30K to 40K repetitions to converge to a steady solution. In addition, the overall computational load, as an upper-level prediction, may exceed 1020 FLOPS when both the training set size and the DNN complexity increase.

Prior to 2008, boosting the training to a satisfactory extent was achieved by using GPUs. Usually, days or weeks are needed for a training session, even with GPU support. By contrast, several optimization strategies were developed to reduce the extensive learning time. The computational requirements are believed to increase as the DNNs continuously enlarge in both complexity and size.

In addition to the computational load cost, the memory bandwidth and capacity have a significant effect on the entire training performance, and to a lesser extent, deduction. More specifically, the parameters are distributed through every layer of the input data, there is a sizeable amount of reused data, and the computation of several network layers exhibits an excessive computation-to-bandwidth ratio. By contrast, there are no distributed parameters, the amount of reused data is extremely small, and the additional FC layers have an extremely small computation-to-bandwidth ratio. Table  3 presents a comparison between different aspects related to the devices. In addition, the table is established to facilitate familiarity with the tradeoffs by obtaining the optimal approach for configuring a system based on either FPGA, GPU, or CPU devices. It should be noted that each has corresponding weaknesses and strengths; accordingly, there are no clear one-size-fits-all solutions.

Although GPU processing has enhanced the ability to address the computational challenges related to such networks, the maximum GPU (or CPU) performance is not achieved, and several techniques or models have turned out to be strongly linked to bandwidth. In the worst cases, the GPU efficiency is between 15 and 20% of the maximum theoretical performance. This issue is required to enlarge the memory bandwidth using high-bandwidth stacked memory. Next, different approaches based on FPGA, GPU, and CPU are accordingly detailed.

CPU-based approach

The well-behaved performance of the CPU nodes usually assists robust network connectivity, storage abilities, and large memory. Although CPU nodes are more common-purpose than those of FPGA or GPU, they lack the ability to match them in unprocessed computation facilities, since this requires increased network ability and a larger memory capacity.

GPU-based approach

GPUs are extremely effective for several basic DL primitives, which include greatly parallel-computing operations such as activation functions, matrix multiplication, and convolutions [ 326 , 327 , 328 , 329 , 330 ]. Incorporating HBM-stacked memory into the up-to-date GPU models significantly enhances the bandwidth. This enhancement allows numerous primitives to efficiently utilize all computational resources of the available GPUs. The improvement in GPU performance over CPU performance is usually 10-20:1 related to dense linear algebra operations.

Maximizing parallel processing is the base of the initial GPU programming model. For example, a GPU model may involve up to sixty-four computational units. There are four SIMD engines per each computational layer, and each SIMD has sixteen floating-point computation lanes. The peak performance is 25 TFLOPS (fp16) and 10 TFLOPS (fp32) as the percentage of the employment approaches 100%. Additional GPU performance may be achieved if the addition and multiply functions for vectors combine the inner production instructions for matching primitives related to matrix operations.

For DNN training, the GPU is usually considered to be an optimized design, while for inference operations, it may also offer considerable performance improvements.

FPGA-based approach

FPGA is wildly utilized in various tasks including deep learning [ 199 , 247 , 331 , 332 , 333 , 334 ]. Inference accelerators are commonly implemented utilizing FPGA. The FPGA can be effectively configured to reduce the unnecessary or overhead functions involved in GPU systems. Compared to GPU, the FPGA is restricted to both weak-behaved floating-point performance and integer inference. The main FPGA aspect is the capability to dynamically reconfigure the array characteristics (at run-time), as well as the capability to configure the array by means of effective design with little or no overhead.

As mentioned earlier, the FPGA offers both performance and latency for every watt it gains over GPU and CPU in DL inference operations. Implementation of custom high-performance hardware, pruned networks, and reduced arithmetic precision are three factors that enable the FPGA to implement DL algorithms and to achieve FPGA with this level of efficiency. In addition, FPGA may be employed to implement CNN overlay engines with over 80% efficiency, eight-bit accuracy, and over 15 TOPs peak performance; this is used for a few conventional CNNs, as Xillinx and partners demonstrated recently. By contrast, pruning techniques are mostly employed in the LSTM context. The sizes of the models can be efficiently minimized by up to 20×, which provides an important benefit during the implementation of the optimal solution, as MLP neural processing demonstrated. A recent study in the field of implementing fixed-point precision and custom floating-point has revealed that lowering the 8-bit is extremely promising; moreover, it aids in supplying additional advancements to implementing peak performance FPGA related to the DNN models.

Evaluation metrics

Evaluation metrics adopted within DL tasks play a crucial role in achieving the optimized classifier [ 335 ]. They are utilized within a usual data classification procedure through two main stages: training and testing. It is utilized to optimize the classification algorithm during the training stage. This means that the evaluation metric is utilized to discriminate and select the optimized solution, e.g., as a discriminator, which can generate an extra-accurate forecast of upcoming evaluations related to a specific classifier. For the time being, the evaluation metric is utilized to measure the efficiency of the created classifier, e.g. as an evaluator, within the model testing stage using hidden data. As given in Eq. 20 , TN and TP are defined as the number of negative and positive instances, respectively, which are successfully classified. In addition, FN and FP are defined as the number of misclassified positive and negative instances respectively. Next, some of the most well-known evaluation metrics are listed below.

Accuracy: Calculates the ratio of correct predicted classes to the total number of samples evaluated (Eq. 20 ).

Sensitivity or Recall: Utilized to calculate the fraction of positive patterns that are correctly classified (Eq. 21 ).

Specificity: Utilized to calculate the fraction of negative patterns that are correctly classified (Eq. 22 ).

Precision: Utilized to calculate the positive patterns that are correctly predicted by all predicted patterns in a positive class (Eq. 23 ).

F1-Score: Calculates the harmonic average between recall and precision rates (Eq. 24 ).

J Score: This metric is also called Youdens J statistic. Eq. 25 represents the metric.

False Positive Rate (FPR): This metric refers to the possibility of a false alarm ratio as calculated in Eq. 26

Area Under the ROC Curve: AUC is a common ranking type metric. It is utilized to conduct comparisons between learning algorithms [ 336 , 337 , 338 ], as well as to construct an optimal learning model [ 339 , 340 ]. In contrast to probability and threshold metrics, the AUC value exposes the entire classifier ranking performance. The following formula is used to calculate the AUC value for two-class problem [ 341 ] (Eq. 27 )

Here, \(S_{p}\) represents the sum of all positive ranked samples. The number of negative and positive samples is denoted as \(n_{n}\) and \(n_{p}\) , respectively. Compared to the accuracy metrics, the AUC value was verified empirically and theoretically, making it very helpful for identifying an optimized solution and evaluating the classifier performance through classification training.

When considering the discrimination and evaluation processes, the AUC performance was brilliant. However, for multiclass issues, the AUC computation is primarily cost-effective when discriminating a large number of created solutions. In addition, the time complexity for computing the AUC is \(O \left( |C|^{2} \; n\log n\right) \) with respect to the Hand and Till AUC model [ 341 ] and \(O \left( |C| \; n\log n\right) \) according to Provost and Domingo’s AUC model [ 336 ].

Frameworks and datasets

Several DL frameworks and datasets have been developed in the last few years. various frameworks and libraries have also been used in order to expedite the work with good results. Through their use, the training process has become easier. Table  4 lists the most utilized frameworks and libraries.

Based on the star ratings on Github, as well as our own background in the field, TensorFlow is deemed the most effective and easy to use. It has the ability to work on several platforms. (Github is one of the biggest software hosting sites, while Github stars refer to how well-regarded a project is on the site). Moreover, there are several other benchmark datasets employed for different DL tasks. Some of these are listed in Table  5 .

Summary and conclusion

Finally, it is mandatory the inclusion of a brief discussion by gathering all the relevant data provided along this extensive research. Next, an itemized analysis is presented in order to conclude our review and exhibit the future directions.

DL already experiences difficulties in simultaneously modeling multi-complex modalities of data. In recent DL developments, another common approach is that of multimodal DL.

DL requires sizeable datasets (labeled data preferred) to predict unseen data and to train the models. This challenge turns out to be particularly difficult when real-time data processing is required or when the provided datasets are limited (such as in the case of healthcare data). To alleviate this issue, TL and data augmentation have been researched over the last few years.

Although ML slowly transitions to semi-supervised and unsupervised learning to manage practical data without the need for manual human labeling, many of the current deep-learning models utilize supervised learning.

The CNN performance is greatly influenced by hyper-parameter selection. Any small change in the hyper-parameter values will affect the general CNN performance. Therefore, careful parameter selection is an extremely significant issue that should be considered during optimization scheme development.

Impressive and robust hardware resources like GPUs are required for effective CNN training. Moreover, they are also required for exploring the efficiency of using CNN in smart and embedded systems.

In the CNN context, ensemble learning [ 342 , 343 ] represents a prospective research area. The collection of different and multiple architectures will support the model in improving its generalizability across different image categories through extracting several levels of semantic image representation. Similarly, ideas such as new activation functions, dropout, and batch normalization also merit further investigation.

The exploitation of depth and different structural adaptations is significantly improved in the CNN learning capacity. Substituting the traditional layer configuration with blocks results in significant advances in CNN performance, as has been shown in the recent literature. Currently, developing novel and efficient block architectures is the main trend in new research models of CNN architectures. HRNet is only one example that shows there are always ways to improve the architecture.

It is expected that cloud-based platforms will play an essential role in the future development of computational DL applications. Utilizing cloud computing offers a solution to handling the enormous amount of data. It also helps to increase efficiency and reduce costs. Furthermore, it offers the flexibility to train DL architectures.

With the recent development in computational tools including a chip for neural networks and a mobile GPU, we will see more DL applications on mobile devices. It will be easier for users to use DL.

Regarding the issue of lack of training data, It is expected that various techniques of transfer learning will be considered such as training the DL model on large unlabeled image datasets and next transferring the knowledge to train the DL model on a small number of labeled images for the same task.

Last, this overview provides a starting point for the community of DL being interested in the field of DL. Furthermore, researchers would be allowed to decide the more suitable direction of work to be taken in order to provide more accurate alternatives to the field.

Availability of data and materials

Not applicable.

Rozenwald MB, Galitsyna AA, Sapunov GV, Khrameeva EE, Gelfand MS. A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features. PeerJ Comput Sci. 2020;6:307.

Article   Google Scholar  

Amrit C, Paauw T, Aly R, Lavric M. Identifying child abuse through text mining and machine learning. Expert Syst Appl. 2017;88:402–18.

Hossain E, Khan I, Un-Noor F, Sikander SS, Sunny MSH. Application of big data and machine learning in smart grid, and associated security concerns: a review. IEEE Access. 2019;7:13960–88.

Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H. Survey of review spam detection using machine learning techniques. J Big Data. 2015;2(1):23.

Deldjoo Y, Elahi M, Cremonesi P, Garzotto F, Piazzolla P, Quadrana M. Content-based video recommendation system based on stylistic visual features. J Data Semant. 2016;5(2):99–113.

Al-Dulaimi K, Chandran V, Nguyen K, Banks J, Tomeo-Reyes I. Benchmarking hep-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape. Pattern Recogn Lett. 2019;125:534–41.

Liu W, Wang Z, Liu X, Zeng N, Liu Y, Alsaadi FE. A survey of deep neural network architectures and their applications. Neurocomputing. 2017;234:11–26.

Pouyanfar S, Sadiq S, Yan Y, Tian H, Tao Y, Reyes MP, Shyu ML, Chen SC, Iyengar S. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv (CSUR). 2018;51(5):1–36.

Alom MZ, Taha TM, Yakopcic C, Westberg S, Sidike P, Nasrin MS, Hasan M, Van Essen BC, Awwal AA, Asari VK. A state-of-the-art survey on deep learning theory and architectures. Electronics. 2019;8(3):292.

Potok TE, Schuman C, Young S, Patton R, Spedalieri F, Liu J, Yao KT, Rose G, Chakma G. A study of complex deep learning networks on high-performance, neuromorphic, and quantum computers. ACM J Emerg Technol Comput Syst (JETC). 2018;14(2):1–21.

Adeel A, Gogate M, Hussain A. Contextual deep learning-based audio-visual switching for speech enhancement in real-world environments. Inf Fusion. 2020;59:163–70.

Tian H, Chen SC, Shyu ML. Evolutionary programming based deep learning feature selection and network construction for visual data classification. Inf Syst Front. 2020;22(5):1053–66.

Young T, Hazarika D, Poria S, Cambria E. Recent trends in deep learning based natural language processing. IEEE Comput Intell Mag. 2018;13(3):55–75.

Koppe G, Meyer-Lindenberg A, Durstewitz D. Deep learning for small and big data in psychiatry. Neuropsychopharmacology. 2021;46(1):176–90.

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05), vol. 1. IEEE; 2005. p. 886–93.

Lowe DG. Object recognition from local scale-invariant features. In: Proceedings of the seventh IEEE international conference on computer vision, vol. 2. IEEE; 1999. p. 1150–7.

Wu L, Hoi SC, Yu N. Semantics-preserving bag-of-words models and applications. IEEE Trans Image Process. 2010;19(7):1908–20.

Article   MathSciNet   MATH   Google Scholar  

LeCun Y, Bengio Y, Hinton G. Deep learning. Nature. 2015;521(7553):436–44.

Yao G, Lei T, Zhong J. A review of convolutional-neural-network-based action recognition. Pattern Recogn Lett. 2019;118:14–22.

Dhillon A, Verma GK. Convolutional neural network: a review of models, methodologies and applications to object detection. Prog Artif Intell. 2020;9(2):85–112.

Khan A, Sohail A, Zahoora U, Qureshi AS. A survey of the recent architectures of deep convolutional neural networks. Artif Intell Rev. 2020;53(8):5455–516.

Hasan RI, Yusuf SM, Alzubaidi L. Review of the state of the art of deep learning for plant diseases: a broad analysis and discussion. Plants. 2020;9(10):1302.

Xiao Y, Tian Z, Yu J, Zhang Y, Liu S, Du S, Lan X. A review of object detection based on deep learning. Multimed Tools Appl. 2020;79(33):23729–91.

Ker J, Wang L, Rao J, Lim T. Deep learning applications in medical image analysis. IEEE Access. 2017;6:9375–89.

Zhang Z, Cui P, Zhu W. Deep learning on graphs: a survey. IEEE Trans Knowl Data Eng. 2020. https://doi.org/10.1109/TKDE.2020.2981333 .

Shrestha A, Mahmood A. Review of deep learning algorithms and architectures. IEEE Access. 2019;7:53040–65.

Najafabadi MM, Villanustre F, Khoshgoftaar TM, Seliya N, Wald R, Muharemagic E. Deep learning applications and challenges in big data analytics. J Big Data. 2015;2(1):1.

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning, vol. 1. Cambridge: MIT press; 2016.

MATH   Google Scholar  

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for COVID-19. J Big Data. 2021;8(1):1–54.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

Bhowmick S, Nagarajaiah S, Veeraraghavan A. Vision and deep learning-based algorithms to detect and quantify cracks on concrete surfaces from uav videos. Sensors. 2020;20(21):6299.

Goh GB, Hodas NO, Vishnu A. Deep learning for computational chemistry. J Comput Chem. 2017;38(16):1291–307.

Li Y, Zhang T, Sun S, Gao X. Accelerating flash calculation through deep learning methods. J Comput Phys. 2019;394:153–65.

Yang W, Zhang X, Tian Y, Wang W, Xue JH, Liao Q. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed. 2019;21(12):3106–21.

Tang J, Li S, Liu P. A review of lane detection methods based on deep learning. Pattern Recogn. 2020;111:107623.

Zhao ZQ, Zheng P, Xu ST, Wu X. Object detection with deep learning: a review. IEEE Trans Neural Netw Learn Syst. 2019;30(11):3212–32.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–8.

Ng A. Machine learning yearning: technical strategy for AI engineers in the era of deep learning. 2019. https://www.mlyearning.org .

Metz C. Turing award won by 3 pioneers in artificial intelligence. The New York Times. 2019;27.

Nevo S, Anisimov V, Elidan G, El-Yaniv R, Giencke P, Gigi Y, Hassidim A, Moshe Z, Schlesinger M, Shalev G, et al. Ml for flood forecasting at scale; 2019. arXiv preprint arXiv:1901.09583 .

Chen H, Engkvist O, Wang Y, Olivecrona M, Blaschke T. The rise of deep learning in drug discovery. Drug Discov Today. 2018;23(6):1241–50.

Benhammou Y, Achchab B, Herrera F, Tabik S. Breakhis based breast cancer automatic diagnosis using deep learning: taxonomy, survey and insights. Neurocomputing. 2020;375:9–24.

Wulczyn E, Steiner DF, Xu Z, Sadhwani A, Wang H, Flament-Auvigne I, Mermel CH, Chen PHC, Liu Y, Stumpe MC. Deep learning-based survival prediction for multiple cancer types using histopathology images. PLoS ONE. 2020;15(6):e0233678.

Nagpal K, Foote D, Liu Y, Chen PHC, Wulczyn E, Tan F, Olson N, Smith JL, Mohtashamian A, Wren JH, et al. Development and validation of a deep learning algorithm for improving Gleason scoring of prostate cancer. NPJ Digit Med. 2019;2(1):1–10.

Google Scholar  

Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.

Brunese L, Mercaldo F, Reginelli A, Santone A. Explainable deep learning for pulmonary disease and coronavirus COVID-19 detection from X-rays. Comput Methods Programs Biomed. 2020;196(105):608.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and COVID-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

Shorfuzzaman M, Hossain MS. Metacovid: a siamese neural network framework with contrastive loss for n-shot diagnosis of COVID-19 patients. Pattern Recogn. 2020;113:107700.

Carvelli L, Olesen AN, Brink-Kjær A, Leary EB, Peppard PE, Mignot E, Sørensen HB, Jennum P. Design of a deep learning model for automatic scoring of periodic and non-periodic leg movements during sleep validated against multiple human experts. Sleep Med. 2020;69:109–19.

De Fauw J, Ledsam JR, Romera-Paredes B, Nikolov S, Tomasev N, Blackwell S, Askham H, Glorot X, O’Donoghue B, Visentin D, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med. 2018;24(9):1342–50.

Topol EJ. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56.

Kermany DS, Goldbaum M, Cai W, Valentim CC, Liang H, Baxter SL, McKeown A, Yang G, Wu X, Yan F, et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell. 2018;172(5):1122–31.

Van Essen B, Kim H, Pearce R, Boakye K, Chen B. Lbann: livermore big artificial neural network HPC toolkit. In: Proceedings of the workshop on machine learning in high-performance computing environments; 2015. p. 1–6.

Saeed MM, Al Aghbari Z, Alsharidah M. Big data clustering techniques based on spark: a literature review. PeerJ Comput Sci. 2020;6:321.

Mnih V, Kavukcuoglu K, Silver D, Rusu AA, Veness J, Bellemare MG, Graves A, Riedmiller M, Fidjeland AK, Ostrovski G, et al. Human-level control through deep reinforcement learning. Nature. 2015;518(7540):529–33.

Arulkumaran K, Deisenroth MP, Brundage M, Bharath AA. Deep reinforcement learning: a brief survey. IEEE Signal Process Mag. 2017;34(6):26–38.

Socher R, Perelygin A, Wu J, Chuang J, Manning CD, Ng AY, Potts C. Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing; 2013. p. 1631–42.

Goller C, Kuchler A. Learning task-dependent distributed representations by backpropagation through structure. In: Proceedings of international conference on neural networks (ICNN’96), vol 1. IEEE; 1996. p. 347–52.

Socher R, Lin CCY, Ng AY, Manning CD. Parsing natural scenes and natural language with recursive neural networks. In: ICML; 2011.

Louppe G, Cho K, Becot C, Cranmer K. QCD-aware recursive neural networks for jet physics. J High Energy Phys. 2019;2019(1):57.

Sadr H, Pedram MM, Teshnehlab M. A robust sentiment analysis method based on sequential combination of convolutional and recursive neural networks. Neural Process Lett. 2019;50(3):2745–61.

Urban G, Subrahmanya N, Baldi P. Inner and outer recursive neural networks for chemoinformatics applications. J Chem Inf Model. 2018;58(2):207–11.

Hewamalage H, Bergmeir C, Bandara K. Recurrent neural networks for time series forecasting: current status and future directions. Int J Forecast. 2020;37(1):388–427.

Jiang Y, Kim H, Asnani H, Kannan S, Oh S, Viswanath P. Learn codes: inventing low-latency codes via recurrent neural networks. IEEE J Sel Areas Inf Theory. 2020;1(1):207–16.

John RA, Acharya J, Zhu C, Surendran A, Bose SK, Chaturvedi A, Tiwari N, Gao Y, He Y, Zhang KK, et al. Optogenetics inspired transition metal dichalcogenide neuristors for in-memory deep recurrent neural networks. Nat Commun. 2020;11(1):1–9.

Batur Dinler Ö, Aydin N. An optimal feature parameter set based on gated recurrent unit recurrent neural networks for speech segment detection. Appl Sci. 2020;10(4):1273.

Jagannatha AN, Yu H. Structured prediction models for RNN based sequence labeling in clinical text. In: Proceedings of the conference on empirical methods in natural language processing. conference on empirical methods in natural language processing, vol. 2016, NIH Public Access; 2016. p. 856.

Pascanu R, Gulcehre C, Cho K, Bengio Y. How to construct deep recurrent neural networks. In: Proceedings of the second international conference on learning representations (ICLR 2014); 2014.

Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 249–56.

Gao C, Yan J, Zhou S, Varshney PK, Liu H. Long short-term memory-based deep recurrent neural networks for target tracking. Inf Sci. 2019;502:279–96.

Zhou DX. Theory of deep convolutional neural networks: downsampling. Neural Netw. 2020;124:319–27.

Article   MATH   Google Scholar  

Jhong SY, Tseng PY, Siriphockpirom N, Hsia CH, Huang MS, Hua KL, Chen YY. An automated biometric identification system using CNN-based palm vein recognition. In: 2020 international conference on advanced robotics and intelligent systems (ARIS). IEEE; 2020. p. 1–6.

Al-Azzawi A, Ouadou A, Max H, Duan Y, Tanner JJ, Cheng J. Deepcryopicker: fully automated deep neural network for single protein particle picking in cryo-EM. BMC Bioinform. 2020;21(1):1–38.

Wang T, Lu C, Yang M, Hong F, Liu C. A hybrid method for heartbeat classification via convolutional neural networks, multilayer perceptrons and focal loss. PeerJ Comput Sci. 2020;6:324.

Li G, Zhang M, Li J, Lv F, Tong G. Efficient densely connected convolutional neural networks. Pattern Recogn. 2021;109:107610.

Gu J, Wang Z, Kuen J, Ma L, Shahroudy A, Shuai B, Liu T, Wang X, Wang G, Cai J, et al. Recent advances in convolutional neural networks. Pattern Recogn. 2018;77:354–77.

Fang W, Love PE, Luo H, Ding L. Computer vision for behaviour-based safety in construction: a review and future directions. Adv Eng Inform. 2020;43:100980.

Palaz D, Magimai-Doss M, Collobert R. End-to-end acoustic modeling using convolutional neural networks for hmm-based automatic speech recognition. Speech Commun. 2019;108:15–32.

Li HC, Deng ZY, Chiang HH. Lightweight and resource-constrained learning network for face recognition with performance optimization. Sensors. 2020;20(21):6114.

Hubel DH, Wiesel TN. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J Physiol. 1962;160(1):106.

Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift; 2015. arXiv preprint arXiv:1502.03167 .

Ruder S. An overview of gradient descent optimization algorithms; 2016. arXiv preprint arXiv:1609.04747 .

Bottou L. Large-scale machine learning with stochastic gradient descent. In: Proceedings of COMPSTAT’2010. Springer; 2010. p. 177–86.

Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on. 2012;14(8).

Zhang Z. Improved Adam optimizer for deep neural networks. In: 2018 IEEE/ACM 26th international symposium on quality of service (IWQoS). IEEE; 2018. p. 1–2.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Duan Y. Deep learning models for classification of red blood cells in microscopy images to aid in sickle cell anemia diagnosis. Electronics. 2020;9(3):427.

Alzubaidi L, Fadhel MA, Al-Shamma O, Zhang J, Santamaría J, Duan Y, Oleiwi SR. Towards a better understanding of transfer learning for medical imaging: a case study. Appl Sci. 2020;10(13):4523.

Alzubaidi L, Al-Shamma O, Fadhel MA, Farhan L, Zhang J, Duan Y. Optimizing the performance of breast cancer classification by employing the same domain transfer learning from hybrid deep convolutional neural network model. Electronics. 2020;9(3):445.

LeCun Y, Jackel LD, Bottou L, Cortes C, Denker JS, Drucker H, Guyon I, Muller UA, Sackinger E, Simard P, et al. Learning algorithms for classification: a comparison on handwritten digit recognition. Neural Netw Stat Mech Perspect. 1995;261:276.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.

MathSciNet   MATH   Google Scholar  

Dahl GE, Sainath TN, Hinton GE. Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8609–13.

Xu B, Wang N, Chen T, Li M. Empirical evaluation of rectified activations in convolutional network; 2015. arXiv preprint arXiv:1505.00853 .

Hochreiter S. The vanishing gradient problem during learning recurrent neural nets and problem solutions. Int J Uncertain Fuzziness Knowl Based Syst. 1998;6(02):107–16.

Lin M, Chen Q, Yan S. Network in network; 2013. arXiv preprint arXiv:1312.4400 .

Hsiao TY, Chang YC, Chou HH, Chiu CT. Filter-based deep-compression with global average pooling for convolutional networks. J Syst Arch. 2019;95:9–18.

Li Z, Wang SH, Fan RR, Cao G, Zhang YD, Guo T. Teeth category classification via seven-layer deep convolutional neural network with max pooling and global average pooling. Int J Imaging Syst Technol. 2019;29(4):577–83.

Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision. Springer; 2014. p. 818–33.

Erhan D, Bengio Y, Courville A, Vincent P. Visualizing higher-layer features of a deep network. Univ Montreal. 2009;1341(3):1.

Le QV. Building high-level features using large scale unsupervised learning. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE; 2013. p. 8595–8.

Grün F, Rupprecht C, Navab N, Tombari F. A taxonomy and library for visualizing learned features in convolutional neural networks; 2016. arXiv preprint arXiv:1606.07757 .

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition; 2014. arXiv preprint arXiv:1409.1556 .

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: 2007 IEEE conference on computer vision and pattern recognition. IEEE; 2007. p. 1–8.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

Bengio Y, et al. Rmsprop and equilibrated adaptive learning rates for nonconvex optimization; 2015. arXiv:1502.04390 corr abs/1502.04390

Srivastava RK, Greff K, Schmidhuber J. Highway networks; 2015. arXiv preprint arXiv:1505.00387 .

Kong W, Dong ZY, Jia Y, Hill DJ, Xu Y, Zhang Y. Short-term residential load forecasting based on LSTM recurrent neural network. IEEE Trans Smart Grid. 2017;10(1):841–51.

Ordóñez FJ, Roggen D. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition. Sensors. 2016;16(1):115.

CireşAn D, Meier U, Masci J, Schmidhuber J. Multi-column deep neural network for traffic sign classification. Neural Netw. 2012;32:333–8.

Szegedy C, Ioffe S, Vanhoucke V, Alemi A. Inception-v4, inception-resnet and the impact of residual connections on learning; 2016. arXiv preprint arXiv:1602.07261 .

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 2818–26.

Wu S, Zhong S, Liu Y. Deep residual learning for image steganalysis. Multimed Tools Appl. 2018;77(9):10437–53.

Huang G, Liu Z, Van Der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 4700–08.

Rubin J, Parvaneh S, Rahman A, Conroy B, Babaeizadeh S. Densely connected convolutional networks for detection of atrial fibrillation from short single-lead ECG recordings. J Electrocardiol. 2018;51(6):S18-21.

Kuang P, Ma T, Chen Z, Li F. Image super-resolution with densely connected convolutional networks. Appl Intell. 2019;49(1):125–36.

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1492–500.

Su A, He X, Zhao X. Jpeg steganalysis based on ResNeXt with gauss partial derivative filters. Multimed Tools Appl. 2020;80(3):3349–66.

Yadav D, Jalal A, Garlapati D, Hossain K, Goyal A, Pant G. Deep learning-based ResNeXt model in phycological studies for future. Algal Res. 2020;50:102018.

Han W, Feng R, Wang L, Gao L. Adaptive spatial-scale-aware deep convolutional neural network for high-resolution remote sensing imagery scene classification. In: IGARSS 2018-2018 IEEE international geoscience and remote sensing symposium. IEEE; 2018. p. 4736–9.

Zagoruyko S, Komodakis N. Wide residual networks; 2016. arXiv preprint arXiv:1605.07146 .

Huang G, Sun Y, Liu Z, Sedra D, Weinberger KQ. Deep networks with stochastic depth. In: European conference on computer vision. Springer; 2016. p. 646–61.

Huynh HT, Nguyen H. Joint age estimation and gender classification of Asian faces using wide ResNet. SN Comput Sci. 2020;1(5):1–9.

Takahashi R, Matsubara T, Uehara K. Data augmentation using random image cropping and patching for deep cnns. IEEE Trans Circuits Syst Video Technol. 2019;30(9):2917–31.

Han D, Kim J, Kim J. Deep pyramidal residual networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 5927–35.

Wang Y, Wang L, Wang H, Li P. End-to-end image super-resolution via deep and shallow convolutional networks. IEEE Access. 2019;7:31959–70.

Chollet F. Xception: Deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 1251–8.

Lo WW, Yang X, Wang Y. An xception convolutional neural network for malware classification with transfer learning. In: 2019 10th IFIP international conference on new technologies, mobility and security (NTMS). IEEE; 2019. p. 1–5.

Rahimzadeh M, Attar A. A modified deep convolutional neural network for detecting COVID-19 and pneumonia from chest X-ray images based on the concatenation of xception and resnet50v2. Inform Med Unlocked. 2020;19:100360.

Wang F, Jiang M, Qian C, Yang S, Li C, Zhang H, Wang X, Tang X. Residual attention network for image classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 3156–64.

Salakhutdinov R, Larochelle H. Efficient learning of deep boltzmann machines. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics; 2010. p. 693–700.

Goh H, Thome N, Cord M, Lim JH. Top-down regularization of deep belief networks. Adv Neural Inf Process Syst. 2013;26:1878–86.

Guan J, Lai R, Xiong A, Liu Z, Gu L. Fixed pattern noise reduction for infrared images based on cascade residual attention CNN. Neurocomputing. 2020;377:301–13.

Bi Q, Qin K, Zhang H, Li Z, Xu K. RADC-Net: a residual attention based convolution network for aerial scene classification. Neurocomputing. 2020;377:345–59.

Jaderberg M, Simonyan K, Zisserman A, et al. Spatial transformer networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2015. p. 2017–25.

Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2018. p. 7132–41.

Mou L, Zhu XX. Learning to pay attention on spectral domain: a spectral attention module-based convolutional network for hyperspectral image classification. IEEE Trans Geosci Remote Sens. 2019;58(1):110–22.

Woo S, Park J, Lee JY, So Kweon I. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV); 2018. p. 3–19.

Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2018. p. 421–9.

Roy AG, Navab N, Wachinger C. Recalibrating fully convolutional networks with spatial and channel “squeeze and excitation’’ blocks. IEEE Trans Med Imaging. 2018;38(2):540–9.

Sabour S, Frosst N, Hinton GE. Dynamic routing between capsules. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 3856–66.

Arun P, Buddhiraju KM, Porwal A. Capsulenet-based spatial-spectral classifier for hyperspectral images. IEEE J Sel Topics Appl Earth Obs Remote Sens. 2019;12(6):1849–65.

Xinwei L, Lianghao X, Yi Y. Compact video fingerprinting via an improved capsule net. Syst Sci Control Eng. 2020;9:1–9.

Ma B, Li X, Xia Y, Zhang Y. Autonomous deep learning: a genetic DCNN designer for image classification. Neurocomputing. 2020;379:152–61.

Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, et al. Deep high-resolution representation learning for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2020. https://doi.org/10.1109/TPAMI.2020.2983686 .

Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L. Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: CVPR 2020; 2020. https://www.microsoft.com/en-us/research/publication/higherhrnet-scale-aware-representation-learning-for-bottom-up-human-pose-estimation/ .

Karimi H, Derr T, Tang J. Characterizing the decision boundary of deep neural networks; 2019. arXiv preprint arXiv:1912.11460 .

Li Y, Ding L, Gao X. On the decision boundary of deep neural networks; 2018. arXiv preprint arXiv:1808.05385 .

Yosinski J, Clune J, Bengio Y, Lipson H. How transferable are features in deep neural networks? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 3320–8.

Tan C, Sun F, Kong T, Zhang W, Yang C, Liu C. A survey on deep transfer learning. In: International conference on artificial neural networks. Springer; 2018. p. 270–9.

Weiss K, Khoshgoftaar TM, Wang D. A survey of transfer learning. J Big Data. 2016;3(1):9.

Shorten C, Khoshgoftaar TM. A survey on image data augmentation for deep learning. J Big Data. 2019;6(1):60.

Wang F, Wang H, Wang H, Li G, Situ G. Learning from simulation: an end-to-end deep-learning approach for computational ghost imaging. Opt Express. 2019;27(18):25560–72.

Pan W. A survey of transfer learning for collaborative recommendation with auxiliary data. Neurocomputing. 2016;177:447–53.

Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L. Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. IEEE; 2009. p. 248–55.

Cook D, Feuz KD, Krishnan NC. Transfer learning for activity recognition: a survey. Knowl Inf Syst. 2013;36(3):537–56.

Cao X, Wang Z, Yan P, Li X. Transfer learning for pedestrian detection. Neurocomputing. 2013;100:51–7.

Raghu M, Zhang C, Kleinberg J, Bengio S. Transfusion: understanding transfer learning for medical imaging. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2019. p. 3347–57.

Pham TN, Van Tran L, Dao SVT. Early disease classification of mango leaves using feed-forward neural network and hybrid metaheuristic feature selection. IEEE Access. 2020;8:189960–73.

Saleh AM, Hamoud T. Analysis and best parameters selection for person recognition based on gait model using CNN algorithm and image augmentation. J Big Data. 2021;8(1):1–20.

Hirahara D, Takaya E, Takahara T, Ueda T. Effects of data count and image scaling on deep learning training. PeerJ Comput Sci. 2020;6:312.

Moreno-Barea FJ, Strazzera F, Jerez JM, Urda D, Franco L. Forward noise adjustment scheme for data augmentation. In: 2018 IEEE symposium series on computational intelligence (SSCI). IEEE; 2018. p. 728–34.

Dua D, Karra Taniskidou E. Uci machine learning repository. Irvine: University of california. School of Information and Computer Science; 2017. http://archive.ics.uci.edu/ml

Johnson JM, Khoshgoftaar TM. Survey on deep learning with class imbalance. J Big Data. 2019;6(1):27.

Yang P, Zhang Z, Zhou BB, Zomaya AY. Sample subset optimization for classifying imbalanced biological data. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2011. p. 333–44.

Yang P, Yoo PD, Fernando J, Zhou BB, Zhang Z, Zomaya AY. Sample subset optimization techniques for imbalanced and ensemble learning problems in bioinformatics applications. IEEE Trans Cybern. 2013;44(3):445–55.

Wang S, Sun S, Xu J. Auc-maximized deep convolutional neural fields for sequence labeling 2015. arXiv preprint arXiv:1511.05265 .

Li Y, Wang S, Umarov R, Xie B, Fan M, Li L, Gao X. Deepre: sequence-based enzyme EC number prediction by deep learning. Bioinformatics. 2018;34(5):760–9.

Li Y, Huang C, Ding L, Li Z, Pan Y, Gao X. Deep learning in bioinformatics: introduction, application, and perspective in the big data era. Methods. 2019;166:4–21.

Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2016. p. 3504–12.

Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, et al. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170,387.

Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.

Pokuri BSS, Ghosal S, Kokate A, Sarkar S, Ganapathysubramanian B. Interpretable deep learning for guided microstructure-property explorations in photovoltaics. NPJ Comput Mater. 2019;5(1):1–11.

Ribeiro MT, Singh S, Guestrin C. “Why should I trust you?” explaining the predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining; 2016. p. 1135–44.

Wang L, Nie R, Yu Z, Xin R, Zheng C, Zhang Z, Zhang J, Cai J. An interpretable deep-learning architecture of capsule networks for identifying cell-type gene expression programs from single-cell RNA-sequencing data. Nat Mach Intell. 2020;2(11):1–11.

Sundararajan M, Taly A, Yan Q. Axiomatic attribution for deep networks; 2017. arXiv preprint arXiv:1703.01365 .

Platt J, et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif. 1999;10(3):61–74.

Nair T, Precup D, Arnold DL, Arbel T. Exploring uncertainty measures in deep networks for multiple sclerosis lesion detection and segmentation. Med Image Anal. 2020;59:101557.

Herzog L, Murina E, Dürr O, Wegener S, Sick B. Integrating uncertainty in deep neural networks for MRI based stroke analysis. Med Image Anal. 2020;65:101790.

Pereyra G, Tucker G, Chorowski J, Kaiser Ł, Hinton G. Regularizing neural networks by penalizing confident output distributions; 2017. arXiv preprint arXiv:1701.06548 .

Naeini MP, Cooper GF, Hauskrecht M. Obtaining well calibrated probabilities using bayesian binning. In: Proceedings of the... AAAI conference on artificial intelligence. AAAI conference on artificial intelligence, vol. 2015. NIH Public Access; 2015. p. 2901.

Li M, Sethi IK. Confidence-based classifier design. Pattern Recogn. 2006;39(7):1230–40.

Zadrozny B, Elkan C. Obtaining calibrated probability estimates from decision trees and Naive Bayesian classifiers. In: ICML, vol. 1, Citeseer; 2001. p. 609–16.

Steinwart I. Consistency of support vector machines and other regularized kernel classifiers. IEEE Trans Inf Theory. 2005;51(1):128–42.

Lee K, Lee K, Shin J, Lee H. Overcoming catastrophic forgetting with unlabeled data in the wild. In: Proceedings of the IEEE international conference on computer vision; 2019. p. 312–21.

Shmelkov K, Schmid C, Alahari K. Incremental learning of object detectors without catastrophic forgetting. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3400–09.

Zenke F, Gerstner W, Ganguli S. The temporal paradox of Hebbian learning and homeostatic plasticity. Curr Opin Neurobiol. 2017;43:166–76.

Andersen N, Krauth N, Nabavi S. Hebbian plasticity in vivo: relevance and induction. Curr Opin Neurobiol. 2017;45:188–92.

Zheng R, Chakraborti S. A phase ii nonparametric adaptive exponentially weighted moving average control chart. Qual Eng. 2016;28(4):476–90.

Rebuffi SA, Kolesnikov A, Sperl G, Lampert CH. ICARL: Incremental classifier and representation learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2001–10.

Hinton GE, Plaut DC. Using fast weights to deblur old memories. In: Proceedings of the ninth annual conference of the cognitive science society; 1987. p. 177–86.

Parisi GI, Kemker R, Part JL, Kanan C, Wermter S. Continual lifelong learning with neural networks: a review. Neural Netw. 2019;113:54–71.

Soltoggio A, Stanley KO, Risi S. Born to learn: the inspiration, progress, and future of evolved plastic artificial neural networks. Neural Netw. 2018;108:48–67.

Parisi GI, Tani J, Weber C, Wermter S. Lifelong learning of human actions with deep neural network self-organization. Neural Netw. 2017;96:137–49.

Cheng Y, Wang D, Zhou P, Zhang T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag. 2018;35(1):126–36.

Wiedemann S, Kirchhoffer H, Matlage S, Haase P, Marban A, Marinč T, Neumann D, Nguyen T, Schwarz H, Wiegand T, et al. Deepcabac: a universal compression algorithm for deep neural networks. IEEE J Sel Topics Signal Process. 2020;14(4):700–14.

Mehta N, Pandit A. Concurrence of big data analytics and healthcare: a systematic review. Int J Med Inform. 2018;114:57–65.

Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, Cui C, Corrado G, Thrun S, Dean J. A guide to deep learning in healthcare. Nat Med. 2019;25(1):24–9.

Shawahna A, Sait SM, El-Maleh A. Fpga-based accelerators of deep learning networks for learning and classification: a review. IEEE Access. 2018;7:7823–59.

Min Z. Public welfare organization management system based on FPGA and deep learning. Microprocess Microsyst. 2020;80:103333.

Al-Shamma O, Fadhel MA, Hameed RA, Alzubaidi L, Zhang J. Boosting convolutional neural networks performance based on fpga accelerator. In: International conference on intelligent systems design and applications. Springer; 2018. p. 509–17.

Han S, Mao H, Dally WJ. Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding; 2015. arXiv preprint arXiv:1510.00149 .

Chen Z, Zhang L, Cao Z, Guo J. Distilling the knowledge from handcrafted features for human activity recognition. IEEE Trans Ind Inform. 2018;14(10):4334–42.

Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network; 2015. arXiv preprint arXiv:1503.02531 .

Lenssen JE, Fey M, Libuschewski P. Group equivariant capsule networks. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 8844–53.

Denton EL, Zaremba W, Bruna J, LeCun Y, Fergus R. Exploiting linear structure within convolutional networks for efficient evaluation. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2014. p. 1269–77.

Xu Q, Zhang M, Gu Z, Pan G. Overfitting remedy by sparsifying regularization on fully-connected layers of CNNs. Neurocomputing. 2019;328:69–74.

Zhang C, Bengio S, Hardt M, Recht B, Vinyals O. Understanding deep learning requires rethinking generalization. Commun ACM. 2018;64(3):107–15.

Xu X, Jiang X, Ma C, Du P, Li X, Lv S, Yu L, Ni Q, Chen Y, Su J, et al. A deep learning system to screen novel coronavirus disease 2019 pneumonia. Engineering. 2020;6(10):1122–9.

Sharma K, Alsadoon A, Prasad P, Al-Dala’in T, Nguyen TQV, Pham DTH. A novel solution of using deep learning for left ventricle detection: enhanced feature extraction. Comput Methods Programs Biomed. 2020;197:105751.

Zhang G, Wang C, Xu B, Grosse R. Three mechanisms of weight decay regularization; 2018. arXiv preprint arXiv:1810.12281 .

Laurent C, Pereyra G, Brakel P, Zhang Y, Bengio Y. Batch normalized recurrent neural networks. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP), IEEE; 2016. p. 2657–61.

Salamon J, Bello JP. Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process Lett. 2017;24(3):279–83.

Wang X, Qin Y, Wang Y, Xiang S, Chen H. ReLTanh: an activation function with vanishing gradient resistance for SAE-based DNNs and its application to rotating machinery fault diagnosis. Neurocomputing. 2019;363:88–98.

Tan HH, Lim KH. Vanishing gradient mitigation with deep learning neural network optimization. In: 2019 7th international conference on smart computing & communications (ICSCC). IEEE; 2019. p. 1–4.

MacDonald G, Godbout A, Gillcash B, Cairns S. Volume-preserving neural networks: a solution to the vanishing gradient problem; 2019. arXiv preprint arXiv:1911.09576 .

Mittal S, Vaishay S. A survey of techniques for optimizing deep learning on GPUs. J Syst Arch. 2019;99:101635.

Kanai S, Fujiwara Y, Iwamura S. Preventing gradient explosions in gated recurrent units. In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2017. p. 435–44.

Hanin B. Which neural net architectures give rise to exploding and vanishing gradients? In: Advances in neural information processing systems. San Mateo: Morgan Kaufmann Publishers; 2018. p. 582–91.

Ribeiro AH, Tiels K, Aguirre LA, Schön T. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In: International conference on artificial intelligence and statistics, PMLR; 2020. p. 2370–80.

D’Amour A, Heller K, Moldovan D, Adlam B, Alipanahi B, Beutel A, Chen C, Deaton J, Eisenstein J, Hoffman MD, et al. Underspecification presents challenges for credibility in modern machine learning; 2020. arXiv preprint arXiv:2011.03395 .

Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020;49(2):1–15.

Wu X, Sahoo D, Hoi SC. Recent advances in deep learning for object detection. Neurocomputing. 2020;396:39–64.

Kuutti S, Bowden R, Jin Y, Barber P, Fallah S. A survey of deep learning applications to autonomous vehicle control. IEEE Trans Intell Transp Syst. 2020;22:712–33.

Yolcu G, Oztel I, Kazan S, Oz C, Bunyak F. Deep learning-based face analysis system for monitoring customer interest. J Ambient Intell Humaniz Comput. 2020;11(1):237–48.

Jiao L, Zhang F, Liu F, Yang S, Li L, Feng Z, Qu R. A survey of deep learning-based object detection. IEEE Access. 2019;7:128837–68.

Muhammad K, Khan S, Del Ser J, de Albuquerque VHC. Deep learning for multigrade brain tumor classification in smart healthcare systems: a prospective survey. IEEE Trans Neural Netw Learn Syst. 2020;32:507–22.

Litjens G, Kooi T, Bejnordi BE, Setio AAA, Ciompi F, Ghafoorian M, Van Der Laak JA, Van Ginneken B, Sánchez CI. A survey on deep learning in medical image analysis. Med Image Anal. 2017;42:60–88.

Mukherjee D, Mondal R, Singh PK, Sarkar R, Bhattacharjee D. Ensemconvnet: a deep learning approach for human activity recognition using smartphone sensors for healthcare applications. Multimed Tools Appl. 2020;79(41):31663–90.

Zeleznik R, Foldyna B, Eslami P, Weiss J, Alexander I, Taron J, Parmar C, Alvi RM, Banerji D, Uno M, et al. Deep convolutional neural networks to predict cardiovascular risk from computed tomography. Nature Commun. 2021;12(1):1–9.

Wang J, Liu Q, Xie H, Yang Z, Zhou H. Boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural networks. Cancers. 2021;13(4):661.

Yu H, Yang LT, Zhang Q, Armstrong D, Deen MJ. Convolutional neural networks for medical image analysis: state-of-the-art, comparisons, improvement and perspectives. Neurocomputing. 2021. https://doi.org/10.1016/j.neucom.2020.04.157 .

Bharati S, Podder P, Mondal MRH. Hybrid deep learning for detecting lung diseases from X-ray images. Inform Med Unlocked. 2020;20:100391.

Dong Y, Pan Y, Zhang J, Xu W. Learning to read chest X-ray images from 16000+ examples using CNN. In: 2017 IEEE/ACM international conference on connected health: applications, systems and engineering technologies (CHASE). IEEE; 2017. p. 51–7.

Rajkomar A, Lingam S, Taylor AG, Blum M, Mongan J. High-throughput classification of radiographs using deep convolutional neural networks. J Digit Imaging. 2017;30(1):95–101.

Rajpurkar P, Irvin J, Zhu K, Yang B, Mehta H, Duan T, Ding D, Bagul A, Langlotz C, Shpanskaya K, et al. Chexnet: radiologist-level pneumonia detection on chest X-rays with deep learning; 2017. arXiv preprint arXiv:1711.05225 .

Wang X, Peng Y, Lu L, Lu Z, Bagheri M, Summers RM. ChestX-ray8: Hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2017. p. 2097–106.

Zuo W, Zhou F, Li Z, Wang L. Multi-resolution CNN and knowledge transfer for candidate classification in lung nodule detection. IEEE Access. 2019;7:32510–21.

Shen W, Zhou M, Yang F, Yang C, Tian J. Multi-scale convolutional neural networks for lung nodule classification. In: International conference on information processing in medical imaging. Springer; 2015. p. 588–99.

Li R, Zhang W, Suk HI, Wang L, Li J, Shen D, Ji S. Deep learning based imaging data completion for improved brain disease diagnosis. In: International conference on medical image computing and computer-assisted intervention. Springer; 2014. p. 305–12.

Wen J, Thibeau-Sutre E, Diaz-Melo M, Samper-González J, Routier A, Bottani S, Dormont D, Durrleman S, Burgos N, Colliot O, et al. Convolutional neural networks for classification of Alzheimer’s disease: overview and reproducible evaluation. Med Image Anal. 2020;63:101694.

Mehmood A, Maqsood M, Bashir M, Shuyuan Y. A deep siamese convolution neural network for multi-class classification of Alzheimer disease. Brain Sci. 2020;10(2):84.

Hosseini-Asl E, Ghazal M, Mahmoud A, Aslantas A, Shalaby A, Casanova M, Barnes G, Gimel’farb G, Keynton R, El-Baz A. Alzheimer’s disease diagnostics by a 3d deeply supervised adaptable convolutional network. Front Biosci. 2018;23:584–96.

Korolev S, Safiullin A, Belyaev M, Dodonova Y. Residual and plain convolutional neural networks for 3D brain MRI classification. In: 2017 IEEE 14th international symposium on biomedical imaging (ISBI 2017). IEEE; 2017. p. 835–8.

Alzubaidi L, Fadhel MA, Oleiwi SR, Al-Shamma O, Zhang J. DFU_QUTNet: diabetic foot ulcer classification using novel deep convolutional neural network. Multimed Tools Appl. 2020;79(21):15655–77.

Goyal M, Reeves ND, Davison AK, Rajbhandari S, Spragg J, Yap MH. Dfunet: convolutional neural networks for diabetic foot ulcer classification. IEEE Trans Emerg Topics Comput Intell. 2018;4(5):728–39.

Yap MH., Hachiuma R, Alavi A, Brungel R, Goyal M, Zhu H, Cassidy B, Ruckert J, Olshansky M, Huang X, et al. Deep learning in diabetic foot ulcers detection: a comprehensive evaluation; 2020. arXiv preprint arXiv:2010.03341 .

Tulloch J, Zamani R, Akrami M. Machine learning in the prevention, diagnosis and management of diabetic foot ulcers: a systematic review. IEEE Access. 2020;8:198977–9000.

Fadhel MA, Al-Shamma O, Alzubaidi L, Oleiwi SR. Real-time sickle cell anemia diagnosis based hardware accelerator. In: International conference on new trends in information and communications technology applications, Springer; 2020. p. 189–99.

Debelee TG, Kebede SR, Schwenker F, Shewarega ZM. Deep learning in selected cancers’ image analysis—a survey. J Imaging. 2020;6(11):121.

Khan S, Islam N, Jan Z, Din IU, Rodrigues JJC. A novel deep learning based framework for the detection and classification of breast cancer using transfer learning. Pattern Recogn Lett. 2019;125:1–6.

Alzubaidi L, Hasan RI, Awad FH, Fadhel MA, Alshamma O, Zhang J. Multi-class breast cancer classification by a novel two-branch deep convolutional neural network architecture. In: 2019 12th international conference on developments in eSystems engineering (DeSE). IEEE; 2019. p. 268–73.

Roy K, Banik D, Bhattacharjee D, Nasipuri M. Patch-based system for classification of breast histology images using deep learning. Comput Med Imaging Gr. 2019;71:90–103.

Hameed Z, Zahia S, Garcia-Zapirain B, Javier Aguirre J, María Vanegas A. Breast cancer histopathology image classification using an ensemble of deep learning models. Sensors. 2020;20(16):4373.

Hosny KM, Kassem MA, Foaud MM. Skin cancer classification using deep learning and transfer learning. In: 2018 9th Cairo international biomedical engineering conference (CIBEC). IEEE; 2018. p. 90–3.

Dorj UO, Lee KK, Choi JY, Lee M. The skin cancer classification using deep convolutional neural network. Multimed Tools Appl. 2018;77(8):9909–24.

Kassem MA, Hosny KM, Fouad MM. Skin lesions classification into eight classes for ISIC 2019 using deep convolutional neural network and transfer learning. IEEE Access. 2020;8:114822–32.

Heidari M, Mirniaharikandehei S, Khuzani AZ, Danala G, Qiu Y, Zheng B. Improving the performance of CNN to predict the likelihood of COVID-19 using chest X-ray images with preprocessing algorithms. Int J Med Inform. 2020;144:104284.

Al-Timemy AH, Khushaba RN, Mosa ZM, Escudero J. An efficient mixture of deep and machine learning models for COVID-19 and tuberculosis detection using X-ray images in resource limited settings 2020. arXiv preprint arXiv:2007.08223 .

Abraham B, Nair MS. Computer-aided detection of COVID-19 from X-ray images using multi-CNN and Bayesnet classifier. Biocybern Biomed Eng. 2020;40(4):1436–45.

Nour M, Cömert Z, Polat K. A novel medical diagnosis model for COVID-19 infection detection based on deep features and Bayesian optimization. Appl Soft Comput. 2020;97:106580.

Mallio CA, Napolitano A, Castiello G, Giordano FM, D’Alessio P, Iozzino M, Sun Y, Angeletti S, Russano M, Santini D, et al. Deep learning algorithm trained with COVID-19 pneumonia also identifies immune checkpoint inhibitor therapy-related pneumonitis. Cancers. 2021;13(4):652.

Fourcade A, Khonsari R. Deep learning in medical image analysis: a third eye for doctors. J Stomatol Oral Maxillofac Surg. 2019;120(4):279–88.

Guo Z, Li X, Huang H, Guo N, Li Q. Deep learning-based image segmentation on multimodal medical imaging. IEEE Trans Radiat Plasma Med Sci. 2019;3(2):162–9.

Thakur N, Yoon H, Chong Y. Current trends of artificial intelligence for colorectal cancer pathology image analysis: a systematic review. Cancers. 2020;12(7):1884.

Lundervold AS, Lundervold A. An overview of deep learning in medical imaging focusing on MRI. Zeitschrift für Medizinische Physik. 2019;29(2):102–27.

Yadav SS, Jadhav SM. Deep convolutional neural network based medical image classification for disease diagnosis. J Big Data. 2019;6(1):113.

Nehme E, Freedman D, Gordon R, Ferdman B, Weiss LE, Alalouf O, Naor T, Orange R, Michaeli T, Shechtman Y. DeepSTORM3D: dense 3D localization microscopy and PSF design by deep learning. Nat Methods. 2020;17(7):734–40.

Zulkifley MA, Abdani SR, Zulkifley NH. Pterygium-Net: a deep learning approach to pterygium detection and localization. Multimed Tools Appl. 2019;78(24):34563–84.

Sirazitdinov I, Kholiavchenko M, Mustafaev T, Yixuan Y, Kuleev R, Ibragimov B. Deep neural network ensemble for pneumonia localization from a large-scale chest X-ray database. Comput Electr Eng. 2019;78:388–99.

Zhao W, Shen L, Han B, Yang Y, Cheng K, Toesca DA, Koong AC, Chang DT, Xing L. Markerless pancreatic tumor target localization enabled by deep learning. Int J Radiat Oncol Biol Phys. 2019;105(2):432–9.

Roth HR, Lee CT, Shin HC, Seff A, Kim L, Yao J, Lu L, Summers RM. Anatomy-specific classification of medical images using deep convolutional nets. In: 2015 IEEE 12th international symposium on biomedical imaging (ISBI). IEEE; 2015. p. 101–4.

Shin HC, Orton MR, Collins DJ, Doran SJ, Leach MO. Stacked autoencoders for unsupervised feature learning and multiple organ detection in a pilot study using 4D patient data. IEEE Trans Pattern Anal Mach Intell. 2012;35(8):1930–43.

Li Z, Dong M, Wen S, Hu X, Zhou P, Zeng Z. CLU-CNNs: object detection for medical images. Neurocomputing. 2019;350:53–9.

Gao J, Jiang Q, Zhou B, Chen D. Convolutional neural networks for computer-aided detection or diagnosis in medical image analysis: an overview. Math Biosci Eng. 2019;16(6):6536.

Article   MathSciNet   Google Scholar  

Lumini A, Nanni L. Review fair comparison of skin detection approaches on publicly available datasets. Expert Syst Appl. 2020. https://doi.org/10.1016/j.eswa.2020.113677 .

Chouhan V, Singh SK, Khamparia A, Gupta D, Tiwari P, Moreira C, Damaševičius R, De Albuquerque VHC. A novel transfer learning based approach for pneumonia detection in chest X-ray images. Appl Sci. 2020;10(2):559.

Apostolopoulos ID, Mpesiana TA. COVID-19: automatic detection from X-ray images utilizing transfer learning with convolutional neural networks. Phys Eng Sci Med. 2020;43(2):635–40.

Mahmud T, Rahman MA, Fattah SA. CovXNet: a multi-dilation convolutional neural network for automatic COVID-19 and other pneumonia detection from chest X-ray images with transferable multi-receptive feature optimization. Comput Biol Med. 2020;122:103869.

Tayarani-N MH. Applications of artificial intelligence in battling against COVID-19: a literature review. Chaos Solitons Fractals. 2020;142:110338.

Toraman S, Alakus TB, Turkoglu I. Convolutional capsnet: a novel artificial neural network approach to detect COVID-19 disease from X-ray images using capsule networks. Chaos Solitons Fractals. 2020;140:110122.

Dascalu A, David E. Skin cancer detection by deep learning and sound analysis algorithms: a prospective clinical study of an elementary dermoscope. EBioMedicine. 2019;43:107–13.

Adegun A, Viriri S. Deep learning techniques for skin lesion analysis and melanoma cancer detection: a survey of state-of-the-art. Artif Intell Rev. 2020;54:1–31.

Zhang N, Cai YX, Wang YY, Tian YT, Wang XL, Badami B. Skin cancer diagnosis based on optimized convolutional neural network. Artif Intell Med. 2020;102:101756.

Thurnhofer-Hemsi K, Domínguez E. A convolutional neural network framework for accurate skin cancer detection. Neural Process Lett. 2020. https://doi.org/10.1007/s11063-020-10364-y .

Jain MS, Massoud TF. Predicting tumour mutational burden from histopathological images using multiscale deep learning. Nat Mach Intell. 2020;2(6):356–62.

Lei H, Liu S, Elazab A, Lei B. Attention-guided multi-branch convolutional neural network for mitosis detection from histopathological images. IEEE J Biomed Health Inform. 2020;25(2):358–70.

Celik Y, Talo M, Yildirim O, Karabatak M, Acharya UR. Automated invasive ductal carcinoma detection based using deep transfer learning with whole-slide images. Pattern Recogn Lett. 2020;133:232–9.

Sebai M, Wang X, Wang T. Maskmitosis: a deep learning framework for fully supervised, weakly supervised, and unsupervised mitosis detection in histopathology images. Med Biol Eng Comput. 2020;58:1603–23.

Sebai M, Wang T, Al-Fadhli SA. Partmitosis: a partially supervised deep learning framework for mitosis detection in breast cancer histopathology images. IEEE Access. 2020;8:45133–47.

Mahmood T, Arsalan M, Owais M, Lee MB, Park KR. Artificial intelligence-based mitosis detection in breast cancer histopathology images using faster R-CNN and deep CNNs. J Clin Med. 2020;9(3):749.

Srinidhi CL, Ciga O, Martel AL. Deep neural network models for computational histopathology: a survey. Med Image Anal. 2020;67:101813.

Cireşan DC, Giusti A, Gambardella LM, Schmidhuber J. Mitosis detection in breast cancer histology images with deep neural networks. In: International conference on medical image computing and computer-assisted intervention. Springer; 2013. p. 411–8.

Sirinukunwattana K, Raza SEA, Tsang YW, Snead DR, Cree IA, Rajpoot NM. Locality sensitive deep learning for detection and classification of nuclei in routine colon cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1196–206.

Xu J, Xiang L, Liu Q, Gilmore H, Wu J, Tang J, Madabhushi A. Stacked sparse autoencoder (SSAE) for nuclei detection on breast cancer histopathology images. IEEE Trans Med Imaging. 2015;35(1):119–30.

Albarqouni S, Baur C, Achilles F, Belagiannis V, Demirci S, Navab N. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans Med Imaging. 2016;35(5):1313–21.

Abd-Ellah MK, Awad AI, Khalaf AA, Hamed HF. Two-phase multi-model automatic brain tumour diagnosis system from magnetic resonance images using convolutional neural networks. EURASIP J Image Video Process. 2018;2018(1):97.

Thaha MM, Kumar KPM, Murugan B, Dhanasekeran S, Vijayakarthick P, Selvi AS. Brain tumor segmentation using convolutional neural networks in MRI images. J Med Syst. 2019;43(9):294.

Talo M, Yildirim O, Baloglu UB, Aydin G, Acharya UR. Convolutional neural networks for multi-class brain disease detection using MRI images. Comput Med Imaging Gr. 2019;78:101673.

Gabr RE, Coronado I, Robinson M, Sujit SJ, Datta S, Sun X, Allen WJ, Lublin FD, Wolinsky JS, Narayana PA. Brain and lesion segmentation in multiple sclerosis using fully convolutional neural networks: a large-scale study. Mult Scler J. 2020;26(10):1217–26.

Chen S, Ding C, Liu M. Dual-force convolutional neural networks for accurate brain tumor segmentation. Pattern Recogn. 2019;88:90–100.

Hu K, Gan Q, Zhang Y, Deng S, Xiao F, Huang W, Cao C, Gao X. Brain tumor segmentation using multi-cascaded convolutional neural networks and conditional random field. IEEE Access. 2019;7:92615–29.

Wadhwa A, Bhardwaj A, Verma VS. A review on brain tumor segmentation of MRI images. Magn Reson Imaging. 2019;61:247–59.

Akkus Z, Galimzianova A, Hoogi A, Rubin DL, Erickson BJ. Deep learning for brain MRI segmentation: state of the art and future directions. J Digit Imaging. 2017;30(4):449–59.

Moeskops P, Viergever MA, Mendrik AM, De Vries LS, Benders MJ, Išgum I. Automatic segmentation of MR brain images with a convolutional neural network. IEEE Trans Med Imaging. 2016;35(5):1252–61.

Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 fourth international conference on 3D vision (3DV). IEEE; 2016. p. 565–71.

Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International conference on medical image computing and computer-assisted intervention. Springer; 2015. p. 234–41.

Pereira S, Pinto A, Alves V, Silva CA. Brain tumor segmentation using convolutional neural networks in MRI images. IEEE Trans Med Imaging. 2016;35(5):1240–51.

Havaei M, Davy A, Warde-Farley D, Biard A, Courville A, Bengio Y, Pal C, Jodoin PM, Larochelle H. Brain tumor segmentation with deep neural networks. Med Image Anal. 2017;35:18–31.

Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell. 2017;40(4):834–48.

Yan Q, Wang B, Gong D, Luo C, Zhao W, Shen J, Shi Q, Jin S, Zhang L, You Z. COVID-19 chest CT image segmentation—a deep convolutional neural network solution; 2020. arXiv preprint arXiv:2004.10987 .

Wang G, Liu X, Li C, Xu Z, Ruan J, Zhu H, Meng T, Li K, Huang N, Zhang S. A noise-robust framework for automatic segmentation of COVID-19 pneumonia lesions from CT images. IEEE Trans Med Imaging. 2020;39(8):2653–63.

Khan SH, Sohail A, Khan A, Lee YS. Classification and region analysis of COVID-19 infection using lung CT images and deep convolutional neural networks; 2020. arXiv preprint arXiv:2009.08864 .

Shi F, Wang J, Shi J, Wu Z, Wang Q, Tang Z, He K, Shi Y, Shen D. Review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for COVID-19. IEEE Rev Biomed Eng. 2020;14:4–5.

Santamaría J, Rivero-Cejudo M, Martos-Fernández M, Roca F. An overview on the latest nature-inspired and metaheuristics-based image registration algorithms. Appl Sci. 2020;10(6):1928.

Santamaría J, Cordón O, Damas S. A comparative study of state-of-the-art evolutionary image registration methods for 3D modeling. Comput Vision Image Underst. 2011;115(9):1340–54.

Yumer ME, Mitra NJ. Learning semantic deformation flows with 3D convolutional networks. In: European conference on computer vision. Springer; 2016. p. 294–311.

Ding L, Feng C. Deepmapping: unsupervised map estimation from multiple point clouds. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2019. p. 8650–9.

Mahadevan S. Imagination machines: a new challenge for artificial intelligence. AAAI. 2018;2018:7988–93.

Wang L, Fang Y. Unsupervised 3D reconstruction from a single image via adversarial learning; 2017. arXiv preprint arXiv:1711.09312 .

Hermoza R, Sipiran I. 3D reconstruction of incomplete archaeological objects using a generative adversarial network. In: Proceedings of computer graphics international 2018. Association for Computing Machinery; 2018. p. 5–11.

Fu Y, Lei Y, Wang T, Curran WJ, Liu T, Yang X. Deep learning in medical image registration: a review. Phys Med Biol. 2020;65(20):20TR01.

Haskins G, Kruger U, Yan P. Deep learning in medical image registration: a survey. Mach Vision Appl. 2020;31(1):8.

de Vos BD, Berendsen FF, Viergever MA, Sokooti H, Staring M, Išgum I. A deep learning framework for unsupervised affine and deformable image registration. Med Image Anal. 2019;52:128–43.

Yang X, Kwitt R, Styner M, Niethammer M. Quicksilver: fast predictive image registration—a deep learning approach. NeuroImage. 2017;158:378–96.

Miao S, Wang ZJ, Liao R. A CNN regression approach for real-time 2D/3D registration. IEEE Trans Med Imaging. 2016;35(5):1352–63.

Li P, Pei Y, Guo Y, Ma G, Xu T, Zha H. Non-rigid 2D–3D registration using convolutional autoencoders. In: 2020 IEEE 17th international symposium on biomedical imaging (ISBI). IEEE; 2020. p. 700–4.

Zhang J, Yeung SH, Shu Y, He B, Wang W. Efficient memory management for GPU-based deep learning systems; 2019. arXiv preprint arXiv:1903.06631 .

Zhao H, Han Z, Yang Z, Zhang Q, Yang F, Zhou L, Yang M, Lau FC, Wang Y, Xiong Y, et al. Hived: sharing a {GPU} cluster for deep learning with guarantees. In: 14th {USENIX} symposium on operating systems design and implementation ({OSDI} 20); 2020. p. 515–32.

Lin Y, Jiang Z, Gu J, Li W, Dhar S, Ren H, Khailany B, Pan DZ. DREAMPlace: deep learning toolkit-enabled GPU acceleration for modern VLSI placement. IEEE Trans Comput Aided Des Integr Circuits Syst. 2020;40:748–61.

Hossain S, Lee DJ. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors. 2019;19(15):3371.

Castro FM, Guil N, Marín-Jiménez MJ, Pérez-Serrano J, Ujaldón M. Energy-based tuning of convolutional neural networks on multi-GPUs. Concurr Comput Pract Exp. 2019;31(21):4786.

Gschwend D. Zynqnet: an fpga-accelerated embedded convolutional neural network; 2020. arXiv preprint arXiv:2005.06892 .

Zhang N, Wei X, Chen H, Liu W. FPGA implementation for CNN-based optical remote sensing object detection. Electronics. 2021;10(3):282.

Zhao M, Hu C, Wei F, Wang K, Wang C, Jiang Y. Real-time underwater image recognition with FPGA embedded system for convolutional neural network. Sensors. 2019;19(2):350.

Liu X, Yang J, Zou C, Chen Q, Yan X, Chen Y, Cai C. Collaborative edge computing with FPGA-based CNN accelerators for energy-efficient and time-aware face tracking system. IEEE Trans Comput Soc Syst. 2021. https://doi.org/10.1109/TCSS.2021.3059318 .

Hossin M, Sulaiman M. A review on evaluation metrics for data classification evaluations. Int J Data Min Knowl Manag Process. 2015;5(2):1.

Provost F, Domingos P. Tree induction for probability-based ranking. Mach Learn. 2003;52(3):199–215.

Rakotomamonyj A. Optimizing area under roc with SVMS. In: Proceedings of the European conference on artificial intelligence workshop on ROC curve and artificial intelligence (ROCAI 2004), 2004. p. 71–80.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the roc curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:101078.

Fawcett T. An introduction to roc analysis. Pattern Recogn Lett. 2006;27(8):861–74.

Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.

Hand DJ, Till RJ. A simple generalisation of the area under the ROC curve for multiple class classification problems. Mach Learn. 2001;45(2):171–86.

Masoudnia S, Mersa O, Araabi BN, Vahabie AH, Sadeghi MA, Ahmadabadi MN. Multi-representational learning for offline signature verification using multi-loss snapshot ensemble of CNNs. Expert Syst Appl. 2019;133:317–30.

Coupé P, Mansencal B, Clément M, Giraud R, de Senneville BD, Ta VT, Lepetit V, Manjon JV. Assemblynet: a large ensemble of CNNs for 3D whole brain MRI segmentation. NeuroImage. 2020;219:117026.

Download references

Acknowledgements

We would like to thank the professors from the Queensland University of Technology and the University of Information Technology and Communications who gave their feedback on the paper.

This research received no external funding.

Author information

Authors and affiliations.

School of Computer Science, Queensland University of Technology, Brisbane, QLD, 4000, Australia

Laith Alzubaidi & Jinglan Zhang

Control and Systems Engineering Department, University of Technology, Baghdad, 10001, Iraq

Amjad J. Humaidi

Electrical Engineering Technical College, Middle Technical University, Baghdad, 10001, Iraq

Ayad Al-Dujaili

Faculty of Electrical Engineering & Computer Science, University of Missouri, Columbia, MO, 65211, USA

Ye Duan & Muthana Al-Amidie

AlNidhal Campus, University of Information Technology & Communications, Baghdad, 10001, Iraq

Laith Alzubaidi & Omran Al-Shamma

Department of Computer Science, University of Jaén, 23071, Jaén, Spain

J. Santamaría

College of Computer Science and Information Technology, University of Sumer, Thi Qar, 64005, Iraq

Mohammed A. Fadhel

School of Engineering, Manchester Metropolitan University, Manchester, M1 5GD, UK

Laith Farhan

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: LA, and JZ; methodology: LA, JZ, and JS; software: LA, and MAF; validation: LA, JZ, MA, and LF; formal analysis: LA, JZ, YD, and JS; investigation: LA, and JZ; resources: LA, JZ, and MAF; data curation: LA, and OA.; writing–original draft preparation: LA, and OA; writing—review and editing: LA, JZ, AJH, AA, YD, OA, JS, MAF, MA, and LF; visualization: LA, and MAF; supervision: JZ, and YD; project administration: JZ, YD, and JS; funding acquisition: LA, AJH, AA, and YD. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Laith Alzubaidi .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Alzubaidi, L., Zhang, J., Humaidi, A.J. et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 8 , 53 (2021). https://doi.org/10.1186/s40537-021-00444-8

Download citation

Received : 21 January 2021

Accepted : 22 March 2021

Published : 31 March 2021

DOI : https://doi.org/10.1186/s40537-021-00444-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Deep learning
  • Machine learning
  • Convolution neural network (CNN)
  • Deep neural network architectures
  • Deep learning applications
  • Image classification
  • Medical image analysis
  • Supervised learning

research in neural network

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Perspective
  • Open access
  • Published: 22 March 2023

Catalyzing next-generation Artificial Intelligence through NeuroAI

  • Anthony Zador   ORCID: orcid.org/0000-0002-8431-9136 1   na1 ,
  • Sean Escola   ORCID: orcid.org/0000-0003-0645-1964 2   na1 ,
  • Blake Richards 3 , 4 , 5 , 6 , 7 ,
  • Bence Ölveczky 8 ,
  • Yoshua Bengio   ORCID: orcid.org/0000-0002-9322-3515 3 ,
  • Kwabena Boahen 9 ,
  • Matthew Botvinick   ORCID: orcid.org/0000-0001-7758-6896 10 ,
  • Dmitri Chklovskii 11 ,
  • Anne Churchland   ORCID: orcid.org/0000-0002-3205-3794 12 ,
  • Claudia Clopath   ORCID: orcid.org/0000-0003-4507-8648 13 ,
  • James DiCarlo   ORCID: orcid.org/0000-0002-1592-5896 14 ,
  • Surya Ganguli 15 ,
  • Jeff Hawkins 16 ,
  • Konrad Körding 17 ,
  • Alexei Koulakov 1 ,
  • Yann LeCun 18 , 19 ,
  • Timothy Lillicrap 10 ,
  • Adam Marblestone 20 ,
  • Bruno Olshausen 21 ,
  • Alexandre Pouget 22 ,
  • Cristina Savin   ORCID: orcid.org/0000-0002-3414-8244 23 ,
  • Terrence Sejnowski   ORCID: orcid.org/0000-0002-0622-7391 24 ,
  • Eero Simoncelli   ORCID: orcid.org/0000-0002-1206-527X 25 ,
  • Sara Solla   ORCID: orcid.org/0000-0001-7696-447X 26 ,
  • David Sussillo 18 , 27 ,
  • Andreas S. Tolias   ORCID: orcid.org/0000-0002-4305-6376 28 &
  • Doris Tsao 21  

Nature Communications volume  14 , Article number:  1597 ( 2023 ) Cite this article

47k Accesses

55 Citations

316 Altmetric

Metrics details

  • Computer science
  • Neuroscience

Neuroscience has long been an essential driver of progress in artificial intelligence (AI). We propose that to accelerate progress in AI, we must invest in fundamental research in NeuroAI. A core component of this is the embodied Turing test, which challenges AI animal models to interact with the sensorimotor world at skill levels akin to their living counterparts. The embodied Turing test shifts the focus from those capabilities like game playing and language that are especially well-developed or uniquely human to those capabilities – inherited from over 500 million years of evolution – that are shared with all animals. Building models that can pass the embodied Turing test will provide a roadmap for the next generation of AI.

Similar content being viewed by others

research in neural network

Collective intelligence: A unifying concept for integrating biology across scales and substrates

Patrick McMillen & Michael Levin

research in neural network

TacticAI: an AI assistant for football tactics

Zhe Wang, Petar Veličković, … Karl Tuyls

research in neural network

Artificial intelligence and illusions of understanding in scientific research

Lisa Messeri & M. J. Crockett

Introduction

Over the coming decades, Artificial Intelligence (AI) will transform society and the world economy in ways that are as profound as the computer revolution of the last half century and likely at an even faster pace. This AI revolution presents tremendous opportunities to unleash human creativity and catalyze economic growth, relieving workers from performing the most dangerous and menial jobs. However, to reach this potential, we still require advances that will make AI more human-like in its capabilities. Historically, neuroscience has been a critical driver and source of inspiration for improvements in AI, particularly those that made AI more proficient in areas that humans and other animals excel at, such as vision, reward-based learning, interacting with the physical world, and language 1 , 2 . It can still play this role. To accelerate progress in AI and realize its vast potential, we must invest in fundamental research in “NeuroAI.”

The seeds of the current AI revolution were planted decades ago, mainly by researchers attempting to understand how brains compute 3 . Indeed, the earliest efforts to build an “artificial brain” led to the invention of the modern “von Neumann computer architecture,” for which John von Neumann explicitly drew upon the very limited knowledge of the brain available to him in the 1940s 4 , 5 . Later, the Nobel-prize winning work of David Hubel and Torsten Wiesel on visual processing circuits in the cat neocortex inspired the deep convolutional networks that have catalyzed the recent revolution in modern AI 6 , 7 , 8 . Similarly, the development of reinforcement learning was directly inspired by insights into animal behavior and neural activity during learning 9 , 10 , 11 , 12 , 13 , 14 , 15 . Now, decades later, applications of ANNs and RL are coming so quickly that many observers assume that the long-elusive goal of human-level intelligence—sometimes referred to as “artificial general intelligence”—is within our grasp. However, in contrast to the optimism of those outside the field, many front-line AI researchers believe that major breakthroughs are needed before we can build artificial systems capable of doing all that a human, or even a much simpler animal like a mouse, can do.

Although AI systems can easily defeat any human opponent in games such as chess 16 and Go 17 , they are not robust and often struggle when faced with novel situations. Moreover, we have yet to build effective systems that can walk to the shelf, take down the chess set, set up the pieces, and move them around during a game, although recent progress is encouraging 18 . Similarly, no machine can build a nest, forage for berries, or care for young. Today’s AI systems cannot compete with the sensorimotor capabilities of a four-year old child or even simple animals. Many basic capacities required to navigate new situations—capacities that animals have or acquire effortlessly—turn out to be deceptively challenging for AI, partly because AI systems lack even the basic abilities to interact with an unpredictable world. A growing number of AI researchers doubt that merely scaling up current approaches will overcome these limitations. Given the need to achieve more natural intelligence in AI, it is quite likely that new inspiration from naturally intelligent systems is needed 19 .

Historically, many key AI advances, such as convolutional ANNs and reinforcement learning, were inspired by neuroscience. Neuroscience continues to provide guidance—e.g., attention-based neural networks were loosely inspired by attention mechanisms in the brain 20 , 21 , 22 , 23 —but this is often based on findings that are decades old. The fact that such cross-pollination between AI and neuroscience is far less common than in the past represents a missed opportunity. Over the last decades, through efforts such as the NIH BRAIN initiative and others, we have amassed an enormous amount of knowledge about the brain. The emerging field of NeuroAI, at the intersection of neuroscience and AI, is based on the premise that a better understanding of neural computation will reveal fundamental ingredients of intelligence and catalyze the next revolution in AI. This will eventually lead to artificial agents with capabilities that match those of humans. The NeuroAI program we advocate is driven by the recognition that AI historically owes much to neuroscience and the promise that AI will continue to learn from it–but only if there is a large enough community of researchers fluent in both domains. We believe the time is right for a large-scale effort to identify and understand the principles of biological intelligence and abstract those for application in computer and robotic systems.

It is tempting to focus on the most characteristically human aspects of intelligent behavior, such as abstract thought and reasoning. However, the basic ingredients of intelligence—adaptability, flexibility, and the ability to make general inferences from sparse observations—are already present in some form in basic sensorimotor circuits, which have been evolving for hundreds of millions of years. As AI pioneer Hans Moravec 24 put it, abstract thought “is a new trick, perhaps less than 100 thousand years old….effective only because it is supported by this much older and much more powerful, though usually unconscious, sensorimotor knowledge.” This implies that the bulk of the work in developing general AI can be achieved by building systems that match the perceptual and motor abilities of animals and that the subsequent step to human-level intelligence would be considerably smaller. This is good news because progress on the first goal can rely on the favored subjects of neuroscience research—rats, mice, and non-human primates—for which extensive and rapidly expanding behavioral and neural datasets can guide the way. Thus, we believe that the NeuroAI path will lead to necessary advances if we figure out the core capabilities that all animals possess in embodied sensorimotor interaction with the world.

NeuroAI grand challenge: the embodied turing test

In 1950, Alan Turing proposed the “imitation game” 25 as a test of a machine’s ability to exhibit intelligent behavior indistinguishable from that of a human (Fig.  1 , left). In that game, now known as the Turing test, a human judge evaluates natural language conversations between a real human and a machine trained to mimic human responses. By focusing on conversational abilities, Turing evaded asking whether a machine could “think,” a question he considered impossible to answer. The Turing test is based on the implicit belief that language represents the pinnacle of human intelligence and that a machine capable of conversation must surely be intelligent.

figure 1

Left : The original Turing test as proposed by Alan Turing 25 . If a human tester cannot determine whether their interlocutor is an AI system or another human, the AI passes the test. Modern large language models have made substantial progress towards passing this test 26 . Right : The embodied Turing test. An AI animal model—whether robotic or in simulation—passes the test if its behavior is indistinguishable from that of its living counterpart. No AI systems are close to passing this test. Here, an artificial beaver is tested on the species-specific behavior of dam construction.

Until recently, no artificial system could come close to passing the Turing test. However, a class of modern AI systems called “large language models” can now engage in surprisingly cogent conversations 26 . In part, their success reveals how easily we can be tricked into imputing intelligence, agency, and even consciousness to our interlocutor 27 . Impressive though these systems are, because they are not grounded in real-world experiences, they nonetheless continue to struggle with many basic aspects of causal reasoning and physical common-sense. Thus, the Turing test does not probe our prodigious perceptual and motor abilities to interact with and reason about the physical world, abilities shared with animals and honed through countless generations of natural selection.

We therefore propose an expanded “embodied Turing test,” one that includes advanced sensorimotor abilities (Fig.  1 , right). The spirit of the original Turing test was to establish a simple qualitative standard against which our progress toward building artificially intelligent machines can be judged. This embodied Turing test would benchmark and compare the interactions with the world of artificial systems versus humans and other animals. Similar ideas have been proposed previously 28 , 29 , 30 , 31 , 32 . However, in light of recent advances enabling large-scale behavioral and neural measurements, as well as large-scale simulations of embodied agents in silico, we believe the time is ripe to instantiate a major research effort in this direction. As each animal has its own unique set of abilities, each animal defines its own embodied Turing test: An artificial beaver might be tested on its ability to build a dam, and an artificial squirrel on its ability to jump through trees. Nonetheless, many core sensorimotor capabilities are shared by almost all animals, and the ability of animals to rapidly evolve the sensorimotor skills needed to adapt to new environments suggests that these core skills provide a solid foundation. This implies that after developing an AI system to faithfully reproduce the behavior of one species, the adaptation of this system to other species—and even to humans—may be straightforward. Below we highlight a few of the characteristics that are shared across species.

Animals engage their environments

The defining feature of animals is their ability to move around and interact with their environment in purposeful ways. Despite recent advances in optimal control, reinforcement learning, and imitation learning, robotics is still far from achieving animal-level abilities in controlling their bodies and manipulating objects, even in simulation. Of course, neuroscience can provide guidance about the kinds of modular and hierarchical architectures that could be adapted to artificial systems to give them these capabilities 33 . It can also provide us with design principles like partial autonomy (how low-level modules in a hierarchy act semi-autonomously in the absence of input from high-level modules) and amortized control (how movements generated at first by a slow planning process are eventually transferred to a fast reflexive system). These principles could guide the design of systems for perception, action selection, locomotion, and fine-grained control of limbs, hands, and fingers. Understanding how specific neural circuits participate in different tasks could also inspire solutions for other forms of ‘intelligence,’ including in more cognitive realms. For example, we speculate that incorporating principles of circuitry for low-level motor control could help provide a better basis for higher-level motor planning in AI systems.

Animals behave flexibly

Another goal is to develop AI systems that can engage a large repertoire of flexible and diverse tasks in a manner that echoes the incredible range of behaviors that individual animals can generate. Modern AI can easily learn to outperform humans at video games like Breakout using nothing more than pixels on a screen and game scores 34 . However, these systems, unlike human players, are brittle and highly sensitive to small perturbations: changing the rules of the game slightly, or even a few pixels on the input, can lead to catastrophically poor performance 35 . This is because these systems learn a mapping from pixels to actions that need not involve an understanding of the agents and objects in the game and the physics that governs them. Similarly, a self-driving car does not inherently know about the danger of a crate falling off a truck in front of it unless it has literally seen examples of crates falling off trucks leading to bad outcomes. And even if it has been trained on the dangers of falling crates, the system might consider an empty plastic bag being blown out of the car in front of it as an obstacle to avoid at all costs rather than an irritant, again, because it doesn’t actually understand what a plastic bag is or how unthreatening it is physically. This inability to handle scenarios that have not appeared in the training data is a significant challenge to widespread reliance on AI systems.

To be successful in an unpredictable and changing world, an agent must be flexible and master novel situations by using its general knowledge about how such situations are likely to unfold. This is arguably what animals do. Animals are born with most of the skills needed to thrive or can rapidly acquire them from limited experience, thanks to their strong foundation in real-world interaction, courtesy of evolution and development 36 . Thus, it is clear that training from scratch for a specific task is not how animals obtain their impressive skills; animals do not arrive into the world tabula rasa and then rely on large labeled training sets to learn. Although machine learning has been pursuing approaches for sidestepping this tabula rasa limitation, including self-supervised learning, transfer learning, continual learning, meta learning, one-shot learning and imitation learning 37 , none of these approaches comes close to achieving the flexibility found in most animals. Thus, we argue that understanding the neural circuit-level principles that provide the foundation for behavioral flexibility in the real-world, even in simple animals, has the potential to greatly increase the flexibility and utility of AI systems. Put another way, we can greatly accelerate our search for general-purpose circuits for real-world interaction by taking advantage of the optimization process that evolution has already engaged in 38 , 39 , 40 , 41 , 42 , 43 , 44 , 45 .

Animals compute efficiently

One important challenge for modern AI—that our brains have overcome—is energy efficiency. Training a neural network requires enormous amounts of energy. For example, training a large language model such as GPT-3 requires over 1000 megawatts-hours, enough to power a small town for a day 46 . Biological systems are, by contrast, much more energy efficient: The human brain uses about 20 watts 47 . The difference in energy requirement between brains and computers derives from differences in information processing. First, at an algorithmic level, modern large-scale ANNs, such as large language models 26 , rely on very large feedforward architectures with self-attention to process sequences over time 23 , ignoring the potential power of recurrence for processing sequential information. One reason for this is that currently we do not have efficient mechanisms for credit assignment calculations in recurrent networks. In contrast, brains utilize flexible recurrent architectures that can solve the temporal credit assignment problem with great efficiency. Uncovering the mechanisms by which this happens could potentially enable us to increase the energy efficiency of artificial systems. Alternatively, it has been proposed that the synaptic dynamics within adjacent dendritic spines could serve as a mechanism for learning sequential structure, a scheme that could potentially be efficiently implemented in hardware 48 . Second, at an implementation level, neural circuits differ from digital computers. Neural circuits compute effectively despite the presence of unreliable or “noisy” components. For example, synaptic release, the primary means of communication between neurons, can be so unreliable that only one in every ten messages is transmitted 49 . Furthermore, neurons interact mainly by transmitting action potentials (spikes), an asynchronous communication protocol. Like the interactions between conventional digital elements, the output of a neuron can be viewed as a string of 0 s and 1 s; but unlike a digital computer, the energy cost of a “1” (i.e., of a spike) is several orders of magnitude higher than that of a “0” 50 . As biological circuits operate in a regime where spikes are sparse—even very active neurons rarely fire at more than 100 spikes per second and typical cortical firing rates may be less than 1 spike/second—they are much more energy efficient 51 . Spike-based computation has also been shown to be orders of magnitude faster and more energy efficient in recent hardware implementation 52 .

A roadmap for solving the embodied Turing test

How might artificial systems that pass the embodied Turing test be developed? One natural approach would be to do so incrementally, guided by our evolutionary history. For example, almost all animals engage in goal-directed locomotion; they move toward some stimuli (e.g., food sources) and away from others (e.g., threats). Layered on top of these foundational abilities are more sophisticated skills, such the ability to combine different streams of sensory information (e.g., visual and olfactory), to use this sensory information to distinguish food sources and threats, to navigate to previous locations, to weigh possible rewards and threats to achieve goals, and to interact with the world in precise ways in service of these goals. Most of these—and many other—sophisticated abilities are found to some extent in even very simple organisms, such as worms. In more complex animals, such as fish and mammals, these abilities are elaborated and combined with new strategies to enable more powerful behavioral strategies.

This evolutionary perspective suggests a strategy for passing the embodied Turing test by breaking it down into a series of incrementally challenging ones that build on each other, and iteratively optimizing on this series 53 . Specifically, the embodied Turing test comprises challenges that include a wide range of organisms used in neuroscience research, including worms, flies, fish, rodents and primates. This would enable us to deploy the vast amount of knowledge we have accumulated about the behavior, biomechanics, and neural mechanisms of these model organisms to both precisely define each species-specific embodied Turing test and serve as strong inductive biases to guide the development of robust AI controllers that can pass it.

The performance of these artificial agents could be compared with that of animals. Rich behavioral datasets representing a large swath of a species’ ethogram have now been collected and can be deployed to benchmark performance on species-specific embodied Turing tests. Furthermore, these datasets are being rapidly expanded given new tools in 3D videography 54 , 55 , 56 , 57 . Additionally, detailed biomechanical measurements support highly realistic animal body models, complete with skeletal constraints, muscles, tendons, and paw features 58 . Combined with the open-sourcing of powerful, fast physics simulators and virtual environments 59 , 60 , these models will afford the opportunity for embodied Turing test research to be performed in silico at scale 33 . Finally, existing extensive neural datasets with simultaneous neural recordings across multiple brain regions during behavior, combined with increasingly detailed neural anatomy and connectomics, provide a powerful roadmap for the design of AI systems that can control virtual animals to recapitulate the behaviors of their in vivo counterparts and thus pass the embodied Turing test.

Importantly, the specifics of the embodied Turing test for each species can be tuned to the needs of different groups of researchers. We can test the capacity of AI systems in terms of sensorimotor control, self-supervised and continual learning, generalization, memory-guided behavior on both short and life-long timescales, and social interactions. Despite these potentially different areas of interest, the challenges that compose the embodied Turing test can be standardized to permit the quantification of progress and comparison between research efforts. Standardization can be fostered by stakeholders including government and private funders, large research organizations such as the Allen Institute, and major collaborations like the International Brain Lab, with an eye toward the development of common APIs and support for competitions as has been an important impetus for much progress in machine learning and robotics 61 , 62 . Ultimately, virtual organisms that demonstrate successful recapitulation of behaviors of interest can be adapted to the physical world with additional efforts in robotics and deployed to solve real-world problems.

What we need

Achieving these goals will require significant resources deployed in three main areas. First, we must train a new generation of AI researchers who are equally at home in engineering/computational science and neuroscience . These researchers will chart fundamentally new directions in AI research by drawing on decades of progress in neuroscience. The greatest challenge will be in determining how to exploit the synergies and overlaps in neuroscience, computational science, and other relevant fields to advance our quest: identifying what details of the brain’s circuitry, biophysics, and chemistry are important and what can be disregarded in the application to AI. There is thus a critical need for researchers with dual training in AI and neuroscience to apply insights from neuroscience to advance AI and to help design experiments that generate new insights relevant to AI. Although there is already some research of this type, it exists largely at the margins of mainstream neuroscience; training in neuroscience has thus far been motivated and funded mainly by the goal of improving human health and of understanding the brain as such. This lack of alignment between fields might explain, e.g., the multi-decade gap between Hubel and Wiesel’s discovery of the structure of the visual system 6 and the development and application of convolutional neural networks in modern machine learning 8 . Thus, the success of a NeuroAI research program depends on the formation of a community of researchers for whom the raison d'être of their training is to exploit synergies between neuroscience and AI. Explicit design of new training programs can ensure that the NeuroAI research community reflects the demographics of society as a whole and is equipped with the ethical tools needed to ensure that the development of AI benefits society 63 .

Second, we must create a shared platform capable of developing and testing these virtual agents. One of the greatest technical challenges that we will face in creating an iterative, embodied Turing test and evolving artificial organisms to pass it is the amount of computational power required. Currently, training just one large neural network model on a single embodied task (e.g. control of a body in 3-dimensional space) can take days on specialized distributed hardware 64 . For multiple research groups to iteratively work together to optimize and evaluate a large number of agents over multiple generations on increasingly complex embodied Turing tests, a large investment in a shared computational platform will be required. Much like a particle accelerator in physics or large telescope in astronomy, this sort of large-scale shared resource will be essential for moving the brain-inspired AI research agenda forward. It will require a major organizational effort, with government and ideally also industry support, that has as its central goal scientific progress on animal and human-like intelligence.

Third, we must support fundamental theoretical and experimental research on neural computation . We have learned a tremendous amount about the brain over the last decades, through the efforts of the NIH, in no small measure due to the BRAIN Initiative, and other major funders, and we are now reaching an understanding of the vast diversity of the brain’s individual cellular elements, neurons, and how they function as parts of simple circuits. With these building blocks in place, we are poised to shift our focus toward understanding how the brain functions as an integrated intelligent system. This will require insight into how a hundred billion neurons of a thousand different types, each one communicating with thousands of other neurons, with variable, adaptable connections, are wired together, and the computational capabilities—the intelligence—that emerges. We must reverse engineer the brain to abstract the underlying principles. Taking advantage of the powerful synergies between neuroscience and AI will require program and infrastructure support to organize and enable research across the disciplines at a large scale.

Fortunately, there is now broad political agreement that investments in AI research are essential to humanity’s technological future. Indeed, IARPA (Intelligence Advanced Research Projects Activity) was a pioneer in this field, launching the Machine Intelligence from Cortical Networks (MICrONS) project. This project spearheaded the collection of an unprecedented data set consisting of a portion of a mouse connectome and associated functional responses with the specific goal of catalyzing the development of next-generation AI algorithms 65 . Nonetheless, community-wide efforts to bridge the fields of neuroscience and AI will require robust investments from government resources, as well as oversight of project milestones, commercialization support, ethics, and big bets on innovative ideas. In the U.S., there are currently some lines of federal resourcing, such as the NSF’s National Artificial Intelligence Research Institutes, explicitly dedicated to driving innovation and discovery in AI from neuroscience research, but these are largely designed to support a traditional academic model with different groups investigating different questions, rather than the creation of a centralized effort that could create something like the embodied Turing test. Likewise, AI support grants in the U.S. are predominantly ancillary programs through the NIH, NSF, DoD, and even the EPA—each of which have their own directives and goals—and this pattern is shared by funding agencies globally. This leaves a significant funding gap for technology development as an end in itself. The creation of overarching directives either through existing entities, or as a stand-alone agency, to support NeuroAI and AI research would drive this mission and consequently unlock the potential for AI to benefit humanity.

Conclusions

Despite the long history of neuroscience driving advances in AI and the tremendous potential for future advances, most engineers and computational scientists in the field are unaware of the history and opportunities. The influence of neuroscience on shaping the thinking of von Neumann, Turing and other giants of computational theory are rarely mentioned in a typical computer science curriculum. Leading AI conferences such as NeurIPS, which once served to showcase the latest advances in both computational neuroscience and machine learning, now focus almost exclusively on the latter. Even some researchers aware of the historical importance of neuroscience in shaping the field often argue that it has lost its relevance. “Engineers don’t study birds to build better planes” is the usual refrain. However, the analogy fails, in part because pioneers of aviation did indeed study birds 66 , 67 , and some still do 68 , 69 . Moreover, the analogy fails also at a more fundamental level: The goal of modern aeronautical engineering is not to achieve “bird-level” flight, whereas a major goal of AI is indeed to achieve (or exceed) “human-level” intelligence. Just as computers exceed humans in many respects, such as the ability to compute prime numbers, so too do planes exceed birds in characteristics such as speed, range and cargo capacity. However, if the goal of aeronautical engineers were indeed to build a machine with the “bird-level” ability to fly through dense forest foliage and alight gently on a branch, they would be well-advised to pay very close attention to how birds do it. Similarly, if AI aims to achieve animal-level common-sense sensorimotor intelligence, researchers would be well-advised to learn from animals and the solutions they evolved to behave in an unpredictable world.

Hassabis, D., Kumaran, D., Summerfield, C. & Botvinick, M. Neuroscience-inspired artificial intelligence. Neuron 95 , 245–258 (2017).

Article   CAS   PubMed   Google Scholar  

Macpherson, T. et al. Natural and artificial intelligence: a brief introduction to the interplay between AI and neuroscience research. Neural Netw. 144 , 603–613 (2021).

Article   PubMed   Google Scholar  

McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5 , 115–133 (1943).

Article   MathSciNet   MATH   Google Scholar  

von Neumann, J. First Draft of a Report on the EDVAC . https://doi.org/10.5479/sil.538961.39088011475779 (1945).

von Neumann, J. The Computer and the Brain (Yale University Press, 2012).

Hubel, D. H. & Wiesel, T. N. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160 , 106–154 https://doi.org/10.1113/jphysiol.1962.sp006837 (1962).

Fukushima, K. Neocognitron: a self organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36 , 193–202 (1980).

Article   CAS   PubMed   MATH   Google Scholar  

LeCun, Y. & Bengio, Y. Convolutional networks for images, speech, and time series. In: The Handbook of Brain Theory and Neural . 255–258 (ACM, 1995).

Thorndike, E. L. Animal intelligence: an experimental study of the associative processes in animals. https://doi.org/10.1037/10780-000 (1898).

Thorndike, E. L. The law of effect. The Am. J. Psychol. 39 , 212 https://doi.org/10.2307/1415413 (1927).

Thorndike, E. L. The fundamentals of learning. https://doi.org/10.1037/10976-000 (1932).

Crow, T. J. Cortical synapses and reinforcement: a hypothesis. Nature 219 , 736–737 (1968).

Article   ADS   CAS   PubMed   Google Scholar  

Rescorla, R. A. A theory of pavlovian conditioning: variations in the effectiveness of reinforcement and nonreinforcement. In: Black, A. H. & Prokasy, W. F. (eds.) Classical Conditioning II: Current Research and Theory . 64–99 (Century-Crofts, 1972).

Klopf, A. H. Brain Function and Adaptive Systems: A Heterostatic Theory (AIR FORCE CAMBRIDGE RESEARCH LABS HANSCOM AFB MA, 1972).

Schultz, W., Dayan, P. & Montague, P. R. A neural substrate of prediction and reward. Science 275 , 1593–1599 (1997).

Campbell, M., Hoane, A. J. & Hsu, F.-H. Deep blue. Artif. Intell. 134 , 57–83 (2002).

Article   MATH   Google Scholar  

Silver, D. et al. Mastering the game of Go with deep neural networks and tree search. Nature 529 , 484–489 (2016).

Reed, S. et al. A generalist agent. https://arxiv.org/abs/2205.06175 (2022).

Sinz, F. H., Pitkow, X., Reimer, J., Bethge, M. & Tolias, A. S. Engineering a less artificial intelligence. Neuron 103 , 967–979 (2019).

Itti, L., Koch, C. & Niebur, E. A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20 , 1254–1259 (1998).

Article   Google Scholar  

Larochelle, H. & Hinton, G. Learning to combine foveal glimpses with a third-order Boltzmann machine. Adv. Neural Inform. Process. Syst. 23 , 1243–1251 (2010).

Xu, K. et al. Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd International Conference on Machine Learning (eds. Bach, F. & Blei, D.) vol. 37, 2048–2057 (PMLR, 2015).

Vaswani, A. et al. Attention is all you need. Adv. Neural Inf. Process. Syst . 30 , 6000–6010 (2017).

Moravec, H. Mind Children: The Future of Robot and Human Intelligence (Harvard University Press, 1988).

Turing, A. M. I.—Computing machinery and intelligence. Mind LIX , 433–460 (1950).

Article   MathSciNet   Google Scholar  

Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33 , 1877–1901 (2020).

Google Scholar  

Sejnowski, T. Large language models and the reverse turing test. https://arxiv.org/abs/2207.14382 (2022).

Brooks, R. A. Intelligence without representation. Artificial Intelligence . 47 , 139–159 https://doi.org/10.1016/0004-3702(91)90053-m (1991).

Meyer, J.-A. & Wilson, S. W. From Animals to Animats: Proceedings of the First International Conference on Simulation of Adaptive Behavior (Bradford Books, 1991).

Pfeifer, R. & Scheier, C. Understanding intelligence. https://doi.org/10.7551/mitpress/6979.001.0001 (2001).

Pfeifer, R. & Bongard, J. How the Body Shapes the Way We Think: A New View of Intelligence (MIT Press, 2006).

Ortiz, C. L. Why we need a physically embodied turing test and what it might look like. AI Magazine . vol. 37, 55–62 https://doi.org/10.1609/aimag.v37i1.2645 (2016).

Merel, J., Botvinick, M. & Wayne, G. Hierarchical motor control in mammals and machines. Nat. Commun. 10 , 5489 (2019).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Mnih, V. et al. Human-level control through deep reinforcement learning. Nature 518 , 529–533 (2015).

Huang, S., Papernot, N., Goodfellow, I., Duan, Y. & Abbeel, P. Adversarial attacks on neural network policies. https://arxiv.org/abs/1702.02284 (2017).

Zador, A. M. A critique of pure learning and what artificial neural networks can learn from animal brains. Nat. Commun. 10 , 3770 (2019).

Bommasani, R. et al. On the opportunities and risks of foundation models. https://arxiv.org/abs/2108.07258 (2021).

Elman, J. L. Learning and development in neural networks: the importance of starting small. Cognition 48 , 71–99 (1993).

Lake, B. M., Ullman, T. D., Tenenbaum, J. B. & Gershman, S. J. Building machines that learn and think like people. Behav. Brain Sci. 40 , e253 (2017).

Doya, K. & Taniguchi, T. Toward evolutionary and developmental intelligence. Curr. Opin. Behav. Sci. 29 , 91–96 https://doi.org/10.1016/j.cobeha.2019.04.006 (2019).

Pehlevan, C. & Chklovskii, D. B. Neuroscience-inspired online unsupervised learning algorithms: artificial neural networks. IEEE Signal Process. Mag. 36 , 88–96 (2019).

Stanley, K. O., Clune, J., Lehman, J. & Miikkulainen, R. Designing neural networks through neuroevolution. Nat. Mach. Intell. 1 , 24–35 (2019).

Gupta, A., Savarese, S., Ganguli, S. & Fei-Fei, L. Embodied intelligence via learning and evolution. Nat. Commun. 12 , 5721 (2021).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Stöckl, C., Lang, D. & Maass, W. Structure induces computational function in networks with diverse types of spiking neurons. bioRxiv. https://doi.org/10.1101/2021.05.18.444689 (2022).

Koulakov, A., Shuvaev, S., Lachi, D. & Zador, A. Encoding innate ability through a genomic bottleneck. bioRxiv . https://doi.org/10.1101/2021.03.16.435261 (2022).

Patterson, D. et al. Carbon emissions and large neural network training. https://arxiv.org/abs/2104.10350 (2021).

Sokoloff, L. The metabolism of the central nervous system in vivo. Handb. Physiol. Sect. I Neurophysiol. 3 , 1843–1864 (1960).

Boahen, K. Dendrocentric learning for synthetic intelligence. Nature 612 , 43–50 (2022).

Dobrunz, L. E. & Stevens, C. F. Heterogeneity of release probability, facilitation, and depletion at central synapses. Neuron 18 , 995–1008 (1997).

Attwell, D. & Laughlin, S. B. An energy budget for signaling in the grey matter of the brain. J. Cereb. Blood Flow. Metab. 21 , 1133–1145 (2001).

Lennie, P. The cost of cortical computation. Curr. Biol. 13 , 493–497 (2003).

Davies, M. et al. Advancing neuromorphic computing with loihi: a survey of results and outlook. Proc. IEEE Inst. Electr. Electron. Eng. 109 , 911–934 (2021).

Cisek, P. & Hayden, B. Y. Neuroscience needs evolution. Philos. Trans. R. Soc. Lond. B Biol. Sci. 377 , 20200518 (2022).

Mathis, A. et al. DeepLabCut: markerless pose estimation of user-defined body parts with deep learning. Nat. Neurosci. 21 , 1281–1289 (2018).

Wu, A. et al. Deep Graph Pose: a semi-supervised deep graphical model for improved animal pose tracking. Adv. Neural Inf. Process. Syst. 33 , 6040–6052 (2020).

Marshall, J. D. et al. Continuous whole-body 3D kinematic recordings across the rodent behavioral repertoire. Neuron 109 , 420–437.e8 (2021).

Pereira, T. D. et al. Publisher Correction: SLEAP: A deep learning system for multi-animal pose tracking. Nat. Methods 19 , 628 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Merel, J. et al. Deep neuroethology of a virtual rodent. in International Conference on Learning Representations (Association for Computing Machinery, 2020).

Todorov, E., Erez, T. & Tassa, Y. MuJoCo: A physics engine for model-based control. in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (IEEE, 2012).

Brockman, G. et al. OpenAI Gym. (2016) https://doi.org/10.48550/arXiv.1606.01540 .

Kitano, H., Asada, M., Kuniyoshi, Y., Noda, I. & Osawa, E. RoboCup: The Robot World Cup Initiative. in: Proceedings of the first international conference on Autonomous Agents . 340–347 (Association for Computing Machinery, 1997).

Bell, R. M. & Koren, Y. Lessons from the Netflix prize challenge. ACM SIGKDD Explorations Newsletter . vol. 9, 75–79 https://doi.org/10.1145/1345448.1345465 (2007).

Doya, K., Ema, A., Kitano, H., Sakagami, M. & Russell, S. Social impact and governance of AI and neurotechnologies. Neural Netw. 152 , 542–554 (2022).

Liu, S. et al. From motor control to team play in simulated humanoid football. https://arxiv.org/abs/2105.12196 (2021).

MICrONS Consortium et al. Functional connectomics spanning multiple areas of mouse visual cortex. bioRxiv https://doi.org/10.1101/2021.07.28.454025 (2021).

Lilienthal, O. Birdflight as the Basis of Aviation: A Contribution Towards a System of Aviation, Compiled from the Results of Numerous Experiments Made by O and G Lilienthal . (Longmans, Green, 1911).

Culick, F. What the Wright Brothers did and did not understand about flight mechanics-In modern terms. in 37th Joint Propulsion Conference and Exhibit (American Institute of Aeronautics and Astronautics, 2001).

Shyy, W., Lian, Y., Tang, J., Viieru, D. & Liu, H. Aerodynamics of Low Reynolds Number Flyers . (Cambridge University Press, 2008).

Akos, Z., Nagy, M., Leven, S. & Vicsek, T. Thermal soaring flight of birds and unmanned aerial vehicles. Bioinspir. Biomim. 5 , 045003 (2010).

Article   ADS   PubMed   Google Scholar  

Download references

Acknowledgements

This Perspective grew out of discussions following the first conference From Neuroscience to Artificially Intelligent Systems (NAISys), held in CSHL in 2020, and includes many of the invited speakers from that meeting. A.Z. would like to thank Cat Donaldson for convincing him that this manuscript would be worth writing, and for contributing to early versions. B.Ö. would like to acknowledge valuable contributions on the manuscript from Diego Aldorando. The authors would like to acknowledge the following funding sources—A.Z.: Schmidt Foundation, Eleanor Schwartz Foundation, Robert Lourie Foundation; S.E.: NIH 5U19NS104649, 5R01NS105349; R.B.: CIFAR, NSERC RGPIN-2020-05105, RGPAS-2020-00031; B.Ö.: NIH R01NS099323, R01GM136972; Y.B.: CIFAR; K.B.: Stanford Institute for Human-Centered Artificial Intelligence, NSF 2223827; A.C.: NIH U19NS123716; J.D.: Semiconductor Research Corporation, DARPA, Office of Naval Research MURI-114407, MURI-N00014-21-1-2801, N00014-20-1-2589, NSF CCF-1231216, NSF 2124136, Simons SCGB 542965; S.G.: Simons Foundation, James S. McDonnell Foundation, NSF CAREER Award; A.M.: Federation of American Scientists; B.O.: NSF IIS-1718991; C.S.: NIH 1R01MH125571-01, R01NS127122, NSF 1922658, Google faculty award; E.S.: Simons Foundation, NIH EY022428; S.S.: NIH NINDS NS053603; D.S.: Simons 543049; A.T.: IARPA D16PC00003, DARPA HR0011-18-2-0025, NIH R01EY026927, R01MH109556, NSF NeuroNex DBI-1707400; D.T.: HHMI.

Author information

These authors contributed equally: Anthony M. Zador, Sean Escola.

Authors and Affiliations

Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA

Anthony Zador & Alexei Koulakov

Department of Psychiatry, Columbia University, New York, NY, 10027, USA

Sean Escola

Mila, Montréal, QC, H2S 3H1, Canada

Blake Richards & Yoshua Bengio

School of Computer Science, McGill University, Montreal, Canada

Blake Richards

Montreal Neurological Institute, McGill University, Montreal, Canada

Department of Neurology & Neurosurgery, McGill University, Montreal, Canada

Learning in Machines and Brains Program, CIFAR, Toronto, Canada

Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, 02138, USA

Bence Ölveczky

Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA

Kwabena Boahen

Google Deepmind, London, N1C 4AG, UK

Matthew Botvinick & Timothy Lillicrap

Flatiron Institute, Simons Foundation, New York, NY, 10010, USA

Dmitri Chklovskii

Department of Neurobiology, University of California Los Angeles, Los Angeles, CA, 90095, USA

Anne Churchland

Department of Bioengineering, Imperial College London, London, SW7 2BW, UK

Claudia Clopath

Department of Brain and Cognitive Sciences, MIT, Cambridge, MA, 02139, USA

James DiCarlo

Department of Applied Physics, Stanford University, Stanford, CA, 94305, USA

Surya Ganguli

Numenta, Redwood City, CA, 94063, USA

Jeff Hawkins

Department of Neuroscience, University of Pennsylvania, Philadelphia, PA, 19104, USA

Konrad Körding

Meta, Menlo Park, CA, 94025, USA

Yann LeCun & David Sussillo

Department of Electrical and Computer Engineering, NYU, Brooklyn, NY, 11201, USA

Media Lab, MIT, Cambridge, MA, 02140, USA

Adam Marblestone

Helen Wills Neuroscience Institute, University of California Berkeley, Berkeley, CA, 94720, USA

Bruno Olshausen & Doris Tsao

Department of Basic Neurosciences, University of Geneva, Genève, 1211, Switzerland

Alexandre Pouget

Center for Neural Science, NYU, New York, NY, 10003, USA

Cristina Savin

Salk Institute for Biological Studies, La Jolla, CA, 92037, USA

Terrence Sejnowski

Departments of Neural Science, Mathematics, and Psychology, NYU, New York, NY, 10003, USA

Eero Simoncelli

Department of Physiology, Northwestern University, Chicago, IL, 60611, USA

Department of Electrical Engineering, Stanford University, Stanford, CA, 94305, USA

David Sussillo

Department of Neuroscience, Baylor College of Medicine, Houston, TX, 77030, USA

Andreas S. Tolias

You can also search for this author in PubMed   Google Scholar

Contributions

A.Z. conceived the project and wrote the first draft. A.Z., S.E., B.R., B.Ö. provided extensive editing to successive drafts. Y.B., K.B., M.B., D.C., A.C., C.C., J.D., S.G., J.H., K.K., A.K., Y.L., T.L., A.M., B.O., A.P., C.S., T.S., E.S., S.S., D.S., A.T., and D.T. provided additional guidance and edits at various stages.

Corresponding author

Correspondence to Anthony Zador .

Ethics declarations

Competing interests.

The authors declare their engagements with the following relevant for-profit entities—A.Z.: Cajal Neuroscience; S.E.: Herophilus; R.B.: Google DeepMind; B.Ö.: Blackbird Neuroscience; K.B.: Femtosense Inc., Radical Semiconductor, Neurovigil; C.C.: Google DeepMind; S.G.: Meta; J.H.: Numenta; K.K.: Paradromics; T.L.: Google DeepMind; A.M.: Google DeepMind, Kernel; D.S.: Meta; A.T.: Vathes Inc., Upload AI LLC, BioAvatar LLC. The remaining authors declare no competing interests.

Peer review

Peer review information.

Nature Communications thanks Wolfgang Maass and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zador, A., Escola, S., Richards, B. et al. Catalyzing next-generation Artificial Intelligence through NeuroAI. Nat Commun 14 , 1597 (2023). https://doi.org/10.1038/s41467-023-37180-x

Download citation

Received : 11 September 2022

Accepted : 03 March 2023

Published : 22 March 2023

DOI : https://doi.org/10.1038/s41467-023-37180-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Artificial intelligence in neurology: opportunities, challenges, and policy implications.

  • Sebastian Voigtlaender
  • Johannes Pawelczyk
  • Sebastian F. Winter

Journal of Neurology (2024)

What have we learned about artificial intelligence from studying the brain?

  • Samuel J. Gershman

Biological Cybernetics (2024)

DishBrain plays Pong and promises more

  • Joshua Goldwag

Nature Machine Intelligence (2023)

Spatially embedded recurrent neural networks reveal widespread links between structural and functional neuroscience findings

  • Jascha Achterberg
  • Danyal Akarca
  • Duncan E. Astle

Distinct value computations support rapid sequential decisions

  • Shannon S. Schiereck
  • Christine M. Constantinople

Nature Communications (2023)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

research in neural network

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Springer Nature - PMC COVID-19 Collection

Logo of phenaturepg

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

Iqbal h. sarker.

1 Swinburne University of Technology, Melbourne, VIC 3122 Australia

2 Chittagong University of Engineering & Technology, Chittagong, 4349 Bangladesh

Deep learning (DL), a branch of machine learning (ML) and artificial intelligence (AI) is nowadays considered as a core technology of today’s Fourth Industrial Revolution (4IR or Industry 4.0). Due to its learning capabilities from data, DL technology originated from artificial neural network (ANN), has become a hot topic in the context of computing, and is widely applied in various application areas like healthcare, visual recognition, text analytics, cybersecurity, and many more. However, building an appropriate DL model is a challenging task, due to the dynamic nature and variations in real-world problems and data. Moreover, the lack of core understanding turns DL methods into black-box machines that hamper development at the standard level. This article presents a structured and comprehensive view on DL techniques including a taxonomy considering various types of real-world tasks like supervised or unsupervised. In our taxonomy, we take into account deep networks for supervised or discriminative learning , unsupervised or generative learning as well as hybrid learning and relevant others. We also summarize real-world application areas where deep learning techniques can be used. Finally, we point out ten potential aspects for future generation DL modeling with research directions . Overall, this article aims to draw a big picture on DL modeling that can be used as a reference guide for both academia and industry professionals.

Introduction

In the late 1980s, neural networks became a prevalent topic in the area of Machine Learning (ML) as well as Artificial Intelligence (AI), due to the invention of various efficient learning methods and network structures [ 52 ]. Multilayer perceptron networks trained by “Backpropagation” type algorithms, self-organizing maps, and radial basis function networks were such innovative methods [ 26 , 36 , 37 ]. While neural networks are successfully used in many applications, the interest in researching this topic decreased later on. After that, in 2006, “Deep Learning” (DL) was introduced by Hinton et al. [ 41 ], which was based on the concept of artificial neural network (ANN). Deep learning became a prominent topic after that, resulting in a rebirth in neural network research, hence, some times referred to as “new-generation neural networks”. This is because deep networks, when properly trained, have produced significant success in a variety of classification and regression challenges [ 52 ].

Nowadays, DL technology is considered as one of the hot topics within the area of machine learning, artificial intelligence as well as data science and analytics, due to its learning capabilities from the given data. Many corporations including Google, Microsoft, Nokia, etc., study it actively as it can provide significant results in different classification and regression problems and datasets [ 52 ]. In terms of working domain, DL is considered as a subset of ML and AI, and thus DL can be seen as an AI function that mimics the human brain’s processing of data. The worldwide popularity of “Deep learning” is increasing day by day, which is shown in our earlier paper [ 96 ] based on the historical data collected from Google trends [ 33 ]. Deep learning differs from standard machine learning in terms of efficiency as the volume of data increases, discussed briefly in Section “ Why Deep Learning in Today's Research and Applications? ”. DL technology uses multiple layers to represent the abstractions of data to build computational models. While deep learning takes a long time to train a model due to a large number of parameters, it takes a short amount of time to run during testing as compared to other machine learning algorithms [ 127 ].

While today’s Fourth Industrial Revolution (4IR or Industry 4.0) is typically focusing on technology-driven “automation, smart and intelligent systems”, DL technology, which is originated from ANN, has become one of the core technologies to achieve the goal [ 103 , 114 ]. A typical neural network is mainly composed of many simple, connected processing elements or processors called neurons, each of which generates a series of real-valued activations for the target outcome. Figure ​ Figure1 1 shows a schematic representation of the mathematical model of an artificial neuron, i.e., processing element, highlighting input ( X i ), weight ( w ), bias ( b ), summation function ( ∑ ), activation function ( f ) and corresponding output signal ( y ). Neural network-based DL technology is now widely applied in many fields and research areas such as healthcare, sentiment analysis, natural language processing, visual recognition, business intelligence, cybersecurity, and many more that have been summarized in the latter part of this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig1_HTML.jpg

Schematic representation of the mathematical model of an artificial neuron (processing element), highlighting input ( X i ), weight ( w ), bias ( b ), summation function ( ∑ ), activation function ( f ) and output signal ( y )

Although DL models are successfully applied in various application areas, mentioned above, building an appropriate model of deep learning is a challenging task, due to the dynamic nature and variations of real-world problems and data. Moreover, DL models are typically considered as “black-box” machines that hamper the standard development of deep learning research and applications. Thus for clear understanding, in this paper, we present a structured and comprehensive view on DL techniques considering the variations in real-world problems and tasks. To achieve our goal, we briefly discuss various DL techniques and present a taxonomy by taking into account three major categories: (i) deep networks for supervised or discriminative learning that is utilized to provide a discriminative function in supervised deep learning or classification applications; (ii) deep networks for unsupervised or generative learning that are used to characterize the high-order correlation properties or features for pattern analysis or synthesis, thus can be used as preprocessing for the supervised algorithm; and (ii) deep networks for hybrid learning that is an integration of both supervised and unsupervised model and relevant others. We take into account such categories based on the nature and learning capabilities of different DL techniques and how they are used to solve problems in real-world applications [ 97 ]. Moreover, identifying key research issues and prospects including effective data representation, new algorithm design, data-driven hyper-parameter learning, and model optimization, integrating domain knowledge, adapting resource-constrained devices, etc. is one of the key targets of this study, which can lead to “Future Generation DL-Modeling”. Thus the goal of this paper is set to assist those in academia and industry as a reference guide, who want to research and develop data-driven smart and intelligent systems based on DL techniques.

The overall contribution of this paper is summarized as follows:

  • This article focuses on different aspects of deep learning modeling, i.e., the learning capabilities of DL techniques in different dimensions such as supervised or unsupervised tasks, to function in an automated and intelligent manner, which can play as a core technology of today’s Fourth Industrial Revolution (Industry 4.0).
  • We explore a variety of prominent DL techniques and present a taxonomy by taking into account the variations in deep learning tasks and how they are used for different purposes. In our taxonomy, we divide the techniques into three major categories such as deep networks for supervised or discriminative learning, unsupervised or generative learning, as well as deep networks for hybrid learning, and relevant others.
  • We have summarized several potential real-world application areas of deep learning, to assist developers as well as researchers in broadening their perspectives on DL techniques. Different categories of DL techniques highlighted in our taxonomy can be used to solve various issues accordingly.
  • Finally, we point out and discuss ten potential aspects with research directions for future generation DL modeling in terms of conducting future research and system development.

This paper is organized as follows. Section “ Why Deep Learning in Today's Research and Applications? ” motivates why deep learning is important to build data-driven intelligent systems. In Section“ Deep Learning Techniques and Applications ”, we present our DL taxonomy by taking into account the variations of deep learning tasks and how they are used in solving real-world issues and briefly discuss the techniques with summarizing the potential application areas. In Section “ Research Directions and Future Aspects ”, we discuss various research issues of deep learning-based modeling and highlight the promising topics for future research within the scope of our study. Finally, Section “ Concluding Remarks ” concludes this paper.

Why Deep Learning in Today’s Research and Applications?

The main focus of today’s Fourth Industrial Revolution (Industry 4.0) is typically technology-driven automation, smart and intelligent systems, in various application areas including smart healthcare, business intelligence, smart cities, cybersecurity intelligence, and many more [ 95 ]. Deep learning approaches have grown dramatically in terms of performance in a wide range of applications considering security technologies, particularly, as an excellent solution for uncovering complex architecture in high-dimensional data. Thus, DL techniques can play a key role in building intelligent data-driven systems according to today’s needs, because of their excellent learning capabilities from historical data. Consequently, DL can change the world as well as humans’ everyday life through its automation power and learning from experience. DL technology is therefore relevant to artificial intelligence [ 103 ], machine learning [ 97 ] and data science with advanced analytics [ 95 ] that are well-known areas in computer science, particularly, today’s intelligent computing. In the following, we first discuss regarding the position of deep learning in AI, or how DL technology is related to these areas of computing.

The Position of Deep Learning in AI

Nowadays, artificial intelligence (AI), machine learning (ML), and deep learning (DL) are three popular terms that are sometimes used interchangeably to describe systems or software that behaves intelligently. In Fig. ​ Fig.2, 2 , we illustrate the position of deep Learning, comparing with machine learning and artificial intelligence. According to Fig. ​ Fig.2, 2 , DL is a part of ML as well as a part of the broad area AI. In general, AI incorporates human behavior and intelligence to machines or systems [ 103 ], while ML is the method to learn from data or experience [ 97 ], which automates analytical model building. DL also represents learning methods from data where the computation is done through multi-layer neural networks and processing. The term “Deep” in the deep learning methodology refers to the concept of multiple levels or stages through which data is processed for building a data-driven model.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig2_HTML.jpg

An illustration of the position of deep learning (DL), comparing with machine learning (ML) and artificial intelligence (AI)

Thus, DL can be considered as one of the core technology of AI, a frontier for artificial intelligence, which can be used for building intelligent systems and automation. More importantly, it pushes AI to a new level, termed “Smarter AI”. As DL are capable of learning from data, there is a strong relation of deep learning with “Data Science” [ 95 ] as well. Typically, data science represents the entire process of finding meaning or insights in data in a particular problem domain, where DL methods can play a key role for advanced analytics and intelligent decision-making [ 104 , 106 ]. Overall, we can conclude that DL technology is capable to change the current world, particularly, in terms of a powerful computational engine and contribute to technology-driven automation, smart and intelligent systems accordingly, and meets the goal of Industry 4.0.

Understanding Various Forms of Data

As DL models learn from data, an in-depth understanding and representation of data are important to build a data-driven intelligent system in a particular application area. In the real world, data can be in various forms, which typically can be represented as below for deep learning modeling:

  • Sequential Data Sequential data is any kind of data where the order matters, i,e., a set of sequences. It needs to explicitly account for the sequential nature of input data while building the model. Text streams, audio fragments, video clips, time-series data, are some examples of sequential data.
  • Image or 2D Data A digital image is made up of a matrix, which is a rectangular array of numbers, symbols, or expressions arranged in rows and columns in a 2D array of numbers. Matrix, pixels, voxels, and bit depth are the four essential characteristics or fundamental parameters of a digital image.
  • Tabular Data A tabular dataset consists primarily of rows and columns. Thus tabular datasets contain data in a columnar format as in a database table. Each column (field) must have a name and each column may only contain data of the defined type. Overall, it is a logical and systematic arrangement of data in the form of rows and columns that are based on data properties or features. Deep learning models can learn efficiently on tabular data and allow us to build data-driven intelligent systems.

The above-discussed data forms are common in the real-world application areas of deep learning. Different categories of DL techniques perform differently depending on the nature and characteristics of data, discussed briefly in Section “ Deep Learning Techniques and Applications ” with a taxonomy presentation. However, in many real-world application areas, the standard machine learning techniques, particularly, logic-rule or tree-based techniques [ 93 , 101 ] perform significantly depending on the application nature. Figure ​ Figure3 3 also shows the performance comparison of DL and ML modeling considering the amount of data. In the following, we highlight several cases, where deep learning is useful to solve real-world problems, according to our main focus in this paper.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig3_HTML.jpg

An illustration of the performance comparison between deep learning (DL) and other machine learning (ML) algorithms, where DL modeling from large amounts of data can increase the performance

DL Properties and Dependencies

A DL model typically follows the same processing stages as machine learning modeling. In Fig. ​ Fig.4, 4 , we have shown a deep learning workflow to solve real-world problems, which consists of three processing steps, such as data understanding and preprocessing, DL model building, and training, and validation and interpretation. However, unlike the ML modeling [ 98 , 108 ], feature extraction in the DL model is automated rather than manual. K-nearest neighbor, support vector machines, decision tree, random forest, naive Bayes, linear regression, association rules, k-means clustering, are some examples of machine learning techniques that are commonly used in various application areas [ 97 ]. On the other hand, the DL model includes convolution neural network, recurrent neural network, autoencoder, deep belief network, and many more, discussed briefly with their potential application areas in Section 3 . In the following, we discuss the key properties and dependencies of DL techniques, that are needed to take into account before started working on DL modeling for real-world applications.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig4_HTML.jpg

A typical DL workflow to solve real-world problems, which consists of three sequential stages (i) data understanding and preprocessing (ii) DL model building and training (iii) validation and interpretation

  • Data Dependencies Deep learning is typically dependent on a large amount of data to build a data-driven model for a particular problem domain. The reason is that when the data volume is small, deep learning algorithms often perform poorly [ 64 ]. In such circumstances, however, the performance of the standard machine-learning algorithms will be improved if the specified rules are used [ 64 , 107 ].
  • Hardware Dependencies The DL algorithms require large computational operations while training a model with large datasets. As the larger the computations, the more the advantage of a GPU over a CPU, the GPU is mostly used to optimize the operations efficiently. Thus, to work properly with the deep learning training, GPU hardware is necessary. Therefore, DL relies more on high-performance machines with GPUs than standard machine learning methods [ 19 , 127 ].
  • Feature Engineering Process Feature engineering is the process of extracting features (characteristics, properties, and attributes) from raw data using domain knowledge. A fundamental distinction between DL and other machine-learning techniques is the attempt to extract high-level characteristics directly from data [ 22 , 97 ]. Thus, DL decreases the time and effort required to construct a feature extractor for each problem.
  • Model Training and Execution time In general, training a deep learning algorithm takes a long time due to a large number of parameters in the DL algorithm; thus, the model training process takes longer. For instance, the DL models can take more than one week to complete a training session, whereas training with ML algorithms takes relatively little time, only seconds to hours [ 107 , 127 ]. During testing, deep learning algorithms take extremely little time to run [ 127 ], when compared to certain machine learning methods.
  • Black-box Perception and Interpretability Interpretability is an important factor when comparing DL with ML. It’s difficult to explain how a deep learning result was obtained, i.e., “black-box”. On the other hand, the machine-learning algorithms, particularly, rule-based machine learning techniques [ 97 ] provide explicit logic rules (IF-THEN) for making decisions that are easily interpretable for humans. For instance, in our earlier works, we have presented several machines learning rule-based techniques [ 100 , 102 , 105 ], where the extracted rules are human-understandable and easier to interpret, update or delete according to the target applications.

The most significant distinction between deep learning and regular machine learning is how well it performs when data grows exponentially. An illustration of the performance comparison between DL and standard ML algorithms has been shown in Fig. ​ Fig.3, 3 , where DL modeling can increase the performance with the amount of data. Thus, DL modeling is extremely useful when dealing with a large amount of data because of its capacity to process vast amounts of features to build an effective data-driven model. In terms of developing and training DL models, it relies on parallelized matrix and tensor operations as well as computing gradients and optimization. Several, DL libraries and resources [ 30 ] such as PyTorch [ 82 ] (with a high-level API called Lightning) and TensorFlow [ 1 ] (which also offers Keras as a high-level API) offers these core utilities including many pre-trained models, as well as many other necessary functions for implementation and DL model building.

Deep Learning Techniques and Applications

In this section, we go through the various types of deep neural network techniques, which typically consider several layers of information-processing stages in hierarchical structures to learn. A typical deep neural network contains multiple hidden layers including input and output layers. Figure ​ Figure5 5 shows a general structure of a deep neural network ( h i d d e n l a y e r = N and N ≥ 2) comparing with a shallow network ( h i d d e n l a y e r = 1 ). We also present our taxonomy on DL techniques based on how they are used to solve various problems, in this section. However, before exploring the details of the DL techniques, it’s useful to review various types of learning tasks such as (i) Supervised: a task-driven approach that uses labeled training data, (ii) Unsupervised: a data-driven process that analyzes unlabeled datasets, (iii) Semi-supervised: a hybridization of both the supervised and unsupervised methods, and (iv) Reinforcement: an environment driven approach, discussed briefly in our earlier paper [ 97 ]. Thus, to present our taxonomy, we divide DL techniques broadly into three major categories: (i) deep networks for supervised or discriminative learning; (ii) deep networks for unsupervised or generative learning; and (ii) deep networks for hybrid learning combing both and relevant others, as shown in Fig. ​ Fig.6. 6 . In the following, we briefly discuss each of these techniques that can be used to solve real-world problems in various application areas according to their learning capabilities.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig5_HTML.jpg

A general architecture of a a shallow network with one hidden layer and b a deep neural network with multiple hidden layers

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig6_HTML.jpg

A taxonomy of DL techniques, broadly divided into three major categories (i) deep networks for supervised or discriminative learning, (ii) deep networks for unsupervised or generative learning, and (ii) deep networks for hybrid learning and relevant others

Deep Networks for Supervised or Discriminative Learning

This category of DL techniques is utilized to provide a discriminative function in supervised or classification applications. Discriminative deep architectures are typically designed to give discriminative power for pattern classification by describing the posterior distributions of classes conditioned on visible data [ 21 ]. Discriminative architectures mainly include Multi-Layer Perceptron (MLP), Convolutional Neural Networks (CNN or ConvNet), Recurrent Neural Networks (RNN), along with their variants. In the following, we briefly discuss these techniques.

Multi-layer Perceptron (MLP)

Multi-layer Perceptron (MLP), a supervised learning approach [ 83 ], is a type of feedforward artificial neural network (ANN). It is also known as the foundation architecture of deep neural networks (DNN) or deep learning. A typical MLP is a fully connected network that consists of an input layer that receives input data, an output layer that makes a decision or prediction about the input signal, and one or more hidden layers between these two that are considered as the network’s computational engine [ 36 , 103 ]. The output of an MLP network is determined using a variety of activation functions, also known as transfer functions, such as ReLU (Rectified Linear Unit), Tanh, Sigmoid, and Softmax [ 83 , 96 ]. To train MLP employs the most extensively used algorithm “Backpropagation” [ 36 ], a supervised learning technique, which is also known as the most basic building block of a neural network. During the training process, various optimization approaches such as Stochastic Gradient Descent (SGD), Limited Memory BFGS (L-BFGS), and Adaptive Moment Estimation (Adam) are applied. MLP requires tuning of several hyperparameters such as the number of hidden layers, neurons, and iterations, which could make solving a complicated model computationally expensive. However, through partial fit, MLP offers the advantage of learning non-linear models in real-time or online [ 83 ].

Convolutional Neural Network (CNN or ConvNet)

The Convolutional Neural Network (CNN or ConvNet) [ 65 ] is a popular discriminative deep learning architecture that learns directly from the input without the need for human feature extraction. Figure ​ Figure7 7 shows an example of a CNN including multiple convolutions and pooling layers. As a result, the CNN enhances the design of traditional ANN like regularized MLP networks. Each layer in CNN takes into account optimum parameters for a meaningful output as well as reduces model complexity. CNN also uses a ‘dropout’ [ 30 ] that can deal with the problem of over-fitting, which may occur in a traditional network.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig7_HTML.jpg

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

CNNs are specifically intended to deal with a variety of 2D shapes and are thus widely employed in visual recognition, medical image analysis, image segmentation, natural language processing, and many more [ 65 , 96 ]. The capability of automatically discovering essential features from the input without the need for human intervention makes it more powerful than a traditional network. Several variants of CNN are exist in the area that includes visual geometry group (VGG) [ 38 ], AlexNet [ 62 ], Xception [ 17 ], Inception [ 116 ], ResNet [ 39 ], etc. that can be used in various application domains according to their learning capabilities.

Recurrent Neural Network (RNN) and its Variants

A Recurrent Neural Network (RNN) is another popular neural network, which employs sequential or time-series data and feeds the output from the previous step as input to the current stage [ 27 , 74 ]. Like feedforward and CNN, recurrent networks learn from training input, however, distinguish by their “memory”, which allows them to impact current input and output through using information from previous inputs. Unlike typical DNN, which assumes that inputs and outputs are independent of one another, the output of RNN is reliant on prior elements within the sequence. However, standard recurrent networks have the issue of vanishing gradients, which makes learning long data sequences challenging. In the following, we discuss several popular variants of the recurrent network that minimizes the issues and perform well in many real-world application domains.

  • Long short-term memory (LSTM) This is a popular form of RNN architecture that uses special units to deal with the vanishing gradient problem, which was introduced by Hochreiter et al. [ 42 ]. A memory cell in an LSTM unit can store data for long periods and the flow of information into and out of the cell is managed by three gates. For instance, the ‘Forget Gate’ determines what information from the previous state cell will be memorized and what information will be removed that is no longer useful, while the ‘Input Gate’ determines which information should enter the cell state and the ‘Output Gate’ determines and controls the outputs. As it solves the issues of training a recurrent network, the LSTM network is considered one of the most successful RNN.
  • Bidirectional RNN/LSTM Bidirectional RNNs connect two hidden layers that run in opposite directions to a single output, allowing them to accept data from both the past and future. Bidirectional RNNs, unlike traditional recurrent networks, are trained to predict both positive and negative time directions at the same time. A Bidirectional LSTM, often known as a BiLSTM, is an extension of the standard LSTM that can increase model performance on sequence classification issues [ 113 ]. It is a sequence processing model comprising of two LSTMs: one takes the input forward and the other takes it backward. Bidirectional LSTM in particular is a popular choice in natural language processing tasks.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig8_HTML.jpg

Basic structure of a gated recurrent unit (GRU) cell consisting of reset and update gates

Overall, the basic property of a recurrent network is that it has at least one feedback connection, which enables activations to loop. This allows the networks to do temporal processing and sequence learning, such as sequence recognition or reproduction, temporal association or prediction, etc. Following are some popular application areas of recurrent networks such as prediction problems, machine translation, natural language processing, text summarization, speech recognition, and many more.

Deep Networks for Generative or Unsupervised Learning

This category of DL techniques is typically used to characterize the high-order correlation properties or features for pattern analysis or synthesis, as well as the joint statistical distributions of the visible data and their associated classes [ 21 ]. The key idea of generative deep architectures is that during the learning process, precise supervisory information such as target class labels is not of concern. As a result, the methods under this category are essentially applied for unsupervised learning as the methods are typically used for feature learning or data generating and representation [ 20 , 21 ]. Thus generative modeling can be used as preprocessing for the supervised learning tasks as well, which ensures the discriminative model accuracy. Commonly used deep neural network techniques for unsupervised or generative learning are Generative Adversarial Network (GAN), Autoencoder (AE), Restricted Boltzmann Machine (RBM), Self-Organizing Map (SOM), and Deep Belief Network (DBN) along with their variants.

Generative Adversarial Network (GAN)

A Generative Adversarial Network (GAN), designed by Ian Goodfellow [ 32 ], is a type of neural network architecture for generative modeling to create new plausible samples on demand. It involves automatically discovering and learning regularities or patterns in input data so that the model may be used to generate or output new examples from the original dataset. As shown in Fig. ​ Fig.9, 9 , GANs are composed of two neural networks, a generator G that creates new data having properties similar to the original data, and a discriminator D that predicts the likelihood of a subsequent sample being drawn from actual data rather than data provided by the generator. Thus in GAN modeling, both the generator and discriminator are trained to compete with each other. While the generator tries to fool and confuse the discriminator by creating more realistic data, the discriminator tries to distinguish the genuine data from the fake data generated by G .

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig9_HTML.jpg

Schematic structure of a standard generative adversarial network (GAN)

Generally, GAN network deployment is designed for unsupervised learning tasks, but it has also proven to be a better solution for semi-supervised and reinforcement learning as well depending on the task [ 3 ]. GANs are also used in state-of-the-art transfer learning research to enforce the alignment of the latent feature space [ 66 ]. Inverse models, such as Bidirectional GAN (BiGAN) [ 25 ] can also learn a mapping from data to the latent space, similar to how the standard GAN model learns a mapping from a latent space to the data distribution. The potential application areas of GAN networks are healthcare, image analysis, data augmentation, video generation, voice generation, pandemics, traffic control, cybersecurity, and many more, which are increasing rapidly. Overall, GANs have established themselves as a comprehensive domain of independent data expansion and as a solution to problems requiring a generative solution.

Auto-Encoder (AE) and Its Variants

An auto-encoder (AE) [ 31 ] is a popular unsupervised learning technique in which neural networks are used to learn representations. Typically, auto-encoders are used to work with high-dimensional data, and dimensionality reduction explains how a set of data is represented. Encoder, code, and decoder are the three parts of an autoencoder. The encoder compresses the input and generates the code, which the decoder subsequently uses to reconstruct the input. The AEs have recently been used to learn generative data models [ 69 ]. The auto-encoder is widely used in many unsupervised learning tasks, e.g., dimensionality reduction, feature extraction, efficient coding, generative modeling, denoising, anomaly or outlier detection, etc. [ 31 , 132 ]. Principal component analysis (PCA) [ 99 ], which is also used to reduce the dimensionality of huge data sets, is essentially similar to a single-layered AE with a linear activation function. Regularized autoencoders such as sparse, denoising, and contractive are useful for learning representations for later classification tasks [ 119 ], while variational autoencoders can be used as generative models [ 56 ], discussed below.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig10_HTML.jpg

Schematic structure of a sparse autoencoder (SAE) with several active units (filled circle) in the hidden layer

  • Denoising Autoencoder (DAE) A denoising autoencoder is a variant on the basic autoencoder that attempts to improve representation (to extract useful features) by altering the reconstruction criterion, and thus reduces the risk of learning the identity function [ 31 , 119 ]. In other words, it receives a corrupted data point as input and is trained to recover the original undistorted input as its output through minimizing the average reconstruction error over the training data, i.e, cleaning the corrupted input, or denoising. Thus, in the context of computing, DAEs can be considered as very powerful filters that can be utilized for automatic pre-processing. A denoising autoencoder, for example, could be used to automatically pre-process an image, thereby boosting its quality for recognition accuracy.
  • Contractive Autoencoder (CAE) The idea behind a contractive autoencoder, proposed by Rifai et al. [ 90 ], is to make the autoencoders robust of small changes in the training dataset. In its objective function, a CAE includes an explicit regularizer that forces the model to learn an encoding that is robust to small changes in input values. As a result, the learned representation’s sensitivity to the training input is reduced. While DAEs encourage the robustness of reconstruction as discussed above, CAEs encourage the robustness of representation.
  • Variational Autoencoder (VAE) A variational autoencoder [ 55 ] has a fundamentally unique property that distinguishes it from the classical autoencoder discussed above, which makes this so effective for generative modeling. VAEs, unlike the traditional autoencoders which map the input onto a latent vector, map the input data into the parameters of a probability distribution, such as the mean and variance of a Gaussian distribution. A VAE assumes that the source data has an underlying probability distribution and then tries to discover the distribution’s parameters. Although this approach was initially designed for unsupervised learning, its use has been demonstrated in other domains such as semi-supervised learning [ 128 ] and supervised learning [ 51 ].

Although, the earlier concept of AE was typically for dimensionality reduction or feature learning mentioned above, recently, AEs have been brought to the forefront of generative modeling, even the generative adversarial network is one of the popular methods in the area. The AEs have been effectively employed in a variety of domains, including healthcare, computer vision, speech recognition, cybersecurity, natural language processing, and many more. Overall, we can conclude that auto-encoder and its variants can play a significant role as unsupervised feature learning with neural network architecture.

Kohonen Map or Self-Organizing Map (SOM)

A Self-Organizing Map (SOM) or Kohonen Map [ 59 ] is another form of unsupervised learning technique for creating a low-dimensional (usually two-dimensional) representation of a higher-dimensional data set while maintaining the topological structure of the data. SOM is also known as a neural network-based dimensionality reduction algorithm that is commonly used for clustering [ 118 ]. A SOM adapts to the topological form of a dataset by repeatedly moving its neurons closer to the data points, allowing us to visualize enormous datasets and find probable clusters. The first layer of a SOM is the input layer, and the second layer is the output layer or feature map. Unlike other neural networks that use error-correction learning, such as backpropagation with gradient descent [ 36 ], SOMs employ competitive learning, which uses a neighborhood function to retain the input space’s topological features. SOM is widely utilized in a variety of applications, including pattern identification, health or medical diagnosis, anomaly detection, and virus or worm attack detection [ 60 , 87 ]. The primary benefit of employing a SOM is that this can make high-dimensional data easier to visualize and analyze to understand the patterns. The reduction of dimensionality and grid clustering makes it easy to observe similarities in the data. As a result, SOMs can play a vital role in developing a data-driven effective model for a particular problem domain, depending on the data characteristics.

Restricted Boltzmann Machine (RBM)

A Restricted Boltzmann Machine (RBM) [ 75 ] is also a generative stochastic neural network capable of learning a probability distribution across its inputs. Boltzmann machines typically consist of visible and hidden nodes and each node is connected to every other node, which helps us understand irregularities by learning how the system works in normal circumstances. RBMs are a subset of Boltzmann machines that have a limit on the number of connections between the visible and hidden layers [ 77 ]. This restriction permits training algorithms like the gradient-based contrastive divergence algorithm to be more efficient than those for Boltzmann machines in general [ 41 ]. RBMs have found applications in dimensionality reduction, classification, regression, collaborative filtering, feature learning, topic modeling, and many others. In the area of deep learning modeling, they can be trained either supervised or unsupervised, depending on the task. Overall, the RBMs can recognize patterns in data automatically and develop probabilistic or stochastic models, which are utilized for feature selection or extraction, as well as forming a deep belief network.

Deep Belief Network (DBN)

A Deep Belief Network (DBN) [ 40 ] is a multi-layer generative graphical model of stacking several individual unsupervised networks such as AEs or RBMs, that use each network’s hidden layer as the input for the next layer, i.e, connected sequentially. Thus, we can divide a DBN into (i) AE-DBN which is known as stacked AE, and (ii) RBM-DBN that is known as stacked RBM, where AE-DBN is composed of autoencoders and RBM-DBN is composed of restricted Boltzmann machines, discussed earlier. The ultimate goal is to develop a faster-unsupervised training technique for each sub-network that depends on contrastive divergence [ 41 ]. DBN can capture a hierarchical representation of input data based on its deep structure. The primary idea behind DBN is to train unsupervised feed-forward neural networks with unlabeled data before fine-tuning the network with labeled input. One of the most important advantages of DBN, as opposed to typical shallow learning networks, is that it permits the detection of deep patterns, which allows for reasoning abilities and the capture of the deep difference between normal and erroneous data [ 89 ]. A continuous DBN is simply an extension of a standard DBN that allows a continuous range of decimals instead of binary data. Overall, the DBN model can play a key role in a wide range of high-dimensional data applications due to its strong feature extraction and classification capabilities and become one of the significant topics in the field of neural networks.

In summary, the generative learning techniques discussed above typically allow us to generate a new representation of data through exploratory analysis. As a result, these deep generative networks can be utilized as preprocessing for supervised or discriminative learning tasks, as well as ensuring model accuracy, where unsupervised representation learning can allow for improved classifier generalization.

Deep Networks for Hybrid Learning and Other Approaches

In addition to the above-discussed deep learning categories, hybrid deep networks and several other approaches such as deep transfer learning (DTL) and deep reinforcement learning (DRL) are popular, which are discussed in the following.

Hybrid Deep Neural Networks

Generative models are adaptable, with the capacity to learn from both labeled and unlabeled data. Discriminative models, on the other hand, are unable to learn from unlabeled data yet outperform their generative counterparts in supervised tasks. A framework for training both deep generative and discriminative models simultaneously can enjoy the benefits of both models, which motivates hybrid networks.

Hybrid deep learning models are typically composed of multiple (two or more) deep basic learning models, where the basic model is a discriminative or generative deep learning model discussed earlier. Based on the integration of different basic generative or discriminative models, the below three categories of hybrid deep learning models might be useful for solving real-world problems. These are as follows:

  • Hybrid M o d e l _ 1 : An integration of different generative or discriminative models to extract more meaningful and robust features. Examples could be CNN+LSTM, AE+GAN, and so on.
  • Hybrid M o d e l _ 2 : An integration of generative model followed by a discriminative model. Examples could be DBN+MLP, GAN+CNN, AE+CNN, and so on.
  • Hybrid M o d e l _ 3 : An integration of generative or discriminative model followed by a non-deep learning classifier. Examples could be AE+SVM, CNN+SVM, and so on.

Thus, in a broad sense, we can conclude that hybrid models can be either classification-focused or non-classification depending on the target use. However, most of the hybrid learning-related studies in the area of deep learning are classification-focused or supervised learning tasks, summarized in Table ​ Table1. 1 . The unsupervised generative models with meaningful representations are employed to enhance the discriminative models. The generative models with useful representation can provide more informative and low-dimensional features for discrimination, and they can also enable to enhance the training data quality and quantity, providing additional information for classification.

A summary of deep learning tasks and methods in several popular real-world applications areas

Deep Transfer Learning (DTL)

Transfer Learning is a technique for effectively using previously learned model knowledge to solve a new task with minimum training or fine-tuning. In comparison to typical machine learning techniques [ 97 ], DL takes a large amount of training data. As a result, the need for a substantial volume of labeled data is a significant barrier to address some essential domain-specific tasks, particularly, in the medical sector, where creating large-scale, high-quality annotated medical or health datasets is both difficult and costly. Furthermore, the standard DL model demands a lot of computational resources, such as a GPU-enabled server, even though researchers are working hard to improve it. As a result, Deep Transfer Learning (DTL), a DL-based transfer learning method, might be helpful to address this issue. Figure ​ Figure11 11 shows a general structure of the transfer learning process, where knowledge from the pre-trained model is transferred into a new DL model. It’s especially popular in deep learning right now since it allows to train deep neural networks with very little data [ 126 ].

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig11_HTML.jpg

A general structure of transfer learning process, where knowledge from pre-trained model is transferred into new DL model

Transfer learning is a two-stage approach for training a DL model that consists of a pre-training step and a fine-tuning step in which the model is trained on the target task. Since deep neural networks have gained popularity in a variety of fields, a large number of DTL methods have been presented, making it crucial to categorize and summarize them. Based on the techniques used in the literature, DTL can be classified into four categories [ 117 ]. These are (i) instances-based deep transfer learning that utilizes instances in source domain by appropriate weight, (ii) mapping-based deep transfer learning that maps instances from two domains into a new data space with better similarity, (iii) network-based deep transfer learning that reuses the partial of network pre-trained in the source domain, and (iv) adversarial based deep transfer learning that uses adversarial technology to find transferable features that both suitable for two domains. Due to its high effectiveness and practicality, adversarial-based deep transfer learning has exploded in popularity in recent years. Transfer learning can also be classified into inductive, transductive, and unsupervised transfer learning depending on the circumstances between the source and target domains and activities [ 81 ]. While most current research focuses on supervised learning, how deep neural networks can transfer knowledge in unsupervised or semi-supervised learning may gain further interest in the future. DTL techniques are useful in a variety of fields including natural language processing, sentiment classification, visual recognition, speech recognition, spam filtering, and relevant others.

Deep Reinforcement Learning (DRL)

Reinforcement learning takes a different approach to solving the sequential decision-making problem than other approaches we have discussed so far. The concepts of an environment and an agent are often introduced first in reinforcement learning. The agent can perform a series of actions in the environment, each of which has an impact on the environment’s state and can result in possible rewards (feedback) - “positive” for good sequences of actions that result in a “good” state, and “negative” for bad sequences of actions that result in a “bad” state. The purpose of reinforcement learning is to learn good action sequences through interaction with the environment, typically referred to as a policy.

Deep reinforcement learning (DRL or deep RL) [ 9 ] integrates neural networks with a reinforcement learning architecture to allow the agents to learn the appropriate actions in a virtual environment, as shown in Fig. ​ Fig.12. 12 . In the area of reinforcement learning, model-based RL is based on learning a transition model that enables for modeling of the environment without interacting with it directly, whereas model-free RL methods learn directly from interactions with the environment. Q-learning is a popular model-free RL technique for determining the best action-selection policy for any (finite) Markov Decision Process (MDP) [ 86 , 97 ]. MDP is a mathematical framework for modeling decisions based on state, action, and rewards [ 86 ]. In addition, Deep Q-Networks, Double DQN, Bi-directional Learning, Monte Carlo Control, etc. are used in the area [ 50 , 97 ]. In DRL methods it incorporates DL models, e.g. Deep Neural Networks (DNN), based on MDP principle [ 71 ], as policy and/or value function approximators. CNN for example can be used as a component of RL agents to learn directly from raw, high-dimensional visual inputs. In the real world, DRL-based solutions can be used in several application areas including robotics, video games, natural language processing, computer vision, and relevant others.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig12_HTML.jpg

Schematic structure of deep reinforcement learning (DRL) highlighting a deep neural network

Deep Learning Application Summary

During the past few years, deep learning has been successfully applied to numerous problems in many application areas. These include natural language processing, sentiment analysis, cybersecurity, business, virtual assistants, visual recognition, healthcare, robotics, and many more. In Fig. ​ Fig.13, 13 , we have summarized several potential real-world application areas of deep learning. Various deep learning techniques according to our presented taxonomy in Fig. ​ Fig.6 6 that includes discriminative learning, generative learning, as well as hybrid models, discussed earlier, are employed in these application areas. In Table ​ Table1, 1 , we have also summarized various deep learning tasks and techniques that are used to solve the relevant tasks in several real-world applications areas. Overall, from Fig. ​ Fig.13 13 and Table ​ Table1, 1 , we can conclude that the future prospects of deep learning modeling in real-world application areas are huge and there are lots of scopes to work. In the next section, we also summarize the research issues in deep learning modeling and point out the potential aspects for future generation DL modeling.

An external file that holds a picture, illustration, etc.
Object name is 42979_2021_815_Fig13_HTML.jpg

Several potential real-world application areas of deep learning

Research Directions and Future Aspects

While existing methods have established a solid foundation for deep learning systems and research, this section outlines the below ten potential future research directions based on our study.

  • Automation in Data Annotation According to the existing literature, discussed in Section 3 , most of the deep learning models are trained through publicly available datasets that are annotated. However, to build a system for a new problem domain or recent data-driven system, raw data from relevant sources are needed to collect. Thus, data annotation, e.g., categorization, tagging, or labeling of a large amount of raw data, is important for building discriminative deep learning models or supervised tasks, which is challenging. A technique with the capability of automatic and dynamic data annotation, rather than manual annotation or hiring annotators, particularly, for large datasets, could be more effective for supervised learning as well as minimizing human effort. Therefore, a more in-depth investigation of data collection and annotation methods, or designing an unsupervised learning-based solution could be one of the primary research directions in the area of deep learning modeling.
  • Data Preparation for Ensuring Data Quality As discussed earlier throughout the paper, the deep learning algorithms highly impact data quality, and availability for training, and consequently on the resultant model for a particular problem domain. Thus, deep learning models may become worthless or yield decreased accuracy if the data is bad, such as data sparsity, non-representative, poor-quality, ambiguous values, noise, data imbalance, irrelevant features, data inconsistency, insufficient quantity, and so on for training. Consequently, such issues in data can lead to poor processing and inaccurate findings, which is a major problem while discovering insights from data. Thus deep learning models also need to adapt to such rising issues in data, to capture approximated information from observations. Therefore, effective data pre-processing techniques are needed to design according to the nature of the data problem and characteristics, to handling such emerging challenges, which could be another research direction in the area.
  • Black-box Perception and Proper DL/ML Algorithm Selection In general, it’s difficult to explain how a deep learning result is obtained or how they get the ultimate decisions for a particular model. Although DL models achieve significant performance while learning from large datasets, as discussed in Section 2 , this “black-box” perception of DL modeling typically represents weak statistical interpretability that could be a major issue in the area. On the other hand, ML algorithms, particularly, rule-based machine learning techniques provide explicit logic rules (IF-THEN) for making decisions that are easier to interpret, update or delete according to the target applications [ 97 , 100 , 105 ]. If the wrong learning algorithm is chosen, unanticipated results may occur, resulting in a loss of effort as well as the model’s efficacy and accuracy. Thus by taking into account the performance, complexity, model accuracy, and applicability, selecting an appropriate model for the target application is challenging, and in-depth analysis is needed for better understanding and decision making.
  • Deep Networks for Supervised or Discriminative Learning: According to our designed taxonomy of deep learning techniques, as shown in Fig. ​ Fig.6, 6 , discriminative architectures mainly include MLP, CNN, and RNN, along with their variants that are applied widely in various application domains. However, designing new techniques or their variants of such discriminative techniques by taking into account model optimization, accuracy, and applicability, according to the target real-world application and the nature of the data, could be a novel contribution, which can also be considered as a major future aspect in the area of supervised or discriminative learning.
  • Deep Networks for Unsupervised or Generative Learning As discussed in Section 3 , unsupervised learning or generative deep learning modeling is one of the major tasks in the area, as it allows us to characterize the high-order correlation properties or features in data, or generating a new representation of data through exploratory analysis. Moreover, unlike supervised learning [ 97 ], it does not require labeled data due to its capability to derive insights directly from the data as well as data-driven decision making. Consequently, it thus can be used as preprocessing for supervised learning or discriminative modeling as well as semi-supervised learning tasks, which ensure learning accuracy and model efficiency. According to our designed taxonomy of deep learning techniques, as shown in Fig. ​ Fig.6, 6 , generative techniques mainly include GAN, AE, SOM, RBM, DBN, and their variants. Thus, designing new techniques or their variants for an effective data modeling or representation according to the target real-world application could be a novel contribution, which can also be considered as a major future aspect in the area of unsupervised or generative learning.
  • Hybrid/Ensemble Modeling and Uncertainty Handling According to our designed taxonomy of DL techniques, as shown in Fig ​ Fig6, 6 , this is considered as another major category in deep learning tasks. As hybrid modeling enjoys the benefits of both generative and discriminative learning, an effective hybridization can outperform others in terms of performance as well as uncertainty handling in high-risk applications. In Section 3 , we have summarized various types of hybridization, e.g., AE+CNN/SVM. Since a group of neural networks is trained with distinct parameters or with separate sub-sampling training datasets, hybridization or ensembles of such techniques, i.e., DL with DL/ML, can play a key role in the area. Thus designing effective blended discriminative and generative models accordingly rather than naive method, could be an important research opportunity to solve various real-world issues including semi-supervised learning tasks and model uncertainty.
  • Dynamism in Selecting Threshold/ Hyper-parameters Values, and Network Structures with Computational Efficiency In general, the relationship among performance, model complexity, and computational requirements is a key issue in deep learning modeling and applications. A combination of algorithmic advancements with improved accuracy as well as maintaining computational efficiency, i.e., achieving the maximum throughput while consuming the least amount of resources, without significant information loss, can lead to a breakthrough in the effectiveness of deep learning modeling in future real-world applications. The concept of incremental approaches or recency-based learning [ 100 ] might be effective in several cases depending on the nature of target applications. Moreover, assuming the network structures with a static number of nodes and layers, hyper-parameters values or threshold settings, or selecting them by the trial-and-error process may not be effective in many cases, as it can be changed due to the changes in data. Thus, a data-driven approach to select them dynamically could be more effective while building a deep learning model in terms of both performance and real-world applicability. Such type of data-driven automation can lead to future generation deep learning modeling with additional intelligence, which could be a significant future aspect in the area as well as an important research direction to contribute.
  • Lightweight Deep Learning Modeling for Next-Generation Smart Devices and Applications: In recent years, the Internet of Things (IoT) consisting of billions of intelligent and communicating things and mobile communications technologies have become popular to detect and gather human and environmental information (e.g. geo-information, weather data, bio-data, human behaviors, and so on) for a variety of intelligent services and applications. Every day, these ubiquitous smart things or devices generate large amounts of data, requiring rapid data processing on a variety of smart mobile devices [ 72 ]. Deep learning technologies can be incorporate to discover underlying properties and to effectively handle such large amounts of sensor data for a variety of IoT applications including health monitoring and disease analysis, smart cities, traffic flow prediction, and monitoring, smart transportation, manufacture inspection, fault assessment, smart industry or Industry 4.0, and many more. Although deep learning techniques discussed in Section 3 are considered as powerful tools for processing big data, lightweight modeling is important for resource-constrained devices, due to their high computational cost and considerable memory overhead. Thus several techniques such as optimization, simplification, compression, pruning, generalization, important feature extraction, etc. might be helpful in several cases. Therefore, constructing the lightweight deep learning techniques based on a baseline network architecture to adapt the DL model for next-generation mobile, IoT, or resource-constrained devices and applications, could be considered as a significant future aspect in the area.
  • Incorporating Domain Knowledge into Deep Learning Modeling Domain knowledge, as opposed to general knowledge or domain-independent knowledge, is knowledge of a specific, specialized topic or field. For instance, in terms of natural language processing, the properties of the English language typically differ from other languages like Bengali, Arabic, French, etc. Thus integrating domain-based constraints into the deep learning model could produce better results for such particular purpose. For instance, a task-specific feature extractor considering domain knowledge in smart manufacturing for fault diagnosis can resolve the issues in traditional deep-learning-based methods [ 28 ]. Similarly, domain knowledge in medical image analysis [ 58 ], financial sentiment analysis [ 49 ], cybersecurity analytics [ 94 , 103 ] as well as conceptual data model in which semantic information, (i.e., meaningful for a system, rather than merely correlational) [ 45 , 121 , 131 ] is included, can play a vital role in the area. Transfer learning could be an effective way to get started on a new challenge with domain knowledge. Moreover, contextual information such as spatial, temporal, social, environmental contexts [ 92 , 104 , 108 ] can also play an important role to incorporate context-aware computing with domain knowledge for smart decision making as well as building adaptive and intelligent context-aware systems. Therefore understanding domain knowledge and effectively incorporating them into the deep learning model could be another research direction.
  • Designing General Deep Learning Framework for Target Application Domains One promising research direction for deep learning-based solutions is to develop a general framework that can handle data diversity, dimensions, stimulation types, etc. The general framework would require two key capabilities: the attention mechanism that focuses on the most valuable parts of input signals, and the ability to capture latent feature that enables the framework to capture the distinctive and informative features. Attention models have been a popular research topic because of their intuition, versatility, and interpretability, and employed in various application areas like computer vision, natural language processing, text or image classification, sentiment analysis, recommender systems, user profiling, etc [ 13 , 80 ]. Attention mechanism can be implemented based on learning algorithms such as reinforcement learning that is capable of finding the most useful part through a policy search [ 133 , 134 ]. Similarly, CNN can be integrated with suitable attention mechanisms to form a general classification framework, where CNN can be used as a feature learning tool for capturing features in various levels and ranges. Thus, designing a general deep learning framework considering attention as well as a latent feature for target application domains could be another area to contribute.

To summarize, deep learning is a fairly open topic to which academics can contribute by developing new methods or improving existing methods to handle the above-mentioned concerns and tackle real-world problems in a variety of application areas. This can also help the researchers conduct a thorough analysis of the application’s hidden and unexpected challenges to produce more reliable and realistic outcomes. Overall, we can conclude that addressing the above-mentioned issues and contributing to proposing effective and efficient techniques could lead to “Future Generation DL” modeling as well as more intelligent and automated applications.

Concluding Remarks

In this article, we have presented a structured and comprehensive view of deep learning technology, which is considered a core part of artificial intelligence as well as data science. It starts with a history of artificial neural networks and moves to recent deep learning techniques and breakthroughs in different applications. Then, the key algorithms in this area, as well as deep neural network modeling in various dimensions are explored. For this, we have also presented a taxonomy considering the variations of deep learning tasks and how they are used for different purposes. In our comprehensive study, we have taken into account not only the deep networks for supervised or discriminative learning but also the deep networks for unsupervised or generative learning, and hybrid learning that can be used to solve a variety of real-world issues according to the nature of problems.

Deep learning, unlike traditional machine learning and data mining algorithms, can produce extremely high-level data representations from enormous amounts of raw data. As a result, it has provided an excellent solution to a variety of real-world problems. A successful deep learning technique must possess the relevant data-driven modeling depending on the characteristics of raw data. The sophisticated learning algorithms then need to be trained through the collected data and knowledge related to the target application before the system can assist with intelligent decision-making. Deep learning has shown to be useful in a wide range of applications and research areas such as healthcare, sentiment analysis, visual recognition, business intelligence, cybersecurity, and many more that are summarized in the paper.

Finally, we have summarized and discussed the challenges faced and the potential research directions, and future aspects in the area. Although deep learning is considered a black-box solution for many applications due to its poor reasoning and interpretability, addressing the challenges or future aspects that are identified could lead to future generation deep learning modeling and smarter systems. This can also help the researchers for in-depth analysis to produce more reliable and realistic outcomes. Overall, we believe that our study on neural networks and deep learning-based advanced analytics points in a promising path and can be utilized as a reference guide for future research and implementations in relevant application domains by both academic and industry professionals.

Declarations

The author declares no conflict of interest.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K. N. and M. Shivakumar.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Open access
  • Published: 19 April 2021

Research progress in optical neural networks: theory, applications and developments

  • Jia Liu 1 ,
  • Qiuhao Wu 1 ,
  • Xiubao Sui 1 ,
  • Qian Chen 1 ,
  • Guohua Gu 1 ,
  • Liping Wang 1 &
  • Shengcai Li 2  

PhotoniX volume  2 , Article number:  5 ( 2021 ) Cite this article

18k Accesses

81 Citations

3 Altmetric

Metrics details

With the advent of the era of big data, artificial intelligence has attracted continuous attention from all walks of life, and has been widely used in medical image analysis, molecular and material science, language recognition and other fields. As the basis of artificial intelligence, the research results of neural network are remarkable. However, due to the inherent defect that electrical signal is easily interfered and the processing speed is proportional to the energy loss, researchers have turned their attention to light, trying to build neural networks in the field of optics, making full use of the parallel processing ability of light to solve the problems of electronic neural networks. After continuous research and development, optical neural network has become the forefront of the world. Here, we mainly introduce the development of this field, summarize and compare some classical researches and algorithm theories, and look forward to the future of optical neural network.

Introduction

As one of most active fields in computer science, artificial intelligence is focus on simulating structure of the nervous system through constructing artificial neural network (ANN) which establish connections between neurons in various layers of the neural network and make it good generalization ability and robustness. Since the 1980s, the research work of ANN has made great progress. Also, it has successfully solved many practical problems that are difficult to be solved by modern computers in the fields of pattern recognition, intelligent robot, automatic control, prediction and estimation, biomedicine, economy, etc., with good intelligence characteristics.

At present, electronic computing is still the most important computing power support for the implementation of artificial intelligence algorithms, especially deep ANN model. Although the specific hardware architectures are different, in a word, they all adopt the von Neumann type computing principle to complete the computing task with complex logic circuits and processor chips [ 1 ]. The original neural network architecture used CPU for computing, but it could not meet the requirements of a large number of floating-point operations in deep network, especially training phase. Moreover, the parallel computing efficiency was too low, and it was quickly replaced by GPU with strong parallel computing capability. It can be said that GPU promoted the development of deep learning.

However, the demand for computational power in deep learning is endless. Limiting by the interference of electrical signals, energy consumption and physical limits [ 2 , 3 ], although electronic components base on silicon can still support it now, the traditional deep learning has quietly appeared a bottleneck. The academia and industrial circles attempt to seek alternative methods to solve electronic defects that can take precautions on computing power. As the speed of light as high as 300,000 km per second, which is 300 times faster than that of electron, the information carrying ability and variety which is 2*10^4 times more than that of electric channels, as well as high parallelism and strong anti-interference [ 4 , 5 ], it has great advantages in information transmission and optical computing. Replacing electricity with light has become a potential and promising work mode, which is the trend of the times.

Therefore, people try to build neural networks by optical way to achieve deep learning architecture. Optical neural network (ONN) emerges as the times require. It has the characteristics of high bandwidth, high interconnection and internal parallel processing, which can accelerate the partial operation of software and electronic hardware, even up to the “light speed”, is a promising method to replace artificial neural network. In the photonic neural network, matrix multiplication can be performed at the speed of light, which can effectively solve the dense matrix multiplication in the artificial neural network, so as to reduce the consumption of energy and time. Moreover, the nonlinearity in ANN can also be realized by nonlinear optical elements. Once the training of the optical neural network is completed, the entire structure can perform the optical signal calculation at the speed of light without additional energy input. In 1978, Goodman of Stanford University first proposed the theoretical model of optical vector-matrix multiplier [ 6 ], which became an important step in optical calculation [ 7 , 8 ], and promoted the development of optical matrix multiplier (OMM) [ 9 , 10 ] and photonic neural network.

In this paper, we are going to discuss the hot topic in the field of deep learning——optical deep learning, that is to build neural network by optical method instead of traditional artificial neural network and train it. It has a large number of linear layers and is connected with each other. The specific structure of the paper is as follows: in the first chapter, it briefly introduces how the artificial neural network developed into optical neural network. ANN is mainly composed of two core components——linear part and nonlinear activation, and then is trained to adjust and optimize the weights of each connection, make the network converge in the end. Therefore, the second and third chapters start from the two core operations respectively, describe in detail how the researchers realize the linear operation and nonlinear activation function in the optical way after introducing the basic principles, so as to successfully build the optical neural network. The fourth chapter, according to different training methods, elaborates the particular training process of optical neural network, and carries out experiments and results comparison for some typical applications. Finally, in the fifth chapter, we analyze and discuss the optical neural network, describe the possible future research direction and development of ONN; and a brief and to the point summary is given in the sixth chapter.

The optical realization of linear operation

In the introduce we mentioned ONN is the optical implementation of both of linear and nonlinear operations of ANN. According to the structure of ANN [ 11 ] and the working principle of neurons [ 12 ]——linear operation z i  =  b i  + ∑ j W ij x j and nonlinear activation a i  =  ϕ ( z i ) in Fig.  1 , it can be seen that the neural network requires a lot of linear multiplication and summation operations. The most direct embodiment of such a multiplication and summation operation in the algorithm is to give two groups of data and carry out multiplication and addition operations in the “for” loop. If we think about this problem simply, we will find that many iterations are needed to complete this operation, which will waste a lot of computing resources. Thus, people begin to seek a faster method——vectorization method, which can make it into the multiplication of two matrices namely the input matrix and the weight matrix.

figure 1

a The structure of ANN [ 11 ]. b The principle of neurons [ 12 ]

We know that it’s easy to achieve the computation between two matrices by using an electronic computer, but it’s also difficult to realize when the matrix dimension is very large. For example, to realize the multiplication of two matrices with a size of n*n, n 3 multiplication and n 3 addition operations need to be performed, which is 2n 3 operations in total. If n is very large, assuming it is 1024, it requires to take 214,7483,648 calculations that is a huge number, up to millions of times. It can be seen that using computer to achieve multiplication operations is very time-consuming. However, if the high speed, high parallelism and anti-interference of light are used to achieve this operation by optical means, it is likely to require only a few or even only one operation. In the training of neural network, the data we need to process and analyze is extremely large. At this time, the characteristics of optics are extremely important, which can bring great convenience for calculation. The appearance of optical matrix multiplier lays the foundation of optical calculation, and provides a development path for the optics of neural network.

Next, we will briefly introduce optical matrix multiplier, which is the basic optical realization of linear multiplication and summation operation, namely matrix multiplication, and then explain how to realize linear operation in optical neural network from the different principles of implementing the multiplication operation.

Optical matrix multiplier

Matrix multiplication is a very important operation in matrix operation, and its calculation process is complicated. Simply put, the multiplication between two matrices is that the corresponding elements of row i of the first matrix and column j of the second matrix are multiplied and added one by one, and then get result matrix element c ij , which is also called inner product operation. The multiplication result matrix can be obtained by traversing the rows or columns of the two matrices once. If A  = ( a ij ) m  ×  s and B  = ( b ij ) s  ×  n , the matrix multiplication operation is defined as follows:

In fact, multiplication is a process of number accumulation for many times. Correspondingly, matrix multiplication is the sum of several different numbers after accumulation for many times. In the electronic computer, the accumulator as the core arithmetical unit, can be used to achieve matrix multiplication operation. Similarly, such an optical multiplier can be designed as the core of the photonic computing system, which has two-dimensional parallelism. Optical multiplication is the process in which optical information is loaded and converted, and the optical multiplier is responsible for realizing this process. The principle of the optical multiplier is simply described in Fig.  2 (a).

figure 2

Optical Matrix Multiplier. a The optical multiplier. b The structure of Vector-Matrix multiplication system [ 6 ]. c Matrix-matrix multiplication realized by 4f system [ 13 ]

If the function f in the graph is replaced by a matrix, the graph can be simply represented as multiplication between matrices. And matrices can be considered as a combination of vectors, so we can start from multiplying vectors by matrices to multiplying matrices by matrices. The vector-matrix multiplication system model was first proposed by Goodman [ 6 ]. After continuous research and improvement by scholars, the final structure of vector-matrix multiplier is shown in Fig. 2 (b).

Let’s take an m*n matrix A multiplied by an n-dimensional vector B to get an m-dimensional vector C. Firstly, vector B is realized by using linear array light source, and the light intensity of n light sources of linear array is corresponding to the input vector B. Then, the light beam emitted by the linear array source passes through the collimation lens L1 to form a parallel light and irradiates on the cylinder lens CL1. Due to its fan-out effect in the horizontal direction, B is duplicated by CL1 in the vertical direction to form a light band. After that, the beam reaches SLM, which is controlled by the computer to load the matrix A, and the two are multiplied. Then, the beam passes through the collimating lens CL2. Due to its fan-in effect in the vertical direction, the light of all pixels in the i-th row of SLM will be concentrated on the i-th detector of CCD. It can be found that in vector-matrix multiplication, the optical system will first copy and paste the vector and expand it into a matrix, and then multiply it with another matrix. From another point of view, this is a special kind of matrix-matrix multiplication. In 1993, an optical 4f system was proposed to realize the multiplication between matrices, as shown in Fig. 2 (c), which mainly uses the Fourier transform of lens and the convolution principle [ 13 ].

The optical matrix multiplier fully embodies the parallel computing power of light, and the optical linear operation completed by OMM is essentially to realize the modulation of information-carrying light by means of certain approaches and some properties of light, such as diffraction, interference and so on.

Diffraction of light to realize linear operation

Light travels along a straight line in the air. When encountering an obstacle or a small hole, light will deviate from the straight-line propagation path, resulting in the phenomenon of uneven distribution of light intensity, which is called diffraction. After the discovery of diffraction in 1665, it attracted the attention of many scholars who invested a great deal of efforts in this field, and formed a complete system theory after long-term development.

In 1678, The Dutch physicist Huygens proposed that every point on the wave surface could be regarded as the wave source of the emitted secondary wave, emitting spherical secondary wave respectively. At a certain time in the future, the envelopment surface of these secondary waves would be the new wave surface at that time, which is the Huygens principle [ 14 ]. Although Huygens principle well explains refraction and reflection and birefringence of light, it does not involve the analysis of light wave intensity and wavelength, and cannot well explain diffraction phenomenon. After the appearance of the Young’s Double-Slit Interference experiment in 1810 [ 15 ], Fresnel supplemented Huygens principle with the help of wavelet coherent superposition in 1815, and developed qualitative Huygens principle into semi-quantitative principle with mathematical proof, which is called Huygens-Fresnel principle [ 16 ], expressed as \( \overset{\sim }{E}(P)=\frac{A}{i\lambda}\underset{\Sigma}{\iint}\frac{\exp\ (ikR)}{R}K\left(\theta \right) d\sigma \) . However, this principle is only a semi-quantitative principle, and there is no specific function representation for the tilt factor, and the meaning of proportionality coefficient is not clear, so it has limitations. Therefore, Kirchhoff and Sommerfeld derived the diffraction formula according to the general wave theory, and gave the specific form of the tilt factor and proportional coefficient. Kirchhoff used Green Theorem [ 17 ] to solve the Helmholtz equation [ 18 ], obtained the complex amplitude of monochromatic light in free space, and finally concluded the Kirchhoff integral theorem [ 19 ], specifically expressed the basic concepts of Huygens-Fresnel principle. The Kirchhoff’s diffraction formula is as follows: \( U\left({P}_0\right)=\frac{A}{j\lambda}\underset{\Sigma}{\iint}\frac{e^{j k\left(r+l\right)}}{rl}\frac{\mathit{\cos}<\overrightarrow{n},\overrightarrow{r}>-\mathit{\cos}<\overrightarrow{n},\overrightarrow{l}>}{2} ds \) . Although Kirchhoff’s diffraction formula gives a good practical effect, the boundary conditions of Kirchhoff hypothesis violate the potential field theorem [ 20 ]. Therefore, Sommerfeld adopted another Green’s formula to overcome the problem that Kirchhoff boundary condition assumption violate the potential theory theorem, making it self-consistent in theory. Its specific form is: \( \left\{\begin{array}{c}{U}_{\mathrm{I}}\left({P}_1\right)=\frac{A}{j\lambda}\underset{\Sigma}{\iint}\frac{e^{j k\left(r+l\right)}}{rl}\mathit{\cos}<\overrightarrow{n},\overrightarrow{r}> ds\\ {}{U}_{\mathrm{I}\mathrm{I}}\left({P}_1\right)=-\frac{A}{j\lambda}\underset{\Sigma}{\iint}\frac{e^{j k\left(r+l\right)}}{rl}\mathit{\cos}<\overrightarrow{n},\overrightarrow{l}> ds\end{array}\right. \) , this is the Rayleigh-Sommerfeld equation [ 21 ].

The above equations are all based on Fresnel diffraction. In addition, Fraunhofer diffraction was also discovered [ 22 ], which is a special case of Fresnel diffraction and belongs to far-field diffraction. Because the Fraunhofer diffraction field is easy to calculate theoretically, has great application value and it is not difficult to realize experimentally, people pay more attention to it. In particular, the rise of Fourier optics in modern transform optics endows classical Fraunhofer diffraction with new modern optical significance. With the rise of optical Fourier transform, the transformation from space domain to frequency domain is realized. Light can represent more contents, and the distribution of light in Fresnel diffraction is also analyzed in more detail. Kirchhoff and Rayleigh-Sommerfeld diffraction both discuss the propagation of light in the spatial domain, and the propagation of light in the frequency domain is summarized as angular spectrum theory [ 23 ].

Diffraction is a very extensive optical phenomenon, which contains a lot of content. The theories related to diffraction can be collectively called diffraction theory. But because the light is electromagnetic wave, the diffraction problem cannot be separated from the classical electromagnetic field theory based on Maxwell’s equations, and electromagnetic field is also a vector field, so the strict diffraction theory should be the vector diffraction theory. When the light vector is only one component, or does not involve the diffraction light propagation, polarization state and the case that aperture wavelength is much larger than light wavelengths, the light can be regarded as a scalar, accordingly it is the scalar diffraction theory.

The implementation based on Rayleigh-Sommerfeld equation

Any obstacle can cause light to diffract, but only when the size of the obstacle or hole is smaller than or similar to the wavelength of light, obvious diffraction phenomenon can be observed. Diffraction produces numerous wavelets at the small aperture. These wavelets superimpose each other when they reach the viewing screen. The degree of mutual weakening becomes lighter or heavier regularly during the overlapping, thus forming the light and dark streaks. In fact, diffraction is the coherent superposition of infinite continuous wavelet, which is mathematically represented as an integral problem. Therefore, the optical diffraction phenomenon can be used to design the linear operation of the optical neural network and realize the linear multiplication and summation operation in the neural network.

According to Rayleigh-Sommerfeld equation of diffraction theory, we can regard each neuron of a given diffraction layer as a secondary source of wave consisting of the optical model: \( {w}_i^l\left(x,y,z\right)=\frac{z-{z}_i}{r^2}\left(\frac{1}{2\pi r}+\frac{1}{j\lambda}\right)\exp \left(\frac{j2\pi r}{\lambda}\right) \) , which is also the basic of many diffraction network architectures. In these networks, transmittance was taken as a learnable parameter W , and then training and learning were carried out to complete the task of identification and classification. Under normal circumstances, when the network using the diffraction principle modulates light waves and conducts diffraction light analysis, there will be such a premise that the vibration direction of the light vector in the whole light wave field doesn’t change, or only one component of the light vector is considered, so vector diffraction is generally simplified to scalar diffraction for using.

In June 2018, Lin Xing, a researcher from the University of California, Los Angeles (UCLA), and other researchers, innovatively proposed an all-optical diffraction deep learning framework based on light diffraction, which they called the diffraction deep neural network (D 2 NN) [ 24 ]. D 2 NN is composed of multi-layer diffraction surfaces to form the physical layer. By cooperating with these diffraction surfaces, the linear operation function of neural network can be performed in the form of light. The principle and structure of the whole network are shown in Fig.  3 (a), which is composed of input layer, several diffraction layers and output layer. In the input layer, the information is encoded into the amplitude channel or phase channel of the input surface by irradiating with coherent light. There are several holes on the input surface, and diffraction of the beam occurs on the input surface, which results in coherent superposition of wavelet, changes the amplitude and phase of input wave, and completes the coding process. Through optical diffraction, light goes from the input layer through the diffraction layers to the output layer, achieving layer-by-layer connections. Similar to the input layer, the diffraction layer has certain parameters. Light of terahertz frequency can transmit through the diffraction layer. After being modulated by parameters through the diffraction layer, the coherent superposition of wavelet is carried out, so as to realize the modulation of light wave and complete the process of forward propagation, that is, the optical linear calculation process of neural network. In the output layer there will be a photoelectric detection array to detect the output light intensity. In 2019, this research group also proposed a broadband diffraction neural network based on the same architecture [ 27 ], which makes the model’s demand for light sources no longer limited to monochromatic coherent light sources, and can process information modulated by time-incoherent light sources, expanding the application range of ONN realized by this architecture.

figure 3

Optical neural networks using light diffraction to realize linear operation. a Schematic diagram of deep diffraction neural network D 2 NN [ 24 ]. b Diffraction grating network system [ 25 ]. c Metasurface implements optical logical operations [ 26 ]

The architecture of 3D printed deep diffractive optical neural network achieves high-speed and low-power calculation, which is unique and innovative, but it still has some big problems. The first is the diffraction layer. Although the manufacturing cost of the diffraction layer is relatively low and the accuracy rate can reach 91.75%, it is difficult to achieve miniaturization of devices, process complex data and image analysis. Moreover, all parameters cannot be reprogrammed after 3D printing. The second problem is the light source, the THz light source used in this study. Such a system is expensive and bulky. The third is the surrounding experimental environment. In this study, an optical platform is required to carry out the network architecture. Due to the existence of optical diffraction, the requirements on the surrounding environment, such as vibration and optical environment, will be quite severe. Ozcan said, although the research uses light at terahertz frequencies, it is also possible to make light at visible, near-infrared or other frequencies in the future, and such networks could also be made by photolithography or other techniques. Therefore, inspired by the diffraction deep neural network, more and more scholars have begun to devote themselves to the study of variants based on D 2 NN.

In December 2019, a team from Tianjin University developed a matrix grating to replace the 3D-printed diffraction layer [ 25 ], and used a carbon dioxide laser tube to emit 10.6um infrared light for detection by the HgCdTe detector array, as shown in Fig. 3 (b). Similarly, the superposition of light waves is realized through the diffraction of each slit of the grating and the interference between slits, thereby achieving the optical linear operation. It is worth mentioning that infrared light source is used in the network which has the following advantages: Firstly, it can reduce the cost of the whole network architecture; Secondly, the size of a single neuron can be reduced to 5 μm, and the characteristic size is reduced by 80 times compared with the previous network. In this way, the matrix grating of 1 mm*1 mm can contain 200*200 neurons, and the distance between layers can also be shortened. The miniaturized matrix grating will be very helpful to integrate into the silicon photonic platform and acquire more extensive application.

In 2020, a team from Zhejiang University proposed to realize optical logic operation using metasurfaces based on diffraction neural network [ 26 ]. The optical logic computation is equivalent to the classification task, optical logic unit is designed based on the diffraction neural network, and finally realize the logic operation. The feasibility and completeness of this method is proved theoretically. Figure  3 (c) shows the layout of the diffraction neural network based on the optical logical operation. Each region of the input layer is assigned a specific logical operator or an input logical state, which has two different states for light transmittance. In other words, the input layer only needs to set the transmission state of each region, then the input plane wave can be spatially coded for specific optical logic operation. The hidden layer is composed of the metasurfaces. According to Huygens-Fresnel diffraction principle, taking AND, OR, and NOT-logical units as examples, the hyperparameters and weight coefficients of the diffraction neural network are obtained through learning and training. Then, according to these parameters, an efficient medium metasurface is used to construct the phase mask. As a hidden layer, it is designed to decode the encoded input light and generate the output light logic state. Two regions are set in the output layer and light passing through the hidden layer is directionally scattered by the metasurface to one of the two designated regions in the output layer. Compared with 3D-layer diffraction system and matrix-grating network system, this method does not require complex optical control system, and only need simple plane wave as input. By selectively activating sub-region of input layer, different logical calculation functions can be realized.

To sum up, D 2 NN based on the Rayleigh-Sommerfeld equation, is able to perform various complex functions that traditional computer neural networks can achieve at a speed close to the speed of light and without energy consumption. It opens up new opportunities for using passive components based on artificial intelligence to quickly analyze data, images and object classification, so as to realize all-optical image analysis, feature detection and object classification. For example, a driverless car using this technology can immediately respond to a stop sign. As soon as it receives light from the sign diffraction, D 2 NN can read the sign information; the technology can also be used to categorize a large number of targets, such as looking for indications of disease in millions of cell samples. In addition, new camera designs and optical components using D 2 NN to perform tasks can be implemented, passively used in medical technology, robotics, security, and any application that requires image and video data. For example, all-optical diffraction neural network can be used to construct holograms that can realize “THz” imaging at a very low cost through 3D printing [ 28 ], reconstructing high-quality images at a high speed.

The implementation based on the Fourier transform

The Fourier transform of light is also a member of the great family of diffraction, which grows out of Fraunhofer diffraction and plays an extremely important role in modern optics due to some of its special properties, such as convolution theorem. Based on Fourier optics, the Fourier lens of optical element can realize the Fourier transform and complete the conversion of time-space domain and frequency domain. According to the convolution theorem [ 29 ], the convolution of two two-dimensional continuous functions in the space domain can be obtained by the inverse transformation of the product of their corresponding two Fourier transforms. On the contrary, convolution in the frequency domain can be obtained by Fourier transform of product in the space domain. Hence, the multiplication operation can be done by convolution in the frequency domain, and then by the inverse Fourier transform.

Not only that, one of the simplest and most basic functions of a lens is to converge light beam, which can be similar to a summation operation to a certain extent. Therefore, we can use the function of Fourier transform and light wave convergence and superposition of lens wave to realize the function of linear multiplication and summation function of optical neural network.

In 1989, Tai Wei Lu proposed a two-dimensional programmable optical neural network [ 30 ], which is based on the interconnection structure of lens array. Linear summation is realized by using lens array, and has good parallel computing and programming abilities. But because of the influence of imaging aberrations and light detection, the number of neurons is severely limited. In 1997, Yang’s team used a coaxial lens array to build an optical neural network with 32*32 neurons [ 31 ], which significantly reduced aberration and improved light efficiency. In 1993, Yasunori Kuratomi proposed an optical neural network with vector feature extraction [ 32 ]. The network structure consists of four layers, namely the input layer, the two hidden layers and the output layer. In the input layer, a flat plate is used to convert letters to binary grid pattern. In the hidden layer 1, a 2 * 2 lens array is applied to realize four feature extraction layers, used to extract feature line segments, and focus on feature-extracting optical neuron device (FEOND) as the hidden layer 2 to extract feature vector. Ultimately, the FEOND output is obtained from readout beam by crossed polarizers, which is detected by CCD for recognition tasks. Neurons in the output layer are fully connected with the neurons in the hidden layer 2.

In the early stage, the lens network reflected its focusing and gathering function, so as to achieve linear summation operation. With the gradual maturity of Fourier theory, the convolution theorem began to be discovered and used.

In August 2018, Julie Chang et al. from Stanford University proposed an optoelectronic hybrid neural network based on diffractive optical elements [ 33 ]. A layer of optical convolution operation is added to the network before the electronic calculation, including a “4f system” composed of two convex lenses with both focal length of f which realizes two cascaded Fourier transforms, as shown in Fig.  4 (a). Due to optical convolution, the computation of the whole network is greatly reduced.

figure 4

Optical neural networks with linear operation realized by Fourier transform of light [ 33 ]. a Optical convolution operation is realized by 4f system. b Experimental realization of optical neurons in AONN [ 34 ]. c The linear operating system of AONN [ 34 ]

In September 2019, researchers from Hong Kong University of science and technology, demonstrated an all-optical neural network (AONN) [ 34 ], with tunable linear operation and the optical nonlinear activation function. Figure  4 (b) shows the experimental implementation schematic diagram of an optical neuron, the linear operation of which is programmable implemented by the spatial light modulator and the Fourier lens. During the linear operation, the laser spot is used to represent the vector, and the laser beam is divided into different directions by using SLM. The incident light power in different regions of SLM represents different input layer nodes. By superposing multiple phase gratings, the incident light will illuminate different directions and have certain weights. Then, the Fourier transform of the lens is used to superimpose all diffraction beams in the same direction onto the points on its front focal plane, so as to realize the linear summation function. The specific linear operating system is shown in Fig. 4 (c).

Interference of light to realize linear operation

When multiple beams with the same frequency, same vibration direction and fixed phase difference are superimposed in a certain space, there will be the phenomenon that distribution of light intensity is different from the sum of the original intensity of multiple beams, which is called interference [ 35 ]. Interference and diffraction are essentially same, both are superpositions of waves, and the spatial distribution of light and dark is not uniform. However, there are differences between them in terms of forming conditions, distribution rules and mathematical treatment methods. Diffraction is the superposition of numerous small element amplitudes, which is calculated by integration. While interference is a superposition of a finite number of beams, calculated by summation. It can be said that diffraction is a complex interference, and in fact interference and diffraction often go hand in hand. Both interference and diffraction can achieve linear summation.

Shen. Y et al. proposed a new photonic chip system for a new all-optical neural network, as shown in Fig.  5 [ 36 ]. The calculation method of the beam in the photonic chip is similar to the basic principle of interference, and the linear operation is realized by a cascaded array with 56 programmable Mach-Zehnder interferometers. The network consists of a cascade of multiple OIUs and ONUs. In OIU, the principle of matrix multiplication is singular value decomposition (SVD). As we all know, any real matrix M can be decomposed through SVD into M = U Σ V †. U, V † can be achieved by optical beam splitter and phase shifter, Σ can be realized by optical attenuator. By tuning the phase shifter integrated in MZIs, you can perform any size of operation on the input. This new method uses multiple beams to propagate and produce interference pattern with using interaction of wave, thereby conveying the desired operation results. In principle, the optical chip with this architecture can run on traditional artificial intelligence algorithm, which is much faster than the traditional electronic chip, with less than one thousandth of the energy.

figure 5

Using the principle of interference to realize the linear operation in the neural network [ 36 ]. a Coherent nanophotonic circuit. b Each layer of the neural network is composed of optical interference unit OIU and optical nonlinear unit ONU. c The internal structure of OIU unit

Scattering of light to realize linear operation

When light meets obstacles or holes, diffraction will occur; When multiple beams of light meet, interference will occur; If light is incident on an opaque surface or random medium, it will be reflected from all aspects by tiny particles, which is known as scattering and was first discovered by scientists in the early 1960s. The so-called scattering [ 37 ] is the phenomenon that the spatial distribution, polarization state or frequency of light intensity is changed by the action of molecules or atoms in the propagation medium. The scattering medium is the propagation medium that causes the scattering phenomenon.

In 1990, The random scattering medium has been proved theoretically that it can be used as a thin lens to image the target [ 38 ]. In 2007, I. M. Vellekoop et al. from Twente University in the Netherlands verified the I. Freund’s point of view experimentally, used feedback control technology to control the spatial light modulator (SLM) to modulate wavefront phase of the incident light in the scattering medium, making wavefront phase distortion compensation caused by optical scattering. As a result, originally chaotic scattering light is focused to the specified location. The technology is the wavefront modulation focusing technology [ 39 ]. In 2010, I. M. Vellekoop combined wavefront modulation focusing technology with the optical memory effect of random scattering medium [ 40 ], and successfully observed the fluorescent structure located behind the random scattering medium through scanning imaging [ 41 ]. Based on the technology, E. G. Putten et al. achieved the super diffraction limit scanning microscopic imaging of gold nanoparticles by using “‘random scattering lens” made of gallium phosphide (GaP) [ 42 ], with a resolution of 97 nm. It’s the first time to realize the super diffraction limit imaging based on random scattering medium, which has opened a new page in the field of far-field super diffraction limit imaging, and set off a research upsurge of random scattering imaging technology in the world. Next, optical coherence tomography technology [ 43 ], speckle correlation imaging technology [ 44 ], optical phase conjugation technology [ 45 ] and other technologies emerged successively, providing more choices for observing targets through random scattering media such as biological tissues.

With the continuous development of deep learning and scattering imaging, as well as the strong learning ability of deep learning, researchers attempt to combine them to try to make new breakthroughs and development. For instance, the realization of scattering medium target recognition based on the direct machine learning of speckle intensity image [ 46 ]. In the experiment, the camera captures the speckle intensity image of the amplitude or phase object on the spatial light modulator, and classifies the acquired face and non-face speckle intensity images by using support vector machine.

In the D 2 NN-type network mentioned above, the diffraction modulation layer is similar to the scattering medium. After the light wave passes through, the optical parameters such as the spatial distribution, polarization state will change, and finally the specular pattern with fine-sized particles will be obtained. Moreover, according to the results of computer training simulation, each diffraction pattern is very similar to speckle pattern. Therefore, such a network modulation layer can be analogous to the scattering medium. In fact, we still according to parameters we have learned, design 3D printing layers gratings, etc. Each parameter or pixel on it can be regarded as the neuron in the network, but the number of neurons is limited. The scattering medium is different, its internal disorder dielectric particle assembly can provide thousands of optical computing neurons, even with larger scattering loss.

The deep ONN constructed by active tumor cells in 2018 well embodies this design [ 47 ]. It uses a living three-dimensional tumor brain model to demonstrate the morphological dynamics of tumor detected by a trained random neural network through image transmission. Tumor brain cells act as scattering mediators and play a role of hidden layers, and the number of waveguide hybrid nodes, namely neurons, is tens of thousands. In this three-dimensional tumor brain model, each cell is a scattering center with a complex transfer function. By training SLM weights of the input layer, a design using scattering media to construct ONN is presented. The specific structure of ONN is shown in Fig.  6 (a). The input layer is realized by a spatial light modulator after iterative training, the middle layer is a three-dimensional spherical layer, and the output is composed of CCD, which detect the intensity distribution.

figure 6

Optical neural networks using scattering to realize linear operation. a Deep ONN constructed by active tumor cells [ 47 ]. b Nanophotonic neural medium NNM [ 48 ]

In fact, whether the diffractive modulation layer of the diffraction network or the scattering ONN of tumor cells, their networks are layered. In previous studies, researchers have found that neural networks need an appropriate number of layers to complete specific tasks, so as to achieve low loss, high accuracy and good performance. If the number of network layers is too few, its training inference ability cannot reach the desired results; If the number of layers is too much, problems of gradient decline and overfitting are likely to occur, resulting in poor results and extremely long training time. Of course, in the photonic neural network, since the task is completed at the speed of light, we hope that under the premise of ensuring the experimental effect, the number of layers is as much as possible, so that we can train a better network and get more accurate results. Thus, there can be an assumption that exists an optical neural network with an infinite number of layers.

In August 2019, Erfan Khoram et al. designed a new type nano-medium, called nanophotonic neural medium NNM [ 48 ], which is composed of matrix material silicon dioxide and a large number of dopants. The dopants may be either pores or materials with different refractive index from the matrix material. A large number of dopants can strongly scatter the incident light in both positive and negative directions. The position and shape of dopants are equivalent to the weight parameters in the traditional neural network. Scattering makes the incident light mix in space, and the incident light contains the information of the input image, which is similar to the linear matrix multiplication in the traditional neural network. NNM is shown in Fig. 6 (b), linear materials complete linear matrix multiplication and nonlinear materials complete activation function. This nanostructure, which allows light energy to be redistributed in different directions in space, can be used for computation between neurons and has a stronger expression capability than layered optical networks. In fact, the layered network is a subset of NNM because the medium can be molded into connected waveguides just like a layered network. The light only needs to pass through the scattering medium, which may surpass the previous hierarchical feedforward network and become a very deep neural network, and there is no problem of gradient decrease in the deep neural network. In addition, NNM does not need to follow any specific geometry, so it can be easily integrated into existing visual or communication devices, and will have a wider range of applications.

In 2020, Y. Qu et al. from Oregon State University, inspired by the NNM, proposed an integrated ONN framework based on optical scattering elements based on optical scattering unit [ 49 ], taking the network structure of coherent nanophotonic circuits as a prototype, which integrated optical interference unit and optical nonlinear unit. The core structure of the optical network framework is an integrated nano-photonic computing unit——Optical Scattering Unit OSU, which be comprised of a multi-mode interference (MMI) coupler with a nanometer-pattern coupler region to implement matrix multiplication. It has the same function as the matrix multiplication unit OIU. OSU can be designed as coherent architecture like OIU to realize arbitrary unitary matrix multiplication. Similarly, a more advantageous noncoherent architecture can be designed which directly manipulates the light intensity to achieve random matrix multiplication. In addition, the researchers also realized the optical convolution operation of CNN based on noncoherent OSU. The core of the realization of convolution operation in OSU is to use “kernel matrix” to execute in the photonic circuit, so as to realize the conversion from convolution operation to optical kernel matrix multiplication. The image is divided into blocks and vectorized. By vectorizing and stacking each kernel, the kernel set is converted into a “kernel matrix”, so that the one-dimensional image blocks can be effectively multiplied by the “kernel matrix”, which is equivalent to the convolution operation. Since nano-imaging makes light scatter within a small region of the coupler and increases the degree of freedom, it can be optimized by an inverse design approach.

Wavelength division multiplexing (WDM) to realize linear operation

Using the principle of diffraction to achieve the optical linear operation, the optical signals propagate in the air. Specific transmission medium, such as scattering medium, can also be used for signal transmission, corresponding to the content in Section 2.2.4; or optical fiber, with wide transmission bandwidth, low transmission loss, strong anti-interference, light weight and low cost, has obvious advantages in light transmission. In optical fiber transmission, at present, WDM is often relied on [ 50 ]. WDM can effectively improve the transmission capacity and realize the separation and composition of light. Therefore, optical fiber can be used for the calculation of huge data.

In 2012, Y. Paquot et al. [ 51 ] successfully constructed an optoelectronic hybrid serial recurrent neural network based on optical fiber system, whose structure is shown in Fig.  7 (a). Signals are injected from an arbitrary waveform generator (AWG) and modulated on light by amplifiers and modulators. The reservoir layer in the middle consists of variable optical attenuators, delay lines, feedback photodiodes, mixers, amplifiers, and a Mach-Zehnder modulator. The photodiode converts the outputs of the system into electrical signals and reads them out. By training and controlling the output weights, the system can realize the recognition of square wave and sine wave. This network can achieve the equalization of communication channel and is the scene expansion of photonic neural network in the field of communication. In the same year, F. Duport et al. also used optical fiber system to construct an all-optical circulating neural network [ 52 ], adopting optical fiber delay switching of single nonlinear node for offline training, and its structure was shown in Fig. 7 (b). In addition to directly using delay lines to obtain the delay function, devices such as micro-ring array and multimode interference separator array can also realize the delay [ 55 ]. At the same time, multistage or more complex time division multiplexing are adopted, which greatly improve the information processing speed and gain better information processing results [ 55 ]. In order to explore the multiplexing ability of light, it is verified in [ 56 ] that two optical modes can be used to carry out two independent information processing tasks simultaneously in the same reservoir pool.

figure 7

The neural networks with reservoir computing function realized by optical fiber. a Optoelectronic reservoir computing [ 51 ]. b All-optical reservoir computing [ 52 ]. c , d , e respectively describes the principle of reservoir computing, and the structures of the neuron and neural network constructed based on the optical fiber communication system [ 53 ]. f A serial electro-optic neural network based on time-domain stretching [ 54 ]

Based on the above researches, T. Cheng et al. proposed an optical neural network system for optical reservoir computing based on optical fiber communication system in October 2019 [ 53 ]. The RC optical system is composed of input layer, reservoir layer and output layer. The schematic diagram is shown in Fig. 7 (c). The input weight matrix of the input layer is W in , which is implemented by the directional coupler. The input layer is used to scale the size of the input data to the size of the reservoir layer corresponding to the matrix W . The reservoir layer is composed of multiple neurons, and its function is similar to the hidden layer in the neural network. Each neuron is composed of two directional couplers, optical fiber and EDFA. The structure diagram is shown in Fig. 7 (d). It has two outputs, one of which serves as the output of optical neurons, and the other can be connected back to the feedback of the same optical neurons or to other optical neurons to achieve signal reproduction and interconnection between optical neurons. The specific structure is shown in Fig. 7 (e). The output layer has an optocoupler consisting of a Mach-Zehnder phase modulator and a directional coupler to implement the readout matrix W out , which converts the results of the reservoir layer into the output of the RC system. The directional coupler realizes the weight setting among neurons, and the fiber establishes the connection among neurons, realizing the linear operation among neurons together. However, due to the existence of optical fiber, directional coupler and EDFA, such a fiber network is limited in dimension and scale. In order to expand the dimension of the photonic neural network, time can be exchanged for space and scale of the neural network can be expanded in the case that the computing speed is not reduced too much. A serial electro-optical neural network (TS-NN) based on time-domain stretching is proposed by Chen Hongwei’s research group from Tsinghua University [ 54 ]. The system structure is shown in Fig. 7 (f). This system is a loop system, and n-1 loop times can realize a n-layer network. In each cycle, two operations are involved——linear computation (matrix multiplication) and nonlinear transformation. Linear operation mainly adopts the time stretching method, which make the ultrashort period pulse broadened and flattened by means of dispersive fiber and wavelength converter. Then the weight matrix is used to modulate the processed pulse, and the output of the last time is used as the input of this time to modulate the modulated pulse again, so as to realize the optical multiplication of input vector and weight matrix. Finally, DSP uses signal processing algorithm to process the results. This method realizes the photoelectric hybrid fully connected neural network through the parallel-to-serial scheme, which can realize the large-scale neural network. Although it is not an optical neural network, such an idea of expanding the network scale by exchanging time for space is worthy of our reference.

Whether as optical wave transmission or communication, optical fiber is a very potential development direction in the future. At present, optical fiber has a very mature performance in WDM technology, broadband amplifier technology such as erbium-doped fiber amplifier EDFA, dispersion compensation technology, soliton WDM transmission technology and so on. In the aspect of optical network, the traditional optical networks have realized all-optical among nodes, but electrical devices are still used at network nodes, which limits its development. All-optical network that replaces electrical nodes with optical nodes will be the important development direction of optical fiber in the future. The combination of optical fiber and optical network in 5G era to realize a real all-optical network is an engineering technology that can be further studied.

In the neural network, another typical application of WDM technology is the all-optical spiking neural network in 2019. The working mechanism of neurons used in this network is similar to synapses mechanism of human brain neurons, it can simulate the spike discharge and naturally reflect the actual situation of biological neurons, known as spiking or pulse neuron. The proposal of spiking neurons began in 1997, when W. Maass first proposed spiking neural network [ 57 ], which used impulse function to simulate signal as the way of information transmission between neurons. Neuromorphic silicon photonics was put forward by Alexander N. Tait from Princeton University in December 2017, which is the world’s first photonic neural network [ 58 ], as shown in Fig.  8 (a). Each node in the network works under a specific wavelength of light, the light from each node will be detected and summed by total power before it is sent to the laser, then the laser output will be fed back to the node to create a feedback loop with nonlinear characteristics. Such a photonic network can be used to solve differential equations and is demonstrated in which nodes is similar to the trigger mechanism of human brain neurons, called pulse or spiking neurons.

figure 8

Optical spiking neural networks realized by WDM principle. a The structure of the first photonic neural network [ 58 ]. b Graphene spiking neuron [ 59 ]. c Bipolar integration-firing neuron with GST [ 60 ]. d The all-optical spiking neurosynaptic network structure formed based on the principle of WDM realized by PCM and MRR array [ 61 ]

In fact, the neural network mentioned above abstracts the input of the network into matrix or vector, and the neuron mainly performs matrix multiplication operation. Whereas biological neurons process information in the form of impulses, so these networks only retain the structure of neural networks, greatly simplifying the neuron model, which is better described as “units” rather than neurons. In contrast, pulse/spiking neurons are closer to the biological model of human brain neurons, which exist in two states——activated and inactive. They are activated only when their membrane potential reaches a threshold, thus they are not activated in every iteration propagation, a bit like dropout regularization in artificial neural networks. When a neuron is activated, it produces a signal and transmits it to other neurons, raising or lowering membrane potential of its cascaded neurons. In a pulse/spiking neural network, the current activation level of the neuron is usually modeled as some kind of differential equation, and it will rise and continue for a period of time after the arrival of the stimulus pulse and then gradually decline. Spiking neural network enhances the ability to process spatial-temporal data: on the one hand, the neurons in such neural network are only connected with nearby neurons to process input blocks respectively, thereby enhancing the processing capacity of spatial information; on the other hand, because the training depends on the time interval information of pulses, the information lost in the binary encoding can be retrieved in the pulse time information, thus enhancing the processing capacity of the time information. It turns out that pulse/spiking neuron is more powerful computing unit than traditional artificial neuron, which is a major development trend in the future. However, owing to the difficulties in the training method and hardware implementation of pulsed neural network, it has not been widely used yet, and most of the researches about pulsed neural network is still focused on the theoretical research and the verification of simple structure. But more and more researchers are now devoting themselves to training algorithms and hardware (optical) implementations of pulsed neural networks.

In 2016, Prucnal’s research team in the Princeton University proposed a spike processing system based on the activated graphene fiber laser, in which the activated graphene fiber laser plays the role of spiking neuron [ 59 ], as a basic component of spike information processing. In 2018, a neural mimicry photonic integrated circuit based on distributed feedback (DFB) laser structure was proposed [ 62 ]. The laser has two photodetectors, which can generate both inhibitory and excitatory stimuli at the same time. The system is compatible with the broadband-and-weight (B&W) protocol [ 63 ]. In the same year, superconducting photoelectric spike ring neurons were designed, known as Loop Neurons [ 64 ]. These neurons are composed of single-photon detectors, Josephson junction and light-emitting diodes. Josephson junction detects event integration and converts it into supercurrent, and finally store it in the superconducting circuit. Also in 2018, a new spiking neuron equipment——a bipolar integration-firing neuron [ 60 ] was introduced, including integral unit that consists of two double bus ring resonator with embedded phase change materials (PCM) (i.e., Ge 2 Sb 2 Te 5 (GST)), which controls the propagation in the loop, sums the output of the resonator, and is used to stimulate the ignition unit that is composed of a photon amplifier, a circulator and a rectangular waveguide with a GST component on the top. This spiking neuron can be connected with photon synapse to form an all-photon spiking neural network. We show graphene spiking neurons and bipolar integration-firing neurons in Fig. 8 (b) and (c), respectively.

Based on this, Feldmann J et al. mentioned an all-optical spiking synaptic realization using PCM in May 2019 [ 61 ]. The structure of this photonic neural network is shown in Fig. 8 (d). It’s a fully connected network. When inputting the pulse, the PCM unit on the waveguide is used for weighting in each neuron, and the MRR array is used as WDM for summation; The spiking mechanism is implemented by the PCM on the ring resonator. The PCM crystal is a special unit in which has two states, crystalline and amorphous, have different effects on the input pulse. Because of this, PCM can modulate the pulse and realize the weighting operation. For amorphous PCM cells, the synaptic waveguide is highly transmissive and can achieve strong connections between neurons. In the crystalline state, most of the light transmitted to the PCM is absorbed, leading to weak connections among neurons. After the pulses are weighted by PCM, they are integrated and sent together by WDM to a ring resonator integrated with PCM cells to realize the summation. In this way, the linear operation in ONN is achieved.

The optical realization of nonlinear activation function

In the previous chapter, we discussed the optical realization of linear operations in neural networks, but it is not enough to have linearity in neural networks. It also requires the processing of nonlinear activation functions, similar to the function of synapses in the brain nervous system. Nonlinear function can accelerate the convergence speed of the network and improve the recognition accuracy, which is an indispensable part of the neural network. Without it, no matter how much the network layer is, it can be attributed to a huge linear operation, however, most problems are nonlinear. The introduction of the activation function provides the nonlinear factors for neurons, making the neural network approximate any nonlinear function, so that neural network can be applied to many nonlinear models.

In the electronic neural network, we can use the existing nonlinear activation function, or define a function to carry out the nonlinear operation. However, in the photonic neural network, this becomes a bottleneck for its development. The reason is that nonlinear optical components need to match the high-power laser, which is more difficult to realize nonlinear functions than electronic devices, and the nonlinear functions realized by them have many non-ideal characteristics. In 1967, Seldon et al. proposed a saturated absorber model or an electronic module [ 65 ] to realize nonlinear operations in the photonic neural network, but this method is difficult to accurately control and requires the conversion of optical signals into electrical signals through photodiodes, thus reducing the computing speed. At present, there are two ways to realize the nonlinear operation in the photonic neural network: one is to use the electronic or photoelectric methods, and the other is to use the nonlinear effects of some special materials. In the following chapter, we will first describe nonlinear optical effects in detail, and then introduce different activation implementations and corresponding optical neural networks according to different effects.

Nonlinear optical effect

Nonlinear optical effect is the effect caused by the nonlinear polarization of the medium under the action of strong light, which originates from the nonlinear polarization of molecules and materials, and is manifested as the nonlinear relationship between the effect of light on the medium and the response of the medium [ 66 ]. Under the action of incident light field, the motion state and charge distribution of the atoms, molecules or ions that make up the medium must change in a certain form to form an electric dipole, generate an electric dipole moment and then radiate a new light wave. In this process, the electric polarization intensity vector P of the medium is an important physical quantity. P has a nonlinear relationship with incident light vector E :

where χ (1) 、 χ (2) 、 χ (3) respectively referred to the first order (linear), second order and third order (nonlinear) polarizability of the medium. The studies showed that χ (1) 、 χ (2) and χ (3) were reduced in turn.

In the case of ordinary incident light, the second- or higher- order electric polarization intensity can be ignored, the medium only shows linear optical properties, and its electric polarization intensity P has a simple linear relationship with the incident light field intensity E . while is incident with a strong monochromatic laser, the order of magnitude of the light field intensity E can be compared with or close to the average electric field intensity ∣ E 0 ∣ within the atom. The contribution of the second-order or third-order electric polarization intensity cannot be ignored; the electric polarization intensity P and the incident light field intensity E show a power series relationship, and the nonlinear optical effect occurs at this time.

There are many kinds of nonlinear optical effects, which can be divided into second-order, third-order and higher-order nonlinear optical effects according to the relationship between electric polarization intensity and electric field. Of course, we generally only study second-order and third-order nonlinear optical effects. According to the interaction mode between laser and medium, that is, whether there is energy exchange between them, it can be divided into active nonlinear optical effect and passive nonlinear optical effect; according to the changed parameters, it can also be divided into optical frequency conversion effect, optical nonlinear absorption, optical Kerr effect and self-focusing, optical bistability effect, optical phase conjugation effect, stimulated scattering effect, etc.

Implementation of nonlinear activation in photonic neural network

In the current studies on photonic neural networks, we find that optical nonlinear activation does not exist in some networks or is simulated electronically. For example, in diffraction network D 2 NN, there is no activation function. In the serial photonic neural network based on time-domain stretched, its nonlinear transformation is realized by nonlinear functions such as non-negative s-type function simulated by the electronic devices in the system. There are also some networks taking advantage of nonlinear effects for nonlinear activation designs. At present, saturation absorption, optical bistability and Kerr effect have been considered as potential activation functions in ONNs.

Nonlinear optical absorption

Optical absorption means that when a photon enters a medium, atoms and molecules absorb the energy of the photon and happen the energy level transition [ 67 ]. In this process, if the photon energy is strong enough, the absorption coefficient of the medium will change with the light intensity. The change can be linear or nonlinear, namely linear and nonlinear optical absorption, respectively. The two main optical mechanisms of nonlinear optical absorption are saturation absorption, anti-saturation absorption and two-photon absorption.

When the laser is incident into the medium, the absorption coefficient of the medium decreases with the increase of the light intensity in the medium. When the intensity of the input light wave exceeds the threshold value, the absorption property of the medium begins to become saturated. This nonlinear optical behavior is called saturable absorption. Saturation absorption is caused by the transition of the particles constituting the medium from the ground state level to the first excited state level. In the case of saturated absorption, the relationship between the absorption coefficient of the medium and the light intensity I in the medium can be expressed as: \( \alpha (I)=\frac{\alpha_0}{1+I/{I}_c} \) , as shown in Fig.  9 (a). Correspondingly, the relationship curve between transmittance and light intensity is opposite, similar to the curve of sigmoid function, as shown in Fig. 9 (b).

figure 9

a , b Absorption coefficient and transmittance curves of saturation absorption. c The structure of Kerr-type network [ 68 ]. d Phase change material PCM realizes nonlinear activation [ 61 ]

On the contrary, anti-saturated absorption is the effect of increasing the absorption coefficient with the increase of light intensity. Its absorption characteristic curve is somewhat similar to the sigmoid function curve, and it is not commonly used in nonlinear activation. Two-photon absorption, as the name suggests, is that an atom in a medium absorbs two photons at the same time, then goes from the ground state to the excited state. When two light beams with frequencies ω 1 and ω 2 pass through a nonlinear medium, if value of ω 1  +  ω 2 is close to a certain transition frequency in the medium, the two beams will attenuate at the same time. Two-photon absorption is a third-order nonlinear optical effect.

Optical nonlinear absorption can be realized by both optoelectronic devices and optical methods. Saturation of optoelectronic devices such as optical attenuation amplifiers, erbium-doped fiber amplifiers, and semiconductor optical amplifiers, etc. can be used as nonlinear activation. In the all-optical reservoir computing implementation based on semiconductor optical amplifier array on the chip proposed by F. Duport et al., the saturation gain effect of semiconductor optical amplifier is utilized to realize the network nonlinear function. In the reservoir computing based on the optical fiber communication system, each neuron is composed of two directional couplers, optical fiber and erbium-doped fiber amplifier, while the erbium-doped fiber amplifier realizes the nonlinear function and each neuron has such nonlinear activation function. In terms of optics, saturable absorbers, such as optical dyes, graphene, C60, etc., can be used to play the role of nonlinear activation. In the 2014 optical reservoir computing [ 69 ], graphene saturable absorbers or two-photon absorption [ 70 ] were used as optical nonlinear units. In 2016, the Prucnal’s research group in the Princeton University proposed a spike processing system based on an activated graphene fiber laser, which also uses graphene as a saturated absorber to perform nonlinear activation functions. In the coherent nanophotonic circuit in 2017, its nonlinear unit ONU is realized by a saturation absorber that can be integrated into the nanophotonic circuit, such as fuel molecules, semiconductors, graphite saturated absorbers and saturation amplifiers. For the incident light I in , emergent light I out is given by nonlinear equations: I out  =  f ( I in ), using the model of saturated absorber is \( \sigma {\tau}_s{I}_0=\frac{1}{2}\frac{\ln \left({T}_m/{T}_0\right)}{1-{T}_m} \) . Once I 0 is given, T m ( I 0 ) can be solved by this formula, and the emission intensity can be obtained by I out  =  I 0   T m ( I 0 ). In 2020 Nanophotonic media network system, the nonlinear activation function is also achieved by making dopants composed of dye semiconductors or graphene saturable absorbers. These dopants can perform distributed nonlinear activation, which mainly reflects the ReLU function, allowing signals with an intensity higher than the set threshold to pass and obstructing signals below the threshold.

Optical Bistability

When the light beam passes through the optical system, there is a nonlinear relationship between the incident light intensity and the transmitted light intensity, thereby achieving the optical switch. For instance, optical restriction, optical bistability, various interference switches and so on. In electronics, bistability is a unit circuit that has two different resistance values for the same input electrical signal. In photonics, bistable state is an optical element, which has two transmittances with different levels for the same incident light intensity, which is called optical bistable state. It is of great significance for understanding the storage, operation and logical processing of optical information.

In a nonlinear optical system, when the input light intensity is small, the output light intensity of the system is also small. When the input light intensity increases to a certain critical light intensity value, the output light intensity of the system will jump to a certain high light intensity state, as if a switch is turned on. After that, if the input light intensity is further reduced, the system will no longer return to the low light intensity state at the original critical value, but there will be another critical value at the lower light intensity, making the system jump from high state to the low state. In this process, the “hysteresis” phenomenon appears in the input-output transfer relationship in the optical system, similar to the hysteresis loop in electromagnetism.

Optical bistable equipment may be used in high-speed optical communication, optical image processing, optical storage, optical limiter and optical logic elements. In particular, optical bistable devices made of semiconductor materials, with the characteristics of small size, low power and short switching time (10–12 s) and so on, are likely to become the logic components of optical computers in the future. Optical bistability has become a very active research field because of its great potential application value.

In reservoir computing, in addition to the saturated absorption being used to design as the activation function, we can also combine the bistability [ 71 ] with ring resonator according to the characteristics of bistability, to realize nonlinear activation structure of neural network [ 72 ]. This point is reflected in [ 73 ]. In coherent nanophotonic networks, the nonlinear activation function of ONU element can be realized by bistable nonlinear effect in addition to saturation absorption.

Optical Kerr effect

Kerr effect [ 74 ], is the third-order nonlinear effect. Under the action of electric field, the refractive index n // and n ⊥ of polarized light waves along parallel and perpendicular to the electric field direction change differently in the medium, and the difference Δn between them is proportional to quadratic power of the electric field, resulting in induced birefringence. Generally, the applied electric field is a direct current or low frequency alternating electric field. If the light/optical frequency electric field replaces the applied electric field, the same phenomenon will occur when the light is strong enough. At this time, Δn is proportional to the intensity of laser beam acting in the medium, where Δn is a nonlinear phase shift, which is called optical Kerr effect. If the parameter to be optimized is the phase, the optical Kerr effect can be used to realize nonlinear activation. Aiming at the nonlinear of Kerr medium, S.R. Skinner proposed an innovative structure of all-optical neural network using Kerr-type nonlinear optical materials in 1994 [ 68 ]. Figure  9 (c) depicts the all-optical feedforward artificial neural network structure with Kerr media, which uses thin material layers separated by free space to realize weighted connection and nonlinear neuron processing, that is, the network consists of thin layer of nonlinear medium and thick layer of linear medium, namely free space. Linear layer in which light propagates to realize weight connection; nonlinear optical layer, used as a weight layer except the first one is the input layer, and performs nonlinear processing. Hence, there are two formulas as follows: 1. \( {E}_{i+1}\left(\beta \right)=\frac{j{C}_i}{\pi }{\int}_{\Omega_i}{F}_i\left(\alpha \right){e}^{-j{C}_i{\left(\beta -\alpha \right)}^2} d\alpha, where={C}_i=\frac{k_0}{2\Delta {L}_i} \) , which describes transmission of light from the coordinate α  = ( x ,  y ) at the beginning of the i-th layer to the coordinate β  = ( x ∗ ,  y ∗ ) before the nonlinear layer of (i + 1)-th layer; 2. \( {F}_i\left(\alpha \right)={E}_i\left(\alpha \right){e}^{-j{k}_0\Delta N{L}_i{n}_2\left({\left|{\Gamma}_i\left(\alpha \right)\right|}^2+{\left|{E}_i\left(\alpha \right)\right|}^2\right)}, where\ {\Gamma}_0\left(\alpha \right)=I\left(\alpha \right),{\Gamma}_{i>0}\left(\alpha \right)={W}_i\left(\alpha \right) \) , describing the effect of the nonlinear layer, E i ( α ) is the light entering the i-th nonlinear layer at the coordinate of α  = ( x ,  y ).

Such a hierarchical network can not only process forward computation signal, but also realize the error backward propagation. This nonlinear method has advantages over other optical implementations because of the fast response speed of the Kerr-type nonlinearity in the material, and the network can be proved to be simple. Other optical networks usually require separate and specific optical hardware for weighted connections and neuron processing.

Taking into account of the fast response of Kerr nonlinear materials and the third-order nonlinear optical effect of two-photon absorption, the researchers combined the Kerr effect with two-photon absorption to establish a nonlinear mechanism and used it in conjunction with the InGaAsP ring resonator to realize all-optical reservoir computing [ 75 ].

In the example of quantum optical neural network (QONN) architecture proposed by G. R. Steinbrecher in 2019 [ 76 ], the input is the single-photon Fock state, the unit nonlinearity is assigned to the Kerr-type interaction, and the quadratic phase is applied to the number of photons. The readout is provided by the photon number resolution detector, which measures the number of photons in each output mode. The single-mode Kerr interaction achieves photon coherence nonlinearity.

Other nonlinear activations

In 2018, R. Amin and J. George et al. pointed out that electro-optic absorption modulation could realize nonlinear modulation of light waves [ 77 , 78 , 79 ], discussed the method of mapping nonlinear activation function to transfer function of electro-optic modulator, and also pointed out that different functional activation functions could be implemented by making use of different electro-optic materials. For example, the ReLU function can be realized by utilizing the inverted filling light absorption mechanism of quantum dots (QD) [ 80 ].

In the same year, M. Iscuglio et al. designed a kind of optical nonlinear, which depends on the reversible transparent sensitivity caused by Fano resonance in the plasma oscillator subsystem and the nonlinear response of Buckyball (C60) membrane——anti-saturation absorption [ 81 ], realizing the fast and effective all-optical nonlinear, improving the throughput of the neural network, and reducing the delay and power consumption.

In May 2019, in the paper called “an all-optical spiking neurosynaptic network” on the “Nature”, it is mentioned that the construction of the network with the help of phase change material nonlinear PCM many times [ 61 ]. PCM combines with MRR achieve weight modulation, and integrates with ring resonator to realize peak function as nonlinear activation function. If the power of input pulse summation exceeds a certain threshold, the state of the PCM will change, producing peak/impulse. Otherwise, the probe pulse resonances with ring resonator, which is similar to the nonlinear response represented by ReLU function.

In addition, the two-layer AONN [ 34 ], designed by the Hong Kong team, proposed a special nonlinear activation function based on electromagnetic induced transparency (EIT)——a photo-induced quantum interference effect between atomic transitions, in laser-cooled atoms with electromagnetic induced transparency. The EIT nonlinear optical activation function is implemented by laser-cooled 85 Rb atoms in the dark-line two-dimensional magneto-optical trap (MOT), as shown in Fig.  10 (a). The atomic energy level is shown in Fig.  10 (b). Atoms are prepared in the ground state ∣ 1>. The output beam after linear operation——the circular polarized coupled beam ω c resonates with ∣ 2 >   ⟷    ∣  3>, incident to the electron cloud transversely, and back-propagation probe beam ω p resonates with ∣ 1 >   ⟶    ∣  3>. In the absence of a coupled beam, the atomic medium is opaque to the resonant detection beam, which is absorbed to the maximum extent by the atom as shown in the implementation of transmission spectrum in Fig. 10 (c). In contrary, in the presence of a coupled beam, quantum interference between the transition paths leads to the EIT [ 82 ] spectral window, as shown in the dashed curve in the figure. The resonance peak transmission and bandwidth are controlled by coupling laser intensity, and the output of the resonance probe laser beam can be expressed as: \( {I}_{p, out}={I}_{p, in}{e}^{- OD\frac{4{\gamma}_{12}{\gamma}_{13}}{\Omega_c^2+4{\gamma}_{12}{\gamma}_{13}}}=\varphi \left({\Omega}_c^2\right) \) . It can be seen that the probe output beam is a nonlinear realization of the coupled nonlinear input beam.

figure 10

a , b , c represents the preparation, energy level diagram, and transmission diagram of laser-cooled 85 Rb atom, respectively [ 34 , 82 ]. d An electro-optical nonlinear activation function structure can be used to realize the optical nonlinear element ONU [ 83 ]

In January 2020, a nonlinear activation function structure of the optical neural network was proposed [ 83 ], which achieves the optical-to-optical nonlinearity by converting a small part of the optical input power into voltage. As the original optical signal passes through the interferometer, the remainder of it is modulated by the phase and amplitude of this voltage. For the input signal with an amplitude of z, the resulting nonlinear optical activation function f ( z ) is the response of the interferometer under modulation and the result of elements in the electrical signal path. The schematic structural diagram is shown in Fig.  10 (d). In addition, he demonstrated another implementation of the activation function, which could include a nonlinear MZI in which an arm has a material with a Kerr nonlinear optical response. Two different kinds of implementation methods were also conducted experimental demonstration analysis and comparison, highlighting that the lower activation threshold can be achieved by the electro-optic activation structures.

Training, experimental demonstration and analysis

For a neural network, training is a crucial and indispensable step, which affects the performance of network. The process of training is to calculate the target loss function by gap between the network output and the actual output and make it minimize to optimize the network parameters and achieve the effect of network convergence. When the network performs the prediction task finally, the desired results can be achieved. In electronic neural networks, training is divided into two categories: supervised learning and unsupervised learning. In the process of optimizing parameters, we can make use of back propagation, adopt gradient descent, Adam, momentum and other methods to minimize the cost function of the network. Since training involves gradient calculation and even more complex calculation, how to train the network is a difficult and important step in optical neural network. At present, training in most of ONNs is implemented through software to obtain weight parameters, so as to complete inference in ONN architecture. However, the training in the electrical field has shortcomings of pertinence and dependence. We can customize the training methods according to the optical architecture of ONN, making full use of the photonic technology, although it will be more complicated. Next, training methods in ONNs will be introduced through different training or gradient calculation methods.

Backpropagation algorithm

Backpropagation algorithm is a learning algorithm suitable for multi-layer neural networks, based on gradient descent method. The main idea is: after ANN has completed the forward propagation process, the error between the estimated value and the actual value of the network is calculated, and the error is back propagated from the output layer to the hidden layer until it is propagated to the input layer; In the process of back propagation, the values of various parameters are adjusted according to the errors, and the above process is iterated continuously until the network converges. In the training of D 2 NN type network, the function of back propagation algorithm is greatly exerted.

In the all-optical machine learning D 2 NN structure [ 24 ], the output layer will have a photoelectric detector array to detect the output light intensity, which makes difference with the target light intensity. The loss function is defined by mean square error, with the help of back propagation algorithm and stochastic gradient descent to update the amplitude or phase of the entire network. This is the training process of this network, which is completed on the electronic computer. And then, the trained parameters of each layer are modeled and 3D printed out. Finally, the light source, the fabricated 3D diffraction modulation layer and the detector array are used to construct the photonic neural network for reasoning and prediction. In order to test the inference ability and performance of the network, the author carried out experiments on the MNIST dataset and the Fashion-MNIST dataset, and reached a high accuracy on the network structure which had designed the 5-layer D 2 NN and increased the number of diffraction layers on this basis. The specific experimental results are shown in Table  1 . Later, the researchers of the research group analyzed in detail the architecture of the diffraction neural network and different parameter designs, and used five phase-only diffraction modulation layers for handwriting number recognition and fashion product recognition, achieving 97.18% and 89.13% recognition accuracy respectively. In addition, the influence of learning loss function on the performance of optical neural network and the mitigation of gradient disappearance in error back propagation are also analyzed [ 84 ].

Replacing 3D layers with diffraction gratings also realized the network training by the above method [ 25 ]. Finally, optimized parameters were used for grating design, and the corresponding diffraction grating was etched through semiconductor processing technology. The phase values of neurons have the following relationship with ladder thickness of etched diffraction grating, that is, the height of Ge: \( \Delta z=\frac{\lambda \phi}{2\pi \Delta n}=0.5618\phi \) . In order to train and test the D 2 NN classifier, MNIST dataset was used for experiments, with obtaining higher recognition accuracy. Table  2 shows the specific experimental results.

In the neural network system which implements logical operation, metamaterial is used as the diffractive modulation layer [ 26 ]. Each layer of metasurfaces is composed of the array of scatterers, the size of which can control the amplitude and phase of the scattered light. The above network architecture is still used for training in the same way, and then the parameters trained are converted to the size of the scatterer, to modulate the amplitude or phase of transmitted light after each layer. At First, three basic logical operations, NOT, OR, and AND are experimentally demonstrated, and the accuracy can reach 100%. Then, a three-layer phase-only diffraction neural network is used to realize all seven optical logic gates in an optical system. By calculating the intensity distribution of two specified areas, the accuracy is still satisfactory. In addition, the team proposed a possible scheme for cascading optical logic gates and pointed out that expect for multilayer metasurfaces, there could be other platforms to promote optical logic gates, such as metamaterials/nanophotonics. As shown in Fig.  11 (a).

figure 11

a Possible solutions for optical logic gate cascading and other possible implementation platforms [ 26 ]. b Describes a SLM cascade neural network, and uses 4f system and SLM to realize optical field measurement, thereby carrying out optical training of diffraction ONN [ 85 ]

From the above description, it can be concluded that for deep diffraction network D 2 NN, computer learning and training of network hyperparameters, no matter using 3D printing diffraction layer, matrix grating or metasurface, are identical in physical essence; and the network parameter training is all completed by computer, using the same set of complete architecture. Besides, once the task you want to accomplish changes, the network needs to be trained again. According to the computer configuration parameters given in the paper, it will spend a long time training the network and remake the diffraction layers. These operations will consume a lot of time and resources.

In June 2020, Dai Qionghai team proposed a SLM cascaded neural network, which uses 4f system and SLM to achieve optical field measurement, and utilizes the error measurement module to realize network training [ 85 ], as shown in Fig.  11 (b). The cascaded SLM is used as the hidden layer. Based on the principle of light reciprocity and phase combination, the gradient of the loss function relative to the weight of the diffraction layer is accurately calculated by measuring the forward and backward propagating light fields. The high-speed spatial light modulator is then programmed to update the diffraction modulation weight to minimize the error between the prediction and the target output, and perform inference tasks at the speed of light. This study not only realizes the SLM diffraction modulation layer, but also has one of the biggest features. In the realization of the back propagation algorithm, that is, it uses optical methods to carry out the back propagation algorithm to train linear and nonlinear diffractive optical neural networks in situ, thereby speeding up the training speed and improving the energy efficiency of the core computing module. Therefore, it not only realizes the optics of the network structure, but also realizes the optics of the training process and real-time programming.

In addition to the diffraction network, in the training of coherent nanophotonic circuit, the author also used the traditional back propagation algorithm and stochastic gradient descent method to update the parameters, and constructed a two-layer fully connected neural network for speech recognition experiment. The recognition accuracy is only 76.7%, while equivalent electronic neural network can achieve accuracy of 91.7%. There is still much room for improvement in this method.

Forward propagation on Chip

Despite the backpropagation algorithm is widely used, and it is currently the most commonly used and most effective algorithm for training artificial neural networks. However, for some ANNs, when the number of effective parameters greatly exceeds the number of different parameters, especially RNNs and CNNs, the use of backpropagation for training is notoriously inefficient. To be exact, due to the recurrent nature of RNN, ANN becomes an extremely deep neural network, which the depth of the network is equal to sequence length, hence the problem of gradient disappearance is more common and especially serious. Meanwhile, in the CNN, the parameter sharing method of extracting features by using the same weight parameters repeatedly in different parts of the image runs through the whole network.

In addition to using back propagation for training in the coherent nanophotonic circuit, the research team has also proposed a way to directly obtain the gradient of each different parameter by only using forward propagation and finite difference methods on ONN [ 36 ]. The method of obtaining the gradient can be specifically described as the following process: first of all, calculating two forward propagation steps J ( W ij ) and J ( W ij  +  δ ij ) in a constant time, then calculating ∆ W ij  = ( J ( W ij  +  δ ij ) −  J ( W ij ))/ δ ij , that the gradient of different weighting parameters ∆ W ij can be acquired only by forward propagation.

Of course, this kind of on-chip forward propagation method is essentially a simple finite difference method. Although it is simple in form and convenient in use, it requires to carry out a forward propagation for each independent parameter, including two calculations of loss function and one calculation of division. When there are many parameters, the efficiency is very low.

In-situ Back propagation and Adjoint method

As an all-optical neural network, the coherent nanophotonic circuit mentioned above, whose linear operation and nonlinear activation can be effectively completed by optical path, has good forward propagation speed and power efficiency, and has a good development prospect. Its training can either use the traditional back propagation algorithm or only using the forward propagation to directly train the neural network on the photonic chip, so as to realize the programmable optical neural network. However, Mach-Zehnder interferometer, directional coupler and phase modulator occupy a large space, so it is difficult to construct the optical network with more than 1000 neurons. In addition, due to the precision encoding phase, phase shift between thermal crosstalk and optical detection noise of MZI and other factors, the identification accuracy cannot reach the expected effect, and the accuracy is far lower than the equivalent electronic neural network. Therefore, this kind of inefficient training method cannot be applied to the neural network based on integrated photonic platform, and it is difficult to achieve the goal of large-scale, fast, programmable and high-precision photonic neural network.

In 2018, Tyler W. Hughes et al., from Stanford University, proposed a method of training neural network efficiently and locally to obtain parameters of optical path in backward propagation through the method of adjoint variables, which is similar to the means of calculating gradient in the common neural network [ 86 ]. Moreover, these gradients can be obtained by measuring the strength of the device.

As shown in Fig.  12 (a), the transmission matrix W between the input and output of each layer is determined by the dielectric constant ε l of the phase shifter of that layer. Using the mean square error (MSE) as the loss function L of the system, we first calculate derivative of the dielectric constant ε l of the last layer corresponding to the loss function, and then compute recursively the gradient of each layer by the chain rule. The next is to calculate the gradient by electromagnetic adjoint variable method. The derivative of dielectric constant ε l corresponding to the loss function of can be expressed as another form including the original quantity oj and the adjoint quantity aj: \( \frac{dL}{d{\varepsilon}_l}={k}_0^2R\left\{\sum \limits_{r\in {r}_{\phi }}{e}_{aj}(r){e}_{oj}(r)\right\} \) .

figure 12

a , b A kind of efficient and locally raining method for neural network [ 86 ]. c Wave physical simulation of recursive network layout [ 87 ]. d Neuron pattern with unsupervised learning [ 61 ]

The last term in the intensity pattern due to the interference of e og and e aj is the amount needed to calculate the gradient:  I  = | e og | 2  + | e aj | 2  + 2 R { e og e aj }, thus as long as e aj in OIU can be generated, the measurement of gradient can be achieved simply by measuring light intensity. Figure  12 (b) shows the experimental method for measuring the gradient. First, in step (1), input the original field X l  − 1 in the forward direction and record the intensity at each phase shifter, i.e. | e og | 2 . Then, input the difference between the actual output and the ideal output in step (1) in the reverse direction and record the intensity of each phase shifter, namely \( {\left|{e}_{aj}^{\ast}\right|}^2 \) . In step (2), reverse output is a time-reverse adjoint field, which can be calculated by \( {X}_{TR}^{\ast }={\hat{W}}_l^T{\delta}_l \) . As shown in step (3), when inputting the original field and the time-inverse field at the same time, interference occurs, and record the intensity of each phase shifter, subtract the constant intensity term in steps (1) and (2) and multiply with them to obtain the gradient.

Furthermore, in situ backpropagation algorithm and adjoint method for gradient measurement are also used in the training of NNM [ 48 ], but its training parameters are different. The training of nanophotonic neural medium NNM is controlled by nonlinear Maxwell’s equations, that is, an input image is used as the light source to solve the iterative process of nonlinear Maxwell’s equations. Before the training, the electric field is randomly initialized to E 0 , thus the dielectric constant could be calculated. The new electric field E 1 could be obtained by solving the equations with FDFD, then use E 1 to update ε and iterate continuously until the field convergences. Next, solve the gradient of the loss function to the dielectric constant until the structure of NNM is updated, and the training of one picture is also finished. In the end, repeat the above process with different pictures. At the beginning of training, dopants are randomly distributed; with the advancement of the training process, dopants begin to move, merge, and finally converge together. The dividing line is gradually generated during the training process. The training process seems to be training and updating dielectric constant ε , but it is actually changing the distribution of dopants in it, in other words, changing the material density of the entire material.

However, Maxwell’s equations describe not only light waves, but also all types of waves belonging to electromagnetic waves. The discovery and development of electromagnetic wave are inseparable from the research of mechanical waves such as water waves and sound waves. Mechanical waves and electromagnetic waves have different generation mechanisms and their own characteristics. However, they are all waves, and there are many common rules, for instance, all of them can produce reflection, refraction, interference, diffraction and other phenomena; wave speed, wavelength and frequency have the same relationship; vibration law and energy distribution are similar to electromagnetic wave. Therefore, the training of light waves in NNM can be extended to the training of other waves with similar characteristics, such as sound wave, so as to realize the deep learning tasks in other fields.

In December 2019, Tyler W. Hughes et al. conducted an analysis and research on the neural network constructed by wave physical simulation [ 87 ]. First, they proved that the dynamics of the wave equation is conceptually equivalent to the dynamics of RNN, and then designed an inhomogeneous medium to demonstrate how to train the dynamics of the wave equation through the construction of the nonhomogeneous material distribution to classify vowels. The specific system layout is shown in Fig.  12 (c). For demonstration, a binary system consisting of two materials is realized. As with NNM, the initial distribution of wave velocity is composed of uniform material area with velocity between the two materials that make up the system. When the system is trained, the wave equation model is used to carry out back propagation, and the gradient of the cross-entropy loss function of the measured output with respect to material density of each pixel in the trainable area is calculated. This method is mathematically equivalent to the adjoint method. Then, we used the Adam optimization algorithm to update the material density with this gradient information, and repeated the process until the final structure converged. The experimental results show that the structure can be used to identify the vowel indeed. The average accuracy of system on the training data set is 92.6 ± 1.1%, and the average accuracy on the test data set is 86.3 ± 4.3%. The prediction performance of the system for acoustic emission vowels is almost perfect, the system can distinguish between iy vowels and ei vowels, but accuracy is poorer, especially in samples not shown in the test dataset.

Although the neural network system combining scattering and deep learning can perform classification and recognition tasks, there are some difficulties in material production. It is not easy to obtain active tumor slices, and the size of nanophotonic medium is also on the order of micron millimeter after calculation, and its fabrication is not simple. However, the materials of scattering media can be diversified and easy to obtain. For linear materials, in the optical platform, linear dopants such as pores can be used; in an acoustic environment, the distribution of materials may include air with a sound velocity of 331 m/s and porous silicone rubber with a sound velocity of 150 m/s [ 88 ]. For nonlinear materials, in the optical platform, utilizing the Kerr nonlinearity is the most direct method to realize the nonlinear wave velocity. Silicon (Si) and chalcogenide glass (such as As 2 S 3 ) are two kinds of widely used nonlinear optical materials on the integrated platform, and chalcogenide glass has one of the highest damage thresholds [ 89 ]. Another commonly used optical nonlinearity is saturation absorption, which consists of the intensity-dependent absorption/damping, mathematically defined as: \( (u)=\frac{b_0}{1+{\left(\frac{u}{u_{th}}\right)}^2} \) . One possible way to achieve this effect is to place graphene or other absorbent 2D materials on the linear optical circuit etched on a medium such as silicon. Acoustically, many fluids, especially those containing bubbles such as carbonated water, exhibit strong nonlinear responses. Not only light waves and sound waves, but also waves similar to Maxwell’s equation can be used to construct network system by making use of inhomogeneous media for training and learning.

Training spiking neural network (SNN) with STDP mechanism

In all-optical spiking neural network, not only supervised learning of simple image recognition is carried out, but also unsupervised learning is demonstrated [ 61 ]. In the supervised learning experiment, the synaptic weight of the network is trained based on the computer, adopting the back propagation algorithm. Here, a set of training data consisting of input mode pair and expected output pair is given. According to the deviation between expected output and actual output, the synapse weight in the network is adjusted for optimization until the deviation is optimal and the network converges. In the unsupervised learning, the network can automatically update its weight through a feedback loop and adapt to specific patterns in this way, without the need for external computer control. The specific unsupervised neuron pattern is shown in Fig.  12 (d). Unsupervised learning uses spiking timing dependent plasticity (STDP) criteria to update the weight, that is, the change in the weights of two synapses is related to the time difference between the pre-synaptic and post-synaptic neuron pulses [ 90 ]. If an input signal arrives just before the output peak, then the input signal is likely to have reached the trigger threshold and the corresponding weight will be increased. If the input pulse arrives after the output pulse, the weight of the synapse is reduced. The increase or decrease of the weight is a function of the time difference between the input peaks and output peaks. Eq. (3) and (4) show the weight update of two mainstream unsupervised STDP learning algorithms:

In 2020, Shuiying Xiang proposed the hardware architecture of a multi-layer photonic spiking neural network [ 91 ], which uses a vertical cavity surface emitting laser embedded with a saturated absorber as the pulse neuron, with two polarization modes. If the polarization pattern between the two is the same, it is considered as an excitatory synapse. If it is orthogonal polarization, it is considered as an inhibitory synapse. In addition, applying the photonic STDP criterion, a supervised learning algorithm based on Tempotron rule and photonic STDP rule is designed, which is suitable for multi-layer photonic spiking neural network. The neuromorphic neural network can solve the classical XOR problem, and consider the influence of physical parameters of photonic neurons on training convergence. Furthermore, the multi-layer photonic SNN is further extended to realize other logical tasks.

Pseudo-inverse matrix method for reservoir computing

Reservoir computing based on WDM technology is a special neural network system, in which training parameters and training methods are also different from other neural networks. RC system has a fixed reservoir, the so-called hidden layer, and its input matrix and internal connection matrix of the reservoir are given randomly and fixed. Thus, RC only needs to train the output matrix between the reservoir layer and the output layer. The existing training methods include pseudo-inverse matrix method, ridge regression, and least square method, the most commonly used training method is pseudo inverse matrix method. We take the RC system based on optical fiber communication as an example to illustrate the application of pseudo-inverse matrix method in the training of reservoir computing [ 53 ].

RC system is a recursive system subject to the time-limited internal state x ( n ). Its neurons can be described as the function of input current and previous calculation results, expressed in the following way:

The output of the network can be expressed as: y ( n ) =  W out [1;  u ( n );  x ( n )]. By collecting training data [1;  u ( n );  x ( n )] and training target signal, the readout matrix waveform can be obtained by using the pseudo-inverse matrix method. When estimating the difference between the theoretical output and the system output, an indicator such as the normalized root mean square error (NRMSE) is used. The specific form is as follows:

In order to evaluate the performance of the optical RC system, the simulation experiment was carried out to identify the input signal waveform using the commercial software of the optical fiber communication system, and the experiments of without interconnection and with interconnection were compared. The pseudo-inverse matrix method used in the training process is replaced by the optical delay tuning of the phase modulator (PM) between the optical neuron and the output layer directional coupler, and the output data is optimized by obtaining the lowest NRMSE. The final results show that ONN can provide better performance in recognition of input signal waveforms, and reservoir computation in the neural network with randomly connected optical neurons can provide better performance. In addition, the performance of RC system when performing linear and nonlinear operation in EDFA is also studied. Finally, it is proved that optical neurons should be activated by nonlinear activation function in the RC system, so as to obtain the ability of signal recognition.

Discussion and outlook

ONN is a promising alternative to ENN with two obvious advantages. Firstly, the matrix multiplication which ANN relies on can be performed at the speed of light in ONN and detected at a rate of over 50 GHz [ 92 ]; Secondly, after training, ONN is passive, and the calculation of optical signal can be realized with the minimum power consumption [ 48 ]. A large number of different types of ONNs have been reported at present, including ONNs based on diffraction optics and free space optics, integrated photonic circuits based on interference optics and synaptic mechanisms of spiking discharge, and even neural networks that utilize the principle of wavelength division multiplexing for reservoir computing. Here, we will summarize and discuss the network technologies involved in this paper, and point out the current challenges and the possible development directions in the future for the implementation of optical neural network.

In the study of diffraction neural network, it makes good use of the phenomenon of light diffraction, and realizes the full connection of neurons among all layers, so that the learning ability of the model will be better. But their research lacks an important part, which is nonlinear activation, and the researchers also suggest that their process does not involve nonlinear activation functions. In the future, we can try to implement such optical diffraction neural networks and add nonlinear work, such as using nonlinear optical media such as photorefractive crystals and magneto-optical traps, or using existing nonlinear activation functions that have been studied, to compensate for its absence through experiments. Beyond that, this kind of network and ONNs based on Fourier transform belong to free-space connection networks. Due to some heavy optical elements such as diffraction element and lens, it is challenging to scale and scale a large number of neurons. Scattering-based neural network is a type of neural network that is worth studying. Due to the disorder of scattering, light may be scattered from all directions. Therefore, the light passing through the scattering medium is equivalent to many computations, which is likely to surpass the previous hierarchical feedforward network. And because of the particularity of nano scattering medium, we can realize many different real-time trainings. Optical neural network with chips as the mainstream, such as coherent nanophotonic circuit and spiking network, can offer a CMOS-compatible, scalable approach to achieve optical deep learning tasks, have huge advantages in device miniaturization and expanding the network size, and they work under light, with the strong computing power and minimal resource consumption. However, the cost of chip-type network is extremely expensive and the technical requirements are strict, requiring a lot of manpower material resources to support. So even though it has excellent development prospects, there are still technical challenges to be overcome.

Nonlinear operation is the root of the strong expression ability of ANN, it enables the neural network to learn complex mapping between the input and output, speeds up the convergence of the network and improves recognition accuracy. It’s an indispensable component of the neural network. Graphene, PCM, EIT and other excellent nonlinear activations have emerged nowadays, however, there are huge challenges to implement the nonlinear function in the optical domain, Firstly, the optical nonlinearity is relatively weak, and its generation generally requires very high optical power, greatly increasing the energy consumption. At the same time, due to the high optical power, other optical devices in the system will be damaged. Secondly, the optical nonlinearity needs to be balanced with the working bandwidth, and the information processing capability of ONN will be limited. In addition, many elements in an optical circuit maintain a consistent resonant response between each other, requiring additional control circuits to calibrate each element [ 93 ]. Thirdly, in the architecture of photonic artificial intelligence chip, the flexibility of nonlinear activation function is high, and it is difficult to control the optical nonlinear effect. In the manufacturing process of nonlinear devices, the response tends to be stable, which cannot meet the need of flexibility. Meanwhile, there are many problems in the integration of nonlinear optical units on chip in terms of process compatibility and device consistency. In summary, how to realize the optical nonlinear activation function with low power consumption, high speed, easy realization and rich expression forms is a technical problem urgently to be solved by the technicians in this field.

Apart from seeking a breakthrough in the aspects of linear operation and nonlinear activation, we can also put our energy into the neural networks training. At present, many networks complete the training process on the computer, and then complete the identification and classification tasks in the optical system, such a method is inevitably too targeted. Therefore, it is extremely important to find a training method that can train in optical mode and realize real-time training. The coherent nanophotonic circuit realizes a forward-propagation programmable training and also proposes an efficient local training method. The training of nano scattering medium is a good example, too. It can change the dielectric constant of the material by controlling the electric field, thereby controlling the distribution of internal dopants and finally achieve stability. Furthermore, depending on the task or goal, the training can be repeated many times.

Of course, the successful implementation of ONN cannot be separated from the combination of technologies in other fields. For instance, the fabrication of 3D diffraction layer in the D 2 NN uses 3D printing technology and Poisson surface reconstruction technology. Grating diffraction layer uses semiconductor processing technology for etching, as well as silicon photonic integration technology is needed for coherent nanophotonic circuit, spiking network and nanometer neural medium carried on chip. There is even cooperation with the fields of metamaterials, scattering imaging, etc. Not only that, in order to realize nonlinear activation, it is also necessary to have knowledge reserves related to chemistry and materials.

Conclusions

In this paper, we introduce and analyze the advanced field of deep learning——optical neural network in detail. Firstly, we introduce how to realize the linear connection and optical nonlinear activation in ONN, and then describe how to train ONN in term of different training or gradient calculation methods. At last, we also conduct and discuss the optical neural network techniques, and point out the current challenges and future developments. For some typical applications, simple data analysis and comparison are also carried out. As an interdisciplinary product of photonic technology and artificial intelligence technology, photonic neural network can combine the advantages of photonic technology and artificial intelligence to build a high-speed, low-power, large-bandwidth network structure, breaking through the bottleneck of traditional electronic neural network. However, the photonic neural network still needs to overcome problems such as real-time training, implementation of nonlinear activation function, scale and application expansion, etc. It is believed that in the near future, the photonic neural network can better play the advantages brought by the combination of optoelectronic technology and artificial intelligence technology, so as to better build a green intelligent world.

Availability of data and materials

Data sharing is not applicable to this article as no new datasets were created in this review.

Abbreviations

Artificial Neural Network

Arbitrary Waveform Generator

Charge Coupled Device

Diffraction Deep Neural Network

Distributed Feedback

Digital Signal Processor

Erbium Doped Optical Fiber Amplifier

Electromagnetic Induced Transparency

Finite-Difference Frequency-Domain

Feature-Extracting Optical Neuron Device

Gale-Shapley algorithm

Magneto-Optical Trap

Microring Resonator

Mean Square Error

Mach-Zehnder Interferometer

Nanophotonic Neural Medium

Normalized Root Mean Square Error

Optical Matrix Multiplier

Optical Neural Network

Phase Change Material

Phase Modulator

Quantum Optical Neural Network

Quantum Dots

Recurrent Neural Network

Spatial Light Modulator

Spiking neural network

Spike Timing Dependent Plasticity

Singular Value Decomposition

Time-Stretch Electro-Optical Neural Network

Wavelength Division Multiplexing

Hines ML, Carnevale NT. The neuron simulation environment. Neural Comput. 1997;9(6):1179–209. https://doi.org/10.1162/neco.1997.9.6.1179 .

Article   Google Scholar  

Schwabe RJ, Zelinger S, Key TS, Phipps KO. Electronic lighting interference. IEEE Ind Appl Mag. 1998;4:46–8.

Markram H, Muller E, Ramaswamy S. Reconstruction and simulation of neocortical microcircuitry. Cell. 2015;163(2):456–92. https://doi.org/10.1016/j.cell.2015.09.029 .

Tsai F-CF, O'Brien CJ, Petrović NS, Rakić AD. Analysis of optical channel cross talk for free-space optical interconnects in the presence of higher-order transverse modes. Appl Optics. 2005;44(30):6380–7. https://doi.org/10.1364/AO.44.006380 .

Hu W, Li X, Yang J, Kong D. Crosstalk analysis of aligned and misaligned free-space optical interconnect systems. J Opt Soc Am A. 2010;27(2):200–5. https://doi.org/10.1364/JOSAA.27.000200 .

Goodman JW, Dias AR, Woody LM. Fully parallel, high-speed incoherent optical method for performing discrete fourier transforms. Opt Lett. 1978;2(1):1–3. https://doi.org/10.1364/OL.2.000001 .

Hu X, Wang A, Zeng M, Long Y, Zhu L, Fu L, et al. Graphene-assisted multiple-input high-base optical computing. Sci Rep. 2016;6:32911.

Caulfield HJ, Dolev S. Why future supercomputing requires optics. Nat Photon. 2010;4(5):261–3. https://doi.org/10.1038/nphoton.2010.94 .

Mosca EP, Griffin RD, Pursel FP, Lee JN. Acoustooptical matrix-vector product processor: implementationissues. Appl Optics. 1989;28(18):3843–51. https://doi.org/10.1364/AO.28.003843 .

Sun C-C, Chang M-W, Hsu KY. Matrix-matrix multiplication by using anisotropic self-diffraction in batio3. Appl Optics. 1994;33:4501X507.

Google Scholar  

Nasr MB, Chtourou M. A hybrid training algorithm for feedforward neural networks. Neural Process Lett. 2006;24(2):107–17. https://doi.org/10.1007/s11063-006-9013-x .

de Lima TF, Shastri BJ, Tait AN, Nahmias MA, Prucna PR. Progress in neuromorphic photonics. Nanophotonics. 2017;6(3):577–99. https://doi.org/10.1515/nanoph-2016-0139 .

Chen Y. 4f-type optical system for matrix multiplication. Optim Eng. 1993;32.

PIAGGIO HTH. The mathematical theory of huygens' principle. Nature. 1940;145(3675):531–2. https://doi.org/10.1038/145531a0 .

Young T. The Bakerian lecture. Experiments and calculations relative to physical optics. Abstr Pap Print Philos Transactions Royal Soc Lond. 1832;1:131–2.

Mandel L, Wolf E. Some properties of coherent light*. J Opt Soc Am. 1961;51(8):815–9. https://doi.org/10.1364/JOSA.51.000815 .

Porter MB. Concerning green's theorem and the cauchy-riemann differential equations. Ann Math Sec Ser. 1905;7(1):1–2. https://doi.org/10.2307/1967189 .

Article   MathSciNet   MATH   Google Scholar  

AL-Jawary MA, Wrobel LC. Numerical solution of the two-dimensional helmholtz equation with variable coefficients by the radial integration boundary integral and integro-differential equation methods. Int J Comput Math. 2012;89:1463–87.

Article   MathSciNet   Google Scholar  

Umul YZ. Young-kirchhoff-rubinowicz theory of diffraction in the light of sommerfeld's solution. J Opt Soc Am A. 2008;25(11):2734–42. https://doi.org/10.1364/JOSAA.25.002734 .

Sommerfeld A. Optics. Lectures on theoretical physics, vol. iv. Am J Physiol. 1955;23(7):477–8. https://doi.org/10.1119/1.1934064 .

Goodman J. Introduction to Fourier optics: 2rd Edition, Roberts and Company Publishers, Englewood; 1995. p. 35.

Karczewski B. Fraunhofer diffraction of an electromagnetic wave. J Opt Soc Am. 1961;51(10):1055–7. https://doi.org/10.1364/JOSA.51.001055 .

Wang X, Xu Q, Liu E. Angular spectrum theory to calculate coupling efficiency in rectangular waveguide resonators. Opt Laser Technol. 2000;32(3):177–81. https://doi.org/10.1016/S0030-3992(00)00037-2 .

Lin X, Rivenson Y, Yardimci NT, Veil M, Luo Y, Jarrahi M, et al. All-optical machine learning using diffractive deep neural networks. Science. 2018;361(6406):1004–8. https://doi.org/10.1126/science.aat8084 .

Lu L, Zhu L, Zhang Q, Zhu B, Yao Q, Yu M, et al. Miniaturized diffraction grating design and processing for deep neural network. IEEE Photon Technol Lett. 2019;31(24):1952–5. https://doi.org/10.1109/LPT.2019.2948626 .

Qian C, Lin X, Xu J, Sun Y, Li E, Zhang B, et al. Performing optical logic operations by a diffractive neural network. Light Sci Appl. 2020;9(1):59. https://doi.org/10.1038/s41377-020-0303-2 .

Luo Y, Mengu D, Yardimci NT, Rivenson Y, Veli M, Jarrahi M, et al. Design of task-specific optical systems using broadband diffractive neural networks. Light Sci Appl. 2019;8(1):112. https://doi.org/10.1038/s41377-019-0223-1 .

Liao D, Chan KF, Chan CH, Zhang Q, Wang H. All-optical diffractive neural networked terahertz hologram. Opt Lett. 2020;45(10):2906–9. https://doi.org/10.1364/OL.394046 .

Blackwell CA, Simpson RS. The convolution theorem in modern analysis. IEEE Transact Educ. 1966;9(1):29–32. https://doi.org/10.1109/TE.1966.4321930 .

Lu T, Wu S, Xu X, Yu FTS. Two-dimensional programmable optical neural network. Appl Optics. 1989;28(22):4908–13. https://doi.org/10.1364/AO.28.004908 .

Gao S, Yang J, Feng Z, Zhang Y. Implementation of a large-scale optical neural network by use of a coaxial lenslet array for interconnection. Appl Optics. 1997;36(20):4779–83. https://doi.org/10.1364/AO.36.004779 .

Kuratomi Y, Takimoto A, Akiyama K, Ogawa H. Optical neural network using vector-feature extraction. Appl Optics. 1993;32(29):5750–8. https://doi.org/10.1364/AO.32.005750 .

Chang J, Sitzmann V, Dun X, Heidrich W, Wetzstein G. Hybrid optical-electronic convolutional neural networks with optimized diffractive optics for image classification. Sci Rep. 2018;8:12324.

Zuo Y, Li B, Zhao Y, Jiang Y, Chen Y-C, Chen P, et al. All-optical neural network with nonlinear activation functions. Optica. 2019;6(9):1132–7. https://doi.org/10.1364/OPTICA.6.001132 .

Breit G. The interference of light and the quantum theory. Proc Natl Acad Sci. 1923;9(7):238–43. https://doi.org/10.1073/pnas.9.7.238 .

Shen Y, Harris NC, Skirlo S, Prabhu M, Baehr-Jones T, Hochberg M, et al. Deep learning with coherent nanophotonic circuits. Nat Photon. 2017;11:44H46.

Elson JM, Rahn JP, Bennett JM. Light scattering from multilayer optics: comparison of theory and experiment. Appl Optics. 1980;19(5):669–79. https://doi.org/10.1364/AO.19.000669 .

Rochon P, Bissonnette D. Lensless imaging due to back-scattering. Nature. 1990;348(6303):708–10. https://doi.org/10.1038/348708a0 .

Vellekoop IM, Mosk AP. Focusing coherent light through opaque strongly scattering media. Opt Lett. 2007;32(16):2309–11. https://doi.org/10.1364/OL.32.002309 .

Katz O, Small E, Silberberg Y. Looking around corners and through thin turbid layers in real time with scattered incoherent light. Nat Photon. 2012;6(8):549–53. https://doi.org/10.1038/nphoton.2012.150 .

Vellekoop IM, Lagendijk A, Mosk AP. Exploiting disorder for perfect focusing. Nat Photon. 2010;4(5):320–2. https://doi.org/10.1038/nphoton.2010.3 .

Bertolotti J, van Putten EG, Akbulut D, Vos WL, Lagendjk A, Mosk AP. Scattering optics resolve nanostructure. In: Proc. SPIE 8102, Nanoengineering: fabrication, properties, optics, and devices VIII; 2011. p. 810206.

Chapter   Google Scholar  

Huang D, Swanson EA, Lin CP, Schuman JS, Stinson WG, Chang W, et al. Optical coherence tomography. Science. 1991;254(5035):1178–81. https://doi.org/10.1126/science.1957169 .

Katz O, Heidmann P, Fink M, Gigan S. Non-invasive single-shot imaging through scattering layers and around corners via speckle correlations. Nat Photon. 2014;8(10):784–90. https://doi.org/10.1038/nphoton.2014.189 .

Yaqoob Z, Psaltis D, Feld MS, Yang C. Optical phase conjugation for turbidity suppression in biological samples. Nat Photon. 2008;2(2):110–5. https://doi.org/10.1038/nphoton.2007.297 .

Ando T, Horisaki R, Tanida J. Speckle-learning-based object recognition through scattering media. Opt Express. 2015;23(26):33902–10. https://doi.org/10.1364/OE.23.033902 .

Pierangeli D, Marcucci G, Moriconi C, Perini G, Spirito MD, Papi EAM. Deep optical neural network by living tumour brain cells. Physis. 2018.

Khoram E, Chen A, Liu D, Ying L, Wang Q, Yuan M, et al. Nanophotonic media for artificial neural inference. Photon Res. 2019;7(8):823–7. https://doi.org/10.1364/PRJ.7.000823 .

Qu Y, Zhu HZ, Shen YC, Zhang J, Tao CN, Ghosh P, et al. Inverse design of an integrated-nanophotonics optical neural network. Sci Bull. 2020;65(14):1177–83. https://doi.org/10.1016/j.scib.2020.03.042 .

Koester CJ. Wavelength multiplexing in fiber optics. J Opt Soc Am. 1968;58(1):63–70. https://doi.org/10.1364/JOSA.58.000063 .

Paquot Y, Duport F, Smerieri A, Dambre J, Schrauwen B, Haelterman M, et al. Optoelectronic reservoir computing. Sci Rep. 2012;2:287.

Duport F, Schneider B, Smerieri A, Haelterman M, Massar S. All-optical reservoir computing. Opt Express. 2012;20(20):22783–95. https://doi.org/10.1364/OE.20.022783 .

Cheng T-Y, Chou D-Y, Liu C-C, Chang Y-J, Chen C-C. Optical neural networks based on optical fiber-communication. Neurocomputing. 2019;364:239–44. https://doi.org/10.1016/j.neucom.2019.07.051 .

Zang Y, Chen M, Yang S, Chen H. Electro-optical neural networks based on time-stretch method. IEEE J Sel Top Quantum Electron. 2020;26(1):1–10. https://doi.org/10.1109/JSTQE.2019.2957446 .

Zhang H, Feng X, Li B, Wang Y, Cui K, Liu F, et al. Integrated photonic reservoir computing based on hierarchical time-multiplexing structure. Opt Express. 2014;22(25):31356–70. https://doi.org/10.1364/OE.22.031356 .

Nguimdo RM, Verschaffelt G, Danckaert J, der Sande GV. Simultaneous computation of two independent tasks using reservoir computing based on a single photonic nonlinear node with optical feedback. IEEE Transact Neur Netw Learn Syst. 2015;26(12):3301–7. https://doi.org/10.1109/TNNLS.2015.2404346 .

Maass W. Networks of spiking neurons: the third generation of neural network models. Neural Netw. 1997;10(9):1659–71. https://doi.org/10.1016/S0893-6080(97)00011-7 .

Tait AN, de Lima TF, Zhou E, Wu AX, Nahmias MA, Shastri BJ, et al. Neuromorphic photonic networks using silicon photonic weight banks. Sci Rep. 2017;7:7430.

Shastri BJ, Nahmias MA, Tait AN, Rodriguez AW, Wu B, Prucnal PR. Spike processing with a graphene excitable laser. Sci Rep. 2016;6:19126.

Chakraborty I, Saha G, Sengupta A, Roy K. Toward fast neural computing using all-photonic phase change spiking neurons. Sci Rep. 2018;8:12980.

Feldmann J, Youngblood N, Wright CD, Bhaskaran H, Pernice WHP. All-optical spiking neurosynaptic networks with self-learning capabilities. Nature. 2019;569(7755):208–14. https://doi.org/10.1038/s41586-019-1157-8 .

Nahmias MA, Peng H, de Lima TF, Huang C, Tait AN, Shastri BJ, Prucnal PR. A TeraMAC neuromorphic photonic processor. In: 2018 IEEE photonics Conf. (IPC); 2018. p. 1–2.

Tait AN, Nahmias MA, Shastri BJ, Prucnal PR. Broadcast and weight: an integrated network for scalable photonic spike processing. J Light Technol. 2014;32(21):4029–41. https://doi.org/10.1109/JLT.2014.2345652 .

Shainline JM, Buckley SM, McCaughan AN, Chiles J, Jafari-Salim A, Mirin RP, et al. Circuit designs for superconducting optoelectronic loop neurons. J Appl Phys. 2018;124(15):152130. https://doi.org/10.1063/1.5038031 .

Selden AC. Pulse transmission through a saturable absorber. Br J Appl Phys. 1967;18(6):743–8. https://doi.org/10.1088/0508-3443/18/6/306 .

Braunstein R. Nonlinear optical effects. Phys Rev. 1962;125(2):475–7. https://doi.org/10.1103/PhysRev.125.475 .

Cotton A. Recherches Sur l'absorption et la dispersion de la lumiere par les milieux doues du pouvoir rotatoire. J Phys Theor Appl. 1896;5(1):237–44. https://doi.org/10.1051/jphystap:018960050023700 .

Article   MATH   Google Scholar  

Skinner SR, Steck JE, Behrman EC. Optical neural network using Kerr-type nonlinear materials. In: Proceedings of the fourth international conference on microelectronics for neural networks and fuzzy systems: IEEE; 1994. p. 12–5.

Dejonckheere A, Duport F, Smerieri A, Fang L, Oudar J-L, Haelterman M, et al. All-optical reservoir computer based on saturation of absorption. Opt Express. 2014;22(9):10868–81. https://doi.org/10.1364/OE.22.010868 .

Cheng Z, Tsang HK, Wan X, Xu K, Xu J. In-plane optical absorption and free carrier absorption in graphene-on-silicon waveguides. IEEE J Sel Top Quant Electron. 2013;20:43–8.

Soljacic M, Ibanescu M, Johnson SG, Fink Y, Joannopoulos JD. Optimal bistable switching in nonlinear photonic crystals. Phys Rev E. 2002;66(5):055601. https://doi.org/10.1103/PhysRevE.66.055601 .

Coarer FD, Sciamanna M, Katumba A, Freiberger M, Dambre J, Bienstman P, et al. All-optical reservoir computing on a photonic chip using silicon-based ring resonators. IEEE J Sel Top Quant Electron. 2018;24(6):1–8. https://doi.org/10.1109/JSTQE.2018.2836985 .

Serber R. The theory of depolarization, optical anisotropy, and the Kerr effect. Phys Rev. 1933;43(12):1003–10. https://doi.org/10.1103/PhysRev.43.1003 .

Weinberger P. John Kerr and his effects found in 1877 and 1878. Philos Mag Lett. 2008;88(12):897–907. https://doi.org/10.1080/09500830802526604 .

Mesaritakis C, Kapsalis A, Syvridis D. All-optical reservoir computing system based on ingaasp ring resonators for high-speed identification and optical routing in optical networks. Quant Sens Nanophoton Devices XII. 2015;9370:608–14.

Steinbrecher GR, Olson JP, Englund D, Carolan J. Quantum optical neural networks. NPJ Quant Inf. 2019;5:60.

Amin R, George J, Khurgin J, El-Ghazawi T, Prucnal PR, Sorger VJ. Attojoule modulators for photonic neuromorphic computing. In: Conference on lasers and electro-optics: Optical Society of America; 2018. p. ATh1Q.4.

Amin R, Khan S, Lee CJ, Dalir H, Sorger VJ. 110 attojoule-per-bit efficient graphene-based plasmon modulator on silicon. In: Conference on lasers and electro-optics: Optical Society of America; 2018. p. SM1I.5.

George JK, Mehrabian A, Amin R, Meng J, de Lima TF, Tait AN, et al. Neuromorphic photonics with electro-absorption modulators. Opt Express. 2019;27(4):5181–91. https://doi.org/10.1364/OE.27.005181 .

George J, Amin R, Mehrabian A, Khurgin J, El-Ghazawi T, Prucnal PR, Sorger VJ. Electrooptic nonlinear activation functions for vector matrix multiplications in optical neural networks. In: Advanced photonics 2018 (BGPP, IPR, NP, NOMA, sensors, networks, SPPCom, SOF): Optical Society of America; 2018. p. SpW4G.3.

Miscuglio M, Mehrabian A, Hu Z, Azzam SI, George J, Kildishev AV, et al. All-optical nonlinear activation function for photonic neural networks. Opt Mater Express. 2018;8:3851–63.

Fleischhauer M, Imamoglu A, Marangos JP. Electromagnetically induced transparency: optics in coherent media. Rev Mod Phys. 2005;77(2):633–73. https://doi.org/10.1103/RevModPhys.77.633 .

Williamson IAD, Hughes TW, Minkov M, Bartlett B, Pai S, Fan S. Reprogrammable electro-optic nonlinear activation functions for optical neural networks. IEEE J Sel Top Quantum Electron. 2020;26(1):1–12. https://doi.org/10.1109/JSTQE.2019.2930455 .

Mengu D, Luo Y, Rivenson Y, Ozcan A. Analysis of diffractive optical neural networks and their integration with electronic neural networks. IEEE J Sel Top Quantum Electron. 2020;26(1):1–14. https://doi.org/10.1109/JSTQE.2019.2921376 .

Zhou T, Fang L, Yan T, Wu J, Li Y, Fan J, et al. In situ optical backpropagation training of diffractive optical neural networks. Photon Res. 2020;8(6):940–53. https://doi.org/10.1364/PRJ.389553 .

Hughes TW, Minkov M, Shi Y, Fan S. Training of photonic neural networks through in situ backpropagation and gradient measurement. Optica. 2018;5(7):864–71. https://doi.org/10.1364/OPTICA.5.000864 .

Hughes TW, Williamson IAD, Minkov M, Fan S. Wave physics as an analog recurrent neural network. Sci Adv. 2019;5(12):eaay6946.

Ba A, Kovalenko A, Aristegui C, Mondain-Monval O, Brunet T. Soft porous silicone rubbers with ultra-low sound speeds in acoustic metamaterials. Sci Rep. 2017;7:40106.

Qiu J, Si J, Hirao K. Photoinduced stable second-harmonic generation in chalcogenide glasses. Opt Lett. 2001;26(12):914–6. https://doi.org/10.1364/OL.26.000914 .

Karmarkar UR, Najarian MT, Buonomano DV. Mechanisms and significance of spike-timing dependent plasticity. Biol Cybern. 2002;87(5-6):373–82. https://doi.org/10.1007/s00422-002-0351-0 .

Xiang S, Ren Z, Zhang Y, Song Z, Guo X, Han G, et al. Training a multi-layer photonic spiking neural network with modified supervised learning algorithm based on photonic STDP. IEEE J Sel Top Quantum Electron. 2020;27:1–9.

Vivien L, Polzer A, Marris-Morini D, Osmond J, Hartmann JM, Crozat P, et al. Zero-bias 40Gbit/s germanium waveguide photodetector on silicon. Opt Express. 2012;20(2):1096–101. https://doi.org/10.1364/OE.20.001096 .

Radulaski M, Bose R, Tran T, Van Vaerenbergh T, Kielpinski D, Beausoleil RG. Thermally tunable hybrid photonic architecture for nonlinear optical circuits. ACS Photon. 2018;5(11):4323–9. https://doi.org/10.1021/acsphotonics.8b00376 .

Download references

Acknowledgements

Not applicable.

This work was supported in part by the National Natural Science Foundation of China under Grant 11773018 and Grant 61727802, in part by the Key Research and Development programs in Jiangsu China under Grant BE2018126, in part by the Fundamental Research Funds for the Central Universities under Grant 30919011401 and Grant 30920010001, and in part by the Leading Technology of Jiangsu Basic Research Plan under Grant BK20192003.

Author information

Authors and affiliations.

School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, 210094, China

Jia Liu, Qiuhao Wu, Xiubao Sui, Qian Chen, Guohua Gu & Liping Wang

Institute of Armored Forces, Army Research Institute, Beijing, China

Shengcai Li

You can also search for this author in PubMed   Google Scholar

Contributions

Jia Liu finished the manuscript and prepared the figures, tables and references, and was a major contributor in writing the manuscript. Qiuhao Wu gave guidance and participated in the revision of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xiubao Sui .

Ethics declarations

Competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Liu, J., Wu, Q., Sui, X. et al. Research progress in optical neural networks: theory, applications and developments. PhotoniX 2 , 5 (2021). https://doi.org/10.1186/s43074-021-00026-0

Download citation

Received : 23 December 2020

Accepted : 09 March 2021

Published : 19 April 2021

DOI : https://doi.org/10.1186/s43074-021-00026-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Optical neural network
  • Deep learning
  • Optical linear operation
  • Optical nonlinearity
  • Training method

research in neural network

A Gentle Introduction to Graph Neural Networks

Research areas.

Machine Intelligence

Meet the teams driving innovation

Our teams advance the state of the art through research, systems engineering, and collaboration across Google.

Teams

  • Alzheimer's disease & dementia
  • Arthritis & Rheumatism
  • Attention deficit disorders
  • Autism spectrum disorders
  • Biomedical technology
  • Diseases, Conditions, Syndromes
  • Endocrinology & Metabolism
  • Gastroenterology
  • Gerontology & Geriatrics
  • Health informatics
  • Inflammatory disorders
  • Medical economics
  • Medical research
  • Medications
  • Neuroscience
  • Obstetrics & gynaecology
  • Oncology & Cancer
  • Ophthalmology
  • Overweight & Obesity
  • Parkinson's & Movement disorders
  • Psychology & Psychiatry
  • Radiology & Imaging
  • Sleep disorders
  • Sports medicine & Kinesiology
  • Vaccination
  • Breast cancer
  • Cardiovascular disease
  • Chronic obstructive pulmonary disease
  • Colon cancer
  • Coronary artery disease
  • Heart attack
  • Heart disease
  • High blood pressure
  • Kidney disease
  • Lung cancer
  • Multiple sclerosis
  • Myocardial infarction
  • Ovarian cancer
  • Post traumatic stress disorder
  • Rheumatoid arthritis
  • Schizophrenia
  • Skin cancer
  • Type 2 diabetes
  • Full List »

share this!

April 10, 2024

This article has been reviewed according to Science X's editorial process and policies . Editors have highlighted the following attributes while ensuring the content's credibility:

fact-checked

peer-reviewed publication

trusted source

Neuroscience study taps into brain network patterns to understand deep focus, attention

by Jess Hunt-Ralston, Georgia Institute of Technology

concentration

From completing puzzles and playing music, to reading and exercising, growing up Dolly Seeburger loved activities that demanded her full attention. "It was in those times that I felt most content, like I was in the zone," she remembers. "Hours would pass, but it would feel like minutes."

While this deep focus state is essential to highly effective work, it's still not fully understood. Now, a new study led by Seeburger, a graduate student in the School of Psychology, alongside her advisor, Eric Schumacher, a professor in the School of Psychology is unearthing the mechanisms behind it.

The interdisciplinary Georgia Tech team also includes Nan Xu, Sam Larson and Shella Keilholz (Coulter Department of Biomedical Engineering), alongside Marcus Ma (College of Computing), and Christine Godwin (School of Psychology).

The researchers' study, " Time-varying functional connectivity predicts fluctuations in sustained attention in a serial tapping task ," published in Cognitive, Affective, and Behavioral Neuroscience , investigates brain activity via fMRI during periods of deep focus and less-focused work.

The work is the first to investigate low-frequency fluctuations between different networks in the brain during focus, and could act as a springboard to study more complex behaviors and focus states.

"Your brain is dynamic. Nothing is just on or off," Seeburger explains. "This is the phenomenon we wanted to study. How does one get into the zone? Why is it that some people can sustain their attention better than others? Is this something that can be trained? If so, can we help people get better at it?"

The dynamic brain

The team's work is also the first to study the relationship between fluctuations in attention and the brain network patterns within these low-frequency 20-second cycles.

"For quite a while, the studies on neural oscillations focused on faster temporal frequencies, and the appreciation of these very low-frequency oscillations is relatively new," Seeburger says. "But, these low-frequency fluctuations may play a key role in regulating higher cognition such as sustained attention."

"One of the things we've discovered in previous research is that there's a natural fluctuation in activity in certain brain networks. When a subject is not doing a specific task while in the MRI scanner, we see that fluctuation happen roughly every 20 seconds," adds co-author Schumacher, explaining that the team was interested in the pattern because it is quasi-periodic, meaning that it doesn't repeat exactly every 20 seconds, and it varies between different trials and subjects.

By studying these quasi-periodic cycles, the team hoped to measure the relationship between the brain fluctuation in these networks and the behavioral fluctuation associated with changes in attention.

Your attention needed

To measure attention, participants tapped along to a metronome while in an fMRI scanner. The team could measure how "in the zone" participants were by measuring how much variability was in each participant's taps—more variability suggested the participant was less focused, while precise tapping suggested the participant was "in the zone."

The researchers found that when a subject's focus level changed, different regions of the brain synchronized and desynchronized, in particular the fronto-parietal control network (FPCN) and default mode network (DMN), The FPCN is engaged when a person is trying to stay on task, whereas the DMN is correlated with internally-oriented thoughts (which a participant might be having when less focused).

"When one is out-of-the-zone, these two networks synchronize, and are in phase in the low frequency," Seeburger explains. "When one is in the zone, these networks desynchronize."

The results suggest that the 20-second patterns could help predict if a person is sustaining their attention or not, and could provide key insight for researchers developing tools and techniques that help us deeply focus.

The big picture

While the direct relationship between behavior and brain activity is still unknown, these 20-second patterns in brain fluctuation are seen universally, and across species.

"If you put someone in a scanner and their mind is wandering, you find these fluctuations. You can find these quasi-period patterns in rodents. You can find it in primates," Schumacher says. "There's something fundamental about this brain network activity."

"I think it answers a really fundamental question about the relationship between behavior and brain activity ," he adds. "Understanding how these brain networks work together and impact behavior could lead to new therapies to help people organize their brain networks in the most efficient way."

And while this simple task might not investigate complex behaviors, the study could act as a springboard to move into more complicated behaviors and focus states.

"Next, I would like to study sustained attention in a more naturalistic way," Seeburger says. "I hope that we can further the understanding of attention and help people get a better handle on their ability to control, sustain, and increase it."

Explore further

Feedback to editors

research in neural network

Study finds esketamine injection just after childbirth reduces depression in new mothers

8 minutes ago

research in neural network

A new screening protocol can detect aggressive prostate cancers more selectively

research in neural network

How a new drug prototype regenerates lung tissue

2 hours ago

research in neural network

Why some people with rheumatoid arthritis have pain without inflammation

research in neural network

Researchers show chemical found naturally in cannabis may reduce anxiety-inducing effects of THC

3 hours ago

research in neural network

'Virtual biopsy' lets clinicians analyze skin noninvasively

research in neural network

Research team discovers new way to generate human cartilage

research in neural network

Filling in genomic blanks for disease studies works better for some groups than others

research in neural network

Researchers find new origin of deep brain waves

4 hours ago

research in neural network

Study suggests liquid biopsy could detect and monitor aggressive small-cell lung cancer

Related stories.

research in neural network

Researchers gain a better understanding of how the most commonly used ADHD medication works

Dec 8, 2022

research in neural network

Attention! Brain scans can tell if you are paying it

Mar 3, 2022

research in neural network

Brain study suggests mind wandering at work not be as bad as you might think

Oct 24, 2017

research in neural network

Study reveals a delicate dance of dynamic changes in the conscious brain

Mar 11, 2020

research in neural network

Interplay between brain networks in autism

Feb 22, 2022

research in neural network

Alpha, beta, theta: What are brain states and brain waves? And can we control them?

Dec 25, 2023

Recommended for you

research in neural network

'Deaths of despair' among Black Americans surpassed those of white Americans in 2022

7 hours ago

research in neural network

Living near green space associated with fewer emotional problems in preschool-age kids, study finds

research in neural network

Certain personality traits possibly linked to increased risk of depression

8 hours ago

Let us know if there is a problem with our content

Use this form if you have come across a typo, inaccuracy or would like to send an edit request for the content on this page. For general inquiries, please use our contact form . For general feedback, use the public comments section below (please adhere to guidelines ).

Please select the most appropriate category to facilitate processing of your request

Thank you for taking time to provide your feedback to the editors.

Your feedback is important to us. However, we do not guarantee individual replies due to the high volume of messages.

E-mail the story

Your email address is used only to let the recipient know who sent the email. Neither your address nor the recipient's address will be used for any other purpose. The information you enter will appear in your e-mail message and is not retained by Medical Xpress in any form.

Newsletter sign up

Get weekly and/or daily updates delivered to your inbox. You can unsubscribe at any time and we'll never share your details to third parties.

More information Privacy policy

Donate and enjoy an ad-free experience

We keep our content available to everyone. Consider supporting Science X's mission by getting a premium account.

E-mail newsletter

  • Computer Vision
  • Federated Learning
  • Reinforcement Learning
  • Natural Language Processing
  • New Releases
  • AI Dev Tools
  • Advisory Board Members
  • 🐝 Partnership and Promotion

Logo

The proposed framework’s effectiveness is underscored by its ability to recover constraints utilized in GDL, demonstrating its potential as a general-purpose framework for deep learning. GDL, which uses a group-theoretic perspective to describe neural layers, has shown promise across various applications by preserving symmetries. However, it encounters limitations when faced with complex data structures. The category theory-based approach overcomes these limitations and provides a structured methodology for implementing diverse neural network architectures.

The Centre of this research is applying category theory to understand and create neural network architectures. This approach enables the creation of neural networks that are more closely aligned with the structures of the data they process, enhancing both the efficiency and effectiveness of these models. The research highlights the universality and flexibility of category theory as a tool for neural network design, offering new insights into the integration of constraints and operations within neural network models.

In conclusion, this research introduces a groundbreaking framework based on category theory for designing neural network architectures. By bridging the gap between the specification of constraints and their implementations, the framework offers a comprehensive approach to neural network design. The application of category theory not only recovers and extends the constraints used in frameworks like GDL but also opens up new avenues for developing sophisticated neural network architectures. 

Check out the  Paper .  All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on  Twitter . Join our  Telegram Channel ,   Discord Channel , and  LinkedIn Gr oup .

If you like our work, you will love our  newsletter..

Don’t Forget to join our  39k+ ML SubReddit

research in neural network

Sana Hassan

Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.

  • Sana Hassan https://www.marktechpost.com/author/sana-hassan/ Sigma: Changing AI Perception with Multi-Modal Semantic Segmentation through a Siamese Mamba Network for Enhanced Environmental Understanding
  • Sana Hassan https://www.marktechpost.com/author/sana-hassan/ Claude vs ChatGPT: A Comparison of AI Chatbots
  • Sana Hassan https://www.marktechpost.com/author/sana-hassan/ Researchers from KAUST and Harvard Introduce MiniGPT4-Video: A Multimodal Large Language Model (LLM) Designed Specifically for Video Understanding
  • Sana Hassan https://www.marktechpost.com/author/sana-hassan/ Researchers at Tsinghua University Propose SPMamba: A Novel AI Architecture Rooted in State-Space Models for Enhanced Audio Clarity in Multi-Speaker Environments

RELATED ARTICLES MORE FROM AUTHOR

The “zero-shot” mirage: how data scarcity limits multimodal ai, speechalign: transforming speech synthesis with human feedback for enhanced naturalness and expressiveness in technological interactions, mistral ai shakes up the ai arena with its open-source mixtral 8x22b model, meta advances ai capabilities with next-generation mtia chips, ct-llm: a 2b tiny llm that illustrates a pivotal shift towards prioritizing the chinese language in developing llms, sigma: changing ai perception with multi-modal semantic segmentation through a siamese mamba network for enhanced environmental understanding, speechalign: transforming speech synthesis with human feedback for enhanced naturalness and expressiveness in technological..., ct-llm: a 2b tiny llm that illustrates a pivotal shift towards prioritizing the chinese..., sigma: changing ai perception with multi-modal semantic segmentation through a siamese mamba network for....

  • AI Magazine
  • Privacy & TC
  • Cookie Policy

🐝 FREE AI Courses on RAG + Deployment of an Healthcare AI App + LangChain Colab Notebook all included

Thank You 🙌

Privacy Overview

ScienceDaily

Connecting lab-grown brain cells provides insight into how our own brains work

The idea of growing a functioning human brain-like tissues in a dish has always sounded pretty far-fetched, even to researchers in the field. Towards the future goal, a Japanese and French research team has developed a technique for connecting lab-grown brain-mimicking tissue in a way that resembles circuits in our brain.

It is challenging to study exact mechanisms of the brain development and functions. Animal studies are limited by differences between species in brain structure and function, and brain cells grown in the lab tend to lack the characteristic connections of cells in the human brain. What's more, researchers are increasingly realizing that these interregional connections, and the circuits that they create, are important for many of the brain functions that define us as humans.

Previous studies have tried to create brain circuits under laboratory conditions, which have been advancing the field. Researchers from The University of Tokyo have recently found a way to create more physiological connections between lab-grown "neural organoids," an experimental model tissue in which human stem cells are grown into three-dimensional developmental brain-mimicking structures. The team did this by linking the organoids via axonal bundles, which is similar to how regions are connected in the living human brain.

"In single-neural organoids grown under laboratory conditions, the cells start to display relatively simple electrical activity," says co-lead author of the study Tomoya Duenki. "when we connected two neural organoids with axonal bundles, we were able to see how these bidirectional connections contributed to generating and synchronizing activity patterns between the organoids, showing some similarity to connections between two regions within the brain."

The cerebral organoids that were connected with axonal bundles showed more complex activity than single organoids or those connected using previous techniques. In addition, when the research team stimulated the axonal bundles using a technique known as optogenetics, the organoid activity was altered accordingly and the organoids were affected by these changes for some time, in a process known as plasticity.

"These findings suggest that axonal bundle connections are important for developing complex networks," explains Yoshiho Ikeuchi, senior author of the study. "Notably, complex brain networks are responsible for many profound functions, such as language, attention, and emotion."

Given that alterations in brain networks have been associated with various neurological and psychiatric conditions, a better understanding of brain networks is important. The ability to study lab-grown human neural circuits will improve our knowledge of how these networks form and change over time in different situations, and may lead to improved treatments for these conditions.

  • Brain Tumor
  • Nervous System
  • Psychology Research
  • Birth Defects
  • Brain-Computer Interfaces
  • Intelligence
  • Brain Injury
  • Neuroscience
  • Neural network
  • Brain damage
  • Brain tumor
  • Psycholinguistics
  • Human brain

Story Source:

Materials provided by Institute of Industrial Science, The University of Tokyo . Note: Content may be edited for style and length.

Journal Reference :

  • Tatsuya Osaki, Tomoya Duenki, Siu Yu A. Chow, Yasuhiro Ikegami, Romain Beaubois, Timothée Levi, Nao Nakagawa-Tamagawa, Yoji Hirano, Yoshiho Ikeuchi. Complex activity and short-term plasticity of human cerebral organoids reciprocally connected with axons . Nature Communications , 2024; 15 (1) DOI: 10.1038/s41467-024-46787-7

Cite This Page :

Explore More

  • 3D Mouth of an Ancient Jawless Fish
  • Connecting Lab-Grown Brain Cells
  • Device: Self-Healing Materials, Drug Delivery
  • How We Perceive Bitter Taste
  • Next-Generation Digital Displays
  • Feeling Insulted? How to Rid Yourself of Anger
  • Pregnancy Accelerates Biological Aging
  • Tiny Plastic Particles Are Found Everywhere
  • What's Quieter Than a Fish? A School of Them
  • Do Odd Bones Belong to Gigantic Ichthyosaurs?

Trending Topics

Strange & offbeat.

Book cover

International Conference on Intelligent Information Technologies for Industry

IITI 2023: Proceedings of the Seventh International Scientific Conference “Intelligent Information Technologies for Industry” (IITI’23) pp 396–409 Cite as

Research on Neural Network Defense Problem Based on Random Noise Injection

  • Juan Kang 12 ,
  • Enzhe Zhao 12 ,
  • Zhichang Guo 12 , 13 ,
  • Shibo Wang 12 ,
  • Weijia Su 12 &
  • Xing Zhang 12  
  • Conference paper
  • First Online: 21 September 2023

187 Accesses

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 776))

Because neural network is a data-driven black box model, people cannot directly understand its decision basis, and once the neural network is constructed with adversarial samples, it can lead to wrong conclusions with high confidence. Therefore, many researchers focus on the robustness of the neural networks. This paper mainly studies neural network defense based on random noise injection. In theory, injecting exponential family noise into any layer of neural network can ensure the robustness of neural network. But the experiment shows that the disturbance resistance effect varies greatly with different noise distribution. We investigate the robustness of neural networks for injection of exponential and Gaussian noise, and give the upper bound of Renyi divergence under these two types of noise. In terms of experiments, we uses CIFAR-10 dataset to conduct experiments on a variety of neural network structures. It is found that random noise injection can effectively reduce the attack effect of adversarial sample and make the neural network more robust. However, when the noise is too high, the classification accuracy of the neural network itself will decline. This paper proposes to add Gaussian noise with small variance to the image subject and Gaussian noise with large variance to the background, so as to achieve better defense effect.

  • Neural networks
  • Adversarial attacks
  • Network explanation
  • Exponential family noise
  • Network defense

Supported by Saint-Petersburg State University, project ID:94062114.

This is a preview of subscription content, log in via an institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Bastounis, A., Hansen, A.C., Vlačić, V.: The mathematics of adversarial attacks in AI–why deep learning is unstable despite the existence of stable neural networks. arXiv preprint arXiv:2109.06098 (2021)

Du, J., Zhang, H., Zhou, J.T., Yang, Y., Feng, J.: Query-efficient meta attack to deep neural networks (2019)

Google Scholar  

Guo, C., Rana, M., Cisse, M., Laurens, V.D.M.: Countering adversarial images using input transformations (2017)

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks on neural network policies (2017)

Li, D.H., Fukushima, M.: A modified BFGS method and its global convergence in nonconvex minimization. J. Comput. Appl. Math. 129 (1–2), 15–35 (2001)

Article   MathSciNet   MATH   Google Scholar  

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learning models resistant to adversarial attacks (2017)

Moosavi-Dezfooli, S.M., Fawzi, A., Frossard, P.: DeepFool: a simple and accurate method to fool deep neural networks. IEEE (2016)

Moritz, P., Nishihara, R., Jordan, M.I.: A linearly-convergent stochastic L-BFGS algorithm. Mathematics (2015)

Papernot, N., Mcdaniel, P., Wu, X., Jha, S., Swami, A.: Distillation as a defense to adversarial perturbations against deep neural networks. In: 2016 IEEE Symposium on Security and Privacy (SP) (2016)

Pinot, R., et al.: Theoretical evidence for adversarial robustness through randomization (2019)

Samangouei, P., Kabkab, M., Chellappa, R.: Defense-GAN: protecting classifiers against adversarial attacks using generative models (2018)

Szegedy, C., et al.: Intriguing properties of neural networks. Comput, Sci (2013)

Wang, R., Fu, B., Fu, G., Wang, M.: Deep and cross network for ad click predictions (2017)

Wojtas, M., Chen, K.: Feature importance ranking for deep learning (2020)

Wu, M., et al.: Regional tree regularization for interpretability in deep neural networks. In: National Conference on Artificial Intelligence (2020)

Yao, Z., Gholami, A., Xu, P., Keutzer, K., Mahoney, M.: Trust region based adversarial attack on neural networks (2018)

Download references

Author information

Authors and affiliations.

Harbin Institute of Technology, Harbin, China

Juan Kang, Enzhe Zhao, Zhichang Guo, Shibo Wang, Weijia Su & Xing Zhang

Researcher, Research Centre of Artificial Intelligence and Data Science, St. Petersburg State University, Saint Petersburg, Russia

Zhichang Guo

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Enzhe Zhao .

Editor information

Editors and affiliations.

Rostov State Transport University, Rostov-on-Don, Russia

Sergey Kovalev

St. Petersburg Federal Research Center of the Russian Academy of Sciences, St. Petersburg, Russia

Igor Kotenko

Andrey Sukhanov

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Kang, J., Zhao, E., Guo, Z., Wang, S., Su, W., Zhang, X. (2023). Research on Neural Network Defense Problem Based on Random Noise Injection. In: Kovalev, S., Kotenko, I., Sukhanov, A. (eds) Proceedings of the Seventh International Scientific Conference “Intelligent Information Technologies for Industry” (IITI’23). IITI 2023. Lecture Notes in Networks and Systems, vol 776. Springer, Cham. https://doi.org/10.1007/978-3-031-43789-2_37

Download citation

DOI : https://doi.org/10.1007/978-3-031-43789-2_37

Published : 21 September 2023

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-43788-5

Online ISBN : 978-3-031-43789-2

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

EDITORIAL article

Editorial: theoretical advances and practical applications of spiking neural networks.

Gaetano Di Caterina

  • 1 University of Strathclyde, Glasgow, United Kingdom
  • 2 Ohio University, Athens, West Virginia, United States
  • 3 University of Electronic Science and Technology of China, Chengdu, China

The final, formatted version of the article will be published soon.

Select one of your emails

You have multiple emails registered with Frontiers:

Notify me on publication

Please enter your email address:

If you already have an account, please login

You don't have a Frontiers account ? You can register here

Keywords: Spiking neural networks - SNN, neuromorphic engineering (NE), Event-based sensing, neural networks, Artificila Intelligent

Received: 25 Mar 2024; Accepted: 29 Mar 2024.

Copyright: © 2024 Di Caterina, Liu and Zhang. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

* Correspondence: Gaetano Di Caterina, University of Strathclyde, Glasgow, United Kingdom

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

IMAGES

  1. Demystifying Data-Driven Neural Networks for Multivariate Production

    research in neural network

  2. Neural Network: A Complete Beginners Guide

    research in neural network

  3. What is a neural network? A computer scientist explains

    research in neural network

  4. Understanding Neural Networks: What, How and Why?

    research in neural network

  5. Neural network diagram

    research in neural network

  6. Your Guide to Understanding Artificial Intelligence

    research in neural network

VIDEO

  1. Neural Network: Models of artificial neural netwok

  2. Need of Recurrent Neural Network (RNN)

  3. An Animated Research Talk on: Neural-Network Quantum Field States

  4. RIT scientists working to develop a neural network

  5. SORA

  6. Stop Neural Network Training After a Certain Level of Accuracy (Callback)

COMMENTS

  1. Explained: Neural networks

    Neural networks were first proposed in 1944 by Warren McCullough and Walter Pitts, two University of Chicago researchers who moved to MIT in 1952 as founding members of what's sometimes called the first cognitive science department. Neural nets were a major area of research in both neuroscience and computer science until 1969, when, according ...

  2. (PDF) Artificial Neural Networks: An Overview

    Neural networks, also known as artificial neural networks, are a type of deep learning technology that falls under the. category of artificial intelligence, or AI. These technologies' commercial ...

  3. Neural networks: An overview of early research, current frameworks and

    1. Introduction and goals of neural-network research. Generally speaking, the development of artificial neural networks or models of neural networks arose from a double objective: firstly, to better understand the nervous system and secondly, to try to construct information processing systems inspired by natural, biological functions and thus gain the advantages of these systems.

  4. Neural Networks

    Neural Networks welcomes submissions that contribute to the full range of neural networks research, from cognitive modeling and computational neuroscience, through deep neural networks and mathematical analyses, to engineering and technological applications of systems that significantly use neural network concepts and learning techniques. This ...

  5. What is a Neural Network?

    A neural network is a machine learning program, or model, that makes decisions in a manner similar to the human brain, by using processes that mimic the way biological neurons work together to identify phenomena, weigh options and arrive at conclusions. Every neural network consists of layers of nodes, or artificial neurons—an input layer ...

  6. Deep Learning: A Comprehensive Overview on Techniques ...

    Deep learning became a prominent topic after that, resulting in a rebirth in neural network research, hence, some times referred to as "new-generation neural networks". This is because deep networks, when properly trained, have produced significant success in a variety of classification and regression challenges [ 52 ].

  7. Neural Network Research

    Neural network research has spawned a variety of adaptive clustering techniques, from competitive learning—an iterative version of K-means clustering, to learned vector quantization (LVQ) [10]—a supervised clustering and classification technique related to classic vector quantization. In LVQ, a set of labelled cluster centers (the code-book ...

  8. Multimodal neurons in artificial neural networks

    Our paper builds on nearly a decade of research into interpreting convolutional networks, [^reference-3] [^reference-4] [^reference-5] [^reference-6] [^reference-7] [^reference-8] [^reference-9] [^reference-10] [^reference-11] [^reference-12] beginning with the observation that many of these classical techniques are directly applicable to CLIP. We employ two tools to understand the activations ...

  9. Neural networks in the future of neuroscience research

    Neural networks are increasingly seen to supersede neurons as fundamental units of complex brain function. In his Timeline article (From the neuron doctrine to neural networks. Nat.

  10. Review of deep learning: concepts, CNN architectures, challenges

    The common convolutional layer of GoogLeNet is substituted by small blocks using the same concept of network-in-network (NIN) architecture , which replaced each layer with a micro-neural network. The GoogLeNet concepts of merge, transform, and split were utilized, supported by attending to an issue correlated with different learning types of ...

  11. Catalyzing next-generation Artificial Intelligence through NeuroAI

    Neuroscience continues to provide guidance—e.g., attention-based neural networks were loosely inspired by attention mechanisms in the brain 20,21,22,23 —but this is often based on findings ...

  12. Advanced Topics in Neural Networks

    An introduction to some advanced neural network topics such as snapshot ensembles, dropout, bias correction, and cyclical learning rates. This article will act as an introduction to some of the more advanced topics used in neural networks and will cover several important topics still discussed in neural network research.

  13. Neural network (machine learning)

    In machine learning, a neural network (also artificial neural network or neural net, abbreviated ANN or NN) is a model inspired by the neuronal organization found in the biological neural networks in animal brains.. An ANN is made of connected units or nodes called artificial neurons, which loosely model the neurons in a brain. These are connected by edges, which model the synapses in a brain.

  14. Transformer: A Novel Neural Network Architecture for ...

    Neural networks, in particular recurrent neural networks (RNNs), are now at the core of the leading approaches to language understanding tasks such as language modeling, machine translation and question answering. In " Attention Is All You Need ", we introduce the Transformer, a novel neural network architecture based on a self-attention ...

  15. Deep Learning: A Comprehensive Overview on Techniques, Taxonomy

    Deep learning became a prominent topic after that, resulting in a rebirth in neural network research, hence, some times referred to as "new-generation neural networks". This is because deep networks, when properly trained, have produced significant success in a variety of classification and regression challenges [ 52 ].

  16. Neural network

    A neural network is a group of interconnected units called neurons that send signals to one another. Neurons can be either biological cells or mathematical models. While individual neurons are simple, many of them together in a network can perform complex tasks. There are two main types of neural network. In neuroscience, a biological neural ...

  17. (PDF) Neural Networks and Their Applications

    A neural network' s main function is to get an array of. inputs, perform progressively complex calculations, and then. use the output to solve a problem. Neural networks, different. than other ...

  18. Research progress in optical neural networks: theory, applications and

    With the advent of the era of big data, artificial intelligence has attracted continuous attention from all walks of life, and has been widely used in medical image analysis, molecular and material science, language recognition and other fields. As the basis of artificial intelligence, the research results of neural network are remarkable. However, due to the inherent defect that electrical ...

  19. Neural Networks

    Neuroscience and artificial intelligence (AI) share a long history of collaboration. Advances in neuroscience, alongside huge leaps in computer processing power over the last few decades, have given rise to a new generation of in silico neural networks inspired by the architecture of the brain. These AI systems are now capable of many of the advanced perceptual and cognitive abilities of ...

  20. A Gentle Introduction to Graph Neural Networks

    Connecting with the broader research community through events is essential for creating progress in every aspect of our work. Learn more about our Conferences & events Learn more. ... Neural networks have been adapted to leverage the structure and properties of graphs. We explore the components needed for building a graph neural network - and ...

  21. (PDF) A Review on Artificial Neural Networks

    The neural network's role in medical science analysis attached to technology shows its elaborative ways of developing neural networks. Discover the world's research 25+ million members

  22. Bridging Brain Circuits with Lab-Grown Neural Networks

    April 10, 2024. Summary: Researchers successfully connected lab-grown brain tissues, mimicking the complex networks found in the human brain. This novel method involves linking "neural organoids" with axonal bundles, enabling the study of interregional brain connections and their role in human cognitive functions.

  23. The Math Behind Neural Networks

    Feedforward Neural Networks (FNN) Starting with the basics, the Feedforward Neural Network is the simplest type. It's like a one-way street for data — information travels straight from the input, through any hidden layers, and out the other side to the output. These networks are the go-to for simple predictions and sorting things into ...

  24. Neuroscience study taps into brain network patterns to understand deep

    The dynamic brain. The team's work is also the first to study the relationship between fluctuations in attention and the brain network patterns within these low-frequency 20-second cycles. "For ...

  25. Unifying Neural Network Design with Category Theory: A Comprehensive

    In deep learning, a unifying framework to design neural network architectures has been a challenge and a focal point of recent research. Earlier models have been described by the constraints they must satisfy or the sequence of operations they perform. This dual approach, while useful, has lacked a cohesive framework to integrate both perspectives seamlessly. The researchers tackle the core ...

  26. Graph neural networks: A review of methods and applications

    The first motivation of GNNs roots in the long-standing history of neural networks for graphs. In the nineties, Recursive Neural Networks are first utilized on directed acyclic graphs (Sperduti and Starita, 1997; Frasconi et al., 1998).Afterwards, Recurrent Neural Networks and Feedforward Neural Networks are introduced into this literature respectively in (Scarselli et al., 2009) and (Micheli ...

  27. Connecting lab-grown brain cells provides insight into how our own

    Summary: Researchers have developed a technique to connect lab-grown neural 'organoids' (three-dimensional developmental brain-like structures grown from human stem cells) using axonal bundles ...

  28. Research on Neural Network Defense Problem Based on Random Noise

    Therefore, many researchers focus on the robustness of the neural networks. This paper mainly studies neural network defense based on random noise injection. In theory, injecting exponential family noise into any layer of neural network can ensure the robustness of neural network. But the experiment shows that the disturbance resistance effect ...

  29. Dynamic resource allocation in 5G networks using hybrid RL-CNN model

    By merging Convolutional Neural Networks (CNN) for feature extraction and Reinforcement Learning (RL) for decision-making, DRARLCNN optimizes resource allocation, minimizing latency and maximizing Quality of Service (QoS). Utilizing a state-of-the-art "5G Resource Allocation Dataset", the research employs Python, TensorFlow, and OpenAI Gym ...

  30. Frontiers

    Editorial: Theoretical Advances and Practical Applications of Spiking Neural Networks. The final, formatted version of the article will be published soon. Keywords: Spiking neural networks - SNN, neuromorphic engineering (NE), Event-based sensing, neural networks, Artificila Intelligent. Received: 24 Mar 2024; Accepted: 29 Mar 2024.