An overview of drug discovery and development

Affiliation.

  • 1 Department of biomedical Science, Nazarbayev University School of Medicine, Nur-Sultan 010000, Kazakhstan.
  • PMID: 32270704
  • DOI: 10.4155/fmc-2019-0307

A new medicine will take an average of 10-15 years and more than US$2 billion before it can reach the pharmacy shelf. Traditionally, drug discovery relied on natural products as the main source of new drug entities, but was later shifted toward high-throughput synthesis and combinatorial chemistry-based development. New technologies such as ultra-high-throughput drug screening and artificial intelligence are being heavily employed to reduce the cost and the time of early drug discovery, but they remain relatively unchanged. However, are there other potentially faster and cheaper means of drug discovery? Is drug repurposing a viable alternative? In this review, we discuss the different means of drug discovery including their advantages and disadvantages.

Keywords: drug repurposing; high throughput; natural sources; small molecule.

Publication types

  • Artificial Intelligence
  • Drug Development*
  • Drug Evaluation, Preclinical

Deep learning in drug discovery: an integrative review and future challenges

  • Open access
  • Published: 17 November 2022
  • Volume 56 , pages 5975–6037, ( 2023 )

Cite this article

You have full access to this open access article

drug discovery essay

  • Heba Askr 1 ,
  • Enas Elgeldawi 2 ,
  • Heba Aboul Ella 4 ,
  • Yaseen A. M. M. Elshaier 5 ,
  • Mamdouh M. Gomaa 2 &
  • Aboul Ella Hassanien 3  

30k Accesses

47 Citations

12 Altmetric

Explore all metrics

Recently, using artificial intelligence (AI) in drug discovery has received much attention since it significantly shortens the time and cost of developing new drugs. Deep learning (DL)-based approaches are increasingly being used in all stages of drug development as DL technology advances, and drug-related data grows. Therefore, this paper presents a systematic Literature review (SLR) that integrates the recent DL technologies and applications in drug discovery Including, drug–target interactions (DTIs), drug–drug similarity interactions (DDIs), drug sensitivity and responsiveness, and drug-side effect predictions. We present a review of more than 300 articles between 2000 and 2022. The benchmark data sets, the databases, and the evaluation measures are also presented. In addition, this paper provides an overview of how explainable AI (XAI) supports drug discovery problems. The drug dosing optimization and success stories are discussed as well. Finally, digital twining (DT) and open issues are suggested as future research challenges for drug discovery problems. Challenges to be addressed, future research directions are identified, and an extensive bibliography is also included.

Similar content being viewed by others

drug discovery essay

Applications of artificial intelligence to drug design and discovery in the big data era: a comprehensive review

drug discovery essay

Revolutionizing Drug Discovery: Unleashing AI’s Potential in Pharmaceutical Innovation

drug discovery essay

A review of machine learning-based methods for predicting drug–target interactions

Avoid common mistakes on your manuscript.

1 Introduction

The examination of how various drugs interact with the body and how a medication needs to act on the body to have a therapeutic impact is known as drug discovery. Drug discovery strategy constitutes from different approaches as physiology-based and target based. This strategy is based on information about the ligand and the target. In this regard, our attention was directed in certain topics especially drug (ligand)–target interactions, drug sensitivity and response, drug–drug interaction, and drug–drug similarity. For certain diseases such as cancer or pandemic situations as COVID-19, more than one drug combination is required to alleviate the prognosis and pathogenesis interactions. Despite all the recent advances in pharmaceuticals, medication development is still a labor-intensive and costly process. As a result, several computational algorithms are proposed to speed up the drug discovery process (Betsabeh and Mansoor 2021 ).

As DL models progress and the drug data size is getting bigger, a slew of new DL-based approaches is cropping up at every stage of the drug development process (Kim et al. 2021 ). In addition, we’ve seen large pharmaceutical corporations migrate toward AI in the wake of the development of DL approaches, eschewing outmoded, ineffective procedures to increase patient profit while also increasing their own (Nag et al. 2022 ). Despite the DL impressive performance, it remains a critical and challenging task, and there is a chance for researchers to develop several algorithms that improve drug discovery performance. Therefore, this paper presents a SLR that integrates the recent DL technologies and applications in drug discovery. This review study is the first one that incorporates the recent DL models and applications for the different categories of drug discovery problems such as DTIs, DDIs similarity, drug sensitivity and response, and drug-side effects predictions, as well as presenting new challenging topics such as XAI and DT and how they help the advancement of the drug discovery problems. In addition, the paper supports the researchers with the most frequently used datasets in the field.

The paper is developed based on six building blocks as shown in Fig.  1 . More than 300 articles are presented in this paper, and they are divided across these building blocks. The papers are selected using the following criteria:

The papers which published from 2000 to 2022.

The papers which published in IEEE, ACM, Elsevier, and Springer have more priority.

figure 1

The main building blocks of the paper

The following analytical questions are discussed and completely being answered in the paper:

AQ1: What DL algorithms have been used to predict the different categories of drug discovery problems?

AQ2: Which deep learning methods are mostly used in drug dosing optimization?

AQ3: Are there any success stories about drug discovery and DL?

AQ4: What about the newest technologies such as XAI and DT in drug discovery?

AQ5: What are the future and open works related to drug discovery and DL?

The remainder of this review paper is organized as: Sect.  2 presents a review of related studies; Sect.  3 covers the various DL techniques as an overview. Section  4 presents the organization of DL applications in drug discovery problems through explaining each drug discovery problem category and gives a literature review of the DL techniques used. Section  5 discusses the numerous benchmark data sets and databases that have been employed in the drug development process. Section  6 presents the evaluation metrics used for each drug discovery problem category. The drug dose optimization, successful stories, and XAI are introduced in Sect.  7 , Sect.  8 , and Sect.  9 . DT and open problems are suggested as future research challenges in Sects.  10 and 11 . Section  12 presents a discussion of the analytical questions. Finally, Sect.  13 concludes the paper.

2 Review of related studies

Although the drug discovery is a large field and has different research categories, there is a few review studies about this field and each related study has focused only on a one research category such as reviewing the DL applications for the DTIs. This section aims to review these related studies and a summary is presented in Table 1 .

Kim et al. ( 2021 ) presented a survey of DL models in the prediction of drug–target interaction (DTI) and new medication development. They start by providing a thorough summary of many depictions of drugs and proteins, DL applications, and widely used exemplary data sets to test and train models. One good point for this study, they identify a few obstacles to the bright future of de novo drug creation and DL-based DTI prediction. However, the major drawback of this study was that it did not consider the latest technology in DL application for the DTIs such as XAI and DTs.

Rifaioglu et al. ( 2019 ) presented the recent ML applications in Virtual Screening (VS) with the techniques, instruments, databases, and materials utilized to create the model. They outline what VS is and how crucial it is to the process of finding new drugs. Good points for this study, they highlighted the DL technologies that are accessible as open access programming libraries and provided instances of VS investigations that resulted in the discovery of novel bioactive chemicals and medications, tool kits and frameworks, and can be employed for the foreseeable future's computational drug discovery (including DTI prediction). However, they did not consider the drug dose optimization in their literature review.

Sachdev and Gupta ( 2019 ) presented the various feature based chemogenomic methods for DTIs prediction. They offer a thorough review of the different methodologies, datasets, tools, and measurements. They give a current overview of the various feature-based methodologies. Additionally, it describes relevant datasets, methods for determining medication or target properties, and evaluation measures. Although the study considered the initial integrated review which concentrate only on DTI feature-based techniques, they did not consider the latest technology in DL application for the DTIs such as XAI and DTs.

3 Deep learning (DL) techniques

Detecting spam, recommending videos, classifying images, and retrieving multimedia ideas are just a few of the techniques used are just a few of the applications where machine learning (ML) has lately gained favor in research. Deep learning (DL) is one of the most extensively utilized ML methods in these applications. The ongoing appearance of new DL studies is due to the unpredictability of data acquisition and the incredible progress made in hardware technologies. DL is based on conventional neural networks but outperforms them significantly. Furthermore, DL uses transformations and graph technology to build multi-layer learning models (Kim et al. 2021 ). With their groundbreaking invention, Machine Learning and Deep Learning have revolutionized the world's perspective. Deep learning approaches have revolutionized the way we tackle problems. Deep learning models come in various shapes and sizes, capable of effectively resolving problems that are too complex for standard approaches to tackle. We'll review the various deep learning models in this section (Sarker 2021 ).

3.1 Classic neural networks

As shown in Fig.  2 , Multi-layer perceptron are frequently employed to recognize Fully Connected Neural Networks. It involves converting the algorithm into simple two-digit data inputs (Mukhamediev et al. 2021 ). This paradigm allows for both linear and nonlinear functions to be included. The linear function is a single line with a constant multiplier that multiplies its inputs. Sigmoid Curve, Hyperbolic Tangent, and Rectified Linear Unit are three representations for nonlinear functions. This model is best for categorization and regression issues with real-valued data and a flexible model of any kind.

figure 2

Multilayer Perceptron or ANN

3.2 Convolutional neural networks (CNN)

As shown in Fig.  3 , The classic convolutional neural network (CNN) model is an advanced and high-potential variant ANN Which developed to manage escalating complexity levels, as well as data pretreatment and compilation. It is based on how an animal's visual cortex's neurons are arranged (Amashita et al. 2018 ). One of the most flexible algorithms for the processing of data with and without images is CNNs. CNN can be processed through 4 phases:

For analyzing basic visual data, such as picture pixels, it includes one input layer that is often the case a 2D array of neurons.

Some CNNs analyze images on their inputs using a single-dimensional output layer of neurons coupled to distributed convolutional layers.

Layer number 3, called as the sampling layer, is included in CNNs o restrict the number of neurons which It took part in the relevant network levels.

The sampling and output layers are joined by one or more connected layers in CNNs.

figure 3

Convolutional Neural Networks (CNN)

This network concept can potentially aid in extracting relevant visual data in pieces or smaller units. In the CNN, the neurons are responsible for the group of neurons from the preceding layer.

After the input data has been included into the convolutional model, the CNN is constructed in four steps:

Convolution: The method produces feature maps based on supplied data., which are then subjected to a purpose.

Max-Pooling: It aids CNN in detecting an image based on supplied changes.

Flattening: The data is flattened in this stage so that a CNN can analyze it.

Full Connection: It's sometimes referred to as a "hidden layer" which creates the loss function for a model.

Image recognition, image analysis, image segmentation, video analysis, and natural language processing (NLP) (Chauhan et al. 2018 ; Tajbakhsh et al. May 2016 ; Mohamed et al. 2020 ; Zhang et al. 2018 ) are among the tasks that CNNs are capable of.

3.3 Recurrent neural networks (RNNs)

RNNs were first created to help in sequence prediction. These networks rely solely on data streams with different lengths as inputs. For the most recent forecast, the knowledge of its previous state is used as an input value by the RNN. As a result, it can help a network's short-term memory achievers (Tehseen et al. 2019 ). As shown in Fig.  4 , The Long Short-Term Memory (LSTM) method, for example, is renowned for its adaptability.

figure 4

LSTM Network

LSTMs, which are advantageous in predicting data in time sequences using memory, and LSTMs, which are useful in predicting data in time sequences using memory, are two forms of RNN designs that aid in the study of problems. The three gates are Input, Output, and Forget. Gated RNNs are particularly helpful for temporal sequence prediction using memory-based data. Both types of algorithms can be used to address a range of issues, including image classification (Chandra and Sharma 2017 ), sentiment analysis (Failed 2018 ), video classification (Abramovich et al. 2018 ), language translation (Hermanto et al. 2015 ), and more.

3.4 Generative adversarial networks: GAN

As shown in Fig.  5 , It combines a Generator and a Discriminator DL neural network approach. The Discriminator helps to discriminate between real and fake data while the Generator Network creates bogus data (Alankrita et al. 2021 ).

figure 5

GAN: Generative Adversarial Networks

Both networks compete with one another as The Discriminator still distinguishes between actual and fake data, and the Generator keeps making fake data look like real data. The Generator network will generate simulated data for the authentic photos if a picture library is necessary. Then, a deconvolution neural network would be created. Then, an Image Detector network would be utilized to discriminate between fictitious and real images. This competition would eventually help the network's performance. It can be employed in creating images and texts, enhancing the image and discovering new drugs.

3.5 Self-organizing maps (SOM)

As shown in Fig.  6 , Self-Organizing Maps operate by leveraging unsupervised data to decrease a model's number of random variables (Kohonen 1990 ). Given that every synapse is linked to both its input and output nodes, the output dimension in this DL approach is set as a two-dimensional model. The competition between each data point and its model representation in the Self-Organizing Maps, the weight of the closest nodes or Best Matching Units is adjusted (BMUs). The value of the weights varies based on how close a BMU is. The value represents the node's position in the network because weights are a node attribute in and of themselves. It's great for evaluating dataset frameworks that don't have a Y-axis value or project explorations that don't have a Y-axis value.

figure 6

Self-Organizing Maps (SOM)

3.6 Boltzmann machines

As shown in Fig.  7 , the nodes are connected in a circular pattern because there is no set orientation in this network model. This deep learning technique is utilized to generate model parameters because of its uniqueness. The Boltzmann Machines model is stochastic, unlike all preceding deterministic network models. It can monitor systems, create a binary recommendation platform, and analyze specific datasets (Hinton 2011 ).

figure 7

Boltzmann Machines

The architecture of the Boltzmann Machine is a two-layer neural network. The visible or input layer is the first, while the hidden layer is the second. They are made up of several neuron-like nodes that carry out computations. These nodes are interconnected at different levels but are not linked across nodes in the same layer. As a result, there is no connectivity between layers, which is one of the Boltzmann machine's disadvantages. When data is supplied into these nodes, it is transformed into a graph, and they process it and learn all the parameters, motifs, and relations between them before deciding whether to transmit it. As a result, an Unsupervised DL model is often known as a Boltzmann Machine.

3.7 Autoencoders

As shown in Fig.  8 , This algorithm, one of the most popular deep learning algorithms, automatically based on its inputs, applies an activation function, and decodes the result at the end. Because of the backlog, there are fewer types of data produced, and the built-in data structures are used to their fullest extent (Zhai et al. 2018 ).

figure 8

Autoencoders

There are various types of autoencoders:

Sparse: The generalization technique is used when the hidden layers outnumber the input layer to decrease the overfitting. It constrains the loss function and restricts the autoencoder from utilizing all its nodes simultaneously.

Denoising: In this case, randomly, the inputs are adjusted and made to equal 0.

Contractive: When the hidden layer outnumbers the input layer, to avoid overfitting and data duplication, a penalty factor is introduced to the loss function.

Stacked: When another hidden layer is added to an autoencoder, it results in two stages of encoding and Initial stages of decoding.

Feature identification, establishing a strong recommendation model, and adding features to enormous datasets are some of the difficulties it can solve.

4 Organization of DL applications in drug discovery problems

The evolution of safe and effective treatments for human is the primary goal of drug discovery (Kim et al. 2021 ). Drug discovery is the problem of finding the suitable drugs to treat a disease (i.e., a target protein) which relies on several interactions. This paper divides the drug discovery problems into four main categories, as presented in Fig.  9 . They are drug–target interactions, drug–drug similarity, drug combinations side effects, and drug sensitivity and response predictions. The following subsections provide a literature review of DL with these problems and some of the investigated literature articles related to each category are summarized in Table 2 .

figure 9

Drug discovery problem categories

4.1 Drug–target interactions prediction using DL

Drug repurposing attempts to uncover new uses for drugs that are already on the market and have been approved. It has attracted much attention since it takes less time, costs less money, and has a greater success rate than traditional de novo drug development (Thafar et al. 2022 ). The discovery of drug–target interactions is the initial step in creating new medications, as well as one of the most crucial aspects of drug screening and drug-guided synthesis (Wang et al. 2020a ). Exploring the link between possible medications and targets can aid researchers in better understanding the pathophysiology of targets at the drug level, which can help with the disease's early detection, treatment prognosis, and drug design. This is well known as drug–target interactions (DTIs) (Lian et al. 2021 ). Achieving success to the drug repositioning mechanism largely reliant on DTI's forecast because it reduces the number of potential medication candidates for specific targets. The approaches based on molecular docking and the approaches based on drugs are the two basic tactics used in traditional computational methods. When target proteins' 3D structures aren't available, the effectiveness of molecular docking is limited. When there are only a few known binding molecules for a target, drug-based techniques typically produce subpar prediction results. DL technologies overcome the restrictions of the high-dimensional structure of drug and target protein by using unstructured-based approaches which do not need 3D structural data or docking for DTI prediction. Therefore, this section provides a recent comprehensive review of DL-based DTIs prediction models (Chen et al. 2012 ).

As shown in Fig.  10 , there are known interactions (solid lines) and unknown interactions (dashed lines) between diseases (proteins) and drugs. DTIs forecast unknown interactions or what diseases (or target proteins) a new drug might treat. According to their input features, we divided the latest DL models used to predict DTIs into three categories: drug-based models, structure (graph)-based models, and drug-protein(disease)-based models.

figure 10

DL models used for predicting the DTIs are grouped into three categories: a drug-based models, b structure (graph)-based models, and c drug-protein(disease)-based models

4.1.1 Drug-based models

Figure  10 A shows drug-based models that assume a potential drug will be like known drugs for the target proteins. It calculates the DTI using the target's medication information. Similarity search strategies are used in these models, which postulate that structurally similar substances have similar biological functions (Thafar et al. 2019 ; Matsuzaka and Uesawa 2019 ). These methods have been used for decades to select compounds in vast compound libraries employing massive computer jobs or solve problems using human calculations. Deep neural network models gradually narrow the gap between in silico prediction and empirical study, and DL technology can shorten these time-consuming procedures and manual operations.

Researchers may now use deep neural networks to analyze medicines and predict drug-related features, including as bioactivities and physicochemical qualities, thanks to using benchmark packages like MoleculeNet (Wu et al. 2018 ) and DeepChem (). As a result, basic neural networks like MLP and CNN have been used in numerous drug-based DL approaches (Zeng et al. 2020 ; Yang et al. 2019 ; Liu et al. 2017 ). The representation power of molecular descriptors was often the focus of ADMET investigations rather than the model itself (Zhai et al. 2018 ; Liu et al. 2017 ; Kim et al. 2016 ; Tang et al. 2014 ). Hirohara et al. trained a CNN model with the SMILES string and then used learned attributes to discover motifs using significant structures for locations that bind proteins or unidentified functional groupings (Hirohara et al. 2018 ). Atom pairs and pharmacophoric donor–acceptor pairings have been employed by Wenzel et al. ( 2019 ) as adjectives in multi-task deep neural networks to predict microsomal metabolic liability. Gao et al. ( 2019 ) compared 6 different kinds of 2D fingerprints in the prediction of affinity between proteins and drugs using ML methods such as RF, single-task DNN, and multi-task DNN models. Matsuzaka and Uesawa ( 2019 ) used 2D pictures of 3D chemical compounds to train a CNN model to predict constitutive androstane receptor agonists. They optimized the greatest performance in snapshots of a 3D ball-and-stick model taken at various angles or coordinates. Therefore, the method outperformed seven common 3D chemical structure forecasts.

Since the GCN's development, drug related GCN models have created depictions of graphs which concerned with molecules that incorporate details on the chemical structures by adding up the adjacent atoms' properties (Gilmer et al. 2017 ).

GCNs have been employed as 3D descriptors instead of SMILES strings in a lot of research, and it's been discovered that these learned descriptors outperform standard descriptors in prediction tests and are easier to understand (Shin et al. 2019 ; Ozturk et al. 2018 ; Yu et al. 2019 ). Chemi-net employed GCN models to represent molecules and compared the performance of single-task and multi-task DNNs on their own QSAR datasets (Liu et al. 2019a ). Yang et al. ( 2019 ) introduced the directed message passing neural network, which uses a directed message-passing paradigm, as a more advanced model (D-MPNN). They tested their approaches on 19 publicly available and 16 privately held datasets and discovered that in most situations, they were correct. The D-MPNN models outperformed the previous models. In two datasets, they underperformed and were not as resilient as typical 3D descriptors when the sample was small or unbalanced. The D-MPNN model was then employed by another research group to correctly forecast a kind of antibiotic named HALICIN, which demonstrated bactericide effects in models for mice (Stokes et al. 2020 ). This was the first incident that resulted in the finding of an antibiotic by using DL methods to explore a large-scale chemical space that current experimental methodologies cannot afford. The application of attention-based graph neural networks is another interesting contemporary method (Sun et al. 2020a ). Edge weights and node features can be learned together since a molecule's graph representations can be altered by edge properties. As a result, Shang et al. suggested a multi-relational GCN with edge attention (Shang et al. 2018 ). For each edge, they created a reference guide on attention spans. Because it is used throughout the molecule, the approach can handle a wide range of input sizes.

In the Tox21 and HIV benchmark datasets, they found that this model performed better than the random forest model. As a result, the model may effectively learn pre-aligned features from the molecular graph's inherent qualities. Withnall et al. ( 2020 ) extended the MPNN model with AMPNN (attention MPNN), which is an attention technique that the message forwarding step employs weighted summation. Moreover, they termed the D-MPNN model the edge memory neural network because it was extended by the same attention mechanism as the AMPNN (EMNN). Although it is computationally more intensive than other models, this model fared better than others on the uniformly absent information from the maximal unbiased validation (MUV) reference.

4.1.2 Structure (graph)-based models

Unlike the drug- and structure-based models in Fig.  10 b, protein targets and medication information should be included. Typical molecular docking simulation methods aim to predict the geometrically possible binding of known tertiary structure drugs and proteins. Atom sequences and amino acid residues can be used to express both the medicine as well as the target. Descriptors based on sequences were selected because DL approaches may be implemented right away with non-significant pre-processing of the entering data.

The Davis kinase binding affinity dataset (Davis et al. 2011 ) and the KIBA dataset (Sun et al. 2020a ) were used in that study. DeepDTA, suggested by Ozturk et al. ( 2018 ), outperformed moderate ML approaches such as KronRLS (Nascimento et al. 2016 ) and SimBoosts (Tong et al. 2017 ) by applying solely information about the sequence of a CNN model based on the SMILES string and amino acid sequences. Wen et al. used ECFPs and protein sequence composition descriptors as examples of common and basic features and trained them using semi-supervised learning via a deep belief network (Wen et al. 2017 ). Another study, DeepConv-DTI, built a deep CNN model using only an RDKit Morgan fingerprint and protein sequences (Lee et al. 2019 ). They also used the pooled convolution findings to capture local residue patterns of target protein sequences, resulting in high values for critical protein areas like actual binding sites.

The scoring feature, which ranks the protein-drug interaction with 3D structures and makes the training data parametric to forecast values for binding affinities of targeted proteins, is used to predict binding affinity values or binding pocket sites of the target proteins as a key metric for the structure-based regression model. The protein–drug complexes' 3D structural characteristics were included in the CNNs by AtomNet (Wallach et al. 2015 ). They placed 3D grids with set sizes (i.e., voxels) in comparison to protein–drug combinations, with every cell in the grid representing structural properties at that position. Several researchers have examined the situation since then, deep CNN models that use voxels to predict binding pocket location or binding affinity (Wang et al. 2020b ; Ashburner et al. 2000 ; Zhao et al. 2019 ). In comparison to common docking approaches such as AutoDock Vina (Trott and Olson 2010 ) or Smina (Koes et al. 2013 ), these models have shown enhanced performance. This is since CNN models are relatively impervious even with large input sizes. It can be taught and is resilient to input data noise.

Many DTI investigations using GCNs based on structure-based approaches have been reported (Feng et al. 2018 ; Liu et al. 2016 ). Feng et al. ( 2018 ) used both ECFPs and GCNs as pharmacological characteristics. In the Davis et al. ( 2011 ), Metz et al. ( 2011 ), and KIBA Tang et al. ( 2014 ) benchmark datasets, their methods outperformed prior models such as KronRLS (Nascimento et al. 2016 ) and SimBoost (Tong et al. 2017 ). However, they did agree that their GCN model couldn't beat their ECFP model due to time and resource constraints in implementing the GCN. In a different DTI investigation study, Torng et al. employed a graph without supervision to become familiar with constant size depictions of protein binding sites (Torng and Altman 2019 ). The pre-trained GCN model was then trained using the newly created protein pocket GCN, the drug GCN model, on the other hand, used attributes to be trained and which were generated automatically. They concluded that without relying on target–drug complexes, their model effectively captured protein–drug binding interactions.

Because the models that implement the attention mechanism have key qualities that enable the model to be interpreted, attention-based DTI prediction approaches have evolved (Hirohara et al. 2018 ; Liu et al. 2016 ; Perozzi et al. 2014 ).

For protein sequences, Gao et al. ( 2017 ) employed compressed vectors with the LSTM RNNs and the GCN for drug structures. They concentrated on demonstrating their method's capacity to deliver biological insights into DTI predictions. To do so, Mechanisms for two-way attention were employed. to calculate the binding of drug–target pairs (DTPs), allowing for flexible interpretation of superior data from target proteins, such as GO keywords. Shin et al. ( 2019 ) introduced the Molecule transformer DTI (MT-DTI) approach for drug representations, which uses the self-attention mechanism. The MT-DTI model was tweaked to perfection and assessed using two Davis models Using pre-trained parameters from the 97 million chemicals PubChem (Davis et al. 2011 ) and (KIBA) (Tang et al. 2014 ) benchmark datasets, which are both publicly available. However, the attention mechanism was not used to depict the protein targets because it would take too long to calculate the target sequence in an acceptable amount of time. Pre-training is impossible due to a lack of target information.

On the other hand, attention DTA presented by Zhao et al. incorporates a CNN attention mechanism model to establish the weighted connections between drug and protein sequences (Zhao et al. 2019 ). They showed that these attention-based drug and protein representations have good MLP model affinity prediction task performance. DeepDTIs used external, experimental DTPs to infer the probability of interaction for any given DTP. Four of the top ten predicted DTIs have previously been identified, and one was discovered to have a poor glucocorticoid receptor binding affinity (Huang et al. 2018 ). DeepCPI was used to predict drug–target interactions. Small-molecule interactions with the glucagon-like peptide one receptor, the glucagon receptor, and the vasoactive intestinal peptide receptor have been tested in experiments (Wan et al. 2019 ).

4.1.3 Drug–protein(disease)-based models

According to poly pharmacology, most medicines have multiple effects on both primary and secondary targets. The biological networks involved, as well as the drug's dose, influence these effects. As a result, the drug–protein(disease)-based models shown in Fig.  10 c are particularly beneficial when evaluating protein promiscuity or drug selectivity (Cortes-Ciriano et al. 2015 ). Furthermore, Neural networks that can do multiple tasks are ideal for simultaneously learning the properties of many sorts of data (Camacho et al. 2018 ). Several DL model applications, such as drug-induced gene-expression patterns and DTI-related heterogeneous networks, leverage relational information for distinct views. A network-based strategy employs heterogeneous networks includes a variety of nodes and edges kinds (Luo et al. 2017 ; David et al. 2019 ). The nodes in these networks have a local similarity, which is a significant aspect of these models. One can anticipate DTIs using their connections and topological features when a network of similarity with medications as its nodes and drug–drug similarity values as a measure of the edges' weights is investigated. Machine to support vectors (Bleakley and Yamanishi 2009 ; Keum and Nam 2017 ), Machine learning techniques that use heterogeneous networks as prediction frameworks include the regularized least square model (RLS) (Liu et al. 2016 ; Xia et al. 2010 ; Hao et al. 2016 ) and random walk with the restart model Nascimento (Lian et al. 2021 ; Nascimento et al. 2016 ). DTI prediction research using networks have employed DL to enhance the methods used to forecast associations today for evaluating the comparable topological structures of drug and target networks that are bipartite and tripartite linked networks, owing to the increased interest in the usage of DL technologies (drug, target, and disease networks) (Hassan-Harrirou et al. 2020 ; Lamb et al. 2006 ; Korkmaz 2020 ; Townshend et al. 2012 ; Vazquez et al. 2020 ). Zong et al. ( 2017 ) used the DeepWalk approach to collect local latent data, compute topology-based similarity in tripartite networks, and demonstrate the technology's promise as a medication repurposing solution.

Relationship-based features collected by training the AE were used in some network-based DTI prediction studies. Zhao et al. ( 2020 ) developed a DTI-CNN prediction model that combined depth information that is low-dimensional but rich with a heterogeneous network that has been taught using the stacked AE technique. To construct the topological similarity matrix of drug and target, Wang et al. used a deep AE and mutually beneficial pointwise information in their analysis (Wang et al. 2020b ). Peng et al. ( 2020 ) employed a denoising Autoencoder to pick network-based attributes and decrease the representation dimensions in another investigation.

By helping the self-encoder learn to denoise, the anti-aliasing effect (Autoencoder) enhances high-dimensional images with noise, input data that is noisy and incomplete, allowing the encoder to learn more reliably. These approaches, however, have a drawback in that it is challenging to foresee recent medications or targets, a problem. The problem of recommendation systems' "cold start" is known as the "cold start" problem (Bedi et al. 2015 ). The size and form of the network have a big impact on these models, so if the network isn't big enough, they will not be able to collect all the medications or targets that aren't in the network (Lamb et al. 2006 ).

Various investigations have also utilized Gene expression patterns as chemogenomic traits to predict DTIs. This research presumes that medications with similar expression patterns have similar effects on the same targets (Hizukuri et al. 2015 ; Sawada et al. 2018 ).

The revised version of CMAP, the LINCS-L1000 database, has been integrated into the DL DTI models in recent works (Subramanian et al. 2017 ; Thafar et al. 2020 ; Karpov et al. 2020 ; Arus-Pous et al. 2020 ). Based on the LINCS pharmacological perturbation and knockout gene data, using a deep neural network, Xie et al. developed a binary classification model (Xie et al. 2018 ).

On the other hand, Lee and Kim employed as a source of expression signature genes medication and target features. They used node2vec to train the rich data by examining three elements of protein function, including pathway-level memberships and PPI (Lee and Kim 2019 ). Saho and Zhang employed a GCN model to extract drug and target attributes from LINCS data and a CNN model to forecast DTPs by extracting latent features in DTIGCCN (Shao et al. 2020 ). The Gaussian kernel function was identified to aid in the production of high-quality graphs, and as a result, this hybrid model scored better on classification tests.

DeepDTnet employs a heterogeneous drug–gene-disease network to uncover known drug targets containing fifteen types of chemicals and genomic, phenotypic, and cellular network properties. DeepDTnet predicted and experimentally confirmed topotecan, a new direct inhibitor of the orphan receptor linked to the human retinoic acid receptor (Zeng et al. 2020 ).

4.2 Drug sensitivity and response prediction using DL

Drug response is the clinical outcome treated by the drug of interest ( https://www.sciencedirect.com/topics/drug-response ). This is due to the normally low ratio of samples to measurements each sample, which makes traditional feedforward neural networks unsuitable. The main idea of drug response prediction is shown in Fig.  11 . The DL method takes the heterogenous network of drug and protein interactions as inputs and predicts the response scores. Although the widespread use of the deep neural network (DNN) approaches in various domains and sectors, including related topics like computational chemistry (Gómez-Bombarelli et al. 2018 ), DNNs have only lately made their way into drug response prediction. Overparameterization, overfitting, and poor generalization are common outcomes of recent simulation datasets. However, more public data has become available recently, and freshly built DNN models have shown promise. As a result, this section summarizes current DL computational problems and drug response prediction breakthroughs.

figure 11

Drug binding with proteins and drug sensitivity (response) scores prediction

Since the 1990s, neural networks have been used to predict drug response (El-Deredy et al. 1997 ) revealed that data from tumor nuclear magnetic resonance (NMR) spectra might be used to train a neural network and can be utilized to predict drug response in gliomas and offer information on the metabolic pathways involved in drug response.

In 2018, The DRscan model was created by Chang et al. ( 2018 ), and it uses a CNN architecture that was trained on 1000 drug reaction studies per molecule. Compared to other traditional ML algorithms like RF and SVM, their model performed much better. CDRscan's ability to incorporate genomic data and molecular fingerprints is one of the reasons it outperformed these baseline models. Furthermore, its convolutional design has been demonstrated to be useful in various machine learning areas. A neural network called an autoencoder attempts to recreate the original data from the compressed form after compressing its input. As proven by Way and Greene ( 2018 ), this is very useful for feature extraction, which condensed a gene expression profile with 5000 dimensions with a maximum of 100 dimensions, some of which revealed to significant characteristics such as the patient's sexual orientation or melanoma status. Using variational autoencoders, Dincer et al. ( 2018 ) created DeepProfile, a technique for learning a depiction of gene expression in AML patients in eight dimensions that is then fitted to a Lasso linear model for treatment response prediction with superior results to that of no extracting features.

Ding et al. ( 2018 ) proposed a deep autoencoder model for representation learning of cancer cells from input data consisting of gene expression, CNV, and somatic mutations.

In 2019, MOLI (Multi-omics Late Integration) (Sharifi-Noghabi et al. 2019 ) was a deep learning model that incorporates multi-omics data and somatic mutations to characterize a cell line. Three separate subnetworks of MOLI learn representations for each type of omics data. A final network identifies a cell's response as responder or non-responder based on concatenated attributes. Those methods share two characteristics: integrating multiple input data (multi-omics) and binary classification of the drug response. Although combining several forms of omics data can improve the learning of cell line status, it may limit the method's applicability for testing on different cell lines or patients because the model requires extra data beyond gene expression.

Furthermore, a certain threshold of the IC50 values should be set before binary classification of the drug response, which may vary depending on the experimental condition, such as drug or tumor types. Twin CNN for drugs in SMILES format (TCNNS) (Liu et al. 2019b ) takes a one-hot encoded representation of drugs and feature vectors of cell lines as the inputs for two encoding subnetworks of a One-Dimensional (1D) CNN. One-hot encodings of drugs in TCNNS are Simplified Molecular Input Line Entry System (SMILES) strings which describe a drug compound's chemical composition. Binary feature vectors of cell lines represent 735 mutation states or CNVs of a cell. KekuleScope (Cortés-Ciriano and Bender 2019 ) adopts transfer learning, using a pre-trained CNN on ImageNet data. The pre-trained CNN is trained with images of drug compounds represented as Kekulé structures to predict the drug response.

Yuan et al. ( 2019 ) offer GNNDR, a GNN-based technique with a high learning capacity and allows drug response prediction by combining protein–protein interactions (PPI) information with genomic characteristics. The value of including protein information has been empirically proven. The proposed method offers a viable avenue for the discovery of anti-cancer medicines. Semi-supervised variational autoencoders for the prediction of monotherapy response were examined by the Rampášek et al. ( 2019 ). In contrast to many conventional ML methodologies, together developed a model for predicting medication reaction that took advantage of expression of genes before and after therapy in cell lines and demonstrated enhanced evaluation on a variety of FDA-approved pharmaceuticals. Chiu et al. ( 2019 ) trained a deep drug response predictor after pre-training autoencoders using mutation data and expression features from the TCGA dataset. The use of pretraining distinguishes their strategy from others. Compared to using only the labeled data, the pretraining process permits un-labelled data from outside sources, like TCGA, as opposed to just gene expression profiles obtained from drug reaction tests, resulting in a significant increase in the number of samples available and improved performance.

Chiu et al. ( 2019 ) and Li et al. ( 2019 ) used a combination of auto-encoders and predicted drug reactions in cell lines with deep neural networks and malignancies that had been gnomically characterized. To anticipate cell lines reactions to drug combinations, in https://string-db.org/cgi/download.pl?sessionId=uKr0odAK9hPs used deep neural encoders to link genetic characteristics with drug profiles.

In 2020, Wei et al. ( 2020 ) anticipate drug risk levels (ADRs) based on adverse drug reactions. They use SMOTE and machine learning techniques in their studies. The proposed framework was used to investigate the mechanism of ADRs to estimate degrees of drug risk and to assist with and direct decision-making during the changeover from prescription to over-the-counter medications. They demonstrated that the best combination, PRR-SMOTE-RF, was built using the above architecture and that the macro-ROC curve had a strong classification prediction effect. They suggested that this framework could be used by several drug regulatory organizations, including the FDA and CFDA, to provide a simple but dependable method for ADR signal detection and drug classification, as well as an auxiliary judgement basis for experts deciding on the status change of Rx drugs to OTC drugs. They propose that more ML or DL categorization algorithms be tested in the future and that computational complexity be factored into the comparison process. Kuenzi et al. ( 2020 ) built DrugCell, an interpretable DL algorithm of personal cancer cells based on the reactions of 1235 tumor cell lines to 684 drugs. Genotypes of cancer cause conditions in cellular systems combined with medication composition to forecast therapeutic outcome while also learning the molecular mechanisms underlying the response. Predictions made by DrugCell in cell lines are precise and help to categorize clinical outcomes. The study of DrugCell processes results in the development of medication combinations with synergistic effects, which we test using combinatorial CRISPR, in vitro drug–drug screening, and xenografts generated from patients. DrugCell is a step-by-step guide to building interpretable predictive medicine models.

Artificial Neural Networks (ANNs) that operate on graphs as inputs are known as Graph Neural Networks (GNNs). Deep GNNs were recently employed for learning representations of low-dimensional biomolecular networks (Hamilton 2020 ; Wu et al. 2020 ). Ahmed et al. ( 2020 ) used two separate GNN methods to develop a GNN using GE and a network of genes that are expressed together. This is a network that depicts the relationship between gene pairs' expression.

The CNN is one of the neural network models adopted for drug response prediction. The CNN has been actively used for image, video, text, and sound data due to its strong ability to preserve the local structure of data and learn hierarchies of features. In 2021, several methods had been developed for drug response prediction, each of which utilizes different input data for prediction (Baptista et al. 2021 ).

Nguyen et al. ( 2021 ) proposed a method to predict drug response called GraphDRP, which integrates two subnetworks for drug and cell line features, like CNN in Liu et al. ( 2019b ) and Qiu et al. ( 2021 ). Gene expression data from cancer cell lines and medication response data, the author finds predictor genes for medications of interest and provides a reliable and accurate drug response prediction model. Using the Pearson correlation coefficient, they employed the ElasticNet regression model to predict drug response and fine-tune gene selection after pre-selecting genes. They ran a regression on each drug twice, once using the IC50 and once with the area under the curve (AUC), to obtain a more trustworthy collection of predictor genes (or activity area). The Pearson correlation coefficient for each of the 12 medicines they examined was greater than 0.6. With 17-AAG, IC50 has the highest Pearson correlation coefficient of 0.811.

In contrast, AUC has the highest Pearson correlation coefficient of 0.81. Even though the model developed in this study has excellent predictive performance for GDSC, it still has certain flaws. First, the cancer cell line's properties may differ significantly from those of in vivo malignancies, and it must be determined whether this will be advantageous in a clinical trial. Second, they primarily use gene expression data to predict drug response. While drug response is influenced by structural changes such as gene mutations, it is also influenced by gene expression levels. To improve the prediction capacity of the model, more research is needed to use such data and integrate it into the model.

In 2022, Ren et al. ( 2022 ) suggested a graph regularized matrix factorization based on deep learning (DeepGRMF), which uses a variety of information, including information on drug chemical composition, their effects on cell biology signaling mechanisms, and the conditions of cancer cells, to integrate neural networks, graph models, and matrix-factorization approaches to forecast cell response to medications. DeepGRMF trains drug embeddings so that drugs in the embedding space with similar structures and action mechanisms, (MOAs) are intimately linked. DeepGRMF learns the same representation embeddings for cells, allowing cells with similar biological states and pharmacological reactions to be linked. The Cancer Cell Line Encyclopedia (CCLE) and On the Genomics of Drug Sensitivity in Cancer (GDSC) datasets, DeepGRMF outperforms competing models in prediction performance. In the Cancer Genome Atlas (TCGA) dataset, the suggested model might anticipate the effectiveness of a treatment plan on lung cancer patients' outcomes. The limited expressiveness of our VAE-based chemical structure representation may explain why new cell line prediction outperforms innovative drug sensitivity prediction in terms of accuracy. A family of neural graph networks has recently been shown to depict better chemical structures that can be investigated in the future. Pouryahya et al. ( 2022 ) proposed a new network-based clustering approach for predicting medication response based on OMT theory. Gene-expression profiles and cheminformatic drug characteristics were used to cluster cell lines and medicines, and data networks were used to represent the data. Then, RF model was used regarding each pair of cell-line drug clusters. by comparison, prediction-clustered based models regarding the homogenous data are anticipated to enhance drug sensitivity and precise forecasting and biological interpretability.

4.3 Drug–drug interactions (DDIs) side effect prediction using DL

Drugs are chemical compounds consumed by people and interact with protein targets to create a change. The drugs may alter the human body positively or negatively. Drug side effects are the undesirable alterations medications cause in the human body. These adverse effects might range from moderate headaches to life-threatening reactions like cardiac arrest, malignancy, and death. They differ depending on the person's age, gender, stage of sickness, and other factors (Kuijper et al. 2019 ). In the laboratory, to determine whether the medications have any unfavorable side effects, several tests are conducted on them. However, these examinations are both pricey and additionally lengthy. Recently, many computational algorithms for detecting medication adverse effects have been created. Computational methodologies are replacing laboratory experiments.

On the other hand, these methods do not provide adequate data to predict drug–drug interactions (DDIs). The phenomenon of DDIs is discussed in Fig.  12 . The desired effects of a drug resulting from its interaction with the intended target and the unfavorable repercussions emerging from drug interactions with off targets make up a drug's entire reaction on the human body (undesirable effects). Even though A medication has a strong affinity for binding to one target, it binds to several proteins as well with varied affinities, which might cause adverse consequences (Liu et al. 2021 ). Predicting DDIs can assist in reducing the likelihood of adverse reactions and optimizing the medication development and post-market monitoring processes (Arshed et al. 2022 ). Side effects of DDIs are often regarded as the leading cause of drug failure in pharmacological development. When drugs have major side effects, the market is quickly removed from them. As a result, predicting side effects is a fundamental requirement in the drug discovery process to keep drug development costs and timelines in check and launch a beneficial drug in terms of patient health recovery.

figure 12

Drug binding with proteins and DDI side effects

Furthermore, the average drug research and development cost is $2.6 billion (Liu et al. 2019 ). As a result, determining the possibility of negative consequences is important for lowering the expense and risk of medication development. The researchers use various computer tools to speed up the process. In pharmacology and clinical application, DDI prediction is a difficult topic, and correctly detecting possible DDIs in clinical studies is crucial for patients and the public. Researchers have recently produced a series of successes utilizing deep learning as an AI technique to predict DDIs by using drug structural properties and graph theory (Han et al. 2022 ). AI successfully detected potential drug interactions, allowing doctors to make informed decisions before prescribing prescription combinations to patients with complex or numerous conditions (Fokoue et al. 2016 ).

Therefore, this section comprehensively reviews the researchers' most popular DL algorithms to predict DDIs.

In 2016, Tiresias is a framework proposed by Achille Fokoue et al. ( 2017 ) for discovering DDIs. The Tiresias framework uses a large amount of drug-related data as input to generate DDI predictions. The detection of the DDI approach begins using input data that has been semantically integrated, resulting in a knowledge network that represents drug properties and interactions using additional components like enzymes, chemical structures, and routes. Numerous similarity metrics between all pharmacological categories were determined using a knowledge graph in a scalable and distributed setting. To forecast the DDIs, a large-scale logistic regression prediction model employs calculated similarity metrics. According to the findings, the Tiresias framework was proven to help identify new interactions between currently available medications and freshly designed and existing drugs. The suggested Tiresias model's necessity for big, scaled medication information was negative, resulting in the developed model's high cost.

In 2017, Reza et al. ( 2017 ) developed a computational technique for predicting DDIs based on functional similarities among all medicines. Several major biological aspects were used to create the suggested model: carriers, enzymes, transporters, and targets (CETT). The suggested approach was implemented on 2189 approved medications, for which the associated CETTs were obtained, and binary vectors to find the DDIs were created. Two million three hundred ninety-four thousand seven hundred sixty-seven potential drug–drug interactions were assessed, with over 250,000 unidentified possible DDIs discovered. Inner product-based similarity measures (IPSMs) offered good values predicted for detecting DDIs among the several similarity measures used. The lack of pharmacological data was a key flaw in this strategy, which resulted in the erroneous detection of all potential pairs of DDIs.

In 2018, Ryu et al. ( 2018 ) proposed a model that predicts more DDI kinds using the drug's chemical structures as inputs and applied multi-task learning to DDI type prediction in the same vein Decagon (Zitnik et al. 2018 ) models polypharmacy side effects using a relational GNN. To comprehend the representations of intricate nonlinear pharmacological interactions, Chu et al. ( 2018 ) utilized an auto-encoder for factoring. To predict DDIs, Liu et al. ( 2019c ) presented the DDI-MDAE based on shared latent representation, a multimodal deep auto-encoder. Recently, interest in employing graph neural networks (GNNs) to forecast DDI has increased. Distinct aggregation algorithms lead to different versions of GNNs to efficiently assemble the vectors of its neighbors’ feature vectors (Asada et al. 2018 ) uses a convolutional graph network (GCN) to encode the molecular structures to extract DDIs from text. Furthermore, Ma et al. ( 2018 ) has incorporated attentive Multiview graph auto-encoders into a coherent model.

Chen ( 2018 ) devised a model for predicting Adverse Drug Reactions (ADR). SVM, LR, RF, and GBT were all used in the predictive model. The DEMO dataset, which contains properties such as the patient's age, weight, and sex, and the DRUG dataset, which includes features such as the drug's name, role, and dosage, were employed in this model. Males make up 46% of the sample, while females make up 54%. The developed model had a fair forecasting accuracy for a representative sample set. Furthermore, the outputs revealed that the suggested model is only accurate for a significant number of datasets.

To anticipate the possible DDI, Kastrin et al. ( 2018 ) employed statistical learning approaches. The DDI was depicted as a complex network, with nodes representing medications and links representing their potential interactions. On networks of DDIs, the procedure for predicting links was represented as a binary classification job. A big DDI database was picked randomly to forecast. Several supervised and unsupervised ML approaches, such as SVM, classification tree, boosting, and RF, are applied for edge prediction in various DDIs. Compared to unsupervised techniques, the supervised link prediction strategy generated encouraging results. To detect the link between the pharmaceuticals, The proposed method necessitates Unified Medical Language System (UMLS) filtering, which provided a dilemma for the scientists. Furthermore, the suggested system only considers fixed network snapshots, which is problematic for DDI's system because It's a fluid system.

In 2019, Lee et al. ( 2019 ) proposed a deep learning system for accurately forecasting the results of DDIs. To learn more about the pharmacological effects of a variety of DDIs, an assortment of auto-encoders and a deep feed-forward neural network was employed in the suggested method that were honed utilizing a mix of well-known techniques. The results revealed that using SSP alone improves GSP and TSP prediction accuracy, and the autoencoder is more powerful than PCA at reducing profile features. In addition, the model outperformed existing approaches and included numerous novel DDIs relevant to the current study Yue et al. ( 2020 ) combines numerous graphs embedding methods for the DDI job, while models DDI as link prediction with the help of a knowledge graph (Karim et al. 2019 ). There's also a system for co-attention (Andreea and Huang 2019 ), which presented a deep learning model based solely on side-effect data and molecular drug structure. CASTER in Huang et al. ( 2020 ) also based on drug chemical structures, develops a framework for dictionary learning to anticipate DDIs (Chu et al. 2019 ) and proposes using semi-supervised learning to extract meaningful information for DDI prediction in both labeled and unlabeled drug data. Shtar et al. ( 2019 ) used a mix of computational techniques to predict medication interactions, including artificial neural networks and graph node factor propagation methods such as adjacency matrix factorization (AMF) and adjacency matrix factorization with propagation (AMFP). The Drug-bank database was used to train the model, containing 1142 medications and 45,297 drug drugs. With 1442 drugs and 248,146 drug–drug interactions, the trained model was tested from the drug bank's most recent version. AMF and AMFP were also used to develop an ensemble-based classifier, and the outcomes were assessed using the receiver operating characteristic (ROC) curve. The findings revealed that the suggested a classifier that uses an ensemble delivers important drug development data and noisy data for drug prescription. In addition, drug embedding, which was developed during the training of models utilizing interaction networks, has been made available. To anticipate adverse drug events caused by DDIs, Hou et al. ( 2019 ) suggested a deep neural network architecture model. The suggested model is based on a database of 5000 medication codes obtained from Drug Bank. Using the computed features, it discovers 80 different types of DDIs. Tensor Flow-GPU was also used to create the model, which takes 4432 drug characteristics as input.

Medicines for inflammatory bowel disease (IBD) can predict how they will react; the trained model has an accuracy of 88 percent. The findings also revealed that the model performs best when many datasets are used. Detecting negative effects of drugs with a DNN Model was proposed by Wang et al. ( 2019 ). The model predicts ADRs by using synthetic, biological, and biomedical knowledge of drugs. Drug data from SIDER databases was also incorporated into the model. The proposed system's performance was improved by distributing. Using a word-embedding approach, determine the association between medications using the target drug representations in a vector space. The suggested system's fundamental flaw was that it only worked well with ordinary SIDER databases.

In 2020, numerous AI-based methods were developed for DDI event prediction, including evaluating chemical structural similarity using neural graph networks (Huang et al. 2020 ). Attempts to forecast DDI utilizing different data sources have also been made, such as leveraging similarity features to create pharmacological features for the DDI job predicting occurrences (Deng et al. 2020 ).

With the help of word embeddings, part-of-speech tags, and distance embeddings. Bai et al. ( 2020 ) suggested a deep learning technique that executes the DDI extraction task and supports the drug development cycle and drug repurposing. According to experimental data, the technique can better avoid instance misclassifications with minimal pre-processing. Moreover, the model employs an attention technique to emphasize the significance of each hidden state in the Bi-LSTM layers.

A tool for extracting features regarding a graph convolutional network (GCN) and a predictor based on a DNN. Feng et al. ( 2020 ) suggested DPDDI, an effective and robust approach for predicting potential DDIs by utilizing data from the DDI network lacking a thought of drug characteristics (i.e., drug chemical and biological properties). The proposed DPDDI is a useful tool for forecasting DDIs. It should benefit from other DDI-related circumstances, such as recognizing unanticipated side effects and guiding drug combinations. The disadvantage of this paradigm is that it ignores drug characteristics.

Zaikis and Vlahavas ( 2020 ), by developing a bi-level network with a more advanced level reflecting the network of biological entities' interactions, suggested a multi-level GNN framework for predicting biological entity links. Lower levels, however, reflect individual biological entities such as drugs and proteins, although the proposed model's accuracy needs to be enhanced.

In 2021, To overcome the DDI prediction, Lin et al. ( 2021 ) suggested an end-to-end system called Knowledge Graph Neural Network (KGNN). KGNN expands the use of spatial GNN algorithms to the knowledge graph by selectively various aggregators of neighborhood data, allowing it to learn the knowledge graph's topological structural information, semantic relations, and the neighborhood of drugs and drug-related entities. Medical risks are reduced when numerous medications are used correctly, and drug synergy advantages are maximized. For multi-typed DDI pharmacological effect prediction, Yue et al. ( 2021 ) used knowledge graph summarization. Lyu et al. ( 2021 ) also introduced a Multimodal Deep Neural Network (MDNN) framework for DDI event prediction. On the drug knowledge graph, a graph neural network was used, MDNN effectively utilizes topological information and semantic relations. MDNN additionally uses joint representation structure information, and heterogeneous traits are studied, which successfully investigates the multimodal data's complementarity across modes. Karim et al. ( 2019 ) built a knowledge graph that used CNN and LSTM models to extract local and global pharmacological properties across the network. DANN-DDI is a deep attention neural network framework proposed by Liu et al. ( 2021 ). To anticipate unknown DDIs, it carefully incorporates different pharmacological properties (Chun and Yi-Ping Phoebe 2021 ) and developed a deep hybrid learning (DL) model to provide a descriptive forecasting of pharmacological adverse reactions. It was one of the initial hybrid DL models through conception models that could be interpreted. The model includes a graph CNN through conception models to improve the learning efficiency of chemical drug properties and bidirectional long short-term memory (BiLSTM) recurrent neural networks to link drug structure to adverse effects. After concatenating the outputs of the two networks (GCNN and BiLSTM), a fully connected network is utilized to forecast pharmacological adverse reactions. Regardless of the classification threshold, the model obtains an AUC of 0.846. It has a 0.925 precision score. Even though a tiny drug data set was used for adverse drug response (ADR) prediction, the Bilingual Evaluation Understudy (BLEU) concluded results were 0.973, 0.938, 0.927, and 0.318, indicating considerable achievements. Furthermore, the model can correctly form words to explain pharmacological adverse reactions and link them to the drug's name and molecular structure. The projected drug structure and ADR relationship will guide safety pharmacology research at the preclinical stage and make ADR detection easier early in the drug development process. It can also aid in the detection of unknown ADRs in existing medications. DDI extraction using a deep neural network model from medical literature was proposed by Mohsen and Hossein (). This model employs an innovative approach of attracting attention to improve the separation of essential words from other terms based on word similarity and location concerning candidate medications. Before recognizing the type of DDIs, this method calculates the results of a bi-directional long short-term memory (Bi-LSTM) model's attention weights in the deep network architecture. On the standard DDI Extraction 2013 dataset, the proposed approach was tested. According to the findings of the experiments, they were able to get an F1-Score of 78.30, which is comparable to the greatest outcomes for stated existing approaches.

In 2022, Pietro et al. ( 2022 ) introduced DruGNN, a GNN-based technique for predicting DDI side effects. Each DDI corresponds to a class in the prediction, a multi-class, multi-label node classification issue. To forecast the side effects of novel pharmaceuticals, they use a combination inductive-transudative learning system that takes advantage of drug and gene traits (induction path) and knowledge of known drug side effects (transduction path). The entire procedure is adaptable because the base for machine learning can still be used if the graph dataset is enlarged to include more node properties and associations. Zhang et al. ( 2022 ) proposed CNN-DDI, a new semi-supervised algorithm for predicting DDIs that uses a CNN architecture. They first extracted interaction features from pharmacological categories, targets, pathways, and enzymes as feature vectors. They then suggested a novel convolution neural network as a predictor of DDIs-related events based on feature representation. Five convolutional layers, two full-connected layers, and a CNN-based SoftMax layer make up the predictor. The results reveal that CNN-DDI superior to other cutting-edge techniques, but it takes longer to complete (Jing et al. 2022 ) presented DTSyn. This unique dual-transformer-based approach can select probable cancer medication combinations. It uses a multi-head attention technique to extract chemical substructure-gene, chemical-chemical, and chemical-cell-line connections. DTSyn is the initial model that incorporates two transformer blocks to extract linkages between interactions between genes, drugs, and cell lines, allowing a better understanding of drug action processes. Despite DTSyn's excellent performance, it was discovered that balanced accuracy on independent data sets is still limited. Collecting more training data is expected to solve the problem. Another issue is that the fine-granularity transformer was only trained on 978 signature genes, which could result in some chemical-target interactions being lost.

Furthermore, DTSyn used expression data as the only cell line attributes. To fully represent the cell line, additional omics data may be added going forward, including methylation and genetic data. He et al. ( 2022 ) proposed MFFGNN, a new end-to-end learning framework for DDI forecasting that can effectively combine information from molecular drug diagrams, SMILES sequences, and DDI graphs. The MFFGNN model used the molecular graph feature extraction module to extract global and local features from molecular graphs.

They run thorough tests on a variety of real-world datasets. The MFFGNN model routinely beats further cutting-edge models, according to the findings. Furthermore, the module for multi-type feature fusion configures the gating mechanism to limit the amount of neighborhood data provided to the node.

4.4 Drug–drug similarity prediction using DL

Drug similarity studies presume that medications with comparable pharmacological qualities have similar activation mechanisms, and side effects are used to treat problems like each other (Brown 2017 ; Zeng et al. 2019 ).

The drug-pharmacological similarity is critical for various purposes, including identifying drug targets, predicting side effects, predicting drug–drug interactions, and repositioning drugs. Features of the chemical structure (Lu et al. 2017 ; O’Boyle 2016 ), protein targets (Vilar 2016 ; Wang et al. 2014 ), side-effect profiles (Campillos et al. 2008 ; Tatonetti et al. 2012 ), and gene expression profiles (Iorio et al. 2010 ) provide a multi-perspective viewpoint for forecasting medications that are similar and can correct for data gaps in different data sources and offer fresh perspectives on drug repositioning and other uses. The main idea of drug–drug similarity is presented in Fig.  13 . The vector represents the drug features, and the links reflect the similarity between the two drugs.

figure 13

Drug–drug similarity main idea

4.4.1 Drug similarity measures

The similarity estimations are calculated based on chemical structure, target protein sequence-based, target protein functional, and drug-induced pathway similarities.

4.4.1.1 The similarity in chemical structure

DrugBank ( 2019 ) provides tiny molecule medicine chemical structures in SDF molecular format. Invalid SDFs can be recognized and eliminated, such as those with a NA value or fewer than three columns in atom or bond blocks. For valid compounds, atom pair descriptors can be computed, pairwise comparison of compounds, δ c ( di , dj ), was evaluated using atom pairs using the Tanimoto coefficient, which is defined as the number of atom pairs in each fraction shared by two different compounds divided by their union (Eq.  1 ).

where AP i and AP j are atom pairs from pharmaceuticals d i and dj, respectively, the numerator is the total number of atom pairs in both compounds, while the denominator is the number of common atom pairs in both compounds.

4.4.1.2 Target protein sequence-based similarity

DrugBank provides all small molecule drugs have target sequences in FASTA format. The basic Needleman-Wunsch et al. ( 1970 ) dynamic programming approach for global alignment can be used to compare pairwise protein sequences. The proportion of pairwise sequence identity (Raghava 2006 ) can be represented as the corresponding sequence similarity. Equation  2 was used to calculate drug–drug similarity based on target sequence similarities:

where δ t ( di , dj ) denotes target-based similarity between medicines di and dj. Drugs di target a group of proteins known as Ti. Tj is a set of proteins that pharmaceuticals dj target and S(x,y) is a similarity metric based on symmetric sequences between two targeted proteins, x \(\in \) Ti and y \(\in \) Tj. Overall, Eq.  2 calculates the average of the best matches, wherein each first medicine's target is only connected to the second medicine's most comparable phrase, and vice versa.

4.4.1.3 Target protein functional similarity

Protein targets that are overrepresented by comparable biological functions and have similar sequences imply shared pharmacological mechanisms and downstream effects (Passi et al. 2018 ). As a result, each protein has a set of Gene Ontology (GO) concepts from all three categories associated with it, such as cellular components (CC), molecular functions (MF), and biological processes (BP). We filtered out GO keywords that were either very specialized (with 15 linked genes) or very general (with 100 genes). DrugBank ( 2019 ) provided the Human Protein–Protein Interaction (PPI) network. Wang et al. ( 2007 ) proposed leveraging the topology of the GO graph structure to determine the semantic similarity of their linked GO terms, which was used to determine how functionally comparable two drugs are, such as δ f (d i , d j ). Using a best-match average technique, any two GO keywords are compared for pairwise semantic similarity connected with di and d j were aggregated into a single semantic similarity measure and presented into a final similarity matrix.

4.4.1.4 Drug-induced pathway similarity

A medication pair that triggers similar pathways or overlaps shows that the drugs' mechanisms of action are similar, which is useful information for drug similarities and repositioning research (Zeng et al. 2015 ). Kanehisa and Goto ( 2000 ) was used to find the pathways activated by each small molecule medication. Using dice similarity, the similarity in pairs of any two options was calculated based on their constituent genes' closeness. After that, a pathway-based similarity score was calculated for each medication pair d i and d j , i.e., δ p ( d i , d j ), was calculated using Eq.  3 :

where P i and P j are a group of drug-induced pathways d i and d j , respectively; x and y are two paths represented by a group of genes that make up their constituents, and \(DSC\left( {x,y} \right) = {{{2}\left| {x \cap y} \right|} \mathord{\left/ {\vphantom {{{2}\left| {x \cap y} \right|} {\left( {\left| x \right| + \left| y \right|} \right)}}} \right. \kern-\nulldelimiterspace} {\left( {\left| x \right| + \left| y \right|} \right)}}\) is the probability of a pair of dice matching, this determines how much the two trajectories overlap. When no gene is shared by any two pathways produced by the comparing drug pair, the similarity is set to 0.0. Overall, Eq.  3 implies that if two medications stimulate one or more identical pathways, the maximum pathway-based similarity will be achieved (s).

4.4.2 DL for drug similarity prediction

Wang et al. ( 2019 ) introduced a gated recurrent units (GRUs) model that employs similarity to predict drug–disease interactions. In this approach, CDK turned the SMILES into 2D chemical fingerprints, and the Jaccard score of the 2D chemical fingerprints was used to compare the two medicines. This section comprehensively reviews the researchers' most popular DL algorithms to predict drug similarity.

Hirohara et al. ( 2018 ) employed a CNN to learn molecular representation. The network is given the molecule's SMILES notation as input to feed into the convolutional layers in this scenario. The TOX 21 dataset was used.

To conduct similarity analysis, Cheng et al. ( 2019 ) used the Anatomical Therapeutic Chemical (ATC) based on the drug ATC classification systems and code-based commonalities of drug pairs. The authors created interaction networks, performed drug pair similarity analyses, and developed a network-based methodology for identifying clinically effective treatment combinations for a specific condition.

Xin et al. ( 2016 ) presented a Ranking-based k-Nearest Neighbour (Re-KNN) technique for medication repositioning. The method's key feature combines the Ranking SVM (Support Vector Machine) algorithm and the traditional KNN algorithm. Chemical structural similarity, target-based similarity, side-effect similarity, and topological similarity are the types of similarity computation methodologies they used. The Tanimoto score was then used to determine the similarity between the two profiles.

Seo et al. ( 2020 ) proposed an approach that combined drug–drug interactions from DrugBank, network-based drug–drug interactions, polymorphisms in a single nucleotide, and anatomical hierarchy of side effects, as well as indications, targets, and chemical structures.

Zeng et al. ( 2019 ) developed an assessment of clinical drug–drug similarity derived from data from the clinic and used EHRs to analyse and establish drug–diagnosis connections. Using the Bonferroni adjusted hypergeometric P value, they created connections between drugs and diagnoses in an EMR dataset. The distances between medications were assessed using the Jaccard similarity coefficient to form drug clusters, and a k-means algorithm was devised.

Dai et al. ( 2020 ) reviewed, summarized representative methods, and discussed applications of patient similarity. The authors talked about the values and applications of patient similarity networks. Also, they discussed the ways to measure similarity or distance between each pair of patients and classified it into unsupervised, supervised, and semi-supervised.

Yan et al. ( 2019 ) created BiRWDDA, a new computational methodology for medication repositioning that combines bi-random walk and various similarity measures to uncover potential correlations between diseases and pharmaceuticals. First drug and disease–disease similarities are assessed to identify optimal drug and disease similarities. The information entropy is evaluated between the similarity of medicine and disease to determine the right similarities. Four drug–drug similarity metrics and three disease–disease similarity measurements were calculated depending on some drug- and disease-related characteristics to create a heterogeneous network. The drug's protein sequence information, the extracted drug interaction from DrugBank then utilized the Jaccard score to determine this similarity, the chemical structure, derived canonical SMILES from DrugBank, and the side effect, respectively the four drug–drug similarities.

Yi et al. ( 2021 ) constructed the model of a deep gated recurrent unit to foresee drug–disease interactions that likely employ a wide range of similarity metrics and a kernel with a Gaussian interaction profile. Based on their chemical fingerprints, the similarity measure is utilized to detect a distinguishing trait in medications. Meanwhile, based on established disease–disease relationships, the Gaussian interactions profile kernel is used to derive efficient disease features. After that, a model with a deep gated recurrent cycle is created to anticipate drug-disease interactions that could occur. The outputs of the experiments showed that the suggested algorithm could be used to anticipate novel drug indications or disease treatments and speed up drug repositioning and associated drug research and discovery.

To forecast DDIs, Yan et al. ( 2022 ) suggested a semi-supervised learning technique (DDI-IS-SL). DDI-IS-SL uses the cosine similarity method to calculate drug feature similarity by combining chemical, biological, and phenotypic data. Drug chemical structures, drug–target interactions, drug enzymes, drug transporters, drug routes, drug indications, drug side effects, harmful effects of drug discontinuation, and DDIs that have been identified are all included in the integrated drug information.

Heba et al. ( 2021 ) used DrugBank to develop a machine learning framework based on similarities called "SMDIP" (Similarity-based ML for Drug Interaction Prediction), where they calculated drug–drug similarity utilizing a Russell–Rao metric for the biological and structural data that is currently accessible on DrugBank to represent the limited feature area. The DDI classification is carried out using logistic regression, emphasizing finding the main predictors of similarity. The DDI key features are subjected to six machine learning models (NB: naive Bayes; LR: logistic regression; KNN: k-nearest neighbours; ANN: neural network; RFC: random forest classifier; SVM: support vector machine).

For large-scale DDI prediction, Vilar et al. ( 2014 ) provided a procedure combining five similar drug fingerprints (Two-dimensional structural fingerprints, fingerprinting of interaction profiles, fingerprints of the target profile, Fingerprints of ADE profiles, and pharmacophoric techniques in three dimensions).

Song et al. ( 2022 ) used similarity theory and a convolutional neural network to create global structural similarity characteristics. They employed a transformer to extract and produce local chemical sub-structure semantic characteristics for drugs and proteins. To create drug and protein global structural similarity characteristics, The Tanimoto coefficient, Levenshtein distance, and CNN are all utilized in this study.

5 Benchmark datasets and databases

Drug development or discovery has been based on a range of direct and indirect data sources and has regularly demonstrated strong predictive capability in finding confirmed repositioning candidates and other applications for computer-aided drug design. This section reviews the most important and available benchmark datasets and databases used in the drug discovery problem and which the researchers may need according to each problem category. Thirty-five datasets are summarized in Table 3.

6 Evaluation metrics

Performance measures are required for evaluating machine learning models (Benedek et al. 2021 ). The measures serve as a tool for comparing different techniques. They aid in comparing many approaches to identify the best one for execution. This section describes the many metrics defined for the four categories of drug discovery difficulties below.

Table 4 shows the metrics employed in drug discovery problems—understanding the metrics aids in assessing the effectiveness of various prediction systems. True positives (TP) are drug side effects that have been recognized appropriately, False positives (FP) are adverse pharmacological effects that aren't present but were detected by the model, and True negatives (TN) are pharmacological side effects that do not exist but that the model failed to detect. False negatives (FN) are adverse pharmacological effects the model did not predict.

7 Drug dosing optimization

Drugs are vital to human health and choosing the proper treatment and dose for the right patient is a constant problem for clinicians. Even when taken as studied and prescribed, drugs have adverse impact profiles with varying response rates. As a result, all medications must be well-managed, especially those utilized in treating critical ailments or with a tight exposure window between efficacy and toxicity. Clinicians follow typical guidelines for the first dosage, which is not always optimal or secure for every patient, especially if the medicine no longer is evaluated in various dosages for various patient types. Precision dosage can revolutionize by increasing perks in health care while reducing drug therapy risks. While precise dosing will probably influence some pharmaceuticals significantly, perhaps not essential or practical to apply to all drugs or therapeutic classes. As a result, recognizing the characteristics that make medications suitable for precision dosage targets will aid in directing resources to where they'll have the most impact. Precision-dosing meds with a high priority and therapeutic classes could be crucial in achieving increased health care performance, safety, and cost-effectiveness (Tyson et al. 2020 ).

Due to standard, fixed dosing procedures or gaps in knowledge, imprecise drug dosing in specific subpopulations increases the risk of potentiating adverse effects due to supratherapeutic or subtherapeutic concentrations (Watanabe et al. 2018 ). Currently, the Food and Medicine Administration (FDA) simply requires a drug to be statistically better than a non-inferior to placebo of the existing treatment standard. This does not guarantee that the medicine will benefit most patients in clinical trials, especially if malignancies treatment can be tough, like diffuse intrinsic pontine glioma (DIPG) and unresectable meningioma, where rates of therapy response can be exceedingly low (Fleischhack et al. 2019 ).

There are essential aspects for dose optimization ( https://friendsofcancerresearch.org/wpcontent/uploads/Optimizing_Dosing_in_Oncology_Drug_Development.pdf ) that vary based on the product, the target population, and the available data to find the most effective dose, which varies based on the product, the target population, and the available data:

Therapeutic properties: Drug features such as small molecule vs. large molecule and agonist vs. antagonist impact how drugs interact with the body regarding safety and efficacy. The therapeutic characteristics impact the first doses used in dose-finding studies and the procedures used to determine which doses should be used in registrational trials.

Patient populations: Patient demographics vary depending on tumour kind, stage of disease, and comorbidities. Understanding how diverse factors influence the drug's efficacy may justify modifying the dose correspondingly, especially in the context of enlarged clinical trial populations.

Supplemental versus original approval: Differences in disease features and patient demographics between tumour types and treatment settings, such as monotherapy versus combination therapy, must be considered when assessing whether additional dose exploration is required for a supplemental application. In cases when more dose exploration is required, the research design can include previous exposure-response knowledge from the initial approval.

8 Drug discovery and XAI

The topic of XAI addresses one of the most serious flaws in ML and DL algorithms: model interpretability and explain ability. Understanding how and why a prediction is formed becomes increasingly crucial as algorithms grow more sophisticated and can forecast with greater accuracy. It would be impossible to trust the forecasts of real-world AI applications without interpretability and explain ability. Human-comprehensible explanations will increase system safety while encouraging trust and sustained acceptance of machine learning technologies (). XAI has been studied to circumvent the limitations of AI technologies due to their black-box nature. In contrast to making decisions and model justifications which may be provided by AI approaches like DL and XAI (Zhang et al. 2022 ). Attention has been attracted to XAI approaches (Lipton 2018 ; Murdoch et al. 2019 ) to compensate for the lack of interpretability of some ML models as well as to aid human decision-making and reasoning (Goebel et al. 2018 ). The purpose of presenting relevant explanations alongside mathematical models is to help students understand them better by (1) Making the decision-making process more transparent (Doshi-Velez and Kim 2017 ), (2) correct predictions should not be made for the wrong motives (Lapuschkin et al. 2019 ), (3) avoid biases and discrimination that are unjust or unethical (Miller 2019 ), and (4) close the gap between ML and other scientific disciplines. Effective XAI can also help scientists in navigating the scientific process (Goebel et al. 2018 ), enabling people to fine-tune their understanding and opinions on the process under inquiry (Chander et al. 2018 ). We hope to provide an overview of recent XAI drug discovery research in this section.

XAI has a place in drug development. While the precise definition of XAI is still up for controversy (Guidotti et al. 2018 ), the following characteristics of XAI are unquestionably beneficial in applications of drug design (Lipton 2018 ):

Transparency is accomplished by understanding how the system came to a specific result.

The explanation of why the model's response is suitable serves as justification. It is instructive to provide new information to human decision-makers.

Determining the reliability of a prediction to estimate uncertainty.

The molecular explanation of pharmacological activity is already possible with XAI (Xu et al. 2017 ; Ciallella and Zhu 2019 ), as well as drug safety and organic synthesis planning (Dey et al. 2018 ). If It's working overtime, XAI will be important in processing and interpreting increasingly complex chemical data, as well as creating new pharmaceutical ideas, all while preventing human bias (Boobier et al. 2017 ). Application-specific XAI techniques are being developed to quickly reply to unique scientific issues relating to the Pathophysiology and biology of the human may be boosted by pressing drug discovery difficulties such as the coronavirus pandemic.

AI tools can increase their prediction performance by increasing model complexity. As a result, these models become opaque, with no clear grasp of how they operate. Because of this ambiguity, AI models are not generally utilized in important industries such as medical care. As a result, XAI focuses on understanding what goes into AI model prediction to meet the demand for transparency in AI tools. AI model interpretability approaches can be categorized depending on the algorithms used, a scale for interpreting, and the kind of information (Adadi and Mohammed 2018 ). Regarding the objectives of interpretability, approaches grouped as white-box model development, black-box model explanation, model fairness enhancement, and predictive sensitivity testing (Guidotti et al. 2018 ).

According to the gradient-based attribution technique (Simonyan et al. 2014 ), the network's input features are to blame for the forecast. Because this strategy is commonly employed when producing a DNN system's predictions, it may be a suitable solution for various black-box DNN models in DDI prediction (Quan, et al. 2016 ; Sun et al. 2018 ). In addition, DeepLIFT is a frequent strategy for implementing on top of DNN models that have been demonstrated to be superior to techniques based on gradients (Shrikumar et al. 2017 ). As opposed to that, the Guided Backpropagation model may be used to construct network architectures (Springenberg 2015 ). A convolutional layer with improved stride can be used instead of max pooling in CNN to deal with loss of precision. This method could be employed in CNN-based DDI prediction, as shown in Zeng et al. ( 2015 ).

Furthermore, in the Tao et al. ( 2016 ) was implemented neural networks that parse natural language. Using rationales, this method aimed to achieve the small pieces of input text. This method's design comprises two parts: a generator and an encoder that seek for text subsets that are closely connected to the predicted outcome. Because NLP-based models are used to extract DDIs (Quan et al. 2016 ), the above methods should be examined for usage in improving the model's clarity.

Aside from that, XAI has created methods for developing white-box models, including linear, decision tree, rule-based, and advanced but transparent models. However, these approaches are receiving less attention due to their weak ability to predict, particularly in the NLP-based sector, such as in the DDIs the job of extracting. Several ideas to address AI fairness have also been offered. Nonetheless, while extracting DDIs, only a small number of these scholarly studies looked at non-tabular data impartiality, such as text-based data. Many DDIs experiments used the word embedding method (Quan et al. 2016 ; Zhang 2020 ; Bolukbasi 2016 ). As a result, attempts to ensure fairness in DDI research should be considered more. To ensure the reliability of AI models, numerous methods also make an effort to examine the sensitivity of the models. Regarding their Adversarial Example-based Sensitivity Analysis, Zügner et al. ( 2018 ) used this model to explore graph-structured data. The technique looks at making changes to links between nodes or node properties to target node categorization models. Because graph-based methods are frequently utilized in DDIs research (Lin et al. 2021 ; Sun et al. 2020b ), methods like those used in the previous study suggest that they might be used in a DDIs prediction model. In RNN, word embedding perturbations (Miyato et al. 1605 ) are also worth addressing. Significantly, the input reduction strategy utilized by Feng et al. ( 2018 ) to expose hypersensitivity in NLP models could be applied to DDI extraction studies. The DDIs study of Schwarz et al. ( 2021 ) attempted to provide model interpretability using Attention ratings derived at all levels of modeling in their DDIs study. The significance of similarity matrices to the vectors for medication depiction is determined using these scores, and drug properties that contribute to improved encoding are identified using these scores. This method makes use of data that travels through all tiers of the network.

Graph neural networks (GNNs) and their explain ability are rapidly evolving in the field of graph data. GNNExplainer in Ying et al. ( 2019 ) uses mask optimization to learn soft masks for edge and node attributes to elaborate on the forecasts. Soft masks have been initiated at random and regarded as trainable variables. After that, the masks are then combined in comparison to the first graph using multiplications on a per-element basis by GNNExplainer. After that, by enhancing the exchange of information between the forecasts from the first graph and the recently acquired graph, the masks are maximized. Even when various regularization terms, such as element-by-element entropy, motivate optimal disguises for stealth, the resulting Masks remain supple.

In addition, because the masks are tuned for each input graph separately, it’s possible that the explanations aren't comprehensive enough. To elaborate on the forecasts, PGExplainer (Luo et al. 2020 ) discovers approximated discrete edge masks. To forecast edge masks, it develops a mask predictor that is parameterized. It starts by concatenating node embeddings to get the embeddings for each edge in an input graph. The predictor then forecasts the chances of each edge being selected using the edge embeddings, that regarded as an evaluation of significance. The reparameterization approach is then used to sample the approximated discrete masks. Finally, the mutual information between the previous and new forecasts is optimized to train the mask predictor. GraphMask (Schlichtkrull et al. 2010 ) describes the relevance of edges in each GNN layer after the fact. It uses a classifier, like the PGExplainer, to forecast if an edge may be eliminated and does not impact the original predictions. A binary concrete distribution (Louizos et al. 1712 ) and a reparameterization method are used to roughly represent separate masks. The classifier is additionally trained by removing a term for a difference, which evaluates the difference between network predictions over the entire dataset. ZORRO (Thorben et al. 2021 ) employs discrete masks to pinpoint key input nodes and characteristics. A greedy method is used to choose nodes or node attributes from an input network. ZORRO chooses one node characteristic with the greatest fidelity score for each stage. The objective function, fidelity score, measures the degree of the recent forecasts resemble the model's original predictions by replacing the rest of the nodes/features with random noise values and repairing chosen nodes/features. The non-differentiable limitation of discrete masks is overcome because no training process is used.

Furthermore, ZORRO avoids the problem of "introduced evidence" by wearing protective masks. The greedy mask selection process, on the other hand, may result in optimal local explanations. Furthermore, because masks are generated for each graph separately, the explanations may lack a global understanding. Causal Screening (Xiang et al. 2021 ) investigates the attribution of causality to various edges in the input graph. It locates the explanatory subgraph's edge mask. The essential concept behind causal attribution is to look at how predictions change when an edge is added to the present explanatory subgraph, called the influence of causality. It examines the causal consequences of many edges at each step and selects one to include in the paragraph. It selects edges using the individual causal effect (ICE), which assesses the difference in information between parties after additional edges are introduced to the subgraph.

Causal Screening, like ZORRO, is a rapacious algorithm that generates undetectable masks without any prior training. As a result, it does not suffer due to the issue of the evidence presented. However, it is possible to lack worldwide comprehension and be caught in optimum local explanations. SubgraphX (Yuan et al. 2102 ) investigates deep graph model subgraph-level explanations. It uses the Monte Carlo Tree Search (MCTS) method (Silver et al. 2017 ) to effectively investigate various subgraphs by trimming nodes and choose the most significant subgraph from the search tree's leaves as the explanation for the prediction.

Furthermore, the Shapley values can be used to update the mask generation algorithm's objective function. Its produced subgraphs are more understandable by humans and suited for graph data than previous perturbation-based approaches. However, the computational cost is higher because the MCTS algorithm explores distinct subgraphs.

9 Success stories about using DL in drug discovery

Big pharmaceutical companies have migrated toward AI as DL methodologies have advanced, abandoning conventional approaches to maximize patient and company profit. AstraZeneca is a multinational, science-driven, worldwide pharmaceutical company that has successfully used artificial intelligence in each stage of drug development, from virtual screening to clinical trials. They could comprehend current diseases better, identify new targets, plan clinical trials with higher quality, and speed up the entire process by incorporating AI into medical science. AstraZeneca's success is a shining illustration of how combining AI with medical science can yield incredible results. Their collaborations with other AI-based companies demonstrate their continual attempts to increase AI utilization. One such cooperation is with Ali Health, an Alibaba subsidiary that wants to provide AI-assisted screening and diagnosis systems in China (Nag et al. 2022 ).

SARS-CoV-2 virus outbreak placed many businesses under duress to develop the best medicine in the shortest amount of time feasible. These businesses have turned to employ AI in conjunction based on the data available to attain their goals. Below are some examples of firms that have been successful in identifying viable strategies to combat the COVID-19 virus because of their efforts.

Deargen, a South Korean startup, developed the MT-DTI (Molecule Transformer Drug Target Interaction Model), a DL-based drug-protein interaction prediction model. In this approach, the strength of an interaction between a drug and its target protein is predicted using simplified chemical sequences rather than 2D or 3D molecular structures. A critical protein on the COVID-19-causing virus SARS-CoV-2 is highly likely to bind to and inhibit the FDA-approved antiviral drug atazanavir, a therapy for HIV. It also discovered three more antivirals, as well as Remdesivir, a not-yet-approved medicine that is currently being studied in patients. Deagen's ability to uncover antivirals utilizing DL approaches is a significant step forward in pharmaceutical research, making it less time-consuming and more efficient. If such treatments are thoroughly evaluated, there is a good chance that we will be able to stop the epidemic in its tracks (Beck et al. 2020 ; Scudellari 2020 ).

Another example is Benevolent AI, a biotechnology company in London leverages medical information, AI, and machine learning to speed up health-related research. They've identified six medicines so far, one of which, Ruxolitinib, is claimed to be in clinical trials for COVID19 (Gatti et al. 2021 ). To find prospective medications that might impede the procedure for viral replication of SARS-CoV-2, The business has been utilizing a massive reservoir of information pertaining to medicine, together Utilizing data obtained from the scientific literature by their AI system and ML. They received FDA permission to use their planned Baricitinib medication in conjunction with Remdesivir, which resulted in a higher recovery rate for hospitalized COVID19 patients (Richardson et al. 2020 ).

Skin cancer is a form of cancer that is very frequent around the globe. As the rate at which skin cancer continues to rise, it is becoming increasingly crucial to diagnose it initially developed, research demonstrate that early identification and therapy improve the survival rate of skin cancer patients. With the advancement of medical research and AI, several skin cancer smartphone applications have been introduced to the market, allowing people with worrisome lesions to use a specialized technique to determine whether they should seek medical care. According to studies, over 235 dermatology smartphone apps were developed between 2014 and 2017 (Flaten et al. 2020 ). Previously, they worked by sending a snapshot of the lesion over the internet to a health care provider. Still, thanks to smartphones' internal AI algorithms, these applications can detect and classify images of lesions as high or low risk and Immediately assess the patient's risk and offer advice. SkinVison (Carvalho et al. 2019 ) is an example of a successful application.

10 Future challenges

10.1 digital twinning in drug discovery.

The development and implementation of Industry 4.0 emerging technologies allow for creation of digital twins (DTs), that promotes the modification of the industrial sector into a more agile and intelligent one. A DT is a digital depiction of a real entity that interacts in dynamic, two-way links with the original. Today, DTs are being used in a variety of industries. Even though the pharmaceutical sector has grown to accept digitization to embrace Industry 4.0, there is yet to be a comprehensive implementation of DT in pharmaceutical manufacture. As a result, it is vital to assess the pharmaceutical industry's success in applying DT solutions (Chen et al. 1088 ).

New digital technologies are essential in today's competitive marketplaces to promote innovation, increase efficiency, and increase profitability (Legner et al. 2017 ). AI (Venkatasubramanian 2019 ), Internet of Things (IoT) devices (Venkatasubramanian 2019 ; Oztemel and Gursev 2018 ), and DTs have all piqued the interest of governments, agencies, academic institutions, and corporations (Bao et al. 2018 ). Industry 4.0 is a concept offered by a professional community to increase the level of automation to boost productivity and efficiency in the workplace.

This section provides a quick look at the evolution of DT and its application in pharmaceutical and biopharmaceutical production. We begin with an overview of the technology's principles and a brief history, then present various examples of DTs in pharmacology and drug discovery. After then, there will be a discussion of the significant technical and other issues that arise in these kinds of applications.

10.1.1 History and main concepts of digital twin

The idea of making a "twin" of a process or a product returned to NASA's Apollo project in the late 1960s (Rosen et al. 2015 ; Mayani et al. 2018 ; Schleich et al. 2017 ), when it assembled two identical space spacecraft. In this scenario, the "twin" was employed to imitate the counterpart's action in real-time.

The DT, according to Guo et al. ( 2018 ), is a type of digital data structure that is generated as a separate entity and linked to the actual system. Michael Grieves presented the original meaning of a DT in 2002 at the University of Michigan as part of an industry presentation on product lifecycle management (PLM) (Grieves 2014 ; Grieves and Vickers 2017 ; Stark et al. 2019 ). However, the first actual use of this notion, which gave origin to the current moniker, occurred in 2010, when NASA (the United States National Aeronautics and Space Administration) attempted to create virtual spaceship simulators for testing (Glaessgen and Stargel 2012 ).

A digital reproduction or representation of a physical thing, process, or service is what a DT is in theory. It's a computer simulation with unique features that dynamically connect the physical and digital worlds. The purpose of DTs is to model, evaluate, and improve a physical object in virtual space til it matches predicted performance, at which time it can be created or enhanced (if already built) in the real world (Kamel et al. 2021 ; Marr 2017 ).

Since then, DT technology has acquired popularity in both business and academia. Main components of DTs presently exist, as shown in Fig.  14 . Still, the theoretical model comprises three parts: the real entity in the actual world, the digital entity in the virtual space, and the interconnection between them (Glaessgen and Stargel 2012 ).

figure 14

Main components of DT

In an ideal world, the digital component would have all the system's information that could be acquired from its physical counterpart (Kritzinger et al. 2018 ). When integrated with AI, IoT, and other recent intelligent systems, a DT can forecast how an object or process will perform.

10.1.2 Digital twin in pharmaceutical manufacturing

Developing a drug is lengthy and costly, requiring efforts in biology, chemistry, and manufacturing, and it has a low success rate. An estimated 50,000 hits (trial versions of compounds that are subsequently tweaked to develop a medication in the future) are evaluated to develop a successful drug. Only one in every 12 therapeutic compounds, clinical trials have been performed on humans, makes it to market successfully. Toxicity (A medication's capacity to offer a patient with respite and slow the progression of a disease) and lack of effectiveness contribute to more than 60% of all drug failures (Subramanian 2020 ).

Making the appropriate decisions about which targets, hits, leads, and compounds to pursue is important to a drug's successful market introduction. However, the decision is based on in vitro (Experimental system in a test tube or petri dish.) and in vivo (experiments in animals.) systems, both of which have a shaky correlation with clinical outcomes (Mak et al. 2014 ). Answers to the following inquiries would be provided by a perfect decision support system for drug discovery:

What is the magnitude of any target's influence on the desired clinical result?

Is the potential compound changing the target enough to change clinical outcomes?

Is the chemical sufficiently selective and free of side effects or harmful consequences?

Is the ineffectiveness attributable to the drug's failure to reach its target?

Has the trial chosen the appropriate dose and dosing regimen?

Are there any surrogate or biomarkers such as cholesterol that serves as a proxy for the illness's root cause that can forecast a drug's success or failure?

Have the correct patients been chosen for the study?

Is it possible to identify hyper- and hypo-responders before the study begins?

Therapeutic failures are prevalent and difficult to address, given the complex process of developing drugs based on the points above. This issue must be addressed by combining data and observations from many stages of the drug development process and developing a system that can forecast an experiment's outcome or a chemical modification's influence on a therapeutic molecule. This highlights the significance of DT in the field of drug discovery.

In the United States, funding organizations such as DARPA, NSF, and DOE have aggressively supported bioprocess modeling at the genomic and cellular levels, resulting in high-profile programs such as BioSPICE (Kumar and Feidler 2003 ). These groups have shown that smaller models built to answer specific issues can greatly influence drug development efficiency. This would make it possible to apply the prediction methodology to various stages of the drug discovery and research process, including confirmation of the target, enhancing leads, and choosing candidates, Recognition of biomarkers, fabrication of assays and screens, and the improvement of clinical trials.

The pharmaceutical business is embracing the overall digitization trend in tandem with the US FDA's ambition to establish an agile, adaptable pharmaceutical manufacturing sector that delivers high-quality pharmaceuticals without considerable regulatory scrutiny (O’Connor et al. 2016 ). Industries are beginning to implement Industry 4.0 and DT principles and use them for development and research (Barenji et al. 2019 ; Steinwandter et al. 2019 ; Lopes et al. 2019 ; Kumar et al. 2020 ; Reinhardt et al. 2020 ). Pharma 4.0 (Ierapetritou et al. 2016 ) is a digitalization initiative that integrates Industry 4.0 with International Council for Harmonisation (ICH) criteria to model a combined operational model and production control plan.

As shown in Fig.  15 , live monitoring of the system `by the Process Analytical Technology (PAT), data collection from the machinery, the supplementary and finished goods, and a worldwide modelling and software for data analysis are some of the key requirements for achieving smart manufacturing with DT (Barenji et al. 2019 ). Quality-by-Design (QbD) and Continuous Manufacturing (CM) (Boukouvala et al. 2012 ), flowsheet modeling (Kamble et al. 2013 ), and PAT implementations (James et al. 2006 ) have all been used by the pharmaceutical industry to achieve this. Although some of the instruments have been thoroughly examined, DTs' entire integration and development is still a work in progress.

figure 15

Main categories of smart manufacturing with DT

The pharmaceutical industry has used PAT in different programs across the steps involved in producing drugs (Nagy et al. 2013 ). Even though this has resulted in a rise in the use of PAT instruments, their implementations are limited to research and development rather than manufacturing on a large scale (Papadakis et al. 2018 ). They have been successful in decreasing production costs and enhancing product quality monitoring in the small number of examples where they have been used in manufacturing (Simon et al. 2019 ). The development of various PAT approaches, as well as their convincing implementation is a vital component of a scheme for surveillance and control (Boukouvala et al. 2012 ) and has given a foundation for obtaining essential data from the physical component.

Papadakis et al. ( 2018 ) recently provided a framework for identifying efficient reaction paths for pharmaceutical manufacture (Rantanen and Khinast 2015 ), which comprises modeling reaction route workflows discovery, analysis of reactions and separations, process simulation, assessment, optimization, and the use (Sajjia et al. 2017 ).

To develop models, data-driven modeling methods require the gathering and using of many substantial experiments, and the resulting models are solely reliant on the datasets provided. Artificial neural networks (ANN) (Pandey et al. 2006 ; Cao et al. 2018 ), multivariate statistical analysis, and in Monte Carlo Badr and Sugiyama ( 2020 ) are all commonly used in pharmaceutical manufacturing. These methods are less computationally costly, but the prediction outside the dataset space is frequently unsatisfactory due to the trained absence of underlying physics understanding in models. Using IoT devices in pharmaceutical manufacturing lines results in massive data collection volumes. The virtual component must receive this collection of process data and CQAs quickly and effectively. Additionally, for accurate prediction, several pharmaceutical process models need material properties. As a result, to provide virtual component access to all datasets, a central database site is necessary (Lin-Gibson and Srinivasan 2019 ).

10.1.3 Digital twin in biopharmaceutical manufacturing

The synthesis of big molecule-based entities in various combinations that has applications in the treatment of inflammatory, microbial, and cancer issues, is the focus of biopharmaceutical manufacturing (Glaessgen and Stargel 2012 ; Narayanan et al. 2020 ). The demand for biologic-based medications has risen in recent years, necessitating greater production efficiency and efficacy (Kamel et al. 2021 ). As a result, many businesses are switching from batch to continuous production and implementing intelligent manufacturing systems (Lin-Gibson and Srinivasan 2019 ). DT can aid in decision-making, risk analysis, product creation, and process prediction., which incorporates the physical plant, data collecting, data analysis, and system control (Tao et al. 2018 ).

biological products' components and structures are intimately connected to treatment effectiveness (Read et al. 2010 ) and are very sensitive to cell-line. Operating conditions thorough actual plant's virtual description in a simulation environment is required to apply DT in biopharmaceutical manufacturing (Tao et al. 2018 ). This means that each unit activity inside an integrated model's simulation should accurately reflect the crucial process dynamics. Previous reviews Narayanan et al. ( 2020 ) Tang et al. ( 2020 ) Farzan et al. ( 2017 ) Baumann and Hubbuch ( 2017 ) Smiatek et al. ( 2020 ) and Olughu et al. ( 2019 ) focused on process modelling methodologies for both upstream and downstream operations.

Data from a biopharmaceutical monitoring system is typically diverse regarding data kinds and time scales. A considerable amount of data is collected during biopharmaceutical manufacture thanks to the deployment of real-time PAT sensors. As a result, data pre-processing is required to deal with missing data, visualize data, and reduce dimensions (Gangadharan et al. 2019 ). In batch biopharmaceutical production, Casola et al. ( 2019 ) presented data mining-based techniques for stemming, classifying, filtering, and clustering historical real-time data. Lee et al. ( 2012 ) combined different spectroscopic techniques and used data fusion to forecast the composition of raw materials.

10.2 AI-driven digital twins in today's pharmaceutical drug discovery

In the pharmaceutical industry, challenges are emerging from clinical studies that make drug development incomplete, sluggish, uncertain, and maybe dangerous. For example, It is not a true reflection of reality where clinical trials can take into account that in the real world, just a small portion of a big and diverse population is depicted among the many billions of humans on the planet where it is not possible to get a view of how each person based on how they will respond to a medicine. Clinical trials' rigorous requirements for physical and mental health in some cases also result in failure because of a lack of qualified participants. Pharmaceutical firms battle to provide the precise number and kind of participants needed to comply with the stringent requirements of clinical trial designs. Also, in most trials, the actual drug is replaced by a placebo as this helps contrast how sick individuals behave when they are not administered the experimental medication; This implies that at least some trial participants do not receive it. Here, These issues can be solved by using digital twins, which can imitate a range of patient features, giving a fair representation of how a medicine affects a larger population. AI-enabled digital twinning may reduce the trial's setup by revealing how susceptible a patient is to various inclusion and exclusion criteria as a result, patients can be rapidly identified, and digital twins can predict a patient's reaction, and placebos won't be required. Therefore, the new treatment can be assured for every patient in the trial, and digital twins can reduce the dangerous impact of drugs in the early stages by decreasing the number of patients who need to be tested in the real world. Figure  16 illustrates a framework by running all possible combinations. All treatment protocols are tested on a digital twin of the patient to discover an appropriate treatment protocol for this patient. Doing this quickly and accurately can lead to providing the best quality treatment for the patient without experimenting with the patient, which saves effort, cost, and accuracy in determining an appropriate treatment protocol for patients.

figure 16

AI-driven digital twins in today's pharmaceutical drug discovery

11 Open problems

This section discusses important issues to consider regarding progression from preclinical to clinical and implementation in practice that necessitate new ML solutions to assist transparent, usable, and data-driven decision-making procedures to accelerate drug discovery and decrease the number of failures in clinical development phases.

Complex disorders, such as viral infections and advanced malignancies frequently necessitate drug combinations (Julkunen et al. 2020 ; White et al. 2021 ). For example, kinase inhibitor combos or single compounds that block several kinases may improve therapeutic efficacy and duration while combating treatment resistance in cancer (Attwood et al. 2021 ). While several ML models have been created to predict response pairs of drug–dose combinations, higher-order combination effects can be predicted in a systematic way involving more than two medicines or targets is still a problem. In cancer cell lines, tensor learning methods have permitted reliable prediction of paired drug combination dose-response matrices (Smiatek et al. 2020 ). This computationally efficient learning approach could use extensive pharmacogenomic data, determine which drug combinations are most successful for additional in vitro or in vivo testing in many kinds of preclinical models, such as higher-order combinations among novel therapeutic compounds and doses.

While possible toxicity and effectiveness that is targeted are important criteria for clinical development success, most existing ML models for predicting response to the therapy accentuate effectiveness as the primary result. As a result, careful examination, and harmful effects prediction of instances in simulated and preclinical settings is required to strike a balance between the effectiveness of the toxicity and therapy that is acceptable to accelerate the next stages of drug development (Narayanan et al. 2020 ). Applying single-cell data and ML algorithms to develop combinations of anticancer drugs has shown the potential to boost the likelihood of clinical success (Tao et al. 2018 ). Transfer of knowledge and deconvolution techniques for in silico cell set (Avila et al. 2020 ) may offer effective ways to reduce the requirement to generate a lot of single-cell data to predict combination therapy responders and impacts of toxicity, as well as the recommended dosage that optimizes both efficacy and safety.

In addition, patient data and clinical profiles must be used to validate the in-silico therapy response forecasts. This real data for ML predictions is crucial for progress in medicine and establishing the practical value and providing clinical guidance in making decisions. A no-go decision was made early, for example, if the substance has harmful consequences. Many of the present issues encountered when using machine learning for drug discovery, particularly in clinical development, are since current AI algorithms do not meet the requirements for clinical research. As a result, ML model validation requires systematic and comprehensive high-quality clinical data sets. The discovery methods must be thoroughly evaluated for accuracy and reproducibility using community-agreed performance measures in various settings, not just a small collection of exemplary data sets. sharing and exploiting private patient information is possible with systems that isolate the code from the data or use the model to data method (Guinney and Saez-Rodriguez 2018 ), which It makes it possible for federated learning to utilise patient-level data for model construction and thorough assessment.

Even if there are many applications for drug discovery, The majority of ML and particularly DL models remain "black boxes”, and interpretation by a human specialist is sometimes tricky (Jiménez-Luna et al. 2020 ). Implementing mathematical models as online decision support tools must be understandable to users to obtain confidence. Comprehensible, accessible, and explainable models should clearly state the optimization goals, such as synergy, efficacy, and/or toxicity.

DTI prediction is a notable example of fields of drug discovery research. It has been ongoing more than 10 years and aims to enhance the effectiveness of computational models using various technologies. The most recent computational approaches for predicting DTIs are DL technologies. These use unstructured-based approaches that don't need 3D structural data or docking to get over the drug and target protein's high-dimensional structure restrictions. Despite the DL's outstanding performance, regression inside the DTI prediction remains a critical and difficult issue, and researchers could develop several strategies to improve prediction accuracy. Furthermore, data scarcity and the lack of a standardized benchmark database are still considered current research gaps.

While DL approaches show promise in detecting drug responses, especially when dealing with large amounts of data, drug response prediction research is in its first stages, and more efficient and relevant models are needed.

While DL techniques have shown to be effective in detecting DDIs, especially when dealing with large amounts of data, more promising algorithms that focus on complex molecular reactions need to be developed.

Only a few studies in the drug discovery field have investigated their models' explain ability, leaving much room for improvement. The explanations generated by XAI for human decision-making must be not insignificant, not artificial, and helpful to the scientific community. Until now, ensuring that XAI techniques achieve their goals and produce trustworthy responses would necessitate a combined effort amongst DL specialists, chemo informaticians and chemists, biologists, data scientists, and other subject matter experts. As a result, we believe that more developed methodologies to explain black-box models for drug discovery fields like DDIs, drug–target interactions, drug sensitivity, and drug side effects must be considered in the future to ensure model fairness or strict sensitivity evaluations of models. Further exploration of the capabilities and constraints of the existing chemical language for defining these models will be critical. The development of novel interpretable molecular representations for DL and the deployment of self-explanatory algorithms alongside sufficiently accurate predictions will be a critical area of research in the coming years. Because there are currently no methods that combine all the stated advantageous XAI characteristics (transparency, justification, informativeness, and uncertainty estimation), consensus techniques that draw on the advantages of many XAI approaches and boost model dependability will play a major role in the short and midterm. Currently, there is no open-community platform for exchanging and refining XAI software and model interpretations in drug discovery. As a result, we believe that future study into XAI in drug development has much potential.

12 Discussion

This section presents a brief about how the proposed analytical questions in Sect.  2 are being answered through the paper.

Several DL algorithms have been used to predict the different categories of drug discovery problems as deeply illustrated in Sect. 4 with respect to the main categories of drug discovery problems in Fig.  8 . In addition, a summary of a sample of these algorithms, their methods, advantages and weaknesses are presented in Table 2 .

Recognizing the characteristics that make medications suitable for precision dosage targets will aid in directing resources to where they'll have the most impact. Employing DL in drug dosing optimization is a big challenge which increases the health care performance, safety, and cost-effectiveness as presented in Sect.  7 .

With the advancement of DL methods, we've seen big pharmaceutical businesses migrate toward AI, such as ‘AstraZeneca’ which is a global multinational pharmaceutical business that has successfully used AI in every stage of drug development. Several success stories have been presented in Sect.  9 .

AQ4: What about using the newest technologies such as XAI and DT in drug discovery?

The topic of XAI addresses one of the most serious flaws in ML and DL algorithms: model interpretability and explain ability. It would be impossible to trust the forecasts of real-world AI applications without interpretability and explain ability. Section  8 presents the literature that address this issue. A digital twin (DT) is a virtual representation of a living thing that is connected to the real thing in dynamic, reciprocal ways. Today, DTs are being used in a variety of industries. Even though the pharmaceutical sector has grown to accept digitization to embrace Industry 4.0, there is yet to be a comprehensive implementation of DT in pharmaceutical manufacture. Success stories regarding employing DT into drug discovery is presented in Sect. 10.

AQ5: What are the future and open works related to the drug discovery and DL?.

Through the paper, we present how DL succeed in all aspects of drug discovery problems, However, it is still a very important challenge for future research. Section 11 covers these challenges.

Figure  17 presents the percentage of the different DL applications for each building block of our study. It is well observed that the most percentage segment is dedicated for the drug discovery and DL because it is the main core of our research.

figure 17

Percentages of DL applications for each category

13 Conclusion

Despite all the breakthroughs in pharmacology, developing new drugs still requires a lot of time and costs. As DL technology advances and the amount of drug-related data grows, a slew of new DL-based approaches is cropping up at every stage of the drug development process. In addition, we’ve seen large pharmaceutical corporations migrate toward AI in the wake of the development of DL approaches.

Although the drug discovery is a large field and has different research categories, there is a few review studies about this field and each related study has focused only on a one research category such as reviewing the DL applications for the DTIs. So, the main goal of our research is to present a systematic Literature review (SLR) which integrates the recent DL technologies and applications for the different categories of drug discovery problems Including, Drug–target interactions (DTIs), drug–drug similarity interactions (DDIs), drug sensitivity and responsiveness, and drug-side effect predictions. That is associated with the benchmark data sets and databases. Related topics such as XAI and DT and how they support the drug discovery problems are also discussed. In addition, the drug dosing optimization and success stories are presented as well. Finally, we suggest open problems as future research challenges.

Although the DL has proved its strength in drug discovery problems, it is still a promising open research area for the interested researchers. In this paper, they can find all they want to know about using DL in various drug discovery problems. In addition, they can find success stories and open areas for future research.

Given the recent success of DL approaches and their use by pharmaceuticals in identifying new medications, it seems clear that current DL techniques being highly regarded in the next generation of enormous data investigation and evaluation for drug discovery and development.

Abramovich I, Ben-Yehuda T, Cohen R (2018) Low-complexity video classification using recurrent neural networks. IEEE Int Conf Sci Electr Eng Israel (ICSEE) 2018:1–4. https://doi.org/10.1109/ICSEE.2018.8646076

Article   Google Scholar  

Adadi A, Mohammed B (2018) Peeking inside the black-box: a survey on explainable artificial intelligence (XAI). IEEE Access 6:2169–3536

Google Scholar  

Ahmed KT, Park S, Jiang Q et al (2020) Network-based drug sensitivity prediction. BMC Med Genomics 13:193

Alankrita A, Mamta M, Gopi B (2021) Generative adversarial network: an overview of theory and applications. Int J Inf Manag Data Insights 1(1):100004

Amashita R, Nishio M, Do RKG et al (2018) Convolutional neural networks: an overview and application in radiology. Insights Imaging 9:611–629. https://doi.org/10.1007/s13244-018-0639-9

Andreea D, Yu-Hsiang H, Petar V, Pietro L, Jian T (2019) Drug–drug adverse effect prediction with graph co-attention. https://arxiv.org/abs/1905.00534

Arshed MA, Mumtaz S, Riaz O, Sharif W, Abdullah S (2022) A deep learning framework for multi drug side effects prediction with drug chemical substructure. Int J Innovat Sci Technol 4(1):19–31

Arus-Pous J, Patronov A, Bjerrum EJ, Tyrchan C, Reymond JL, Chen H, Engkvist O (2020) SMILES-based deep generative scaffold decorator for de-novo drug design. J Cheminform 12:1–18

Asada M, Miwa M, Sasaki Y (2018) Enhancing drug–drug interaction extraction from texts by molecular structure information. In: proceedings of the 56th annual meeting of the association for computational linguistics. 2, pp 680–685, https://doi.org/10.18653/v1/P18-2108

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT et al (2000) Gene ontology: tool for the unification of biology. Nat Genet 25:25–29

Attwood MM, Fabbro D, Sokolov AV et al (2021) Trends in kinase drug discovery: targets, indications and inhibitor design. Nat Rev Drug Discov 20(11):839–861

Avila C, Alquicira-Hernandez J, Powell JE et al (2020) Benchmarking of cell type deconvolution pipelines for transcriptomics data. Nat Commun 11(1):5650

Azad AKM, Dinarvand M, Nematollahi A, Swift J, Lutze-Mann L, Vafaee F (2021) A comprehensive integrated drug similarity resource for in-silico drug repositioning and beyond. Brief Bioinform 22(3):bbaa126. https://doi.org/10.1093/bib/bbaa126

Badr S, Sugiyama H (2020) A PSE perspective for the efficient production of monoclonal antibodies: integration of process, cell, and product design aspects. Curr Opin Chem Eng 27:121–128

Bao J, Guo D, Li J, Zhang J (2018) The modelling and operations for the digital twin in the context of manufacturing. Enterp Inf Syst 13:534–556

Baptista D, Ferreira PG, Rocha M (2021) Deep learning for drug response prediction in cancer. Briefings Bioinform 22:360–379

Barenji RV, Akdag Y, Yet B, Oner L (2019) Cyber-physical-based PAT (CPbPAT) framework for Pharma 4.0. Int J Pharm 567:118445

Baumann P, Hubbuch J (2017) Downstream process development strategies for effective bioprocesses: Trends, progress, and combinatorial approaches. Eng Life Sci 17:1142–1158

Beck BR, Shin B, Choi Y, Park S, Kang K (2020) Predicting commercially available antiviral drugs that may act on the novel coronavirus (SARS-CoV-2) through a drug–target interaction deep learning model. Comput Struct Biotechnol J 18:784–790

Bedi P, Sharma C, Vashisth P, Goel D, Dhanda M (2015) Handling cold start problem in Recommender Systems by using Interaction Based Social Proximity factor. In: Proceeding of the 2015 international conference on advances in computing, communications and informatics, Kerala, India, 10–13 August 2015; pp 1987–1993

Benedek R, Stephen B, Andriy N, Michael U, Sebastian N, Eliseo P (2021) A unified view of relational deep learning for drug pair scoring. coRR V. https://arxiv.org/abs/2111.02916 .

Betsabeh T, Mansoor ZJ (2021) Using drug–drug and protein-protein similarities as feature vector for drug–target binding prediction. Chemom Intell Lab Syst 217:104405. https://doi.org/10.1016/j.chemolab.2021.104405

Bleakley K, Yamanishi Y (2009) Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics 25:2397–2403

Bolukbasi T (2016) Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 2016; 29. In Identifying gender and sexuality of data subjects. https://cis.pubpub.org/pub/debiasing-word-embeddings-2016 .

Bongini P, Pancino N, Dimitri GM, Bianchini M, Scarselli F, Lio P (2022) Modular multi-source prediction of drug side-effects with DruGNN. http://arxiv.org/abs/2202.08147 .

Boobier S, Osbourn A, Mitchell JB (2017) Can human experts predict solubility better than computers? J Cheminform 9:63

Boukouvala F, Niotis V, Ramachandran R, Muzzio FJ, Ierapetritou MG (2012) An integrated approach for dynamic flowsheet modeling and sensitivity analysis of a continuous tablet manufacturing process. Comput Chem Eng 42:30–47

Brown AS, Patel CJ (2017) MeSHDD: literature-based drug-drug similarity for drug repositioning. J Am Med Inf Assoc 24(3):614–618

Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ (2018) Next-generation machine learning for biological networks. Cell 173:1581–1592

Campillos M et al (2008) Drug target identification using side-effect similarity. Science 321(5886):263–666. https://doi.org/10.1126/science.1158140

Cao H, Mushnoori S, Higgins B, Kollipara C, Fermier A, Hausner D, Jha S, Singh R, Ierapetritou M, Ramachandran R (2018) A systematic framework for data management and integration in a continuous pharmaceutical manufacturing processing line. Processes 6:53

Casola G, Siegmund C, Mattern M, Sugiyama H (2019) Data mining algorithm for pre-processing biopharmaceutical drug product manufacturing records. Comput Chem Eng 124:253–269

Chabner BA (2016) NCI-60 cell line screening: a radical departure in its time. J Natl Cancer Inst. https://doi.org/10.1093/jnci/djv388

Chander A, Srinivasan R, Chelian S, Wang J, Uchino K (2018) Working with beliefs: AI transparency in the enterprise. In: Joint proceedings of the ACM IUI 2018 workshops co-located with the 23rd acm conference on intelligent user interfaces 2068 (eds Said, A. and Komatsu, T.) (CEUR-WS.org, 2018)

Chandra B, Sharma RK (2017) On improving recurrent neural network for image classification. Int Jt Conf Neural Netw (IJCNN) 2017:1904–1907. https://doi.org/10.1109/IJCNN.2017.7966083

Chang Y, Park H, Yang HJ, Lee S, Lee KY, Kim TS, Jung J, Shin JM (2018) Cancer drug response profile scan (CDRscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci Rep 8:1–11

Chauhan R, Ghanshala KK, Joshi RC (2018) Convolutional neural network (CNN) for image detection and recognition. First Int Conf Secure Cyber Comput Commun (ICSCCC) 2018:278–282. https://doi.org/10.1109/ICSCCC.2018.8703316

Chen AW (2018) Predicting adverse drug reaction outcomes with machine learning. Int J Commun Med Public Health 5(3):901–904

Chen JY, Mamidipalli S, Huan T (2009) Happi: an online database of comprehensive human annotated and predicted protein interactions. BMC Genomics 10(1):S16

Chen X, Liu M-X, Yan G-Y (2012) Drug–target interaction prediction by random walk on the heterogeneous network. Mol BioSyst 8:1970–1978. https://doi.org/10.1039/C2MB00002D

Chen Y, Yang O, Sampat C, Bhalode P, Ramachandran R, Ierapetritou M (2020) Digital twins in pharmaceutical and biopharmaceutical manufacturing: a literature review. Processes 8(9):1088. https://doi.org/10.3390/pr8091088

Cheng F, Kovács IA, Barabási AL (2019) Network-based prediction of drug combinations. Nat Commun 10(1):1–11

Chiu Y-C, Chen H-IH, Zhang T, Zhang S, Gorthi A, Wang L-J, Huang Y, Chen Y (2019) Predicting drug response of tumors from integrated genomic profiles by deep neural networks. BMC Med Genomics 12:119

Chu X, Lin Y, Gao J, Wang J, Wang Y, Wang L (2018) Multi-label robust factorization autoencoder and its applicationin predicting drug–drug interactions. arXiv:1811.00208 .

Chu X, Lin Y, Wang Y, Wang L, Wang J, Mlrda JG (2019) A multitask semi-supervised learning framework for drug–drug interaction prediction. In: proceedings of the international joint conference on artificial intelligence, pp 4518– 4524

Ciallella HL, Zhu H (2019) Advancing computational toxicology in the big data era by artificial intelligence: data-driven and mechanism-driven modeling for chemical toxicity. Chem Res Toxicol 32:536–547

Cortes-Ciriano I, Ain QU, Subramanian V, Lenselink EB, Méndez-Lucio O, IJzerman AP, Wohlfahrt G, Prusis P, Malliavin TE, van Westen GJP et al (2015) Polypharmacology modelling using proteochemometrics (PCM): recent methodological developments, applications to target families, and future prospects. Medchemcomm 6:24–50

Cortés-Ciriano I, Bender A (2019) KekuleScope: prediction of cancer cell line sensitivity and compound potency using convolutional neural networks trained on compound images. J Cheminform 11:1–16

Dai L, Zhu H, Liu D (2020) Patient similarity: methods and applications. http://arxiv.org/abs/2012.01976

David L, Arús-Pous J, Karlsson J, Engkvist O, Bjerrum EJ, Kogej T, Kriegl JM, Beck B, Chen H (2019) Applications of deep-learning in exploiting large-scale and heterogeneous compound data in industrial pharmaceutical research. Front Pharmacol 10:1303

Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, Hocker M, Treiber DK, Zarrinkar PP (2011) Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol 29:1046–1051

De Carvalho TM, Noels E, Wakkee M, Udrea A, Nijsten T (2019) Development of smartphone apps for skin cancer risk assessment: progress and promise. JMIR Dermatol 2(1):e13376

De Kuijper GM, Risselada A, van Dijken R (2019) Monitoring drug side-effects. Handbook of intellectual disabilities. Springer, Cham, pp 275–301

“deepchem/deepchem: Democratizing Deep-Learning for Drug Discovery”; Quantum Chemistry, Materials Science and Biology; Available online: https://github.com/deepchem/deepchem (accessed on 15 April 2022).

Dey S, Luo H, Fokoue A, Hu J, Zhang P (2018) Predicting adverse drug reactions through interpretable deep learning framework. BMC Bioinform 19:476

Dincer AB, Celik S, Hiranuma N, Lee S-I (2018) DeepProfile: deep learning of cancer molecular profiles for precision medicine. bioRxiv. https://doi.org/10.1101/278739

Ding MQ, Chen L, Cooper GF, Young JD, Lu X (2018) Precision oncology beyond targeted therapy: combining omics data with machine learning matches the majority of cancer cells to effective therapeutics. Mol Cancer Res 16:269–278

Doshi-Velez F, Kim B (2017) Towards a rigorous science of interpretable machine learning. https://arxiv.org/abs/1702.08608

DrugBank (2019) DrugBank Release Version 5.1.3, chemical structures. https://www.drugbank.com

Dua D, Graff C (2017) UCI machine learning repository. https://archive.ics.uci.edu/ml/index.php

El-Deredy W et al (1997) Pretreatment prediction of the chemotherapeutic response of human glioma cell cultures using nuclear magnetic resonance spectroscopy and artificial neural networks. Cancer Res 57:4196–4199

Farzan P, Mistry B, Ierapetritou MG (2017) Review of the important challenges and opportunities related to modeling of mammalian cell bioreactors. AIChE J 63:398–408

Fatehifar M, Karshenas H (2021) Drug–drug interaction extraction using a position and similarity fusion-based attention mechanism. J Biomed Inf 115:103707. https://doi.org/10.1016/j.jbi.2021.103707

Feng S, et al (2018) Pathologies of neural models make interpretations difficult. http://arxiv.org/abs/1804.07781

Feng Q, Dueva E, Cherkasov A, Ester M (2018) PADME: a deep learning-based framework for drug–target interaction prediction. arXiv 2018; arXiv:1807.09741

Feng YH, Zhang SW, Shi JY (2020) DPDDI: a deep predictor for drug–drug interactions. BMC Bioinform 21:419. https://doi.org/10.1186/s12859-020-03724-x

Ferdousi R, Safdari R, Omidi Y (2017) Computational prediction of drug–drug interactions based on drugs functional similarities. J Biomed Inform. https://doi.org/10.1016/j.jbi.2017.04.021

Finn RD et al (2013) Pfam: the protein families database. Nucleic Acids Res 42(D1):D222–D230

Flaten HK, St Claire C, Schlager E, Dunnick CA, Dellavalle RP (2020) Growth of mobile applications in dermatology. Dermatol Online J 24(2):13–16

Fleischhack G, Massimino M, Warmuth-Metz M, Khuhlaeva E, Janssen G, Graf N et al (2019) Nimotuzumab and radiotherapy for treatment of newly diagnosed diffuse intrinsic pontine glioma (DIPG): a phase III clinical study. J Neurooncol 143:107–113. https://doi.org/10.1007/s11060-019-03140-z

Fokoue A, Sadoghi M, Hassanzadeh O, Zhang P (2016) Predicting drug–drug interactions through large-scale similarity-based link prediction. In: European semantic web conference 2016 May 29; pp 774–789

Fushman D, Shooshan SE, Rodriguez L, Aronson AR, Lang F, Rogers W, Tonning J (2018) A dataset of 200 structured product labels annotated for adverse drug reactions. Sci Data 5:180001

Gangadharan N, Turner R, Field R, Oliver SG, Slater N, Dikicioglu D (2019) Metaheuristic approaches in biopharmaceutical process development data analysis. Bioprocess Biosyst Eng 42:1399–1408

Gao Z et al (2008) PDTD: a web-accessible protein database for drug target identification. BMC Bioinf 9(1):104

Gao KY, Fokoue A, Luo H, Iyengar A, Dey S, Zhang P (2017) Interpretable drug target prediction using deep neural representation. In: Proceedings of the international joint conference on artificial intelligence, Melbourne, Australia, 19–25 August 2017

Gao K, Duy Nguyen D, Sresht V, Mathiowetz AM, Tu M, Wei G-W (2019) Are 2D fingerprints still valuable for drug discovery? Phys Chem Chem Phys 22:8373–8390

Gatti M, Turrini E, Raschi E, Sestili P, Fimognari C (2021) Janus kinase inhibitors and coronavirus disease (COVID)-19: rationale, clinical evidence and safety issues. Pharmaceuticals 14(8):738

Gaulton A et al (2011) ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res 40(D1):D1100–D1107

Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. 34th Int Conf Mach Learn ICML 3:2053–2070

Glaessgen EH, Stargel DS (2012) The digital twin paradigm for future NASA and US Air Force vehicles. In: Proceedings of the 53rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics and materials conference, Honolulu, HI, USA. https://ntrs.nasa.gov/citations/20120008178

Goebel R et al (2018) Explainable AI: the new 42? In: Holzinger A, Kieseberg P, Tjoa A, Weippl E (eds) Machine learning and knowledge extraction. CD-MAKE Lecture Notes in Computer Science. Springer, New York

Gómez-Bombarelli R et al (2018) Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 4:268–276

Grieves M (2014) Digital twin: manufacturing excellence through virtual factory replication. Glob J Eng Sci Res. https://doi.org/10.5281/zenodo.1493930

Grieves M, Vickers J (2017) Digital twin: mitigating unpredictable undesirable emergent behavior in complex systems. Springer, Cham, pp 85–113

Guidotti R et al (2018) A survey of methods for explaining black box models. ACM Comput Surv 51:93

Guinney J, Saez-Rodriguez J (2018) Alternative models for sharing confidential biomedical data. Nat Biotechnol 36(5):391–392

Gunther S et al (2007) SuperTarget and Matador: resources for exploring drug–target relationships. Nucleic Acids Res 36:D919–D922

Hamilton WL (2020) Graph representation learning. Synth Lect Artif Intell Mach Learn 14:1–159

MATH   Google Scholar  

Han X, Xie R, Li X, Li J (2022) SmileGNN: drug–drug interaction prediction based on the smiles and graph neural network. Life (basel). 12(2):319. https://doi.org/10.3390/life12020319

Hao M, Wang Y, Bryant SH (2016) Improved prediction of drug–target interactions using regularized least squares integrating with kernel fusion technique. Anal Chim Acta 909:41

Hassan-Harrirou H, Zhang C, Lemmin T (2020) RosENet: improving binding affinity prediction by leveraging molecular mechanics energies with an ensemble of 3D convolutional neural networks. J Chem Inf Model 60:2791–2802

He C, Liu Y, Li H, Zhang H, Mao Y, Qin X, Liu L, Zhang X (2022) Multi-type feature fusion based on graph neural network for drug-drug interaction prediction. BMC Bioinf 23(1):1–8

Hecker N et al (2011) SuperTarget goes quantitative: update on drug–target interactions. Nucleic Acids Res 40(D1):D1113–D1117

Hermanto A, Adji TB, Setiawan NA (2015) Recurrent neural network language model for English-Indonesian machine translation: experimental study. Int Conf Sci Inf Technol (ICSITech) 2015:132–136. https://doi.org/10.1109/ICSITech.2015.7407791

Hinton G (2011) Boltzmann machines. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning. Springer, Boston

Hirohara M, Saito Y, Koda Y, Sato K, Sakakibara Y (2018) Convolutional neural network based on SMILES representation of compounds for detecting chemical motif. BMC Bioinform 19:83–94

Hizukuri Y, Sawada R, Yamanishi Y (2015) Predicting target proteins for drug candidate compounds based on drug-induced gene expression data in a chemical structure-independent manner. BMC Med Genomics 8:82

Hou X, You J, Hu P (2019) Predicting drug–drug interactions using deep neural network. In: proceedings of the 11 th international conference on machine learning and computing, pp 168–172

http://zinc.docking.org

https://bioinf-applied.charite.de/supernatural_new/index.php .

https://friendsofcancerresearch.org/wpcontent/uploads/Optimizing_Dosing_in_Oncology_Drug_Development.pdf .

https://ncats.nih.gov/tox21

https://pharmacodb.pmgenomics.ca/datasets/4

https://sites.broadinstitute.org/ccle/

https://string-db.org/cgi/download.pl?sessionId=uKr0odAK9hPs

https://www.cancer.gov/about-nci/organization/ccct/ctrp

https://www.ebi.ac.uk/chebi/

https://www.sciencedirect.com/topics/drug-response

Hu J, Gao J, Fang X, Liu Z, Wang F, Huang W, Wu H, Zhao G (2022) DTSyn: a dual-transformer-based neural network to predict synergistic drug combinations. bioRxiv. https://doi.org/10.1101/2022.03.29.486200

Huang C-T et al (2018) A large-scale gene expression intensity-based similarity metric for drug repositioning. iScience 7:40–52

Huang K, Xiao C, Hoang TN, Glass LM, Sun J (2020) Caster: predicting drug interactions with chemical substructure representation. In: AAAI 2020 34th AAAI Conference on Artificial Intelligence, American Association for Artificial Intelligence (AAAI) Press, pp 702–709

Ibrahim H, El Kerdawy AM, Abdo A, Eldin AS (2021) Similarity-based machine learning framework for predicting safety signals of adverse drug–drug interactions. Inf Med Unlocked 26:100699

Ierapetritou M, Muzzio F, Reklaitis G (2016) Perspectives on the continuous manufacturing of powder-based pharmaceutical processes. AIChE J 62:1846–1862

Iorio F et al (2010) Discovery of drug mode of action and drug repositioning from transcriptional responses. PNAS 107(33):14621–14626. https://doi.org/10.1073/pnas.1000138107

Iorio F, Knijnenburg TA, Vis DJ, Bignell GR, Menden MP, Schubert M, Aben N, Gonçalves E, Barthorpe S, Lightfoot H et al (2016) A landscape of pharmacogenomic interactions in cancer. Cell 166:740–754

James M, Stanfield CF, Bir G (2006) A review of process analytical technology (PAT) in the US pharmaceutical industry. Curr Pharm Anal 2:405–414

Ji ZL, Han LY, Yap CW, Sun LZ, Chen X, Chen YZ (2003) Drug adverse reaction target database (DART). Drug Saf 26(10):685–690

Jiménez-Luna J, Grisoni F, Schneider G (2020) Drug discovery with explainable artificial intelligence. Nat Mach Intell 2(10):573–584

Julkunen H, Cichonska A, Gautam P et al (2020) Leveraging multi-way interactions for systematic prediction of pre-clinical drug combination effects. Nat Commun 11(1):6136

Kamath U, Liu J (2021) Explainable artificial intelligence: an introduction to interpretable machine learning. Springer, Cham

Kamble R, Sharma S, Varghese V, Mahadik K (2013) Process analytical technology (PAT) in pharmaceutical development and its application. Int J Pharm Sci Rev Res 23:212–223

Kamel Boulos MN, Zhang P (2021) Digital twins: from personalised medicine to precision public health. J Person Med 11(8):745

Kanehisa M, Goto S (2000) KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res 28(1):27–30

Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug–drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network. In: Proceedings of the 10th ACM international conference on bioinformatics, computational biology and health informatics, pp 113–123

Karim MR, Cochez M, Jares JB, Uddin M, Beyan O, Decker S (2019) Drug–drug interaction prediction based on knowledge graph embeddings and convolutional-LSTM network. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics 2019, pp 113–123

Karpov P, Godin G, Tetko IV (2020) Transformer-CNN: Swiss knife for QSAR modeling and interpretation. J Cheminform 12:17

Kastrin A, Ferk P, Leskošek B (2018) Predicting potential drug–drug interactions on topological and semantic similarity features using statistical learning. PLoS ONE 13(5):e0196865

Keum J, Nam H (2017) SELF-BLM: prediction of drug–target interactions via self-training SVM. PLoS ONE 12:e0171839

Kim S, Thiessen PA, Bolton EE, Chen J, Fu G, Gindulyte A, Han L, He J, He S, Shoemaker BA et al (2016) PubChem substance and compound databases. Nucleic Acids Res 44:D1202–D1213

Kim J, Park S, Min D, Kim W (2021) comprehensive survey of recent drug discovery using deep learning. Int J Mol Sci 22:9983. https://doi.org/10.3390/ijms22189983

Koes DR, Baumgartner MP, Camacho CJ (2013) Lessons learned in empirical scoring with smina from the CSAR 2011 benchmarking exercise. J Chem Inf Model 53:1893–1904

Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

Korkmaz S (2020) Deep learning-based imbalanced data classification for drug discovery. J Chem Inf Model 60:4180–4190

Kritzinger W, Karner M, Traar G, Henjes J, Sihn W (2018) Digital Twin in manufacturing: a categorical literature review and classification. IFAC-PapersOnLine 51:1016–1022

Kuenzi BM et al (2020) Predicting drug response and synergy using a deep learning model of human cancer cells. J Elsevier Cancer Cell 38(5):1535–6108. https://doi.org/10.1016/j.ccell.2020.09.014

Kuhn M et al (2010) A side effect resource to capture phenotypic effects of drugs. Mol Syst Biol 6(1):343

Kuhn M et al (2013) STITCH 4: integration of protein–chemical interactions with user data. Nucleic Acids Res 42(D1):D401–D407

Kumar SP, Feidler JC (2003) BioSPICE: a computational infrastructure for integrative biology. OMICS J Integr Biol 7(3):225. https://doi.org/10.1089/153623103322452350

Kumar S, Talasila D, Gowrav M, Gangadharappa H (2020) Adaptations of pharma 4.0 from industry 4.0. Drug Invent Today 14:405–415

Lamb J, Crawford ED, Peck D, Modell JW, Blat IC, Wrobel MJ, Lerner J, Brunet JP, Subramanian A, Ross KN et al (2006) The connectivity map: using gene-expression signatures to connect small molecules, genes, and disease. Science 313:1929–1935

Lapuschkin S et al (2019) Unmasking clever Hans predictors and assessing what machines really learn. Nat Commun 10:1096

Lee CY, Chen YP (2021) Descriptive prediction of drug side-effects using a hybrid deep learning model. Int J Intell Syst 36(6):2491–2510

MathSciNet   Google Scholar  

Lee H, Kim W (2019) Comparison of target features for predicting drug–target interactions by deep neural network based on large-scale drug-induced transcriptome data. Pharmaceutics 11:377

Lee HW, Christie A, Xu J, Yoon S (2012) Data fusion-based assessment of raw materials in mammalian cell culture. Biotechnol Bioeng 109:2819–2828

Lee G, Park C, Ahn J (2019) Novel deep learning model for more accurate prediction of drug–drug interaction effects. BMC Bioinform 20(1):415

Lee I, Keum J, Nam H (2019) DeepConv-DTI: prediction of drug–target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol 15:1–21

Legner C, Eymann T, Hess T, Matt C, Böhmann T, Drews P, Mädche A, Urbach N, Ahlemann F (2017) Digitalization: opportunity and challenge for the business and information systems engineering community. Bus Inf Syst Eng 59:301–308

Lei T, Barzilay R, Jaakkola T (2016) Rationalizing neural predictions. In: 2016 conference on empirical methods in natural language processing, 2016; Austin, Texas: Association for computational linguistics, pp 107—117. https://aclanthology.org/D16-1011

Li M, Wang Y, Zheng R, Shi X, Wu F, Wang J, et al. (2019) Deepdsc: a deep learning method to predict drug sensitivity of cancer cell lines. IEEE/ACM transactions on computational biology and bioinformatics

Lian M, Du W, Wang X, Yao Q (2021) Drug–target interaction prediction based on multi-similarity fusion and sparse dual-graph regularized matrix factorization. IEEE Access 9:99718–99730. https://doi.org/10.1109/ACCESS.2021.3096830

Lin X, Quan Z, Wang Z-J, Ma T, Zeng X (2021) KGNN: knowledge graph neural network for drug–drug interaction prediction. In: Proceedings of the twenty-ninth international joint conference on artificial intelligence, Jaban; IJCAI'20

Lin-Gibson S, Srinivasan V (2019) Recent industrial roadmaps to enable smart manufacturing of biopharmaceuticals. IEEE Trans Autom Sci Eng 2019:1–8

Lipton ZC (2018) The mythos of model interpretability. Queue 16:31–57

Liu Y, Wu M, Miao C, Zhao P, Li X-L (2016) Neighborhood regularized logistic matrix factorization for drug–target interaction prediction. PLoS Comput Biol 12:e1004760

Liu B, Ramsundar B, Kawthekar P, Shi J, Gomes J, Luu Nguyen Q, Ho S, Sloane J, Wender P, Pande V (2017) Retrosynthetic reaction prediction using neural sequence-to-sequence models. R ACS Cent Sci 3:1103–1113

Liu N, Chen CB, Kumara S (2019) Semi-supervised learning algorithm for identifying high-priority drug–drug interactions. IEEE J Biomedic Health Inform. https://doi.org/10.1109/JBHI.2019.2932740

Liu K, Sun X, Jia L, Ma J, Xing H, Wu J, Gao H, Sun Y, Boulnois F, Fan J (2019a) Chemi-net: a molecular graph convolutional network for accurate drug property prediction. Int J Mol Sci 20:3389

Liu P, Li H, Li S, Leung KS (2019b) Improving prediction of phenotypic drug response on cancer cell lines using deep convolutional network. BMC Bioinform 20:408

Liu S, Huang Z, Qiu Y, Chen Y-PP, Zhang W (2019c) Structural network embedding using multi-modal deep auto-encoders for predicting drug–drug interactions. IEEE Int Conf Bioinform Biomed 2019:445–450. https://doi.org/10.1109/BIBM47256.2019.8983337

Liu S, Zhang Y, Cui Y, Qiu Y, Deng Y, Zhang W, Zhang Z (2021) Enhancing drug–drug interaction prediction using deep attention neural networks. BioRxiv. https://doi.org/10.1101/2021.03.16.435553

Lopes MR, Costigliola A, Pinto R, Vieira S, Sousa JMC (2019) Pharmaceutical quality control laboratory digital twin—a novel governance model for resource planning and scheduling. Int J Prod Res 58:1–15

Louizos C, Welling M, Kingma DP (2017) Learning sparse neural networks through l 0 regularization. http://arxiv.org/abs/1712.01312 .

Lu Y, Guo Y, Korhonen AJB (2017) Link prediction in drug–target interactions network using similarity indices. BMC Bioinf 18(1):39. https://doi.org/10.1186/s12859-017-1460-z

Luo Y, Zhao X, Zhou J, Yang J, Zhang Y, Kuang W, Peng J, Chen L, Zeng J (2017) A network integration approach for drug–target interaction prediction and computational drug repositioning from heterogeneous information. Nat Commun 8:573

Luo D, Cheng W, Xu D, Yu W, Zong B, Chen H, Zhang X (2020) Parameterized explainer for graph neural network. Adv Neural Inf Process Syst 33:19620–19631

Lyu T, Gao J, Tian L, Li Z, Zhang P, Zhang J (2021) MDNN: a multimodal deep neural network for predicting drug–drug interaction events. In: Proceedings of the thirtieth international joint conference on artificial intelligence (IJCAI-21), pp 3536–3542. https://doi.org/10.24963/ijcai.2021/487

Ma T, Xiao C, Zhou J, Wang F (2018) Drug similarity integration through attentive Multiview graph auto-encoders. In: IJCAI 2018, proceedings of the 27th international joint conference on artificial intelligence, pp 3477–3483

Mahajan D, Kumar D (2018) Sentiment analysis using RNN and Google translator. In: 2018 8th international conference on cloud computing, data science & engineering (Confluence), pp 798–802. https://doi.org/10.1109/CONFLUENCE.2018.8442924

Mak IWY, Evaniew N, Ghert M (2014) Lost in translation: animal models and clinical trials in cancer treatment. Am J Transl Res 6:114–118

Marr B (2017) What is digital twin technology and why is it so important? Forbes. https://www.forbes.com/sites/bernardmarr/2017/03/06/what-is-digital-twin-technology-and-why-is-it-so-important

Matsuzaka Y, Uesawa Y (2019) Prediction model with high-performance constitutive androstane receptor (CAR) using DeepSnap-deep learning approach from the tox21 10K compound library. Int J Mol Sci 20:4855

Maul J-T, Djamei V, Kolios AG, Meier B, Czernielewskiand J, Jungo P (2016) Efficacy and survival of systemic psoriasis treatments: an analysis of the SWISS registry SDNTT. Dermatology 232(6):640–647

Mayani MG, Svendsen M, Oedegaard SI (2018) Drilling digital twin success stories the last 10 years. In: Proceedings of the SPE Norway one day seminar, Bergen, Norway. https://doi.org/10.2118/191336-MS

Metz JT, Johnson EF, Soni NB, Merta PJ, Kifle L, Hajduk PJ (2011) Navigating the kinome. Nat Chem Biol 7:200–202

Miller T (2019) Explanation in artificial intelligence: insights from the social sciences. Artif Intell 267:1–38

MathSciNet   MATH   Google Scholar  

Miyato T, Dai AM, Goodfellow I (2016) Adversarial training methods for semisupervised text classification. http://arxiv.org/abs/1605.07725

Mohamed C, Nsiri B, Abdelmajid S, Abdelghani EM, Brahim B (2020) Deep convolutional networks for image segmentation: application to optic disc detection. Int Conf Electr Inf Technol (ICEIT) 2020:1–3. https://doi.org/10.1109/ICEIT48248.2020.9113204

Mukhamediev RI, Symagulov A, Kuchin Y, Yakunin K, Yelis M (2021) From classical machine learning to deep neural networks: a simplified scientometric review. Appl Sci 11:5541. https://doi.org/10.3390/app11125541

Murdoch WJ, Singh C, Kumbier K, Abbasi-Asl R, Yu B (2019) Definitions, methods, and applications in interpretable machine learning. Proc Natl Acad Sci USA 116:22071–22080

Nag S, Baidya ATK, Mandal A et al (2022) Deep learning tools for advancing drug discovery and development. 3 Biotech 12:110. https://doi.org/10.1007/s13205-022-03165-8

Nagy ZK, Fevotte G, Kramer H, Simon LL (2013) Recent advances in the monitoring, modelling, and control of crystallization systems. Chem Eng Res Des 91:1903–1922

Narayanan H, Luna MF, von Stosch M, Cruz Bournazou MN, Polotti G, Morbidelli M, Butte A, Sokolov M (2020) Bioprocessing in the digital age: the role of process models. Biotechnol J 15:e1900172

Nascimento ACA, Prudêncio RBC, Costa IG (2016) A multiple kernel learning algorithm for drug–target interaction prediction. BMC Bioinforma 17:46

Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48(3):443–453

Nguyen T, Nguyen TT, Nguyen T, Le DH (2021) Graph convolutional networks for drug response prediction. IEEE/ACM Trans Comput Biol Bioinform 19:146–154

O’Connor TF, Yu LX, Lee SL (2016) Emerging technology: a key enabler for modernizing pharmaceutical manufacturing and advancing product quality. Int J Pharm 509:492–498

Oboyle NM, Sayle RA (2016) Comparing structural fingerprints using a literature-based similarity benchmark. J Cheminform 8(1):1–14. https://doi.org/10.1186/s13321-016-0148-0

Olughu W, Deepika G, Hewitt C, Rielly C (2019) Insight into the large-scale upstream fermentation environment using scaled-down models. J Chem Technol Biotechnol 94:647–657

Oughtred R, Rust J, Chang C, Breitkreutz BJ, Stark C, Willems A, Boucher L, Leung G, Kolas N, Zhang F, Dolma S (2021) The BioGRID database: a comprehensive biomedical resource of curated protein, genetic, and chemical interactions. Protein Sci 30(1):187–200

Oztemel E, Gursev S (2018) Literature review of Industry 4.0 and related technologies. J Intell Manuf 31:127–182

Ozturk H, Ozturk A, Ozkirimli E (2018) DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics 34:i821–i829

Pandey P, Katakdaunde M, Turton R (2006) Modeling weight variability in a pan coating process using Monte Carlo simulations. AAPS Pharm Sci Tech 7:E2–E11

Papadakis E, Woodley JM, Gani R (2018) Perspective on PSE in pharmaceutical process development and innovation. In Process. Systems engineering for pharmaceutical manufacturing. Elsevier, Amsterdam pp 597–656

Passi A et al (2018) RepTB: a gene ontology-based drug repurposing approach for tuberculosis. J Cheminform 10(1):24. https://doi.org/10.1186/s13321-018-0276-9

Peng J, Li J, Shang X (2020) A learning-based method for drug–target interaction prediction based on feature representation learning and deep neural network. BMC Bioinform 21:1–13

Perozzi B, Al-Rfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: Proceeding of the ACM SIGKDD international conference on knowledge discovery and data mining, New York, NY, USA, 24–27 August 2014, pp 701–710

Poluzzi E, Raschi E, Piccinni C, De Ponti F (2012) data mining techniques in pharmacovigilance: analysis of the publicly accessible FDA adverse event reporting system (AERS). In: Data mining applications in engineering and medicine. London, United Kingdom: IntechOpen. https://doi.org/10.5772/50095

Pouryahya M, Oh JH, Mathews JC, Belkhatir Z, Moosmüller C, Deasy JO, Tannenbaum AR (2022) Pan-cancer prediction of cell-line drug sensitivity using network-based methods. Int J Mol Sci 23:1074. https://doi.org/10.3390/ijms23031074

Qiu K, Lee J, Kim H, Yoon S, Kang K (2021) Machine learning based anti-cancer drug response prediction and search for predictor genes using cancer cell line gene expression. Genomics Inform. https://doi.org/10.5808/gi.20076

Quan C et al (2016) Multichannel convolutional neural network for biological relation extraction. BioMed Res Int. https://doi.org/10.1155/2016/1850404

Raghava GP, Barton GJ (2006) Quantification of the variation in percentage identity for protein sequence alignments. BMC Bioinf 7(1):415. https://doi.org/10.1186/1471-2105-7-415

Rampášek L et al (2019) Improving drug response prediction via modeling of drug perturbation effects. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz158

Rantanen J, Khinast J (2015) The future of pharmaceutical manufacturing sciences. J Pharm Sci 104:3612–3638

Read EK, Park JT, Shah RB, Riley BS, Brorson KA, Rathore AS (2010) Process analytical technology (PAT) for biopharmaceutical products: Part I. Concepts and applications. Biotechnol Bioeng 105:276–284

Reinhardt IC, Oliveira DJC, Ring DDT (2020) Current perspectives on the development of industry 4.0 in the pharmaceutical sector. J Ind Inf Integr 18:100131

Ren S, Tao Y, Yu K et al (2022) De novo prediction of Cell-Drug sensitivities using deep learning-based graph regularized matrix factorization. Pacif Symp Biocomput. https://doi.org/10.7490/f1000research.1118807.1

Reza F, Reza S, Yadollah O (2017) Computational prediction of drug–drug interactions based on drugs functional similarities. J Biomed Inform 70:54–64

Richardson P, Grifn I, Tucker C, Smith D, Oechsle O, Phelan A, Rawling M, Savory E, Stebbing J (2020) Baricitinib as potential treatment for 2019-nCoV acute respiratory disease. Lancet (london, England) 395(10223):e30

Rifaioglu AS, Atas H, Martin MJ, Cetin-Atalay R, Atalay V, Dogan T (2019) Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Brief Bioinform 20:1878–1912

Rosen R, von Wichert G, Lo G, Bettenhausen KD (2015) About the importance of autonomy and digital twins for the future of manufacturing. IFAC-PapersOnLine 48:567–572

Ryu JY, Kim HU, Lee SY (2018) Deep learning improves prediction of drug–drug and drug–food interactions. PNAS 115(18):E4304–E4311

Sachdev K, Gupta MK (2019) A comprehensive review of feature-based methods for drug–target interaction prediction. J Biomed Inform 93:103159

Sajjia M, Shirazian S, Kelly CB, Albadarin AB, Walker G (2017) ANN analysis of a roller compaction process; in the pharmaceutical industry. Chem Eng Technol 40:487–492

Sarker IH (2021) Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput Sci 2:420. https://doi.org/10.1007/s42979-021-00815-1

Sawada R, Iwata M, Tabei Y, Yamato H, Yamanishi Y (2018) Predicting inhibitory and activatory drug targets by chemically and genetically perturbed transcriptome signatures. Sci Rep 8:156

Schleich B, Anwer N, Mathieu L, Wartzack S (2017) Shaping the digital twin for design and production engineering. CIRP Ann 66:141–144

Schlichtkrull MS, De Cao N, Titov I (2020) Interpreting graph neural networks for NLP with differentiable edge masking. http://arxiv.org/abs/2010.00577

Schwarz K (2021) AttentionDDI: Siamese attention-based deep learning method for drug–drug interaction predictions. BMC Bioinf 22(1):412

Scudellari M (2020) Five companies using AI to fight coronavirus. https://spectrum.ieee.org/the-human-os/artificial-intelligence/medical-ai/companies-ai-coronavirus

Seo S, Lee T, Kim MH, Yoon Y (2020) Prediction of side effects using comprehensive similarity measures. BioMed Res Int. https://doi.org/10.1155/2020/1357630

Shang C, Liu Q, Chen KS, Sun J, Lu J, Yi J, Bi J (2018) Edge attention-based multi-relational graph convolutional networks. arXiv 2018; arXiv:1802.04944 .

Shao K, Zhang Z, He S, Bo X (2020) DTIGCCN: prediction of drug–target interactions based on GCN and CNN. In: Proceedings of the 2020 IEEE 2 nd international conference on tools with artificial intelligence (ICTAI), Baltimore, MD, USA, 9–11 November 2020, pp 337–342

Sharifi-Noghabi H, Zolotareva O, Collins CC, Ester M (2019) MOLI: multi-omics late integration with deep neural networks for drug response prediction. Bioinformatics 35:i501–i509

Shin B, Park S, Kang K, Ho JC (2019) Self-attention based molecule representation for predicting drug–target interaction. Proc Mach Learn Res 106:1–18

Shoemaker RH (2006) The NCI60 human tumour cell line anticancer drug screen. Nat Rev Cancer 6:813–823

Shrikumar A, Greenside P, Kundaje A (2017) Learning important features through propagating activation differences. In: Proceedings of the 34th international conference on machine learning 2017; 70, JMLR.org: Sydney, NSW, Australia. pp 3145–3153

Shtar G, Rokach L, Shapira B (2019) Detecting drug–drug interactions using artificial neural networks and classic graph similarity measures. PLoS ONE 14(8):e0219796

Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A et al (2017) Mastering the game of go without human knowledge. Nature 550(7676):354–359

Simon LL, Kiss AA, Cornevin J, Gani R (2019) Process engineering advances in pharmaceutical and chemical industries: Digital process design, advanced rectification, and continuous filtration. Curr Opin Chem Eng 25:114–121

Simonyan K, Vedaldi A, Zisserman A (2014) Deep inside convolutional networks: visualising image classification models and saliency maps. In: 2nd international conference on learning representations, ICLR 2014, Banff, AB, Canada, April 14–16, 2014, Workshop Track Proceedings; http://arxiv.org/abs/1312.6034

Smiatek J, Jung A, Bluhmki E (2020) Towards a digital bioprocess. Replica: computational approaches in biopharmaceutical development and manufacturing. Trends Biotechnol 38(10):1141–1153. https://doi.org/10.1016/j.tibtech.2020.05.008

Song T, Zhang X, Ding M, Rodriguez-Paton A, Wang S, Wang G (2022) DeepFusion: a deep learning based multi-scale feature fusion method for predicting drug–target interactions. Methods 204:269–277

Springenberg JT (2015) Striving for simplicity: the all-convolutional Net. CoRR, http://arxiv.org/abs/1412.6806

Stark R, Fresemann C, Lindow K (2019) Development and operation of digital twins for technical systems and services. CIRP Ann 68:129–132

Steinwandter V, Borchert D, Herwig C (2019) Data science tools and applications on the way to Pharma 4.0. Drug Discov Today 24:1795–1805

Stokes JM, Yang K, Swanson K, Jin W, Cubillos-Ruiz A, Donghia NM, MacNair CR, French S, Carfrae LA, Bloom-Ackerman Z et al (2020) A deep learning approach to antibiotic discovery. Cell 180:688-702.e13

Subramanian K (2020) Digital twin for drug discovery and development—the virtual liver. J Indian Inst Sci 100:653–662. https://doi.org/10.1007/s41745-020-00185-2

Subramanian A, Narayan R, Corsello SM, Peck DD, Natoli TE, Lu X, Gould J, Davis JF, Tubelli AA, Asiedu JK et al (2017) A next generation connectivity map: L1000 platform and the first 1,000,000 profiles. Cell 171:1437-1452.e17

Sun X, Ma L, Du X, Feng J, Dong K (2018) Deep convolution neural networks for drug–drug interaction extraction. In: 2018 IEEE international conference on bioinformatics and biomedicine (BIBM), pp 1662–1668. https://doi.org/10.1109/BIBM.2018.8621405

Sun M, Zhao S, Gilvary C, Elemento O, Zhou J, Wang F (2020a) Graph convolutional networks for computational drug development and discovery. Brief Bioinform 21:919–935

Sun M, Wang F, Elemento O, Zhou J (2020b) Structure-based drug–drug interaction detection via expressive graph convolutional networks and deep sets. Proc AAAI Conf Artif Intell 34(10):13927–13928. https://doi.org/10.1609/aaai.v34i10.7236

System HSL (2006) Psychoactive Drug Screening Program. https://www.hsls.pitt.edu/obrc/index.php?page=URL1133202727

Tajbakhsh N et al (2016) Convolutional neural networks for medical image analysis: full training or fine tuning? IEEE Trans Med Imaging 35(5):1299–1312. https://doi.org/10.1109/TMI.2016.2535302

Tang J, Szwajda A, Shakyawar S, Xu T, Hintsanen P, Wennerberg K, Aittokallio T (2014) Making sense of large-scale kinase inhibitor bioactivity data sets: a comparative and integrative analysis. J Chem Inf Model 54:735–743

Tang P, Xu J, Louey A, Tan Z, Yongky A, Liang S, Li ZJ, Weng Y, Liu S (2020) Kinetic modeling of Chinese hamster ovary cell culture: factors and principles. Crit Rev Biotechnol 40:265–281

Tao F, Cheng J, Qi Q, Zhang M, Zhang H, Sui F (2018) Digital twin-driven product design, manufacturing and service with big data. Int J Adv Manuf Technol 94:3563–3576

Tatonetti NP et al (2012) Data-driven prediction of drug effects and interactions. Sci Transl Med 4(125):12531. https://doi.org/10.1126/scitranslmed.3003377

Tatonetti NP, Patrick PY, Daneshjou R, Altman RB (2012) Data driven prediction of drug effects and interactions. Sci Transl Med 4(125):125ra31-125ra31

Tehseen Z, Usman Z (2019) Long short-term memory recurrent neural network architectures for Urdu acoustic modelling. Int J Speech Technol 22(1):21–30. https://doi.org/10.1007/s10772-018-09573-7

Thafar M, Raies AB, Albaradei S, Essack M, Bajic VB (2019) Comparison study of computational prediction tools for drug–target binding affinities. Front Chem 7:782. https://doi.org/10.3389/fchem.2019.00782

Thafar MA, Olayan RS, Olayan RS, Ashoor H, Ashoor H, Albaradei S, Albaradei S, Bajic VB, Gao X et al (2020) DTiGEMS: drug–target interaction prediction using graph embedding, graph mining, and similarity-based techniques. J Cheminform 12:1–17

Thafar MA, Alshahrani M, Albaradei S et al (2022) Affinity2Vec: drug–target binding affinity prediction through representation learning, graph mining, and machine learning. Sci Rep 12:4751. https://doi.org/10.1038/s41598-022-08787-9

Thorben F, Megha Kh, Avishek A (2021) Hard masking for explaining graph neural networks. In Submitted to international conference on learning representations https://openreview.net/forum?id=uDN8pRAdsoC

Tian X, Xin M, Luo J, Jiang Z (2016) Using the ranking-based KNN approach for drug repositioning based on multiple information. Springer, Cham, pp 317–327

Tong H, Heidemeyer M, Ban F, Cherkasov A, Ester M (2017) SimBoost: A read-across approach for predicting drug–target binding affinities using gradient boosting machines. J Cheminform 9:1–14

Torng W, Altman RB (2019) Graph convolutional neural networks for predicting drug–target interactions. J Chem Inf Model 59:4131–4149

Townshend RJL, Powers A, Eismann S, Derry A (2021) ATOM3D: tasks on molecules in three dimensions. arXiv 2021: arXiv:2012.04035

Trißl S, Rother K, Müller H et al (2005) Columba: an integrated database of proteins, structures, and annotations. BMC Bioinformatics 6:81. https://doi.org/10.1186/1471-2105-6-81

Trott O, Olson AJ (2010) AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization and multithreading. J Comput Chem 31:455

Tyson RJ, Park CC, Powell JR, Patterson JH, Weiner D, Watkins PB, Gonzalez D (2020) Precision dosing priority criteria: drug, disease, and patient population variables. J Front Pharmacol. https://doi.org/10.3389/fphar.2020.00420

U. Consortium (2014) UniProt: a hub for protein information. Nucleic Acids Res 43(D1):D204–D212

Vazquez J, Lopez M, Gibert E, Herrero E, Luque FJ (2020) Merging ligand-based and structure-based methods in drug discovery: an overview of combined virtual screening approaches. Molecules 25:4723

Venkatasubramanian V (2019) The promise of artificial intelligence in chemical engineering: is it here, finally? AIChE J 65:466–478

Vermeer NS, Straus SM, Mantel-Teeuwisse AK, Domergue F, Egberts TC, Leufkens HG, De Bruin ML (2013) Traceability of biopharmaceuticals in spontaneous reporting systems: a cross sectional study in the FDA adverse event reporting system (FAERS) and surveillance databases. Drug Saf 36(8):617–625

Vilar S, Hripcsak GJ (2016) Leveraging 3D chemical similarity, target and phenotypic data in the identification of drug-protein and drug-adverse effect associations. J Cheminform 8(1):35. https://doi.org/10.1186/s13321-016-0147-1

Vilar S, Uriarte E, Santana L, Lorberbaum T, Hripcsak G, Friedman C, Tatonetti NP (2014) Similarity-based modeling in large-scale prediction of drug–drug interactions. Nat Protoc 9(9):2147–2163. https://doi.org/10.1038/nprot.2014.151

Wallach I, Dzamba M, Heifets A (2015) AtomNet: a deep convolutional neural network for bioactivity prediction in structurebased drug discovery. arXiv 2015: arXiv:1510.02855 .

Wan F et al (2019) DeepCPI: a deep learning-based framework for large-scale in silico drug screening. Genom Proteomics Bioinform 17:478–495

Wang JZ et al (2007) A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10):1274–1281. https://doi.org/10.1093/bioinformatics/btm087

Wang W et al (2014) Drug repositioning by integrating target information through a heterogeneous network model. Bioinformatics 30(20):2923–2930. https://doi.org/10.1093/bioinformatics/btu403

Wang CS, Lin PJ, Cheng CL, Tai SH, Kao Yang YH, Chiang JH (2019) Detecting potential adverse drug reactions using a deep neural network model. J Med Internet Res 21(2):e11016

Wang T, Yi HC, You ZH, Li LP, Wang YB, Hu L, Wong L (2019) A gated recurrent unit model for drug repositioning by combining comprehensive similarity measures and Gaussian interaction profile kernel. In: International conference on intelligent computing. Springer, Cham. pp 344–353

Wang YB, You ZH, Yang S et al (2020a) A deep learning-based method for drug–target interaction prediction based on long short-term memory neural network. BMC Med Inform Decis Mak 20:49. https://doi.org/10.1186/s12911-020-1052-0

Wang H, Wang J, Dong C, Lian Y, Liu D, Yan Z (2020b) A novel approach for drug–target interactions prediction based on multimodal deep autoencoder. Front Pharmacol 10:1–19

Watanabe JH, McInnis T, Hirsch JD (2018) Cost of prescription drug-related morbidity and mortality. Ann Pharmacother 52:829–837. https://doi.org/10.1177/1060028018765159

Way GP, Greene CS (2018) Extracting a biologically relevant latent space from cancer transcriptomes with variational autoencoders. Pac Symp Biocomput 23:80–91

Wei J, Lu Z, Qiu K, Li P, Sun H (2020) Predicting drug risk level from adverse drug reactions using SMOTE and machine learning approaches. IEEE Access 8:185761–185775. https://doi.org/10.1109/ACCESS.2020.3029446

Weinstein JN (2004) Integromic analysis of the NCI-60 cancer cell lines. Breast Dis 19:11–22

Wen M, Zhang Z, Niu S, Sha H, Yang R, Yun Y, Lu H (2017) Deep-learning-based drug–target interaction prediction. J Proteome Res 16:1401–1409

Wenzel J, Matter H, Schmidt F (2019) Predictive multitask deep neural network models for adme-tox properties: learning from large data sets. J Chem Inf Model 59:1253–1268

White J, Schiffer JT, Bender R et al (2021) Drug combinations as a first line of defense against coronaviruses and other emerging viruses. Mbio 12(6):e0334721

Withnall M, Lindelöf E, Engkvist O, Chen H (2020) Building attention and edge message passing neural networks for bioactivity and physical-chemical property prediction. J Cheminform 12:1

Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) MoleculeNet: a benchmark for molecular machine learning. Chem Sci 9:513–530

Wu Z, Pan S, Chen F et al (2020) A comprehensive survey on graph neural networks. IEEE Trans Neural Netw Learn Syst 32:4–24

Xia Z, Wu LY, Zhou X, Wong ST (2010) Semi-supervised drug-protein interaction prediction from heterogeneous biological spaces. BMC Syst Biol 4:S6

Xiang W, Yingxin W, An Z, Xiangnan H, Tat-seng C (2021) Causal screening to interpret graph neural networks. In Submitted to international conference on learning representations. https://www.openreview.net/forum?id=nzKv5vxZfge

Xie L, He S, Song X, Bo X, Zhang Z (2018) Deep learning-based transcriptome data classification for drug–target interaction prediction. BMC Genomics 19:13–16

Xie Y, Peng J, Zhou Y, et al (2019) Integrating protein-protein interaction information into drug response prediction by graph neural encoding. 16 December 2019, Available at Research Square https://doi.org/10.21203/rs.2.18936/v1 .

Xu Y, Pei J, Lai L (2017) Deep learning-based regression and multiclass models for acute oral toxicity prediction with automatic chemical feature extraction. J Chem Inf Model 57:2672–2685

Yan CK, Wang WX, Zhang G et al (2019) BiRWDDA: a novel drug repositioning method based on multisimilarity fusion. J Comput Biol 26(11):1230–1242

Yan C, Duan G, Zhang Y, Wu F-X, Pan Y, Wang J (2022) Predicting drug–drug interactions based on integrated similarity and semi-supervised learning. IEEE/ACM Trans Comput Biol Bioinf 19(1):168–179. https://doi.org/10.1109/TCBB.2020.2988018

Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M et al (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59:3370–3388

Yi HC, You ZH, Wang L et al (2021) In silico drug repositioning using deep learning and comprehensive similarity measures. BMC Bioinf 22:293. https://doi.org/10.1186/s12859-020-03882-y

Yifan D, Xinran X, Yang Q, Jingbo X, Wen Z, Shichao L (2020) A multimodal deep learning framework for predicting drug–drug interaction events. Bioinformatics 36:4316–4322

Ying Z, Bourgeois D, You J, Zitnik M, Leskovec J (2019) Gnnexplainer: generating explanations for graph neural networks. Adv Neural Inf Process Syst 32:9244–9255

Yu Y, Si X, Hu C, Zhang J (2019) A review of recurrent neural networks: Lstm cells and network architectures. Neural Comput 31:1235–1270

Yu Y, Huang K, Zhang C, Glass LM, Sun J, Xiao C (2021) SumGNN: multi-typed drug interaction prediction via efficient knowledge graph summarization. Bioinformatics 37(18):2988–2995

Yuan H, Yu H, Wang J, Li K, Ji S (2021) On explain-ability of graph neural networks via subgraph explorations. http://arxiv.org/abs/2102.05152

Yue X, Wang Z, Huang J, Parthasarathy S, Moosavinasab S, Huang Y, Lin SM, Zhang W, Zhang P, Sun H (2020) Graph embedding on biomedical networks: methods, applications, and evaluations. Bioinformatics 36(4):1241–1251. https://doi.org/10.1093/bioinformatics/btz718

Yunsheng B, Ken G, Yizhou S, Wei W (2020) Bi-level graph neural networks for drug–drug interaction prediction. J Comput Eng arXiv:2006.14002

Zaikis D, Vlahavas I (2020) Drug–drug interaction classification using attention based neural networks. In: 11th Hellenic conference on artificial intelligence, pp 34–40. https://doi.org/10.1145/3411408.3411461

Zeng H, Qiu C, Cui QJD (2015) Drug-path: a database for drug-induced pathways. J Biol Databases Curation. https://doi.org/10.1093/database/bav061

Zeng T, Rongjian L, Ravi M, Jieping Y, Shuiwang J (2015) Deep convolutional neural networks for annotating gene expression patterns in the mouse brain. BMC Bioinformatics 16(1):147

Zeng X et al (2019) Measure clinical drug–drug similarity using electronic medical records. Int J Med Inf 124:97–103. https://doi.org/10.1016/j.ijmedinf.2019.02.003

Zeng X, Zhu S, Lu W, Liu Z, Huang J, Zhou Y, Fang J, Huang Y, Guo H, Li L et al (2020) Target identification among known drugs by deep learning from heterogeneous networks. Chem Sci 11:1775–1797

Zhai J, Zhang S, Chen J, He Q (2018) Autoencoder and its various variants. In: 2018 IEEE international conference on systems, man, and cybernetics (SMC), pp 415–419. https://doi.org/10.1109/SMC.2018.00080

Zhang Y (2020) Predicting drug–drug interactions using multi-modal deep autoencoders based network embedding and positive-unlabeled learning. Methods 179:37–46

Zhang M-L, Zhou Z-H (2007) Ml-knn: a lazy learning approach to multi-label learning. Pattern Recogn 40(7):2038–2048

Zhang H, Liu D, Xiong Z (2018) Convolutional neural network-based video super-resolution for action recognition. In: 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pp 746–750. https://doi.org/10.1109/FG.2018.00117

Zhang Y, Weng Y, Lund J (2022) Applications of explainable artificial intelligence in diagnosis and surgery. Diagnostics 12:237. https://doi.org/10.3390/diagnostics12020237

Zhang C, Lu Y, Zang T (2022) CNN-DDI: a learning-based method for predicting drug–drug interactions using convolution neural networks. BMC Bioinf 23:88. https://doi.org/10.1186/s12859-022-04612-2

Zhao Y, Zheng K, Guan B, Guo M, Song L, Gao J, Qu H, Wang Y, Shi D, Zhang Y (2020) DLDTI: a learning-based framework for drug–target interaction identification using neural networks and network representation. J Transl Med 18:434

Zhao Q, Xiao F, Yang M, Li Y, Wang J (2019) AttentionDTA: prediction of drug–target binding affinity using attention model. In: Proceedings of the 2019 IEEE international conference on bioinformatics and biomedicine, San Diego, CA, USA, 18–21 November 2019, pp 64–69

Zhou Y, Zhang Y, Lian X, Li F, Wang C, Zhu F, Qiu Y, Chen Y (2022) Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents. Nucleic Acids Res 50:1398–1407

Zitnik M, Agrawal M, Leskovec J (2018) Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 34(13):i457–i466

Zitnik SM, Sosic R, Leskovec J (2018) Biosnap datasets: Stanford biomedical network dataset collection. http://snap.stanford.edu/biodata

Zong N, Kim H, Ngo V, Harismendy O (2017) Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations. Bioinformatics 33:2337–2344

Zügner D, Akbarnejad A, Günnemann S (2018) Adversarial attacks on neural networks for graph data. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and Data Mining. 2018, Association for Computing Machinery: London, United Kingdom. pp 2847–2856

Download references

Open access funding provided by The Science, Technology & Innovation Funding Authority (STDF) in cooperation with The Egyptian Knowledge Bank (EKB).

Author information

Authors and affiliations.

Faculty of Computers and Artificial Intelligence, University of Sadat City, Sadat City, Egypt

Computer Science Department, Faculty of Science, Minia University, Minia, Egypt

Enas Elgeldawi & Mamdouh M. Gomaa

Faculty of Computers and Artificial Intelligence, Cairo University, Cairo, Egypt

Aboul Ella Hassanien

Faculty of Pharmacy and Drug Technology, Chinese University in Egypt (CUE), Cairo, Egypt

Heba Aboul Ella

Faculty of Pharmacy, University of Sadat City, Sadat City, Menoufia, Egypt

Yaseen A. M. M. Elshaier

You can also search for this author in PubMed   Google Scholar

Contributions

Ask wrote the main text, HA wrote the digital twining part, EE wrote the deep learning part, YAMME wrote the data sets part, MMG wrote the similarly part, AEH, suggest the idea of the review and supervision

Corresponding author

Correspondence to Aboul Ella Hassanien .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Askr, H., Elgeldawi, E., Aboul Ella, H. et al. Deep learning in drug discovery: an integrative review and future challenges. Artif Intell Rev 56 , 5975–6037 (2023). https://doi.org/10.1007/s10462-022-10306-1

Download citation

Accepted : 24 October 2022

Published : 17 November 2022

Issue Date : July 2023

DOI : https://doi.org/10.1007/s10462-022-10306-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Drug discovery
  • Artificial intelligence
  • Deep learning
  • Drug–target interactions
  • Drug–drug similarity
  • Drug side-effects
  • Drug sensitivity and response
  • Drug dosing optimization
  • Explainable artificial intelligence
  • Digital twining
  • Find a journal
  • Publish with us
  • Track your research

Drug Discovery

Special Collection Image

  • Campaigning and outreach
  • News and events
  • Awards and funding
  • Privacy policy
  • Journals and databases
  • Locations and contacts
  • Membership and professional community
  • Teaching and learning
  • Help and legal
  • Cookie policy
  • Terms and conditions
  • Get Adobe Acrobat Reader
  • Registered charity number: 207890
  • © Royal Society of Chemistry 2023

This Feature Is Available To Subscribers Only

Sign In or Create an Account

  • Knowledge Library Hub
  • Assay Development in Drug Discovery

Assay Development in Drug Discovery and Development Process

Did you know it could take around 12-15 years and anywhere near $1 billion for a drug to reach the market?

In that period, it takes 3 to 6 years to decide the efficacy of a drug candidate in an animal model after the identification, validation, optimization, and efficiency testing of the drug compound. However, even after such extensive research, one-third of the developed drugs fail at the first clinical stage, and half of them fail as they may show toxicity in humans. This is mainly because of the assumption that the chosen drug candidates are non-toxic compounds for humans after testing their efficacy in animal models.

The other half of the drugs fail due to their inefficacy in humans during late-stage clinical trials. Because of all such reasons, testing the chosen drug candidate's effectiveness in the early stages of drug development using in vitro assays and human cells and tissues poses a best practice to accelerate the drug discovery and development process and save money.

Assays are procedures that assess the effectiveness of the chosen drug candidate on the desired target, which can be molecular or biochemical targets. It can also be defined as the process of “hit” discovery (Figure 1). After target identification and validation, assays are developed for the efficient optimization of the chosen compound. The assay should be efficient for high-throughput screening , biologically relevant, and economical.

In this article, we review the types of assays developed during drug development, their process, and what challenges drug manufacturers face while designing an efficient assay to assess drugs.

Types of Assays in Drug Discovery

Assay development, or creating a test system to assess the effects of chosen drug candidates on desired biological processes, including cellular-based, and biochemical processes (Figure 2), is one of the first steps of drug development.

High throughput screening of compound libraries enables researchers to perform pharmacological or genetic tests using automated software robotic machines, and sensitive detectors. The process narrows down to pinpoint compounds with therapeutic properties or functions that contribute to human health or are associated with various diseases. These tested compounds (also known as probes) are further optimized for their application in the drug development pipeline as therapeutic candidates.

Cell-based assays

Cell-based assays help evaluate the efficacy of drug molecules in a more effective way than biochemical assays. The in vitro cell cultures are more reliable and provide deeper insight into the effect of the small molecule on humans. To obtain substantive information and keep track of cellular activities in terms of temporal (time of occurrence) and spatial (location) resolution, researchers often combine advanced microscopic techniques with cell-based assays.

Some cell-based assays used in the drug development process include:

  • On-chip, cell-based microarray immunofluorescence assay: Used for high-throughput target protein analysis.
  • Beta-lactamase protein fragment complementation assays: To study protein-protein interaction
  • The ToxTracker assay: Used to study the toxicity of the chosen drug candidates
  • Fluorescence-based IHC assays: To study morphologic features and molecular target activities.
  • Reporter gene assay: To detect primary signal pathway modulators.
  • Mammalian two-hybrid assay: To study mammalian protein interactions in the cellular environment.

These assays can be applied in a 2-D cell culture environment and in 3-D cell culture systems as well.

See how Danaher Life Sciences can help

Talk to an expert

Biochemical assays

Biochemical assays are used to test the binding affinity or inhibitory activity of the tested drug candidate with the target enzyme or receptor molecule.

Some assays extensively used in drug discovery and development processes are:

  • Quenched fluorescence resonance energy transfer (FRET) technology-based assay : Used to screen inhibitors and monitor proteolytic activity of
  • High-performance liquid chromatography (HPLC) technique : Used to assess proteolytic action (mainly for chromogenic compounds), and it can also be used to screen inhibitors to a certain extent.
  • Enzyme-linked immunosorbent assay (ELISA) : Used to analyze the inhibitory activity of the tested or chosen drug compound.
  • Surface plasmon resonance (SPR) techniques: Used to study the interaction of the lead compound with the target protein.

In silico assays

In silico methods are computation-based approaches. They are effective techniques used to screen a large library of compounds and evaluate their affinity and efficacy even before they enter the development process, based on their structure. They have massive applications in assessing the pharmacodynamic (PD) and pharmacokinetic (PK) properties of molecules. The two types of virtual screening methods used in drug development are ligand-based methods (utilizing topological fingerprints and pharmacophore similarity) and target-based methods. Docking and consensus scoring are some target-based approaches used to estimate the binding affinity and inhibitors of the selected compound. Quantitative Structure-Activity Relationship (QSAR) is another technique that predicts the quantitative relationship between a chemical structure and its biological activity.

What are the factors responsible for variation of Assays?

Factors influencing cell-based assays are as follows:.

  • Culturing Media

Culturing conditions influencing cell-based assay methods are as follows:-

  • Culturing media
  • Culturing conditions
  • Passage number

Factors influencing biochemical assay are as follows:-

  • Temperature
  • Concentration of Ions
  • Solubility of Reagents
  • Stability of Reagents
  • Aggregation of Reagents

Assay Development Process

The first step in the drug development process is the identification and validation of targets, such as DNA, enzymes, receptors, and ion channels for diseases. This is followed by designing assays to evaluate the pharmacokinetics, molecular, and biological activity of the hit molecule.

Different biochemical and cell-based assays are designed to assess compounds’ effect on the desired target. After designing the assay, several validation steps are implemented to improve the chances of success for a drug candidate. The assay conditions and procedures are optimized to reduce or minimize the influence of potential factors that could introduce errors in measuring a specific substance (analyte) or a biological endpoint. The rate of false-positive and false-negative determines the selectivity and sensitivity of an assay. Furthermore, various other factors, including assay automation, reagent stability, pipetting, and data analysis models, play roles in determining the validity of an experiment.

To facilitate compound screening, a range of assay formats are available that are chosen by researchers based on the goal, available facility and equipment, and screening scale.

When choosing an assay for drug development, it is important to consider the following factors :

  • Pharmacological Relevance: Conduct studies with a known ligand and the target to assess if the assay pharmacology predicts the disease state and identifies compounds with desired potency and mechanism of action.
  • Assay Reproducibility: The assay must demonstrate reproducibility across plates, screen days, and the entire duration of the drug discovery program, which may span several years.
  • Assay Quality: It’s assessed using the Z' factor, with values above 0.4 considered robust for screening. Monitoring pharmacological controls is crucial, and high-quality assays result from simple protocols, stable reagents, and optimal instrumentation.
  • Assay Costs: Assays typically involve microtiter plates, with reagents and volumes selected to minimize costs. The format (e.g., 96-well or 384-well) depends on the context (academia, industry, or high-throughput screening).
  • Effects of Compounds: Assays need to be configured to be insensitive to solvent concentrations. Cell-based assays tolerate up to 1% DMSO, while biochemical assays can handle up to 10%. Based on the false negative and false positive hit rate, the assays are reconfigured or optimized for their further applications in the drug development process.

Ensuring the reproducibility and transferability of measurements relies heavily on the integration of standard operating procedures (SOPs) and comprehensive method documentation. In this context, bioinformatic support is unquestionably pivotal, playing a critical role in both configuring assays and analyzing data.

Challenges in Assay Development

Assay development is the process of designing assays and optimizing their environment and procedure for the specific hit molecule or drug compound. These assays designed for drug development purposes are of three types: biochemical assays that provide insight into the chemistry of the drug molecule, a cell-based assay that offers insight into the efficacy and toxicity of drug molecules in humans, and computation-based approaches that predict the chemistry of the lead molecule with the target in advance.

In recent times, High-Throughput Screening (HTS) campaigns have leaned towards employing target-directed and specialized libraries, along with assay formats designed to minimize artifacts and yield more comprehensive information. Further, primary cells and 3D cultures have become more common in high-content screening of lead compounds.

Besides improving current technologies, researchers are actively seeking out new detection methods in High-Throughput (HT) formats. This ongoing exploration covers possibilities such as localized surface plasmon resonance (SPR), time-resolved anisotropy analysis (TRA), intrinsic protein fluorescence, multi-photon excitation techniques, and electrochemical screens. The refinement of analysis software is also seen as a way to elevate the effectiveness of various High-Throughput Screening methods.

Conclusion and Future Directions in Assay Development

The assay development stage is a crucial drug development stage, which determines the success of a drug candidate in further drug development processes. Thus, one needs to ask the right questions and carefully design the assay protocol based on their goal and requirements. However, this is not easy and poses several challenges.

There is not one assay without limitations; thus, to ensure the reliability and accuracy of data one needs to design several assays to evaluate the drug candidate. For example, biochemical screening assays face limitations in capturing the intricate dynamics of living systems, which necessitates the development of more sophisticated biological assays for accurate evaluation of the biological activity of the hit molecule or lead compound and its potential toxicity in humans.

Further, the application of advanced approaches needs years of study to assess the effectiveness of their procedure in high throughput screening. For example, the use of 3D cell culture in HCS and HTS campaigns is still challenging as there are no high-throughput methods for analyzing cells within a 3D environment.

Understanding that all assays have limitations, it's crucial to create counter-assays. These counter-assays are essential for filtering out compounds that work in undesirable ways. Additionally, to learn in-depth about the complexity of biological responses, the development of secondary assays is required.

What is assay development and validation?

Assay development and validation is the process of designing specific assays, based on the lead compound and target that need to be assessed, and evaluating their suitability for intended purposes.

What are the steps in assay development?

Assay development begins with the assay designing phase, followed by multiple validation steps, including pre-screen validation, in-screen validation, and cross-validation. Failure of assays at any validation steps required re-designing of assays for the intended purposes.

What is the importance of assay validation?

It is crucial to ensure that the designed assays are both robust and specific. This is essential to guarantee the effectiveness of a drug candidate in treating a particular condition and to assess its safety in humans.

What are the assays used in drug development?

Many biochemical, cell-based, and in-silico assays are used in drug development. Some examples are the ToxTracker assay, fluorescence-based IHC assays, reporter gene assay, enzyme-linked immunosorbent assay (ELISA), molecular docking, and many others.

What is the role of assay in drug development?

Assays are analytical procedures employed to qualitatively evaluate a substance or investigate its effects on identified molecular, cellular, or biochemical targets. In drug development, they are used at all stages, including identifying hit molecules, narrowing down on the lead compound, and testing the efficacy and safety of the drug compound.

U.S. flag

An official website of the United States government

The .gov means it's official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you're on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Browse Titles

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Markossian S, Grossman A, Arkin M, et al., editors. Assay Guidance Manual [Internet]. Bethesda (MD): Eli Lilly & Company and the National Center for Advancing Translational Sciences; 2004-.

Cover of Assay Guidance Manual

Assay Guidance Manual [Internet].

Early drug discovery and development guidelines: for academic researchers, collaborators, and start-up companies.

Jeffrey Strovel , Sitta Sittampalam , Nathan P. Coussens , Michael Hughes , James Inglese , Andrew Kurtz , Ali Andalibi , Lavonne Patton , Chris Austin , Michael Baltezor , Michael Beckloff , Michael Weingarten , and Scott Weir .

Affiliations

Published May 1, 2012 ; Last Update: July 1, 2016 .

Setting up drug discovery and development programs in academic, non-profit and other life science research companies requires careful planning. This chapter contains guidelines to develop therapeutic hypotheses, target and pathway validation, proof of concept criteria and generalized cost analyses at various stages of early drug discovery. Various decision points in developing a New Chemical Entity (NCE), description of the exploratory Investigational New Drug (IND) and orphan drug designation, drug repurposing and drug delivery technologies are also described and geared toward those who intend to develop new drug discovery and development programs.

Note: The estimates and discussions below are modeled for an oncology drug New Molecular Entity (NME) and repurposed drugs. For other disease indications these estimates might be significantly higher or lower.

Medical innovation in America today calls for new collaboration models that span government, academia, industry and disease philanthropy. Barriers to translation and ultimate commercialization will be lowered by bringing best practices from industry into academic settings, and not only by training a new generation of 'translational' scientists prepared to move a therapeutic idea forward into proof of concept in humans, but also by developing a new cadre of investigators skilled in regulatory science.

As universities begin to focus on commercializing research, there is an evolving paradigm for drug discovery and early development focused innovation within the academic enterprise. The innovation process -- moving from basic research to invention and to commercialization and application -- will remain a complex and costly journey. New funding mechanisms, the importance of collaborations within and among institutions, the essential underpinnings of public-private partnerships that involve some or all sectors, the focus of the new field of regulatory science, and new appropriate bridges between federal health and regulatory agencies all come to bear in this endeavor.

We developed these guidelines to assist academic researchers, collaborators and start-up companies in advancing new therapies from the discovery phase into early drug development, including evaluation of therapies in human and/or clinical proof of concept. This chapter outlines necessary steps required to identify and properly validate drug targets, define the utility of employing probes in the early discovery phase, medicinal chemistry, lead optimization, and preclinical proof of concept strategies, as well as address drug delivery needs through preclinical proof of concept. Once a development candidate has been identified, the guidelines provide an overview of human and/or clinical proof of concept enabling studies required by regulatory agencies prior to initiation of clinical trials. Additionally, the guidelines help to ensure quality project plans are developed and projects are advanced consistently. We also outline the expected intellectual property required at key decision points and the process by which decisions may be taken to move a project forward.

The purpose of this chapter is to define:

Discovery and early development of a New Chemical Entity (NCE)

Discovery of new, beneficial activity currently marketed drugs possess against novel drug targets, also referred to as “drug repurposing”

Application of novel platform technology to the development of improved delivery of currently marketed drugs

  • Key data required at each decision point, targets and expectations required to support further development
  • An estimate of the financial resources needed to generate the data at each decision point
  • Opportunities available to outsource activities to optimally leverage strengths within the institution
  • Offer opportunities to initiate meaningful discussions with regulatory agencies to define requirements for advancement of new cancer therapies to human evaluation
  • Afford opportunities to license technologies to university start-up, biotechnology and major pharmaceutical companies
  • Define potential role(s) the National Institutes of Health SBIR programs may play in advancing new cancer therapies along the drug discovery and early development path

The scope of drug discovery and early drug development within the scope of these guidelines spans target identification through human (Phase I) and/or clinical (Phase IIa) proof of concept. This chapter describes an approach to drug discovery and development for the treatment, prevention, and control of cancer. The guidelines and decision points described herein may serve as the foundation for collaborative projects with other organizations in multiple therapeutic areas.

  • Assumptions

These guidelines are being written with target identification as the initial decision point, although the process outlined here applies to a project initiated at any of the subsequent points.

The final decision point referenced in this chapter is human and/or clinical proof of concept. Although the process for new drug approval is reasonably well defined, it is very resource intensive and beyond the focus of most government, academic, and disease philanthropy organizations conducting drug discovery and early drug development activities.

The decision points in this chapter are specific to the development of a drug for the treatment of relapsed or refractory late stage cancer patients. Many of the same criteria apply to the development of drugs intended for other indications and therapeutic areas, but each disease should be approached with a logical customization of this plan. Development of compounds for the prevention and control of cancer would follow a more conservative pathway as the benefit/risk evaluation for these compounds would be different. When considering prevention of a disease one is typically treating patients at risk, but before the disease has developed in individuals that are otherwise healthy. The development criteria for these types of compounds would be more rigorous initially and would typically include a full nonclinical development program to support the human studies. Similarly, compounds being developed to control cancer suggest that the patients may have a prolonged life expectation such that long term toxicity must be fully evaluated before exposing a large patient population to the compound.  The emphasis of the current chapter is on the development of compounds for the treatment of late stage cancer patients.

Human and/or clinical proof of concept strategies will differ depending upon the intent of the product (treatment, prevention, or control). The concepts and strategies described in this chapter can be modified for the development of a drug for prevention or control of multiple diseaes.

The cost estimates and decision points are specific to the development of a small molecule drug. Development of large molecules will require the evaluation of additional criteria and may be very specific to the nature of the molecule under development.

This plan is written to describe the resources required at each decision point and does not presume that licensing will occur only at the final decision point. It is incumbent upon the stakeholders involved to decide the optimal point at which the technology should move outside their institution.

The plan described here does not assume that the entire infrastructure necessary to generate the data underlying each decision criterion is available at any single institution. The estimates of financial resource requirements are based on an assumption that these services can be purchased from an organization (or funded through a collaborator) with the necessary equipment, instrumentation, and trained personnel to conduct the studies.

The costs associated with the tasks in the development plan are based on the experiences of the authors. It is reasonable to assume that variability in the costs and duration of specific data-generating activities will depend upon the nature of the target and molecule under development.

  • Definitions

At Risk Initiation – The decision by the project team to begin activities that do not directly support the next unmet decision point, but will instead support a subsequent decision point. At Risk Initiation is sometimes recommended to decrease the overall development time.

Commercialization Point – In this context, the authors use the term to describe the point at which a commercial entity is involved to participate in the development of the drug product. This most commonly occurs through a direct licensing arrangement between the university and an organization with the resources to continue the development of the product.

Counter-screen – A screen performed in parallel with or after the primary screen. The assay used in the counter-screen is developed to identify compounds that have the potential to interfere with the assay used in the primary screen (the primary assay). Counter-screens can also be used to eliminate compounds that possess undesirable properties, for example, a counter-screen for cytotoxicity ( 1 ).

Cumulative Cost – This describes the total expenditure by the project team from project initiation to the point at which the project is either completed or terminated.

Decision Point 1 – The latest moment at which a predetermined course of action is initiated. Project advancement based on decision points balances the need to conserve scarce development resources with the requirement to develop the technology to a commercialization point as quickly as possible. Failure to meet the criteria listed for the following decision points will lead to a No Go recommendation.

False positive – Generally related to the ‘‘specificity’’ of an assay. In screening, a compound may be active in an assay but inactive toward the biological target of interest. For this chapter, this does not include activity due to spurious, non-reproducible activity (such as lint in a sample that causes light-scatter or spurious fluorescence and other detection related artifacts). Compound interference that is reproducible is a common cause of false positives, or target-independent activity ( 1 ).

Go Decision – The project conforms to key specifications and criteria and will continue to the next decision point.

High-Throughput Screen ( HTS ) – A large-scale automated experiment in which large libraries (collections) of compounds are tested for activity against a biological target or pathway. It can also be referred to as a “screen” for short ( 1 ).

Hits – A term for putative activity observed during the primary high-throughput screen, usually defined by percent activity relative to control compounds ( 1 ).

Chemical Lead Compound – A member of a biologically and pharmacologically active compound series with desired potency, selectivity, pharmacokinetic, pharmacodynamic and toxicity properties that can advance to IND-enabling studies for clinical candidate selection.

Incremental Cost – A term used to describe the additional cost of activities that support decision criteria for any given decision point, independent of other activities that may have been completed or initiated to support decision criteria for any other decision point.

Library – A collection of compounds that meet the criteria for screening against disease targets or pathways of interest ( 1 ).

New Chemical Entity (NCE) – A molecule emerging from the discovery process that has not previously been evaluated in clinical trials.

No Go Decision – The project does not conform to key specifications and criteria and will not continue.

Off-Target Activity – Compound activity that is not directed toward the biological target of interest but can give a positive read-out, and thus can be classified as an active in the assay ( 1 ).

Orthogonal Assay – An assay performed following (or in parallel to) the primary assay to differentiate between compounds that generate false positives from those compounds that are genuinely active against the target ( 1 ).

Primary Assay – The assay used for the high-throughput screen ( 1 ).

Qualified Task – A task that should be considered, but not necessarily required to be completed at a suggested point in the project plan. The decision is usually guided by factors outside the scope of this chapter. Such tasks will be denoted in this chapter by enclosing the name of the tasks in parentheses in the Gantt chart, e.g. (qualified task).

Secondary Assay – An assay used to test the activity of compounds found active in the primary screen (and orthogonal assay) using robust assays of relevant biology. Ideally, these are of at least medium-throughput to allow establishment of structure-activity relationships between the primary and secondary assays and establish a biologically plausible mechanism of action ( 1 ).

Section 1. Discovery and Development of New Chemical Entities

The Gantt chart ( Table 1 ) illustrates the scope of this chapter. The left-hand portion of the chart includes the name of each decision point as well as the incremental cost for the activities that support that task. The black bars on the right-hand portion of the chart represent the duration of the summary task (combined criteria) to support a decision point as well as the cumulative cost for the project at the completion of that activity. A similar layout applies to each of the subsequent figures; however, the intent of these figures is to articulate the activities that underlie each decision point.

Table 1:

Composite Gantt Chart Roll-up Representing Target ID through Clinical POC

The submission of regulatory documents, for the purpose of this example, reflects the preparation of an Investigational New Drug (IND) application in Common Technical Document (CTD) format. The CTD format is required for preparation of regulatory documents in Europe (according to the Investigational Medicinal Product Dossier [IMPD]), Canada for investigational applications (Clinical Trial Application) and is accepted by the United States Food and Drug Administration (FDA) for INDs. The CTD format is required for electronic CTD (eCTD) submissions. The advantages of the CTD are that it facilitates global harmonization and lays the foundation upon which the marketing application can be prepared. The sections of the CTD are prepared early in development (at the IND stage) and are then updated, as needed, until submission of the marketing application.

Decision Point #1 - Target Identification

Target-based drug discovery begins with identifying the function of a possible therapeutic target and its role in the disease ( 2 ). There are two criteria that justify advancement of a project beyond target identification. These are:

  • Previously published (peer-reviewed) data on a particular disease target pathway or target, OR
  • Evidence of new biology that modulates a disease pathway or target of interest

Resource requirements to support this initial stage of drug discovery can vary widely as the novelty of the target increases. In general, the effort required to elucidate new biology can be significant. Most projects will begin with these data in hand, whether from a new or existing biology. We estimate that an additional investment might be needed to support the target identification data that might already exist ( Table 2 ). However, as reflected in Table 2 , if additional target validation activities proceed at risk , the total cost of the project at a “No Go” decision will reach approximately $468,500 (estimated).

Table 2:

Target Identification and Target Validation

Decision Point #2 - Target Validation

Target validation requires a demonstration that a molecular target is directly involved in a disease process, and that modulation of the target is likely to have a therapeutic effect ( 2 ). There are seven criteria for evaluation prior to advancement beyond target validation. These are:

  • Known molecules modulate the target
  • Type of target has a history of success (e.g. Ion channel, GCPR, nuclear receptor, transcription factor, cell cycle, enzyme, etc.)
  • Genetic confirmation (e.g. Knock-out, siRNA, shRNA, SNP, known mutations, etc.)
  • Availability of known animal models
  • Low-throughput target validation assay that represents biology
  • Intellectual property of the target
  • Market potential of the disease/target space

The advancement criteria supporting target validation can usually be completed in approximately 12 months by performing most activities in parallel. In an effort to reduce the overall development timeline, we recommend starting target validation activities at risk (prior to a “Go” decision on target identification). Table 2 illustrates the dependencies between the criteria supporting the first two decision points. The incremental cost of the activities supporting decision-making criteria for target validation is approximately $268,500. However, a decision to initiate target validation prior to completion of target initiation (recommended) and subsequent initiation of identification of actives at risk would lead to a total project cost (estimate) of $941,000 if a “No Go” decision were reached at the conclusion of target validation.

Decision Point #3 - Identification of Actives

An active is defined as a molecule that shows significant biological activity in a validated screening assay that represents the disease biology and physiology. By satisfying the advancement criteria listed below for identification of actives, the project team will begin to define new composition of matter by linking a chemical structure to modulation of the target. There are five (or six if invention disclosure occurs at this stage) criteria for evaluation at the identification of actives decision point. These are:

  • Acquisition of screening reagents
  • Primary HTS assay development and validation
  • Compound library available to screen
  • Actives criteria defined
  • Perform high-throughput screen
  • (Composition of Matter invention disclosure)

The advancement criteria supporting identification of actives can be completed in approximately 12 months in most cases by performing activities in parallel. Table 3 illustrates the dependencies and timing associated with a decision to begin activities supporting confirmation of hits prior to a “Go” decision on decision point #3. The incremental cost associated with decision point #3 is estimated to be $472,500 (assuming the assay is transferred and validated without difficulty). The accumulated project cost associated with a “No Go” decision at this point is estimated to be $1.46 million. This assumes an at risk initiation of activities supporting decision point #4.

Table 3:

Identification of Actives

Decision Point #4 - Confirmation of Hits

A hit is defined as consistent activity of a molecule (with confirmed purity and identity) in a biochemical and/or cell based secondary assay. Additionally, this is the point at which the project team will make an assessment of the molecular class of each of the hits. There are six (or seven if initial invention disclosure occurs at this stage) criteria for evaluation at the confirmation of hits decision point. These are:

  • Confirmation based on repeat assay, Concentration Response Curve ( CRC )
  • Secondary assays for specificity, selectivity, and mechanisms
  • Confirmed identity and purity
  • Cell-based assay confirmation of biochemical assay when appropriate
  • Druggability of the chemical class (reactivity, stability, solubility, synthetic feasibility)
  • Chemical Intellectual Property (IP)

The advancement criteria supporting decision point #4 can usually be completed in approximately 18 months, depending upon the existence of cell-based assays for confirmation. If the assays need to be developed or validated at the screening lab, we recommend starting that activity at risk concurrent with the CRC and mechanistic assays. Table 4 represents the dependencies and timing associated with the decision to begin activities supporting confirmation of hits prior to a “Go” decision on decision point #3. The incremental cost of confirmation of hits is $522,000. The accumulated project cost at a “No Go” decision on decision point #4 can be as high as $1.8 million if a proceed at risk decision is made on identification of a chemical lead (decision point #5).

Table 4:

Confirmation of Hits

Decision Point #5 - Identification of Chemical Lead

A chemical lead is defined as a synthetically feasible, stable, and drug-like molecule active in primary and secondary assays with acceptable specificity and selectivity for the target. This requires definition of the Structure-Activity Relationship (SAR) as well as determination of synthetic feasibility and preliminary evidence of in vivo efficacy and target engagement (Note: projects at this stage might be eligible for Phase I SBIR) . Characteristics of a chemical lead are:

  • SAR defined
  • Drugability (preliminary toxicity, hERG, Ames)
  • Synthetic feasibility
  • Select mechanistic assays
  • In vitro assessment of drug resistance and efflux potential
  • Evidence of in vivo efficacy of chemical class
  • PK/Toxicity of chemical class known based on preliminary toxicity or in silico studies

In order to decrease the number of compounds that fail in the drug development process, a druggability assessment is often conducted. This assessment is important in transforming a compound from a lead molecule into a drug. For a compound to be considered druggable it should have the potential to bind to a specific target; however, also important is the compound’s pharmacokinetic profile regarding absorption, distribution, metabolism, and excretion. Other assays will evaluate the potential toxicity of the compound in screens such as the Ames test and cytotoxicity assay. When compounds are being developed for indications where the predicted patient survival is limited to a few years, it is important to note that a positive result in the cytotoxicity assays would not necessarily limit the development of the compound and other drugability factors (such as the pharmacokinetic profile) would be more relevant for determining the potential for development.

The advancement criteria supporting decision point #5 will most likely be completed in approximately 12-18 months due to the concurrent activities. We recommend that SAR and drugability assessments begin at risk prior to a “Go” on confirmation of hits. Synthetic feasibility and PK assessment will begin at the completion of decision point #4. The cost of performing the recommended activities to support identification of a chemical lead is estimated to be $353,300 ( Table 5 ). The accumulated project costs at the completion of decision point #5 are estimated to be $2.1 million including costs associated with at risk initiation of activities to support decision point #6.

Table 5:

Identification of a Chemical Lead

Decision Point #6 - Selection of Optimized Chemical Lead

An optimized chemical lead is a molecule that will enter IND-enabling GLP studies and GMP supplies will be produced for clinical trials. We will describe the activities that support GLP and GMP development in the next section. This section focuses on the decision process to identify those molecules ( Note: projects at this stage may be eligible for Phase II SBIR) . Criteria for selecting optimized candidates are listed below :

  • Acceptable in vivo PK and toxicity
  • Feasible formulation
  • In vivo preclinical efficacy (properly powered)
  • Dose Range Finding (DRF) pilot toxicology
  • Process chemistry assessment of scale up feasibility
  • Regulatory and marketing assessments

The advancement criteria supporting decision point #6 can be completed in approximately 12-15 months. As indicated above, we recommend commencing activities to support selection of an optimized chemical lead prior to a “Go” decision on decision point #5. In particular, the project team should place emphasis on 6.3 ( in vivo preclinical efficacy). A strong lead will have clearly defined pharmacodynamic endpoints at the preclinical stage and will set the stage for strong indicators of efficacy at decision point #11 (clinical proof of concept). The cost of performing the recommended activities to support decision point #6 is estimated to be $302,500 ( Table 6 ). The accumulated project costs at the completion of decision point #6 are estimated to be $2.4 million, including costs associated with at risk initiation of activities to support decision point #7.

Table 6:

Selection of an Optimized Chemical Lead

Decision Point #7 - Selection of a Development Candidate

A development candidate is a molecule for which the intent is to begin Phase I evaluation. Prior to submission of an IND, the project team must evaluate the likelihood of successfully completing the IND-enabling work that will be required as part of the regulatory application for first in human testing. Prior to decision point #7, many projects will advance as many as 7-10 molecules. Typically, most pharma and biotech companies will select a single development candidate with one designated backup. Here, we recommend that the anointed “Development Candidate” be the molecule that rates the best on the six criteria below. In many cases, a Pre-IND meeting with the regulatory agency might be considered. A failure to address all of these by any molecule should warrant a “No Go” decision by the project team. The following criteria should be minimally met for a development candidate:

  • Acceptable PK (with a validated bioanalytical method)
  • Demonstrated in vivo efficacy/activity
  • Acceptable safety margin (toxicity in rodents or dogs when appropriate)
  • Feasibility of GMP manufacture
  • Acceptable drug interaction profile
  • Well-developed clinical endpoints

The advancement criteria supporting decision point #7 are estimated to be completed in 12 months, but may be compressed to as little as 6 months. The primary rate limit among the decision criteria is the determination of the safety margin, as this can be affected by the formulation and dosing strategies selected earlier. In this case, the authors have presented a project that includes a 7-day repeat dose in rodents to demonstrate an acceptable safety margin. The incremental costs of activities to support the selection of a development candidate (as shown) are estimated to be approximately $275,000. The accumulated project cost at this point is approximately $2.4 million to complete decision points #6, #7, and the FDA Pre-IND meeting ( Table 7 ). If the development plan requires a longer toxicology study at this point, costs can be higher (approximately $190,000 for a 14-day repeat dose study in rats and $225,000 in dogs).

Table 7:

Selection of a Development Candidate

Decision Point #8 - Pre-IND Meeting with the FDA

Pre-IND advice from the FDA may be requested for issues related to the data needed to support the rationale for testing a drug in humans; the design of nonclinical pharmacology, toxicology, and drug activity studies, including design and potential uses of any proposed treatment studies in animal models; data requirements for an IND application; initial drug development plans, and regulatory requirements for demonstrating safety and efficacy ( 1 ). We recommend that this meeting take place after the initiation, but before the completion of tasks to support decision point #7 (selection of a development candidate). The feedback from the FDA might necessitate adjustments to the project plan. Making these changes prior to candidate selection will save time and money. Pre-IND preparation will require the following:

  • Prepare pre-IND meeting request to the FDA, including specific questions
  • Prepare pre-IND meeting package, which includes adequate information for the FDA to address the specific questions (clinical plan, safety assessments summary, CMC plan, etc.)
  • Prepare the team for the pre-IND meeting
  • Conduct pre-IND meeting with the FDA
  • Adjust project plan to address the FDA comments
  • Target product profile

The advancement criteria supporting decision point #8 should be completed in 12 months. We recommend preparing the pre-IND meeting request approximately 3 to 6 months prior to selection of a development candidate (provided that the data supporting that decision point are promising). The cost of performing the recommended activities to support pre-IND preparation #8 is estimated to be $37,000.

Decision Point #9 - Preparation and Submission of an IND Application

The decision to submit an IND application presupposes that all of the components of the application have been addressed. The largest expense associated with preparation of the IND is related to the CMC activities (manufacture and release of GMP clinical supplies). A “Go” decision is contingent upon all of the requirements for the IND having been addressed and that the regulatory agency agrees with the clinical plan. (Note: projects at this stage may be eligible for SBIR BRIDGE awards). The following criteria should be addressed in addition to addressing comments from the pre-IND meeting:

  • Well-developed clinical plan
  • Acceptable clinical dosage form
  • Acceptable preclinical drug safety profile
  • Clear IND regulatory path
  • Human Proof of Concept (HPOC)/Clinical Proof of Concept (CPOC) plan is acceptable to regulatory agency (pre-IND meeting)
  • Reevaluate IP positions

The advancement criteria supporting decision point #9 are estimated to be completed in 12 months, but might be compressed to as little as 6 months if necessary. We recommend initiating “ at risk ” as long as there is confidence that a qualified development candidate is emerging before completion of decision point #7 and the plan remains largely unaltered after the pre-IND meeting (decision point #8). The incremental costs of completing decision point #9 are estimated to be $780,000. The accumulated project cost at this point will be approximately $3.2 million ( Table 8 ).

Table 8:

Submit IND Application

Decision Point #10 - Human Proof of Concept

Most successful Phase I trials in oncology require 12-21 months for completion, due to very restrictive enrollment criteria in these studies in some cases. There is no “ at risk ” initiation of Phase I; therefore, the timeline cannot be shortened in that manner. The most important factors in determining the length of a Phase I study are a logically written clinical protocol and an available patient population. A “Go” decision clearly rests on the safety of the drug, but many project teams will decide not to proceed if there is not at least some preliminary indication of efficacy during Phase I (decision point #10, below). Proceeding to Phase II trials will depend on:

  • IND clearance
  • Acceptable Maximum Tolerated Dose (MTD)
  • Acceptable Dose Response (DR)
  • Evidence of human pharmacology
  • Healthy volunteer relevance

We estimate the incremental cost of an oncology Phase I study will be approximately $1 million. This can increase significantly if additional patients are required to demonstrate MTD, DR, pharmacology and/or efficacy. Our estimate is based on a 25 patient (outpatient) study completed in 18 months. The accumulated project cost at completion of decision point #10 will be approximately $4.2 million ( Table 9 ).

Table 9:

Human Proof of Concept

Decision Point #11: Clinical Proof of Concept

With acceptable Dose Ranging and Maximum Tolerable Dose having been defined during Phase I, in Phase II the project team will attempt to statistically demonstrate efficacy. More specifically, the outcome of Phase II should reliably predict the likelihood of success in Phase III randomized trials.

  • Meeting the IND objectives
  • Acceptable human PK/PD profile
  • Safety and tolerance assessments

We estimate the incremental cost of an oncology Phase IIa study will be approximately $5.0 million ( Table 10 ). This cost is largely dependent on the number of patients required and the number of centers involved. Our estimate is based on 150 outpatients with studies completed in 24 months. The accumulated project cost at the completion of decision point #11 will be approximately $9.2 million ( Table 10 ).

Table 10:

Decision Point #11 in Detail

Section 2. Repurposing of Marketed Drugs

Drug repurposing and rediscovery development projects frequently seek to employ the 505(b)( 2 ) drug development strategy. This strategy leverages studies conducted and data generated by the innovator firm that is available in the published literature, in product monographs, or product labeling. Improving the quality of drug development plans will reduce the time of 505(b)( 2 ) development cycles, and reduce the time and effort required by the FDA during the NDA review process. Drug repurposing projects seek a new indication in a different patient population and perhaps a different formulated drug product than what is currently described on the product label. By leveraging existing nonclinical data and clinical safety experience, sponsors have the opportunity to design and execute novel, innovative clinical trials to characterize safety and efficacy in a different patient population. The decision points for drug repurposing are summarized in Table 11 .

Table 11:

Summary of Decision Points for Drug Repurposing

Decision Point #1: Identification of Actives

For drug repurposing, actives are identified as follows ( Table 12 ):

Table 12:

  • Acquisition of Active Pharmaceutical Ingredients (API) for screening
  • Primary HTS assay development, validation
  • Perform HTS
  • (Submit invention disclosure and consider use patent)

Decision Point #2: Confirmation of Hits

Hits are confirmed as follows for a drug repurposing project ( Table 13 ):

Table 13:

  • Confirmation based on repeat assay, CRC

Decision Point #3: Gap Analysis/Development Plan

When considering the 505(b)( 2 ) NDA approach, it is important to understand what information is available to support the proposed indication and what additional information might be needed. The development path is dependent upon the proposed indication, change in formulation, route, and dosing regimen. The gap analysis/development plan that is prepared will take this information into account in order to determine what studies might be needed prior to submission of an IND and initiating first-in-man studies. A thorough search of the literature is important in order to capture information available to satisfy the data requirements for the IND. Any gaps identified would need to be filled with studies conducted by the sponsor. A pre-IND meeting with the FDA will allow the sponsor to present their plan to the FDA and gain acceptance prior to submission of the IND and conducting the first-in-man study ( Table 14 ).

Table 14:

Gap Analysis/Development Plan

  • CMC program strategy
  • Preclinical program strategy
  • Clinical proof of concept strategy
  • Draft clinical protocol design
  • Pre-IND meeting with the FDA
  • Commercialization/marketing strategy and target product profile

Decision Point #4: Clinical Formulation Development

The clinical formulation development will include the following ( Table 15 ):

Table 15:

Clinical Formulation Development

  • Prototype development
  • Analytical methods development
  • Prototype stability
  • Prototype selection
  • Clinical supplies release specification
  • (Submit invention disclosure on novel formulation)

Decision Point #5: Preclinical Safety Data Package

Preparation of the gap analysis/development plan will identify any additional studies that might be needed to support the development of the compound for the new indication. Based on this assessment, as well as the intended patient population, the types of studies that will be needed to support the clinical program will be determined. It is possible that a pharmacokinetic study evaluating exposure would be an appropriate bridge to the available data in the literature ( Table 16 ).

Table 16:

Preclinical Safety Data Package

  • Preclinical oral formulation development
  • Bioanalytical method development
  • Qualify GLP test article
  • Transfer plasma assay to GLP laboratory
  • ICH S7a (Safety Pharmacology) & S7b (Cardiac Tox) core battery of tests
  • Toxicology bridging study
  • PK/PD/Tox studies if formulation & route of administration is different

Decision Point #6: Clinical Supplies Manufacture

Clinical supplies will need to be manufactured. The list below provides some of the considerations that need to be made for manufacturing clinical supplies ( Table 17 ):

Table 17:

Clinical Supplies Manufacture

  • Select cGMP supplier and transfer manufacturing process
  • Cleaning validation development
  • Scale-up lead formulation at GMP facility
  • Clinical label design
  • Manufacture clinical supplies

Decision Point #7: IND Preparation and Submission

Following the pre-IND meeting with the FDA, and conducting any additional studies, the IND is prepared in common technical document format to support the clinical protocol. The IND is prepared in 5 separate modules that include administrative information, summaries (CMC, nonclinical, clinical), quality data (CMC), nonclinical study reports and literature, and clinical study reports and literature ( Table 18 ). Following submission of the IND to the FDA, there is a 30-day review period during which the FDA may ask for additional data or clarity on the information submitted. If after 30-days the FDA has communicated that there is no objection to the proposed clinical study, the IND is considered active and the clinical study can commence.

Table 18:

IND Preparation and Submission

  • Investigator’s brochure preparation
  • Protocol preparation and submission to IRB
  • IND preparation and submission

Decision Point #8: Human Proof of Concept

Human proof of concept may commence following successful submission of an IND (i.e. and IND that has not been placed on ‘clinical hold’). The list below provides some information concerning human proof of concept ( Table 19 ):

Table 19:

  • IND Clearance
  • Acceptable MTD
  • Acceptable DR

Section 3. Development of Drug Delivery Platform Technology

Historically about 40% of NCEs identified as possessing promise for development, based on drug-like qualities, progress to evaluation in humans. Of those that do make it into clinical trials, about 9 out of 10 fail. In many cases, innovative drug delivery technology can provide a “second chance” for promising compounds that have consumed precious drug-discovery resources, but were abandoned in early clinical trials due to unfavorable side-effect profiles. As one analyst observed, “pharmaceutical companies are sitting on abandoned goldmines that should be reopened and excavated again using the previously underutilized or unavailable picks and shovels developed by the drug delivery industry” (SW Warburg Dillon Read). Although this statement was made more than 10 years ago, it continues to apply.

Beyond enablement of new drugs, innovative approaches to drug delivery also hold potential to enhance marketed drugs (e.g., through improvement in convenience, tolerability, safety, and/or efficacy); expand their use (e.g., through broader labeling in the same therapeutic area and/or increased patient acceptance/compliance); or transform them by enabling their suitability for use in other therapeutic areas. These opportunities contribute enormously to the potential for value creation in the drug delivery field. Table 20 summarizes the decision points for the development of drug delivery platform technology.

Table 20:

Summary of Decision Points for Drug Delivery Platform Technology

Decision Point #1: Clinical Formulation Development

See Table 21 for a schematic representation of the time and costs associated with development at this stage.

Table 21:

Decision Point #2: Development Plan

Preparation of a development plan allows the sponsor to evaluate the available information regarding the compound of interest (whether at the development stage or a previously marketed compound) to understand what information might be available to support the proposed indication and what additional information may be needed. The development path is dependent upon the proposed indication, change in formulation, route, and dosing regimen. The development plan that is prepared will take this information into account in order to determine what information or additional studies might be needed prior to submission of an IND and initiating first-in-man studies. A thorough search of the literature is important in order to capture available information to satisfy the data requirements for the IND. Any gaps identified would need to be filled with studies conducted by the sponsor. A pre-IND meeting with the FDA will allow the sponsor to present their plan to the FDA and gain acceptance (de-risk the program) prior to submission of the IND and conducting the first-in-man study ( Table 22 ).

Table 22:

Development Plan

Decision Point #3: Clinical Supplies Manufacture

  • Scale up lead formulation at GMP facility

See Table 23 for a schematic representation of the time and costs associated with development at this stage.

Table 23:

Decision Point #4: Preclinical Safety Package

Preparation of the gap analysis/development plan will identify any additional studies that might be needed to support the development of the new delivery platform for the compound. Based on this assessment, as well as the intended patient population, the types of studies that will be needed to support the clinical program will be determined. It is possible that a pharmacokinetic study evaluating exposure would be an appropriate bridge to the available data in the literature ( Table 24 ).

Table 24:

Preclinical Safety Package

  • Transfer drug exposure/bioavailability assays to GLP laboratory

Decision Point #5: IND Preparation and Submission

Following the pre-IND meeting with the FDA and conducting any additional studies, the IND is prepared in common technical document format to support the clinical protocol. The IND is prepared in 5 separate modules, which include administrative information, summaries (CMC, nonclinical, clinical), quality data (CMC), nonclinical study reports and literature, and clinical study reports and literature. Following submission of the IND to the FDA, there is a 30-day review period during which the FDA might ask for additional data or clarity on the information submitted. If after 30-days the FDA has communicated that there is no objection to the proposed clinical study, the IND is considered active and the clinical study can commence ( Table 25 ).

Table 25:

Decision Point #6: Human Proof of Concept

Human proof of concept may commence following successful submission of an IND (i.e. and IND that has not been placed on ‘clinical hold’). The list below provides some information concerning human proof of concept ( Table 26 ):

Table 26:

Decision Point #7: Clinical Proof of Concept

With acceptable DR and MTD having been defined during Phase I, in Phase II the project team will attempt to statistically demonstrate efficacy. More specifically, the outcome of Phase II should reliably predict the likelihood of success in Phase III randomized trials ( Table 27 ).

Table 27:

Clinical Proof of Concept

  • Acceptable PK/PD profile
  • Direct and indirect biomarkers

Section 4. Alternative NCE Strategy: Exploratory IND

The plans outlined previously in these guidelines describe advancement of novel drugs as well as repurposed or reformulated, marketed drug products to human and/or clinical proof of concept trials using the traditional or conventional early drug development, IND approach. This section of the guidelines outlines an alternative approach to accelerating novel drugs and imaging molecules to humans employing a Phase 0, exploratory IND strategy (exploratory IND). The exploratory IND strategy was first issued in the form of draft guidance in April, 2005. Following a great deal of feedback from the public and private sectors, the final guidance was published in January, 2006.

Phase 0 describes clinical trials that occur very early in the Phase I stage of drug development. Phase 0 trials limit drug exposure to humans (up to 7 days) and have no therapeutic intent. Phase 0 studies are viewed by the FDA and National Cancer Institute (NCI) as important tools for accelerating novel drugs to the clinic. There is some flexibility in data requirements for an exploratory IND. These requirements are dependent on the goals of the investigation (e.g., receptor occupancy, pharmacokinetics, human biomarker validation), the clinical testing approach, and anticipated risks.

Exploratory IND studies provide the sponsor with an opportunity to evaluate up to five chemical entities (optimized chemical lead candidates) or formulations at once. When an optimized chemical lead candidate or formulation is selected, the exploratory IND is then closed, and subsequent drug development proceeds along the traditional IND pathway. This approach allows one, when applicable, to characterize the human pharmacokinetics and target interaction of chemical lead candidates. Exploratory IND goals are typically to:

  • Characterize the relationship between mechanism of action and treatment of the disease; in other words, to validate proposed drug targets in humans
  • Characterize the human pharmacokinetics
  • Select the most promising chemical lead candidate from a group of optimized chemical lead candidates (note that the chemical lead candidates do not necessarily have the same chemical scaffold origins)
  • Explore the bio-distribution of chemical lead candidates employing imaging strategies (e.g., PET studies)

Exploratory IND studies are broadly described as “microdosing” studies and clinical studies attempting to demonstrate a pharmacologic effect. Exploratory IND or Phase 0 strategies must be discussed with the relevant regulatory agency before implementation. These studies are described below.

Microdosing studies are intended to characterize the pharmacokinetics of chemical lead candidates or the imaging of specific human drug targets. Microdosing studies are not intended to produce a pharmacologic effect. Doses are limited to less than 1/100th of the dose predicted (based on preclinical data) to produce a pharmacologic effect in humans, or a dose of less than 100 μg/subject, whichever is less. Exploratory IND-enabling preclinical safety requirements for microdosing studies are substantially less than the conventional IND approach. In the US, a single dose, single species toxicity study employing the clinical route of administration is required. Animals are observed for 14 days following administration of the single dose. Routine toxicology endpoints are collected. The objective of this toxicology study is to identify the minimally toxic dose, or alternatively, demonstrate a large margin of safety (e.g., 100x). Genotoxicity studies are not required. The EMEA, in contrast to the FDA, requires toxicology studies employing two routes of administration, intravenous and the clinical route, prior to initiating microdosing studies. Genotoxicity studies (bacterial mutation and micronucleus) are required. Exploratory IND workshops have discussed or proposed the allowance of up to five microdoses administered to each subject participating in an exploratory IND study, provided each dose does not exceed 1/100th the NOAEL or 1/100th of the anticipated pharmacologically active dose, or the total dose administered is less than 100 mcg, whichever is less. In this case, doses would be separated by a washout period of at least six pharmacokinetic terminal half-lives. Fourteen-day repeat toxicology studies encompassing the predicted therapeutic dose range (but less than the MTD) have also been proposed to support expanded dosing in microdosing studies.

Exploratory IND clinical trials designed to produce a pharmacologic effect were proposed by PhRMA in May 2004, based on a retrospective analysis of 106 drugs that supported the accelerated preclinical safety-testing paradigm. In Phase 0 studies designed to produce a pharmacologic effect, up to five compounds can be studied. The compounds must have a common drug target, but do not necessarily have to be structurally related. Healthy volunteers or minimally ill patients may receive up to 7 repeated doses in the clinic. The goal is to achieve a pharmacologic response but not define the MTD. Preclinical safety requirements are greater compared to microdosing studies. Fourteen-day repeat toxicology studies are required and conducted in rodents (i.e., rats), with full clinical and histopathology evaluation. In addition, a full safety pharmacology battery, as described by ICH S7a, is required. In other words, untoward pharmacologic effects on the cardiovascular, respiratory, and central nervous systems are characterized prior to Phase 0. In addition, genotoxicity studies employing bacterial mutation and micronucleus assays are required. In addition to the 14-day rodent toxicology study, a repeat dose study in a non-rodent specie (typically dog) is conducted at the rat NOAEL dose. The duration of the non-rodent repeat dose study is equivalent to the duration of dosing planned for the Phase 0 trial. If toxicity is observed in the non-rodent specie at the rat NOAEL, the chemical lead candidate will not proceed to Phase 0. The starting dose for Phase 0 studies is defined typically as 1/50th the rat NOAEL, based on a per meter squared basis. Dose escalation in these studies is terminated when: 1) a pharmacologic effect or target modulation is observed, 2) a dose equivalent (e.g., scaled to humans on a per meter squared basis) to one-fourth the rat NOAEL, or 3) human systemic exposure reflected as AUC reaches ½ the AUC observed in the rat or dog in the 14-day repeat toxicology studies, whichever is less.

Early phase clinical trials with terminally ill patients without therapeutic options, involving potentially promising drugs for life threatening diseases, may be studied under limited (e.g., up to 3 days dosing) conditions employing a facilitated IND strategy. As with the Phase 0 strategies described above, it is imperative that this approach be defined in partnership with the FDA prior to implementation.

The reduced preclinical safety requirements are scaled to the goals, duration and scope of Phase 0 studies. Phase 0 strategies have merit when the initial clinical experience is not driven by toxicity, when pharmacokinetics are a primary determinant in selection from a group of chemical lead candidates (and a bioanalytical method is available to quantify drug concentrations at microdoses), when pharmacodynamic endpoints in surrogate (e.g., blood) or tumor tissue is of primary interest, or to assess PK/PD relationships (e.g., receptor occupancy studies employing PET scanning).

PhRMA conducted a pharmaceutical industry survey in 2007 to characterize the industry’s perspective on the current and future utility of exploratory IND studies ( 3 ). Of the 16 firms who provided survey responses, 56% indicated they had either executed or were planning to execute exploratory IND development strategies. The authors concluded that the merits of exploratory INDs continue to be debated, however, this approach provides a valuable option to advancing drugs to the clinic.

There are limitations to the exploratory IND approach. Doses employed in Phase 0 studies might not be predictive of doses over the human dose range (up to the maximum tolerated dose). Phase 0 studies in patients raises ethical issues compared to conventional Phase I, in that escalation into a pharmacologically active dose range might not be possible under the exploratory IND guidance. The Phase 0 strategy is designed to kill drugs early that are likely to fail based on PK or PK/PD. Should Phase 0 lead to a “Go” decision, however, a conventional IND is required for subsequent clinical trials, adding cost and time. Perhaps one of the most compelling arguments for employing an exploratory IND strategy is in the context of characterizing tissue distribution (e.g., receptor occupancy following PET studies) after microdosing.

Section 5. Orphan Drug Designation

Development programs for cancer drugs are often much more complex as compared to drugs used to treat many other indications. This complexity often results in extended development and approval timelines. In addition, oncology patient populations are often much smaller by comparison to other more prevalent indications. These factors (e.g., limited patent life and smaller patient populations) often complicate commercialization strategies and can, ultimately, make it more difficult to provide patient access to important new therapies.

To help manage and expedite the commercialization of drugs used to treat rare diseases, including many cancers, the Orphan Drug Act was signed into law in 1983. This law provides incentives to help sponsors and investigators develop new therapies for diseases and conditions of less than 200,000 cases per year allowing for more realistic commercialization.

The specific incentives for orphan-designated drugs are as follows:

  • Seven years of exclusive marketing rights to the sponsor of a designated orphan drug product for the designated indication once approval to market has been received from the FDA
  • A credit against tax for qualified clinical research expenses incurred in developing a designated orphan product
  • Eligibility to apply for specific orphan drug grants

A sponsor may request orphan drug designation for:

  • A previously unapproved drug
  • A new indication for a marketed drug
  • A drug that already has orphan drug status—if the sponsor is able to provide valid evidence that their drug may be clinically superior to the first drug

A sponsor, investigator, or an individual may apply for orphan drug designation prior to establishing an active clinical program or can apply at any stage of development (e.g., Phase 1 – 3). If orphan drug designation is granted, clinical studies to support the proposed indication are required. A drug is not given orphan drug status and, thus marketing exclusivity, until the FDA approves a marketing application. Orphan drug status is granted to the first sponsor to obtain FDA approval and not necessarily the sponsor originally submitting the orphan drug designation request.

There is no formal application for an orphan drug designation. However, the regulations (e.g., 21 CRF 316) identify the components to be included. An orphan drug designation request is typically a five- to ten-page document with appropriate literature references appended to support the prevalence statements of less than 200,000 cases/year. The orphan drug designation request generally includes:

  • The specific rare disease or condition for which orphan drug designation is being requested
  • Sponsor contact, drug names, and sources
  • A description of the rare disease or condition with a medically plausible rationale for any patient subset type of approach
  • A description of the drug and the scientific rationale for the use of the drug for the rare disease or condition
  • A summary of the regulatory status and marketing history of the drug
  • Documentation (for a treatment indication for the disease or condition) that the drug will affect fewer than 200,000 people in the United States (prevalence)
  • Documentation (for a prevention indication [or a vaccine or diagnostic drug] for the disease or condition) that the drug will affect fewer than 200,000 people in the United States per year (incidence)
  • Alternatively, a rationale may be provided for why there is no reasonable expectation that costs of research and development of the drug for the indication can be recovered by sales of the drug in the United States

Following receipt of the request, the FDA Office of Orphan Product Development (OOPD) will provide an acknowledgment of receipt of the orphan drug designation request. The official response will typically be provided within 1 to 3 months following submission. Upon notification of granting an orphan drug designation, the name of the sponsor and the proposed rare disease or condition will be published in the federal register as part of public record. The complete orphan drug designation request is placed in the public domain once the drug has received marketing approval in accordance with the Freedom of Information Act.

Finally, the sponsor of an orphan designated drug must provide annual updates that contain a brief summary of any ongoing or completed nonclinical or clinical studies, a description of the investigational plan for the coming year, any anticipated difficulties in development, testing, and marketing, and a brief discussion of any changes that may affect the orphan drug status of the product

While many authors have described the general guidelines for drug development (4,5, etc.), no one has outlined the process of developing drugs in an academic setting. It is well known that the propensity for late stage failures has lead to a dramatic increase in the overall cost of drug development over the last 15 years. It is also commonly accepted that the best way to prevent late stage failures is by increasing scientific rigor in the discovery, preclinical, and early clinical stages. Where many authors present drug discovery as a single monolithic process, we intend to reflect here that there are multiple decision points contained within this process.

An alternative approach is the exploratory IND (Phase 0) under which the endpoint is proof of principle demonstration of target inhibition ( 6 ). This potentially paradigm-shifting approach might dramatically improve the probability of late stage success and may offer additional opportunities for academic medical centers to become involved in drug discovery and development.

Literature Cited

Additional references.

  • Eckstein, Jens. ISOA/ARF Drug Development Tutorial .

Behind each Decision Point are detailed decision-making criteria defined in detail later in this chapter

All Assay Guidance Manual content, except where otherwise noted, is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported license (CC BY-NC-SA 3.0), which permits copying, distribution, transmission, and adaptation of the work, provided the original work is properly cited and not used for commercial purposes. Any altered, transformed, or adapted form of the work may only be distributed under the same or similar license to this one.

  • Cite this Page Strovel J, Sittampalam S, Coussens NP, et al. Early Drug Discovery and Development Guidelines: For Academic Researchers, Collaborators, and Start-up Companies. 2012 May 1 [Updated 2016 Jul 1]. In: Markossian S, Grossman A, Arkin M, et al., editors. Assay Guidance Manual [Internet]. Bethesda (MD): Eli Lilly & Company and the National Center for Advancing Translational Sciences; 2004-.
  • PDF version of this page (980K)
  • PDF version of this title (79M)
  • Disable Glossary Links

In this Page

  • Discovery and Development of New Chemical Entities
  • Repurposing of Marketed Drugs
  • Development of Drug Delivery Platform Technology
  • Alternative NCE Strategy: Exploratory IND
  • Orphan Drug Designation

Assay Guidance Manual Links

  • New in Assay Guidance Manual

Related information

  • PMC PubMed Central citations
  • PubMed Links to PubMed

Similar articles in PubMed

  • Guidelines, editors, pharma and the biological paradigm shift. [Mens Sana Monogr. 2007] Guidelines, editors, pharma and the biological paradigm shift. Singh AR, Singh SA. Mens Sana Monogr. 2007 Jan; 5(1):27-30.
  • Review [Development of antituberculous drugs: current status and future prospects]. [Kekkaku. 2006] Review [Development of antituberculous drugs: current status and future prospects]. Tomioka H, Namba K. Kekkaku. 2006 Dec; 81(12):753-74.
  • Erratum: Eyestalk Ablation to Increase Ovarian Maturation in Mud Crabs. [J Vis Exp. 2023] Erratum: Eyestalk Ablation to Increase Ovarian Maturation in Mud Crabs. . J Vis Exp. 2023 May 26; (195). Epub 2023 May 26.
  • "New drug" designations for new therapeutic entities: new active substance, new chemical entity, new biological entity, new molecular entity. [J Med Chem. 2014] "New drug" designations for new therapeutic entities: new active substance, new chemical entity, new biological entity, new molecular entity. Branch SK, Agranat I. J Med Chem. 2014 Nov 13; 57(21):8729-65. Epub 2014 Sep 4.
  • Review Drug repurposing in rare diseases: Myths and reality. [Therapie. 2020] Review Drug repurposing in rare diseases: Myths and reality. Fetro C, Scherman D. Therapie. 2020 Apr; 75(2):157-160. Epub 2020 Feb 13.

Recent Activity

  • Early Drug Discovery and Development Guidelines: For Academic Researchers, Colla... Early Drug Discovery and Development Guidelines: For Academic Researchers, Collaborators, and Start-up Companies - Assay Guidance Manual

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Connect with NLM

National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894

Web Policies FOIA HHS Vulnerability Disclosure

Help Accessibility Careers

statistics

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, drug discovery.

386 papers with code • 28 benchmarks • 25 datasets

Drug discovery is the task of applying machine learning to discover new candidate drugs.

( Image credit: A Turing Test for Molecular Generators )

drug discovery essay

Benchmarks Add a Result

drug discovery essay

Most implemented papers

Semi-supervised classification with graph convolutional networks.

drug discovery essay

We present a scalable approach for semi-supervised learning on graph-structured data that is based on an efficient variant of convolutional neural networks which operate directly on graphs.

Neural Message Passing for Quantum Chemistry

Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science.

Gated Graph Sequence Neural Networks

Graph-structured data appears frequently in domains including chemistry, natural language semantics, social networks, and knowledge bases.

Self-Normalizing Neural Networks

We introduce self-normalizing neural networks (SNNs) to enable high-level abstract representations.

Junction Tree Variational Autoencoder for Molecular Graph Generation

We evaluate our model on multiple tasks ranging from molecular generation to optimization.

Convolutional Networks on Graphs for Learning Molecular Fingerprints

We introduce a convolutional neural network that operates directly on graphs.

PhysNet: A Neural Network for Predicting Energies, Forces, Dipole Moments and Partial Charges

Further, two new datasets are generated in order to probe the performance of ML models for describing chemical reactions, long-range interactions, and condensed phase systems.

Molecule Attention Transformer

Designing a single neural network architecture that performs competitively across a range of molecule property prediction tasks remains largely an open challenge, and its solution may unlock a widespread use of deep learning in the drug discovery industry.

Fast and Uncertainty-Aware Directional Message Passing for Non-Equilibrium Molecules

Many important tasks in chemistry revolve around molecules during reactions.

Generating Focussed Molecule Libraries for Drug Discovery with Recurrent Neural Networks

In de novo drug design, computational strategies are used to generate novel molecules with good affinity to the desired biological target.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 April 2023

Computational approaches streamlining drug discovery

  • Anastasiia V. Sadybekov   ORCID: orcid.org/0000-0003-3925-983X 1 , 2 &
  • Vsevolod Katritch   ORCID: orcid.org/0000-0003-3883-4505 1 , 2 , 3  

Nature volume  616 ,  pages 673–685 ( 2023 ) Cite this article

75k Accesses

147 Citations

410 Altmetric

Metrics details

  • Cheminformatics
  • Virtual screening

Computer-aided drug discovery has been around for decades, although the past few years have seen a tectonic shift towards embracing computational technologies in both academia and pharma. This shift is largely defined by the flood of data on ligand properties and binding to therapeutic targets and their 3D structures, abundant computing capacities and the advent of on-demand virtual libraries of drug-like small molecules in their billions. Taking full advantage of these resources requires fast computational methods for effective ligand screening. This includes structure-based virtual screening of gigascale chemical spaces, further facilitated by fast iterative screening approaches. Highly synergistic are developments in deep learning predictions of ligand properties and target activities in lieu of receptor structure. Here we review recent advances in ligand discovery technologies, their potential for reshaping the whole process of drug discovery and development, as well as the challenges they encounter. We also discuss how the rapid identification of highly diverse, potent, target-selective and drug-like ligands to protein targets can democratize the drug discovery process, presenting new opportunities for the cost-effective development of safer and more effective small-molecule treatments.

Similar content being viewed by others

drug discovery essay

Accurate structure prediction of biomolecular interactions with AlphaFold 3

drug discovery essay

Highly accurate protein structure prediction with AlphaFold

drug discovery essay

De novo generation of multi-target compounds using deep generative chemistry

Despite amazing progress in basic life sciences and biotechnology, drug discovery and development (DDD) remain slow and expensive, taking on average approximately 15 years and approximately US$2 billion to make a small-molecule drug 1 . Although it is accepted that clinical studies are the priciest part of the development of each drug, most time-saving and cost-saving opportunities reside in the earlier discovery and preclinical stages. Preclinical efforts themselves account for more than 43% of expenses in pharma, in addition to major public funding 1 , driven by the high attrition rate at every step from target selection to hit identification and lead optimization to the selection of clinical candidates. Moreover, the high failure rate in clinical trials (currently 90%) 2 is largely explained by issues rooted in early discovery such as inadequate target validation or suboptimal ligand properties. Finding fast and accessible ways to discover more diverse pools of higher-quality chemical probes, hits and leads with optimal absorption, distribution, metabolism, excretion and toxicology (ADMET) and pharmacokinetics (PK) profiles at the early stages of DDD would improve outcomes in preclinical and clinical studies and facilitate more effective, accessible and safer drugs.

The concept of computer-aided drug discovery 3 was developed in the 1970s and popularized by Fortune magazine in 1981, and has since been through several cycles of hype and disillusionment 4 . There have been success stories along the way 5 and, in general, computer-assisted approaches have become an integral, yet modest, part of the drug discovery process 6 , 7 . In the past few years, however, several scientific and technological breakthroughs resulted in a tectonic shift towards embracing computational approaches as a key driving force for drug discovery in both academia and industry. Pharmaceutical and biotech companies are expanding their computational drug discovery efforts or hiring their first computational chemists. Numerous new and established drug discovery companies have raised billions in the past few years with business models that heavily rely on a combination of advanced physics-based molecular modelling with deep learning (DL) and artificial intelligence (AI) 8 . Although it is too early yet to expect approved drugs from the most recent computationally driven discovery efforts, they are producing a growing number of clinical candidates, with some campaigns specifically claiming target-to-lead times as low as 1–2 months 9 , 10 , or target-to-clinic time under 1 year 11 . Are these the signs of a major shift in the role that computational approaches have in drug discovery or just another round of the hype cycle?

Let us look at the key factors defining the recent changes (Fig. 1 ). First, the structural revolution—from automation in crystallography 12 to microcrystallography 13 , 14 and most recently cryo-electron microscopy technology 15 , 16 —has made it possible to reveal 3D structures for the majority of clinically relevant targets, often in a state or molecular complex relevant to its biological function. Especially impressive has been the recent structural turnaround for G protein-coupled receptors (GPCRs) 17 and other membrane proteins that mediate the action of more than 50% of drugs 18 , providing 3D templates for ligand screening and lead optimization. The second factor is a rapid and marked expansion of drug-like chemical space, easily accessible for hit and lead discovery. Just a few years ago, this space was limited to several million on-shelf compounds from vendors and in-house screening libraries in pharma. Now, screening can be done with ultra-large virtual libraries and chemical spaces of drug-like compounds, which can be readily made on-demand, rapidly growing beyond billions of compounds 19 , and even larger generative spaces with theoretically predicted synthesizability (Box 1 ). The third factor involves emerging computational approaches that strive to take full advantage of the abundance of 3D structures and ligand data, supported by the broad availability of cloud and graphics processing unit (GPU) computing resources to support these methods at scale. This includes structure-based virtual screening of ultra-large libraries 20 , 21 , 22 , using accelerated 23 , 24 , 25 and modular 26 screening approaches, as well as recent growth of data-driven machine learning (ML) and DL methods for predicting ADMET and PK properties and activities 27 .

figure 1

a , More than 200,000 protein structures in the PDB, plus private collections, have more than 90% of protein families covered with high-resolution X-ray and more recently cryo-electron microscopy structures, often in distinct functional states, with remaining gaps also filled by homology or AlphaFold2 models. b , The chemical space available for screening and fast synthesis has grown from about 10 7 on-shelf compounds in 2015 to more than 3 × 10 10 on-demand compounds in 2022, and can be rapidly expanded beyond 10 15 diverse and novel compounds. c , Computational methods for VLS include advances in fast flexible docking, modular fragment-based algorithms, DL models and hybrid approaches. d , Computational tools are supported by rapid growth of affordable cloud computing, GPU acceleration and specialized chips.

Although the impacts of the recent structural revolution 17 and computing hardware in drug discovery 28 are comprehensively reviewed elsewhere, here we focus on the ongoing expansion of accessible drug-like chemical spaces as well as current developments in computational methods for ligand discovery and optimization. We detail how emerging computational tools applied in gigaspace can facilitate the cost-effective discovery of hundreds or even thousands of highly diverse, potent, target-selective and drug-like ligands for a desired target, and put them in the context of experimental approaches (Table 1 ). Although the full impact of new computational technologies is only starting to affect clinical development, we suggest that their synergistic combination with experimental testing and validation in the drug discovery ecosystem can markedly improve its efficiency in producing better therapeutics.

Box 1
 Types of chemical libraries and spaces for drug discovery

Pharma companies amass collections of compounds for screening in-house, whereas in-stock collections from vendors (see the figure, part a ) allow fast (less than 1 week) delivery, contain unique and advanced chemical scaffolds, are easily searchable and are HTS compatible. However, the high cost of handling physical libraries, their slow linear growth, limited size and novelty constrain their applications.

More recently, virtual on-demand chemical databases (fully enumerated) and spaces (not enumerated) allow fast parallel synthesis from available building blocks, using validated or optimized protocols, with synthetic success of more than 80% and delivery in 2–3 weeks (see the figure, part b ). The virtual chemical spaces assure high chemical novelty and allow fast polynomial growth with the addition of new synthons and reaction scaffolds, including 4+ component reactions. Examples include Enamine REAL, Galaxy by WuXi, CHEMriya by Otava and private databases and spaces at pharmaceutical companies.

Generative spaces, unlike on-demand spaces, comprise theoretically possible molecules and collectively could comprise all chemical space (see the figure, part c ). Such spaces are limited only by theoretical plausibility, estimated as 10 23 –10 60 of drug-like compounds. Although allowing comprehensive space coverage, the reaction path and success rate of generated compounds are unknown, and thus require computational prediction of their practical synthesizability. Examples of generative spaces and their subsets include GDB-13, GDB-17, GDB-18 and GDBChEMBL.

drug discovery essay

Expansion of accessible chemical space

Why bigger is better.

The limited size and diversity of screening libraries have long been a bottleneck for detection of novel potent ligands and for the whole process of drug discovery. An average ‘affordable’ high-throughput screening (HTS) campaign 29 uses screening libraries of about 50,000–500,000 compounds and is expected to yield only a few true hits after secondary validation. Those hits, if any, are usually rather weak, non-selective, have suboptimal ADMET and PK properties and unknown binding mode, so their discovery entails years of painstaking trial-and-error optimization efforts to produce a lead molecule with satisfying potency and all the other requirements for preclinical development. Scaling of HTS to a few million compounds can be afforded only in big pharma, and it still does not make that much difference in terms of the quality of resulting hits. Likewise, virtual libraries that use in silico screening were traditionally limited to a collection of compounds available in stock from vendors, usually comprising fewer than 10 million unique compounds, therefore the scale advantage over HTS was marginal.

Although chasing the full coverage of the enormous drug-like chemical space (estimated at more than 10 63 compounds) 30 is a futile endeavour, expanding the screening of on-demand libraries by several orders of magnitude to billions and more of previously unexplored drug-like compounds, either physical or virtual, is expected to change the drug discovery model in several ways. First, it can proportionally increase the number of potential hits in the initial screening 31 (Fig. 2 ). This abundance of ligands in the library also increases the chances of identification of more potent or selective ligands, as well as ligands with better physicochemical properties. This has been demonstrated in ultra-large virtual screening campaigns for several targets, revealing highly potent ligands with affinities often in the mid-nanomolar to sub-nanomolar range 20 , 21 , 22 , 23 , 26 . Second, the accessibility of hit analogues in the same on-demand spaces streamlines a generation of meaningful structure–activity relationship (SAR)-by-catalogue and further optimization steps, reducing the amount of elaborate custom synthesis. Last, although the library scale is important, properly constructed gigascale libraries can expand chemical diversity (even with a few chemical reactions 32 ), chemical novelty and patentability of the hits, as almost all on-demand compounds have never been synthesized before.

figure 2

The red curves in log scale illustrate the distribution of screening hits with binding scores better than X for libraries of 10 billion, 100 million and 1 million compounds, as estimated from previous VLS and V-SYNTHES screening campaigns. The blue curves illustrate the approximate dependence of the experimental hit rate on the predicted docking score for 10-µM, 1-µM and 100-nM thresholds 20 . This analysis (semi-quantitative, as it varies from target to target) suggests that screening of more than 100 million compounds lifts the limitations of smaller libraries, extending the tail of the hit distribution towards better binding scores with high hit rates, and allowing for identification of proportionally more experimental hits with higher affinity. Note also two important factors justifying further growth of screening libraries to 10 billion and more: (1) the candidate hits for synthesis and experimental testing are usually picked as a result of target-dependent post-processing of several thousands of top-scoring compounds, which selects for novelty, diversity, drug likeness and often interactions with specific receptor residues. Thus, the more good-scoring compounds that are identified, the better overall selection can be made. (2) Saturation of the hit rate curves at best scores is not a universal rule but a result of the limited accuracy of fast scoring functions used in screening. Using more accurate docking or scoring approaches (flexible docking, quantum mechanical and free energy perturbation) in the post-processing step can extend a meaningful correlation of binding score with affinity further left (grey dashed curves), potentially bringing even more high-affinity hits for gigascale chemical spaces.

Physical libraries

Several approaches have been developed recently to push the library size limits in HTS, including combinatorial chemistry and large-scale pooling of the compounds for parallel assays. For example, affinity-selection mass spectrometry techniques can be applied to identify binders directly in pools of thousands of compounds 33 without the need for labelling. DNA-encoded libraries (DELs) and cost-effective approaches to generate and screen them have also been developed 34 , making it possible to work with as many as approximately 10 10 compounds in a single test tube 35 . These methods have their own limitations; as DELs are created by tagging ligands with unique DNA sequences through a linker, DNA conjugation limits the chemistries possible for the combinatorial assembly of the library. Screening of DELs may also yield a large number of false negatives by blocking important moieties for binding and, more importantly, false positives by nonspecific binding of DNA labels, so expensive off-DNA resynthesis of hit compounds is needed for their validation. To avoid this resynthesis, it has been suggested to use ML modes trained on DEL results for each target to predict drug-like ligands from on-demand chemical spaces, as described in ref. 36 .

Virtual on-demand libraries

In silico screening of virtual libraries by fast computational approaches has long been touted as a cost-effective way to overcome the limitations of physical libraries. Only recently, however, have synthetic chemistry and cheminformatics approaches been developed to break out of these limits and construct virtual on-demand libraries that explore much larger chemical space, as reviewed in refs. 37 , 38 . In 2017, the readily accessible (REAL) database by Enamine 19 , 39 became the first commercially available on-demand library based on the robust reaction principle 40 , whereas the US National Institutes of Health developed synthetically accessible virtual inventory (SAVI) 41 , which also uses Enamine building blocks. The REAL database uses carefully selected and optimized parallel synthesis protocols and a curated collection of in-stock building blocks, making it possible to guarantee the fast (less than 4 weeks), reliable (80% success rate) and affordable synthesis of a set of compounds 21 . Driven by new reactions and diverse building blocks, the fully enumerated REAL database has grown from approximately 170 million compounds in 2017 to more than 5.5 billion compounds in 2022 and comprises the bulk of the popular ZINC20 virtual screening database 42 . The practical utility of the REAL database has been recently demonstrated in several major prospective screening campaigns 20 , 21 , 23 , 24 , some of them taking further hit optimization steps in the same chemical space, yielding selective nanomolar and even sub-nanomolar ligands without any custom synthesis 20 , 21 . Similar ultra-large virtual libraries (that is, GalaXi ( http://www.wuxiapptec.com ) and CHEMriya ( http://chemriya.com )) are available commercially, although their synthetic success rates are yet to be published.

Virtual chemical spaces

The modular nature of on-demand virtual libraries supports further growth by the addition of reactions and building blocks. However, building, maintaining and searching fully enumerated chemical libraries comprising more than a few billion compounds become slow and impractical. Such gigascale virtual libraries are therefore usually maintained as non-enumerated chemical spaces, defined by a specific set of building blocks and reactions (or transforms), as comprehensively reviewed in ref. 38 . Within pharma, one of the first published examples includes PGVL by Pfizer 37 , 43 , the most recent version of which uses a set of 1,244 reactions and in-house reagents to account for 10 14 compounds. Other biopharma companies have their own virtual chemical spaces 38 , 44 , although their details are often not in the public domain. Among commercially available chemical spaces, GalaXi Space by WuXi (approximately 8 billion compounds), CHEMriya by Otava (11.8 billion compounds) and Enamine REAL Space (36 billion compounds) 45 are among the largest and most established. In addition to their enormous sizes, these virtual spaces are highly novel and diverse, and have minimal overlap (less than 10%) between each other 46 . Currently, the largest commercial space, Enamine REAL Space, is an extension to the REAL database that maintains the same synthetic speed, rate and cost guarantees, covering more than 170 reactions and more than 137,000 building blocks (Box 1 ). Most of these reactions are two-component or three-component, but more four-component or even five-component reactions are being explored, enabling higher-order combinatorics. This space can be easily expanded to 10 15 compounds based on available reactions and extended building block sets, for example, 680 million of make on demand (MADE) building blocks 47 , although synthesis of such compounds involves more steps and is more expensive. To represent and navigate combinatorial chemical spaces without their full enumeration, specialized cheminformatics tools have been developed, from fragment-based chemical similarity searches 48 to more elaborate 3D molecular similarity search methods based on atomic property fields such as rapid isostere discovery engine (RIDE) 38 .

An alternative approach proposed to building chemical spaces generates hypothetically synthesizable compounds following simple rules of synthetic feasibility and chemical stability. Thus, the generated databases (GDB) predict compounds that can be made of a specific number of atoms; for example, GDB-17 contained 166.4 billion molecules of up to 17 atoms of C, N, O, S and halogens 49 , whereas GDB-18 made up of 18 atoms would reach an estimated 10 13 compounds 38 . Other generative approaches based on narrower definitions of chemical spaces are now used in de novo ligand design with DL-based generative chemistry (for example, ref. 50 ), as discussed below.

Although the synthetic success rate for some of the commercial on-demand chemical spaces (for example, Enamine REAL Space) have been thoroughly validated 20 , 21 , 22 , 23 , 24 , 26 , 42 , synthetic accessibilities and success rates of other chemical spaces remain unpublished 38 . These are important metrics for the practical sustainability of on-demand synthesis because reduced success rates or unreasonable time and cost would diminish its advantage over custom synthesis.

Computational approaches to drug design

Challenges of gigascale screening.

Chemical spaces of gigascale and terrascale, provided that they maintain high drug likeness and diversity, are expected to harbour millions of potential hits and thousands of potential lead series for any target. Moreover, their highly tractable robust synthesis simplifies any downstream medicinal chemistry efforts towards final drug candidates.

Dealing with such virtual libraries, however, calls for new computational approaches that meet special requirements for both speed and accuracy. They have to be fast enough to handle gigascale libraries. If docking of a compound takes 10 s per CPU core, it would take more than 3,000 years to screen 10 10 compounds on a single CPU core, or cost approximately US $1 million on a computing cloud at the cheapest CPU rates. At the same time, gigascale screening must be extremely accurate, safeguarding against false-positive hits that effectively cheat the scoring function by exploiting its holes and approximations 31 . Even a one-in-a-million rate of false positives in a 10 10 compound library would comprise 10,000 false hits, which may flood out any hit candidate selection. The artefact rate and nature may depend on the target and screening algorithms and should be carefully addressed in screening and post-processing. Although there is no one simple solution for such artefacts, some practical and reasonably cost-effective remedies include: (1) selection based on the consensus of two different scoring functions, (2) selection of highly diverse hits (many artefacts cluster to similar compounds), (3) hedging the bets from several ranges of scores 31 and (4) manually curating the final list of compounds for any unusual interactions. Ultimately, it is highly desirable to fix as many remaining ‘holes in the scoring functions’ as possible, and reoptimize them for high selectivity in the range of scores where the top true hits of gigaspace are found. Missing some hits in screening (false negatives) would be well tolerated because of the huge number of potential hits in the 10 10 space (for example, losing 50% of a million potential hits is perfectly fine), so some trade-off in score sensitivity is acceptable.

The major types of computational approaches to screening a protein target for potential ligands are summarized in Table 2 . Below, we discuss some emerging technologies and how they can best fit into the overall DDD pipeline to take full advantage of growing on-demand chemical spaces.

Receptor structure-based screening

In silico screening by docking molecules of the virtual library into a receptor structure and predicting its ‘binding score’ is a well-established approach to hit and lead discovery and had a key role in recent drug discovery success stories 11 , 17 , 51 . The docking procedure itself can use molecular mechanics, often in internal coordinate representation, for rapid conformational sampling of fully flexible ligands 52 , 53 , using empirical 3D shape-matching approaches 54 , 55 , or combining them in a hybrid docking funnel 56 , 57 . Special attention is devoted to ligand scoring functions, which are designed to reliably remove non-binders to minimize false-positive predictions, which is especially relevant with the growth of library size. Blind assessments of the performance of structure-based algorithms have been routinely performed as a D3R Grand Challenge community effort 58 , 59 , showing continuous improvements in ligand pose and binding energy predictions for the best algorithms.

Results of the many successful structure-based prospective screening campaigns have been published over the years covering all major classes of targets, most recently GPCRs, as reviewed in refs. 17 , 51 , 60 , whereas countless more have been used in industry. The focused candidate ligand sets, predicted by such screening, often show useful (10–40%) hit rates in experimental testing 60 , yielding novel hits for many targets with potencies in the 0.1–10-μM range (for those that are published, at least). Further steps in optimization of the initial hits obtained from standard screening libraries of less than 10 million compounds, however, usually require expensive custom synthesis of analogues, which has been afforded only in a few published cases 20 , 61 .

Identification of hits directly in much larger chemical spaces such as REAL Space not only can bring more and better hits 31 but also supports their optimization, as any resulting hit has thousands of analogues and derivatives in the same on-demand space. This advantage was especially helpful for such challenging targets as SARS-CoV-2 main protease (M pro ), for which hundreds of standard virtual ligand screening (VLS) attempts came up empty-handed 62 (see discussion on M pro challenges in ‘Hybrid in vitro–in silico approaches’ below). Although the initial hit rates were low even in the ultra-large screens, VirtualFlow 24 of the REAL database with 1.4 billion compounds still identified hits in the 10–100-µM range, which were optimized via on-demand synthesis 63 to yield quality leads with the best compound Z222979552 (half maximal inhibitory concentration (IC 50 ) = 1.0 μM). Another ultra-large screen of 235 million compounds, based on a newer M pro structure with a non-covalent inhibitor (Protein Data Bank (PDB) ID: 6W63 ), also produced viable hits, fast optimization of which resulted in the discovery of nanomolar M pro inhibitors in just 4 months by a combination of on-demand and simple custom chemistry 64 . The best compound in this work had good in vitro ADMET properties, with an affinity of 38 nM and a cell-based antiviral potency of 77 nM, which are comparable to clinically used PF-07321332 (nirmatrelvir) 65 .

With increasing library sizes, the computational time and cost of docking itself become the main bottleneck in screening, even with massively parallel cloud computing 60 . Iterative approaches have been recently suggested to tackle libraries of this size; for example, VirtualFlow used stepwise filtering of the whole library with docking algorithms of increasing accuracy to screen approximately 1.4 billion Enamine REAL compounds 23 , 24 . Although improving speed several-fold, the method still requires a fully enumerated library and its computational cost grows linearly with the number of compounds, limiting its applicability in rapidly expanding chemical spaces.

Modular synthon-based approaches

The idea of designing molecules from a limited set of fragments to optimally fill the receptor binding pocket has been entertained from the early years of drug discovery, implemented, for example, in the LUDI algorithm 66 . However, custom synthesis of the designed compounds remained the major bottleneck of such approaches. The recently developed virtual synthon hierarchical enumeration screening (V-SYNTHES) 26 technology applies fragment-based design to on-demand chemical spaces, thus avoiding the challenges of custom synthesis (Fig. 3 ). Starting with the catalogue of REAL Space reactions and building blocks (synthons), V-SYNTHES first prepares a minimal library of representative chemical fragments by fully enumerating synthons at one of the attachment points, capping the other position (or positions) with a methyl or phenyl group. Docking-based screening then allows selection of the top-scoring fragments (for example, the top 0.1%) that are predicted to bind well into the target pocket. This is repeated for a second position (and then third and fourth positions, if available), and the resulting focused libraries are screened at each iteration against the target pocket. At the final step, the top approximately 50,000 full compounds from REAL Space are docked with more elaborate and accurate docking parameters or methods, and the top-ranking candidates are filtered for novelty, diversity and variety of desired drug-like properties. In post-processing, the best 50–500 compounds are selected for synthesis and testing. Our assessment suggests that combining synthons with the scaffolds and capping them with dummy minimal groups in the V-SYNTHES algorithm is a critical requirement for optimal fragment predictions because reactive groups of building blocks and scaffolds often create strong, yet false, interactions that are not present in the full molecule. Another important part of the algorithm is the evaluation of the fragment-binding pose in the target, which prioritizes those hits with minimal caps pointed into a region of the pocket where the fragment has space to grow.

figure 3

An overview of the V-SYNTHES algorithm allowing effective screening of more than 31 billion compounds in REAL Space or even larger chemical spaces, while performing enumeration and docking of only small fractions of molecules. The algorithm, illustrated here using a two-component reaction based on a sulfonamide scaffold with R 1 and R 2 synthons, can be applied to hundreds of optimized two-component, three-component or more-component reactions by iteratively repeating steps 3 and 4 until fully enumerated molecules optimally fitting the target pocket are obtained. PAINS, pan assay interference compounds.

Initially applied to discover new chemotypes for cannabinoid receptor CB 2 antagonists, V-SYNTHES has shown a hit rate of 23% for submicromolar ligands, which exceeded the hit rate of standard VLS by fivefold, while taking about 100 times less computational resources 26 . A similar hit rate was found for the ROCK1 kinase screening in the same study, with one hit in the low nanomolar range 26 . V-SYNTHES is being applied to other therapeutically relevant targets with well-defined pocket structures.

A similar approach, chemical space docking, has been implemented by BioSolveIT, so far for two-component reactions 67 . This method is even faster, as it docks individual building block fragments and then enumerates them with scaffolds and other synthons. However, there are trade-offs for the extra speed: docking of smaller fragments without scaffolds is less reliable, and their reactive groups often have dissimilar properties from the reaction product. This may introduce strong receptor interactions that are irrelevant to the final compound and can misguide the fragment selection. This is especially true for cycloaddiction reactions and three-component scaffolds, which need further validation in chemical space docking.

Apart from supporting the abundance, chemical diversity and potential quality of hits, structure-based modular approaches are especially effective in identifying hits with robust chemical novelty, as they (1) do not rely on information for existing ligands and (2) identify ligands that have never been synthesized before. This is an important factor in assuring the patentability of the chemical matter for hit compounds and the lead series arising from gigascale screening. Moreover, thousands of easily synthesizable analogues assure extensive SAR-by-catalogue for the best hits, which, for example, enabled approximately 100-fold potency and selectivity improvement for the CB 2 V-SYNTHES hits 26 . Availability of the multilayer on-demand chemical space extensions (for example, supported by MADE building blocks 47 ) can also greatly streamline the next steps in lead optimization through ‘virtual MedChem’, thus reducing extensive custom synthesis.

Data-driven approaches and DL

In the era of AI-based face recognition, ChatGPT and AlphaFold 68 , there is enormous interest in applications of data-driven DL approaches across drug discovery, from target identification to lead optimization to translational medicine (as reviewed in refs. 69 , 70 , 71 ).

Data-driven approaches have a long history in drug discovery, in which ML algorithms such as support vector machine, random forest and neural networks have been used extensively to predict ligand properties and on-targets activities, albeit with mixed results. Accurate quantitative structure–property relationship (QSPR) models can predict physicochemical (for example, solubility and lipophilicity) and pharmacokinetic (for example, bioavailability and blood–brain barrier penetration) properties, in which large and broad experimental datasets for model training are available and continue to grow 72 , 73 , 74 . ML is also implemented in many quantitative SAR (QSAR) algorithms 75 , in which the training set and the resulting models are focused on a given target and a chemical scaffold, helping to guide lead affinity and potency optimization. Methods based on extensive ligand–target binding datasets, chemical similarity clustering and network-based approaches have also been suggested for drug repurposing 76 , 77 .

The advent of DL takes data-driven models to the next level, allowing analysis of much larger and diverse datasets while deriving more complicated non-linear relationships, with vast literature describing specific DL methodologies and applications to drug discovery 27 , 70 . By its ‘learning from examples’ nature, AI requires comprehensive ligand datasets for training the predictive models. For QSPR, large public and private databases have been accumulated, with various properties such as solubility, lipophilicity or in vitro proxies for oral bioavailability and brain permeability experimentally measured for many thousands of diverse compounds, allowing prediction of these properties in a broad range of new compounds.

The quality of QSAR models, however, differs for different target classes depending on data availability, with the most advances achieved for the kinase superfamily and aminergic GPCRs. An unbiased benchmark of the best ML QSAR models was given by a recent IDG-DREAM Drug-Kinase Binding Prediction Challenge with the participation of more than 200 experts 78 . The top predictive models in this blind assessment included kernel learning, gradient boosting and DL-based algorithms. The top-performing model (from team Q.E.D) used a kernel regression, protein sequence similarity and affinity values of more than 60,000 compound–kinase pairs between 13,608 compounds and 527 kinases from ChEMBL 79 and Drug Target Commons 80 databases as the training data. The best DL model used as many as 900,000 experimental ligand-binding data points for training, but still trailed the much simpler kernel model in performance. The best models achieved a Spearman rank coefficient of 0.53 with a root-mean-square error of 0.95 for the predicted versus experimental p K d values in the challenge set. Such accuracy was found to be on par with the accuracy and recall of single-point experimental assays for kinase inhibition, and may be useful in screenings for the initial hits for less explored kinases and guiding lead optimization. Note, however, that the kinase family is unique as it is the largest class of more than 500 targets, all possessing similar orthosteric binding pockets and sharing high cross-selectivity. The distant second family with systematic cross-reactivity comprises about 50 aminergic GPCRs, whereas other GPCR families and other cross-reactive protein families are much smaller. The performance and generalizability of ML and DL methods for these and other targets remain to be tested.

The development of broadly generalizable or even universal models is the key aspiration of AI-driven drug discovery. One of the directions here is to extract general models of binding affinities (binding score functions) from data on both known ligand activities and corresponding protein–ligand 3D structures, for example, collected in the PDBbind database 81 or obtained from docking. Such models explore various approaches to represent the data and network architectures, including spatial graph-convolutional models 82 , 83 , 3D deep convolutional neural networks 84 , 85 or their combinations 86 . A recent study, however, found that regardless of neural network architecture, an explicit description of non-covalent intermolecular interactions in the PDBbind complexes does not provide any statistical advantage compared with simpler approximations of only ligand or only receptor that omit the interactions 87 . Therefore, the good performances of DL models based on PDBbind rely on memorizing similar ligands and receptors, rather than on capturing general information about their binding. One possible explanation for this phenomenon is that the PDBbind database does not have an adequate presentation of ‘negative space’, that is, ligands with suboptimal interaction patterns to enforce the training.

This mishap exemplifies the need for a better understanding of behaviour of DL models and their dependence on the training data, which is widely recognized in the AI community. It has been shown that DL models, especially based on limited datasets lacking negative data, are prone to overtraining and spurious performance, sometimes leading to whole classes of models deemed ‘useless’ 88 or severely biased by subjective factors defining the training dataset 89 . Statistical tools are being developed to define the applicability range and carefully validate the performance of the models. One of the proposed concepts is the predictability, computability and stability framework for ‘veridical data science’ 90 . Adequate selection of quality data has been specifically identified by leaders of the AI community as the major requirement for closing the ‘production gap’, or the inability of ML models to succeed when they are deployed in the real world, thus calling for a data-centric approach to AI 91 , 92 . There have also been attempts to develop tools to make AI ‘explainable’, that is, able to formulate some general trends in the data, specifically in the drug discovery applications 93 .

Despite these challenges and limitations, AI is already starting to make a substantial effect on drug discovery, with the first AI-based drug candidates making it into the preclinical and clinical studies. For kinases, the AI-driven compounds were reported as potent and effective in vivo inhibitors of the receptor tyrosine kinase DDR1, which is involved in fibrosis 9 . Phase I clinical trials have been announced for ISM001-055 (also known as INS018_055) for the treatment of idiopathic pulmonary fibrosis 10 , although the identity of the compound and its target has not been disclosed. For GPCRs, AI-driven compounds targeting 5-HT 1A , dual 5-HT 1A –5-HT 2A and A 2A receptors have recently entered clinical trials, providing further support for the AI-driven drug discovery concept. These first success stories are coming from kinase and GPCR families with already well-studied pharmacology, and the compounds show close chemical similarity to known high-affinity scaffolds 94 . It is important for the next generation of DL drug candidates to improve in novelty and applicability range.

Hybrid computational approaches

As discussed above, physics-based and data-driven approaches have distinct advantages and limitations in predicting ligand potency. Structure-based docking predictions are naturally generalizable to any target with 3D structures and can be more accurate, especially in eliminating false positives as the main challenge of screening. Conversely, data-driven methods may work in lieu of structures and can be faster, especially with GPU acceleration, although they struggle to generalize beyond data-rich classes of targets. Therefore, there are numerous ongoing efforts to combine physics-based and data-driven approaches in some synergistic ways in general 95 , and in drug discovery specifically 96 .

In virtual screening approaches, a synergetic use of physics-based docking with data-based scoring functions may be highly beneficial. Moreover, if the physics-based and data-based scoring functions are relatively independent and both generate enrichment in the selected focused libraries, their combination can reduce the false-positive rates and improve the quality of the hits. This synergy is reflected in the latest 3DR Grand Challenge 4 results for ligand IC 50 predictions 59 , in which the top methods that used a combination of both physics-based and ML scoring outperformed those that did not use ML. Going forward, thorough benchmarking of physics-based, ML and hybrid approaches will be a key focus of a new Critical Assessment of Computational Hit-finding Experiments (CACHE), which will assess five specific scenarios relevant to practical hit and lead discovery and optimization 97 .

At a deeper level, the results of accurate physics-based docking (in addition to experimental data, for example, from PDBbind 81 ) can be used to train generalized graph or 3D DL models predicting ligand–receptor affinity. This would help to markedly expand the training dataset and balance positive and negative (suboptimal binding) examples, which is important to avoid the overtraining issues described in ref. 87 . Such DL-based 3D scoring functions for predicting molecular binding affinity from a docked protein−ligand complex are being developed and benchmarked, most recently RTCNN 98 , although their practical utility remains to be demonstrated.

To expand the range of structure-based docking applicability to those targets lacking high-resolution structures, it is also tempting to use AI-derived AlphaFold2 (refs. 99 , 100 ) or RosettaFold 101 3D models, which already show utility in many applications, including protein–protein and protein–peptide docking 102 . Traditional homology models based on close protein similarity, especially when refined with known ligands 103 , have been used in small-molecule docking and virtual screening 104 , therefore AlphaFold2 is expected to further expand the scope of structural modelling and its accuracy. In a recent report, AlphaFold2 models, augmented by other AI approaches, helped to identify a cyclin-dependent kinase 20 (CDK20) small-molecule inhibitor, although at a modest affinity of 8.9 μM (ref. 105 ). More general benchmarking of the performance of AlphaFold2 models in virtual screening, however, gives mixed results. In a benchmark focused on targets with existing crystal structures, most AlphaFold2 models had to be cleaned from loops blocking the binding pocket and/or augmented with known ion or other cofactors to achieve reasonable enrichment of hits 106 . For the more practical cases of targets lacking experimental structures, especially for target classes with less obvious structural homologies in the ligand-binding pocket, the performance of AlphaFold2 models in small-molecule docking showed disappointing results in recent assessments for GPCR and antibacterial targets 107 , 108 . The recently developed AphaFill approach 109 for ‘transplanting’ small-molecule cofactors and ligands form PDB structures to homologous AlphaFold2 models can potentially help to validate and optimize these models, although further assessment of their utility for docking and virtual screening is ongoing.

To speed up virtual screening of ultra-large chemical libraries, several groups have suggested hybrid iterative approaches, in which results of structure-based docking of a sparse library subset are used to train ML models, which are then used to filter the whole library to further reduce its size. These methods, including MolPal 25 , Active Learning 110 and DeepDocking 111 , report as much as 14–100 reduction in the computational cost for libraries of 1.4 billion compounds, although it is not clear how they would scale to rapidly growing chemical spaces.

We should emphasize here that scoring functions in fast-docking algorithms and ML models are primarily designed and trained to effectively separate potential target binders from non-binders, although they are not very accurate in predictions of binding affinities or potencies. For more accurate potency predictions, the smaller focused library of candidate binders selected by the initial AI or docking-based screening can be further analysed and ranked using more elaborate physics-based tools, including free energy perturbation methods for relative 112 and absolute 113 , 114 , 115 free energy of ligand binding. Although these methods are much slower, utilization of GPU accelerated calculations 28 holds the potential for their broader application in post-processing in virtual screening campaigns to further enrich the hit rates for high-affinity candidates (Fig. 2 ), as well as in lead optimization stages.

Future challenges

Further growth of readily accessible chemical spaces.

The advent of fast and practical methods for screening gigascale chemical spaces for drug discovery stimulates further growth of these on-demand spaces, supporting better diversity and the overall quality of identified hits and leads. Specifically developed for V-SYNTHES screening, the xREAL extension of Enamine REAL Space now comprises 173 billion compounds 116 , and can be further expanded to 10 15 compounds and beyond by tapping into an even larger building block set (for example, to 680 million of MADE building blocks 47 ), by including four-component or five-component scaffolds, and by using new click-like chemistries as they are discovered. Real-world testing of MADE-enhanced REAL Space, and other commercial and proprietary chemical spaces will allow a broader assessment of their synthesizability and overall utility 38 , 117 , 118 . In parallel, specialized ultra-large libraries can be built for important scaffolds underrepresented in general purpose on-demand spaces, for example, screening of a virtual library of 75 million easily synthesizable tetrahydropyridines recently yielded potent agonists for the 5-HT 2A receptor 119 .

Further growth of the on-demand chemical space size and diversity is also supported by recent development of new robust reactions for the click-like assembly of building blocks. As well as ‘classical’ azide-alkyne cycloaddition click chemistry 120 , recognized by the 2022 Nobel Prize in chemistry 121 , and optimized click-like reactions including SuFEx 122 , more recent developments such as Ni-electrocatalysed doubly decarboxylative cross-coupling 123 show promise. Other carbon–carbon forming reactions use methyliminodiacetic acid boronates for C sp 2 –C sp 2 couplings 124 , and most recently tetramethyl N -methyliminodiacetic acid boronates 125 for stereospecific C sp 3 –C bond formation. Each of these reactions applied iteratively can generate new on-demand chemical spaces of billions of diverse compounds operating with a limited number of building blocks. Similar to the routinely used automatic assembly of amino acids in peptide synthesis, fully automated processes could be carried out with robots capable of producing a library of drug-like compounds on demand using combinations of a few thousand diverse building blocks 126 , 127 , 128 . Such machines are already working, although scaling-up production of thousands of specialized building blocks remains the bottleneck.

The development of more robust generative chemical spaces can also be supported by new computational approaches in synthetic chemistry, for example, predictions of new iterative reaction sequences 129 or synthetic routes and feasibility from DL-based retrosynthetic analysis 130 . In generative models, synthesizability predictions can be coupled with predictions of potency and other properties towards higher levels of automated chemical design 131 . Thus, generative adversarial networks combined with reinforcement learning (GAN-RL) were recently used to predict synthetic feasibility, novelty and biological activity of compounds, enabling the iterative cycle of in silico optimization, synthesis and testing of the ligands in vitro 50 , 132 . When applied within a set of well-established reactions and pharmacologically explored classes of targets, these approaches already yield useful hits and leads, leading to clinical candidates 50 , 132 . However, the wider potential of automated chemical design concepts and robotic synthesis in drug discovery remains to be seen.

Hybrid in vitro–in silico approaches

Although blind benchmarking and recent prospective screening success stories for the growing number of targets support utility of modern computational tools, there are whole classes of challenging targets, in which existing in silico screening approaches are not expected to fare very well by themselves. Some of the hardest cases are targets with cryptic or shallow pockets that have to open or undergo a substantial induced fit to engage ligand, as often found when targeting allosteric sites, for example, in kinases or GPCRs, or protein–protein interactions in signalling pathways.

Although bioinformatics and molecular dynamics approaches can help to detect and analyse allosteric and cryptic pockets 133 , computational tools alone are often insufficient to support ligand discovery for such challenging sites. The cryptic and shallow pockets, however, have been rather successfully handled by fragment-based drug discovery approaches, which start with experimental screening for the binding of small fragments. The initial hits are found by very sensitive methods, such as BIACORE, NMR, X-ray 134 , 135 and potentially cryo-electron microscopy 136 , to reliably detect weak binding, usually in the 10–100-μM range. The initial screening of the target can be also performed with fragments decorated by a chemical warhead enabling proximity-driven covalent attachment of a low-affinity ligand 137 . In either case, elaboration of initial fragment hits to full high-affinity ligands is the key bottleneck of fragment-based drug discovery, which requires a major effort involving ‘growing’ the fragment or linking two or more fragments together. This is usually an iterative process involving custom ligand design and synthesis that can take many years 134 , 138 . At the same time, structure-based virtual screening can help to computationally elaborate the fragments to match the experimentally identified conformation of the target binding pocket. Most cost-effectively, this approach can be applied when fragment hits are identified from the on-demand space building blocks or their close analogues for easy elaboration in the same on-demand space 139 .

The recent examples of hybrid fragment-based computational design approaches targeting SARS-CoV-2 inhibitors highlight the challenges presented by such targets and allow head-to-head comparisons to ultra-large VLS. One of the studies was aimed at the SARS-CoV-2 NSP3 conserved macrodomain enzyme (Mac1), which is a target critical for the pathogenesis and lethality of the virus. Building on crystallographic detection of the low-affinity (180 μM) fragments weakly binding Mac1 (ref. 139 ), merging of the fragments identified a 1-μM hit, quickly optimized by catalogue synthesis to a 0.4-μM lead 140 . In the same study, an ultra-scale screening of 400 million REAL database identified more than 100 new diverse chemotypes of drug-like ligands, with follow-up SAR-by-catalogue optimization yielding a 1.7-μM lead 140 . For the SARS-CoV-2 main protease M pro , the COVID Moonshot initiative published results of crystallographic screening of 1,500 small fragments with 71 hits bound in different subpockets of the shallow active site, albeit none of them showing in vitro inhibition of protease even at 100 μM (ref. 141 ). Numerous groups crowdsourcing the follow-up computational design and screening of merged and growing fragments helped to discover several SAR series, including a non-covalent M pro inhibitor with an enzymatic IC 50 of 21 μM. Further optimization by both structure-based and AI-driven computational approaches, which used more than 10 million MADE Enamine building blocks, led to the discovery of preclinical candidates with cell-based IC 50 in the approximately 100-nM range, approaching the potency of nirmatrelvir 65 . The enormous scale, urgency and complexity of this Moonshot effort with more than 2,400 compounds synthesized on demand and measured in more than 10,000 assays are unprecedented and this highlights the challenges of de novo design of non-covalent inhibitors of M pro .

Beyond the Moonshot initiative, a flood of virtual screening efforts yielded mostly disappointing results 62 , for example, the antimalaria drug ebselen, which was proposed in an early virtual screen 142 , failed in clinical trials. Most of these studies, however, screened small-ligand sets focused on repurposing existing drugs, lacked experimental support and used the first structure of M pro solved in a covalent ligand complex (PDB ID: 6LU7 ) that was suboptimal for docking non-covalent molecules 142 .

In comparison, several studies screening ultra-large libraries were able to identify de novo non-covalent M pro inhibitors in the 10–100-μM range 24 , 62 , 63 , 143 , while experimentally testing only a few hundred synthesized on-demand compounds. One of these studies further elaborated on these weak VLS hits by testing their Enamine on-demand analogues, revealing a lead with IC 50  = 1 μM in cell-based assays, and validating its non-covalent binding crystallographically 63 . Another study based on a later, more suitable non-covalent co-crystal structure of M pro (PDB ID: 6W63 ) used an ultra-large docking and optimization strategy to discover even more potent 38-nM lead compounds 64 . Note that, although the results of the initial ultra-large screenings for M pro were modest, they were on par with the much more elaborate and expensive efforts of the Moonshot hybrid approach, with simple on-demand optimization leading to similar-quality preclinical candidates. These examples suggest that even for challenging shallow pockets, structure-based virtual screening can often provide a viable alternative when performed at gigascale and supported by accurate structures, sufficient testing and optimization effort.

Outlook towards computer-driven drug discovery

With all the challenges and caveats, the emerging capability of in silico tools to effectively tap into the enormous abundance and diversity of drug-like on-demand chemical spaces at the key target-to-hit-to-lead-to-clinic stages make it tempting to call for the transformation of the DDD ecosystem from computer-aided to computer-driven 144 (Fig. 4 ). At the early hit identification stage, the ultra-scale virtual screening approaches, both structure-based and AI-based, are becoming mainstream in providing fast and cost-effective entry points into drug discovery campaigns. At the hit-to-lead stage, the more elaborate potency prediction tools such as free energy perturbation and AI-based QSAR often guide rational optimization of ligand potency. Beyond the on-target potency and selectivity, various data-driven computational tools are routinely used in multiparameter optimization of the lead series that includes ADMET and PK properties. Of note, chemical spaces of more than 10 10 diverse compounds are likely to contain millions of initial hits for each target 20 (Box 1 ), thousands of potent and selective leads and, with some limited medicinal chemistry in the same highly tractable chemical space, drug candidates ready for preclinical studies. To harness this potential, the computational tools need to become more robust and better integrated into the overall discovery pipeline to ensure their impact in translating initial hits into preclinical and clinical development.

figure 4

Schematic comparison of the standard HTS plus custom synthesis-driven discovery pipeline versus the computationally driven pipeline. The latter is based on easily accessible on-demand or generative virtual chemical spaces, as well as structure-based and AI-based computational tools that streamline each step of the drug discovery process.

One should not forget here that any computational models, however useful or accurate, may never ensure that all of the predictions are correct. In practice, the best virtual screening campaigns result in 10–40% of candidate hits confirmed in experimental validation, whereas the best affinity predictions used in optimization rarely have accuracy better than 1 kcal mol −1 root-mean-square error. Similar limitations apply to current computational models predicting ADMET and PK properties. Therefore, computational predictions always need experimental validation in robust in vitro and in vivo assays at each step of the pipeline. At the same time, experimental testing of predictions also provides data that can feed back into improving the quality of the models by expanding their training datasets, especially for the ligand property predictions. Thus, the DL-based QSPR models will greatly benefit from further accumulating data in cell-permeability assays such as CACO-2 and MDCK, as well as new advanced technologies such as organs-on-a-chip or functional organoids to provide better estimates of ADMET and PK properties without cumbersome in vivo experiments. The ability to train ADMET and PK models with in vitro assay data representing the most relevant species for drug development (typically mouse, rat and human) would also help to address species variability as a major challenge for successful translational studies. All of this creates a virtuous cycle for improving computational models to the point at which they can drive compound selection for most DDD end points. When combined with more accurate in vitro testing, this may reduce and eventually eliminate animal test requirements (as recently indicated by FDA) 145 .

Building hybrid in silico–in vitro pipelines with easy access to the enormous on-demand chemical space at all stages of the gene-to-lead process can help to generate abundant pools of diverse lead compounds with optimal potency, selectivity and ADMET and PK properties, resulting in less compromise in multiparameter optimization for clinical candidates. Running such data-rich computationally driven pipelines requires overarching data management tools for drug discovery, many of them being implemented in pharma and academic DDD centres 146 , 147 . Building computationally driven pipelines will also help to reveal weak or missing links, in which new approaches and additional data may be needed to generate improved models, thus helping to fill the remaining computational gaps in the DDD pipeline. Provided this systematic integration continues, computer-driven ligand discovery has a great potential to reduce the entry barriers for generating molecules for numerous lines of inquiry, whether it is in vivo probes for new and understudied targets 148 , polypharmacology and pluridimensional signalling, or drug candidates for rare diseases and personalized medicine.

Austin, D. & Hayford, T. Research and development in the pharmaceutical industry. CBO https://www.cbo.gov/publication/57126 (2021).

Sun, D., Gao, W., Hu, H. & Zhou, S. Why 90% of clinical drug development fails and how to improve it? Acta Pharm. Sin. B 12 , 3049–3062 (2022).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bajorath, J. Computer-aided drug discovery. F1000Res. 4 , F1000 Faculty Rev-1630 (2015).

Article   Google Scholar  

Van Drie, J. H. Computer-aided drug design: the next 20 years. J. Comput. Aided Mol. Des. 21 , 591–601 (2007).

Article   ADS   CAS   PubMed   Google Scholar  

Talele, T. T., Khedkar, S. A. & Rigby, A. C. Successful applications of computer aided drug discovery: moving drugs from concept to the clinic. Curr. Top. Med. Chem. 10 , 127–141 (2010).

Article   CAS   PubMed   Google Scholar  

Macalino, S. J. Y., Gosu, V., Hong, S. & Choi, S. Role of computer-aided drug design in modern drug discovery. Arch. Pharmacal. Res. 38 , 1686–1701 (2015).

Article   CAS   Google Scholar  

Sabe, V. T. et al. Current trends in computer aided drug design and a highlight of drugs discovered via computational techniques: a review. Eur. J. Med. Chem. 224 , 113705 (2021).

Jayatunga, M. K., Xie, W., Ruder, L., Schulze, U. & Meier, C. AI in small-molecule drug discovery: a coming wave. Nat. Rev. Drug Discov. 21 , 175–176 (2022).

Zhavoronkov, A. et al. Deep learning enables rapid identification of potent DDR1 kinase inhibitors. Nat. Biotechnol. 37 , 1038–1040 (2019). This study claims the discovery of a lead candidate in just 21 days, using generative AI, synthesis, and in vitro and in vivo testing of the compounds .

US National Library of Medicine. ClinicalTrials.gov https://clinicaltrials.gov/ct2/show/NCT05154240#contactlocation (2022).

Schrodinger. Schrödinger announces FDA clearance of investigational new drug application for SGR-1505, a MALT1 inhibitor. Schrodinger https://ir.schrodinger.com/node/8621/pdf (2022). This press release states that combined physics-based and ML methods enabled a computational screen of 8.2 billion compounds and the selection of a clinical candidate after 10 months and only 78 molecules synthesized .

Jones, N. Crystallography: atomic secrets. Nature 505 , 602–603 (2014).

Liu, W. et al. Serial femtosecond crystallography of G protein–coupled receptors. Science 342 , 1521–1524 (2013).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Nannenga, B. L. & Gonen, T. The cryo-EM method microcrystal electron diffraction (MicroED). Nat. Methods 16 , 369–379 (2019).

Fernandez-Leiro, R. & Scheres, S. H. Unravelling biological macromolecules with cryo-electron microscopy. Nature 537 , 339–346 (2016).

Renaud, J.-P. et al. Cryo-EM in drug discovery: achievements, limitations and prospects. Nat. Rev. Drug Discov. 17 , 471–492 (2018).

Congreve, M., de Graaf, C., Swain, N. A. & Tate, C. G. Impact of GPCR structures on drug discovery. Cell 181 , 81–91 (2020).

Santos, R. et al. A comprehensive map of molecular drug targets. Nat. Rev. Drug Discov. 16 , 19–34 (2017).

Grygorenko, O. O. et al. Generating multibillion chemical space of readily accessible screening compounds. iScience 23 , 101681 (2020).

Lyu, J. et al. Ultra-large library docking for discovering new chemotypes. Nature 566 , 224–229 (2019). This is ultra-large docking study also carefully assessed the advantages and potential pitfalls of expanding chemical space .

Stein, R. M. et al. Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature 579 , 609–614 (2020). This study shows ultra-large docking that resulted in subnanomolar hits for a GPCR .

Alon, A. et al. Structures of the sigma2 receptor enable docking for bioactive ligand discovery. Nature 600 , 759–764 (2021).

Gorgulla, C. et al. An open-source drug discovery platform enables ultra-large virtual screens. Nature 580 , 663–668 (2020). This study shows an iterative library filtering as a first approach to accelerate ultra-large virtual screening .

Gorgulla, C. et al. A multi-pronged approach targeting SARS-CoV-2 proteins using ultra-large virtual screening. iScience 24 , 102021 (2021).

Graff, D. E., Shakhnovich, E. I. & Coley, C. W. Accelerating high-throughput virtual screening through molecular pool-based active learning. Chem. Sci. 12 , 7866–7881 (2021). This study introduces acceleration of ultra-large screening by iteratively combining DL and docking .

Sadybekov, A. A. et al. Synthon-based ligand discovery in virtual libraries of over 11 billion compounds. Nature 601 , 452–459 (2022). This study introduces the modular concept for screening gigascale spaces, V-SYNTHES, and validates its performance on GPCR and kinase targets .

Yang, X., Wang, Y., Byrne, R., Schneider, G. & Yang, S. Concepts of artificial intelligence for computer-assisted drug discovery. Chem. Rev. 119 , 10520–10594 (2019).

Pandey, M. et al. The transformational role of GPU computing and deep learning in drug discovery. Nat. Mach. Intell. 4 , 211–221 (2022).

Blay, V., Tolani, B., Ho, S. P. & Arkin, M. R. High-throughput screening: today’s biochemical and cell-based approaches. Drug Discov. Today 25 , 1807–1821 (2020).

Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Med. Res. Rev. 16 , 3–50 (1996).

Lyu, J., Irwin, J. J. & Shoichet, B. K. Modeling the expansion of virtual screening libraries. Nat. Chem. Biol. https://doi.org/10.1038/s41589-022-01234-w (2023).

Article   PubMed   Google Scholar  

Tomberg, A. & Boström, J. Can easy chemistry produce complex, diverse, and novel molecules? Drug Discov. Today 25 , 2174–2181 (2020).

Muchiri, R. N. & van Breemen, R. B. Affinity selection–mass spectrometry for the discovery of pharmacologically active compounds from combinatorial libraries and natural products. J. Mass Spectrom. 56 , e4647 (2021).

Fitzgerald, P. R. & Paegel, B. M. DNA-encoded chemistry: drug discovery from a few good reactions. Chem. Rev. 121 , 7155–7177 (2021).

Neri, D. & Lerner, R. A. DNA-encoded chemical libraries: a selection system based on endowing organic compounds with amplifiable information. Annu. Rev. Biochem. 87 , 479–502 (2018).

McCloskey, K. et al. Machine learning on DNA-encoded libraries: a new paradigm for hit finding. J. Med. Chem. 63 , 8857–8866 (2020).

Walters, W. P. Virtual chemical libraries. J. Med. Chem. 62 , 1116–1124 (2019).

Warr, W. A., Nicklaus, M. C., Nicolaou, C. A. & Rarey, M. Exploration of ultralarge compound collections for drug discovery. J. Chem. Inf. Model. 62 , 2021–2034 (2022). This is a comprehensive review of the history and recent developments of the on-demand and generative chemical spaces .

Enamine. REAL Database. Enamine https://enamine.net/compound-collections/real-compounds/real-database (2020).

Hartenfeller, M. et al. A collection of robust organic synthesis reactions for in silico molecule design. J. Chem. Inf. Model. 51 , 3093–3098 (2011).

Patel, H. et al. SAVI, in silico generation of billions of easily synthesizable compounds through expert-system type rules. Sci. Data 7 , 384 (2020).

Article   PubMed   PubMed Central   Google Scholar  

Irwin, J. J. et al. ZINC20-A free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60 , 6065–6073 (2020).

Hu, Q. et al. Pfizer Global Virtual Library (PGVL): a chemistry design tool powered by experimentally validated parallel synthesis information. ACS Comb. Sci. 14 , 579–589 (2012).

Nicolaou, C. A., Watson, I. A., Hu, H. & Wang, J. The Proximal Lilly Collection: mapping, exploring and exploiting feasible chemical space. J. Chem. Inf. Model. 56 , 1253–1266 (2016).

Enamine. REAL Space. Enamine https://enamine.net/library-synthesis/real-compounds/real-space-navigator (2022).

Bellmann, L., Penner, P., Gastreich, M. & Rarey, M. Comparison of combinatorial fragment spaces and its application to ultralarge make-on-demand compound catalogs. J. Chem. Inf. Model. 62 , 553–566 (2022).

Enamine. Make on-demand building blocks (MADE). Enamine https://enamine.net/building-blocks/made-building-blocks (2022).

Hoffmann, T. & Gastreich, M. The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discov. Today 24 , 1148–1156 (2019).

Ruddigkeit, L., van Deursen, R., Blum, L. C. & Reymond, J.-L. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J. Chem. Inf. Model. 52 , 2864–2875 (2012).

Vanhaelen, Q., Lin, Y.-C. & Zhavoronkov, A. The advent of generative chemistry. ACS Med. Chem. Lett. 11 , 1496–1505 (2020).

Ballante, F., Kooistra, A. J., Kampen, S., de Graaf, C. & Carlsson, J. Structure-based virtual screening for ligands of G protein-coupled receptors: what can molecular docking do for you? Pharmacol. Rev. 73 , 527–565 (2021).

Neves, M. A., Totrov, M. & Abagyan, R. Docking and scoring with ICM: the benchmarking results and strategies for improvement. J. Comput. Aided Mol. Des. 26 , 675–686 (2012).

Meiler, J. & Baker, D. ROSETTALIGAND: protein-small molecule docking with full side-chain flexibility. Proteins 65 , 538–548 (2006).

Lorber, D. M. & Shoichet, B. K. Flexible ligand docking using conformational ensembles. Protein Sci. 7 , 938–950 (1998).

Trott, O. & Olson, A. J. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31 , 455–461 (2010).

CAS   PubMed   PubMed Central   Google Scholar  

Halgren, T. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47 , 1750–1759 (2004).

Friesner, R. A. et al. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47 , 1739–1749 (2004).

Gaieb, Z. et al. D3R grand challenge 3: blind prediction of protein-ligand poses and affinity rankings. J. Comput. Aided Mol. Des. 33 , 1–18 (2019).

Parks, C. D. et al. D3R grand challenge 4: blind prediction of protein-ligand poses, affinity rankings, and relative binding free energies. J. Comput. Aided Mol. Des. 34 , 99–119 (2020).

Bender, B. J. et al. A practical guide to large-scale docking. Nat. Protoc. 16 , 4799–4832 (2021).

Manglik, A. et al. Structure-based discovery of opioid analgesics with reduced side effects. Nature 537 , 185–190 (2016).

Cerón-Carrasco, J. P. When virtual screening yields inactive drugs: dealing with false theoretical friends. ChemMedChem 17 , e202200278 (2022).

PubMed   PubMed Central   Google Scholar  

Rossetti, G. G. et al. Non-covalent SARS-CoV-2 M pro inhibitors developed from in silico screen hits. Sci. Rep. 12 , 2505 (2022).

Luttens, A. et al. Ultralarge virtual screening identifies SARS-CoV-2 main protease inhibitors with broad-spectrum activity against coronaviruses. J. Am. Chem. Soc. 144 , 2905–2920 (2022). This study compares fragment-based and ultra-large screening-based discovery of lead candidates for the challenging target .

Owen, D. R. et al. An oral SARS-CoV-2 M pro inhibitor clinical candidate for the treatment of COVID-19. Science 374 , 1586–1593 (2021).

Böhm, H.-J. The computer program LUDI: a new method for the de novo design of enzyme inhibitors. J. Comput. Aided Mol. Des. 6 , 61–78 (1992).

Article   ADS   PubMed   Google Scholar  

Beroza, P. et al. Chemical space docking enables large-scale structure-based virtual screening to discover ROCK1 kinase inhibitors. Nat. Commun. 13 , 6447 (2022).

Jumper, J. et al. Applying and improving AlphaFold at CASP14. Proteins 89 , 1711–1721 (2021).

Vamathevan, J. et al. Applications of machine learning in drug discovery and development. Nat. Rev. Drug Discov. 18 , 463–477 (2019).

Schneider, P. et al. Rethinking drug design in the artificial intelligence era. Nat. Rev. Drug Discov. 19 , 353–364 (2020). This article provides a comprehensive introduction to DL approaches in drug discovery .

Elbadawi, M., Gaisford, S. & Basit, A. W. Advanced machine-learning techniques in drug discovery. Drug Discov. Today 26 , 769–777 (2021).

Bender, A. & Cortés-Ciriano, I. Artificial intelligence in drug discovery: what is realistic, what are illusions? Part 1: ways to make an impact, and why we are not there yet. Drug Discov. Today 26 , 511–524 (2021).

Davies, M. et al. Improving the accuracy of predicted human pharmacokinetics: lessons learned from the AstraZeneca drug pipeline over two decades. Trends Pharmacol. Sci. 41 , 390–408 (2020).

Schneckener, S. et al. Prediction of oral bioavailability in rats: transferring insights from in vitro correlations to (deep) machine learning models using in silico model outputs and chemical structure parameters. J. Chem. Inf. Model. 59 , 4893–4905 (2019).

Cherkasov, A. et al. QSAR modeling: where have you been? Where are you going to? J. Med. Chem. 57 , 4977–5010 (2014).

Keiser, M. J. et al. Predicting new molecular targets for known drugs. Nature 462 , 175–181 (2009).

Guney, E., Menche, J., Vidal, M. & Barábasi, A.-L. Network-based in silico drug efficacy screening. Nat. Commun. 7 , 10331 (2016).

Cichońska, A. et al. Crowdsourced mapping of unexplored target space of kinase inhibitors. Nat. Commun. 12 , 3307 (2021).

Gaulton, A. et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Res. 40 , D1100–D1107 (2012).

Tang, J. et al. Drug Target Commons: a community effort to build a consensus knowledge base for drug–target interactions. Cell Chem. Biol. 25 , 224–229.e222 (2018).

Liu, Z. et al. PDB-wide collection of binding data: current status of the PDBbind database. Bioinformatics 31 , 405–412 (2015).

Gaudelet, T. et al. Utilizing graph machine learning within drug discovery and development. Brief. Bioinform. 22 , bbab159 (2021).

Son, J. & Kim, D. Development of a graph convolutional neural network model for efficient prediction of protein–ligand binding affinities. PLoS ONE 16 , e0249404 (2021).

Stepniewska-Dziubinska, M. M., Zielenkiewicz, P. & Siedlecki, P. Improving detection of protein–ligand binding sites with 3D segmentation. Sci. Rep. 10 , 5035 (2020).

Jiménez, J., Škalič, M., Martínez-Rosell, G. & De Fabritiis, G. KDEEP: protein–ligand absolute binding affinity prediction via 3D-convolutional neural networks. J. Chem. Inf. Model. 58 , 287–296 (2018).

Jones, D. et al. Improved protein–ligand binding affinity prediction with structure-based deep fusion inference. J. Chem. Inf. Model. 61 , 1583–1592 (2021).

Volkov, M. et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J. Med. Chem. 65 , 7946–7958 (2022).

Roberts, M. et al. Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nat. Mach. Intell. 3 , 199–217 (2021).

Beker, W. et al. Machine learning may sometimes simply capture literature popularity trends: a case study of heterocyclic Suzuki–Miyaura coupling. J. Am. Chem. Soc. 144 , 4819–4827 (2022).

Yu, B. & Kumbier, K. Veridical data science. Proc. Natl Acad. Sci. USA 117 , 3920–3929 (2020). This perspective article lays a foundation for veridical AI .

Article   ADS   MathSciNet   CAS   PubMed   PubMed Central   MATH   Google Scholar  

Ng, A., Laird, D. & He, L. Data-centric AI competition. DeepLearning AI https://https-deeplearning-ai.github.io/data-centric-comp/ (2021).

Miranda, L. J. Towards data-centric machine learning: a short review. LJ Miranda https://ljvmiranda921.github.io/notebook/2021/07/30/data-centric-ml/ (2021).

Jiménez-Luna, J., Grisoni, F. & Schneider, G. Drug discovery with explainable artificial intelligence. Nat. Mach. Intell. 2 , 573–584 (2020).

Wills, T. AI drug discovery: assessing the first AI-designed drug candidates to go into human clinical trials. CAS https://www.cas.org/resources/cas-insights/drug-discovery/ai-designed-drug-candidates (2022).

Meng, C., Seo, S., Cao, D., Griesemer, S. & Liu, Y. When physics meets machine learning: a survey of physics-informed machine learning. Preprint at https://doi.org/10.48550/arXiv.2203.16797 (2022).

Thomas, M., Bender, A. & de Graaf, C. Integrating structure-based approaches in generative molecular design. Curr. Opin. Struct. Biol. 79 , 102559 (2023).

Ackloo, S. et al. CACHE (Critical Assessment of Computational Hit-finding Experiments): a public–private partnership benchmarking initiative to enable the development of computational methods for hit-finding. Nat. Rev. Chem. 6 , 287–295 (2022). This is an important community initiative for comprehensive performance assessment of computational drug discovery methods .

MolSoft. Rapid isostere discovery engine (RIDE). MolSoft http://molsoft.com/RIDE.html (2022).

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596 , 583–589 (2021).

Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596 , 590–596 (2021).

Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373 , 871–876 (2021).

Akdel, M. A structural biology community assessment of AlphaFold2 applications. Nat. Struct. Mol. Biol. 29 , 1056–1067 (2022).

Katritch, V., Rueda, M. & Abagyan, R. Ligand-guided receptor optimization. Methods Mol. Biol. 857 , 189–205 (2012).

Carlsson, J. et al. Ligand discovery from a dopamine D 3 receptor homology model and crystal structure. Nat. Chem. Biol. 7 , 769–778 (2011).

Ren, F. et al. AlphaFold accelerates artificial intelligence powered drug discovery: efficient discovery of a novel cyclin-dependent kinase 20 (CDK20) small molecule inhibitor. Chem. Sci. 14 , 1443–1452 (2023).

Zhang, Y. et al. Benchmarking refined and unrefined AlphaFold2 structures for hit discovery. J. Chem. Inf. Model. 63 , 1656–1667 (2023).

He, X.-h. et al. AlphaFold2 versus experimental structures: evaluation on G protein-coupled receptors. Acta Pharmacol. Sin. 44 , 1–7 (2022).

Wong, F. et al. Benchmarking AlphaFold-enabled molecular docking predictions for antibiotic discovery. Mol. Syst. Biol. 18 , e11081 (2022).

Hekkelman, M. L., de Vries, I., Joosten, R. P. & Perrakis, A. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nat. Methods 20 , 205–213 (2023).

Yang, Y. et al. Efficient exploration of chemical space with docking and deep learning. J. Chem. Theory Comput. 17 , 7106–7119 (2021).

Gentile, F. et al. Artificial intelligence-enabled virtual screening of ultra-large chemical libraries with deep docking. Nat. Protoc. 17 , 672–697 (2022).

Schindler, C. E. M. et al. Large-scale assessment of binding free energy calculations in active drug discovery projects. J. Chem. Inf. Model. 60 , 5457–5474 (2020).

Chen, W., Cui, D., Abel, R., Friesner, R. A. & Wang, L. Accurate calculation of absolute protein–ligand binding free energies. Preprint at https://doi.org/10.26434/chemrxiv-2022-2t0dq-v2 (2022).

Khalak, Y. et al. Alchemical absolute protein–ligand binding free energies for drug design. Chem. Sci. 12 , 13958–13971 (2021).

Cournia, Z. et al. Rigorous free energy simulations in virtual screening. J. Chem. Inf. Model. 60 , 4153–4169 (2020).

xREAL Chemical Space, Chemspace , https://chem-space.com/services#v-synthes (2023).

Rarey, M., Nicklaus, M. C. & Warr, W. Special issue on reaction informatics and chemical space. J. Chem. Inf. Model. 62 , 2009–2010 (2022).

Zabolotna, Y. et al. A close-up look at the chemical space of commercially available building blocks for medicinal chemistry. J. Chem. Inf. Model. 62 , 2171–2185 (2022).

Kaplan, A. L. et al. Bespoke library docking for 5-HT 2A receptor agonists with antidepressant activity. Nature 610 , 582–591 (2022).

Krasiński, A., Fokin, V. V. & Sharpless, K. B. Direct synthesis of 1,5-disubstituted-4-magnesio-1,2,3-triazoles, revisited. Org. Lett. 6 , 1237–1240 (2004).

The Nobel Prize in Chemistry. nobelprize.org , https://www.nobelprize.org/prizes/chemistry/2022/summary/ (2022)

Dong, J., Sharpless, K. B., Kwisnek, L., Oakdale, J. S. & Fokin, V. V. SuFEx-based synthesis of polysulfates. Angew. Chem. Int. Ed. Engl. 53 , 9466–9470 (2014).

Zhang, B. et al. Ni-electrocatalytic C sp 3 -C sp 3 doubly decarboxylative coupling. Nature 606 , 313–318 (2022).

Gillis, E. P. & Burke, M. D. Iterative cross-couplng with MIDA boronates: towards a general platform for small molecule synthesis. Aldrichimica Acta 42 , 17–27 (2009).

Blair, D. J. et al. Automated iterative C sp 3 –C bond formation. Nature 604 , 92–97 (2022). This study provides a chemical approach for automation of the C–C bond formation in small-molecule synthesis .

Li, J. et al. Synthesis of many different types of organic small molecules using one automated process. Science 347 , 1221–1226 (2015).

Trobe, M. & Burke, M. D. The molecular industrial revolution: automated synthesis of small molecules. Angew. Chem. Int. Ed. 57 , 4192–4214 (2018).

Bubliauskas, A. et al. Digitizing chemical synthesis in 3D printed reactionware. Angew. Chem. Int. Ed. 61 , e202116108 (2022).

Molga, K. et al. A computer algorithm to discover iterative sequences of organic reactions. Nat. Synth. 1 , 49–58 (2022).

Article   ADS   Google Scholar  

Segler, M. H. S., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555 , 604–610 (2018).

Goldman, B., Kearnes, S., Kramer, T., Riley, P. & Walters, W. P. Defining levels of automated chemical design. J. Med. Chem. 65 , 7073–7087 (2022).

Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7 , eabg3338 (2021).

Wagner, J. R. et al. Emerging computational methods for the rational discovery of allosteric drugs. Chem. Rev. 116 , 6370–6390 (2016).

Davis, B. J. & Hubbard, R. E. in Structural Biology in Drug Discovery (ed. Renaud, J.-P.) 79–98 (2020).

de Souza Neto, L. R. et al. In silico strategies to support fragment-to-lead optimization in drug discovery. Front. Chem. 8 , 93 (2020).

Article   ADS   PubMed   PubMed Central   Google Scholar  

Saur, M. et al. Fragment-based drug discovery using cryo-EM. Drug Discov. Today 25 , 485–490 (2020).

Kuljanin, M. et al. Reimagining high-throughput profiling of reactive cysteines for cell-based screening of large electrophile libraries. Nat. Biotechnol. 39 , 630–641 (2021).

Muegge, I., Martin, Y. C., Hajduk, P. J. & Fesik, S. W. Evaluation of PMF scoring in docking weak ligands to the FK506 binding protein. J. Med. Chem. 42 , 2498–2503 (1999).

Schuller, M. et al. Fragment binding to the Nsp3 macrodomain of SARS-CoV-2 identified through crystallographic screening and computational docking. Sci. Adv. 7 , eabf8711 (2021).

Gahbauer, S. et al. Iterative computational design and crystallographic screening identifies potent inhibitors targeting the Nsp3 macrodomain of SARS-CoV-2. Proc. Natl Acad. Sci. USA 120 , e2212931120 (2023). This article demonstrates the application of both hybrid fragment screening-and-merging design and ultra-large library screening to a challenging viral target .

Achdout, H. et al. Open science discovery of oral non-covalent SARS-CoV-2 main protease inhibitor therapeutics. Preprint at https://doi.org/10.1101/2020.10.29.339317 (2022).

Jin, Z. et al. Structure of M pro from SARS-CoV-2 and discovery of its inhibitors. Nature 582 , 289–293 (2020).

Ton, A. T., Gentile, F., Hsing, M., Ban, F. & Cherkasov, A. Rapid identification of potential inhibitors of SARS-CoV-2 main protease by deep docking of 1.3 billion compounds. Mol. Inform. 39 , e2000028 (2020).

Frye, L., Bhat, S., Akinsanya, K. & Abel, R. From computer-aided drug discovery to computer-driven drug discovery. Drug Discov. Today Technol. 39 , 111–117 (2021).

Wadman, M. FDA no longer needs to require animal tests before human drug trials. Science , https://doi.org/10.1126/science.adg6264 (2023).

Stiefl, N. et al. FOCUS—development of a global communication and modeling platform for applied and computational medicinal chemists. J. Chem. Inf. Model. 55 , 896–908 (2015).

Schrodinger. LiveDesign. Schrodinger https://www.schrodinger.com/sites/default/files/general_ld_rgb_080119_forweb.pdf . (accessed 5 April 2023)

Müller, S. et al. Target 2035—update on the quest for a probe for every protein. RSC Med. Chem. 13 , 13–21 (2022).

Verdonk, M. L., Cole, J. C., Hartshorn, M. J., Murray, C. W. & Taylor, R. D. Improved protein–ligand docking using GOLD. Proteins 52 , 609–623 (2003).

Miller, E. B. et al. Reliable and accurate solution to the induced fit docking problem for protein–ligand binding. J. Chem. Theory Comput. 17 , 2630–2639 (2021).

Chemical space docking. BioSolveIT https://www.biosolveit.de/application-academy/chemical-space-docking/ (2022).

Cavasotto, C. N. in Quantum Mechanics in Drug Discovery (ed. Heifetz, A.) 257–268 (Springer, 2020).

Dixon, S. L. et al. AutoQSAR: an automated machine learning tool for best-practice quantitative structure–activity relationship modeling. Future Med. Chem. 8 , 1825–1839 (2016).

Totrov, M. Atomic property fields: generalized 3D pharmacophoric potential for automated ligand superposition, pharmacophore elucidation and 3D QSAR. Chem. Biol. Drug Des. 71 , 15–27 (2008).

Schaller, D. et al. Next generation 3D pharmacophore modeling. WIREs Comput. Mol. Sci. 10 , e1468 (2020).

Chakravarti, S. K. & Alla, S. R. M. Descriptor free QSAR modeling using deep learning with long short-term memory neural networks. Front. Artif. Intell. 2 , 17 (2019).

Deng, Z., Chuaqui, C. & Singh, J. Structural interaction fingerprint (SIFt): a novel method for analyzing three-dimensional protein-ligand binding interactions. J. Med. Chem. 47 , 337–344 (2004).

Download references

Acknowledgements

We thank A. Brooun, A. A. Sadybekov, S. Majumdar, M. M. Babu, Y. Moroz and V. Cherezov for helpful discussions.

Author information

Authors and affiliations.

Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA

Anastasiia V. Sadybekov & Vsevolod Katritch

Center for New Technologies in Drug Discovery and Development, Bridge Institute, Michelson Center for Convergent Biosciences, University of Southern California, Los Angeles, CA, USA

Department of Chemistry, University of Southern California, Los Angeles, CA, USA

Vsevolod Katritch

You can also search for this author in PubMed   Google Scholar

Contributions

All authors contributed to the writing of the manuscript.

Corresponding author

Correspondence to Vsevolod Katritch .

Ethics declarations

Competing interests.

The University of Southern California are in the process of applying for a patent application (no. 63159888) covering the V-SYNTHES method that lists V.K. as a co-inventor.

Peer review

Peer review information.

Nature thanks Alexander Tropsha and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article.

Sadybekov, A.V., Katritch, V. Computational approaches streamlining drug discovery. Nature 616 , 673–685 (2023). https://doi.org/10.1038/s41586-023-05905-z

Download citation

Received : 21 April 2022

Accepted : 01 March 2023

Published : 26 April 2023

Issue Date : 27 April 2023

DOI : https://doi.org/10.1038/s41586-023-05905-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Natural product-inspired strategies towards the discovery of novel bioactive molecules.

  • Sunita Gagare
  • Pranita Patil
  • Ashish Jain

Future Journal of Pharmaceutical Sciences (2024)

Small molecule autoencoders: architecture engineering to optimize latent space utility and sustainability

  • Marie Oestreich
  • Matthias Becker

Journal of Cheminformatics (2024)

Neural multi-task learning in drug design

  • Stephan Allenspach
  • Jan A. Hiss
  • Gisbert Schneider

Nature Machine Intelligence (2024)

Genetics of human brain development

  • Hongjun Song
  • Guo-li Ming

Nature Reviews Genetics (2024)

Machine learning accelerates pharmacophore-based virtual screening of MAO inhibitors

  • Marcin Cieślak
  • Tomasz Danel
  • Justyna Kalinowska-Tłuścik

Scientific Reports (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

drug discovery essay

IMAGES

  1. Study into Drug discovery and Design

    drug discovery essay

  2. Drug discovery

    drug discovery essay

  3. ≫ Issue of Drug Usage among Emerging Adults Free Essay Sample on

    drug discovery essay

  4. Major activities of drug discovery, drug development, and drug

    drug discovery essay

  5. Drug Research Paper

    drug discovery essay

  6. introduction to drug discovery

    drug discovery essay

VIDEO

  1. Drug discovery by UFT and Physics based AI systems

  2. Drug Discovery & Development Lab 1

  3. Drug Discovery Biology Boston 2023

  4. Introduction to Drug Discovery and Development by Iliya

  5. Drug Discovery Biology Boston 2023

  6. knowledge able essay drug addiction|10 points|for class five

COMMENTS

  1. Drug Design and Discovery: Principles and Applications

    Drug discovery is the process through which potential new therapeutic entities are identified, using a combination of computational, experimental, translational, and clinical models (see, e.g., [1,2]).Despite advances in biotechnology and understanding of biological systems, drug discovery is still a lengthy, costly, difficult, and inefficient process with a high attrition rate of new ...

  2. The Stages of Drug Discovery and Development Process

    Abstract and Figures. Drug discovery is a process which aims at identifying a compound therapeutically useful in curing and treating disease. This process involves the identification of candidates ...

  3. An overview of drug discovery and development

    Abstract. A new medicine will take an average of 10-15 years and more than US$2 billion before it can reach the pharmacy shelf. Traditionally, drug discovery relied on natural products as the main source of new drug entities, but was later shifted toward high-throughput synthesis and combinatorial chemistry-based development.

  4. Nature Reviews Drug Discovery

    Nature Reviews Drug Discovery is a journal for people interested in drug discovery and development. It features reviews, news, analysis and research highlights.

  5. PDF HOW AI IS ACCELERATING AND TRANSFORMING DRUG DISCOVERY

    Leonard Lee, head of growth and customer success for Accelerated Discovery at IBM, discusses the ways AI is transforming drug discovery and assisting scientists today. Artificial intelligence (AI ...

  6. Machine Learning in Drug Discovery: A Review

    In drug discovery, ... In literature, several papers provided information relates to predictive models and biomarkers, and last, few were utilized in clinical trials. Various factors like model rebuilding, designing, data accessing, data quality and software, model selection are necessary for a clinical setting. ...

  7. Phenotypic drug discovery: recent successes, lessons learned ...

    Phenotypic drug discovery has re-emerged over the past decade as an approach to systematically pursue drug discovery based on therapeutic effects in realistic disease models. This article ...

  8. Deep learning in drug discovery: an integrative review and future

    Recently, using artificial intelligence (AI) in drug discovery has received much attention since it significantly shortens the time and cost of developing new drugs. Deep learning (DL)-based approaches are increasingly being used in all stages of drug development as DL technology advances, and drug-related data grows. Therefore, this paper presents a systematic Literature review (SLR) that ...

  9. Drug Discovery Today

    Drug Discovery Today delivers informed and highly current reviews for the discovery community. The magazine addresses not only the rapid scientific developments in drug discovery associated technologies but also the management, commercial and regulatory issues that increasingly play a part in how R&D is planned, structured and executed. Features include comment by international experts, news ...

  10. PDF Angewandte Essays Drug Development Advancing the Drug Discovery and

    current state of affairs in the drug discovery and development process is briefly summarized and then ways to take advantage of the ever-increasing fundamental knowledge and tech-nical knowhow in chemistry and biology and related disciplines are discussed. The primary motivation of this Essay is to celebrate the great achievements of chemistry ...

  11. Advancing the Drug Discovery and Development Process

    The current state of affairs in the drug discovery and development process is briefly summarized and then ways to take advantage of the ever-increasing fundamental knowledge and technical knowhow in chemistry and biology and related disciplines are discussed. The primary motivation of this Essay is to celebrate the great achievements of ...

  12. Principles of early drug discovery

    Abstract. Developing a new drug from original idea to the launch of a finished product is a complex process which can take 12-15 years and cost in excess of $1 billion. The idea for a target can come from a variety of sources including academic and clinical research and from the commercial sector. It may take many years to build up a body of ...

  13. Artificial intelligence in drug discovery and development

    The use of artificial intelligence (AI) has been increasing in various sectors of society, particularly the pharmaceutical industry. In this review, we highlight the use of AI in diverse sectors of the pharmaceutical industry, including drug discovery and development, drug repurposing, improving pharmaceutical productivity, and clinical trials, among others; such use reduces the human workload ...

  14. Full article: Artificial intelligence in drug discovery: recent

    1. Introduction. Machine learning algorithms have been widely applied for computer-assisted drug discovery [Citation 1-3].Deep learning approaches, that is, artificial neural networks with several hidden processing layers [Citation 4, Citation 5], have recently gathered renewed attention owing to their ability to perform automatic feature extractions from the input data, and their potential ...

  15. (PDF) Recent Advances in Drug Discovery: Innovative ...

    Abstract. Drug discovery is a dynamic field constantly evolving with the aim of identifying novel. therapeutic agents to combat various diseases. In this review, we present an overview of recent ...

  16. Research articles

    This analysis of the publicly available ChEMBL database, which includes more than 500,000 drug discovery and marketed oral drug compounds, suggests that the perceived benefit of high in vitro ...

  17. A Career in Drug Discovery and Development

    Consider what is needed to bring a new drug to clinical use. First, we must understand the underlying molecular, biochemical, and genetic mechanisms that contribute to disease. Together with the use of bioinformatics and genomics, this leads to potential drug targets that might play a critical role in disease. Chemists can then design compounds ...

  18. Drug Discovery, Development and Delivery

    Section Information. This Section publishes original and innovative scientific papers, as well as comprehensive review papers. It is organized into topics mainly covering drug discovery, drug design, basic and clinical pharmacology, formulation and delivery, and toxicology.

  19. Drug Discovery

    The Drug Discovery Series covers all aspects of drug discovery and medicinal chemistry and contains over seventy books published since 2010. Providing comprehensive coverage of this important and far-reaching area, the books encourage learning in a range of different topics and provide valuable reference sources for scientists working outside their own areas of expertise.

  20. Assay Development in Drug Discovery

    Types of Assays in Drug Discovery. Assay development, or creating a test system to assess the effects of chosen drug candidates on desired biological processes, including cellular-based, and biochemical processes (Figure 2), is one of the first steps of drug development. High throughput screening of compound libraries enables researchers to ...

  21. Drug discovery and development: Role of basic biological research

    2. Discovery: From target to clinical candidate. The goal of a preclinical drug discovery program is to deliver one or more clinical candidate molecules, each of which has sufficient evidence of biologic activity at a target relevant to a disease as well as sufficient safety and drug-like properties so that it can be entered into human testing.

  22. Early Drug Discovery and Development Guidelines: For Academic

    Setting up drug discovery and development programs in academic, non-profit and other life science research companies requires careful planning. This chapter contains guidelines to develop therapeutic hypotheses, target and pathway validation, proof of concept criteria and generalized cost analyses at various stages of early drug discovery. Various decision points in developing a New Chemical ...

  23. Drug Discovery

    Drug Discovery. 378 papers with code • 28 benchmarks • 25 datasets. Drug discovery is the task of applying machine learning to discover new candidate drugs. ( Image credit: A Turing Test for Molecular Generators )

  24. Computational approaches streamlining drug discovery

    The concept of computer-aided drug discovery 3 was developed in the 1970s and popularized by Fortune magazine in 1981, and has since been through several cycles of hype and disillusionment 4 ...