automatic hypothesis generation

Hypothesis Maker

Ai-powered research hypothesis generator.

  • Scientific Research: Generate a hypothesis for your experimental or observational study based on your research question.
  • Academic Studies: Formulate a hypothesis for your thesis, dissertation, or academic paper.
  • Market Research: Develop a hypothesis for your market research study to understand consumer behavior or market trends.
  • Social Science Research: Create a hypothesis for your social science research to explore societal or behavioral patterns.

New & Trending Tools

Ai writing ideas, fanfic maker, proverbs ai generator.

Automated Hypothesis Generation

automatic hypothesis generation

Automated hypothesis generation: when machine-learning systems produce ideas, not just test them.

Testing ideas at scale. Fast.

While algorithms are mostly used as tools to number-crunch and test-drive ideas, they have yet been used to generate the ideas themselves. Let alone at scale.

Rather than thinking up one idea at a time and testing it, what if a machine could generate millions of ideas automatically? What if this same machine would then proceed to autonomously test and rank the ideas, discovering which are better supported by the data? A machine that can even identify the type of data that could refute one’s theories and challenge existing practices.

This machine lies at the heart of SparkBeyond Discovery: its Hypothesis Engine. The engine automatically generates millions of ideas, many of them novel. Asks questions we would never think to even ask.

This Hypothesis Engine integrates the world’s largest collection of algorithms, and bypasses human cognitive bias to produce millions of ideas, hypotheses and questions in minutes. These hypotheses ensure that any meaningful signals in the data are surfaced. Then, these signals are often immediately actionable, and can be used as predictive features in machine learning models.

Going beyond the bias

Human ideation is inherently limited by cognitive bottlenecks and biases, which restrict us in generating and testing ideas at scale and high throughput. We're also limited by the speed at which we can communicate. We don’t have the capacity to read and comprehend the thousands of scientific articles and patents published every day. 

What’s more, the questions we ask are biased by our experience and knowledge, or even our mood.

In data science and research workflows, there are key bottlenecks that limit what a person or team can accomplish while working on a problem within a finite amount of time. 

For example, when exploring for useful patterns in data, a data scientist only has time to conceive, engineer, and evaluate a limited number of distinct hypotheses, leaving many areas unexplored. 

One of these areas is the gaps within an organization’s own data. This internal data may only reveal part of the story, whereas augmented external data sources can provide valuable contextual information. Without it, hypotheses based only on internal data don’t take into account the influence of external factors, such as weather and local events, or macro-economic factors and market conditions. 

Instead, by mapping out the entire spectrum of dynamics that happen on earth,SparkBeyond Discovery connects the dots between every data set that exists and offers a comprehensive viewpoint.

Tap into humanity's collective intelligence

Just like search engines crawl the web for text, our machine started indexing the code, data and knowledge on the web, and amassed one of the world's largest libraries of open-source code functions. 

Using both automation and AI, the Hypothesis Engine employs these functions to generate four million hypotheses per minute—a capacity that allows the technology to work through hundreds of good and bad ideas every second.

Related Articles

Overcoming the Enterprise LLM Blindspot

Overcoming the Enterprise LLM Blindspot

Turns out Enterprise LLMs have a massive blindspot, diminishing AI's impact on real-world performance. Here's how to solve it.

Continue reading

automatic hypothesis generation

Generative AI for data analytics: the future of enterprise sense-making

In the case of enterprise data analytics, generative AI will radically change the way we interrogate our data to explore, react to and shape our business realities.

automatic hypothesis generation

Turning enterprise data into accessible knowledge for LLMs

With the recent release of the GPT edition of our Discovery Platform, we introduce novel ways to unlock the vault of deep enterprise knowledge and internally developed insights, making them accessible to decision makers at all levels

automatic hypothesis generation

It was easier in this project since we used this outpout

Business insights.

Apply key dataset transformations through no/low-code workflows to clean, prep, and scope your datasets as needed for analysis

Predictive Models

Micro-segments, features for external models, business automation rules, root-cause analysis, join our event about this topic today..

Learn all about the SparkBeyond mission and product vision.

A conversation worth having today

Drop us a message and we'll get back to you promptly

Book a virtual meeting to see SparkBeyond products in action

Explore current job openings at SparkBeyond worldwide

Research Studio

Applications.

automatic hypothesis generation

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 20 October 2022

An automatic hypothesis generation for plausible linkage between xanthium and diabetes

  • Arida Ferti Syafiandini 1 ,
  • Gyuri Song 1 ,
  • Yuri Ahn 1 ,
  • Heeyoung Kim 1 &
  • Min Song 1  

Scientific Reports volume  12 , Article number:  17547 ( 2022 ) Cite this article

1433 Accesses

1 Citations

Metrics details

  • Computational science
  • Scientific data

There has been a significant increase in text mining implementation for biomedical literature in recent years. Previous studies introduced the implementation of text mining and literature-based discovery to generate hypotheses of potential candidates for drug development. By conducting a hypothesis-generation step and using evidence from published journal articles or proceedings, previous studies have managed to reduce experimental time and costs. First, we applied the closed discovery approach from Swanson’s ABC model to collect publications related to 36 Xanthium compounds or diabetes. Second, we extracted biomedical entities and relations using a knowledge extraction engine, the Public Knowledge Discovery Engine for Java or PKDE4J. Third, we built a knowledge graph using the obtained bio entities and relations and then generated paths with Xanthium compounds as source nodes and diabetes as the target node. Lastly, we employed graph embeddings to rank each path and evaluated the results based on domain experts’ opinions and literature. Among 36 Xanthium compounds, 35 had direct paths to five diabetes-related nodes. We ranked 2,740,314 paths in total between 35 Xanthium compounds and three diabetes-related phrases: type 1 diabetes, type 2 diabetes, and diabetes mellitus. Based on the top five percentile paths, we concluded that adenosine, choline, beta-sitosterol, rhamnose, and scopoletin were potential candidates for diabetes drug development using natural products. Our framework for hypothesis generation employs a closed discovery from Swanson’s ABC model that has proven very helpful in discovering biological linkages between bio entities. The PKDE4J tools we used to capture bio entities from our document collection could label entities into five categories: genes, compounds, phenotypes, biological processes, and molecular functions. Using the BioPREP model, we managed to interpret the semantic relatedness between two nodes and provided paths containing valuable hypotheses. Lastly, using a graph-embedding algorithm in our path-ranking analysis, we exploited the semantic relatedness while preserving the graph structure properties.

Similar content being viewed by others

automatic hypothesis generation

From language models to large-scale food and biomedical knowledge graphs

Gjorgjina Cenikj, Lidija Strojnik, … Tome Eftimov

automatic hypothesis generation

Integrating and formatting biomedical data as pre-calculated knowledge graph embeddings in the Bioteque

Adrià Fernández-Torras, Miquel Duran-Frigola, … Patrick Aloy

automatic hypothesis generation

DrugMechDB: A Curated Database of Drug Mechanisms

Adriana Carolina Gonzalez-Cavazos, Anna Tanska, … Andrew I. Su

Introduction

Drug development is both expensive and time-consuming. Therefore, many studies have focused on reducing the time and costs of drug development. Multidisciplinary approaches and the implementation of computational methods are strongly encouraged to reduce the workload in drug development. Previous studies have applied artificial intelligence approaches to help reduce drug development costs 1 , 2 . As the quantity of biomedical literature has increased, there has been steadfast interest in applying text-mining techniques and Literature-based Discovery (LBD) to generate applicable drug compound candidates 3 . We can extract information from biomedical literature and generate facts using text-mining techniques. Then, we can employ the LBD concept to generate hypotheses for drug development using those facts. Analyzing existing facts garnered from biomedical literature to generate new hypotheses is called “Conceptual Biology” 4 .

Previous works suggest that combining LBD and text-mining techniques to generate hypotheses for drug development can significantly decrease the experiment time and cost 5 , 6 , 7 , 8 . However, despite the significant growth of studies in this field, hypothesis generation for drug development purposes remains challenging. The high dimensionality of biological substances and the number of related publications can be a significant obstacle to discovering possible linkages between entities 9 . Moreover, the number of generated paths during hypothesis generation is relatively high, making it difficult to gain insights. Therefore, to tackle such problems, this paper proposed a complete framework for hypothesis generation utilizing LBD and text-mining techniques with additional path-ranking steps to select critical paths and recommended them for further experiments in drug development.

We investigated the natural compounds found in Xanthium and their connectedness with diabetes as a case study. Those compounds are extracted from medicinal plants in the Xanthium genus such as Xanthium strumarium 10 and Xanthium sibiricum 11 . A previous study found that Xanthium strumarium might have an anti-diabetic effect because its fruit reduced the elevation of plasma glucose levels in diabetic rats 12 . Another study found that Xanthium sibiricum Patrin ex Widder water extracts (CEW) could increase the sugar tolerance in normal mice and decrease the blood sugar level in diabetic mice 13 . Further experiments 14 also proved that Xanthium compounds and diabetes are significantly related but there have been few studies about the complete biological interactions between these two. In addition, we found that diabetes is one of the most common endocrine disorders with a high probability of severe complications 15 , 16 . Diabetes is also a lifelong disease with no available cure. Several synthetic drugs are available for diabetes treatments but they are costly, have many side effects, and are unsuitable for long-term consumption 17 . Therefore, there is an urgent need to find natural compounds for long-term diabetes treatment.

To generate hypotheses for Xanthium compounds and diabetes, we applied Swanson’s ABC model 18 , which is a known LBD model for bio-literature mining. This model has two main steps we need to execute, constructing a knowledge base and generating paths. We utilized the PKDE4J tool 19 and BioPREP model 20 to extract entities and relations from retrieved PubMed articles and construct the knowledge base. PKDE4J is a dictionary-based Named Entity Recognition (NER) and rule-based relation extraction tool, while BioPREP is a pre-trained language model specifically built for learning biomedical text. The BioPREP model learned sentences, transformed them into embedding representations, and forwarded them for predicate (relation) classification. We generated simple paths from our knowledge base and ranked them using our proposed path-ranking algorithm, which combines the graph-embedding approach 21 and the encoder–decoder architecture. We relied on a literature-based study and experts’ opinions to validate our path-ranking results.

We highlighted our contributions in this paper: constructing a Xanthium compounds-diabetes knowledge base, proposing a path-ranking approach using graph-embedding values, and generating hypotheses for drug development experiments using Xanthium compounds.

Related works

Swanson first implemented a literature-based discovery approach to investigate linkages between dietary fish oil and Raynaud’s syndrome 18 . This approach is known as the ABC model and pioneered biomedical literature mining. Swanson’s ABC model generated constructive hypotheses and was helpful for further investigation. With the growing number of publications and digitalization, more studies have applied and co-opted Swanson’s ABC model with text-mining techniques for knowledge discovery and hypothesis generation 22 . We can cover more extensive collections with text-mining techniques and significantly reduce analysis bias. Moreover, we increased the probability of discovering new biological concepts and produced more compact hypotheses for drug development 23 , 24 , 25 .

The basic principle in the ABC model is constructing a knowledge base (usually represented as a graph) using open or closed discovery approaches. Essentially, an open discovery aimed to discover C instances given the A and B instances, while closed discovery aimed to discover B instances given the A and C instances. We can directly observe and choose the A, B, or C instances in small collection cases. Nevertheless, we need to use an automated approach to identify those instances for significant collection cases. PKDE4J 19 —a dictionary-based tool for entity recognition and relation extraction—is one solution for processing large-scale data collections. PKDE4J can automatically identify entities and label the relationships between two entities in sentences. For entity extraction, PKDE4J utilized multiple dictionaries with a vast vocabulary. In a previous evaluation, PKDE4J outperformed several machine learning-based tools—including Neji 26 —in the NER task. PKDE4J gave better performance, especially for matching and labeling bio entities with multiple terms.

PKDE4J employed a rule-based approach for relation extraction, which might be powerful but may not cover all conditions. Other than rule-based approaches, previous studies proposed supervised approaches that utilize neural network structures to extract relation information from texts 27 , 28 , 29 . However, those methods were less efficient because they required determining features beforehand. The development of a pre-trained language model such as BERT 30 has enabled the processing of texts without additional feature-processing steps. BERT employs bidirectional encoders that learn sentences and passages in a contextual manner. We can fine-tune BERT for specific vocabularies and collections such as biomedical literature; BioPREP is one of several BERT models explicitly trained for biomedicine 20 . BioPREP fine-tuned the previously available BERT models SciBERT 31 and BioBERT 32 using SemMedDB 33 . SemMedDB is a publicly available large dataset for biomedical entity and relation extraction. Fine-tuning a language model with SemMedDB can tackle the coverage problem when building a relation-extraction model.

Once we finish the knowledge base construction, we need to generate paths and conclude hypotheses based on those paths. Depending on the knowledge base size and path depths, the number of generated paths could be enormous and analyzing them individually would be excessive. Therefore, we need an automated approach such as a path-ranking algorithm (PRA). A PRA would help identify critical paths for hypothesis generation and has emerged as a promising method for learning inference paths in large knowledge graphs 34 . The most common step in PRA is calculating the triple score (node–relation–node) and calculating the path score. A previous study 35 proposed a triple score calculation using semantic relatedness between nodes and compared their approaches with baseline approaches, such as co-occurrence, word embedding, COALS, and random index. They concluded that their approach performed well compared to those baseline approaches. Despite its effectiveness, their approach depended on the quantity of collected data and was not suitable for handling networks with multiple relations. Therefore, this paper proposed a PRA algorithm that employed a graph-embedding approach called Complex 21 to calculate a triple score. The Complex algorithm considers relation information in edges and maps them into complex space. Using this algorithm, we can obtain embedding values that reflect on multiple relation conditions and the importance of triples.

Previous studies have developed various LBD tools for generating hypotheses to support drug discovery. One early tool in LBD, Swanson’s Arrowsmith, utilized the term co-occurrence to identify associations between entities 36 . Other tools such as BITOLA 37 , DAD 38 , LitLinker 39 , Manjal 40 , and LION 41 provide similar LBD functions focusing on biomedical literature mining. The success of hypothesis generation using the LBD approach significantly depended on path selection and scoring efficiency. Previous studies attempted to use various representation models to calculate path scores and filter paths based on those scores. Despite numerous advantages in implementing a graph-embedding algorithm on heterogeneous networks (knowledge bases) 42 , it has not been widely implemented in the LBD framework.

Our hypothesis generation framework followed the close discovery approach of Swanson’s ABC model 22 . The close discovery approach tried to identify B entities that connected the A entity to the C entity. Both A and C entities were known entities that we can use as source and tail nodes in path retrieval. This paper defined A entities as Xanthium compounds and C entities as diabetes-related terms or phrases. Since we aimed to discover B entities (multiple types of bio entities) that connected those entities, we formulated search queries using Xanthium compounds and diabetes to retrieve documents from PubMed.

Previous studies 10 , 43 discovered 243 compounds from Xanthium, only 36 of which were closely related to diabetes. Therefore, we used those 36 compounds to retrieve titles and abstracts from PubMed in our search queries. We retrieved documents from PubMed using queries from Table 1 in January 2021 and collected 805,839 titles and abstracts. After pre-processing and duplicate removal, 763,155 titles and abstracts remained in our collection. Then, we tokenized each sentence from abstracts and titles and used them for the NER and relation-extraction tasks. We provided a document sample related to 4,5-dicaffeoylquinic acid in Table 2 .

Knowledge base construction

We extracted bio entities and relations from our document collection to construct a knowledge base (graph) for hypothesis generation. There are two steps in our knowledge base construction: entity extraction (NER task) and relation extraction. Figure  1 illustrates the complete flow of our research.

figure 1

Our research framework.

Entity extraction (NER task)

To extract bio entities from our document collection, we used a knowledge extraction engine called PKDE4J 19 . This tool has a dictionary-based NER module where we can use custom dictionaries depending on which entities we want to extract. We decided to use eight bio entities related to drug development: genes (including protein and RNA), compounds (including Xanthium compounds), phenotypes, biological processes, and molecular functions. In addition, we utilized five different dictionaries from several biological databases, as described in Table 3 .

As mentioned in the data section, we used the [MH] code for our document retrieval. Hence, during the retrieval process, not only were “diabetes”-related documents retrieved, we also retrieved documents related to “diabetes mellitus,” “type 1 diabetes,” “type 2 diabetes,” “gestational diabetes,” and “pre-diabetes.” This paper analyzed every possible hypothesis (path) between Xanthium compounds and those five diabetes-related phrases.

Relation extraction

For relation extraction, we examined every sentence in our document collection. If there were two or more unique bio entities in a sentence, we proceeded with the relation-extraction step using a pre-trained model called BioPREP 20 . BioPREP employs a BioBERT-based model that it fine-tunes using the SemMedDB dataset 33 . Using the BioPREP model, we extracted 28 relations, namely: “process of,” “part of,” “location of,” “diagnoses,” “interacts with,” “treats,” “coexists with,” “is a,” “uses,” “precedes,” “associated with,” “causes,” “affects,” “administered to,” “disrupts,” “occurs in,” “complicates,” “inhibits,” “stimulates,” “augments,” “compared with,” “prevents,” “method of,” “neg interacts with,” “neg affects,” “produces,” “manifestation of,” and “higher than.”

The pre-trained BioPREP model required entity type information for predicate classification. Hence, we needed to substitute entities with entity types before processing our sentences using the model, as illustrated in Fig.  2 . If there were more than two unique bio entities in a sentence, we processed the entire sentence for relation extraction.

figure 2

Pre-processing sentences by substituting entities with their type.

Proposed path-ranking algorithm

After obtaining nodes and relations in the previous step, we built a knowledge graph and evaluated each possible path from Xanthium compounds to diabetes using our proposed path-ranking algorithm (PRA) framework. Our PRA framework consists of three steps: (1) transforming nodes and relations from the graph into vector representations (graph embedding). (2) Calculating triple (head node–relation–tail node) scores. The triples are bio entity pairs and relations obtained from the relation extraction step (“ Relation extraction ” section). (3) Calculating the path scores based on the average of triple scores and ranked paths accordingly. Paths with high scores have more inference possibilities, which might be necessary for constructing hypotheses.

Previous works in PRA employed co-occurrence and node similarity based on ontology to calculate the triple score (node–relation–node) 44 . However, using the co-occurrence number in PRA neglected the semantic relatedness between nodes because it ignores relation/edge type. Similar to co-occurrence, the previous approach in the triple score calculation using ontology information focused solely on the hierarchical positioning and neglected semantic relations between nodes 58 . These conditions might not be the best option for inference-purpose or hypothesis generation from path-ranking results. Therefore, this paper proposed a framework in PRA that includes relations to calculate the triple score.

Our framework employed a graph embedding approach called Complex 21 to transform nodes and relations into vector representations (complex space). Previous research used Complex embeddings to execute link prediction tasks in knowledge graph completion 59 . Complex assumes a knowledge base as a three-way tensor to model asymmetric relations, matching relations in our knowledge graph. Complex decomposes tensor into low-dimensional vectors representing embedding values of entities and relations.

First, we trained our knowledge graph using the complex embeddings model and obtained the vector representation for nodes and relations. Then, we concatenated the head node, relation, and tail node vectors and constructed triple vectors. Later, we used the triple vectors as inputs for encoder–decoder architecture to obtain the weight values to calculate triple scores, as illustrated in Fig.  3 . Later, we will use the weight values to transform the n-dimension vector into a probability that represents the triple score.

figure 3

For calculating the triple score, we transform each node and edge into vector representation and construct triple vectors. Then, using the encoder–decoder architecture, we automatically generate weight values for the triple score calculation.

Our encoder–decoder architecture has seven layers. The first three are encoder layers, the fourth is the middle layer, and the last three are decoder layers. Our experiment only included the weight values from the last decoder layer as it encodes the latent representation of data. We used the mean-squared error loss in the training process to maintain the model correctness. We obtained the triple score by calculating the triple vector using Eq. ( 1 ). To rank paths, we calculated the path score of each path by averaging the total triple scores. For example, for paths with a depth of two (where there were two triples in the path), the path score would be the total of two triple scores divided by two.

where n is the triple vector (v) dimension and h is the weight values obtained from the hidden layer.

Hypothesis generation and evaluation

After executing the path-ranking algorithm, we conducted a thorough study of the biological linkages from the top-n paths. Our experts examined the top five percentile and concluded which paths were most plausible for drug development experiments. Furthermore, our experts examined paths in the middle and lower ranks to validate the performance of our proposed ranking algorithm.

Xanthium compounds-diabetes knowledge base

The first step in knowledge base construction is entity extraction or NER task. We employed PKDE4J 19 to process sentences and found that only 3,397,178 sentences contained bio entities. Initially, there were 145,246 unique bio entity terms; after normalization and disambiguation processes, only 84,176 bio entities remained. We provided a summary of NER task results in Table 4 . Among 26,343 compounds, 144 compounds in total were related to Xanthium .

We used the bio entity and entity type information obtained from the NER task to pre-process sentences for the second step, relation extraction. We should note that we only processed sentences with two or more entities and skipped sentences with only one entity. Table 5 gives the sample triples from relation extraction results. Similar node types might have more than one relation; for example, a phenotype can be a process of another phenotype or one phenotype can cause another phenotype, depending on the sentences registered as the BioPREP 20 model input.

We constructed a knowledge base using the obtained triple data from the relation extraction step. Then, we generated paths from 36 Xanthium compounds to five diabetes nodes (diabetes mellitus, type 1 diabetes, type 2 diabetes, gestational diabetes, and pre-diabetes). The generated paths were paths with a depth of two, three, and four. Unfortunately, we found no connecting paths between water-soluble glycosides and five diabetes nodes. This might be due to limited available information about water-soluble glycosides, as we only collected 14 related articles (as of January 2021). There are 12,437 paths with a depth of two, 3,612,585 with a depth of three, and 1,151,267,082 with a depth of four for 35 Xanthium compounds to five diabetes nodes.

Given the large number of paths generated, we focused on “type 1 diabetes,” “type 2 diabetes,” and “diabetes mellitus” as tail nodes and a depth of two and three for further analysis. We provided the path summary between 35 Xanthium compounds and three diabetes-related phrases, “type 1 diabetes,” “type 2 diabetes,” and “diabetes mellitus,” in Table 6 and illustrated the subgraph of our knowledge base in Fig.  4 . More paths were found from compounds like adenosine, choline, hexadecenoic acid, and quercetin than other compounds; these might indicate the high relatedness between those compounds and diabetes. We calculated scores for those paths and ranked them accordingly.

figure 4

An ego graph of the compound “1,3_di_o_caffeoylquinic_acid” with radius = 1.

We can find the 35 compounds mentioned in Table 6 in the root, leaf, fruit, and aerial parts of Xanthium plants 60 . Despite findings of syringaresinol as a potential therapeutic agent for diabetes as it indicates the inhibition of inflammation, fibrosis, and oxidative stress 61 , we found fewer paths connecting the compound to diabetes. Similar to syringaresinol, there were relatively few paths for atractyloside and formononetin regardless of their significant relatedness with type 2 diabetes progression. Meanwhile, for compounds with evidence from a laboratory—such as beta-sitosterol and emodin—we found an adequate number of paths connecting those compounds to diabetes.

Path-ranking performance evaluation

To ensure the performance of our proposed path-ranking algorithm, we conducted separate experiments using the Hetionet dataset. Hetionet is a bio entity network built using 29 publicly available databases containing 24 entity types (compounds, diseases, genes, biological pathways, etc.). Hetionet (version 1.0) contains 2,250,197 edges with 47,031 nodes from 11 types of bio entities. Although we can consider Hetionet a complete biological network (given how many datasets were integrated), it has little information regarding Xanthium compounds. A previous project called Repethio 62 used Hetionet to identify paths from compound to disease and discriminate between treatments and non-treatments. The Repethio project gives a clear idea of how network-based data analysis significantly impacts drug development 63 .

The Repethio project predicted the probability of treatment for 209,168 compound–disease pairs (het.io/repurpose) and used two external sets of treatment for validation. This was an open study that received real-time evaluations from community members. For compound–disease prediction, they also provided network support analysis with information about path score and meta path contributions (meta path significance rate in treatment prediction). They calculated path scores using residual degree weighted path count (R-DWPC), a modification of the DWPC method introduced in 64 . Unlike the previous DWPC method, R-DWPC reflects the specific relationship between source and target nodes in paths. By assuming that the path score represents the level of significance (the higher, the better), we can also use the path score provided in the Repethio project for path ranking.

To validate our path-ranking algorithm, we used extracted paths between diabetes-related compounds and type 2 diabetes mellitus from Hetionet and compared ranking results based on path score with our path-ranking results. We retrieved paths with different depths: one, two, and three. We did not retrieve paths with a depth larger than four because Hetionet only provides information on path scores for paths with a depth of three or less. The compounds we used as source nodes for path retrieval were: Glyburide, Glipizide, Gemfibrozil, Tolazamide, Tolbutamide, Glimepiride, Telmisartan, Chlorpropamide, Losartan, Irbesartan, Eprosartan, Valsartan, Alogliptin, Nateglinide, Olmesartan, Gliclazide, Rosiglitazone, Methylergometrine, Repaglinide, and Fenofibrate. These compounds are recommended for diabetes disease 56 and have a high probability of type 2 diabetes mellitus treatment according to Hetionet.

Using 20 compounds as source nodes and type 2 diabetes mellitus as the target node, we retrieved 13 paths with depth one, 132 paths with depth two, and 26,194 paths with depth three. For ranking results comparison, we employed rank-biased overlap (RBO) 65 to calculate the similarity degree between our results (from PRA ranking) and Hetionet ranking. Table 7 shows the similarity degree based on RBO calculation for 20 compounds (with paths of depth two). In addition, we provided samples for path-ranking results with a depth of two for Glyburide to type 2 diabetes mellitus in Table 8 . We provided their path-ranking results in additional material for other essential compounds.

For paths with depth two, the similarity was in the range 49.5–100% with an average value of 71.2%; the Alogliptin and Methylergometrine paths reached 100% similarity. The average similarity degree for paths with a depth of three was slightly lower as the number of paths was increased. The average value was 49.8% and the range was 44.5–54.7%. We should note that our proposed approaches in path scoring and Hetionet differ considerably. Hetionet weighted each path (edge) by calculating node degrees’ product and raising it to a negative exponent. Meanwhile, our approach focused on weighting each path using graph embedding values translated from complex space.

Path-ranking results (Xanthium compounds—diabetes)

There were 2,740,314 paths between 35 Xanthium compounds and the three diabetes nodes type 1 diabetes, type 2 diabetes, and diabetes mellitus. We calculated the path score by averaging the triple scores obtained from Eq. ( 1 ). Then, we sorted those paths and analyzed paths in the top five percentile. There were 34,774 paths for type 1 diabetes, 42,670 for type 2 diabetes, and 59,575 for diabetes mellitus. Among the top-ranked paths linked to type 1 diabetes, compounds such as adenosine, alkaloids, quercetin, choline, and oleic acid were dominant. For type 2 diabetes, based on the number of occurrences in top percentile paths, the most significant compounds were adenosine, quercetin, alkaloids, choline, and caffeic acid. Lastly, for diabetes mellitus, the most significant compounds were adenosine, quercetin, alkaloids, choline, and hexadecenoic acid. Table 9 provides the top ten paths of each diabetes term.

Based on the top percentile paths, diabetes is strongly related to adenosine, alkaloids, choline, and quercetin. As reported in the clinical trials sections from the Drug Bank 56 , some records stated that adenosine and choline are related to diabetes. Adenosine was used for diabetes mellitus and type 2 diabetes experiments but there was no further information about the clinical trial phase or purpose. However, a record mentioned that a clinical experiment for diabetic peripheral neuropathic pain treatment had entered phase four of the clinical trial for choline. In addition, another clinical experiment used choline for type 2 diabetes mellitus treatment and entered the third phase of clinical trials.

Alkaloids are natural chemical compounds derived from plants, animals, bacteria, or fungi with various pharmacological activities 17 . Naturally, derived alkaloids were effective for diabetic nephropathy treatments and suitable for patients who did not respond well to synthetic drugs or conventional therapeutic medications 66 . Alkaloids could be a strong candidate for the new discovery of anti-diabetic agents. In addition to alkaloids, quercetin might be a potential candidate for diabetes treatment. Quercetin is one of the plant-based flavonoids with various potent biological properties including anti-inflammatory, antioxidative, anti-hypertensive, anticancer, antiviral, neuroprotective, hepatoprotective, and anti-diabetic 67 . Although there has not been any clinical trial record of quercetin for diabetes treatment, there was a completed phase one clinical trial for quercetin as a treatment purpose in high blood pressure disease (hypertension). Previous work mentioned that diabetes patients with hypertension were more predisposed to several complications 68 .

In addition to adenosine, alkaloids, choline, and quercetin, we discovered that caffeic acid, hexadecenoic acid, and oleic acid were also significantly related to diabetes. Those acids are essential to maintaining the diabetes patients’ diets. Caffeic acid could suppress the progression of type 2 diabetes states 69 . High hexadecenoic acid or palmitoleic acid in diets were highly associated with higher risks of diabetes 70 . Lastly, oleic acid helped prevent type 2 diabetes and cardiovascular diseases 71 . According to clinical trial records (as of January 2021), among 36 Xanthium compounds, only two—adenosine and choline—have been reportedly used for diabetes clinical trials. Those two compounds were also in the top selection based on our PRA results. After matching findings from top percentile paths with previous research—including clinical trials—we concluded that our PRA framework distinguished critical paths for hypothesis generation.

Hypothesis generation

Our experts analyzed the top-ranked paths (the top five percentile) and compiled information for Xanthium compounds and diabetes. Previous research showed significant relationships between diabetes and adenosine, oleic acid, choline, caffeic acid, and stigmasterol. From the constructed Xanthium compounds and diabetes, there was a direct edge between those compounds and diabetes. In addition, several paths with a depth of two or three connected those compounds and diabetes diseases. Based on those paths, we concluded that choline and betaine intake were supplementary to type 2 diabetes 72 . Caffeic acid has antioxidant properties that might prevent several chronic diseases including diabetes 73 . Moreover, stigmasterol had the potential to protect beta cell functions during diabetes progression 74 . Other compounds were also connected to diabetes disease through intermediary nodes that are most likely to accelerate diabetes progression, such as hypertension and infections.

The type 1 diabetes-related paths showed significant relatedness between several Xanthium compounds and glucose. Glucose is the main compound in carbohydrate metabolism and provides energy by ATP synthesis. Cells in diabetic patients cannot process glucose effectively due to insulin decrease, resulting in a high glucose level. Compounds such as adenosine, beta-sitosterol, rhamnose, and scopoletin could show decreased glucose level. Based on the data in our collection, we found 2808 documents supporting the argument about adenosine, glucose, and diabetes. For others, we found 73 articles on beta-sitosterol, 63 articles on rhamnose, and 12 articles on scopoletin. Based on those numbers, we can assume that researchers had explored adenosine and diabetes further but only a few had shown interest in the other three compounds. These three compounds might be more appropriate selections for hypothesis generation results than adenosine. Since there were only a few publications related to those compounds and diabetes, we believe that there might be more discoveries to be made; we strongly recommend them for further experiments concerning the glucose level in diabetes cases.

Based on the top percentile paths, we found that adenosine had a significant role in diabetes prognosis. Adenosine is an agonist of adenosine receptors with binding functions that trigger biological reactions. Adenosine receptor signaling plays an essential role in inflammation, immune systems, and oxidative stress 75 . Thus, adenosine was highly related to heart disease, ischemic heart disease, autoimmune disease, and lymphoma. Those diseases are metabolic syndromes related to diabetes. Since we only observed paths with a depth of two and three, the intermediary nodes (between Xanthium compound and diabetes-related terms) were mostly compound or disease nodes. Therefore, for further experiments with more variations in intermediary nodes, we recommended using paths with a depth more extensive than three.

Based on our findings about the top five percentile paths, we concluded the following hypotheses.

Compounds that negatively affect glucose level (lowering effect) are potential candidates for diabetes drug development.

Compounds that are beneficial to treat diseases related to higher diabetes risks are potential candidates for diabetes drug development.

We recommended adenosine, choline, beta-sitosterol, rhamnose, and scopoletin for further studies in diabetes drug development.

Conclusions

Previous hypothesis-generation approaches depended on how experts summarized published scientific documents or how experts interpreted knowledge bases. Similar to previous approaches, we experimented with published scientific documents and expert judgments to generate hypotheses for diabetes drug development using compounds from Xanthium . Our hypothesis generation framework used evidence from scientific publications retrieved from PubMed to build a Xanthium compounds-diabetes knowledge base and generate hypotheses from it. First, we employed a dictionary-based tool to conduct the NER task and extracted bio-entities such as genes, compounds, phenotypes, biological processes, and molecular functions. Depending on the size and coverage of dictionaries, using a dictionary-based tool might be beneficial for recognizing bio-entities. Second, we classified possible relations between two entities using entity type information obtained from the NER task step and analyzed sentences’ context. We trained sentences where two entities were found in a supervised manner using a deep learning approach. The relation classification step gave us triple information (node–relation–node), which enabled us to construct a knowledge base.

Using the constructed knowledge base, we generated simple paths from Xanthium compounds to three diabetes-related phrases: type 1 diabetes, type 2 diabetes, and diabetes mellitus. We used several cutoffs to generate paths and analyzed paths with depths of two and three, which we then ranked using our proposed PRA. Our proposed PRA approach utilized a graph-embedding model to transform nodes and relations (edges) into vector representations. Then, we constructed the triple (node–relation–node) vector representation by concatenating individual vectors and used them to calculate the triple score. Lastly, we calculated the path score based on the average of total triple scores in a path. We considered paths with high path scores as significant paths that might be helpful for hypothesis generation. Using PRA, we made shortlists of important information from an extensive knowledge base. In addition, this helped our experts generate hypotheses related to Xanthium compounds and diabetes. Since our proposed PRA approach employed graph embedding, the results depended on how well the graph was constructed. A larger graph with complete information might give better results than smaller ones. Unfortunately, we only experimented with one graph embedding algorithm in this research, Complex. We plan to do more comprehensive experiments with other graph-embedding algorithms for further analysis.

Data availability

The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.

Liu, B., He, H., Luo, H., Zhang, T. & Jiang, J. Artificial intelligence and big data facilitated targeted drug discovery. Stroke Vasc. Neurol. 4 , 206–213. https://doi.org/10.1136/svn-2019-000290 (2019).

Article   PubMed   PubMed Central   Google Scholar  

Smalley, E. AI-powered drug discovery captures pharma interest. Nat. Biotechnol. 35 , 604–605. https://doi.org/10.1038/nbt0717-604 (2017).

Article   CAS   PubMed   Google Scholar  

Zheng, S., Dharssi, S., Wu, M., Li, J. & Lu, Z. Text mining for drug discovery. Methods Mol. Biol. 1939 , 231–252. https://doi.org/10.1007/978-1-4939-9089-4_13 (2019).

Blagosklonny, M. V. & Pardee, A. B. Conceptual biology: Unearthing the gems. Nature 416 , 373. https://doi.org/10.1038/416373a (2002).

Article   ADS   CAS   PubMed   Google Scholar  

Kim, Y. H., Beak, S. H., Charidimou, A. & Song, M. Discovering new genes in the pathways of common sporadic neurodegenerative diseases: A bioinformatics approach. J. Alzheimers Dis. 51 , 293–312. https://doi.org/10.3233/JAD-150769 (2016).

Lee, S., Choi, J., Park, K., Song, M. & Lee, D. Discovering context-specific relationships from biological literature by using multi-level context terms. BMC Med. Inform. Decis. Mak. 12 , S1. https://doi.org/10.1186/1472-6947-12-S1-S1 (2012).

Sang, S. et al. SemaTyP: A knowledge graph based literature mining method for drug discovery. BMC Bioinformatics 19 , 193. https://doi.org/10.1186/s12859-018-2167-5 (2018).

Yu, L. et al. Inferring drug-disease associations based on known protein complexes. BMC Med. Genomics 8 , S2. https://doi.org/10.1186/1755-8794-8-S2-S2 (2015).

Spangler, S. et al. Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , 1877–1886. https://doi.org/10.1145/2623330.2623667 (2014).

Fan, W. et al. Traditional uses, botany, phytochemistry, pharmacology, pharmacokinetics and toxicology of Xanthium strumarium L.: A review. Molecules https://doi.org/10.3390/molecules24020359 (2019).

Jiang, H. et al. Four new glycosides from the fruit of Xanthium sibiricum Patr. Molecules 18 , 12464–12473. https://doi.org/10.3390/molecules181012464 (2013).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Hsu, F. L., Chen, Y. C. & Cheng, J. T. Caffeic acid as active principle from the fruit of Xanthium strumarium to lower plasma glucose in diabetic rats. Planta Med. 66 , 228–230. https://doi.org/10.1055/s-2000-8561 (2000).

Guo, F., Zeng, Y. & Li, J. Inhibition of α-glucosidase activity by water extracts of Xanthium sibiricum Patrin ex Widder and their effects on blood sugar in mice. Zhejiang da xue bao. Yi xue ban = Journal of Zhejiang University. Med. Sci. 42 , 632–637 (2013).

Hwang, S. H., Wang, Z., Yoon, H. N. & Lim, S. S. Xanthium strumarium as an Inhibitor of α-Glucosidase, Protein Tyrosine Phosphatase 1β, Protein Glycation and ABTS + for Diabetic and Its Complication. Molecules , 21, https://doi.org/10.3390/molecules21091241 (2016).

Kaul, K., Tarr, J. M., Ahmad, S. I., Kohner, E. M. & Chibber, R. Introduction to diabetes mellitus. Adv. Exp. Med. Biol. 771 , 1–11. https://doi.org/10.1007/978-1-4614-5441-0_1 (2012).

Menini, S., Iacobini, C., Vitale, M. & Pugliese, G. The inflammasome in chronic complications of diabetes and related metabolic disorders. Cells https://doi.org/10.3390/cells9081812 (2020).

Kumar, A. et al. Role of plant-derived alkaloids against diabetes and diabetes-related complications: A mechanism-based approach. Phytochem. Rev. 18 , 1277–1298. https://doi.org/10.1007/s11101-019-09648-6 (2019).

Article   CAS   Google Scholar  

Swanson, D. R. Fish oil, Raynaud’s syndrome, and undiscovered public knowledge. Perspect. Biol. Med. 30 , 7–18. https://doi.org/10.1353/pbm.1986.0087 (1986).

Song, M., Kim, W. C., Lee, D., Heo, G. E. & Kang, K. Y. PKDE4J: Entity and relation extraction for public knowledge discovery. J. Biomed. Inform. 57 , 320–332. https://doi.org/10.1016/j.jbi.2015.08.008 (2015).

Article   PubMed   Google Scholar  

Hong, G., Kim, Y., Choi, Y. & Song, M. BioPREP: Deep learning-based predicate classification with SemMedDB. J. Biomed. Inform. 122 , 103888. https://doi.org/10.1016/j.jbi.2021.103888 (2021).

Trouillon, T., Welbl, J., Riedel, S., Ciaussier, E. & Bouchard, G. Complex embeddings for simple link prediction. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML'16) . 2071–2080. https://doi.org/10.5555/3045390.3045609 (2016).

Weeber, M., Klein, H., de Jong-van den Berg, L. T. W. & Vos, R. Using concepts in literature-based discovery: Simulating Swanson’s Raynaud-fish oil and migraine-magnesium discoveries. J. Am. Soc. Inf. Sci. Technol. 52 , 548–557. https://doi.org/10.1002/asi.1104 (2001).

Kim, Y. H. & Song, M. A context-based ABC model for literature-based discovery. PLoS ONE 14 , e0215313. https://doi.org/10.1371/journal.pone.0215313 (2019).

May, B. H., Lu, C., Lu, Y., Zhang, A. L. & Xue, C. C. L. Chinese herbs for memory disorders: A review and systematic analysis of classical herbal literature. J. Acupunct. Meridian Stud. 6 , 2–11. https://doi.org/10.1016/j.jams.2012.11.009 (2013).

Hu, R.-F. & Sun, X.-B. Design of new traditional Chinese medicine herbal formulae for treatment of type 2 diabetes mellitus based on network pharmacology. Chin. J. Nat. Med. 15 , 436–441. https://doi.org/10.1016/S1875-5364(17)30065-1 (2017).

Campos, D., Matos, S. & Oliveira, J. L. A modular framework for biomedical concept recognition. BMC Bioinform. 14 , 281. https://doi.org/10.1186/1471-2105-14-281 (2013).

Article   Google Scholar  

Sahu, S. K. & Anand, A. Drug-drug interaction extraction from biomedical texts using long short-term memory network. J. Biomed. Inform. 86 , 15–24. https://doi.org/10.1016/j.jbi.2018.08.005 (2018).

Zhang, Y. et al. A hybrid model based on neural networks for biomedical relation extraction. J. Biomed. Inform. 81 , 83–92. https://doi.org/10.1016/j.jbi.2018.03.011 (2018).

Li, F., Zhang, M., Fu, G. & Ji, D. A neural joint model for entity and relation extraction from biomedical text. BMC Bioinform. 18 , 198. https://doi.org/10.1186/s12859-017-1609-9 (2017).

Devlin, J., Chang, M. W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL HLT 2019–2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference , Vol. 1 4171–4186 (2019).

Beltagy, I., Lo, K. & Cohan, A. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) , 3613–3618. https://doi.org/10.18653/v1/D19-1371 (2019).

Lee, J. et al. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics https://doi.org/10.1093/bioinformatics/btz682 (2019).

Kilicoglu, H., Shin, D., Fiszman, M., Rosemblat, G. & Rindflesch, T. C. SemMedDB: A PubMed-scale repository of biomedical semantic predications. Bioinformatics 28 , 3158–3160. https://doi.org/10.1093/bioinformatics/bts591 (2012).

Lao, N., Mitchell, T. & Cohen, W. W. Random walk inference and learning in a large scale knowledge base. In EMNLP 2011—Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference , 529–539 (2011).

Heo, G. E., Xie, Q., Song, M. & Lee, J.-H. Combining entity co-occurrence with specialized word embeddings to measure entity relation in Alzheimer’s disease. BMC Med. Inform. Decis. Mak. 19 , 240. https://doi.org/10.1186/s12911-019-0934-5 (2019).

Swanson, D. R. & Smalheiser, N. R. An interactive system for finding complementary literatures: A stimulus to scientific discovery. Artif. Intell. 91 , 183–203. https://doi.org/10.1016/S0004-3702(97)00008-8 (1997).

Article   MATH   Google Scholar  

Baud, R. Improving literature based discovery support by genetic knowledge integration. In The New Navagators: From Professionals to Patients , Vol. 95 68 (2003).

Weeber, M. et al. Text-based discovery in biomedicine: The architecture of the DAD-system. In Proceedings of the AMIA Symposium , 903 (2000).

Pratt W. & Yetisgen-Yildiz, M. LitLinker: Capturing connections across the biomedical literature. In Proceedings of the 2nd International Conference on Knowledge Capture , 105–112. https://doi.org/10.1145/945645.945662 (2003).

Srinivasan, P. Text mining: Generating hypotheses from MEDLINE. J. Am. Soc. Inf. Sci. Technol. 55 , 396–413. https://doi.org/10.1002/asi.10389 (2004).

Pyysalo, S. et al. LION LBD: A literature-based discovery system for cancer biology. Bioinformatics 35 , 1553–1561 (2019).

Saxena, A., Tripathi, A., & Talukdar, P. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 4498–4507. https://doi.org/10.18653/v1/2020.acl-main.412 (2020).

Yoo, S. et al. A data-driven approach for identifying medicinal combinations of natural products. IEEE Access 6 , 58106–58118. https://doi.org/10.1109/ACCESS.2018.2874089 (2018).

Brown, G. R. et al. Gene: A gene-centered information resource at NCBI. Nucleic Acids Res. 43 , D36–D42. https://doi.org/10.1093/nar/gku1055 (2015).

Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46 , D754–D761. https://doi.org/10.1093/nar/gkx1098 (2018).

Oughtred, R. et al. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 47 , D529–D541. https://doi.org/10.1093/nar/gky1079 (2019).

Whirl-Carrillo, M. et al. Pharmacogenomics knowledge for personalized medicine. Clin. Pharmacol. Ther. 92 , 414–417. https://doi.org/10.1038/clpt.2012.96 (2012).

Bateman, A. et al. UniProt: The universal protein knowledgebase. Nucleic Acids Res. 45 , D158–D169. https://doi.org/10.1093/nar/gkw1099 (2017).

Federhen, S. The NCBI taxonomy database. Nucleic Acids Res 40 , D136–D143. https://doi.org/10.1093/nar/gkr1178 (2012).

Kim, S. et al. PubChem 2019 update: Improved access to chemical data. Nucleic Acids Res 47 , D1102–D1109. https://doi.org/10.1093/nar/gky1033 (2019).

Mendez, D. et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res 47 , D930–D940. https://doi.org/10.1093/nar/gky1075 (2019).

Hastings, J. et al. ChEBI in 2016: Improved services and an expanding collection of metabolites. Nucleic Acids Res. 44 , D1214–D1219. https://doi.org/10.1093/nar/gkv1031 (2016).

Park, J., Kim, J.-S. & Bae, S. Cas-database: Web-based genome-wide guide RNA library design for gene knockout screens using CRISPR-Cas9. Bioinformatics 32 , 2017–2023. https://doi.org/10.1093/bioinformatics/btw103 (2016).

Gilson, M. K. et al. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 44 , D1045–D1053. https://doi.org/10.1093/nar/gkv1072 (2016).

Kanehisa, M., Sato, Y., Furumichi, M., Morishima, K. & Tanabe, M. New approach for understanding genome variations in KEGG. Nucleic Acids Res. 47 , D590–D595. https://doi.org/10.1093/nar/gky962 (2019).

Wishart, D. S. et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res. 46 , D1074–D1082. https://doi.org/10.1093/nar/gkx1037 (2018).

Ashburner, M. et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25 , 25–29. https://doi.org/10.1038/75556 (2000).

Garla, V. N. & Brandt, C. Semantic similarity in the biomedical domain: An evaluation across knowledge sources. BMC Bioinform. 13 , 261. https://doi.org/10.1186/1471-2105-13-261 (2012).

Trouillon, T. et al. Knowledge graph completion via complex tensor factorization. J. Mach. Learn. Res . 18 , 4735–4772. https://doi.org/10.5555/3045390.3045609 (2017).

Article   MathSciNet   MATH   Google Scholar  

Fan, W. et al. Traditional uses, botany, phytochemistry, pharmacology, pharmacokinetics and toxicology of Xanthium strumarium L.: A review. Molecules 24 , 359. https://doi.org/10.3390/molecules24020359 (2019).

Article   CAS   PubMed Central   Google Scholar  

Li, G. et al. Syringaresinol protects against type 1 diabetic cardiomyopathy by alleviating inflammation responses, cardiac fibrosis, and oxidative stress. Mol. Nutr. Food Res. 64 , 2000231. https://doi.org/10.1002/mnfr.202000231 (2020).

Himmelstein, D. S. et al. Systematic integration of biomedical knowledge prioritizes drugs for repurposing. Elife 6 , 1–35. https://doi.org/10.7554/eLife.26726 (2017).

Recanatini, M. & Cabrelle, C. drug research meets network science: Where are we?. J. Med. Chem. 63 , 8653–8666. https://doi.org/10.1021/acs.jmedchem.9b01989 (2020).

Himmelstein, D. S. & Baranzini, S. E. Heterogeneous network edge prediction: A data integration approach to prioritize disease-associated genes. PLOS Comput. Biol. 11 , e1004259. https://doi.org/10.1371/journal.pcbi.1004259 (2015).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Webber, W., Moffat, A. & Zobel, J. A similarity measure for indefinite rankings. ACM Trans. Inf. Syst. https://doi.org/10.1145/1852102.1852106 (2010).

Ajebli, M., Khan, H. & Eddouks, M. Natural alkaloids and diabetes mellitus: A review. Endocr. Metab. Immune Disord. Drug Targets 21 , 111–130. https://doi.org/10.2174/1871530320666200821124817 (2021).

Yang, D. K. & Kang, H.-S. Anti-diabetic effect of cotreatment with quercetin and resveratrol in streptozotocin-induced diabetic rats. Biomol. Ther. 26 , 130–138. https://doi.org/10.4062/biomolther.2017.254 (2018).

Naha, S., Gardner, M. J., Khangura, D., Kurukulasuriya, L. R. & Sowers, J. R. Hypertension in diabetes, Endotext (2021).

Jung, U. J., Lee, M.-K., Park, Y. B., Jeon, S.-M. & Choi, M.-S. Antihyperglycemic and antioxidant properties of caffeic acid in db/db mice. J. Pharmacol. Exp. Ther. 318 , 476–483. https://doi.org/10.1124/jpet.106.105163 (2006).

Qureshi, W. et al. Risk of diabetes associated with fatty acids in the de novo lipogenesis pathway is independent of insulin sensitivity and response: The Insulin Resistance Atherosclerosis Study (IRAS). BMJ Open Diabetes Res. Care 7 , e000691. https://doi.org/10.1136/bmjdrc-2019-000691 (2019).

Granado-Casas, M. & Mauricio, D. Oleic acid in the diet and what it does: Implications for diabetes and its complications. In Bioactive Food as Dietary Interventions for Diabetes , 211–229 (Elsevier, 2019). https://doi.org/10.1016/B978-0-12-813822-9.00014-X .

Virtanen, J. K., Tuomainen, T.-P. & Voutilainen, S. Dietary intake of choline and phosphatidylcholine and risk of type 2 diabetes in men: The Kuopio Ischaemic Heart Disease Risk Factor Study. Eur. J. Nutr. 59 , 3857–3861. https://doi.org/10.1007/s00394-020-02223-2 (2020).

Socała, K., Szopa, A., Serefko, A., Poleszak, E. & Wlaź, P. Neuroprotective effects of coffee bioactive compounds: A review. Int. J. Mol. Sci. 22 , 50. https://doi.org/10.3390/ijms22010107 (2020).

Ward, M. G., Li, G., Barbosa-Lorenzi, V. C. & Hao, M. Stigmasterol prevents glucolipotoxicity induced defects in glucose-stimulated insulin secretion. Sci. Rep. 7 , 9536. https://doi.org/10.1038/s41598-017-10209-0 (2017).

Peleli, M. & Carlstrom, M. Adenosine signaling in diabetes mellitus and associated cardiovascular and renal complications. Mol. Aspects Med. 55 , 62–74. https://doi.org/10.1016/j.mam.2016.12.001 (2017).

Download references

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT) (No. 2022R1A2B5B02002359).

Author information

Authors and affiliations.

Department of Library and Information Science, Yonsei University, Seoul, Republic of Korea

Arida Ferti Syafiandini, Gyuri Song, Yuri Ahn, Heeyoung Kim & Min Song

You can also search for this author in PubMed   Google Scholar

Contributions

Y.A. and G.S. collected data and performed experiments. A.F.S. performed experiments and was a major contributor in writing the manuscript. H.K. validated the results and generated hypotheses. M.S. designed and supervised experiments. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Min Song .

Ethics declarations

Competing interests.

The authors declare no competing interests .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary information 1., supplementary information 2., rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Syafiandini, A.F., Song, G., Ahn, Y. et al. An automatic hypothesis generation for plausible linkage between xanthium and diabetes. Sci Rep 12 , 17547 (2022). https://doi.org/10.1038/s41598-022-20752-0

Download citation

Received : 07 December 2021

Accepted : 19 September 2022

Published : 20 October 2022

DOI : https://doi.org/10.1038/s41598-022-20752-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

automatic hypothesis generation

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

An AI Tool for Automated Research Question and Hypothesis Generation from a given Scientific Literature

bhaskatripathi/HypothesisHub

Folders and files, repository files navigation, hypothesishub.

HypothesisHub is an AI Tool for the Automated Generation of Research Questions and Hypotheses from Scientific Literature. It applies a chain of reasoning to scientific literature to generate questions and hypotheses. OpenAI and Langchain serve as the underlying technologies for the tool.

Open In Colab

  • Generates research questions from a given scientific literature
  • Generates a null hypothesis (H0) and an alternate hypothesis (H1) for each research question
  • Handles cases where either H0 or H1 is not present
  • Automatically generates missing H1 using the LLMChain if needed
  • Negates hypothesis statement if H0 is missing

Sequence Diagram

Output image

Please give a star if you like this project and find it useful.

Star History

Star History Chart

  • Jupyter Notebook 100.0%

Help | Advanced Search

Computer Science > Artificial Intelligence

Title: automating psychological hypothesis generation with ai: large language models meet causal graph.

Abstract: Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 potential psychological hypotheses focusing on `well-being', then compared them against research ideas conceived by doctoral scholars and those produced solely by the LLM. Interestingly, our combined approach of a LLM and causal graphs mirrored the expert-level insights in terms of novelty, clearly surpassing the LLM-only hypotheses (t(59) = 3.34, p=0.007 and t(59) = 4.32, p<0.001, respectively). This alignment was further corroborated using deep semantic analysis. Our results show that combining LLM with machine learning techniques such as causal knowledge graphs can revolutionize automated discovery in psychology, extracting novel insights from the extensive literature. This work stands at the crossroads of psychology and artificial intelligence, championing a new enriched paradigm for data-driven hypothesis generation in psychological research.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Advertisement

Advertisement

Artificial intelligence to automate the systematic review of scientific literature

  • Regular Paper
  • Open access
  • Published: 11 May 2023
  • Volume 105 , pages 2171–2194, ( 2023 )

Cite this article

You have full access to this open access article

automatic hypothesis generation

  • José de la Torre-López   ORCID: orcid.org/0000-0003-2612-395X 1 ,
  • Aurora Ramírez   ORCID: orcid.org/0000-0002-1916-6559 1 &
  • José Raúl Romero   ORCID: orcid.org/0000-0002-4550-6385 1  

13k Accesses

5 Citations

10 Altmetric

Explore all metrics

Artificial intelligence (AI) has acquired notorious relevance in modern computing as it effectively solves complex tasks traditionally done by humans. AI provides methods to represent and infer knowledge, efficiently manipulate texts and learn from vast amount of data. These characteristics are applicable in many activities that human find laborious or repetitive, as is the case of the analysis of scientific literature. Manually preparing and writing a systematic literature review (SLR) takes considerable time and effort, since it requires planning a strategy, conducting the literature search and analysis, and reporting the findings. Depending on the area under study, the number of papers retrieved can be of hundreds or thousands, meaning that filtering those relevant ones and extracting the key information becomes a costly and error-prone process. However, some of the involved tasks are repetitive and, therefore, subject to automation by means of AI. In this paper, we present a survey of AI techniques proposed in the last 15 years to help researchers conduct systematic analyses of scientific literature. We describe the tasks currently supported, the types of algorithms applied, and available tools proposed in 34 primary studies. This survey also provides a historical perspective of the evolution of the field and the role that humans can play in an increasingly automated SLR process.

Similar content being viewed by others

automatic hypothesis generation

Why, When, Who, What, How, and Where for Trainees Writing Literature Review Articles

Gerry L. Koons, Katja Schenke-Layland & Antonios G. Mikos

automatic hypothesis generation

Natural Language Processing

automatic hypothesis generation

The Use of Artificial Intelligence in Writing Scientific Review Articles

Melissa A. Kacena, Lilian I. Plotkin & Jill C. Fehrenbacher

Avoid common mistakes on your manuscript.

1 Introduction

Artificial intelligence (AI) has come to alleviate people from tasks they repeatedly do at work but require some human abilities to success. Scientists are not an exception, and also demand powerful computational techniques to accelerate their results. In this sense, starting a new research often involves an in-depth analysis of related scientific literature in order to understand the context and find relevant works addressing the same or a similar problem. Besides, searching, screening and extracting key information from an extensive collection of papers is a time-consuming task that, doing without experience or clear guidelines, can lead to missing important contributions. Potential biases and errors can be mitigated by providing a rigorous methodology for literature search and analysis [ 1 ]. A systematic literature review (SLR) is a secondary study that follows a well-established methodology to find relevant papers, extract information from them and properly present their key findings [ 2 ]. The literature review is expected to provide a complete overview of a research topic, often providing a historical perspective which allows identifying trends and open issues. Literature reviews have become an important piece of work in many scientific disciplines, such as medicine —the area with the largest number of reviews published (13,510)— and computing (6,342). Footnote 1

Conducting a literature review is known to be costly in time, specially if the authors cover a broad field. To support the SLR process, several tools have been created in the last years for different purposes [ 3 ]. Among other features, SLR tools can import literature search results from electronic databases, mark them as relevant based on the inclusion criteria or provide visual assistance to analyse meta-information from authors and citations. Going one step further, automating the SLR process is gaining attention as an application domain in computing research [ 4 ], mostly proposing methods that semi-automatically build search strings or retrieve papers from scientific databases. The use of automated approaches has proven to save time and resources when it comes to select relevant papers [ 5 ] or sketch the report of findings [ 6 ]. Nevertheless, some authors still suggest that their practical use is limited due to the required learning curve, and the lack of studies evaluating their benefits [ 7 ].

In this paper, we focus on the automation of the SLR tasks using AI as the main driver, seeking to augment the capabilities of automated methods and tools with additional knowledge and recommendations. The first use of AI techniques for automating SLR tasks dates back from 2006 [ 8 ], when a neural network was proposed to automatically select primary studies based on information extracted via text mining. Following this idea, other authors have explored other text mining strategies [ 9 , 10 ] and, more recently, machine learning (ML) and natural language processing (NLP) [ 4 ]. The possibilities that AI brings to the analysis of scientific literature are wide considering all the repetitive tasks that the SLR methodology entails. However, the role that humans play in the process should not be diminished, since they have an holistic view of the process that current AI techniques still lack.

The application of AI techniques to automate the SLR process is still a young discipline that is expected to continue growing in the next years. The increasing interest suggests that it is a good moment to analyse the AI techniques currently proposed to address the different SLR tasks, with special emphasis on their purpose, inputs and outputs, and human intervention, if any. Some of the secondary studies published so far in the area have already included AI techniques in their analysis of methods and tools for supporting SLR tasks. However, these studies either have been approached from a more general perspective, focusing on any kind of automation—not necessarily focused on AI—[ 3 , 4 ], or are specialised in a particular AI technique (e.g. ML) [ 11 ] or SLR task (e.g. paper selection) [ 9 , 12 ]. Furthermore, these studies may lack an in-depth explanation of the AI concepts and techniques applicable to the whole SLR process. Therefore, this paper presents a complete survey of the area, while also seeks to deepen on the role that humans play in an semi-automatic SLR process, a perspective not considered by any previous literature review. With these goals in mind, we analyse the current state of AI-based SLR automation guided by the following research questions (RQs):

RQ1. Which phases of the SLR process have been automated using AI?

RQ2. Which are the AI techniques supporting the automation of SLR tasks?

RQ3. To what extent is the human involved in SLR automation with AI?

To respond to these RQs, we conduct a systematic literature search as part of our survey. We identify 34 primary studies from more than 9,000 references retrieved from both automatic and manual search. Footnote 2 An analysis of these works is carried out to understand the purpose of using AI for solving a specific task. Then, we focus on the characteristics of the proposed methods, including their inputs, outputs and algorithmic choices. We also collect information on how the approach is experimentally evaluated, including the performance metrics and corpus of papers used for comparison. From our analysis, we found that some tasks are far more studied than others, and that some ML techniques proposed in the early stages are still used. However, we also discover some recent works exploring new ML approaches in which the human can be more involved. The discussion of our findings to answer each RQ has served us to identify some open issues and challenges related to unsupported tasks, additional AI techniques not still considered, and experimental reproducibility.

2 Background

A systematic literature review is a secondary study that rigorously unifies and analyses scientific literature in order to synthesise current knowledge, critically discuss existing proposals and identify trends. A SLR follows a well-established methodology to conduct evidence-based research [ 2 ], including the definition of research questions (RQs) and a replicable procedure to find relevant papers, a.k.a. primary studies, from which information will be extracted.

Conducting a SLR reports benefits to both its authors and the target research community. For authors, the SLR represents an opportunity to study a topic in depth, what is particularly recommended for graduate students [ 13 ]. For readers, SLRs provide a comprehensible and up-to-date overview of their field of interest, usually becoming a reference work to identify key studies and discover the latest advances. SLRs are known to have some drawbacks too, such as the long time needed to complete it or difficulties to evaluate the quality of primary studies [ 14 ]. Recently, common threats related to SLR replicability have been analysed [ 15 ], pointing out problems that arise due to the lack of a clear methodology. The methodology proposed by Kitchenham and Charters [ 2 ] divides the SLR process into the following phases:

Planning phase . The need for a SLR in the research area is motivated, thus guaranteeing that it will contribute to fill a gap and spread knowledge. Research questions are formulated to set the scope of the SLR and guide its development. They can follow predefined structures, e.g. PICO (Population, Intervention, Comparison and Outcome) or SPICE (Setting, Perspective, Intervention, Comparison and Evaluation) [ 16 ]. During this phase, a review protocol is prepared with a detailed strategy for all phases of the review. The protocol includes the search procedure and its sources, e.g. scientific databases and libraries; the definition of inclusion and exclusion criteria to select papers; and guidelines for data extraction and quality evaluation.

Conducting phase . Automatic searchers in databases and digital libraries are executed with search strings derived from the RQs or built with some supporting method [ 17 ]. It is worth considering other sources too, such as grey literature and snowballing [ 18 ]. The former consists in including sources like theses, dissertations, presentations and others that are not part of formal or commercial publications. Snowballing is a manual method where new literature is obtained by looking at references and citations in papers previously found. This helps access a more comprehensive collection of information on the topic. After the search, relevant studies have to be identified from the retrieved results, a process that includes duplication removal, identification of candidates —usually based on title and abstract—, and the application of exclusion and inclusion criteria. These criteria specify the quality requirements that each paper must satisfy in order to be considered in scope [ 19 ]. The primary studies are then analysed to extract information. Summary statistics can be obtained to synthesise and visualise the collected data.

Reporting phase . This phase mostly refers to the writing process, including mechanisms to evaluate the completeness and quality of the final report. The authors should decide how the information is presented and discussed, and determine whether the review report is ready for publication. Guidelines in form of check-lists have been proposed to assess that the SLR report contains the essential information [ 20 ].

3 Methodology

Figure  1 shows the methodological steps followed to retrieve papers and extract information from them [ 2 , 21 ]. Next, each step is explained in detail.

figure 1

Steps for searching and selecting relevant papers in AI-based SLR automation

3.1 Search strategy

The search strategy is comprised of both automatic and manual search. For automatic search, the following sources are queried: ACM Library, IEEE Xplore, Scopus, SpringerLink and Web of Science. The search string defined to retrieve papers is composed of multiple terms that combine keywords related to systematic reviews and words referring to automation. We choose general terms related to automation instead of a list of specific AI techniques for two reasons: (1) the list might bias the results to particular techniques, preventing less common approaches to appear in the results; and (2) a fully detailed list of techniques would result in large and complex search strings, which are difficult to manage by databases. Figure  2 shows the resulting search string, which was conveniently adapted to each data source when needed. The fields considered for the search are title, keywords and abstract.

figure 2

Search string defined for retrieving papers related to SLR automation

figure 3

Distribution of primary studies per year

After the execution of the search queries, 9027 references are returned. Figure  1 shows the number of papers retrieved from each source. From this set, 2417 references are duplicates and, therefore, excluded from the total count. Then, a manual inspection of title and abstract is carried out to obtain a list of 44 candidate papers. Based on this list, manual search is performed via backwards snowballing. From the 8 papers initially found, 6 are added to the final list of candidate papers after reading their title and abstract.

The 50 candidate papers are further analysed to confirm that they are within scope. With this aim, exclusion and inclusion criteria are established. Excluded papers correspond to manuscripts not written in English, those whose full content cannot be reached, or publications without evidence of a peer review process. Inclusion criteria specify restrictions applied to the content of the paper. To be considered for this survey, the paper should be focused on the automation of one or more phases of a SLR, and explicitly mention the application of some AI-based approach. This general criterion is decomposed into a number of mutually exclusive options: (1) the paper describes a new algorithm, tool or technique supporting the automation or semi-automation of a SLR; (2) the paper analyses the importance of the automation of a SLR and provides a retrospective of the state-of-art in this field; (3) the paper reports a summary of tools that are related to one or more phases of a SLR.

After applying these criteria, 34 papers are finally selected as primary studies. Figure  3 shows their distribution along the years, divided into conference (32%) and journal papers (68%). The first study appeared in 2006, and it is not until 2009 that other proposals were published. After that, the number of papers per year remains more constant, without a clear predominance of conference or journals. However, it is noticeable that 57% of the total journals papers (13) have been published in the last five years.

3.2 Data extraction

Once all primary studies are identified, they are thoroughly analysed to gather information using a data extraction form [ 2 ]. Each paper is revised by one author, a second reviewer being involved in case of any doubt. The data extraction form includes meta-information, e.g. authors and their affiliation, type of study and publication year, and categories to characterise the AI approach. More specifically, the content of each paper is summarised according to:

SLR phase and task . The paper is classified according to the SLR phase(s) that it automates, detailing the specific step(s) in that phase.

AI area and technique . The paper is assigned to one or more AI areas, including a short description of the algorithm or method used. We also annotate if the human is somehow involved in the process.

Experimental framework . The type of primary study is identified among empirical, theoretical, application or review. For empirical studies, we collect the data corpus and the performance evaluation metrics used for evaluation.

Reproducibility . We revise if algorithms, datasets and tools included in the paper are publicly available. To do this, we check any website or repository mentioned as additional material to confirm that the content is reachable.

4 AI techniques for SLR automation

This section presents the AI techniques organised by SLR phase, namely planning (Sect.  4.1 ), conducting (Sect.  4.2 ) and reporting (Sect.  4.3 ).

4.1 AI techniques for the planning phase

At the beginning of the planning phase, it is recommended to perform a preliminary analysis of the scope and magnitude of the SLR [ 22 ]. In the context of health research, “scoping” reviews are a way to quickly identify research themes, for which papers need to be catalogued in order to obtain a “map” of the research topic. Due to its descriptive nature, unsupervised learning is suitable because it does not need data labels, i.e. predefined research topics in this case. In particular, clustering becomes a relevant approach here, as it is able to identify groups of entities like papers sharing characteristics. Lingo3G Footnote 3 is a document clustering algorithm that has been used to group similar papers based on their title and abstract [ 22 ]. It allows papers to be associated to more than one cluster, and can also generate hierarchical clusters, thus providing a more refined topic classification. After clustering, the reviewer can map clusters to concepts. The method was evaluated using the results of previous “scoping” reviews from a health institution, comparing the topics automatically generated by clustering with those assigned by manual review.

Although the review process itself should be analysed during the whole SLR development, decisions about the available resources and task prioritisation should be taken during the planning stage. Process mining has been studied as a potential approach to understand the required effort and usual organisation of SLR activities [ 23 ]. Process mining encompasses, among other methods, a number of data mining techniques that analyse business processes by means of log events. Its main goal is the identification of trends and patterns with the aim of generating knowledge and increasing the efficiency of the business process. The method proposed by Pham et al. [ 23 ] analyses event logs produced by 12 manual SLR processes simulated by a multidisciplinary team. Logs represent the input to the process mining method, which is able to extract information about task assignment, timelines and effort measured in person-hour. More specifically, a heuristic mining algorithm analyses the frequency of events to determine the most relevant activities (e.g. searching papers, selecting them or reporting findings) and how they are temporarily distributed. To do it, a dependency graph is built to discover sequence patterns between the SLR tasks, e.g. whether a task is usually followed by another. Also, a fuzzy mining algorithm is executed to abstract different review models (how people conduct SLRs) by excluding less relevant activities or their characteristics (time spent, people involved, etc.). The algorithm uses two metrics, significance and correlation, to decide which events and relationships between them should be highlighted, aggregated or removed to simplify the process model.

4.2 AI techniques for the conducting phase

This phase has attracted great attention from the AI perspective, with 59% of the primary studies related to its tasks. The selection of primary studies stands out as the most frequently supported task, with a total of 18 papers. ML is the most widely used branch of AI at this phase, often combined with NLP and text mining. Therefore, we first describe how paper selection is addressed from the ML perspective. Then, we focus on those tasks within the conducting phase that have been automatised with other different AI techniques.

The automatic selection of primary studies using ML requires two main steps: (1) the extraction of features to characterise the papers and (2) training a classifier to discern between those papers to be included and those to be excluded from the SLR. Feature extraction for paper selection often requires creating a list of topics or keywords from the title and abstract. NLP and text mining are applied to computationally handle and process such textual information. NLP provides efficient mechanisms for information retrieval and extraction from pieces of text so that they can be processed by a machine. NLP involves a series of steps to process and synthesise the data, such as word tokenisation, removal of stop words, and stemming. Text mining, which combines NLP steps with data mining methods, allows processing and analysing large fragments of text. Text mining is particularly relevant for inferring non-explicit knowledge and dealing with semantic aspects. In the second step, the list of candidate papers is processed by the learning algorithm based on their features, so that a decision is made about the relevance of the paper with respect to the SLR topic. In this case, three ML paradigms have been considered: supervised learning, active learning and reinforcement learning. In supervised learning, a labelled dataset is required to train the decision model. Active learning does not assume availability of labelled data, but considers that labels can be obtained at a certain cost. Reinforcement learning evaluates the rewards obtained when taking decision over the data.

4.2.1 Supervised learning techniques for paper selection

Supervised methods have been extensively explored for paper selection, using existing SLRs to create labelled datasets to train from. The pioneering work combines text mining with neural networks [ 8 ]. More specifically, the voting perceptron algorithm is used to train a classifier able to discern between relevant and non-relevant papers. The decision is based on a bag-of-words (BoW) representation of the papers, which is obtained from title and abstract via text mining using the Porter stemming algorithm and removing stop words. This work is also important because of the definition of the WSS%95 evaluation metric, which has become a reference in many later studies. These authors use the same BoW representation in a subsequent study [ 24 ], which applies a fast implementation of support vector machine (SVM) called SVM \(^{light}\) . They also propose a novel way to train the model with a combination of topic-specific and non-topic-specific papers. By “topic” they refer to the research area for which the SLR is conducted, whereas non-topic papers are not strictly related to the field under study but to a close discipline. Such non-topic papers could be useful when the SLR covers a new research field with few publications yet. As the authors report, topic-specific classification can be biased and very few papers were deemed as relevant. In contrast, enlarging the training data with no-topic papers increased the performance of the method. In another study, SVM \(^{light}\) is trained with 19 systematic reviews of different topics conducted in a medical institution [ 25 ]. Each paper is identified as included, excluded due to general criteria like the type of paper or publication source, or excluded due to topic-specific criteria. To characterise each paper, the authors combine the publication type with words extracted from title, abstract and indexing terms.

The performance of SVM and logistic regression (LR) with different set of features have been compared against human screening [ 26 ]. A BoW approach is used to build the features for the SVM classifier, using the TF-IDF (Term Frequency-Inverse Document Frequency) metric to weight the importance of each word. As for LR, BoW features are combined with 300 topics extracted by a topic modelling algorithm (Latent Dirichlet Allocation, LDA). The authors study the performance of each method and the discrepancies between machine predictions and human decisions. Thomas et al. [ 27 ] also consider a BoW approach, using title and abstract, to build an ensemble classifier. More specifically, the ensemble is comprised of two SVM models. The first SVM is trained with terms having one, two or three words in order to preserve some semantic. The second SVM only takes one-word terms into account and applies an oversampling method to improve the classification rate of the minority class. To put both SVM scores together into the ensemble, a logistic regression model known as Platt scaling is applied. This scaling generates an output in the interval [0,1], which represents the probability that a paper is selected.

Naive Bayes (NB) classification is another supervised approach that has been studied for automating paper selection. FCNB/WE (Factorized Complement Naïve Bayes/Weight Engineering) combines a modified version of Complement NB (CNB) with feature engineering to assign different weights to the features. CNB amends the Multinomial NB (MNB) algorithm to use word count normalisation [ 28 ]. A comparison against the algorithm proposed by Cohen et al. [ 8 ] using the same corpus of papers is included to assess the improvement achieved by their proposal. CNB has been trained under two additional different methodologies to deal with the imbalanced training set of candidate papers [ 29 ]. First, the authors use a human-annotated training corpus for which three representations are compared: (1) BoW like in [ 8 ], (2) a more specialised collection of terms from a medical knowledge repository, and (3) the combination of both. Only abstracts are considered to classify papers, simulating an early step of candidate identification. The second approach, referred as per-question classification, requires building a classifier for each inclusion criterion. Different voting aggregation methods are studied to finally decide whether the candidate paper is selected or not. Definitely, SVM and NB are the supervised techniques most frequently applied, even though other classification algorithms have also been employed. García Adeva et al. [ 30 ] combine the use of seven feature selection methods and four classification techniques. Feature selection is applied to keep only a proportion of the most relevant terms, which are measured using popular text mining metrics like term frequency (TF), document frequency (DF) and inverse document frequency (IDF). As for classification, the authors compare NB, SVM, k-nearest neighbours (kNN) and Rocchio. The experimentation suggests that SVM outperforms the other algorithms when the papers are characterised by their title or by a combination of title and abstract. When only abstracts are considered, Rocchio and NB show better performance. Almeida et al. [ 31 ] present another study comparing several classifiers and feature sets, which is specialised for biomedical literature. The papers are represented as BoW taken from abstract and title, alone or in combination with a specific list of biomedical terms. They select a subset of words using two metrics: IDF and Odds ratio. The authors compare NB, SVM and a logistic model tree (LMT), which builds a decision tree (DT) using logistic regression models on the nodes. The best results were obtained by LMT over the combination of BoW with biomedical terms.

4.2.2 Active learning techniques for paper selection

All the above methods work under a supervised strategy. Note that in these cases, the corpus of candidate papers could be comprised of thousands of irrelevant papers retrieved by automatic search if the search string is too generic or not sufficiently refined. Active learning has appeared as a relevant paradigm for paper selection, since it is founded on the idea that labelling is a costly process that can be only partially done by querying an external oracle during the learning process. The classification can be performed by usual techniques for supervised learning, SVM being the preferred one for paper selection. Based on this idea, Abstrackr applies SVM under an active learning approach where the oracle is the human reviewer [ 32 ]. Implemented as a web tool, Abstrackr shows the title, keywords and abstract of a paper to be labelled as relevant, irrelevant or borderline. Reviewers are asked to highlight those terms that support their decision, which will be exploited then for learning by the SVM classifier.

The labels annotated by a human reviewer can be propagated to similar unlabelled papers following different strategies [ 33 ]. One possibility is that the label assigned by the reviewer is propagated to neighbouring unlabelled papers using the cosine distance between the paper representations: BoW or a low-dimensional representation obtained by a technique similar to principal component analysis. The underlying classifier, SVM, predicts the label of the remaining papers together with a certainty level. In each new cycle, the reviewer is asked to provide new labels for a sample of either the less or the more uncertain predictions. FASTREAD [ 34 ] is a conceptual active learning approach also using SVM as the underlying classifier, which can be “instantiated” into 32 different learning models depending on: (1) when to start training, (2) which document to query next, (3) when to stop training, and (4) how to balance the training data. The 4,000 terms from title and abstract with highest TF-IDF score become the features for learning. The authors are particularly interested in analysing the ability of these methods to exclude irrelevant works, showing that a specific configuration of their abstract method leads to better performance than state-of-the-art algorithms. Build upon these findings, a later work presents FAST \(^2\) , an improved active learner [ 35 ]. FAST \(^2\) includes a new strategy to identify the first relevant paper using domain knowledge, a LR-based estimation to decide when learning should stop, and a method to revise disagreements in paper labelling between the learner and the human.

4.2.3 Other methods to support paper selection

As suggested above, the selection of primary studies is strongly related with the quality of the search, so the first task could benefit from an automatic definition of search strings too [ 36 ]. The method starts from an initial set of accepted papers, whose title, abstract and keywords are used to infer the search strings by means of a DT (ID3 algorithm). Automatic search is then executed to collect candidate papers, which will undergo the ML-based paper selection. First, a BoW representation, extracted from title, abstract and keywords, is combined with a list of topics discovered by LDA to build the features. Since the authors argue that paper selection should be interactive and iterative, they propose the use of semi-supervised learning approaches: active learning (AL) and reinforcement learning (RL). The former will show the reviewer those papers with the highest probability of being primary studies, or those for which the classifier is more uncertain. The latter combines both ideas (probability and uncertainty) to explore papers that are not necessarily the most relevant ones as a way to avoid local optima. SVM and LR are internally used as classifiers for AL and RL, respectively. The authors also include greedy approaches of SVM and LR that automatically select the paper with highest probability.

Some other AI-based techniques have been proposed to assist in the process of paper selection, but they are not directly intended to automatically select the set of primary studies. Rather, the pool of candidate papers is inspected with additional information in order to evaluate their quality. In a first study, text mining and interactive visualisation techniques are combined [ 37 ]. In visual text mining, visualisation techniques are incorporated to show relations between documents and help inspecting textual data [ 38 ]. These techniques are used to build a “document map” showing the relationships among candidate papers based on content similarity. Content similarity is calculated as the cosine distance between papers, represented as a BoW vector. The extracted words are weighted using the TF-IDF metric. Clustering using the k-means algorithm is applied over the map, whose results should be later analysed by the reviewer using additional information. For instance, a citation map showing co-citation relationships extracted from bibtex files can be used to decide the quality of the paper. The visual analysis is supported by Revis, a tool for document map creation, which was extended to incorporate citation maps. In a subsequent study, the authors propose the score citation automatic selection (SCAS) strategy, which again combines paper content and citation information to select candidate papers [ 39 ]. Two tools support their method: StArt that provides a classification score based on the frequency of appearance of the search string in title, abstract and keywords; and Revis, for the analysis of cross-references among research papers. SCAS takes two inputs, the StArt score and whether the candidate paper is cited or not, to train a DT (J48 algorithm). The tree classifies the papers into four classes (included, excluded or two categories of “to be reviewed”), also allowing to identify the cut-off point of the Start score that separates included papers from excluded ones. Labels are obtained from manual selection using three SLRs as case studies. Thirdly, the work by Langlois et al. [ 40 ] automatically classifies papers into empirical and non-empirical studies. The former are considered as relevant, while the latter are discarded. kNN, NB, SVM, DT and ensembles (bagging and boosting strategies for DT) are applied as classification techniques. In this case, the authors first build the classification models with words extracted from title, abstract and a thesaurus of medicine terms. Then, they analyse the classification performance under different ratios of full-text availability, concluding that adding words from full texts slightly improved the obtained results.

4.2.4 AI techniques for data extraction and summarisation

Finally, a few AI techniques are focused on the data extraction task with the purpose of supporting knowledge representation. In this sense, ontologies are the main mechanism to capture real-world concepts and their semantic relationships. Ontology-based systems use a representation language, e.g. first-order logic or fuzzy logic, to encode such knowledge, which is combined with automatic reasoning techniques to make inferences. In the context of automated data extraction, the SLROnt ontology defines the concepts that appear in two key elements of a SLR: the review protocol and the set of primary studies [ 41 ]. The method is focused on automatic reasoning about primary studies, using abstract information to describe their most important characteristics. Such a description is based on the usual categories of structured abstracts (background, objective, method, results and conclusion). Similarly, the use of ontologies with information extracted from abstracts has been proposed as a means of providing a short description of biomedical papers [ 42 ]. A semantic representation of each paper is then derived, mapping words to concepts from three medical ontologies and setting predefined relationships among them. The paper description is generated from the semantic information by filling a PICO-based template. ML is applied for entity recognition during concept parsing, even though the details of the algorithm are not provided in the paper.

Data extraction has been treated as a learning problem too, whose goal is to classify relevant sentences for summarising experimental results [ 43 ]. In particular, this method identifies key sentences about medical treatment comparisons from full texts. SVM classifiers with linear and Gaussian kernel methods are trained with 100 sentences using words and concepts manually assigned. The method works under a multi-class approach, trying to identify the entities and treatment characteristics that appear in the comparison sentences.

4.3 AI techniques for the reporting phase

This last phase of the SLR process has received little attention yet. Current AI approaches only support two tasks: writing the SLR report and its evaluation.

The automatic generation of content for the SLR report is a complex task not addressed until very recently. A summary of each selected primary study is a good starting point to write a SLR report. Teslyuk et al. envision a system combining NLP and deep neural networks able to generate such summaries [ 44 ]. Deep learning is suitable here due to its ability to learn complex concepts from simple ones using layered architectures. The conceptual model takes a set of papers as input, for which up to five sentences located around citations are extracted using NLP. A pre-trained biomedical language representation model, called BioBERT, is responsible for encoding the sentences that will be transformed into summaries by means of a long short-term memory (LSTM) recurrent neural network. A LSTM efficiently processes sequences of data, e.g. text, allowing to keep and forget parts of the inferred information.

A way of evaluating the SLR report is to analyse whether the relevant aspects of the primary studies are well reflected in the report. To do this, Liu et al. [ 45 ] propose the use of NLP to generate automatic questions about the content of the papers. These questions address the subject of research, its aim and contributions, the method and datasets used, the results obtained and the strengths and limitations of the method. A name entity tagger, called LBJ, is the NLP technique applied for automatic question generation, together with phrase parsers and regular expressions. LBJ has a language model based on functions, constraints and an inference mechanism to support NLP tasks such as part-of-speech tagging, chunking and semantic labelling [ 46 ]. In the primary study, LBJ automates the identification of author names in citations. Then, the method formulates questions about the sentence explaining the cited work.

4.4 Previous analyses of the field and tool evaluations

During the literature search, we found works that cannot be classified in a particular phase. These works compare existing tools or analyse research literature related to the use of AI for SLR automation. They complement our analysis from different viewpoints and allow us to obtain a historical perspective.

A mapping study of tools to support SLRs in a computing field (software engineering) is based on the analysis of 14 papers [ 3 ]. The authors found that text mining, including those that integrate visualisation techniques, are prevalent in the area (57%). Extensions of Revis and the SLROnt ontology mentioned in Sect.  4.2 appear in this study, as they were evaluated with corpus of papers related to software testing and cost estimation, respectively. The authors conclude that the analysed tools were at an early stage of development. Besides, experiments to assess their effectiveness were still very preliminary.

Tsafnat et al. [ 47 ] provide an overview of SLR automation in the domain of evidence-based medicine. Focusing on AI-based tools, they only include Abstrackr (see Sect.  4.2 ) in their analysis. Other techniques, like ontologies, clustering, supervised classification and NLP are mentioned but as part of reference managers and specialised bio-medical systems without providing an in-depth analysis. In addition, the authors see great potential on the application of AI for: (1) automatic hypothesis generation, (2) improvements on inclusion criteria through reasoning, (3) duplication detection via NLP, (4) abstract screening combining ML and heuristics, and 5) better text analysis using NLP and optical character recognition for multi-language support.

Two other secondary studies are focused on the analysis of ML techniques for the paper selection task [ 9 , 12 ]. The former provides a retrospective of different approaches to analyse how they contribute to workload reduction and the challenges that their application entail. From their analysis, the achievable workload reduction greatly varies depending on the experiments (30–70%). Among the identified problems, the authors highlight imbalanced data, i.e. the percentage of relevant studies is very low compared with the number of non-selected papers. They suggest that class weighting and undersampling are possible solutions to this problem. Focusing on the techniques, the authors conclude that active learning ensures higher recall. The second work presents a more detailed analysis of text mining techniques required for preprocessing as part of paper screening. The studied techniques are characterised in terms of the method used to extract features for learning, the type of classifier, performance measures for evaluation and corpus of papers. Feature representation is mostly based on term frequency (66%), including works that use TF-IDF and other information gain metrics to weight words. They found 13 different classification algorithms, SVM and ensembles being the most widely applied.

A different perspective of the field is provided in two recent studies [ 11 , 48 ]. On the one hand, Beller et al. [ 48 ] present the principles that should guide the development of automated methods for SLR, which were derived from an international meeting of members of the ICASR (International Collaboration for the Automation of Systematic Reviews) group. The desired principles include improvements in efficiency, coverage of multiple tasks, flexibility to use and combine methods, and better replicability promoting the use of open source resources, among others. On the other hand, Marshall et al. [ 11 ] develop a practical guide for the use of ML methods to conduct SLR in the medicine domain. The study is conceived as an introduction for non-experts, discussing the scope of each tool, as well as their strengths and limitations. Therefore, they only analyse tools accessible in an online catalogue named SR Toolbox, Footnote 4 omitting scientific literature unless a supporting tool is also available. 13 tools are analysed, classified depending on the SLR task: literature search, paper selection and data extraction. The authors suggest that most of these tools should be viewed as assistant tools, where the user plays a key role in validating the provided results. However, they also prevent about the usability of these tools, since most of them are still prototypes or research-oriented tools. Nonetheless, new tools have appeared and others have evolved in the last years. We provide an up-to-date analysis of SLR tools using AI in our supplementary material.

5 Analysis of current trends

We discuss the state of the field in terms of SLR phases currently supported (RQ1), the selection of AI techniques (RQ2) and human intervention (RQ3).

5.1 SLR phases currently supported

Focusing on RQ1, our literature analysis indicates that all phases of the SLR process have been covered by at least one primary study, but that the conducting phase stands out as the most studied by far due to the strong interest on the automatic selection of primary studies. This prevalence is in line with the conclusions drawn by the most recent review on SLR automation [ 4 ]. In contrast, this review also concluded that no study, either using AI or not, supports the planning and reporting phases, although there are some primary studies applying AI techniques used in these phases, as explained in Sects.  4.1 and 4.3 . The effort required during the selection of primary studies might explain well the high number of AI proposals to automate it. Indeed, several studies have measured the time spent on manual and semi-automatic selection, suggesting that AI-based methods can reduce screening burden up to 60% [ 49 ] and represent time savings of more than 80 h [ 50 ]. However, only a couple of tools supporting paper selection, Abstrackr and EPPI-Reviewer, seem to be relatively popular in the medicine domain. The fact that most of the proposed methods are not available as tools or integrated in other systems like reference managers seems to be hampering its use in practice. This is also applicable to the rest of phases and tasks, since most of the surveyed publications only cover a very specific problem without giving complete support to the SLR process. According to our findings, only two papers address more than one task [ 23 , 36 ].

From a historical perspective, it is also interesting to note that the selection of primary studies continues to attract attention since the publication of the first paper [ 8 ]. Five new methods have been proposed in the last four years [ 26 , 27 , 34 , 35 , 40 ], and supporting tools are subject of evaluations [ 11 , 49 , 50 ].

5.2 Selection of AI techniques

In response to RQ2, ML is the most frequent AI area, with contributions exploring different learning paradigms: supervised and active learning for classification and, less often, unsupervised learning for clustering. Active learning has become the reference approach for paper selection [ 33 , 34 , 35 , 36 ]. With this approach, the cost of labelling is explicitly modelled, not assuming endlessly availability of previously labelled training data. Another recurrent characteristic is imbalance during the paper selection task, for which authors have selected algorithms specifically designed for problems with imbalanced class distribution [ 29 ], or have incorporated some data balancing technique [ 27 , 34 ].

Focusing on ML algorithms, SVM is frequently adopted for classification (13 out of 17 papers), either under supervised or active learning approaches. SVM is known to be highly effective to cope with high dimensional feature spaces [ 51 ], as is the case of the paper selection problem using a BoW feature representation. The rest of classifiers explored for paper selection are NB (5), DT (3), LR (2) and neural networks (2). Nevertheless, the number of papers is rather low to draw conclusions about why a particular algorithm was chosen.

Since most of the primary studies are focused on the paper selection problem, we further analyse the characteristics of the methods in terms of required inputs, types of outputs and availability of paper corpora for training. Table  1 summarises this information for the 11 papers focused on paper selection. Text mining is the usual approach to extract representative words from the candidate papers, which are later used to build the features for learning. In general, words are obtained from the title and abstract, and less often from the keywords too. Inspecting only these parts of candidate papers is the standard procedure during manual screening [ 2 ], and the most common approach in SLR automation [ 4 ]. However, text mining techniques are powerful enough to manage large pieces of text, so AI methods could increase the amount and quality of the information used. This would allow including more details about the paper content that might not appear in the header section, i.e. title, abstract or keywords, but at the expense of many more words to be processed. To reduce the dimension of the feature space, many authors rely on scoring methods, such as TF-IDF, to weight the words and keep only the most representative terms. Another alternative is the application of the LDA algorithm, which allows setting a predefined number of high-level topics to be extracted. In terms of tools, Abstrackr is more flexible in this sense, because it lets researchers interactively highlight the relevant and irrelevant words at their convenience [ 32 ].

Guidelines for SLR often refer to criteria based on meta-information or quality for defining the selection strategy in the review protocol. Language, extension or type of publication are exclusion criteria that can greatly reduce the number of candidate papers to be inspected. Despite this, very few works include features beyond the paper content. Only two methods complement word processing with other kind of information, citations and cross-references [ 37 , 39 ]. In both cases, visualisation mechanisms and clustering methods are developed to build assistant tools that facilitate the analysis. The rest of algorithms perform classification in one step, i.e. a binary decision of whether the paper should be selected or not. Breaking with this idea, a few methods [ 26 , 36 ] propose that the output should be a ranking, similarly to Abstrackr, where researchers can rate papers as relevant, irrelevant or borderline [ 32 ]. Overall, most of the AI-based methods detect a reduced list of papers within scope, not really simulating a criteria-guided evaluation.

5.3 Human intervention

During the data extraction process, the need of human intervention was carefully observed in order to respond to RQ3. AI approaches were classified as fully automated (68%) or semi-automated (32%). The former case corresponds to those primary studies for which the human does not take part in the execution of the AI approach. This category includes supervised learning techniques and any other method requiring an input corpus of papers, even if it is previously created or annotated by a human. Hence, semi-automated approaches should mention explicitly that some kind of human intervention is required.

Abstrackr is an interactive tool whose classifier is trained based on the feedback provided by one or more reviewers. More specifically, they can perform two actions: (1) highlighting relevant and irrelevant words within the title and abstract; and (2) marking the paper as accepted, rejected or borderline. For borderline papers, reviewers also have to introduce the number of SLRs that they have conducted in the past as an indicator of their expertise. Then, borderline papers are shown to more experienced reviewers. Abstrackr is the AI-based tool that has been adopted by more independent researchers to evaluate its performance. In such studies, participants have been asked to use Abstrackr to reproduce the paper screening of real SLRs with the purpose of measuring the time saved and the precision of the final paper selection.

The rest of active learning methods mention humans as an oracle for providing labels, though the presented experiments do not involve actual participants. Kontonatsios et al. [ 33 ] use the label assigned to one paper by the human reviewer to tag other similar papers that remain unlabelled. The authors present two strategies to decide which papers should be shown to the human: (1) choose the more relevant papers according to the classifier, or (2) let the human classify those papers for which the classifier has less confidence in its prediction. In the experiments both approaches are automatically evaluated taking a percentage of labelled papers from a training set, showing that the classifier can achieve 92% performance with only 5% of labelled papers. Such a percentage seems manageable for a scenario of collaboration with a human.

Ros et al. [ 36 ] present a proof-of-concept in which the reviewer should validate papers suggested by the tool. The information displayed to the human includes the most relevant terms used by the classifier to make a decision, as well as information about how the paper was found, i.e. snowballing or automatic search. The papers to be validated are selected following two strategies: (1) picking papers close to the decision boundary built by the classifier, and (2) promoting papers predicted as positive by the classifier. The experimental validation is automatically performed by looking the manual labels assigned within a training set created from a SLR previously conducted by the authors.

For their general FASTREAD method, Yu et al. [ 34 ] explore the same strategies as Ros et al. [ 36 ]. The authors discuss that it would be desirable to allow having multiple reviewers, assigning different sets of papers to each one. This idea represents a challenge since the ML algorithm would need to deal with potential human disagreements. This particular problem is addressed in a subsequent work [ 35 ], but still focused on only one human reviewer. Here, FAST \(^2\) analyses the class probability estimation each time the human oracle labels 50 new papers, and those papers on which the active learner and the human reviewer strongly disagree are marked to be rechecked. To test their strategy, the authors simulate inconsistencies in the human evaluation.

6 Open issues and challenges

We have identified a number of open issues that lead to challenges:

6.1 One single task is predominant

Research into SLR automation with AI is strongly biased towards the conducting phase and, more specifically, the paper selection task. Although this task is time consuming, the application of AI to other tasks demands attention. Some initial works have appeared, but are less mature compared to the algorithms proposed for paper selection. We identify AI-driven writing tasks, e.g. formulating RQs, defining exclusion/inclusion criteria or reporting SLR results, as the main challenge in this direction.

6.2 AI techniques are still to be explored

The spectrum of AI areas and techniques is wide, but some of them have not been applied to SLR automation yet. For instance, optimisation and search techniques have not been explored for any SLR task resolution. This type of techniques have been traditionally used to solve planning problems, thus we speculate that they could be applied during the first phase to prioritise resources, e.g. choose the best databases, or distribute work, e.g. assign papers to reviewers based on their skills. Compared to ML, knowledge representation and NLP appear less often and most of the proposals seem to be in an initial stage. Consequently, there is a lack of tools and frameworks to develop solutions based on these techniques.

6.3 Specialised algorithms can replace general purpose approaches

Focusing on ML for paper selection, SVM has become a reference algorithm, probably due to the choice of the high-dimensional BoW representation. It would be interesting to study the applicability of other algorithms under the same or other feature spaces. The need of approaches specifically designed for the paper selection problem and for other tasks in SLR automation, should be explored in-depth. Some challenges here are related to the combination of types of input information to enrich the process, as well as to obtaining more flexible outputs beyond selected/non-selected. For SLR tasks requiring text analysis, the methods must be retrained or adapted to learn from the specific vocabulary of the scientific discipline (medicine, computing, etc.) under review.

6.4 More complete information can improve decision-making

As for the features, BoW representation of title and abstract clearly dominates. Content from different paper sections, as well as meta-information and citation analysis, may be considered as well. Nevertheless, strongly relying on paper content implies that the classifier can only use the “vocabulary” of the field to make decisions, missing those papers adopting a different or emerging terminology, or simply those covering new or disruptive topics. Therefore, the analysis of related research communities, including co-authorships and cross-references, could be necessary to identify emergent topics for which a standardised terminology has not been comprehensibly developed yet.

6.5 More active human involvement can benefit AI

The level and nature of the cooperation between the human and the AI methods or tools is still limited. At the moment, the role of the human is mostly oriented towards providing some labels for paper selection under an active learning approach. The planning and writing phases, which clearly demand more human skills, could benefit from interactive AI. Involving humans in this process would also have other beneficial effects, such as adapting the results to their preferences.

6.6 End-users of SLR automation are not necessarily AI experts

Most of the ML techniques considered so far —e.g. SVM or neural networks— are known as “black-box” techniques. The fact that SLRs are conducted by scientists from diverse disciplines, not necessarily experts in AI, poses the challenge of the lack of trust in automatic results. Interpretable models, such as rule-based systems or small decision trees, have been barely explored. Also, we envision that the potential of recent explainable methods would allow complementing the output of black-box AI solutions developed in this area.

6.7 AI-based automation of SLRs can be scaled up

Most of current proposals have been validated in the field of medicine or computing, sometimes using domain-specific ontologies or concepts to build the feature set. Probably the hardest limitation here lies in the availability of benchmarks, since real SLRs are not always fully replicable. Even when the set of candidate, excluded and included papers is available, decisions made for their selection might not be explicitly linked to inclusion and exclusion criteria. Further progress should be made in extending the evaluation of AI methods to cover a wider variety of SLRs, as well as broadening the scope of topics.

6.8 Performance comparison between different methods and fields

The performance of AI-based techniques for SLR automation has been studied for fields like medicine or computing. Applying one technique to solve the same SLR task in a different field may not be trivial due to the specific terminology or types of research papers of each field. Studying the applicability of techniques to different fields is necessary to determine how they should be adapted. It would also be useful to compare methods to find out to what extent their performance depends on the application field, or if there are methods that fit better than others to the specific characteristics of a given field of knowledge.

6.9 Open science fosters the development of practicable methods

In terms of reproducibility, the availability of implementations and corpora is still rare. Some tools and algorithms were originally made public but they are not accessible any more. As interest in this area continues to grow, there is an increasing need to provide access to algorithms and to set common experimental frameworks that allow comparisons between proposals. This point is seemingly less challenging, but still requires considerable effort from the community to make artefacts not only accessible but also fully functional.

Finally, we provide suggestions based on our own experience when trying to use some of the reviewed methods to accelerate SLR tasks. In particular, we tested two paper selection tools (Abstrackr [ 32 ] and FAST \(^2\) [ 35 ]) to replicate our own search for primary studies. FAST \(^2\) was considerably more effective than Abstrackr, since we were able to find almost 95% of the primary studies with less than 10% of screened papers. In contrast, Abstrackr found only 10% of the primary studies after screening the same number of papers (300). Despite some configuration issues due to the requested dataset format, we found these tools useful and intuitive. We suggest some improvements regarding the information shown to the reviewers, e.g. why a paper was selected, and how they can add information to improve the process, e.g. by adding key words at the beginning instead of iteratively. Even if some tool support is available, we consider that the success of an SLR still lies on researchers’ shoulders in terms of methodological steps (clear review protocol, checkpoints for replication) and analytical capabilities (summary of papers and trends analysis).

7 Conclusions

The application of artificial intelligence has shown to be effective in automating many tasks humans find costly and repetitive to do, as is the case of conducting literature reviews. Planning, conducting and reporting a SLR involve many individual tasks, so it not surprising to observe that not all of them have been automatised yet. Our findings reveal a clear interest in applying AI, specially ML, to support paper screening, a burden task aimed at identifying relevant works from thousands of candidate papers. Regarding other tasks, we can highlight the use of ontologies and NLP to deal with semantic information. Nevertheless, the number of studies in these areas are still far less abundant.

Future efforts should be devoted to provide support to the planning and reporting phases, whose tasks are more difficult to automate. Advances in automatic writing would be expected in the near future because of the appearance of some conceptual approaches based on deep learning.

Source: Results of searching “Systematic literature review” on Scopus by February 1st, 2022.

The search was completed up to the 30th of June of 2021.

https://carrotsearch.com/lingo3g/ (Last accessed: February 14, 2023)

http://systematicreviewtools.com/ (Last accessed: February 14, 2023)

Booth A, Sutton A, Papaioannou D (2016) Systematic approaches to a successful literature review, 2nd edn. SAGE Publications, Cambridge

Google Scholar  

Kitchenham B, Charters S (2007) Guidelines for performing systematic literature reviews in software engineering. Version 2.3 (EBSE-2007-01). School of Computer Science and Mathematics, Keele University. https://www.elsevier.com/__data/promis_misc/525444systematicreviewsguide.pdf

Marshall C, Brereton P (2013) Tools to support systematic literature reviews in software engineering: a mapping study. In: International symposium on empirical software engineering and measurement. p. 296–299

van Dinter R, Tekinerdogan B, Catal C (2021) Automation of systematic literature reviews: a systematic literature review. Inf Softw Technol 136:106589

Chapman AL, Morgan LC, Gartlehner G (2010) Semi-automating the manual literature search for systematic reviews increases efficiency. Health Inf Libr J 27(1):22–27

Torres Torres M, Adams CE (2017) RevManHAL: towards automatic text generation in systematic reviews. Syst Rev 6:1–7

van Altena AJ, Spijker R, Olabarriaga SD (2019) Usage of automation tools in systematic reviews. Res Synth Methods 10(1):72–82

Cohen AM, Hersh WR, Peterson K, Yen PY (2006) Reducing workload in systematic review preparation using automated citation classification. J Am Med Inform Assoc 13(2):206–219

O’Mara-Eves A, Thomas J, McNaught J, Miwa M, Ananiadou S (2015) Using text mining for study identification in systematic reviews: a systematic review of current approaches. Syst Rev 4(1):1–22

Stansfield C, O’Mara-Eves A, Thomas J (2017) Text mining for search term development in systematic reviewing: a discussion of some methods and challenges. Res Synth Methods 8(3):355–365

Marshall IJ, Wallace BC (2019) Toward systematic review automation: a practical guide to using machine learning tools in research synthesis. Syst Rev 8:1–10

Olorisade BK, De Quincey E, Andras P, Brereton P (2016) A critical analysis of studies that address the use of text mining for citation screening in systematic reviews. In: Proceedings of the 20th international conference on evaluation and assessment in software engineering. p. 14:1–11

Felizardo KR, de Souza ÉF, Napoleão BM, Vijaykumar NL, Baldassarre MT (2020) Secondary studies in the academic context: a systematic mapping and survey. J Syst Softw 170:110734

Kitchenham B, Brereton P (2013) A systematic review of systematic review process research in software engineering. Inf Softw Technol 55(12):2049–2075

Krüger J, Lausberger C, von Nostitz-Wallwitz I, Saake G, Leich T (2020) Search. Review. Repeat? An empirical study of threats to replicating SLR searches. Empir Softw Eng 25:627–677

Davies KS (2011) Formulating the evidence based practice question: a review of the frameworks. Evid Based Libr Inf Pract 6(2):75–80

Mergel GD, Silveira MS, da Silva TS (2015) A Method to Support Search String Building in Systematic Literature Reviews through Visual Text Mining. In: Proceedings ACM symposium on applied computing. p. 1594–1601

Lefebvre C, Manheimer E, Glanville J (2008) Searching for studies. In: Higgins JP, Green S (eds) Cochrane handbook for systematic reviews of interventions. https://doi.org/10.1002/9780470712184.ch6

Booth A, Sutton A, Papaioannou D (2016) Defining your scope, 2nd edn. SAGE Publications, Cambridge

Liberati A, Altman DG, Tetzlaff J, Mulrow C, Gotzsche PC, Ioannidis JPA et al (2009) The PRISMA statement for reporting systematic reviews and meta-analyses of studies that evaluate health care interventions: explanation and elaboration. J Clin Epidemiol 62(10):1–34

Kitchenham B (2004) Procedures for performing systematic reviews. Department of Computer Science: Keele University, UK

Stansfield C, Thomas J, Kavanagh J (2013) Clustering documents automatically to support scoping reviews of research: a case study. Res Synth Methods 4(3):230–241

Pham B, Bagheri E, Rios P, Pourmasoumi A, Robson RC, Hwee J et al (2018) Improving the conduct of systematic reviews: a process mining perspective. J Clin Epidemiol 103:101–111

Cohen AM, Ambert K, McDonagh M (2009) Cross-topic learning for work prioritization in systematic review creation and update. J Am Med Inform Assoc 16(5):690

Kim S, Choi J (2014) An SVM-based high-quality article classifier for systematic reviews. J Biomed Inform 47:153–159

Bannach-Brown A, Przybyła P, Thomas J, Rice ASC, Ananiadou S, Liao J et al (2019) Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev 8(1):23

Thomas J, McDonald S, Noel-Storr A, Shemilt I, Elliott J, Mavergames C et al (2021) Machine learning reduced workload with minimal risk of missing studies: development and evaluation of a randomized controlled trial classifier for Cochrane Reviews. J Clin Epidemiol 133:140–151

Matwin S, Kouznetsov A, Inkpen D, Frunza O, O’Blenis P (2010) A new algorithm for reducing the workload of experts in performing systematic reviews. J Am Med Inform Assoc 17(4):446–453

Frunza O, Inkpen D, Matwin S (2010) Building systematic reviews using automatic text classification techniques. In: 23rd international conference computational linguistics. p. 303–311

García Adeva JJ, Pikatza Atxa JM, Ubeda Carrillo M, Ansuategi Zengotitabengoa E (2014) Automatic text classification to support systematic reviews in medicine. Expert Syst Appl 41(4):1498–1508

Almeida H, Meurs MJ, Kosseim L, Tsang A (2016) Data sampling and supervised learning for HIV literature screening. IEEE Trans Nanobiosci 15(4):354–361

Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA (2012) Deploying an interactive machine learning system in an Evidence-based Practice Center: Abstrackr. In: Proceedings 2nd ACM SIGHIT international health informatics symposium. p. 819–823

Kontonatsios G, Brockmeier AJ, Przybyła P, McNaught J, Mu T, Goulermas JY et al (2017) A semi-supervised approach using label propagation to support citation screening. J Biomed Inform 72:67–76

Yu Z, Kraft NA, Menzies T (2018) Finding better active learners for faster literature reviews. Empir Softw Eng 23(6):3161–3186

Yu Z, Menzies T (2019) FAST \(^2\) : an intelligent assistant for finding relevant papers. Expert Syst Appl 120:57–71

Ros R, Bjarnason E, Runeson P (2017) A machine learning approach for semi-automated search and selection in literature studies. In: 21st International conference evaluation and assessment in software engineering. p. 118–127

Felizardo KR, Andery GF, Paulovich FV, Minghim R, Maldonado JC (2012) A visual analysis approach to validate the selection review of primary studies in systematic reviews. Inf Softw Technol 54(10):1079–1091

Alencar AB, de Oliveira MCF, Paulovich FV (2012) Seeing beyond reading: a survey on visual text analytics. WIREs Data Min Knowl Discovery 2(6):476–492

Octaviano FR, Felizardo KR, Maldonado JC, Fabbri SCPF (2015) Semi-automatic selection of primary studies in systematic literature reviews: is it reasonable? Empir Softw Eng 20(6):1898–1917

Langlois A, Nie JY, Thomas J, Hong QN, Pluye P (2018) Discriminating between empirical studies and nonempirical works using automated text classification. Res Synth Methods 9(4):587–601

Sun Y, Yang Y, Zhang H, Zhang W, Wang Q (2012) Towards evidence-based ontology for supporting systematic literature review. In: 16th international conference evaluation and assessment in software engineering. p. 171–175

Erekhinskaya T, Balakrishna M, Tatu M, Werner S, Moldovan D (2016) Knowledge extraction for literature review. In: Proceedings of the ACM/IEEE Joint conference on digital libraries. IEEE. p. 221–222

Lucic A, Blake CL (2016) Improving Endpoint Detection to Support Automated Systematic Reviews. In: AMIA Ann Symp proceedings. p. 1900–1909

Teslyuk A (2020) The concept of system for automated scientific literature reviews generation. In: International conference on computational science. vol. 12139 LNCS. Springer. p. 437–443

Liu M, Calvo RA, Rus V (2010) Automatic question generation for literature review writing support. In: International conference on intelligent tutoring systems. vol. 6094 LNCS. p. 45–54

Rizzolo N, Roth D (2007) Modeling Discriminative Global Inference. In: International conference on semantic computing (ICSC); p. 597–604

Tsafnat G, Glasziou P, Choong MK, Dunn A, Galgani F, Coiera E (2014) Systematic review automation technologies. Syst Rev 3:1–15

Beller E, Clark J, Tsafnat G, Adams C, Diehl H, Lund H, Glasziou P (2018) Making progress with the automation of systematic reviews: principles of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst Rev 7(1):1–7

Tsou AY, Treadwell JR, Erinoff E, Schoelles K (2020) Machine learning for screening prioritization in systematic reviews: comparative performance of Abstrackr and EPPI-Reviewer. Syst Rev 9(1):73

Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L (2020) The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol 20(1):139

Cervantes J, Garcia-Lamont F, Rodríguez-Mazahua L, Lopez A (2020) A comprehensive survey on support vector machine classification: applications, challenges and trends. Neurocomputing 408:189–215

Download references

Acknowledgements

Grant PID2020-115832GB-I00 funded by MICIN/AEI/10.13039/501100011033. Andalusian Regional Government (postdoctoral grant DOC_00944).

Open Access funding provided by Universidad de Córdoba / CBUA thanks to the CRUE-CSIC agreement with Springer Nature.

Author information

Authors and affiliations.

Department of Computer Science and Numerical Analysis, University of Córdoba, Rabanales Campus, 14071, Córdoba, Spain

José de la Torre-López, Aurora Ramírez & José Raúl Romero

You can also search for this author in PubMed   Google Scholar

Contributions

All authors; Methodology: all authors; Formal analysis and investigation: all authors; Writing—original draft preparation: JTL; Writing—review and editing: AR, JRR; Funding acquisitionand supervision: JRR.

Corresponding author

Correspondence to Aurora Ramírez .

Ethics declarations

Conflict of interest.

The authors have no competing interests to declare.

Additional information

Supplementary information.

Detailed results of the literature search and an analysis of tools are available from: https://www.uco.es/kdis/ai4slr/survey .

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

de la Torre-López, J., Ramírez, A. & Romero, J.R. Artificial intelligence to automate the systematic review of scientific literature. Computing 105 , 2171–2194 (2023). https://doi.org/10.1007/s00607-023-01181-x

Download citation

Received : 08 February 2022

Accepted : 19 April 2023

Published : 11 May 2023

Issue Date : October 2023

DOI : https://doi.org/10.1007/s00607-023-01181-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Artificial intelligence
  • Machine learning
  • Systematic literature review

Mathematics Subject Classification

  • 68T01 General topics in artificial intelligence
  • Find a journal
  • Publish with us
  • Track your research

AGATHA: Automatic Graph-mining And Transformer based Hypothesis generation Approach

automatic hypothesis generation

Medical research is risky and expensive. Drug discovery, as an example, requires that researchers efficiently winnow thousands of potential targets to a small candidate set for more thorough evaluation. However, research groups spend significant time and money to perform the experiments necessary to determine this candidate set long before seeing intermediate results. Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal holdout wherein we predict connections first introduced after 2015 using data published beforehand. We additionally explore biomedical sub-domains, and demonstrate AGATHA's predictive capacity across the twenty most popular relationship types. This system achieves best-in-class performance on an established benchmark, and demonstrates high recommendation scores across subdomains. Reproducibility: All code, experimental data, and pre-trained models are available online: sybrandt.com/2020/agatha

automatic hypothesis generation

Justin Sybrandt

Ilya Tyagin

Michael Shtutman

automatic hypothesis generation

Related Research

Validation and topic-driven ranking for biomedical hypothesis generation systems, literature-based discovery for landscape planning, accelerating covid-19 research with graph mining and transformer-based learning, moliere: automatic biomedical hypothesis generation system, gt4sd: generative toolkit for scientific discovery, genedisco: a benchmark for experimental design in drug discovery, temporal positive-unlabeled learning for biomedical hypothesis generation via risk estimation.

Please sign up or login with your details

Generation Overview

AI Generator calls

AI Video Generator calls

AI Chat messages

Genius Mode messages

Genius Mode images

AD-free experience

Private images

  • Includes 500 AI Image generations, 1750 AI Chat Messages, 30 AI Video generations, 60 Genius Mode Messages and 60 Genius Mode Images per month. If you go over any of these limits, you will be charged an extra $5 for that group.
  • For example: if you go over 500 AI images, but stay within the limits for AI Chat and Genius Mode, you'll be charged $5 per additional 500 AI Image generations.
  • Includes 100 AI Image generations and 300 AI Chat Messages. If you go over any of these limits, you will have to pay as you go.
  • For example: if you go over 100 AI images, but stay within the limits for AI Chat, you'll have to reload on credits to generate more images. Choose from $5 - $1000. You'll only pay for what you use.

Out of credits

Refill your membership to continue using DeepAI

Share your generations with friends

Subscribe to the PwC Newsletter

Join the community, edit social preview.

automatic hypothesis generation

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row.

  • LINK PREDICTION
  • VOCAL BURSTS INTENSITY PREDICTION

Remove a task

automatic hypothesis generation

Add a method

Remove a method, edit datasets, explainable automatic hypothesis generation via high-order graph walks.

29 Sep 2021  ·  Uchenna Akujuobi , Xiangliang Zhang , Sucheendra Palaniappan , Michael Spranger · Edit social preview

In this paper, we study the automatic hypothesis generation (HG) problem, focusing on explainability. Given pairs of biomedical terms, we focus on link prediction to explain how the prediction was made. This more transparent process encourages trust in the biomedical community for automatic hypothesis generation systems. We use a reinforcement learning strategy to formulate the HG problem as a guided node-pair embedding-based link prediction problem via a directed graph walk. Given nodes in a node-pair, the model starts a graph walk, simultaneously aggregating information from the visited nodes and their neighbors for an improved node-pair representation. Then at the end of the walk, it infers the probability of a link from the gathered information. This guided walk framework allows for explainability via the walk trajectory information. By evaluating our model on predicting the links between millions of biomedical terms in both transductive and inductive settings, we verified the effectiveness of our proposed model on obtaining higher prediction accuracy than baselines and understanding the reason for a link prediction.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • HHS Author Manuscripts

Logo of nihpa

MOLIERE: Automatic Biomedical Hypothesis Generation System

Justin sybrandt.

1 Clemson University, School of Computing, Clemson SC, USA

Michael Shtutman

2 University of South Carolina, Drug Discovery and Biomedical Sciences, Columbia SC, USA

Hypothesis generation is becoming a crucial time-saving technique which allows biomedical researchers to quickly discover implicit connections between important concepts. Typically, these systems operate on domain-specific fractions of public medical data. MOLIERE , in contrast, utilizes information from over 24.5 million documents. At the heart of our approach lies a multi-modal and multi-relational network of biomedical objects extracted from several heterogeneous datasets from the National Center for Biotechnology Information (NCBI). These objects include but are not limited to scientific papers, keywords, genes, proteins, diseases, and diagnoses. We model hypotheses using Latent Dirichlet Allocation applied on abstracts found near shortest paths discovered within this network, and demonstrate the effectiveness of MOLIERE by performing hypothesis generation on historical data. Our network, implementation, and resulting data are all publicly available for the broad scientific community.

1 Introduction

Vast amounts of biomedical information accumulate in modern databases such as MEDLINE [ 3 ], which currently contains the bibliographic data of over 24.5 million medical papers. These ever-growing datasets impose a great difficulty on researchers trying to survey and evaluate new information in the existing biomedical literature, even when advanced ranking methods are applied. On the one hand, the vast quantity and diversity of available data has inspired many scientific breakthroughs. On the other hand, as the set of searchable information continues to grow, it becomes impossible for human researchers to query and understand all of the data relevant to a domain of interest.

In 1986 Swanson hypothesized that novel discoveries could be found by carefully studying the existing body of scientific research [ 45 ]. Since then, many groups have attempted to mine the wealth of public knowledge. Efforts such as Swanson’s own Arrowsmith generate hypotheses by finding concepts which implicitly link two queried keywords. His method and others are discussed at length in Section 1.3. Ideally, an effective hypothesis generation system greatly increases the productivity of researchers. For example, imagine that a medical doctor believed that stem cells could be used to repair the damaged neural pathways of stroke victims (as some did in 2014 [ 22 ]). If no existing research directly linked stem cells to stroke victims, this doctor would typically have no choice but to follow his/her intuition. Hypothesis generation allows this researcher to quickly learn the likelihood of such a connection by simply running a query. Our hypothetical doctor may query the topics stem cells and stroke for example. If the system returned topics such as paralysis then not only would the doctor’s intuition be validated, but he/she would be more likely to invest in exploring such a connection. In this manner, an intelligent hypothesis generation system can increase the likelihood that a researcher’s study yields usable new findings.

1.1 Our Contribution

We introduce a deployed system, MOLIERE [ 47 ], with the goal of generating more usable results than previously proposed hypothesis generation systems. We develop a novel method for constructing a large network of public knowledge and devise a query process which produces human readable text highlighting the relationships present between nodes.

To the best of our knowledge, MOLIERE is the first hypothesis generation system to utilize the entire MEDLINE data set. By using state-of-the-art tools, such as ToPMine [ 16 ] and FastText [ 9 ], we are able to find novel hypotheses without restricting the domain of our knowledge network or the resulting vocabulary when creating topics. As a result, MOLIERE is more generalized and yet still capable of identifying useful hypotheses.

We provide our network and findings online for others in the scientific community [ 47 ]. Additionally, to aid interested biomedical researchers, we supply an online service where users can request specific query results at http://jsybran.people.clemson.edu/mForm.php . Furthermore, MOLIERE is entirely open-source in order to facilitate similar projects. See https://github.com/JSybrandt/MOLIERE for the code needed to generate and query the MOLIERE knowledge network.

In the following paper we describe our process for creating and querying a large knowledge network built from MEDLINE and other NCBI data sources. We use natural language processing methods, such as Latent Dirichlet Allocation (LDA) [ 8 ] and topical phrase mining [ 16 ], along with other data mining techniques to conceptually link together abstracts and biomedical objects (such as biomedical keywords and n -grams) in order to form our network. Using this network we can run shortest path queries to discover a pathway between two concepts which are non-trivially connected. We then find clouds of documents around these pathways which contain knowledge representative of the path as a whole. PLDA+, a scalable implementation of LDA [ 28 ], allows us to quickly find topic models in these clouds. Unlike similar systems, we do not restrict PLDA+ to any set vocabulary. Instead, by using topical phrase mining, we identify meaningful n-grams in order to improve the performance, flexibility, and understandability of our LDA models. These models result in both quantitative and qualitative connections which human researchers can use to inform their decision making.

We evaluate our system by running queries on historical data in order to discover landmark findings. For example, using data published on or before 2009, we find strong evidence that the protein Dead Box RNA Helicase 3 (DDX3) can be applied to treat cancer. We also verify the ability of MOLIERE to make predictions similar to previous systems with restricted LDA [ 49 ].

1.2 Our Method in Summary

We focus on the domain of medicine because of the large wealth of public information provided by the National Library of Medicine (NLM). MEDLINE is a database containing over 24.5 million references to medical publications dating all the way back to the late 1800s [ 3 ]. Over 23 million of these references include the paper’s title and abstract text. In addition to MEDLINE, the NLM also maintains the Unified Medical Language System (UMLS) which is comprised of three main resources: the metathesaurus, the semantic network, and the SPECIALIST natural language processing (NLP) tools. These resources, along with the rest of our data, are described in section 2.1.

Our knowledge base starts as XML files provided by MEDLINE, from which we extract each publication’s title, document ID, and abstract text. We first process these results with the SPECIALIST NLP toolset. The result is a corpus of text which has standardized spellings (for example “colour” becomes “color”), no stop words (including medical specific stop words such as Not Otherwise Specified (NOS) ), and other characteristics which improve later algorithms on this corpus. Then we use ToPMine to identify multi-word phrases from that corpus such as “asthma attack,” allowing us to treat phrases as single tokens [ 16 ]. Next, we send the corpus through FastText , the most recent word2vec implementation, which maps each unique token in the corpus to a vector [ 30 ]. We can then fit a centroid to each publication and use the Fast Library for Approximate Nearest Neighbors (FLANN) to generate a nearest neighbors graph [ 32 ]. The result is a network of MEDLINE papers, each of which are connected to other papers sharing a similar topic. This network, combined with the UMLS metathesaurus and semantic network, constitutes our full knowledge base. The network construction process is described in greater detail in Section 2.

With our network, a researcher can query for the connections between two keywords. We find the shortest path between the two keywords in the knowledge network, and extend this path to identify a significant set of related abstracts. This subset contains many documents which, due to our network construction process, all share common topics. We perform topic modeling on these documents using PLDA+ [ 28 ]. The result is a set of plain text topics which represent different concepts which likely connect the two queried keywords. More information about the query process is detailed in Section 3.

We use landmark historical findings in order to validate our methods. For example, we show the implicit link between Venlafaxine and HTR1A, and the involvement of DDX3 on Wnt signaling. These queries and results are detailed in Section 4. In Sections 5 and 6 we discuss challenges and open research questions we have uncovered during our work.

1.3 Related Work

The study and exploration of undiscovered public knowledge began in 1986 with Swanson’s landmark paper [ 45 ]. Swanson hypothesized that fragments of information from the set of public knowledge could be connected in such a way as to shed light on new discoveries. With this idea, Swanson continued his research to develop Arrowsmith , a text-based search application meant to help doctors make connections from within the MEDLINE data set [ 38 , 44 , 46 ]. To use Arrowsmith , researchers supply two UMLS keywords which are used to find two sets of abstracts, A and C . The system then attempts to find a set B ≈ A ∩ C . Assuming sets A and C do not overlap initially, implicit textual links are used to expand both sets until some sizable set B is discovered. The experimental process was computationally expensive, and queries were typically run on a subset of the MEDLINE data set (according to [ 46 ] around 1,000 documents).

Spangler has also been a driving force in the field of hypothesis generation and mining undiscovered public knowledge. His textbook [ 41 ] details many text mining techniques as well as an example application related to hypothesis generation in the MEDLINE data set. His research in this field has focused on p53 kinases and how these undiscovered interactions might aid drug designers [ 42 , 41 ]. His method leverages unstructured text mining techniques to identify a network entities and relationships from medical text. Our work differs from this paradigm by utilizing the structured UMLS keywords, their known connections, and mined phrases. We do, however, rely on similar unstructured text mining techniques, such as FastText and FLANN, to make implicit connections between the abstracts.

Rzhetsky and Evans notice that current information gathering methods struggle to keep up with the growing wealth of forgotten and hard to find information [ 17 ]. Their work in the field of hypothesis generation has included a study on the assumptions made when constructing biomedical models [ 15 ] and digital representations of hypothesis [ 40 ].

Divoli et al. analyze the assumptions made in medical research [ 15 ]. They note that scientists often reach contradictory conclusions due to differences in each person’s underly assumptions. The study in [ 15 ] highlights the variance of these preconceptions by surveying medical researchers on the topic of cancer metastasis. Surprisingly, 27 of the 28 researchers surveyed disagree with the textbook process of cancer metastasis. When asked to provide the “correct” metastasis scenario, none of the surveyed scientists agree. Divoli’s study highlights a major problem for hypothesis generation. Scientists often disagree, even in published literature. Therefore, a hypothesis generation system must be able to produce reliable results from a set of contradicting information.

In [ 40 ], Soldatova and Rzhetsky describe a standardized way to represent scientific hypotheses. By creating a formal and machine readable standard, they envision a collection of hypotheses which clearly describes the full spectrum of existing theories on a given topic. Soldatova and Rzhetsky extend existing approaches by representing hypotheses as logical statements which can be interpreted by Adam , a robot scientist capable of starting one thousand experiments a day. Adam is successful, in part, because they model hypotheses as an ontology which allows for Bayesian inference to govern the likelihood of a specific hypothesis being correct.

DiseaseConnect, an online system that allows researchers to query for concepts intersecting two keywords, is a notable contribution to hypothesis generation [ 27 ]. This system, proposed by Liu et al., is similar to both our system and Arrowsmith [ 39 ] in its focus on UMLS keywords and MEDLINE literature mining. Unlike our system, Liu et al. restrict DiseaseConnect to simply 3 of the 130 semantic types. They supplement this subset with concepts from the OMIM [ 19 ] and GWAS [ 6 ] databases, two genome specific data sets. Still, their network size is approximately 10% of the size of MOLIERE . DiseaseConnect uses its network to identify diseases which can be grouped by their molecular mechanisms rather than symptoms. The process of finding these clusters depends on the relationships between different types of entities present in the DiseaseConnect network. Users can view sub-networks relevant to their query online and related entities are displayed alongside the network visualization.

Barabási et al. improve upon the network analytic approach to understand biomedical data in both their work on the disease network [ 19 ] as well as their more generalized symptoms-disease network [ 51 ]. In the former [ 19 ], the authors construct a bipartite network of disease phonemes and genomes to which they refer to as the Diseasome . Their inspiration is an observation that genes which are related to similar disorders are likely to be related themselves. They use the Diseasome to create two projected networks, the human disease network (HDN), and the Disease Gene Network (DGN). In the latter [ 51 ], they construct a more generalized human symptoms disease network (HSDN) by using both UMLS keywords and bibliographic data. HSDN consists of data collected from a subset of MEDLINE consisting of only abstracts which contained at least one disease as well as one symptom, a subset consisting of approximately 850,000 records. From this set, Goh et al. calculated keyword co-occurrence statistics in order to build their network. They validate their approach using 1,000 randomly selected MEDLINE documents and, with the help of medical experts, manually confirm that the relationship described in a document is reflected meaningfully in HSDN. Ultimately, Goh et al. find strong correlations between the symptoms and genes shared by common diseases.

Bio-LDA is a modification of LDA which limits the set of keywords to the set present in UMLS [ 49 ]. This reduction improves the meaning and readability of topics generated by LDA. Wang et al. also show in this work that their method can imply connections between keywords which do not show up in the same document. For example, they note that Venlafaxine and HTR1A both appear in the same topic even though both do not appear in the same abstract. We explore and repeat these findings in Section 4.2.

1.4 Related and Incorporated Technologies

FastText is the most recent implementation of word2vec from Milkolov et al. [ 30 , 31 , 23 , 9 ]. Word2vec is a method which utilizes the skip-gram model to identify the relationships between words by analyzing word usage patterns. This process maps plain text words into a high dimensional vector space for use in data mining applications. Similar words are often grouped together, and the distances between words can reveal relationships. For example, the distance between the words “Man” and “Woman” is approximately the same as the distance between “King” and “Queen”. FastText improves upon this idea by leveraging sub-strings in long rarely occurring words.

ToPMine , a project from El-Kishky et al., is focused on discovering multi-word phrases from a large corpus of text [ 20 ]. This project intelligently groups unigrams together to create n-gram phrases for later use in text mining algorithms. By using a bag-of-words topic model, ToPMine groups unigrams based on their co-occurrence rate as well as their topical similarity using a process they call Phrase LDA.

Latent Dirichlet Allocation [ 8 ] is the most common topic modeling process and PLDA+ is a scalable implementation of this algorithm [ 20 , 28 ]. Developed by Zhiyuan Liu et al., PLDA+ quickly identifies groups of words and phrases which all relate to a similar concept. Although it is an open research question as to how best to interpret these results, simple qualitative analysis allows for “ballpark” estimations. For instance, it may take a medical researcher to wholly understand the topics generated from abstracts related to two keywords, but anyone can identify that all words related to a concept of interest occur in the same topic. Results like this, show that LDA has distinguished the presence of a concept in a body of text.

2 Knowledge Network Construction

In order to discover hypotheses we construct a large weighted multi-layered network of biomedical objects extracted from NLM data sets. Using this network, we run shortest-centroid-path queries (see Section 3) whose results serve as an input for hypothesis mining. The wall clock time needed to complete this network construction pipeline is depicted in Figure 1 (see details in Section 4.4). Omitted from this figure is the time spent preprocessing the initial abstract text due to its embarrassingly parallel nature.

An external file that holds a picture, illustration, etc.
Object name is nihms938476f1.jpg

Running times of each network construction phase. All phases run on a single node described in section 4.4. Not shown: Initial text processing which was handled by a large array of small nodes.

2.1 Data Sources

The NLM maintains multiple databases of medical information which are the main source of our data. This includes MEDLINE [ 3 ], a source containing the metadata of approximately 24.5 million medical publications since the late 1800’s. Most of these MEDLINE records include a paper’s title, authors, publication date, and abstract text.

In addition to MEDLINE, the NLM maintains UMLS [ 2 ], which in turn provides the metathesaurus as well as a semantic network. The metathesaurus contains two million keywords along with all known synonyms (referred to as “atoms”) used in medical text. For example, the keyword “RNA” has many different synonyms such as “Ribonucleinicum acidum”, “Ribonucleic Acid”, and “Gene Products, RNA” to name a few. These metathesaurus keywords form a network comprised of multi-typed edges. For example, an edge may represent a parent - child or a boarder concept - narrower concept relationship. RNA has connections to terms such as “Nucleic Acids” and “DNA Synthesizers”. Lastly, each keyword holds a reference to an object in the semantic network. RNA is an instance of the “Nucleic Acid, Nucleoside, or Nucleotide” semantic type.

The UMLS semantic network is comprised of approximately 130 semantic types and is connected in a similar manner as the metathesaurus. For example, the semantic type “Drug Delivery Device” has an “is a” relationship with the “Medical Device” type, and has a “contains” relationship with the “Clinical Drug” type.

MEDLINE, the metathesaurus, and the semantic network are represented in our network as different layers. Articles which contain full text abstracts are represented as the abstract layer nodes 𝒜, keywords from the metathesaurus are represented as nodes in the keyword layer 𝒦, and items from the semantic network are represented as nodes in the semantic layer 𝒮.

2.2 Network Topology

We define a weighted undirected graph underlying our network 𝒩 as G = ( V , E ), where V = 𝒜 ∪ 𝒦 ∪ 𝒮. The construction of G was governed by two major goals. Firstly, the shortest path between two indirectly related keywords should likely contain a significant number of nodes in 𝒜. If instead, this shortest path contained only 𝒦 − 𝒦 edges, we would limit ourselves to known information contained within the UMLS metathesaurus. Secondly, conceptual distance between topics should be represented as the distance between two nodes in 𝒩. This implies that we can determine the similarity between i, j ∈ V by the weight of their shortest path. If ij ∈ E , this would imply that exists a previously known relationship between i and j . We are instead interested in connections between distant nodes, as these potentially represent unknown information. Below we describe the construction of each layer in 𝒩.

2.3 Abstract Layer 𝒜

When connecting abstracts (𝒜 − 𝒜 edges), we want to ensure that two nodes i, j ∈ 𝒜 with similar content are likely neighbors in the 𝒜 layer. In order to do this, we turned to the UMLS SPECIALIST NLP toolset [ 1 ] as well as ToPMine [ 16 ] and FastText [ 9 , 23 ]. Our process for constructing 𝒜 is summarized in Figure 2 .

An external file that holds a picture, illustration, etc.
Object name is nihms938476f2.jpg

MOLIERE network construction pipeline.

First, we extract all titles, abstracts, and associated document ID (referred to as PMID within MEDLINE) from the raw MEDLINE files. We then process these combined titles and abstracts with the SPECIALIST NLP toolset to standardize spelling, strip stop words, convert to ASCII, and perform a number of other data cleaning processes. We then use ToPMine to generate meaningful n -grams and further clean the text. This process finds tokens that appear frequently together, such as newborn and infants and combines them into a single token newborn_infants . Cleaning and combining tokens in this manner greatly increases the performance of FastText , the next tool in our pipeline.

When running ToPMine , we keep the minimum phrase frequency and the maximum number of words per phrase set to their default values. We also keep the topic modeling component disabled. On our available hardware, the MEDLINE data set can be processed in approximately thirteen hours without topic modeling, but does not finish within three days if topic modeling is enabled. Because the resulting phrases are of high quality even without the topic modeling component, we accept this quality vs. time trade off. It is also important to note that we modify the version of ToPMine distributed by El-Kishky in [ 16 ] to allow phrases containing numbers, such as gene names like p53.

Next, FastText maps each token in our corpus to a vector υ ∈ ℝ d , allowing us to fit a centroid per abstract i ∈ 𝒜. Using a sufficiently high-dimensional space ensures a good separation between vectors. In other words, each abstract i ∈ 𝒜 is represented in ℝ d as c i = 1 / k · ∑ j = 1 k x j , where x j are FastText vectors of k keywords in i .

We choose to use the skipgram model to train FastText and reduce the minimum word count to zero. Because our data preprocessing and ToPMine have already stripped low support words, we accept that any n-gram seen by FastText is important. Following examples presented in [ 30 , 31 , 23 ] and others, we set the dimensionality of our vector space d to 500. This is consistent with published examples of similar size, for example the Google news corpus processed in [ 30 ]. Lastly, we increase the word neighborhood and number of possible sub-words from five to eight in order to increase data quality.

Finally, we used FLANN [ 32 ] to create nearest neighbors graph from all i ∈ 𝒜 in order to establish 𝒜 − 𝒜 edges in E . This requires that we presuppose a number of expected nearest neighbors per abstract k . We set this tunable parameter to ten initially and noticed that this value seemed appropriate. By studying the distances between connected abstracts, we observed that most abstracts had a range of very close and relatively far “nearest neighbors”. For our purposes in these initial experiments, we kept k = 10 and saw promising results. Due to time and resource limitations, we were unable to explore higher values of k in this study, but we are currently planning experiments where k = 100 and k = 1000. It is important to note that the resulting network will have ≈ k (2.3 × 10 7 ) edges, so there is a considerable trade-off between quality vs. space and time complexity.

After experimenting with both L 2 and normalized cosine distances, we observed that L 2 distance metric performs significantly better for establishing connections between centroids. Unfortunately, we cannot utilize the k-tree optimization in FLANN along with non-normalized cosine distance, making it computationally infeasible a dataset of our size. This is because the k-tree optimization requires an agglomerative distance metric. Lastly, we scale edges to the [0, 1] interval in order to relate them to other edges within the network.

2.4 Keyword Layer 𝒦

The 𝒦 layer is imported from the UMLS metathesaurus. Each keyword is referenced by a CUI number of UMLS. This layer links keywords which share already known connections. These known connections are 𝒦 − 𝒦 edges. The metathesaurus connections link related words; for example, the keyword “Protine p53” C0080055 is related to “Tumor Suppressor Proteins” C0597611 and “Li-Fraumeni Syndrome” C0085390 among others. There exist 14 different types of connections between keywords representing relationships such as parent - child or broader concept - narrower concept . We assign each a weight in the [0, 1] interval corresponding to its relevance, and then scale all weights by a constant factor σ so the average 𝒜 − 𝒜 edge are is stronger than the average 𝒦 − 𝒦 edge. The result is that a path between two indirectly related concepts will more likely include a number of abstracts. We selected σ = 2, but more study is needed to determine the appropriate edge weights within the keyword layer.

2.5 𝒜 − 𝒦 Connections

In order to create edges between 𝒜 and 𝒦, we used a simple metric of term frequency-inverse document frequency (tf-idf). UMLS provides not only a list of keywords, but all known synonyms for each keyword. For example, the keyword Color C0009393 has the American spelling, the British spelling, and the pluralization of both defined as synonyms. Therefore we used the raw text abstracts and titles (before running the SPECIALIST NLP tools) to calculate tf-idf. In order to quickly count all occurrences of UMLS keywords across all synonyms, we implemented a simple parser. This was especially important because many keywords in UMLS are actually multi-word phrases such as “Clustered Regularly Interspaced Short Palindromic Repeats” (a.k.a. CRISPR) C3658200 .

In order to count these keywords, we construct a parse tree from the set of synonyms. Each node in the tree contains a word, a set of CUIs, and a set of children nodes, with the exception of the root which contains the null string. We build this tree by parsing each synonym word by word. For each word, we either create a new node in the tree, or traverse to an already existing child node. We store each synonym’s CUI in the last node in its parse path. Then, to parse a document, we simply traverse the parse tree. This can be done in parallel over the set of abstracts. For each word in an abstract, we move from the current tree node to a child representing the same word. If none exists, we return to the root node. At each step of this traversal, we record the CUIs present at each visited node. In this manner, we get a count of each CUI present in each abstract. Our next pass aggregates these counts to discover the total number of usages per keyword across all abstracts. We calculate tf-idf per keyword per abstract. Because our network’s weights represent distance, we take the inverse of tf-idf to find the weight for an 𝒜 − 𝒦 edge. This is done simply by dividing a CUI’s count across all abstracts by its count in a particular abstract. By calculating weights this way, abstracts which use a keyword more often will have a lower weight, and therefore, a shorter distance. We scale the edge weights to the [0, σ ] interval so that these edges are comparable to those within the 𝒜 and 𝒦 layers.

2.6 Semantic Layer 𝒮

The UMLS supplies a companion network referred as the semantic network. This network consists of semantic types, which are overarching concepts. These “types” are similar to the function of a “type” in a programming language. In other words, it is a conceptual entity embodied by instantiations of that type. In the UMLS network, elements of 𝒦 are analogous to the instantiations of semantic types. While there are over two million elements of 𝒦, there are approximately 130 elements in 𝒮. For example, the semantic type Disease or Syndrome T047 is defined as “A condition which alters or interferes with a normal process, state, or activity of an organism” [ 2 ]. There are thousands of keywords, such as “influenza” C0021400 that are instances of this type.

The 𝒮 − 𝒮 edges are connected similarly to 𝒦 − 𝒦 edges. The overall structure is hierarchical with “Event” T051 and “Entity” T071 being the most generalized semantic types. Cross cutting connections are also present and can take on approximately fifty different forms. These cross cutting relations also form a hierarchy of relationship types. For example, “produces” T144 is a more specific relation than its parent “brings about” T187 .

We initially included 𝒮 in our network by linking each keyword to its corresponding semantic type. Unfortunately, in our early results we found that many shortest paths traversed through 𝒮 rather than through 𝒜. For example, if we were interested in two diseases, it was possible for the shortest path would simply travel to the “Disease or Syndrome” T047 type. This ultimately degraded the performance of our hypothesis generation system. As a result we removed this layer, but that further study may find that careful choice of 𝒮 − 𝒮 and 𝒦 − 𝒮 connection weights may make 𝒮 more useful. This is further discussed in Section 5.

3 Query Process

The process of running a query within MOLIERE is summarized in Figure 3 . Running a query starts with the user selecting two nodes i, j ∈ V (typically, but not necessarily, i, j ∈ 𝒦). For example, a query searching for the relationship between “stem cells” and “strokes” would be input as keyword identifiers C0038250 and C1263853 , respectively. This process simplifies our query process, but determining a larger set of keywords and abstracts which best represents a user’s search query is a future work direction.

An external file that holds a picture, illustration, etc.
Object name is nihms938476f3.jpg

MOLIERE query pipeline.

After receiving two query nodes i and j , we find a shortest path between them, ( ij ) s , using Dijkstra’s algorithm. These paths typically are between three and five nodes long and contain up to three abstracts (unless the nodes are truly unrelated, see Section 4.1). We observed that when ( ij ) s contains only two or three nodes in 𝒦, that the ij relationship is clearly well studied because it was solely supplied by the UMLS layer 𝒦. We are more interested in paths containing abstracts because these represent keyword pairs whose relationships are less well-defined. Still, the abstracts we find along these shortest paths alone are not likely to be sufficient to generate a hypothesis.

3.1 Hypothesis Modeling

Broadening ( ij ) s consists of two main phases, the results of which are depicted in Figure 4 . First, we select all nodes S = ( ij ) s ∩ 𝒜. These abstracts along the path ( ij ) s represent papers which hold key information relating two unconnected keywords. We find a neighborhood around S using a weighted breadth-first traversal, selecting the closest 1,000 abstracts to S . We will call this set N . Because 𝒜 was constructed as a nearest neighbors graph, it is likely that the concepts contained in N will be similar to the concepts contained in S , which increases the likelihood that important concepts will be detected by PLDA+ later in the pipeline.

An external file that holds a picture, illustration, etc.
Object name is nihms938476f4.jpg

Process of extending a path to a cloud of abstracts.

Next, we identify abstracts with contain information pertaining to the 𝒦 − 𝒦 connections present in ( ij ) s . We do so in order to identify abstracts which likely contain concepts which a human reader could use to understand the known relationship between two connected keywords. We start by traversing ( ij ) s to find α, β ∈ 𝒦 such that α and β are adjacent in ( ij ) s . From there, we find a set of abstracts C = { c : cα ∈ E ∧ cβ ∈ E }. That is, C is a subset of abstracts containing both keywords α and β . Because ( ij ) s can have many edges between keywords, and because thousands of abstracts can contain the same two keywords, it is important to limit the size of C .

This process creates a set of around 1,300 𝒜-nodes. This set will typically contain around 15,000–20,000 words and is large enough for PLDA+ to find topics. We run PLDA+ and request 20 topics. We find this provides a sufficient spread in our resulting data sets. The trained model generated by PLDA+ is what is eventually returned by our query process.

For our experiments, we often must process tens of thousands of results and thus must train topic models quickly. This is most apparent when running a one-to-many query such as the drug repurposing example in 4.3. Additionally, the training corpus returned from a MOLIERE query is often only a couple thousand documents large. As a result, we set the number of topics and the number of iterations to relatively small values, 20 and 100 respectively. Because we store intermediary results, it is trivial to retrain a topic model if the preliminary result seems promising.

The process of analyzing a topic model and uncovering a human interpretable sentence to describe a hypothesis is still a pressing open problem. The process as stated here does have some strong benefits which are apparent in Section 4. These include the ability to find correlations between medical objects, such as between a drug and multiple genes. In Section 6 we explain our initial plans to improve the quality of results which can be deduced from these topic models.

4 Experiments

We conduct two major validation efforts to demonstrate our system’s potential for hypothesis generation. For each of these experiments we use the same set of parameters for our trained model and network weights. Our initial findings show our choices, detailed in Section 2, to be robust. We plan to refine these choices with methods described in Section 6.

We repeat an experiment done by Wang et al. in [ 49 ] wherein we discover the implicit connections between the drug Venlafaxine and the genes HTR1A and HTR2A. We also perform a large scale study of Dead Box RNA Helicase 3 (DDX3) and its connection to cancer adhesion and metastasis. Each of these experiments is described in greater detail in the following sections. In this paper, we deliberately do not evaluate our experiments with extremely popular objects such as p53. These objects are so highly connected within 𝒦 that hypothesis generation involving these keywords is easy for many different methods .

4.1 Network Profile

We conduct our experiments on a very large knowledge graph which has been constructed according to Section 2. We initially created a network 𝒩 containing information dating up to and including 2016. This network consists of 24,556,689 nodes and 989,169,295 edges. The network overall consists of largest strongly connected component containing 99.8% of our network. The average degree of a node in 𝒩 is 79.65, and we observe a high clustering coefficient of 0.283. These metrics cause us to expect that the shortest path between two nodes will be very short. Our experiments agree, showing that most shortest paths are between three and six nodes long.

4.2 Venlafaxine to HTR1A

Wang et al. in [ 49 ] use a similar topic modeling approach, and find during one of their experiments that Venlafaxine C0078569 appears in the same topic as the HTR1A and HTR2A genes ( C1415803 and C1825553 respectively). When looking into these results, they find a stronger association between Venlafaxine and HTR1A. This finding is important because Venlafaxine is used to treat depressive disorder and anxiety, which HTR1A and HTR2A have been thought to affect, but as of 2009 no abstract contains this link. As a result, this implicit connection is difficult to detect with many existing methods.

As a result of running two queries, Venlafaxine to HTR1A, and Venlafaxine to HTR2A, we can corroborate the findings of Wang et al. in [ 49 ]. We find that neither pair of keywords is directly connected or connected through a single abstract. Nevertheless, phrases such as “long term antidepressant treatment,” “action antidepressants,” and “antidepressant drugs” are all prominent keywords in the HTR1A query. Meanwhile, the string “depress” only occurs four times in unrelated phrases with the HTR2A results. The distribution of depression related keywords from both queries can be see in figure 5 .

An external file that holds a picture, illustration, etc.
Object name is nihms938476f5.jpg

Distribution of n-grams having to do with depression from Venlafaxine queries.

Similarly, our results for HTR1A contain a single topic holding the phrases “anxiogenic,” “anxiety disorders,” “depression anxiety disorders,” and “anxiolytic response.” In contrast, our HTR2A results do not contain any phrases related to anxiety. The distribution of anxiety related keywords from both queries can be see in figure 6 .

An external file that holds a picture, illustration, etc.
Object name is nihms938476f6.jpg

Distribution of n-grams having to do with anxiety from Venlafaxine queries.

Our findings agree with those of Wang et al. which were that a small association score of 0.34 between Venlafaxine and HTR1A indicates a connection which is likely related to depressive disorder and anxiety. The association score between Venlafaxine and HTR2A, in contrast, is a much higher 4.0. This indicates that the connection between these two keywords is much weaker.

4.3 Drug Repurposing and DDX3’s Anti-Tumor Applications

Many genes are active in multiple cellular processes and in many cases they are found to be active outside of the original area in which the gene was initially discovered. The prediction of new processes is especially important for repurposing existing drugs (or drug target genes) to a new application [ 5 , 33 , 4 ]. As an example, the drugs developed for the treatment of infectious diseases were recently repurposed for cancer treatment. Extending applications of existing drugs provides a tremendous opportunity for the development of cost-effective treatments for cancers and other life-threatening diseases.

To estimate the predictive value of our system for the discovery of new applications of small molecules we select Dead Box RNA Helicase 3 (DDX3) C2604356 . DDX3 is the member of Dead-box RNA helicase and was initially discovered to be a regulator of transcription and propagation of Human Immunodeficiency Virus (HIV) as well as ribosomal biogenesis. Initially, DDX3 was a target for the development of anti-viral therapy for the AIDS treatment [ 25 , 29 ].

More recently, DDX3 activity was found to be involved cancer development and progression mainly through regulation of the Wnt signaling pathway [ 13 , 50 ] and associated regulation of Cell-cell and Cell-matrix adhesion, tumor cells invasion, and metastasis [ 12 , 43 , 48 , 24 ]. Currently, DDX3 is an established target for anti-tumor drug development [ 10 , 37 , 11 ] and represents a case for repurposing target anti-viral drugs into the application area of anti-tumor therapy.

To test this hypothesis, we analyze the data available on and before 12/31/2009, when no published indication of links in between DDX3 and the Wnt signaling were available. We compare DDX3 to all UMLS keywords containing the text “signal transduction”, “transcription”, “adhesion”, “cancer”, “development”, “translation”, or “RNA” in their synonym list. This search results in 9,905 keywords over which we query for relationships to DDX3. From this large set of results we personally analyze a subset of important pairs.

In our generated dataset, we found following text grouping within topics: “substrate adhesion,” “RGD cell adhesion domain,” “cell adhesion factor,” “focal adhesion kinase” which are indicative for the cell-matrix adhesion. The topics “cell-cell adhesion,” “regulation of cell-cell adhesion,” “cell-adhesion molecules” indicate the involvement of DDX3 into cell-cell adhesion regulation. The involvement of adhesion is associated with topics related to tumor dissemination: “ Collaborative staging metastasis evaluation Cancer,” “metastasis adhesion protein, human,” “metastasis associated in colon cancer 1” (selected in between others similar topics).

The results above suggested that through analysis of the ≤2009 dataset we can predict the involvement of DDX3 in tumor cell dissemination through the effects of Cell-cell and cell-matrix adhesion. Next, we analyzed, whether it will be possible to made inside of the mechanisms of DDX3-dependent regulation of Wnt signaling. As shown recently, DDX3 involvement on Wnt signaling is based on the regulated Casein kinase epsilon, to affect phosphorylation of the disheveled protein. Although we cannot predict the exact mechanism of DDX3 based on ≤2009 dataset, the existence of multiple topics of signal-transduction associated kinases, like “CELL ADHESION KINASE”, “activation by organism of defense-related host MAP kinase-mediated signal transduction pathway”, “modulation of defense-related symbiont mitogen-activated protein kinase-mediated signal transduction pathway by organism”, suggested the ability of DDX3 to regulate kinases activities and kinase-regulated pathways.

4.4 Experimental Setup

We performed all experiments on a single node within Clemson’s Palmetto supercomputing cluster. To perform our experiments and construct our network, we use an HP DL580 containing four Intel Xeon x7542 chips. This 24 core node has 500 GB of memory and access to a large ZFS-based file system where we stored experimental data.

For the DDX3 queries, we initially searched for all ( ij ) s where i = DDX3 and j ∈ 𝒦. This resulted in 1,350,484 shortest paths with corresponding abstract clouds. We used PLDA+ to construct models for all of these paths. Discovering all ( ij ) s completed in almost 10 hours of CPU time, and training the respective models completed in slightly over 68 hours of CPU time. We ran PLDA+ in parallel, resulting in a wall time of only 12 hours. As mentioned previously, this large dataset was filtered to the 9,905 paths we are interested in.

We generate the results for the Venlafaxine experiments in one hour of CPU time, which is mostly spent loading our very large network and then running Dijkstra’s algorithm. After this, the two resulting PLDA+ models were trained in parallel within a minute.

5 Deployment Challenges

In the following section we detail the challenges which we have faced and are expecting to encounter while creating our system and deploying it to the research community.

Dynamic Information Updates

The process of creating our network is computationally expensive and for the purposes of validation we must create multiple instances of our network representing different points in time. Initially we would have liked to create these multiple instances from scratch, starting from the MEDLINE archival distribution and rebuilding the network from there. Unfortunately, this proved infeasible because creating a single network is a time consuming process. Instead, we filter our network by removing abstracts and keywords which were published after our select date. Additionally, the act of adding information to our network, such as extending the 2016 network to create a 2017 network, is not straightforward. Ideally, adding a small number abstracts or keywords should be a fast and dynamic process which only affects localized regions of the network. If this were so, our deployed system could take advantage of new ideas and connections as soon as they are published.

A deployed system could support dynamic updates with an amortized approach. Using previously created FastText and ToPMine models, new documents could be fitted into an existing network with suitably high performance. Of course, if a new document introduced a new keyword or phrase, we would be unable to detect it initially. After some threshold of new documents had been added to the network, we could then rerun the entire network construction process to ensure that new keywords, phrases, and concepts would be properly placed in the network.

Query Platform and Performance

Initially, we expected to use a graph database to make the query process easier. We surveyed a selection of graph databases and found that Neo4j [ 14 ] provides a powerful query language as well as a platform capable of holding our billion-edge network. Unfortunately, Neo4j does not easily support weighted shortest path queries. Although some user suggestions did hint that it may be possible, the process requires leveraging edge labels and custom java procedures in a way that did not seem scalable. In place of Neo4j, we implemented Dijkstra’s shortest path algorithm in C++ using skew heaps as the internal priority queue. This implementation was chosen to minimize memory usage while maximizing speed and readability. Because we implemented Dijkstra’s algorithm ourselves, we can also combine the process of finding a shortest path and finding all neighboring abstracts for all keywords from a specific source. With only these high level optimizations, we were able to generate over 1,350,000 shortest paths and abstract neighborhoods in under ten hours, but generating a single result takes slightly over one hour.

6 Lessons Learned and Open Problems

Specialized lda.

During last two decades there has been a number of significant attempts to design automatic hypothesis generation systems [ 41 , 44 , 49 ]. However, most of these improve their performance by restricting either their information space or the size of their dictionary. For example, specialized versions of LDA such as Bio-LDA [ 49 ] uncover latent topics using a dictionary that gives a priority to special terms. We find that such approaches are helpful when general language may significantly over weigh a specialized language. However, phrase mining approaches that recover n -grams, such as [ 16 ], produce accurate methods without limiting the dictionary.

Hypothesis Viability and Novelty Assessment

Intuitively, a strong connection between two concepts in 𝒩 means that there exist a significant amount of research that covers a path between them. Similar observations are valid for LDA, i.e., latent topics are likely to describe well known facts. As a result, the most meaningful connections and interpretable topical inference are discovered with latent keywords that are among the most well known concepts. However, real hypotheses are not necessarily described using the most latent keywords in such topic models. In many cases, the keywords required for a successful and interpretable hypothesis start to appear among 20–30 most latent topical keywords. Thus, a major open problem related is the process to which one should select a combination of keywords and topics in order to represent a viable hypothesis. This problem is also linked to the problem of assessing the viability of a generated hypothesis.

These problems, as well as the problem of hypothesis novelty assessment, can be partially addressed by using the Dynamic Topic Modeling (DTM) [ 7 ]. Our preliminary experiments with scalable time-dependent clustered LDA [ 21 ] that significantly accelerates DTM demonstrate a potential to discover dynamic topics in MEDLINE. The dynamic topics are typically more realistic than those that can be discovered in the static network. This significantly simplifies the assessment of viability and topic noise elimination.

Incorporating the Semantic Layer 𝒮

In section 2.6 we describe the process in which we evaluated the UMLS semantic network and found that it worsened our resulting shortest path queries. Further work could improve the contribution that 𝒮 has on our overall network, possibly allowing 𝒮 to define the overall structure of our knowledge graph. In order to do this, one would likely need to take into account the hierarchy of relationship types present in this network, as well as the relative relationship each element in 𝒦 has with its connection in 𝒮. Ultimately, these different relationships would need to inform a weighting scheme that balances the over generalizations that 𝒮 introduces. For example, it may be useful to understand that two keywords are both diseases, but it is much less useful to understand that two keywords are “entities”.

Learning the Models of Hypothesis Generation

There is surprisingly little research focused on addressing the process of biomedical research and how that process evolves over time. We would like to model the process of discovery formation, taking into account the information context surrounding and preceding a discovery. We believe we could do so by reverse engineering existing discoveries in order to discover factors which altered the steps in a scientist’s research pipeline. Several promising observations in this direction have been done by Foster et al. [ 18 ] who examined this through Bourdieu’s field theory of science by developing a typology of research strategies on networks extracted from MEDLINE. However, instead of reverse engineering their models, they separate innovation steps from those that are more traditional in the research pipeline.

Dynamic Keyword Discovery

One of the limitations we found when performing our historical queries is the delay between the first major uses of a keyword and its appearance in the UMLS metathesaurus. Initially, we planned to study the relationship between “CRISPR” C3658200 and “genome editing” C4279981 . To our surprise, many keywords related to this query did not exist in our historical networks between 2009 and 2012, despite their frequent usage in cutting-edge research during that time. To further confuse the issue, although the keyword “CRISPR” did not appear in the UMLS releases on or before 2012, keywords containing “CRISPR” as a substring, such as “CRISPR element metabolism” C1752766 , do appear. We find this to be contradictory and that these inconsistencies highlight the limitation of relying on so strongly on keyword databases. Going forward, we plan to devise a way to extend a provided keyword network, utilizing semantic connections we can find within the MEDLINE document set. Projects like [ 42 ] have already shown this method can work in domains of smaller scales with good results. The challenge will be to extend this method to perform well when used on the entire MEDLINE data set.

Improving Performance of Algorithms with Graph Reordering Techniques

Cache-friendly layouts of graphs are known to generally accelerate the performance of the path and abstract retrieval algorithms which we apply. Moreover, it is desirable to consider this type of acceleration in order to make our system more suitable for regular modern desktops. This is an important consideration as memory is not expected to be a major bottleneck after the network is constructed. We propose to rearrange the network nodes by minimizing such objectives as the minimum logarithmic or linear arrangements [ 36 , 34 ]. On a mixture of 𝒦 − 𝒦, 𝒜 − 𝒦, and 𝒜 − 𝒜 edges we anticipate an improvement of at least 20% in the number of cache misses according to [ 35 ].

Mass Evaluation

We note that evaluation techniques are largely an issue in the state of the art of hypothesis generation. While some works feature large scale evaluation performed by many human experts, a majority, this work included, are restricted to only a couple of promising results to justify the system. In order to better evaluate and compare hypothesis generation techniques we must devise a common and large scale suite of historical hypotheses. We are currently evaluating whether a ground-truth network, like the drug-side-effect network SIDER [ 26 ], can be a good source of such hypotheses. For example, if we identify a set of recently added connections within SIDER, and predict a substantial percentage of those connections using MOLIERE , then we may be more certain of our performance.

New Domains of Interest

We have considered other domains on which MOLIERE may perform well. These include generating hypotheses regarding economics, patents, narrative fiction, and social interactions. These are all domains where a hypothesis would involve finding new relationships between distinct entities. We contrast this with domains such as mathematics where the entity-relationship network is much less clear, and logical approaches from the field of automatic theorem proving are more applicable.

7 Conclusions

In this study we describe a deployed biomedical hypothesis generation system, MOLIERE , that can discover relationship hypotheses among biomedical objects. This system utilizes information which exists in MEDLINE and other NLM datasets. We validate MOLIERE on landmark discoveries using carefully filtered historical data. Unlike several other hypothesis generation systems, we do not restrict the information retrieval domain to a specific language or a subset of scientific papers since this method can lose an unpredictable amount of information. Instead, we use recent text mining techniques that allow us to work with the full heterogeneous data at scale. We demonstrate that MOLIERE successfully generates hypotheses and recommend using it to advance biomedical knowledge discovery. Going forward, we note a number of directions along which we can improve MOLIERE as well as many existing hypothesis generation systems.

Acknowledgments

We would like to thank Dr. Lihn Ngo for his help in using the Palmetto supercomputer which ran our experiments, and Cong Qiu for initial experiments with Neo4j.

Explainable Automatic Hypothesis Generation via High-order Graph Walks

Uchenna akujuobi , xiangliang zhang , sucheendra palaniappan , michael spranger, send feedback.

Enter your feedback below and we'll get back to you as soon as possible. To submit a bug report or feature request, you can use the official OpenReview GitHub repository: Report an issue

BibTeX Record

Society Homepage About Public Health Policy Contact

Data-driven hypothesis generation in clinical research: what we learned from a human subject study, article sidebar.

automatic hypothesis generation

Submit your own article

Register as an author to reserve your spot in the next issue of the Medical Research Archives.

Join the Society

The European Society of Medicine is more than a professional association. We are a community. Our members work in countries across the globe, yet are united by a common goal: to promote health and health equity, around the world.

Join Europe’s leading medical society and discover the many advantages of membership, including free article publication.

Main Article Content

Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. Without an impactful hypothesis, the significance of any research project can be questionable, regardless of the rigor or diligence applied in other steps of the study, e.g., study design, data collection, and result analysis. In this perspective article, the authors provide a literature review on the following topics first: scientific thinking, reasoning, medical reasoning, literature-based discovery, and a field study to explore scientific thinking and discovery. Over the years, scientific thinking has shown excellent progress in cognitive science and its applied areas: education, medicine, and biomedical research. However, a review of the literature reveals the lack of original studies on hypothesis generation in clinical research. The authors then summarize their first human participant study exploring data-driven hypothesis generation by clinical researchers in a simulated setting. The results indicate that a secondary data analytical tool, VIADS—a visual interactive analytic tool for filtering, summarizing, and visualizing large health data sets coded with hierarchical terminologies, can shorten the time participants need, on average, to generate a hypothesis and also requires fewer cognitive events to generate each hypothesis. As a counterpoint, this exploration also indicates that the quality ratings of the hypotheses thus generated carry significantly lower ratings for feasibility when applying VIADS. Despite its small scale, the study confirmed the feasibility of conducting a human participant study directly to explore the hypothesis generation process in clinical research. This study provides supporting evidence to conduct a larger-scale study with a specifically designed tool to facilitate the hypothesis-generation process among inexperienced clinical researchers. A larger study could provide generalizable evidence, which in turn can potentially improve clinical research productivity and overall clinical research enterprise.

Article Details

The  Medical Research Archives  grants authors the right to publish and reproduce the unrevised contribution in whole or in part at any time and in any form for any scholarly non-commercial purpose with the condition that all publications of the contribution include a full citation to the journal as published by the  Medical Research Archives .

Hypothesis Maker Online

Looking for a hypothesis maker? This online tool for students will help you formulate a beautiful hypothesis quickly, efficiently, and for free.

Are you looking for an effective hypothesis maker online? Worry no more; try our online tool for students and formulate your hypothesis within no time.

  • 🔎 How to Use the Tool?
  • ⚗️ What Is a Hypothesis in Science?

👍 What Does a Good Hypothesis Mean?

  • 🧭 Steps to Making a Good Hypothesis

🔗 References

📄 hypothesis maker: how to use it.

Our hypothesis maker is a simple and efficient tool you can access online for free.

If you want to create a research hypothesis quickly, you should fill out the research details in the given fields on the hypothesis generator.

Below are the fields you should complete to generate your hypothesis:

  • Who or what is your research based on? For instance, the subject can be research group 1.
  • What does the subject (research group 1) do?
  • What does the subject affect? - This shows the predicted outcome, which is the object.
  • Who or what will be compared with research group 1? (research group 2).

Once you fill the in the fields, you can click the ‘Make a hypothesis’ tab and get your results.

⚗️ What Is a Hypothesis in the Scientific Method?

A hypothesis is a statement describing an expectation or prediction of your research through observation.

It is similar to academic speculation and reasoning that discloses the outcome of your scientific test . An effective hypothesis, therefore, should be crafted carefully and with precision.

A good hypothesis should have dependent and independent variables . These variables are the elements you will test in your research method – it can be a concept, an event, or an object as long as it is observable.

You can observe the dependent variables while the independent variables keep changing during the experiment.

In a nutshell, a hypothesis directs and organizes the research methods you will use, forming a large section of research paper writing.

Hypothesis vs. Theory

A hypothesis is a realistic expectation that researchers make before any investigation. It is formulated and tested to prove whether the statement is true. A theory, on the other hand, is a factual principle supported by evidence. Thus, a theory is more fact-backed compared to a hypothesis.

Another difference is that a hypothesis is presented as a single statement , while a theory can be an assortment of things . Hypotheses are based on future possibilities toward a specific projection, but the results are uncertain. Theories are verified with undisputable results because of proper substantiation.

When it comes to data, a hypothesis relies on limited information , while a theory is established on an extensive data set tested on various conditions.

You should observe the stated assumption to prove its accuracy.

Since hypotheses have observable variables, their outcome is usually based on a specific occurrence. Conversely, theories are grounded on a general principle involving multiple experiments and research tests.

This general principle can apply to many specific cases.

The primary purpose of formulating a hypothesis is to present a tentative prediction for researchers to explore further through tests and observations. Theories, in their turn, aim to explain plausible occurrences in the form of a scientific study.

It would help to rely on several criteria to establish a good hypothesis. Below are the parameters you should use to analyze the quality of your hypothesis.

🧭 6 Steps to Making a Good Hypothesis

Writing a hypothesis becomes way simpler if you follow a tried-and-tested algorithm. Let’s explore how you can formulate a good hypothesis in a few steps:

Step #1: Ask Questions

The first step in hypothesis creation is asking real questions about the surrounding reality.

Why do things happen as they do? What are the causes of some occurrences?

Your curiosity will trigger great questions that you can use to formulate a stellar hypothesis. So, ensure you pick a research topic of interest to scrutinize the world’s phenomena, processes, and events.

Step #2: Do Initial Research

Carry out preliminary research and gather essential background information about your topic of choice.

The extent of the information you collect will depend on what you want to prove.

Your initial research can be complete with a few academic books or a simple Internet search for quick answers with relevant statistics.

Still, keep in mind that in this phase, it is too early to prove or disapprove of your hypothesis.

Step #3: Identify Your Variables

Now that you have a basic understanding of the topic, choose the dependent and independent variables.

Take note that independent variables are the ones you can’t control, so understand the limitations of your test before settling on a final hypothesis.

Step #4: Formulate Your Hypothesis

You can write your hypothesis as an ‘if – then’ expression . Presenting any hypothesis in this format is reliable since it describes the cause-and-effect you want to test.

For instance: If I study every day, then I will get good grades.

Step #5: Gather Relevant Data

Once you have identified your variables and formulated the hypothesis, you can start the experiment. Remember, the conclusion you make will be a proof or rebuttal of your initial assumption.

So, gather relevant information, whether for a simple or statistical hypothesis, because you need to back your statement.

Step #6: Record Your Findings

Finally, write down your conclusions in a research paper .

Outline in detail whether the test has proved or disproved your hypothesis.

Edit and proofread your work, using a plagiarism checker to ensure the authenticity of your text.

We hope that the above tips will be useful for you. Note that if you need to conduct business analysis, you can use the free templates we’ve prepared: SWOT , PESTLE , VRIO , SOAR , and Porter’s 5 Forces .

❓ Hypothesis Formulator FAQ

Updated: Oct 25th, 2023

  • How to Write a Hypothesis in 6 Steps - Grammarly
  • Forming a Good Hypothesis for Scientific Research
  • The Hypothesis in Science Writing
  • Scientific Method: Step 3: HYPOTHESIS - Subject Guides
  • Hypothesis Template & Examples - Video & Lesson Transcript
  • Free Essays
  • Writing Tools
  • Lit. Guides
  • Donate a Paper
  • Referencing Guides
  • Free Textbooks
  • Tongue Twisters
  • Job Openings
  • Expert Application
  • Video Contest
  • Writing Scholarship
  • Discount Codes
  • IvyPanda Shop
  • Terms and Conditions
  • Privacy Policy
  • Cookies Policy
  • Copyright Principles
  • DMCA Request
  • Service Notice

Use our hypothesis maker whenever you need to formulate a hypothesis for your study. We offer a very simple tool where you just need to provide basic info about your variables, subjects, and predicted outcomes. The rest is on us. Get a perfect hypothesis in no time!

SONY

Uchenna Akujuobi

Michael Spranger

Sucheendra K Palaniappan*

Xiangliang Zhang*

* External authors

IEEE Transactions on Knowledge and Data Engineering

automatic hypothesis generation

  • T-PAIR: Temporal node-pair embedding for automatic biomedical hypothesis generation

In this paper, we study an automatic hypothesis generation (HG) problem, which refers to the discovery of meaningful implicit connections between scientific terms, including but not limited to diseases, chemicals, drugs, and genes extracted from databases of biomedical publications. Most prior studies of this problem focused on the use of static information of terms and largely ignored the temporal dynamics of scientific term relations. Even when the dynamics were considered in a few recent studies, they learned the representations for the scientific terms, rather than focusing on the term-pair relations. Since the HG problem is to predict term-pair connections, it is not enough to know with whom the terms are connected, it is more important to know how the connections have been formed (in a dynamic process). We formulate this HG problem as a future connectivity prediction in a dynamic attributed graph. The key is to capture the temporal evolution of node-pair (term-pair) relations. We propose an inductive edge (node-pair) embedding method named T-PAIR, utilizing both the graphical structure and node attribute to encode the temporal node-pair relationship. We demonstrate the efficiency of the proposed model on three real-world datasets, which are three graphs constructed from Pubmed papers published until 2019 in Neurology, Immunotherapy, and Virology, respectively. Evaluations were conducted on predicting future term-pair relations between millions of seen terms (in the transductive setting), as well as on the relations involving unseen terms (in the inductive setting). Experiment results and case study analyses show the effectiveness of the proposed model.

Related Publications

CERM: Context-aware Literature-based Discovery via Sentiment Analysis

ECAI, 2023 Julio Christian Young*, Uchenna Akujuobi

Motivated by the abundance of biomedical publications and the need to better understand the relationship between food and health, we study a new sentiment analysis task based on literature- based discovery. Many attempts have been made to introduce health into recipe recomme…

Improving Artificial Intelligence with Games

Science, 2023 Peter R. Wurman, Peter Stone, Michael Spranger

Games continue to drive progress in the development of artificial intelligence.

MocoSFL: enabling cross-client collaborative self-supervised learning

ICLR, 2023 Jingtao Li, Lingjuan Lyu, Daisuke Iso, Chaitali Chakrabarti*, Michael Spranger

Existing collaborative self-supervised learning (SSL) schemes are not suitable for cross-client applications because of their expensive computation and large local data requirements. To address these issues, we propose MocoSFL, a collaborative SSL framework based on Split Fe…

  • Publications

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire to shape the future of AI.

TO TOP

  • Open access
  • Published: 29 August 2012

An automated framework for hypotheses generation using literature

  • Vida Abedi 1 , 2 ,
  • Ramin Zand 3 ,
  • Mohammed Yeasin 1 , 2 &
  • Fazle Elahi Faisal 1 , 2  

BioData Mining volume  5 , Article number:  13 ( 2012 ) Cite this article

8417 Accesses

7 Citations

18 Altmetric

Metrics details

In bio-medicine, exploratory studies and hypothesis generation often begin with researching existing literature to identify a set of factors and their association with diseases, phenotypes, or biological processes. Many scientists are overwhelmed by the sheer volume of literature on a disease when they plan to generate a new hypothesis or study a biological phenomenon. The situation is even worse for junior investigators who often find it difficult to formulate new hypotheses or, more importantly, corroborate if their hypothesis is consistent with existing literature. It is a daunting task to be abreast with so much being published and also remember all combinations of direct and indirect associations. Fortunately there is a growing trend of using literature mining and knowledge discovery tools in biomedical research. However, there is still a large gap between the huge amount of effort and resources invested in disease research and the little effort in harvesting the published knowledge. The proposed hypothesis generation framework (HGF) finds “crisp semantic associations” among entities of interest - that is a step towards bridging such gaps.

Methodology

The proposed HGF shares similar end goals like the SWAN but are more holistic in nature and was designed and implemented using scalable and efficient computational models of disease-disease interaction. The integration of mapping ontologies with latent semantic analysis is critical in capturing domain specific direct and indirect “crisp” associations, and making assertions about entities (such as disease X is associated with a set of factors Z).

Pilot studies were performed using two diseases. A comparative analysis of the computed “associations” and “assertions” with curated expert knowledge was performed to validate the results. It was observed that the HGF is able to capture “crisp” direct and indirect associations, and provide knowledge discovery on demand.

Conclusions

The proposed framework is fast, efficient, and robust in generating new hypotheses to identify factors associated with a disease. A full integrated Web service application is being developed for wide dissemination of the HGF. A large-scale study by the domain experts and associated researchers is underway to validate the associations and assertions computed by the HGF.

Peer Review reports

The explosion of OMICS - based technologies, such as genomics, proteomics, and pharmaco-genomics, has generated a wave of information retrieval tools, such as SWAN [ 1 ], to mine the heterogeneous, high dimensional and large databases, as well as complex biological networks. The general characteristics of such complex systems as well as their robustness and dynamical properties were reported by many researchers (i.e., [ 2 , 3 ]). These reports of designing scalable and efficient knowledge discovery tools can further our understanding of complex biological systems. The burgeoning gap between the effort and investment made to acquire the knowledge about complexities of biological systems is disproportionately large compared to the development of knowledge discovery tools that can be used for effectively disseminating the acquired knowledge, generating and validating hypothesis, and understanding the complex causal relationships. Despite a plethora of efforts in reverse-engineering of complex systems to predict response to perturbations, there is a lack of significant effort to create a higher level abstraction of such complex biological systems using sources of information other than genetics data [ 2 , 4 ]. A high level view of complex systems would be very useful in generating new hypotheses and connecting seemingly unrelated entities. Such an abstraction could facilitate translational research and may prove vital in clinical studies by providing a valuable reference to the clinicians, researchers, and other domain experts.

Disease networks can provide a high level view of complex systems; however, the reported networks are mostly based on genetic and proteomic data [ 2 , 4 ]. Such networks could also be constructed based on literature data to incorporate a wider range of factors such as side effects and risk factors. Generating disease-models based on literature data is a very natural and efficient way to better understand and summarize the current knowledge about different high-level systems. A connection between two diseases can be formalized by risk factors, symptoms, treatment options, or any other diseases as compared to only common disease-genes. The relations between diseases can provide a systematic approach to identify missing links and potential associations. It will also create new avenues for collaborations and interdisciplinary research.

To construct a disease network based on literature data, it is imperative to have a scalable and efficient literature-mining tool to explore the huge textual resources. Nevertheless, mining of biological and medical literature is a very challenging task [ 5 – 7 ]. This can further be complicated by challenges with the implementation of relevant information extraction, also known as deep parsing, which is built on formal mathematical models. Deep parsing, also known as formal grammar, attempts to describe how text is generated in the human mind [ 5 ]. Deterministic or probabilistic context-free grammars are probably the most popular formal grammars [ 7 ]. Grammar-based information extraction techniques are computationally expensive as they require the evaluation of alternative ways to generate the same sentence. Grammar-based information could therefore be more precise but at the cost of reduced processing speed [ 5 ].

An alternative to the grammar-based methods are factorization methods such as Latent Semantic Analysis (LSA) [ 8 ], and Non-negative Matrix Factorization (NMF) [ 9 , 10 ]. Factorization methods rely on bag-of-word concept, and have therefore reduced computational complexity. LSA is a well known information retrieval technique which has been applied to many areas in bioinformatics. Arguably, LSA captures semantic relations between various concepts based on their distance in the reduced eigen space [ 11 ]. It has the advantage of extracting direct and indirect associations between entities. A commonly used distance measure in LSA is the cosine value of the angle between the document and query in the reduced eigen space.

Over the past two decades, medical text-mining has proved to be valuable in generating new exciting hypotheses. For instance, titles from MEDLINE were used to make connections between disconnected arguments: 1) the connection between migraine and magnesium deficiency [ 12 ] which has been verified experimentally; 2) between indomethacin and Alzheimer’s disease [ 12 ]; and finally 3) between Curcuma longa and retinal diseases [ 13 ]. Hypothesis generation in literature-mining relies on the fact that chance connections can emerge to be meaningful [ 7 ].

This paper designs and implements an efficient and scalable literature-mining framework to generate and also validate plausible hypotheses about various entities that include (but not limited to): risk factors, environmental factors, lifestyle, diseases, and disease groups. The proposed hypothesis generation framework (HGF) is implemented based on parameter optimized latent semantic analysis (POLSA) [ 14 ] and is suitable to capture direct and indirect association among concepts. It is easy to note that the overall performance and quality of results obtained through LSA-based systems is a function of the dictionary used. The concept of mapping ontologies was integrated with the POLSA to overcome such limitations and to provide crisp associations. In particular, the Medical Subject Headings (MESH) is used to construct the dictionary. Such a dictionary allows a more efficient use of the LSA technique in finding semantically related entities in the biological and medical sciences. This framework can be used to generate customized disease-disease interaction networks, to facilitate interdisciplinary collaborations between scientists and organizations, to discover hidden knowledge, and to spawn new research directions. In addition, the concept of statistical disease modeling was introduced to compute the strongly related, related, and not related concepts.

The following section describes the proposed hypothesis generation framework and its evaluation. Two case studies were performed to showcase the potential and utility of the proposed method. Finally, the paper ends with a brief conclusion and discussions about the strengths and weaknesses of the method.

Results and discussion

Hypotheses generation framework (hgf).

The HGF has three major modules: Ontology Mapping to generate data-driven domain specific dictionaries, a parameter optimized latent semantic analysis (POLSA), and Disease Model. The schematic diagram of the overall HGF framework is shown in the Figure 1 (A). The model is constructed using the POLSA framework, and it is based on the selected documents and the dictionary (Figure 1 C). Users can query the model and the output is a ranked list of headings. These ranked headings are grouped into three sets (unknown factors, potential factors, or established factors) using the Disease Model module (Figure 1 C and 1 D). Analyzing the headings in the three sets can facilitate hypothesis generation and information retrieval based on user query.

figure 1

Flow diagram of the hypothesis generation framework (HGF). A ) In a medical and biological setting, Ontology Mapping could use the Medical Subject Heading (MeSH) and generate a context specific dictionary, which is one of the parameters of the POLSA model. Associated factors are ranked based on a User Query which can be any word(s) in the dictionary. These factors are subsequently grouped into three different bins (unknown factors, potential factors or established factors) based on our Disease Model. B ) Ontology Mapping to create domain specific dictionary. C ) Parameter Optimized Latent Semantic Analysis Module. D ) Disease Model Module.

Ontology mapping

MeSH is used to generate the dictionary in the POLSA model. The mapping of MeSH ontology to create the dictionary for the POLSA significantly enhances the quality of results and provides a crisp association of semantically related entities in biological and medical science. All MeSH headings are reduced to single words to create the context specific and data driven dictionary (see Figure 1 B). For instance, “Reproductive and Urinary Physiological Phenomena” is a MeSH term and is reduced to five words in the dictionary (1. Reproductive, 2. and, 3. Urinary, 4. Physiological, and 5. Phenomena). In the filtering step, duplicates as well as stop words such as “and” or words containing fewer than three characters are removed. The final size of this dictionary is 19,165 words. Any dictionary word could be used as a query to the HGF. For instance, the disease “stroke” is a query in this study. The highly ranked factors with respect to a query-disease are considered factors associated with that disease. Cosine similarity measure is used as a metric in the HGF.

POLSA module

In order to develop an effective literature-mining framework to model disease-disease interaction networks, generate plausible new hypotheses, and support knowledge-discovery by finding semantically related entities, a Parameter Optimized LSA (POLSA) [ 14 ] was re-designed and adopted in the proposed HGF framework.

In addition, a set of associated factors was selected to represent interaction between diseases. Ninety-six common associated factors (see Table 1 ) were selected through a literature review from numerous medical articles by two domain experts. As the first step, a set of articles was selected by querying the PubMed database using a series of diseases and factors. In the second step, the retrieved articles were manually reviewed by domain experts and entities that were associated with diseases or factors were selected. All articles considered for this analysis were peer reviewed articles. In addition, some common diseases such as diabetes and depression were also included in the set of 96 factors, as these are believed to be, in many instances, risk factors to other diseases. Therefore, the set of 96 associated factors represents a wide range of factors including generic factors such as depression and infection as well as specific factors such as vitamin E. As the final step, the set was further revised by an expert in the medical field. Using the improved POLSA technique [ 14 ], meaningful associations from the textual data in the PubMed database are extracted and mined. Furthermore, the factors are ranked based on their level of association to a given query.

Titles and abstracts from PubMed (for the past twenty years) for each of the 96 factors were downloaded in a local machine. On average there were 47,570 abstracts per factor; the specific factors such as “maternal influenza” had fewer abstracts associated with them (minimum of 160 abstracts/factor) and the more generic factors such as “hormone” were associated with a greater number of abstracts (a maximum of 557,554 abstracts/factor). The complete collection was then used to construct the knowledge space for the POLSA model. Using a query such as “Parkinson” or “stroke” the 96 factors were then ranked based on their relative level of associations to the query. The distribution of a set of associated factors with respect to a disease was modeled as a tri-modal distribution: a distribution which has three modes. This is due to the fact that some factors are known to be associated with the disease and have high scores. Similarly, some factors are known to be unassociated to the disease and these have negative scores; in addition, some factors may or may not be associated to the disease and these have low similarity scores. Matlab was used to generate two tri-modal distributions based on general Gaussian models for the two distributions obtained from queries “stroke” and “Parkinson”. The model uses the following formulation to describe the tri-modal Gaussian distribution:

Where α 1 , α 2 and α 3 are the scaling factors; μ 1 , μ 2 and μ 3 are the position of the center of the peaks, and σ 1 , σ 2 , σ 3 control the width of the distributions. The goodness of fit was measured using an R-square score.

  • Disease model

Using a disease model (see Figure 2 ), it was possible to map the mixture of three Gaussian distributions into easy to understandable categories. The implicit assumption is that if associated factors of a disease are well known, a large body of literature will be available to corroborate the existence of such associations. On the other hand, if associated factors of a disease are not well documented, the factors are weakly associated to the disease with few factors displaying a high level of association (Disease X versus Disease Y as shown in the Figure 2 ). Since the distribution of association level of factors (including risk factors) will be different in the two scenarios. In the first case (Disease Y) the two dominating distributions are the factors that are associated and those that are not associated with the disease; in the second case (Disease X) the dominating distribution is that of the potential factors. In essence, if one accepts this assumption then the distribution of associated factors follows a tri-modal distribution and it will be intuitive to measure the level of association for different factors with respect to a given disease. Utilization of a disease model (by a tri-modal distribution) allows better identification of the three sets of factors: unknown associations, potential associations and established associations.

figure 2

Model for the distribution of associated factors of a given disease. If associated factors – such as risk factors – of a disease are well known as in the case for Disease Y, then the two dominating distributions are the factors that are associated and those that are not associated with the disease; if on other hand the associated factors of a disease are not well documented (Disease X) then the dominating distribution is that of the potential factors.

Separating the three distributions allows implementation of a dynamic and data-driven threshold calculation. Hence, the parameters of the distributions can be used to model a cut-off threshold for the factors that are established, potential, or unknown. This method is empirical and provides an intuitive approach to evaluate the results. The score can be further optimized in a heuristic manner with utilization of a large-scale and comprehensive ground truth set. Furthermore, the highly associated factors to the disease are the well known factors; the hidden knowledge on the other hand resides in the region where the associations are positive yet weak.

Model evaluation

Two diseases, namely, Ischemic Stroke (IS) and Parkinson’s Disease (PD), were used as queries to the hypothesis generation system. The distribution of associated factors is presented in the Figure 3 . The results were compared with MedLink neurology [ 15 ], a web resource used by clinicians. Comparative results were summarized in the Figure 4 . In the case of IS, most of the associated factors are identified by both systems; however there is a set of factors that have only been identified by the proposed approach. In the case of the PD, a large number of factors have been identified by both systems. However, there are a number of factors that have only been identified by the proposed HGF and only a handful that are mentioned in the MedLink neurology which have positive but low similarity score in the hypothesis generation framework.

figure 3

Number of factors identified by MedLink Neurology and by HGF for IS and PD. Association levels for IS measured by HGF are high (0.3 < cosine score) and possible (0.1 < cosine score < 0.3); association levels for PD measured by HGF are high (0.2 < cosine score), possible (0.1 < cosine score < 0.2) or low (0.05 < cosine score < 0.1).

figure 4

Distribution of similarity score (dashed line) for risk factors associated with IS and PD. The frequency represents the number of factors at each cosine similarity level (−1 to +1). Tri-modal distribution models are represented by solid lines.

The tri-modal distribution model is used to group the associated factors into different levels. The cut-off values to differentiate between different association levels vary slightly depending on the distribution of the similarity scores. The ideal decision boundary can be found if a large number of ground truth cases are available; in this situation the decision boundary is selected intuitively based on the shape of the distributions. For example, in the case of IS, factors are considered highly associated if their cosine score is greater than 0.3, factors are possible associated if their score is between 0.1 and 0.3 and are possibly not associated if their score is lower than 0.1. In the case of PD, factors are considered highly associated if their cosine score is greater than 0.2, factors are possibly associated if their cosine score is between 0.1 and 0.2 and finally the factors with scores between 0.05 and 0.1 are considered associated at low level, factors with scores lower than 0.05 are considered possibly not associated with the Parkinson’s Disease.

In the case of IS, the distribution of known associated factors are more shifted to the right as compared to the factors in PD, hence the separation between the known and unknown factors is more pronounced. In addition to that, associations at both extreme levels (close to +1.0 and −1.0) are likely to be common knowledge; however, the hidden knowledge tends to be captured at similarity scores that are low yet positive. Nonetheless, it is not realistic to compare the precise similarity score values in order to give more importance to one factor versus another factor mainly because there is a systemic bias that is inherent to the biological text data and causes the generic factors to be an underestimate of the true value (data not shown); hence a direct comparison would fail in this case if no additional normalization steps are taken.

Figure 3 summarizes a comparative analysis of MedLink Neurology and HGF for IS and PD. Overall in the case of IS, twelve factors were identified by both systems and six factors were identified by the HGF. In the case of PD, twelve factors were identified by both systems, ten factors were identified by the HGF and five factors were identified by MedLink Neurology. But, these factors had a low association level in HGF. The five factors were either very generic or were not exactly mapped in the set of the 96 factors, hence a direct comparison could not be made. Finally, this small scale comparative analysis corroborates the hypothesis that HGF based on literature can better predict the associated factors for diseases such as IS when the risk and associated factors are well studied and documented. In both cases, MedLink, Neurology, and HGF predicted twelve common associated factors; however, in the case of PD ten new factors were predicted in comparison to six in the case of IS.

De novo hypothesis generation can provide an approach on how we design experiments and select the parameters for the study. Interestingly, associations detected by the proposed framework can facilitate extraction of interesting observations and new trends in the field. For instance, it was found that PD could possibly be associated with immunological disorders; this is an intriguing observation. This analysis also facilitates interdisciplinary research and enhances interaction among scientists from sub-specialized fields. A manual review of the literature is performed to find evidences for some of the associations found only by the HGF; Table 2 summarizes these results.

There are three main limitations in the presented framework. We are currently in the process of finding solutions for these limitations. 1) Manual selection of the factors creates bias in the dataset and also limits its scalability property. To alleviate this problem, MeSH hierarchy will be used to generate the set of factors. MeSH comprises more than 25,000 subjects headings organized in an eleven-level hierarchy. 2) In the set of 96 factors, some factors were very generic and some very specific, therefore, there was a systemic bias in the dataset which caused the score for generic factors to be an underestimate of the true values and factors with limited information to be overestimated (data not shown). To partially solve this technical difficulty, an improved method based on local LSA is being developed in our lab. And finally, 3) looking only at literature from the past twenty years was not sufficient for the HGF. The expansion of the literature is necessary based on the observation that the association between head trauma and PD was significantly lower than expected.

Generating new hypotheses by mining a vast amount of raw unstructured knowledge from the archived reported literature may help in identifying new research trends as well as promoting interdisciplinary studies. In addition, the presented framework is not limited to uncovering disease-disease interactions; any word from MeSH can be used to query the system, and its associated factors can be identified accordingly. Disease-disease interaction networks, interaction networks among chemical compounds, drug-drug interaction networks, or any specific type of interaction network can be constructed using the HGF. The common basis for all these networks is the knowledge embedded in the literature. Application of this framework is broad as its usage is not limited to any specific domain. For instance, uncovering drug-drug interactions is valuable in drug development and drug administration, uncovering disease-disease interaction is important in understanding disease mechanism’s and advancing biology through integrated interdisciplinary research. Even though the framework is not limited to diseases, in this study two neurological diseases were used to test the system and demonstrate the power and applicability of the framework.

In addition to addressing the limitations of the framework, work is in progress to expand the HGF framework to allow the user to generate disease networks based on a number of user-defined queries. Such customized networks can be valuable to a wide range of scientists by promoting a faster identification of associated factors and detection of disease-disease interactions. Disease networks based on genetics and proteomics data display many connections between individual disorders and disease categories [ 2 , 4 ]. Therefore, as expected each human disorder does not seem to have unique origins or be independent of other disorders. To uncover potential links between two disorders knowledge extraction from medical literature could be greatly beneficial and reliable.

Authors’ information

VA is a Ph.D. candidate in Electrical and Computer Engineering at the University of Memphis; she has a B.A.Sc. in Computer Engineering and B.Sc. in Biochemistry in addition to a M.Sc. in Cellular Molecular Medicine and a second M.Sc. in Bioinformatics. Her research interests are interdisciplinary research in Medical Informatics and Systems Biology. VA’s research incorporates a systems approach to understanding gene regulatory networks, which combines mathematical modeling and molecular biology wet lab techniques. Her recent contributions are in medical informatics where her board understanding of interdisciplinary issues as well as deep knowledge in mathematics and experimental biology are fundamental in designing and performing experiments in translational research.

RZ is a M.D. in the department of Neurology at the University of Tennessee. He also holds a Masters of Public Health. His research interests include Vascular Neurology and Bioinformatics. Over the past few years, RZ has contributed to bridge the gap between clinical findings and application of bioinformatics tools.

FEF is a PhD candidate in Electrical and Computer Engineering at The University of Memphis; he has a B.Sc. in Computer Science and Engineering, M.Sc. degree in Computer Science and Engineering and a second M.Sc. degree in Bioinformatics. His research interests are biological information retrieval and data mining. FEF possesses good knowledge in software design and development. He participated in software development of some national and international research projects, such as Codewitz Asia-Link Project of European Union.

MY is an Associate Professor in the department of Electrical and Computer Engineering, adjunct faculty member of Biomedical Engineering and Bioinformatics Program, and an affiliated member of the Institute for Intelligent Systems (IIS) at The University of Memphis (U of M). He is a senior member of the IEEE. He made significant contributions in the research and development of real-time computer vision solutions for academic research and commercial applications. He has been involved with several technological innovations, including classifying gender, age group, ethnicity and emotion, face detection, recognition of human activities in video, and speech-gesture enabled sophisticated natural human-computer interfaces. Some of his research on facial image analysis and hand gesture recognition is used in developing several commercial products by the Videomining Inc.

Abbreviations

Hypothesis generation framework

Ischemic stroke

Latent semantic analysis

Medical subject heading

Non-negative matrix factorization

Parkinson’s disease

Parameter optimized latent semantic analysis.

Gao Y, Kinoshita J, Wu E, Miller E, Lee R, Seaborne A, Cayzer S, Clark T: SWAN: A Distributed Knowledge Infrastructure for Alzheimer Disease Research. Journal of Web Semantics. 2006, 4 (3): 222-228. 10.1016/j.websem.2006.05.006.

Article   Google Scholar  

Goh KI, Cusick ME, Valle D, Childs B, Vidal M, Barabási AL: The human disease network. Proc Natl Acad Sci USA. 2007, 104 (21): 8685-8690. 10.1073/pnas.0701361104.

Article   CAS   PubMed   PubMed Central   Google Scholar  

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner R, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science. 2001, 292 (5518): 929-934. 10.1126/science.292.5518.929.

Article   CAS   PubMed   Google Scholar  

Zhang X, Zhang R, Jiang Y, Sun P, Tang G, Wang X, Lv H, Li X: The expanded human disease network combining protein–protein interaction information. Eur J Hum Genet. 2011, 19 (7): 783-788. 10.1038/ejhg.2011.30.

Rzhetsky A, Seringhaus M, Gerstein M: Seeking a new biology through text mining. Cell. 2008, 134 (1): 9-13. 10.1016/j.cell.2008.06.029.

Hirschman L, Morgan AA, Yeh AS: Rutabaga by any other name: extracting biological names. J Biomed Inform. 2002, 35 (4): 247-259. 10.1016/S1532-0464(03)00014-5.

Wilbur WJ, Hazard GF, Divita G, Mork JG, Aronson AR, Browne AC: Analysis of biomedical text for chemical names: a comparison of three methods. Proc AMIA Symp. 1999, 176-180. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2232672/ .

Landauer TK, Dumais ST: A solution to plato’s problem: the latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychol Rev. 1997, 104: 211-240.

Lee DD, Seung HS: Learning the parts of objects by non-negative matrix factorization. Nature. 1999, 401: 788-791. 10.1038/44565.

Paatero P, Tapper U: Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994, 5: 111-126. 10.1002/env.3170050203.

Berry MW, Browne M: Understanding Search Engines: Mathematical Modeling and Text Retrieval. 1990, Philadelphia, USA: SIAM

Google Scholar  

Swanson D, Smalheiser N: Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease. Neurosci Res Commun. 1994, 15: 1-9.

Srinivasan P, Libbus B: Mining MEDLINE for implicit links between dietary substances and diseases. Bioinformatics. 2004, 20 (Suppl 1): i290-i296. 10.1093/bioinformatics/bth914.

Yeasin M, Malempati H, Homayouni R, Sorower MS: A systematic study on latent semantic analysis model parameters for mining biomedical literature. Conference Proceedings: BMC Bioinformatics. 2009, 10 (Suppl. 7): A6-

Medlink Neurology. [ http://www.medlink.com/medlinkcontent.asp ]

Catling LA, Abubakar I, Lake IR, Swift L, Hunter PR: A systematic review of analytical observational studies investigating the association between cardiovascular disease and drinking water hardness. J Water Health. 2008, 6 (4): 433-442. 10.2166/wh.2008.054.

Article   PubMed   Google Scholar  

Menown IA, Shand JA: Recent advances in cardiology. Future Cardiol. 2010, 6 (1): 11-17. 10.2217/fca.09.59.

Tafet GE, Idoyaga-Vargas VP, Abulafia DP, Calandria JM, Roffman SS, Chiovetta A, Shinitzky M: Correlation between cortisol level and serotonin uptake in patients with chronic stress and depression. Cogn Affect Behav Neurosci. 2001, 1 (4): 388-393. 10.3758/CABN.1.4.388.

Williams GP: The role of oestrogen in the pathogenesis of obesity, type 2 diabetes, breast cancer and prostate disease. Eur J Cancer Prev. 2010, 19 (4): 256-271. 10.1097/CEJ.0b013e328338f7d2.

Schürks M, Glynn RJ, Rist PM, Tzourio C, Kurth T: Effects of vitamin E on stroke subtypes: meta-analysis of randomised controlled trials. BMJ. 2010, 341: c5702-10.1136/bmj.c5702.

Article   PubMed   PubMed Central   Google Scholar  

Benkler M, Agmon-Levin N, Shoenfeld Y: Parkinson’s disease, autoimmunity, and olfaction. Int J Neurosci. 2009, 119 (12): 2133-2143. 10.3109/00207450903178786.

Moscavitch SD, Szyper-Kravitz M, Shoenfeld Y: Autoimmune pathology accounts for common manifestations in a wide range of neuro-psychiatric disorders: the olfactory and immune system interrelationship. Clin Immunol. 2009, 130 (3): 235-243. 10.1016/j.clim.2008.10.010.

Faria AM, Weiner HL: Oral tolerance. Immunol Rev. 2005, 206: 232-259. 10.1111/j.0105-2896.2005.00280.x.

Teixeira G, Paschoal PO, de Oliveira VL, Pedruzzi MM, Campos SM, Andrade L, Nobrega A: Diet selection in immunologically manipulated mice. Immunobiology. 2008, 213 (1): 1-12. 10.1016/j.imbio.2007.08.001.

Schiffman SS, Sattely-Miller EA, Taylor EL, Graham BG, Landerman LR, Zervakis J, Campagna LK, Cohen HJ, Blackwell S, Garst JL: Combination of flavor enhancement and chemosensory education improves nutritional status in older cancer patients. J Nutr Health Aging. 2007, 11 (5): 439-454.

CAS   PubMed   Google Scholar  

Murphy C, Davidson TM, Jellison W, Austin S, Mathews WC, Ellison DW, Schlotfeldt C: Sinonasal disease and olfactory impairment in HIV disease: endoscopic sinus surgery and outcome measures. Laryngoscope. 2000, 110 (10 Pt 1): 1707-1710.

Zucco GM, Ingegneri G: Olfactory deficits in HIV-infected patients with and without AIDS dementia complex. Physiol Behav. 2004, 80 (5): 669-674. 10.1016/j.physbeh.2003.12.001.

Tandeter H, Levy A, Gutman G, Shvartzman P: Subclinical thyroid disease in patients with Parkinson’s disease. Arch Gerontol Geriatr. 2001, 33 (3): 295-300. 10.1016/S0167-4943(01)00196-0.

Chinnakkaruppan A, Das S, Sarkar PK: Age related and hypothyroidism related changes on the stoichiometry of neurofilament subunits in the developing rat brain. Int J Dev Neurosci. 2009, 27 (3): 257-261. 10.1016/j.ijdevneu.2008.12.007.

García-Moreno JM, Chacón-Peña J: Hypothyroidism and Parkinson’s disease and the issue of diagnostic confusion. Mov Disord. 2003, 18 (9): 1058-1059. 10.1002/mds.10475.

Munhoz RP, Teive HA, Troiano AR, Hauck PR, Herdoiza Leiva MH, Graff H, Werneck LC: Parkinson’s disease and thyroid dysfunction. Parkinsonism Relat Disord. 2004, 10 (6): 381-383. 10.1016/j.parkreldis.2004.03.008.

Ferreira JJ, Neutel D, Mestre T, Coelho M, Rosa MM, Rascol O, Sampaio C: Skin cancer and Parkinson’s disease. Mov Disord. 2010, 25 (2): 139-148. 10.1002/mds.22855.

Download references

Acknowledgements

This work was supported by the Electrical and Computer Engineering Department and Bioinformatics Program at the University of Memphis, by the University of Tennessee Health Science Center (UTHSC), as well as by NSF grant NSF-IIS-0746790. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding institution.

Author information

Authors and affiliations.

Department of Electrical and Computer Engineering, Memphis University, Memphis, TN, 38152, USA

Vida Abedi, Mohammed Yeasin & Fazle Elahi Faisal

College of Arts and Sciences, Bioinformatics Program, Memphis University, Memphis, TN, 38152, USA

Department of Neurology, University of Tennessee Health Science Center, Memphis, TN, 38163, USA

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Mohammed Yeasin .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors’ contributions

VA designed and carried out the experiments, participated in the development of the methods, analyzed the results and drafted the manuscript. RZ participated in the development of the methods, designed the validation experiments for the two test cases and reviewed the manuscript. FEF participated in the implementation of the algorithms. MY participated in the development of the methods, supervised the experiments and edited the manuscript. All authors have read, and approved the final version of the manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, authors’ original file for figure 3, authors’ original file for figure 4, rights and permissions.

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Abedi, V., Zand, R., Yeasin, M. et al. An automated framework for hypotheses generation using literature. BioData Mining 5 , 13 (2012). https://doi.org/10.1186/1756-0381-5-13

Download citation

Received : 30 March 2012

Accepted : 13 July 2012

Published : 29 August 2012

DOI : https://doi.org/10.1186/1756-0381-5-13

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Disease network
  • Biological literature-mining
  • Hypothesis generation
  • Knowledge discovery
  • MeSH ontology

BioData Mining

ISSN: 1756-0381

automatic hypothesis generation

Inference on Tables as Semi-structured Data

Currently, either crowdsourced or fully automatic methods are used to create training data for Natural Language Inference (NLI) tasks, like semi-structured table reasoning. In this paper, a realistic semi-automated system for tabular inference's data augmentation is developed. Our methodology creates hypothesis templates that may be applied to similar tables rather than manually creating a hypothesis for each table. Additionally, our paradigm calls for the construction of logical counterfactual tables based on premise paraphrasing and human-written logical restrictions. We found that our methodology could produce examples of tabular inference that resembled those made by humans. This could help with training data augmentation, particularly in the case of limited supervision.

automatic hypothesis generation

Our framework includes four main components:

i. Hypothesis Template Creation

The row attributes (i.e., keys) for a particular category of tables (for example, movies) are mostly shared across all tables (e.g., Length, Producer, Director, and others). It is advantageous to write key-based rules for specific table categories in order to produce logical hypothesis sentences. We develop key-based rules for the following reasoning types: Temporal reasoning, numerical reasoning, spatial reasoning, and common sense reasoning are the four types of reasoning.

automatic hypothesis generation

ii. Rational Counterfactual Table Creation

We alter the original table in one or more of the following ways to create a counterfactual table: Maintaining the row in its current state without making any changes, adding a new value to an existing key, replacing an existing key-value pair with counterfactual data, deleting a specific key-value pair from the table, adding missing new keys (i.e., a key from (n k)), and adding a missing key row to the table are all examples of options. For the purpose of creating counterfactual tables, a subset of operations is chosen randomly for each row of an existing table with a predetermined probability p (a hyper-parameter).To guarantee logical consistency in the generated sentences, we impose crucial key-specific constraints when building these tables.

automatic hypothesis generation

iii. Paraphrasing of Premise Tables

The lack of linguistic variety is a major issue with grammar-based data generation methods. As a result, we use both automated and human premise table parsing to address the diversity issue. We write at least three to five simple paraphrased sentences of the key-specific template for each key for a given category.

automatic hypothesis generation

iv. Automatic Table Hypothesis Generation

Once created, the templates can be used to automatically fill in the blanks from the entries of the considered tables and generate logically rational hypothesis sentences. To generate contradictory sentences, we choose a value at random from a set of key values shared by all tables to fill in the blanks. This substitution ensures that key-specific constraints, such as key-value type, are followed. Furthermore, we ensure that entail contradict pair is created using a similar template with minimal token modification.

automatic hypothesis generation

Data Quality Evaluation

automatic hypothesis generation

Experimental Results

We did our analysis in broadly 4 settings:

a. Evaluation

automatic hypothesis generation

b. Standalone

automatic hypothesis generation

c. Augmentation

automatic hypothesis generation

d. Limited supervision

automatic hypothesis generation

Knowledge + InfoTabS

You should check our NAACL 2021 paper which enhance InfoTabS with extra Knowledge.

We presented a semi-automatic framework for extracting information from tabular data. We generate AutoTNLI using a template-based approach. AutoTNLI was used for both TNLI evaluation and data augmentation. Our experiments show that AutoTNLI and, by extension, our framework are effective, particularly in adversarial settings. In the future, we hope to create more lexically diverse and robust datasets and investigate whether the addition of neutrals can improve these datasets.

The following people have worked on the paper " Realistic Data Augmentation Framework for Enhancing Tabular Reasoning ":

automatic hypothesis generation

Please cite our paper as below if you use AutoTNLI.

Acknowledgement

We thank members of the Utah NLP group for their valuable insights and suggestions at various stages of the project; and EMNLP 2022 reviewers their helpful comments. We thank Antara Bahursettiwar for her valuable feedback. Additionally, we appreciate the inputs provided by Vivek Srikumar and Ellen Riloff . Vivek Gupta acknowledges support from Bloomberg's Data Science Ph.D. Fellowship .

Hypothesis Generation Toolkit: Identify What to Test, like a Data Scientist

Forget automatic hypothesis generators. Grab custom Google Analytics reports, segments, and an all-in-one spreadsheet to focus on your most profitable optimization ideas.

Download FREE toolkit

Cover Resource

Testing ideas that never seem to get the conversion lifts you want?

Wish you had a data scientist on the team, hippo (highest-paid-person’s-opinion) derailing your optimization efforts, a robust hypothesis based on data (and not on gut instinct) is the key to profitable testing., and with this toolkit, you can take your first, easy step towards data driven decision-making., we’ve compiled actionable resources that will point you in the direction of ideas that will actually eliminate the conversion roadblocks your traffic faces., 6 custom segments & 4 reports.

Most marketers use Google Analytics. Your team does too. Supercharge the effectiveness of your Analytics account with 6 custom segments and 4 custom reports that you can set-up/import right away to define your best marketing strategies, your most effective content and your biggest acquisition opportunities.

ALL-IN-ONE GOOGLE DRIVE SPREADSHEET

Fancy tools and dashboards are great. But sometimes simple works just as well. We’ve created a spreadsheet that will help you gather your qualitative and quantitative insights in one place, prioritize ideas and frame a hypothesis that is worth investing your time in.

THIS GUIDE HAS BEEN COMPILED BY

Andra Baragan

Andra Baragan, Founder @ Ontrack Digital

Andra is an experienced conversion optimization specialist and a certified Data Analyst and Optimizer. She has worked with over 60 online businesses and has brought 6 figures in increased revenue for them

You don’t need a data scientist to craft winning hypothesis. all you need is a solid start. make one..

Start Your 15 -Day Free Trial Right Now. No Credit Card Required

Important. Please Read.

  • Check your inbox for the password to Convert’s trial account.
  • Log in using the link provided in that email.

This sign up flow is built for maximum security. You’re worth it!

  • Pricing Lightweight Script Blog
  • Sign Up Log In

AI Hypothesis Generator

Hypothesis Generator to help you come up with a boilerplate hypothesis for your test ideas. Generate well-structured hypothesis in under 10 seconds!

1. Give us a brief about your hypothesis...

Hypotheses in A/B Testing

Hypotheses form an integral part of A/B Testing. They provide a clear path and expected outcome for the test, based on the initial conditions, such as the user interface and user experience, among others. A well-defined hypothesis is the foundation of any successful A/B test, guiding the direction of the test and serving as a benchmark against which the test’s results are evaluated.

What are the benefits?

The Automated Hypothesis Creator simplifies the first step in the A/B testing process and provides several benefits:

  • Quick and efficient hypothesis generation.
  • Saves time and resources which can often be invested in analysing the output of the A/B test.
  • Provides insightful and scientifically-backed predictions.
  • Outlines a clear picture for the A/B test, thus leading to more accurate outcomes.

How to Use it with A/B Testing?

To use the Automated Hypothesis Creator with A/B testing, follow these simple steps:

  • Begin by clearly formulating your query.
  • Use the text area in the tool to provide the necessary input data.
  • Click the “Create Hypothesis” button.
  • Wait for a while for the tool to process your request and generate a hypothesis.
  • Once the hypothesis is created, use it as a basis for your A/B test.

Try other free tools:

  • A/B Test Headline Generator
  • Sample Size Calculator
  • A/B Test Duration Calculator
  • Statistical Significance Calculator

A/B testing platform for people who cares about website performance

Mida is 10X faster than everything you have ever considered. Try it yourself.

Mida.so is a super lightweight A/B testing tool to help you experiment, analyze and implement conversion strategies in minutes.

IMAGES

  1. MOLIERE: Automatic Biomedical Hypothesis Generation System

    automatic hypothesis generation

  2. Workflow for automated hypothesis generation. a) General workflow

    automatic hypothesis generation

  3. Automating Hypothesis Generation

    automatic hypothesis generation

  4. MOLIERE: Automatic Biomedical Hypothesis Generation System

    automatic hypothesis generation

  5. How To Develop A Digital Product Through A Hypothesis Generation Design

    automatic hypothesis generation

  6. Steps in the hypothesis Generation

    automatic hypothesis generation

VIDEO

  1. AI in Hypothesis Generation

  2. The hypothesis of sixth-generation fighter aircraft (HD Enhanced Edition)

  3. 14. Benefits of the Scientific Hypothesis: #1-Antidote to Unconscious Bias

  4. Abiogenesis: What Is the Probability Life Arose from Inorganic Chemicals?

  5. The Autogen Assistant- Create Agents With a WebUI

  6. Inside AutoGen: Decoding the Core Mechanics of AI-Powered Group Chats

COMMENTS

  1. Hypothesis Maker

    How to use Hypothesis Maker. Visit the tool's page. Enter your research question into the provided field. Click the 'Generate' button to let the AI generate a hypothesis based on your research question. Review the generated hypothesis and adjust it as necessary to fit your research context and objectives. Copy and paste the hypothesis into your ...

  2. Automated Hypothesis Generation

    Automated hypothesis generation: when machine-learning systems. produce. ideas, not just test them. Testing ideas at scale. Fast. While algorithms are mostly used as tools to number-crunch and test-drive ideas, they have yet been used to generate the ideas themselves. Let alone at scale. Rather than thinking up one idea at a time and testing it ...

  3. An automatic hypothesis generation for plausible linkage between

    Our hypothesis generation framework followed the close discovery approach of Swanson's ABC model 22.The close discovery approach tried to identify B entities that connected the A entity to the C ...

  4. An AI Tool for Automated Research Question and Hypothesis Generation

    Generates a null hypothesis (H0) and an alternate hypothesis (H1) for each research question; Handles cases where either H0 or H1 is not present; Automatically generates missing H1 using the LLMChain if needed; Negates hypothesis statement if H0 is missing

  5. [2402.14424] Automating Psychological Hypothesis Generation with AI

    Leveraging the synergy between causal knowledge graphs and a large language model (LLM), our study introduces a groundbreaking approach for computational hypothesis generation in psychology. We analyzed 43,312 psychology articles using a LLM to extract causal relation pairs. This analysis produced a specialized causal graph for psychology. Applying link prediction algorithms, we generated 130 ...

  6. Artificial intelligence to automate the systematic review of scientific

    The automatic generation of content for the SLR report is a complex task not addressed until very recently. A summary of each selected primary study is a good starting point to write a SLR report. ... In addition, the authors see great potential on the application of AI for: (1) automatic hypothesis generation, (2) improvements on inclusion ...

  7. AGATHA

    Hypothesis generation systems address this challenge by mining the wealth of publicly available scientific information to predict plausible research directions. ... Justin Sybrandt, Michael Shtutman, and Ilya Safro. 2017. MOLIERE: Automatic Biomedical Hypothesis Generation System. In Proceedings of the 23rd ACM SIGKDD International Conference ...

  8. Hypothesis Generation from Literature for Advancing Biological

    Hypothesis Generation is a literature-based discovery approach that utilizes existing literature to automatically generate implicit biomedical associations and provide reasonable predictions for future research. Despite its potential, current hypothesis generation methods face challenges when applied to research on biological mechanisms.

  9. AGATHA: Automatic Graph-mining And Transformer based Hypothesis

    We present AGATHA, a deep-learning hypothesis generation system that can introduce data-driven insights earlier in the discovery process. Through a learned ranking criteria, this system quickly prioritizes plausible term-pairs among entity sets, allowing us to recommend new research directions. We massively validate our system with a temporal ...

  10. Explainable Automatic Hypothesis Generation via High-order Graph Walks

    This more transparent process encourages trust in the biomedical community for automatic hypothesis generation systems. We use a reinforcement learning strategy to formulate the HG problem as a guided node-pair embedding-based link prediction problem via a directed graph walk. Given nodes in a node-pair, the model starts a graph walk ...

  11. An automatic hypothesis generation for plausible linkage between

    Our hypothesis generation framework used evidence from scientific publications retrieved from PubMed to build a Xanthium compounds-diabetes knowledge base and generate hypotheses from it. First, we employed a dictionary-based tool to conduct the NER task and extracted bio-entities such as genes, compounds, phenotypes, biological processes, and ...

  12. MOLIERE: Automatic Biomedical Hypothesis Generation System

    1.1 Our Contribution. We introduce a deployed system, MOLIERE [], with the goal of generating more usable results than previously proposed hypothesis generation systems.We develop a novel method for constructing a large network of public knowledge and devise a query process which produces human readable text highlighting the relationships present between nodes.

  13. Explainable Automatic Hypothesis Tion Via High-order Graph Walks

    In this paper, we study the automatic hypothesis generation (HG) problem, focusing on explainability. Given pairs of biomedical terms, we focus on link prediction to explain how the prediction was made. This more transparent process encourages trust in the biomedical community for automatic hypothesis generation systems. We

  14. Explainable Automatic Hypothesis Generation via High-order ...

    In this paper, we study the automatic hypothesis generation (HG) problem, focusing on explainability. Given pairs of biomedical terms, we focus on link prediction to explain how the prediction was made. This more transparent process encourages trust in the biomedical community for automatic hypothesis generation systems. We use a reinforcement learning strategy to formulate the HG problem as a ...

  15. Data-Driven Hypothesis Generation in Clinical Research: What We Learned

    Hypothesis generation is an early and critical step in any hypothesis-driven clinical research project. Because it is not yet a well-understood cognitive process, the need to improve the process goes unrecognized. ... Zhang X. T-PAIR: Temporal Node-Pair Embedding for Automatic Biomedical Hypothesis Generation. IEEE Transactions on Knowledge and ...

  16. Hypothesis Maker

    Our hypothesis maker is a simple and efficient tool you can access online for free. If you want to create a research hypothesis quickly, you should fill out the research details in the given fields on the hypothesis generator. Below are the fields you should complete to generate your hypothesis:

  17. T-PAIR: Temporal node-pair embedding for automatic biomedical ...

    In this paper, we study an automatic hypothesis generation (HG) problem, which refers to the discovery of meaningful implicit connections between scientific terms, including but not limited to diseases, chemicals, drugs, and genes extracted from databases of biomedical publications. Most prior studies of this problem focused on the use of ...

  18. An automated framework for hypotheses generation using literature

    The proposed hypothesis generation framework (HGF) finds "crisp semantic associations" among entities of interest - that is a step towards bridging such gaps. The proposed HGF shares similar end goals like the SWAN but are more holistic in nature and was designed and implemented using scalable and efficient computational models of disease ...

  19. [PDF] Automated literature mining and hypothesis generation through a

    DOI: 10.1101/403667 Corpus ID: 92120855; Automated literature mining and hypothesis generation through a network of Medical Subject Headings @article{Wilson2018AutomatedLM, title={Automated literature mining and hypothesis generation through a network of Medical Subject Headings}, author={Stephen J. Wilson and Angela D. Wilkins and Matthew V. Holt and Byung-Kwon Choi and Daniel M. Konecki and ...

  20. AutoTNLI

    Automatic Table Hypothesis Generation . Once created, the templates can be used to automatically fill in the blanks from the entries of the considered tables and generate logically rational hypothesis sentences. To generate contradictory sentences, we choose a value at random from a set of key values shared by all tables to fill in the blanks. ...

  21. Hypothesis Generation Toolkit: Identify What to Test

    Hypothesis Generation Toolkit: Identify What to Test,like a Data Scientist. Hypothesis Generation Toolkit: Identify What to Test, like a Data Scientist. Forget automatic hypothesis generators. Grab custom Google Analytics reports, segments, and an all-in-one spreadsheet to focus on your most profitable optimization ideas. Download FREE toolkit.

  22. Hypothesis Generator For A/B Testing

    The Automated Hypothesis Creator simplifies the first step in the A/B testing process and provides several benefits: Quick and efficient hypothesis generation. Saves time and resources which can often be invested in analysing the output of the A/B test. Provides insightful and scientifically-backed predictions.

  23. An automatic hypothesis generation for plausible linkage between

    This framework for hypothesis generation employs a closed discovery approach from Swanson's ABC model that has proven very helpful in discovering biological linkages between bio entities to generate hypotheses of potential candidates for diabetes drug development using natural products. ... An automatic hypothesis generation for plausible ...