Subscribe to the PwC Newsletter

Join the community, natural language processing, representation learning.

nlp based research papers

Disentanglement

Graph representation learning, sentence embeddings.

nlp based research papers

Network Embedding

Classification.

nlp based research papers

Text Classification

nlp based research papers

Graph Classification

nlp based research papers

Audio Classification

nlp based research papers

Medical Image Classification

Language modelling.

nlp based research papers

Long-range modeling

Protein language model, sentence pair modeling, deep hashing, table retrieval, question answering.

nlp based research papers

Open-Ended Question Answering

nlp based research papers

Open-Domain Question Answering

Conversational question answering.

nlp based research papers

Answer Selection

Translation, image generation.

nlp based research papers

Image-to-Image Translation

nlp based research papers

Image Inpainting

nlp based research papers

Text-to-Image Generation

nlp based research papers

Conditional Image Generation

Data augmentation.

nlp based research papers

Image Augmentation

nlp based research papers

Text Augmentation

Machine translation.

nlp based research papers

Transliteration

Bilingual lexicon induction.

nlp based research papers

Multimodal Machine Translation

nlp based research papers

Unsupervised Machine Translation

Text generation.

nlp based research papers

Dialogue Generation

nlp based research papers

Data-to-Text Generation

nlp based research papers

Multi-Document Summarization

Text style transfer.

nlp based research papers

Topic Models

nlp based research papers

Document Classification

nlp based research papers

Sentence Classification

nlp based research papers

Emotion Classification

2d semantic segmentation, image segmentation.

nlp based research papers

Scene Parsing

nlp based research papers

Reflection Removal

Visual question answering (vqa).

nlp based research papers

Visual Question Answering

nlp based research papers

Machine Reading Comprehension

nlp based research papers

Chart Question Answering

nlp based research papers

Embodied Question Answering

Named entity recognition (ner).

nlp based research papers

Nested Named Entity Recognition

Chinese named entity recognition, few-shot ner, sentiment analysis.

nlp based research papers

Aspect-Based Sentiment Analysis (ABSA)

nlp based research papers

Multimodal Sentiment Analysis

nlp based research papers

Aspect Sentiment Triplet Extraction

nlp based research papers

Twitter Sentiment Analysis

Few-shot learning.

nlp based research papers

One-Shot Learning

nlp based research papers

Few-Shot Semantic Segmentation

Cross-domain few-shot.

nlp based research papers

Unsupervised Few-Shot Learning

Word embeddings.

nlp based research papers

Learning Word Embeddings

nlp based research papers

Multilingual Word Embeddings

Embeddings evaluation, contextualised word representations, optical character recognition (ocr).

nlp based research papers

Active Learning

nlp based research papers

Handwriting Recognition

Handwritten digit recognition, irregular text recognition, continual learning.

nlp based research papers

Class Incremental Learning

Continual named entity recognition, unsupervised class-incremental learning, text summarization.

nlp based research papers

Abstractive Text Summarization

Document summarization, opinion summarization, information retrieval.

nlp based research papers

Passage Retrieval

Cross-lingual information retrieval, table search, relation extraction.

nlp based research papers

Relation Classification

Document-level relation extraction, joint entity and relation extraction, temporal relation extraction, link prediction.

nlp based research papers

Inductive Link Prediction

Dynamic link prediction, anchor link prediction, calibration for link prediction, natural language inference.

nlp based research papers

Answer Generation

nlp based research papers

Visual Entailment

Cross-lingual natural language inference, reading comprehension.

nlp based research papers

Intent Recognition

Implicit relations, large language model, active object detection, emotion recognition.

nlp based research papers

Speech Emotion Recognition

nlp based research papers

Emotion Recognition in Conversation

nlp based research papers

Multimodal Emotion Recognition

Emotion-cause pair extraction, natural language understanding.

nlp based research papers

Emotional Dialogue Acts

Image captioning.

nlp based research papers

3D dense captioning

Controllable image captioning, aesthetic image captioning.

nlp based research papers

Relational Captioning

Semantic textual similarity.

nlp based research papers

Paraphrase Identification

nlp based research papers

Cross-Lingual Semantic Textual Similarity

Event extraction, event causality identification, zero-shot event extraction, dialogue state tracking, task-oriented dialogue systems.

nlp based research papers

Visual Dialog

Dialogue understanding, semantic parsing.

nlp based research papers

AMR Parsing

Semantic dependency parsing, drs parsing, ucca parsing, coreference resolution, coreference-resolution, cross document coreference resolution, in-context learning, semantic similarity, conformal prediction.

nlp based research papers

Text Simplification

nlp based research papers

Music Source Separation

Audio source separation.

nlp based research papers

Decision Making Under Uncertainty

nlp based research papers

Sentence Embedding

Sentence compression, joint multilingual sentence representations, sentence embeddings for biomedical texts, code generation.

nlp based research papers

Code Translation

nlp based research papers

Code Documentation Generation

Class-level code generation, library-oriented code generation, dependency parsing.

nlp based research papers

Transition-Based Dependency Parsing

Prepositional phrase attachment, unsupervised dependency parsing, cross-lingual zero-shot dependency parsing, specificity, information extraction, extractive summarization, temporal information extraction, low resource named entity recognition, cross-lingual, cross-lingual transfer, cross-lingual document classification.

nlp based research papers

Cross-Lingual Entity Linking

Cross-language text summarization, response generation, common sense reasoning.

nlp based research papers

Physical Commonsense Reasoning

Riddle sense, anachronisms, instruction following, visual instruction following, memorization, data integration.

nlp based research papers

Entity Alignment

nlp based research papers

Entity Resolution

Table annotation, entity linking.

nlp based research papers

Question Generation

Poll generation, prompt engineering.

nlp based research papers

Visual Prompting

nlp based research papers

Topic coverage

Dynamic topic modeling, part-of-speech tagging.

nlp based research papers

Unsupervised Part-Of-Speech Tagging

Abuse detection, hate speech detection, mathematical reasoning.

nlp based research papers

Math Word Problem Solving

Formal logic, geometry problem solving, abstract algebra, open information extraction.

nlp based research papers

Hope Speech Detection

Hate speech normalization, hate speech detection crisishatemm benchmark, data mining.

nlp based research papers

Argument Mining

nlp based research papers

Opinion Mining

Subgroup discovery, cognitive diagnosis, parallel corpus mining, word sense disambiguation.

nlp based research papers

Word Sense Induction

Few-shot relation classification, implicit discourse relation classification, cause-effect relation classification, language identification, dialect identification, native language identification, bias detection, selection bias, fake news detection, relational reasoning.

nlp based research papers

Semantic Role Labeling

nlp based research papers

Predicate Detection

Semantic role labeling (predicted predicates).

nlp based research papers

Textual Analogy Parsing

Slot filling.

nlp based research papers

Zero-shot Slot Filling

Extracting covid-19 events from twitter, grammatical error correction.

nlp based research papers

Grammatical Error Detection

Text matching, document text classification, learning with noisy labels, multi-label classification of biomedical texts, political salient issue orientation detection, pos tagging, deep clustering, trajectory clustering, deep nonparametric clustering, nonparametric deep clustering, spoken language understanding, dialogue safety prediction, stance detection, zero-shot stance detection, few-shot stance detection, stance detection (us election 2020 - biden), stance detection (us election 2020 - trump), multi-modal entity alignment, intent detection.

nlp based research papers

Open Intent Detection

Word similarity, text-to-speech synthesis.

nlp based research papers

Prosody Prediction

Zero-shot multi-speaker tts, zero-shot cross-lingual transfer, cross-lingual ner, fact verification, intent classification.

nlp based research papers

Document AI

Document understanding, language acquisition, grounded language learning, constituency parsing.

nlp based research papers

Constituency Grammar Induction

Entity typing.

nlp based research papers

Entity Typing on DH-KGs

Self-learning, cross-modal retrieval, image-text matching, multilingual cross-modal retrieval.

nlp based research papers

Zero-shot Composed Person Retrieval

Cross-modal retrieval on rsitmd, ad-hoc information retrieval, document ranking.

nlp based research papers

Word Alignment

Open-domain dialog, dialogue evaluation, novelty detection, model editing, knowledge editing, multimodal deep learning, multimodal text and image classification, discourse parsing, discourse segmentation, connective detection, multi-label text classification.

nlp based research papers

Text-based Image Editing

Text-guided-image-editing.

nlp based research papers

Zero-Shot Text-to-Image Generation

Concept alignment, conditional text-to-image synthesis.

nlp based research papers

Shallow Syntax

Sarcasm detection.

nlp based research papers

De-identification

Privacy preserving deep learning, explanation generation, lemmatization, session search, morphological analysis.

nlp based research papers

Aspect Extraction

Extract aspect, aspect category sentiment analysis.

nlp based research papers

Aspect-oriented Opinion Extraction

nlp based research papers

Aspect-Category-Opinion-Sentiment Quadruple Extraction

Molecular representation.

nlp based research papers

Chinese Word Segmentation

Handwritten chinese text recognition, chinese spelling error correction, chinese zero pronoun resolution, offline handwritten chinese character recognition, entity disambiguation, conversational search, text-to-video generation, text-to-video editing, subject-driven video generation, source code summarization, method name prediction, speech-to-text translation, simultaneous speech-to-text translation, text clustering.

nlp based research papers

Short Text Clustering

nlp based research papers

Open Intent Discovery

Authorship attribution, keyphrase extraction, linguistic acceptability.

nlp based research papers

Column Type Annotation

Cell entity annotation, columns property annotation, row annotation, abusive language.

nlp based research papers

Visual Storytelling

nlp based research papers

KG-to-Text Generation

nlp based research papers

Unsupervised KG-to-Text Generation

Few-shot text classification, zero-shot out-of-domain detection, term extraction, text2text generation, keyphrase generation, figurative language visualization, sketch-to-text generation, protein folding, phrase grounding, grounded open vocabulary acquisition, deep attention, morphological inflection, word translation, multilingual nlp, spam detection, context-specific spam detection, traditional spam detection, summarization, unsupervised extractive summarization, query-focused summarization.

nlp based research papers

Knowledge Base Population

Natural language transduction, conversational response selection, cross-lingual word embeddings, text annotation, passage ranking, image-to-text retrieval, news classification, key information extraction, biomedical information retrieval.

nlp based research papers

SpO2 estimation

Graph-to-sequence, authorship verification.

nlp based research papers

Sentence Summarization

Unsupervised sentence summarization, automated essay scoring, keyword extraction, story generation, temporal processing, timex normalization, document dating, multimodal association, multimodal generation, morphological tagging, nlg evaluation, meme classification, hateful meme classification, key point matching, component classification, argument pair extraction (ape), claim extraction with stance classification (cesc), claim-evidence pair extraction (cepe), weakly supervised classification, weakly supervised data denoising, entity extraction using gan.

nlp based research papers

Rumour Detection

Semantic composition.

nlp based research papers

Sentence Ordering

Comment generation.

nlp based research papers

Lexical Simplification

Token classification, toxic spans detection.

nlp based research papers

Blackout Poetry Generation

Semantic retrieval, subjectivity analysis.

nlp based research papers

Taxonomy Learning

Taxonomy expansion, hypernym discovery, conversational response generation.

nlp based research papers

Personalized and Emotional Conversation

Review generation, sentence-pair classification, emotional intelligence, dark humor detection, lexical normalization, pronunciation dictionary creation, negation detection, negation scope resolution, question similarity, medical question pair similarity computation, intent discovery, passage re-ranking, propaganda detection, propaganda span identification, propaganda technique identification, lexical analysis, lexical complexity prediction, question rewriting, goal-oriented dialog, user simulation, punctuation restoration, reverse dictionary, humor detection.

nlp based research papers

Meeting Summarization

Table-based fact verification, legal reasoning, pretrained multilingual language models, formality style transfer, semi-supervised formality style transfer, word attribute transfer, attribute value extraction, diachronic word embeddings, persian sentiment analysis, clinical concept extraction.

nlp based research papers

Clinical Information Retreival

Constrained clustering.

nlp based research papers

Only Connect Walls Dataset Task 1 (Grouping)

Incremental constrained clustering, aspect category detection, dialog act classification, extreme summarization.

nlp based research papers

Hallucination Evaluation

Recognizing emotion cause in conversations.

nlp based research papers

Causal Emotion Entailment

nlp based research papers

Nested Mention Recognition

Relationship extraction (distant supervised), binary classification, llm-generated text detection, cancer-no cancer per breast classification, cancer-no cancer per image classification, suspicous (birads 4,5)-no suspicous (birads 1,2,3) per image classification, cancer-no cancer per view classification, clickbait detection, decipherment, semantic entity labeling, text compression, handwriting verification, bangla spelling error correction, ccg supertagging, gender bias detection, linguistic steganography, probing language models, toponym resolution.

nlp based research papers

Timeline Summarization

Multimodal abstractive text summarization, reader-aware summarization, code repair, thai word segmentation, stock prediction, text-based stock prediction, event-driven trading, pair trading.

nlp based research papers

Face to Face Translation

Multimodal lexical translation, aggression identification, arabic text diacritization, commonsense causal reasoning, fact selection, suggestion mining, temporal relation classification, vietnamese datasets, vietnamese word segmentation, arabic sentiment analysis, aspect category polarity, complex word identification, cross-lingual bitext mining, morphological disambiguation, scientific document summarization, lay summarization, text attribute transfer.

nlp based research papers

Image-guided Story Ending Generation

Speculation detection, speculation scope resolution, abstract argumentation, dialogue rewriting, logical reasoning reading comprehension.

nlp based research papers

Unsupervised Sentence Compression

Sign language production, stereotypical bias analysis, temporal tagging, anaphora resolution, bridging anaphora resolution.

nlp based research papers

Abstract Anaphora Resolution

Hope speech detection for english, hope speech detection for malayalam, hope speech detection for tamil, hidden aspect detection, latent aspect detection, chinese spell checking, cognate prediction, japanese word segmentation, memex question answering, multi-agent integration, polyphone disambiguation, spelling correction, table-to-text generation.

nlp based research papers

KB-to-Language Generation

Text anonymization, zero-shot sentiment classification, conditional text generation, contextualized literature-based discovery, multimedia generative script learning, image-sentence alignment, open-world social event classification, personality generation, personality alignment, action parsing, author attribution, binary condescension detection, conversational web navigation, croatian text diacritization, czech text diacritization, definition modelling, document-level re with incomplete labeling, domain labelling, french text diacritization, hungarian text diacritization, irish text diacritization, latvian text diacritization, misogynistic aggression identification, morpheme segmentaiton, multi-label condescension detection, news annotation, open relation modeling, personality recognition in conversation.

nlp based research papers

Reading Order Detection

Record linking, role-filler entity extraction, romanian text diacritization, slovak text diacritization, spanish text diacritization, syntax representation, text-to-video search, turkish text diacritization, turning point identification, twitter event detection.

nlp based research papers

Vietnamese Text Diacritization

Zero-shot machine translation.

nlp based research papers

Conversational Sentiment Quadruple Extraction

Attribute extraction, legal outcome extraction, automated writing evaluation, chemical indexing, clinical assertion status detection.

nlp based research papers

Coding Problem Tagging

Collaborative plan acquisition, commonsense reasoning for rl, context query reformulation.

nlp based research papers

Variable Disambiguation

Cross-lingual text-to-image generation, crowdsourced text aggregation.

nlp based research papers

Description-guided molecule generation

nlp based research papers

Multi-modal Dialogue Generation

Page stream segmentation.

nlp based research papers

Email Thread Summarization

Emergent communications on relations, emotion detection and trigger summarization, extractive tags summarization.

nlp based research papers

Hate Intensity Prediction

Hate span identification, job prediction, joint entity and relation extraction on scientific data, joint ner and classification, literature mining, math information retrieval, meme captioning, multi-grained named entity recognition, multilingual machine comprehension in english hindi, multimodal text prediction, negation and speculation cue detection, negation and speculation scope resolution, only connect walls dataset task 2 (connections), overlapping mention recognition, paraphrase generation, multilingual paraphrase generation, phrase ranking, phrase tagging, phrase vector embedding, poem meters classification, query wellformedness.

nlp based research papers

Question-Answer categorization

Readability optimization, reliable intelligence identification, sentence completion, hurtful sentence completion, speaker attribution in german parliamentary debates (germeval 2023, subtask 1), text effects transfer, text-variation, vietnamese aspect-based sentiment analysis, sentiment dependency learning, web page tagging, workflow discovery, incongruity detection, multi-word expression embedding, multi-word expression sememe prediction, trustable and focussed llm generated content, pcl detection, semeval-2022 task 4-1 (binary pcl detection), semeval-2022 task 4-2 (multi-label pcl detection), automatic writing, complaint comment classification, counterspeech detection, extractive text summarization, face selection, job classification, multi-lingual text-to-image generation, multlingual neural machine translation, optical charater recogntion, bangla text detection, question to declarative sentence, relation mention extraction.

nlp based research papers

Tweet-Reply Sentiment Analysis

Vietnamese parsing.

natural language processing Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Towards Developing Uniform Lexicon Based Sorting Algorithm for Three Prominent Indo-Aryan Languages

Three different Indic/Indo-Aryan languages - Bengali, Hindi and Nepali have been explored here in character level to find out similarities and dissimilarities. Having shared the same root, the Sanskrit, Indic languages bear common characteristics. That is why computer and language scientists can take the opportunity to develop common Natural Language Processing (NLP) techniques or algorithms. Bearing the concept in mind, we compare and analyze these three languages character by character. As an application of the hypothesis, we also developed a uniform sorting algorithm in two steps, first for the Bengali and Nepali languages only and then extended it for Hindi in the second step. Our thorough investigation with more than 30,000 words from each language suggests that, the algorithm maintains total accuracy as set by the local language authorities of the respective languages and good efficiency.

Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi

Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision and natural language processing deal with image understanding and language modeling, respectively. In the existing literature, most of the works have been carried out for image captioning in the English language. This article presents a novel method for image captioning in the Hindi language using encoder–decoder based deep learning architecture with efficient channel attention. The key contribution of this work is the deployment of an efficient channel attention mechanism with bahdanau attention and a gated recurrent unit for developing an image captioning model in the Hindi language. Color images usually consist of three channels, namely red, green, and blue. The channel attention mechanism focuses on an image’s important channel while performing the convolution, which is basically to assign higher importance to specific channels over others. The channel attention mechanism has been shown to have great potential for improving the efficiency of deep convolution neural networks (CNNs). The proposed encoder–decoder architecture utilizes the recently introduced ECA-NET CNN to integrate the channel attention mechanism. Hindi is the fourth most spoken language globally, widely spoken in India and South Asia; it is India’s official language. By translating the well-known MSCOCO dataset from English to Hindi, a dataset for image captioning in Hindi is manually created. The efficiency of the proposed method is compared with other baselines in terms of Bilingual Evaluation Understudy (BLEU) scores, and the results obtained illustrate that the method proposed outperforms other baselines. The proposed method has attained improvements of 0.59%, 2.51%, 4.38%, and 3.30% in terms of BLEU-1, BLEU-2, BLEU-3, and BLEU-4 scores, respectively, with respect to the state-of-the-art. Qualities of the generated captions are further assessed manually in terms of adequacy and fluency to illustrate the proposed method’s efficacy.

Model Transformation Development Using Automated Requirements Analysis, Metamodel Matching, and Transformation by Example

In this article, we address how the production of model transformations (MT) can be accelerated by automation of transformation synthesis from requirements, examples, and metamodels. We introduce a synthesis process based on metamodel matching, correspondence patterns between metamodels, and completeness and consistency analysis of matches. We describe how the limitations of metamodel matching can be addressed by combining matching with automated requirements analysis and model transformation by example (MTBE) techniques. We show that in practical examples a large percentage of required transformation functionality can usually be constructed automatically, thus potentially reducing development effort. We also evaluate the efficiency of synthesised transformations. Our novel contributions are: The concept of correspondence patterns between metamodels of a transformation. Requirements analysis of transformations using natural language processing (NLP) and machine learning (ML). Symbolic MTBE using “predictive specification” to infer transformations from examples. Transformation generation in multiple MT languages and in Java, from an abstract intermediate language.

A Computational Look at Oral History Archives

Computational technologies have revolutionized the archival sciences field, prompting new approaches to process the extensive data in these collections. Automatic speech recognition and natural language processing create unique possibilities for analysis of oral history (OH) interviews, where otherwise the transcription and analysis of the full recording would be too time consuming. However, many oral historians note the loss of aural information when converting the speech into text, pointing out the relevance of subjective cues for a full understanding of the interviewee narrative. In this article, we explore various computational technologies for social signal processing and their potential application space in OH archives, as well as neighboring domains where qualitative studies is a frequently used method. We also highlight the latest developments in key technologies for multimedia archiving practices such as natural language processing and automatic speech recognition. We discuss the analysis of both visual (body language and facial expressions), and non-visual cues (paralinguistics, breathing, and heart rate), stating the specific challenges introduced by the characteristics of OH collections. We argue that applying social signal processing to OH archives will have a wider influence than solely OH practices, bringing benefits for various fields from humanities to computer sciences, as well as to archival sciences. Looking at human emotions and somatic reactions on extensive interview collections would give scholars from multiple fields the opportunity to focus on feelings, mood, culture, and subjective experiences expressed in these interviews on a larger scale.

Which environmental features contribute to positive and negative perceptions of urban parks? A cross-cultural comparison using online reviews and Natural Language Processing methods

Natural language processing for smart construction: current status and future directions, attention-based unsupervised keyphrase extraction and phrase graph for covid-19 medical literature retrieval.

Searching, reading, and finding information from the massive medical text collections are challenging. A typical biomedical search engine is not feasible to navigate each article to find critical information or keyphrases. Moreover, few tools provide a visualization of the relevant phrases to the query. However, there is a need to extract the keyphrases from each document for indexing and efficient search. The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. This research investigates whether the self-attentions can be utilized to extract keyphrases from a document in an unsupervised manner and identify relevancy between phrases to construct a query relevancy phrase graph to visualize the search corpus phrases on their relevancy and importance. The comparison with six baseline methods shows that the self-attention-based unsupervised keyphrase extraction works well on a medical literature dataset. This unsupervised keyphrase extraction model can also be applied to other text data. The query relevancy graph model is applied to the COVID-19 literature dataset and to demonstrate that the attention-based phrase graph can successfully identify the medical phrases relevant to the query terms.

Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing

Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this article, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition. To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB .

An ensemble approach for healthcare application and diagnosis using natural language processing

Machine learning and natural language processing enable a data-oriented experimental design approach for producing biochar and hydrochar from biomass, export citation format, share document.

Publications

Daniel Jurafsky . 2014. The Language of Food . W. W. Norton.

Christopher D. Manning , Prabhakar Raghavan , and Hinrich Schütze . 2008. Introduction to Information Retrieval . Cambridge University Press.

Daniel Jurafsky and James H. Martin . 2008. Speech and Language Processing: An Introduction to Natural Language Processing, Speech Recognition, and Computational Linguistics . 2nd edition. Prentice-Hall.

Christopher D. Manning and Hinrich Schütze . 1999. Foundations of Statistical Natural Language Processing . Cambridge, MA: MIT Press.

Barbara A. Fox , Dan Jurafsky , and Laura A. Michaelis (Eds.). 1999. Cognition and Function in Language . Stanford, CA: CSLI Publications.

Avery D. Andrews and Christopher D. Manning . 1999. Complex Predicates and Information Spreading in LFG . Stanford, CA: CSLI Publications.

  • Warning : Invalid argument supplied for foreach() in /home/customer/www/opendatascience.com/public_html/wp-includes/nav-menu.php on line 95 Warning : array_merge(): Expected parameter 2 to be an array, null given in /home/customer/www/opendatascience.com/public_html/wp-includes/nav-menu.php on line 102
  • ODSC EUROPE
  • AI+ Training
  • Speak at ODSC

nlp based research papers

  • Data Analytics
  • Data Engineering
  • Data Visualization
  • Deep Learning
  • Generative AI
  • Machine Learning
  • NLP and LLMs
  • Business & Use Cases
  • Career Advice
  • Write for us
  • ODSC Community Slack Channel
  • Upcoming Webinars

Top Recent NLP Research

Top Recent NLP Research

Featured Post Modeling NLP & LLMs posted by Daniel Gutierrez, ODSC October 1, 2021 Daniel Gutierrez, ODSC

Natural language processing (NLP) including conversational AI is arguably one of the most exciting technology fields today. NLP is important because it works to resolve ambiguity in language and adds useful analytical structure to the data for a plethora of downstream applications such as speech recognition and text analytics. NLP helps computers communicate with humans in their own language and scales other language-centric tasks. For example, NLP makes it possible for computers to read text, listen to speech, interpret conversations, measure sentiment, and determine which segments are important. Even though budgets were hit hard by the pandemic, 53% of technical leaders said their NLP budget was at least 10% higher compared to 2019 . In addition, many NLP breakthroughs are moving from research to production, with much coming from recent NLP research.

The last couple of years have been big for NLP with a number of high-profile research efforts involving: generative pre-training model (GPT), transfer learning, transformers (e.g. BERT, ELMO), multilingual NLP, training models with reinforcement learning, automating customer service with a new era of chatbots, NLP for social media monitoring, fake news detection, and so much more. 

In this article, I’ll help get you up to speed with current NLP research efforts by curating a list of the top recent papers published with a variety of research destinations including: arXiv.org , The International Conference on Learning Representations (ICLR) , The Stanford NLP Group , NeurIPS , and KDD . Enjoy!

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Increasing model size when pretraining natural language representations often result in improved performance on downstream tasks. However, at some point, further model increases become harder due to GPU/TPU memory limitations and longer training times. To address these problems, this paper presents two parameter-reduction techniques to lower memory consumption and increase the training speed of BERT. Comprehensive empirical evidence shows that the proposed methods lead to models that scale much better compared to the original BERT. Also used is a self-supervised loss that focuses on modeling inter-sentence coherence, and shows it consistently helps downstream tasks with multi-sentence inputs. As a result, the best model from this NLP research establishes new state-of-the-art results on the GLUE, RACE, and \squad benchmarks while having fewer parameters compared to BERT-large. The GitHub repo associated with this paper can be found HERE . 

CogLTX: Applying BERT to Long Texts

BERTs are incapable of processing long texts due to quadratically increasing memory and time consumption. The attempts to address this problem, such as slicing the text by a sliding window or simplifying transformers, suffer from insufficient long-range attentions or need customized CUDA kernels. The limited text length of BERT reminds us of the limited capacity (5 ∼ 9 chunks) of the working memory of humans – then how do human beings “Cognize Long TeXts?” Founded on the cognitive theory stemming from Baddeley, the CogLTX framework described in this NLP research paper identifies key sentences by training a judge model, concatenates them for reasoning, and enables multi-step reasoning via rehearsal and decay. Since relevance annotations are usually unavailable, it is proposed to use treatment experiments to create supervision. As a general algorithm, CogLTX outperforms or gets comparable results to SOTA models on NewsQA, HotpotQA, multi-class, and multi-label long-text classification tasks with memory overheads independent of the text length.

https://odsc.com/california/#register

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, this NLP research paper proposes a more sample-efficient pre-training task called replaced token detection . Instead of masking the input, the new approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, the new approach trains a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by this approach substantially outperform the ones learned by BERT given the same model size, data, and compute. 

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. This NLP research paper explores a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) — models which combine pre-trained parametric and non-parametric memory for language generation. RAG models are introduced where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. Two RAG formulations are compared, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token.

ConvBERT: Improving BERT with Span-based Dynamic Convolution

Pre-trained language models like BERT and its variants have recently achieved impressive performance in various natural language understanding tasks. However, BERT heavily relies on the global self-attention block and thus suffers a large memory footprint and computation cost. Although all its attention heads query on the whole input sequence for generating the attention map from a global perspective, some heads only need to learn local dependencies, which means the existence of computation redundancy. This NLP research paper proposes a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning. BERT is equipped with this mixed attention design using a ConvBERT model. Experiments have shown that ConvBERT significantly outperforms BERT and its variants in various downstream tasks, with lower training costs and fewer model parameters. 

The Lottery Ticket Hypothesis for Pre-trained BERT Networks

In NLP, enormous pre-trained models like BERT have become the standard starting point for training on a range of downstream tasks, and similar trends are emerging in other areas of deep learning. In parallel, work on the lottery ticket hypothesis has shown that models for NLP and computer vision contain smaller matching subnetworks capable of training in isolation to full accuracy and transferring to other tasks. The work in this paper combines these observations to assess whether such trainable, transferrable subnetworks exist in pre-trained BERT models. For a range of downstream tasks, matching subnetworks at 40% to 90% sparsity is found. These subnetworks are found at (pre-trained) initialization, a deviation from prior NLP research where they emerge only after some amount of training. Subnetworks found on the masked language modeling task (the same task used to pre-train the model) transfer universally; those found on other tasks transfer in a limited fashion if at all. As large-scale pre-training becomes an increasingly central paradigm in deep learning, the results demonstrate that the main lottery ticket observations remain relevant in this context. The GitHub repo associated with this paper can be found HERE .

BERT Loses Patience: Fast and Robust Inference with Early Exit

This NLP research paper proposes Patience-based Early Exit , a straightforward yet effective inference method that can be used as a plug-and-play technique to simultaneously improve the efficiency and robustness of a pretrained language model (PLM). To achieve this, the approach couples an internal-classifier with each layer of a PLM and dynamically stops inference when the intermediate predictions of the internal classifiers do not change for a pre-defined number of steps. The approach improves inference efficiency as it allows the model to predict with fewer layers. Meanwhile, experimental results with an ALBERT model show that the method can improve the accuracy and robustness of the model by preventing it from overthinking and exploiting multiple classifiers for prediction, yielding a better accuracy-speed trade-off compared to existing early exit methods.

The Curious Case of Neural Text Degeneration

Despite considerable advancements with deep neural language models, the enigma of neural text degeneration persists when these models are tested as text generators. The counter-intuitive empirical observation is that even though the use of likelihood as a training objective leads to high-quality models for a broad range of language understanding tasks, using likelihood as a decoding objective leads to text that is bland and strangely repetitive. This NLP research paper reveals surprising distributional differences between human text and machine text. In addition, it’s found that decoding strategies alone can dramatically affect the quality of machine text, even when generated from exactly the same neural language model. The findings motivate Nucleus Sampling , a simple but effective method to draw the best out of neural generation. By sampling text from the dynamic nucleus of the probability distribution, which allows for diversity while effectively truncating the less reliable tail of the distribution, the resulting text better demonstrates the quality of human text, yielding enhanced diversity without sacrificing fluency and coherence.

Encoding word order in complex embeddings

Sequential word order is important when processing text. Currently, neural networks (NNs) address this by modeling word position using position embeddings. The problem is that position embeddings capture the position of individual words, but not the ordered relationship (e.g., adjacency or precedence) between individual word positions. This NLP research paper presents a novel and principled solution for modeling both the global absolute positions of words and their order relationships. The solution generalizes word embeddings, previously defined as independent vectors, to continuous word functions over a variable (position). The benefit of continuous functions over variable positions is that word representations shift smoothly with increasing positions. Hence, word representations in different positions can correlate with each other in a continuous function. The general solution of these functions is extended to a complex-valued domain due to richer representations. CNN, RNN, and Transformer NNs are extended to complex-valued versions to incorporate complex embedding. 

Stanza: A Python Natural Language Processing Toolkit for Many Human Languages

This paper introduces Stanza , an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. Stanza was trained on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. The GitHub repo associated with this NLP research paper, along with source code, documentation, and pretrained models for 66 languages can be found HERE . 

Mogrifier LSTM

Many advances in NLP have been based upon more expressive models for how inputs interact with the context in which they occur. Recurrent networks, which have enjoyed a modicum of success, still lack the generalization and systematicity ultimately required for modeling language. This NLP research paper proposes an extension to the venerable Long Short-Term Memory (LSTM) in the form of mutual gating of the current input and the previous output. This mechanism affords the modeling of a richer space of interactions between inputs and their context. Equivalently, the model can be viewed as making the transition function given by the LSTM context-dependent. 

DeFINE: Deep Factorized Input Token Embeddings for Neural Sequence Modeling

For sequence models with large vocabularies, a majority of network parameters lie in the input and output layers. This NLP research paper describes a new method, DeFINE, for learning deep token representations efficiently. The architecture uses a hierarchical structure with novel skip-connections which allows for the use of low dimensional input and output layers, reducing total parameters and training time while delivering similar or better performance versus existing methods. DeFINE can be incorporated easily in new or existing sequence models. Compared to state-of-the-art methods including adaptive input representations, this technique results in a 6% to 20% drop in perplexity. 

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. This paper proposes a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, it is applied to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. The GitHub repo associated with this paper can be found HERE . 

Dynabench: Rethinking Benchmarking in NLP

This paper introduces Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. It is argued that Dynabench addresses a critical need in the NLP community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. The paper reports on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

Causal Effects of Linguistic Properties

This paper considers the problem of using observational data to estimate the causal effects of linguistic properties. For example, does writing a complaint politely lead to a faster response time? How much will a positive product review increase sales? This paper addresses two technical challenges related to the problem before developing a practical method. First, formalize the causal quantity of interest as the effect of a writer’s intent, and establish the assumptions necessary to identify this from observational data. Second, in practice, access is only offered to noisy proxies for the linguistic properties of interest—e.g., predictions from classifiers and lexicons. An estimator is proposed for this setting and proof that its bias is bounded when we perform an adjustment for the text. Based on these results, TEXTCAUSE is introduced, an algorithm for estimating causal effects of linguistic properties. The method leverages (1) distant supervision to improve the quality of noisy proxies, and (2) a pre-trained language model (BERT) to adjust for the text. It is shown that the proposed method outperforms related approaches when estimating the effect of Amazon review sentiment on semi-simulated sales figures.

LM-Critic: Language Models for Unsupervised Grammatical Error Correction

Training a model for grammatical error correction (GEC) requires a set of labeled ungrammatical/grammatical sentence pairs, but manually annotating such pairs can be expensive. Recently, the Break-It-Fix-It (BIFI) framework has demonstrated strong results on learning to repair a broken program without any labeled examples, but this relies on a perfect critic (e.g., a compiler) that returns whether an example is valid or not, which does not exist for the GEC task. This paper shows how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical if the LM assigns it a higher probability than its local perturbations. This LM-Critic and BIFI is applied along with a large set of unlabeled sentences to bootstrap realistic ungrammatical/grammatical pairs for training a corrector. 

Generative Adversarial Transformers

This paper introduces the GANformer, a novel and efficient type of transformer, and explores it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. The model’s strength and robustness are demonstrated through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data efficiency. The GitHub repo associated with this paper can be found HERE .

Learn More About NLP and NLP Research at ODSC West 2021

At our upcoming event this November 16th-18th in San Francisco,  ODSC West 2021 will feature a plethora of talks, workshops, and training sessions on NLP and NLP research. You can register now for 30% off all ticket types before the discount drops to 20% in a few weeks. Some highlighted sessions on NLP and NLP research  include:

  • Transferable Representation in Natural Language Processing: Kai-Wei Chang, PhD | Director/Assistant Professor | UCLA NLP/UCLA CS
  • Build a Question Answering System using DistilBERT in Python: Jayeeta Putatunda | Data Scientist | MediaMath
  • Introduction to NLP and Topic Modeling: Zhenya Antić, PhD | NLP Consultant/Founder | Practical Linguistics Inc
  • NLP Fundamentals: Leonardo De Marchi | Lead Instructor | ideai.io

Sessions on Deep Learning and Deep Learning Research:

  • GANs: Theory and Practice, Image Synthesis With GANs Using TensorFlow: Ajay Baranwal | Center Director | Center for Deep Learning in Electronic Manufacturing, Inc
  • Machine Learning With Graphs: Going Beyond Tabular Data: Dr. Clair J. Sullivan | Data Science Advocate | Neo4j
  • Deep Dive into Reinforcement Learning with PPO using TF-Agents & TensorFlow 2.0: Oliver Zeigermann | Software Developer | embarc Software Consulting GmbH
  • Get Started with Time-Series Forecasting using the Google Cloud AI Platform: Karl Weinmeister | Developer Relations Engineering Manager | Google

Sessions on Machine Learning:

  • Towards More Energy-Efficient Neural Networks? Use Your Brain!: Olaf de Leeuw | Data Scientist | Dataworkz
  • Practical MLOps: Automation Journey: Evgenii Vinogradov, PhD | Head of DHW Development | YooMoney
  • Applications of Modern Survival Modeling with Python: Brian Kent, PhD | Data Scientist | Founder The Crosstab Kite
  • Using Change Detection Algorithms for Detecting Anomalous Behavior in Large Systems: Veena Mendiratta, PhD | Adjunct Faculty, Network Reliability, and Analytics Researcher | Northwestern University

nlp based research papers

Daniel Gutierrez, ODSC

Daniel D. Gutierrez is a practicing data scientist who’s been working with data long before the field came in vogue. As a technology journalist, he enjoys keeping a pulse on this fast-paced industry. Daniel is also an educator having taught data science, machine learning and R classes at the university level. He has authored four computer industry books on database and data science technology, including his most recent title, “Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R.” Daniel holds a BS in Mathematics and Computer Science from UCLA.

DE Summit Square

California’s Representatives Establish AI Task Force

AI and Data Science News posted by ODSC Team Apr 9, 2024 California’s representatives have established an AI Task Force, to focus on potentials and risks associated with...

JPMorgan’s Jamie Dimon Compares AI to the Printing Press

JPMorgan’s Jamie Dimon Compares AI to the Printing Press

AI and Data Science News posted by ODSC Team Apr 9, 2024 J.P. Morgan Chase’s CEO, Jamie Dimon, compares AI development and effects to the printing press. In...

Podcast: Deciphering Data Architectures with James Serra

Podcast: Deciphering Data Architectures with James Serra

Podcast Data Engineering posted by ODSC Team Apr 9, 2024 Learn about cutting-edge developments in AI and data science from the experts who know them best...

AI weekly square

Natural Language Processing

Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more.

Our work spans the range of traditional NLP tasks, with general-purpose syntax and semantic algorithms underpinning more specialized systems. We are particularly interested in algorithms that scale well and can be run efficiently in a highly distributed environment.

Our syntactic systems predict part-of-speech tags for each word in a given sentence, as well as morphological features such as gender and number. They also label relationships between words, such as subject, object, modification, and others. We focus on efficient algorithms that leverage large amounts of unlabeled data, and recently have incorporated neural net technology.

On the semantic side, we identify entities in free text, label them with types (such as person, location, or organization), cluster mentions of those entities within and across documents (coreference resolution), and resolve the entities to the Knowledge Graph.

Recent work has focused on incorporating multiple sources of knowledge and information to aid with analysis of text, as well as applying frame semantics at the noun phrase, sentence, and document level.

Recent Publications

Some of our teams.

We're always looking for more talented, passionate people.

Careers

  • Skip to main content
  • Skip to primary sidebar
  • Skip to footer

TOPBOTS Logo

The Best of Applied Artificial Intelligence, Machine Learning, Automation, Bots, Chatbots

GPT-3 & Beyond: 10 NLP Research Papers You Should Read

November 17, 2020 by Mariya Yao

nlp research papers

NLP research advances in 2020 are still dominated by large pre-trained language models, and specifically transformers. There were many interesting updates introduced this year that have made transformer architecture more efficient and applicable to long documents.

Another hot topic relates to the evaluation of NLP models in different applications. We still lack evaluation approaches that clearly show where a model fails and how to fix it.

Also, with the growing capabilities of language models such as GPT-3, conversational AI is enjoying a new wave of interest. Chatbots are improving, with several impressive bots like Meena and Blender introduced this year by top technology companies.

To help you stay up to date with the latest NLP research breakthroughs, we’ve curated and summarized the key research papers in natural language processing from 2020. The papers cover the leading language models, updates to the transformer architecture, novel evaluation approaches, and major advances in conversational AI.

Subscribe to our AI Research mailing list at the bottom of this article to be alerted when we release new summaries.

If you’d like to skip around, here are the papers we featured:

  • WinoGrande: An Adversarial Winograd Schema Challenge at Scale
  • Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
  • Reformer: The Efficient Transformer
  • Longformer: The Long-Document Transformer
  • ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  • Language Models are Few-Shot Learners
  • Beyond Accuracy: Behavioral Testing of NLP models with CheckList
  • Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics
  • Towards a Human-like Open-Domain Chatbot
  • Recipes for Building an Open-Domain Chatbot

Best NLP Research Papers 2020

1. winogrande: an adversarial winograd schema challenge at scale , by keisuke sakaguchi, ronan le bras, chandra bhagavatula, yejin choi, original abstract .

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. 

To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. 

Furthermore, we establish new state-of-the-art results on five related benchmarks – WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

Our Summary 

The research group from the Allen Institute for Artificial Intelligence introduces WinoGrande , a new benchmark for commonsense reasoning. They build on the design of the famous Winograd Schema Challenge (WSC) benchmark but significantly increase the scale of the dataset to 44K problems and reduce systematic bias using a novel AfLite algorithm. The experiments demonstrate that state-of-the-art methods achieve up to 79.1% accuracy on WinoGrande, which is significantly below the human performance of 94%. Furthermore, the researchers show that WinoGrande is an effective resource for transfer learning, by using a RoBERTa model fine-tuned with WinoGrande to achieve new state-of-the-art results on WSC and four other related benchmarks.

NLP research paper - WinoGrande

What’s the core idea of this paper?

  • The authors claim that existing benchmarks for commonsense reasoning suffer from systematic bias and annotation artifacts, leading to overestimation of the true capabilities of machine intelligence on commonsense reasoning.
  • Crowdworkers were asked to write twin sentences that meet the WSC requirements and contain certain anchor words. This new requirement is aimed at improving the creativity of crowdworkers.
  • Collected problems were validated through a distinct set of three crowdworkers. Out of 77K collected questions, 53K were deemed valid.
  • It generalizes human-detectable biases based on word occurrences to machine-detectable biases based on embedding occurrences.
  • After applying the AfLite algorithm, the debiased WinoGrande dataset contains 44K samples. 

What’s the key achievement?

  • Wino Knowledge Hunting (WKH) and Ensemble LMs only achieve chance-level performance (50%);
  • RoBERTa achieves 79.1% test-set accuracy;
  • whereas human performance achieves 94% accuracy.
  • 90.1% on WSC;
  • 93.1% on DPR ; 
  • 90.6% on COPA ; 
  • 85.6% on KnowRef ; and 
  • 97.1% on Winogender .

What does the AI community think?

  • The paper received the Outstanding Paper Award at AAAI 2020, one of the key conferences in artificial intelligence.

What are future research areas?

  • Exploring new algorithmic approaches for systematic bias reduction.
  • Debiasing other NLP benchmarks.

Where can you get implementation code?

  • The dataset can be downloaded from the WinoGrande project page .
  • The implementation code is available on GitHub .
  • And here is the WinoGrande leaderboard .

2nd Edition Applied AI book

2. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts every language problem into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new “Colossal Clean Crawled Corpus”, we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our dataset, pre-trained models, and code.

The Google research team suggests a unified approach to transfer learning in NLP with the goal to set a new state of the art in the field. To this end, they propose treating each NLP problem as a “text-to-text” problem. Such a framework allows using the same model, objective, training procedure, and decoding process for different tasks, including summarization, sentiment analysis, question answering, and machine translation. The researchers call their model a Text-to-Text Transfer Transformer (T5) and train it on the large corpus of web-scraped data to get state-of-the-art results on a number of NLP tasks.

T5 language model

  • Providing a comprehensive perspective on where the NLP field stands by exploring and comparing existing techniques.
  • The mode understands which tasks should be performed thanks to the task-specific prefix added to the original input sentence (e.g., “translate English to German:”, “summarize:”).
  • Presenting and releasing a new dataset consisting of hundreds of gigabytes of clean web-scraped English text, the Colossal Clean Crawled Corpus (C4) .
  • Training a large (up to 11B parameters) model, called Text-to-Text Transfer Transformer (T5) on the C4 dataset.
  • the GLUE score of 89.7 with substantially improved performance on CoLA, RTE, and WNLI tasks;
  • the Exact Match score of 90.06 on SQuAD dataset;
  • the SuperGLUE score of 88.9, which is a very significant improvement over the previous state-of-the-art result (84.6) and very close to human performance (89.8);
  • the ROUGE-2-F score of 21.55 on CNN/Daily Mail abstractive summarization task.
  • Researching the methods to achieve stronger performance with cheaper models.
  • Exploring more efficient knowledge extraction techniques.
  • Further investigating the language-agnostic models.

What are possible business applications?

  • Even though the introduced model has billions of parameters and can be too heavy to be applied in the business setting, the presented ideas can be used to improve the performance on different NLP tasks, including summarization, question answering, and sentiment analysis.
  • The pretrained models together with the dataset and code are released on GitHub .

3. Reformer: The Efficient Transformer , by Nikita Kitaev, Łukasz Kaiser, Anselm Levskaya

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O( L 2 ) to O( L log L ), where L is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of N times, where N is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

The leading Transformer models have become so big that they can be realistically trained only in large research laboratories. To address this problem, the Google Research team introduces several techniques that improve the efficiency of Transformers. In particular, they suggest (1) using reversible layers to allow storing the activations only once instead of for each layer, and (2) using locality-sensitive hashing to avoid costly softmax computation in the case of full dot-product attention. Experiments on several text tasks demonstrate that the introduced Reformer model matches the performance of the full Transformer but runs much faster and with much better memory efficiency.

Reformer - NLP

  • The activations of every layer need to be stored for back-propagation.
  • The intermediate feed-forward layers account for a large fraction of memory use since their depth is often much larger than the depth of attention activations.
  • The complexity of attention on a sequence of length L is O( L 2 ).
  • using reversible layers to store only a single copy of activations;
  • splitting activations inside the feed-forward layers and processing them in chunks;
  • approximating attention computation based on locality-sensitive hashing .
  • switching to locality-sensitive hashing attention;
  • using reversible layers.
  • For example, on the newstest2014 task for machine translation from English to German, the Reformer base model gets a BLEU score of 27.6 compared to Vaswani’s et al. (2017) BLEU score of 27.3. 
  • The paper was selected for oral presentation at ICLR 2020, the leading conference in deep learning.
  • text generation;
  • visual content generation;
  • music generation;
  • time-series forecasting.
  • The official code implementation from Google is publicly available on GitHub .
  • The PyTorch implementation of Reformer is also available on GitHub .

4. Longformer: The Long-Document Transformer , by Iz Beltagy, Matthew E. Peters, Arman Cohan

Transformer-based models are unable to process long sequences due to their self-attention operation, which scales quadratically with the sequence length. To address this limitation, we introduce the Longformer with an attention mechanism that scales linearly with sequence length, making it easy to process documents of thousands of tokens or longer. Longformer’s attention mechanism is a drop-in replacement for the standard self-attention and combines a local windowed attention with a task motivated global attention. Following prior work on long-sequence transformers, we evaluate Longformer on character-level language modeling and achieve state-of-the-art results on text8 and enwik8. In contrast to most prior work, we also pretrain Longformer and finetune it on a variety of downstream tasks. Our pretrained Longformer consistently outperforms RoBERTa on long document tasks and sets new state-of-the-art results on WikiHop and TriviaQA.

Self-attention is one of the key factors behind the success of Transformer architecture. However, it also makes transformer-based models hard to apply to long documents. The existing techniques usually divide the long input into a number of chunks and then use complex architectures to combine information across these chunks. The research team from the Allen Institute for Artificial Intelligence introduces a more elegant solution to this problem. The suggested Longformer model employs an attention pattern that combines local windowed attention with task-motivated global attention. This attention mechanism scales linearly with the sequence length and enables processing of documents with thousands of tokens. The experiments demonstrate that Longformer achieves state-of-the-art results on character-level language modeling tasks, and when pre-trained, consistently outperforms RoBERTa on long-document tasks.

Longformer - NLP

  • The computational requirements of self-attention grow quadratically with sequence length, making it hard to process on current hardware. 
  • allows memory usage to scale linearly, and not quadratically, with the sequence length;
  • a windowed local-context self-attention to build contextual representations;
  • an end task motivated global attention to encode inductive bias about the task and build full sequence representation.
  • Since the implementation of the sliding window attention pattern requires a form of banded matrix multiplication that is not supported in the existing deep learning libraries like PyTorch and Tensorflow, the authors also introduce a custom CUDA kernel for implementing these attention operations.
  • BPC of 1.10 on text8 ;
  • BPC of 1.00 on enwik8 .
  • accuracy of 75.0 vs. 72.4 on WikiHop ;
  • F1 score of 75.2 vs. 74.2 on TriviaQA ;
  • joint F1 score of 64.4 vs. 63.5 on HotpotQA ;
  • average F1 score of 78.6 vs. 78.4 on the OntoNotes coreference resolution task;
  • accuracy of 95.7 vs. 95.3 on the IMDB classification task;
  • F1 score of 94.0 vs. 87.4 on the Hyperpartisan classification task.
  • The performance gains are especially remarkable for the tasks that require a long context (i.e., WikiHop and Hyperpartisan).
  • Exploring other attention patterns that are more efficient due to dynamic adaptation to the input. 
  • Applying Longformer to other relevant long document tasks such as summarization.
  • document classification;
  • question answering;
  • coreference resolution;
  • summarization;
  • semantic search.
  • The code implementation of Longformer is open-sourced on GitHub .

5. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , by Kevin Clark, Minh-Thang Luong, Quoc V. Le, Christopher D. Manning

Masked language modeling (MLM) pre-training methods such as BERT corrupt the input by replacing some tokens with [MASK] and then train a model to reconstruct the original tokens. While they produce good results when transferred to downstream NLP tasks, they generally require large amounts of compute to be effective. As an alternative, we propose a more sample-efficient pre-training task called replaced token detection. Instead of masking the input, our approach corrupts it by replacing some tokens with plausible alternatives sampled from a small generator network. Then, instead of training a model that predicts the original identities of the corrupted tokens, we train a discriminative model that predicts whether each token in the corrupted input was replaced by a generator sample or not. Thorough experiments demonstrate this new pre-training task is more efficient than MLM because the task is defined over all input tokens rather than just the small subset that was masked out. As a result, the contextual representations learned by our approach substantially outperform the ones learned by BERT given the same model size, data, and compute. The gains are particularly strong for small models; for example, we train a model on one GPU for 4 days that outperforms GPT (trained using 30× more compute) on the GLUE natural language understanding benchmark. Our approach also works well at scale, where it performs comparably to RoBERTa and XLNet while using less than 1/4 of their compute and outperforms them when using the same amount of compute.

The pre-training task for popular language models like BERT and XLNet involves masking a small subset of unlabeled input and then training the network to recover this original input. Even though it works quite well, this approach is not particularly data-efficient as it learns from only a small fraction of tokens (typically ~15%). As an alternative, the researchers from Stanford University and Google Brain propose a new pre-training task called replaced token detection . Instead of masking, they suggest replacing some tokens with plausible alternatives generated by a small language model. Then, the pre-trained discriminator is used to predict whether each token is an original or a replacement. As a result, the model learns from all input tokens instead of the small masked fraction, making it much more computationally efficient. The experiments confirm that the introduced approach leads to significantly faster training and higher accuracy on downstream NLP tasks.

ELECTRA - NLP

  • Pre-training methods that are based on masked language modeling are computationally inefficient as they use only a small fraction of tokens for learning.
  • some tokens are replaced by samples from a small generator network; 
  • a model is pre-trained as a discriminator to distinguish between original and replaced tokens.
  • enables the model to learn from all input tokens instead of the small masked-out subset;
  • is not adversarial, despite the similarity to GAN, as the generator producing tokens for replacement is trained with maximum likelihood.
  • Demonstrating that the discriminative task of distinguishing between real data and challenging negative samples is more efficient than existing generative methods for language representation learning.
  • ELECTRA-Small gets a GLUE score of 79.9 and outperforms a comparably small BERT model with a score of 75.1 and a much larger GPT model with a score of 78.8.
  • An ELECTRA model that performs comparably to XLNet and RoBERTa uses only 25% of their pre-training compute.
  • ELECTRA-Large outscores the alternative state-of-the-art models on the GLUE and SQuAD benchmarks while still requiring less pre-training compute.
  • The paper was selected for presentation at ICLR 2020, the leading conference in deep learning.
  • Because of its computational efficiency, the ELECTRA approach can make the application of pre-trained text encoders more accessible to business practitioners.
  • The original TensorFlow implementation and pre-trained weights are released on GitHub .

6. Language Models are Few-Shot Learners , by Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions – something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10× more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3’s few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

The OpenAI research team draws attention to the fact that the need for a labeled dataset for every new language task limits the applicability of language models. Considering that there is a wide range of possible tasks and it’s often difficult to collect a large labeled training dataset, the researchers suggest an alternative solution, which is scaling up language models to improve task-agnostic few-shot performance. They test their solution by training a 175B-parameter autoregressive language model, called GPT-3 , and evaluating its performance on over two dozen NLP tasks. The evaluation under few-shot learning, one-shot learning, and zero-shot learning demonstrates that GPT-3 achieves promising results and even occasionally outperforms the state of the art achieved by fine-tuned models.

GPT-3

  • The GPT-3 model uses the same model and architecture as GPT-2, including the modified initialization, pre-normalization, and reversible tokenization.
  • However, in contrast to GPT-2, it uses alternating dense and locally banded sparse attention patterns in the layers of the transformer, as in the Sparse Transformer .
  • Few-shot learning , when the model is given a few demonstrations of the task (typically, 10 to 100) at inference time but with no weight updates allowed.
  • One-shot learning , when only one demonstration is allowed, together with a natural language description of the task.
  • Zero-shot learning , when no demonstrations are allowed and the model has access only to a natural language description of the task.
  • On the CoQA benchmark, 81.5 F1 in the zero-shot setting, 84.0 F1 in the one-shot setting, and 85.0 F1 in the few-shot setting, compared to the 90.7 F1 score achieved by fine-tuned SOTA.
  • On the TriviaQA benchmark, 64.3% accuracy in the zero-shot setting, 68.0% in the one-shot setting, and 71.2% in the few-shot setting, surpassing the state of the art (68%) by 3.2%.
  • On the LAMBADA dataset, 76.2 % accuracy in the zero-shot setting, 72.5% in the one-shot setting, and 86.4% in the few-shot setting, surpassing the state of the art (68%) by 18%.
  • The news articles generated by the 175B-parameter GPT-3 model are hard to distinguish from real ones, according to human evaluations (with accuracy barely above the chance level at ~52%).
  • “The GPT-3 hype is way too much. It’s impressive (thanks for the nice compliments!) but it still has serious weaknesses and sometimes makes very silly mistakes. AI is going to change the world, but GPT-3 is just a very early glimpse. We have a lot still to figure out.” – Sam Altman, CEO and co-founder of OpenAI .
  • “I’m shocked how hard it is to generate text about Muslims from GPT-3 that has nothing to do with violence… or being killed…” – Abubakar Abid, CEO and founder of Gradio .
  • “No. GPT-3 fundamentally does not understand the world that it talks about. Increasing corpus further will allow it to generate a more credible pastiche but not fix its fundamental lack of comprehension of the world. Demos of GPT-4 will still require human cherry picking.” – Gary Marcus, CEO and founder of Robust.ai .
  • “Extrapolating the spectacular performance of GPT3 into the future suggests that the answer to life, the universe and everything is just 4.398 trillion parameters.” – Geoffrey Hinton, Turing Award winner .
  • Improving pre-training sample efficiency.
  • Exploring how few-shot learning works.
  • Distillation of large models down to a manageable size for real-world applications.
  • The model with 175B parameters is hard to apply to real business problems due to its impractical resource requirements, but if the researchers manage to distill this model down to a workable size, it could be applied to a wide range of language tasks, including question answering, dialog agents, and ad copy generation.
  • The code itself is not available, but some dataset statistics together with unconditional, unfiltered 2048-token samples from GPT-3 are released on GitHub .

7. Beyond Accuracy: Behavioral Testing of NLP models with CheckList , by Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

The authors point out the shortcomings of existing approaches to evaluating performance of NLP models. A single aggregate statistic, like accuracy, makes it difficult to estimate where the model is failing and how to fix it. The alternative evaluation approaches usually focus on individual tasks or specific capabilities. To address the lack of comprehensive evaluation approaches, the researchers introduce CheckList , a new evaluation methodology for testing of NLP models. The approach is inspired by principles of behavioral testing in software engineering. Basically, CheckList is a matrix of linguistic capabilities and test types that facilitates test ideation. Multiple user studies demonstrate that CheckList is very effective at discovering actionable bugs, even in extensively tested NLP models.

CheckList

  • The primary approach to the evaluation of models’ generalization capabilities, which is accuracy on held-out data, may lead to performance overestimation, as the held-out data often contains the same biases as the training data. Moreover, this single aggregate statistic doesn’t help much in figuring out where the NLP model is failing and how to fix these bugs.
  • The alternative approaches are usually designed for evaluation of specific behaviors on individual tasks and thus, lack comprehensiveness.
  • CheckList provides users with a list of linguistic capabilities to be tested, like vocabulary, named entity recognition, and negation.
  • Then, to break down potential capability failures into specific behaviors, CheckList suggests different test types , such as prediction invariance or directional expectation tests in case of certain perturbations.
  • Potential tests are structured as a matrix, with capabilities as rows and test types as columns.
  • The suggested implementation of CheckList also introduces a variety of abstractions to help users generate large numbers of test cases easily.
  • Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered “solved” based on accuracy results, the behavioral testing highlights many areas for improvement.
  • helps to identify and test for capabilities not previously considered;
  • results in more thorough and comprehensive testing for previously considered capabilities;
  • helps to discover many more actionable bugs.
  • The paper received the Best Paper Award at ACL 2020, the leading conference in natural language processing.
  • CheckList can be used to create more exhaustive testing for a variety of NLP tasks.
  • Such comprehensive testing that helps in identifying many actionable bugs is likely to lead to more robust NLP systems.
  • The code for testing NLP models with CheckList is available on GitHub .

8. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , by Nitika Mathur, Timothy Baldwin, Trevor Cohn

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric’s efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

The most recent Conference on Machine Translation (WMT) has revealed that, based on Pearson’s correlation coefficient, automatic metrics poorly match human evaluations of translation quality when comparing only a few best systems. Even negative correlations were exhibited in some instances. The research team from the University of Melbourne investigates this issue by studying the role of outlier systems, exploring how the correlation coefficient reflects different patterns of errors (type I vs. type II errors), and what magnitude of difference in the metric score corresponds to true improvements in translation quality as judged by humans. Their findings suggest that small BLEU differences (i.e., 1–2 points) have little meaning and other metrics, such as chrF, YiSi-1, and ESIM should be preferred over BLEU. However, only human evaluations can be a reliable basis for drawing important empirical conclusions.

Tangled up in BLEU

  • Automatic metrics are used as a proxy for human translation evaluation, which is considerably more expensive and time-consuming.
  • For example, the recent findings show that if the correlation between leading metrics and human evaluations is computed using a large set of translation systems, it is typically very high (i.e., 0.9). However, if only a few best systems are considered, the correlation reduces markedly and can even be negative in some cases.
  • The identified problem with Pearson’s correlation is due to the small sample size and not specific to comparing strong MT systems.
  • Outlier systems, whose quality is much higher or lower than the rest of the systems, have a disproportionate effect on the computed correlation and should be removed.
  • The same correlation coefficient can reflect different patterns of errors. Thus, a better approach for gaining insights into metric reliability is to visualize metric scores against human scores.
  • Small BLEU differences of 1-2 points correspond to true improvements in translation quality (as judged by humans) only in 50% of cases.
  • Giving preference to such evaluation metrics as chrF, YiSi-1, and ESIM over BLEU and TER.
  • Moving away from using small changes in evaluation metrics as the sole basis to draw important empirical conclusions, and always ensuring support from human evaluations before claiming that one MT system significantly outperforms another one.
  • The paper received an Honorable Mention at ACL 2020, the leading conference in natural language processing. 
  • The implementation code, data, and additional analysis will be released on GitHub .

9. Towards a Human-like Open-Domain Chatbot , by Daniel Adiwardana, Minh-Thang Luong, David R. So, Jamie Hall, Noah Fiedel, Romal Thoppilan, Zi Yang, Apoorv Kulshreshtha, Gaurav Nemade, Yifeng Lu, Quoc V. Le

We present Meena, a multi-turn open-domain chatbot trained end-to-end on data mined and filtered from public domain social media conversations. This 2.6B parameter neural network is simply trained to minimize perplexity of the next token. We also propose a human evaluation metric called Sensibleness and Specificity Average (SSA), which captures key elements of a human-like multi-turn conversation. Our experiments show strong correlation between perplexity and SSA. The fact that the best perplexity end-to-end trained Meena scores high on SSA (72% on multi-turn evaluation) suggests that a human-level SSA of 86% is potentially within reach if we can better optimize perplexity. Additionally, the full version of Meena (with a filtering mechanism and tuned decoding) scores 79% SSA, 23% higher in absolute SSA than the existing chatbots we evaluated. 

In contrast to most modern conversational agents, which are highly specialized, the Google research team introduces a chatbot Meena that can chat about virtually anything. It’s built on a large neural network with 2.6B parameters trained on 341 GB of text. The researchers also propose a new human evaluation metric for open-domain chatbots, called Sensibleness and Specificity Average (SSA), which can capture important attributes for human conversation. They demonstrate that this metric correlates highly with perplexity, an automatic metric that is readily available. Thus, the Meena chatbot, which is trained to minimize perplexity, can conduct conversations that are more sensible and specific compared to other chatbots. Particularly, the experiments demonstrate that Meena outperforms existing state-of-the-art chatbots by a large margin in terms of the SSA score (79% vs. 56%) and is closing the gap with human performance (86%).

Meena chatbot

  • Despite recent progress, open-domain chatbots still have significant weaknesses: their responses often do not make sense or are too vague or generic.
  • Meena is built on a seq2seq model with Evolved Transformer (ET) that includes 1 ET encoder block and 13 ET decoder blocks.
  • The model is trained on multi-turn conversations with the input sequence including all turns of the context (up to 7) and the output sequence being the response.
  • making sense,
  • being specific.
  • The research team discovered that the SSA metric shows high negative correlation (R2 = 0.93) with perplexity, a readily available automatic metric that Meena is trained to minimize.
  • Proposing a simple human-evaluation metric for open-domain chatbots.
  • The best end-to-end trained Meena model outperforms existing state-of-the-art open-domain chatbots by a large margin, achieving an SSA score of 72% (vs. 56%).
  • Furthermore, the full version of Meena, with a filtering mechanism and tuned decoding, further advances the SSA score to 79%, which is not far from the 86% SSA achieved by the average human.
  • “Google’s “Meena” chatbot was trained on a full TPUv3 pod (2048 TPU cores) for 30 full days – that’s more than $1,400,000 of compute time to train this chatbot model.” – Elliot Turner, CEO and founder of Hyperia .
  • “So I was browsing the results for the new Google chatbot Meena, and they look pretty OK (if boring sometimes). However, every once in a while it enters ‘scary sociopath mode,’ which is, shall we say, sub-optimal” – Graham Neubig, Associate professor at Carnegie Mellon University .

Meena chatbot

  • Lowering the perplexity through improvements in algorithms, architectures, data, and compute.
  • Considering other aspects of conversations beyond sensibleness and specificity, such as, for example, personality and factuality.
  • Tackling safety and bias in the models.
  • further humanizing computer interactions; 
  • improving foreign language practice; 
  • making interactive movie and videogame characters relatable.
  • Considering the challenges related to safety and bias in the models, the authors haven’t released the Meena model yet. However, they are still evaluating the risks and benefits and may decide otherwise in the coming months.

10. Recipes for Building an Open-Domain Chatbot , by Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Kurt Shuster, Eric M. Smith, Y-Lan Boureau, Jason Weston

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we show that other ingredients are important for a high-performing chatbot. Good conversation requires a number of skills that an expert conversationalist blends in a seamless way: providing engaging talking points and listening to their partners, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models are superior to existing approaches in multi-turn dialogue in terms of engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models. 

The Facebook AI Research team shows that with appropriate training data and generation strategy, large-scale models can learn many important conversational skills, such as engagingness, knowledge, empathy, and persona consistency. Thus, to build their state-of-the-art conversational agent, called BlenderBot , they leveraged a model with 9.4B parameters, trained it on a novel task called Blended Skill Talk , and deployed beam search with carefully selected hyperparameters as a generation strategy. Human evaluations demonstrate that BlenderBot outperforms Meena in pairwise comparison 75% to 25% in terms of engagingness and 65% to 35% in terms of humanness.

BlenderBot

  • Large scale. The largest model has 9.4 billion parameters and was trained on 1.5 billion training examples of extracted conversations.
  • Blended skills. The chatbot was trained on the Blended Skill Talk task to learn such skills as engaging use of personality, engaging use of knowledge, and display of empathy.
  • Beam search used for decoding. The researchers show that this generation strategy, deployed with carefully selected hyperparameters, gives strong results. In particular, it was demonstrated that the lengths of the agent’s utterances is very important for chatbot performance (i.e, too short responses are often considered dull and too long responses make the chatbot appear to waffle and not listen).
  • 75% of the time in terms of engagingness;
  • 65% of the time in terms of humanness.
  • In an A/B comparison between human-to-human and human-to-BlenderBot conversations, the latter were preferred 49% of the time as more engaging.
  • a lack of in-depth knowledge if sufficiently interrogated; 
  • a tendency to use simpler language; 
  • a tendency to repeat oft-used phrases.
  • Further exploring unlikelihood training and retrieve-and-refine mechanisms as potential avenues for fixing these issues.
  • Facebook AI open-sourced BlenderBot by releasing code to fine-tune the conversational agent, the model weights, and code to evaluate it.

If you like these research summaries, you might be also interested in the following articles:

  • 2020’s Top AI & Machine Learning Research Papers
  • Novel Computer Vision Research Papers From 2020
  • AAAI 2021: Top Research Papers With Business Applications
  • ICLR 2021: Key Research Papers

Enjoy this article? Sign up for more AI research updates.

We’ll let you know when we release more summary articles like this one.

  • Email Address *
  • Name * First Last
  • Natural Language Processing (NLP)
  • Chatbots & Conversational AI
  • Computer Vision
  • Ethics & Safety
  • Machine Learning
  • Deep Learning
  • Reinforcement Learning
  • Generative Models
  • Other (Please Describe Below)
  • What is your biggest challenge with AI research? *

Reader Interactions

' src=

About Mariya Yao

Mariya is the co-author of Applied AI: A Handbook For Business Leaders and former CTO at Metamaven. She "translates" arcane technical concepts into actionable business advice for executives and designs lovable products people actually want to use. Follow her on Twitter at @thinkmariya to raise your AI IQ.

' src=

December 9, 2023 at 3:26 am

Thank you so much for sharing such great information with us. Your website is great. We are impressed by the details you have on your website. We have bookmarked this site. keep it up and thanks again.

' src=

March 14, 2024 at 8:49 am

quibusdam recusandae sint delectus nobis inventore dolorem nostrum omnis eum voluptas. autem blanditiis voluptatum delectus non asperiores sit. itaque accusamus voluptas repudiandae sed officiis. consectetur aut officia molestiae voluptatem sint quasi illum voluptate.

' src=

March 20, 2024 at 11:05 pm

Your article helped me a lot, is there any more related content? Thanks!

' src=

March 29, 2024 at 9:49 am

Leave a Reply

You must be logged in to post a comment.

About TOPBOTS

  • Expert Contributors
  • Terms of Service & Privacy Policy
  • Contact TOPBOTS

Help | Advanced Search

Computer Science > Computation and Language

Title: text summarization using large language models: a comparative study of mpt-7b-instruct, falcon-7b-instruct, and openai chat-gpt models.

Abstract: Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and OpenAI ChatGPT text-davinci-003 models. The experiment was performed with different hyperparameters and evaluated the generated summaries using widely accepted metrics such as the Bilingual Evaluation Understudy (BLEU) Score, Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score, and Bidirectional Encoder Representations from Transformers (BERT) Score. According to the experiment, text-davinci-003 outperformed the others. This investigation involved two distinct datasets: CNN Daily Mail and XSum. Its primary objective was to provide a comprehensive understanding of the performance of Large Language Models (LLMs) when applied to different datasets. The assessment of these models' effectiveness contributes valuable insights to researchers and practitioners within the NLP domain. This work serves as a resource for those interested in harnessing the potential of LLMs for text summarization and lays the foundation for the development of advanced Generative AI applications aimed at addressing a wide spectrum of business challenges.

Submission history

Access paper:.

  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Natural Language Processing (NLP) based Text Summarization - A Survey

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Roberto Iriondo

The Best of NLP: February 2023's Top NLP Papers

Stay ahead of the game: get a sneak peek of the coolest natural language processing (nlp) research of february 2023 our handpicked selection of the best nlp papers will keep you up-to-date on the latest advancements in language models, text generation, and summarization..

- For all you NLP enthusiasts out there, here is a list of awesome papers from February 2023 highlighted by C4AI’s research community.

This article’s title and TL;DR have been generated with Cohere. Get started with text generation

As NLP enthusiasts, we know that this technology is constantly pushing the boundaries of what's possible. That's why it's crucial to stay up-to-date with the latest breakthroughs and advancements. In this post, we've curated a selection of the top NLP papers for February 2023, covering a wide range of topics, including the most recent developments in language models, text generation, and summarization.

Our team at Cohere has done the heavy lifting by scouring the web and consulting with our research community to bring you the most current and relevant information on NLP research. We're thrilled about the progress that NLP has made in recent years, and we can't wait to see what the future holds. The advancements in this field are enabling us to do more with language than ever before, and this list of top NLP papers will keep you informed and prepared to take advantage of these developments.

At Cohere, our goal is to make NLP technology more accessible to developers and organizations. We believe that the democratization of NLP is key to unlocking its full potential. That's why we are always looking for new community members to join us on this exciting journey. If you're passionate about NLP and want to be part of a community that is driving the future of this technology, we would love to have you . Don't hesitate to apply and be a part of this exciting journey.

nlp based research papers

Top NLP Papers of February 2023 Highlighted by Our Research Discord Community

These papers were highlighted by C4AI research discord community members. Big thank you to Ujan#3046, bhavnicksm#8949, EIFY#4102, cvf#1006, MajorMelancholy#1836, cakiki#9145, hails#6601, Mike-RsrchRabbit#9843, and the rest of the Cohere For AI NLP research community for participating.

nlp based research papers

Toolformer: Language Models Can Teach Themselves to Use Tools

Authors: Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom

Let's talk about language models (LMs), which are pretty cool because they can solve new tasks with just a few examples or textual instructions. However, as amazing as they are, LMs sometimes struggle with basic functionality, like doing simple math or finding facts, where smaller models excel. But what if NLP folks could have the best of both worlds? Enter Toolformer!

Toolformer is a model that can teach itself to use external tools via simple APIs. It's trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. And get this - it does this in a self-supervised way, requiring nothing more than a handful of demonstrations for each API.

Toolformer incorporates a range of tools, including a calculator, a Q&A system, two different search engines, a translation system, and a calendar. And the best part is that it achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities. So, with Toolformer, we're able to use the best of both worlds, making life a whole lot easier for us NLP, machine learning, AI, and software engineering enthusiasts.

nlp based research papers

SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Authors: Max Ryabinin, Tim Dettmers, Michael Diskin, Alexander Borzunov

In this paper, the authors tackle the challenge of training large deep learning models with billions of parameters, which is known to require specialized HPC clusters that come with a hefty price tag. To work around this limitation, they explore alternative setups for training these large models, such as using cheap "preemptible" instances or pooling resources from multiple regions.

Then it analyzes the performance of existing model-parallel algorithms in these conditions and identifies configurations where training larger models become less communication-intensive. They introduce SWARM parallelism, a novel model-parallel training algorithm specifically designed for poorly connected, heterogeneous, and unreliable devices.

SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure, which is a significant improvement over existing large-scale training approaches. The authors empirically validate their findings and compare SWARM parallelism with existing methods.

To further demonstrate their approach, they combine their insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network. These promising results show that SWARM parallelism has the potential to revolutionize the way large models are trained, making it more accessible and cost-effective for researchers and practitioners alike.

nlp based research papers

Pretraining Language Models with Human Preferences

Authors: Tomasz Korbak, Kejian Shi, Angelica Chen, Rasika Bhalerao, Christopher L. Buckley, Jason Phang, Samuel R. Bowman, Ethan Perez

In this paper, the authors delve into the exciting world of language models (LMs) and how they can be trained to generate text that aligns with human preferences. While LMs are pre-programmed to imitate internet text, which can lead to some undesirable outcomes. But what if LMs could be taught to generate text that's not only coherent and informative but also aligns with human preferences?

To explore this, the authors benchmarked five objectives for pretraining LMs with human feedback across three tasks. They studied how these objectives affect the balance between the alignment and capabilities of pretrained LMs. And what they found was a Pareto-optimal approach: conditional training.

Conditional training involves teaching the LM to learn the distribution over tokens conditional on their human preference scores, given by a reward model. And the results were impressive! Conditional training reduced the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt.

Moreover, conditional training maintained the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback resulted in much better preference satisfaction than standard LM pretraining, followed by finetuning with feedback.

Overall, the results suggest that it should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training. This is a huge step forward in ensuring that language models generate text that aligns with human preferences, and it's exciting to see where this technology will go in the future!

nlp based research papers

Multimodal Chain-of-Thought Reasoning in Language Models

Authors: Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, Alex Smola

In this paper, the authors introduce a groundbreaking new approach for large language models (LLMs) that combines text and vision to achieve even better reasoning performance. The new model, called Multimodal-CoT, builds on the chain-of-thought (CoT) approach to generate intermediate reasoning chains as the rationale to infer the answer. The big difference is that this time, the model incorporates both language and vision (images) modalities into a two-stage framework that separates rationale generation and answer inference.

The Multimodal-CoT model is designed to leverage better-generated rationales that are based on multimodal information, improving the accuracy of answer inference. The results speak for themselves: the model with just under 1 billion parameters outperforms the previous state-of-the-art LLM (GPT-3.5) by a whopping 16 percentage points (75.17% to 91.68% accuracy) on the ScienceQA benchmark and even surpasses human performance.

The code for Multimodal-CoT is publicly available on Amazon , so if you're interested in exploring this cutting-edge technology, it's just a click away. With this new model, the authors have taken an important step forward in the development of large language models and multimodal reasoning, paving the way for even more exciting advances in the field of AI and machine learning.

nlp based research papers

Poisoning Web-Scale Training Datasets is Practical

Authors: Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, Florian Tramèr

In this paper, the authors dive deep into the dangers of dataset poisoning attacks in deep learning models. These attacks introduce malicious examples into a model's performance, which can have serious consequences. The authors introduce two new and practical attacks that can poison ten popular datasets.

The first attack, split-view poisoning, takes advantage of the mutable nature of internet content. By manipulating an annotator's view of a dataset, they can introduce malicious examples that will go unnoticed by subsequent clients. This attack is particularly insidious because it exploits invalid trust assumptions. Shockingly, the authors found they could poison 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD.

The second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content, like Wikipedia. The attacker only needs a time-limited window to inject malicious examples into the dataset.

In light of these attacks, the authors notified the maintainers of each affected dataset and recommended several low-overhead defenses. These defenses will help mitigate the risks of dataset poisoning and protect deep learning models from malicious attacks.

nlp based research papers

Symbolic Discovery of Optimization Algorithms

Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V. Le

In this paper, the authors introduce a novel approach to algorithm discovery by framing it as program search. They apply this method to discover optimization algorithms for deep neural network training and demonstrate how it can bridge the generalization gap between proxy and target tasks.

Their approach utilizes efficient search techniques to explore an infinite and sparse program space. To simplify the process, they also introduce program selection and simplification strategies. The result of their method is the discovery of a new optimization algorithm, Lion (EvoLved Sign Momentum).

Compared to widely used optimizers such as Adam and Adafactor, Lion is more memory-efficient since it only keeps track of the momentum. It also differs from adaptive optimizers in that its update has the same magnitude for each parameter calculated through the sign operation.

The authors test Lion on various models and tasks and show that it outperforms Adam in several areas, including image classification and diffusion models. In some cases, Lion also requires a smaller learning rate due to the larger norm of the update produced by the sign function.

However, the authors also acknowledge the limitations of Lion and identify scenarios where its improvements are small or not statistically significant. They make the implementation of Lion publicly available for others to use and build upon.

nlp based research papers

The Wisdom of Hindsight Makes Language Models Better Instruction Followers

Authors: Tianjun Zhang, Fangchen Liu, Justin Wong, Pieter Abbeel, Joseph E. Gonzalez

In this paper, the authors delve into the complex world of reinforcement learning and its application in fine-tuning language models. Specifically, they explore the “Reinforcement Learning with Human Feedback (RLHF)” algorithm, which has demonstrated remarkable success in aligning GPT series models with instructions through human feedback.

However, the authors point out that the underlying RL algorithm is not a walk in the park and requires an additional training pipeline for reward and value networks. So, they propose an alternative approach: relabeling the original feedback and training the model for better alignment in a supervised manner. This algorithm doesn't require any additional parameters except for the original language model and maximally reuses the pretraining pipeline.

To accomplish this, the authors formulate the instruction alignment problem for language models as a goal-reaching problem in decision-making. They present a novel algorithm called Hindsight Instruction Relabeling (HIR), which aligns language models with instructions based on feedback that has been relabeled with hindsight.

The resulting two-stage algorithm sheds light on a family of reward-free approaches that utilize the relabeled feedback as a substitute for reward. The authors evaluate the performance of HIR on 12 challenging BigBench reasoning tasks and show that it outperforms the baseline algorithms and is comparable to, or even surpasses, supervised fine-tuning.

In conclusion, the paper offers an intriguing new approach to fine-tuning language models that has the potential to reduce the complexity of the reinforcement learning algorithm and streamline the training process.

nlp based research papers

Hyena Hierarchy: Towards Larger Convolutional Language Models

Authors: Michael Poli, Stefano Massaroli, Eric Nguyen, Daniel Y. Fu, Tri Dao, Stephen Baccus, Yoshua Bengio, Stefano Ermon, Christopher Ré

In this paper, the authors introduce us to Hyena, a subquadratic replacement for the attention operator in Transformers. While attention has been the core building block of Transformers, it suffers from quadratic cost in sequence length, which makes it difficult to access large amounts of context. To bridge this gap, the authors propose Hyena, which is constructed by interleaving implicitly parametrized long convolutions and data-controlled gating.

What's exciting about Hyena is that it can significantly improve accuracy in recall and reasoning tasks on sequences of thousands to hundreds of thousands of tokens. In fact, it improves accuracy by more than 50 points over operators relying on state spaces and other implicit and explicit methods. Not only that, but Hyena can match attention-based models, setting a new state-of-the-art for dense-attention-free architectures on language modeling in standard datasets (WikiText103 and The Pile).

In addition to its accuracy, Hyena can reduce training compute required at sequence length 2K by 20%. Its operators are also twice as fast as highly optimized attention at sequence length 8K, and 100x faster at sequence length 64K. This means that not only is Hyena powerful, but it's also efficient. Overall, Hyena presents a promising new approach to subquadratic methods in deep learning that could have wide-ranging implications for the field.

nlp based research papers

Crawling the Internal Knowledge-Base of Language Models

Authors: Roi Cohen, Mor Geva, Jonathan Berant, Amir Globerson

Language models are becoming increasingly sophisticated, and as they continue to evolve, they will eventually be able to extract a significant body of factual knowledge from the vast amount of text they are trained on. This wealth of knowledge can then be used to enhance downstream NLP tasks. But how can this knowledge be represented in an interpretable way? That's where the authors' proposal comes in.

The authors present a novel approach to extract a knowledge-graph of facts from a given language model. They start by "crawling" the internal knowledge-base of the language model and expanding a knowledge-graph around a seed entity. The crawling procedure is broken down into sub-tasks, which are achieved through specially designed prompts that ensure high precision and recall rates.

The authors evaluated their approach on graphs crawled from dozens of seed entities and found that it yielded high-precision graphs ranging from 82% to 92%. The procedure also emitted a reasonable number of facts per entity, which is important for practical applications. This work is an important step towards building more interpretable language models that can provide a structured representation of the knowledge they acquire from text.

nlp based research papers

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature

Authors: Eric Mitchell, Yoonho Lee, Alexander Khazatsky, Christopher D. Manning, Chelsea Finn

In this paper, the authors tackle the problem of detecting machine-generated text, which has become increasingly difficult with the advancement of large language models (LLMs). These models are so good at generating text that it's becoming harder to tell whether a piece of writing is human or machine-generated. For instance, students could use these models to complete their writing assignments, making it harder for instructors to assess their work.

To solve this issue, the authors propose a new approach called DetectGPT, which uses the curvature of the model's log probability function to identify whether a given passage was generated by the LLM in question. This new method doesn't require a separate classifier or a dataset of real or generated passages, and it doesn't explicitly watermark the generated text.

To test the effectiveness of DetectGPT, the authors use it to detect fake news articles generated by the massive 20B parameter GPT-NeoX model. The results are impressive, with DetectGPT significantly outperforming existing zero-shot methods for detecting model samples. The strongest zero-shot baseline achieved a 0.81 AUROC, while DetectGPT achieved an impressive 0.95 AUROC.

If you're interested in this exciting new approach to detecting machine-generated text, check out the code, data, and other project information .

Final Thoughts

Are you ready to revolutionize the way you work with large volumes of text? Look no further than incorporating large language models into your workflow. This list of cutting-edge research on NLP serves as your guide to unlocking the full potential of this powerful technology. But don't just take our word for it—experiment and tweak to find the perfect model for your specific needs. And the journey doesn't have to be a solitary one— join our Discord community to share your discoveries and collaborate with like-minded individuals. Ready to dive in? Try out our NLP API on the Cohere playground and start building the future of natural language processing today.

Ontology extension with NLP-based concept extraction for domain experts in catalytic sciences

  • Regular Paper
  • Open access
  • Published: 15 July 2023
  • Volume 65 , pages 5503–5522, ( 2023 )

Cite this article

You have full access to this open access article

  • Alexander S. Behr 1 ,
  • Marc Völkenrath 1 &
  • Norbert Kockmann 1  

2562 Accesses

Explore all metrics

Ontologies store semantic knowledge in a machine-readable way and represent domain knowledge in controlled vocabulary. In this work, a workflow is set up to derive classes from a text dataset using natural language processing (NLP) methods. Furthermore, ontologies and thesauri are browsed for those classes and corresponding existing textual definitions are extracted. A base ontology is selected to be extended with knowledge from catalysis science, while word similarity is used to introduce new classes to the ontology based on the class candidates. Relations are introduced to automatically reference them to already existing classes in the selected ontology. The workflow is conducted for a text dataset related to catalysis research on methanation of CO \(_2\) and seven semantic artifacts assisting ontology extension by domain experts. Undefined concepts and unstructured relations can be more easily introduced automatically into existing ontologies. Domain experts can then revise the resulting extended ontology by choosing the best fitting definition of a class and specifying suggested relations between concepts of catalyst research. A structured extension of ontologies supported by NLP methods is made possible to facilitate a Findable, Accessible, Interoperable, Reusable (FAIR) data management workflow.

Similar content being viewed by others

nlp based research papers

PreMedOnto: A Computer Assisted Ontology for Precision Medicine

Development of terminological resources for expert knowledge: a case study in mining.

Ljiljana Kolonja, Ranka Stanković, … Aleksandar Cvjetić

Semantic biomedical resource discovery: a Natural Language Processing framework

Pepi Sfakianaki, Lefteris Koumakis, … Manolis Tsiknakis

Avoid common mistakes on your manuscript.

1 Introduction

In the current research data management, interconnection of the data produced and its interpretation are essential for comprehensible deductions of new knowledge. Research data need to be FAIR (Findable, Accessible, Interoperable, and Reusable) by humans and machines in order to make proper use of data recorded in experiments, e.g., in electronic laboratory notebooks [ 1 , 2 ]. While a researcher can easily grasp and interpret semantics expressed in texts using their implicit knowledge [ 3 ], a machine cannot perform this without having a representation of such knowledge embedded. Here, ontologies are used to describe implicit knowledge in an explicit way as they represent explicit specifications of conceptualizations [ 4 ]. Ontologies are informatic constructs used to represent relations among classes, such as catalyst or reactor .

As classification is an important concept of ontologies, the hierarchic sorting of the classes in turn represents the backbone of the ontologies. While the connection of classes within ontologies is important for their definition, short definition sentences (definition strings) are used as class annotation. This helps humans using the ontology to define and understand the classes of the ontology properly. Not only ontologies can be used to obtain definition strings for classes. Thesauri also provides classes with respective definition strings, such as the NCIT [ 5 ]. While they do not necessarily have semantic relations between their concepts like ontologies, they often contain more concepts and respective definition strings than ontologies.

For a domain expert who wants to represent the domain knowledge in an ontology, the hurdle to include ontology classes in the correct form into an ontology might be quite challenging and time consuming. Being experts in certain scientific fields, domain experts might also omit some knowledge because it is considered as trivial. Extending an ontology for own needs often is tedious work [ 6 , 7 ]; thus, approaches are desired to simplify extension of ontologies and reduce consumed time for domain experts in order to raise acceptance of ontologies.

Since already existing ontologies do not necessarily contain all classes essential to describe the respective knowledge domain, an automated extension of ontologies is desirable. In addition, plenty of information is presented in scientific research in textual form, e.g., research papers by many domain experts. Those research papers contain a high number of domain-specific vocabulary. Using techniques from Natural Language Processing (NLP), in turn, can help to automate the setup of ontologies based on unstructured (natural) text as contained in research papers [ 8 ]. Exemplarily, by using Part of Speech (POS) tagging, nouns can be sorted out automatically from a given text and afterward be brought to their nominative singular form by lemmatizing.

While methods exist to extract ontologies from documents fully automatically, they usually provide ontologies that are not really useful for further reuse [ 9 ]. The ConTrOn (continuously trained ontology) project shows how user feedback can be integrated by a human-in-the-loop system  [ 10 , 11 ]. Here, a domain-specific ontology is augmented automatically and extended on basis of textual data and external sources of knowledge such as Wikidata and WordNet [ 12 ]. While the approach represents a solution to integrate information from data sheets to ontologies, the extraction of knowledge and relations between ontology classes from text is missing. In addition, a comparison of classes and their definitions with WikiData is done, while a comparison of classes and their definitions with other ontologies also would make sense. This is due to the fact that other ontologies also might contain knowledge not represented in WikiData, as ontologies focus more on expert knowledge.

The scope of this work is to use NLP techniques to extract vocabulary relevant to a domain of knowledge represented in a set of scientific papers. This vocabulary then is annotated by definitions derived from existing semantic artifacts (such as ontologies and thesauri) to help domain experts in later steps with sorting out the classes best fitting to the domain of knowledge. In addition, NLP is used to assist domain experts by including suggested classes automatically into an existing ontology and suggesting semantic relations between the classes based on text vectorization models of the texts. As classes should be only defined once to avoid ambiguities, already existing definitions of the added classes are included in the resulting extended ontology to later aid domain experts with selection of the most fitting definition to the automatically added classes. Thus, words necessary to describe a knowledge domain are included in a holistic, automated way into an ontology by including knowledge from a variety of scientific papers on a certain topic of interest.

2 Methodological background

This section describes the text dataset and the semantic artifacts used later to apply the workflow. Furthermore, the vectorization with Word2Vec is explained as its cosine similarity and min_count parameter serve as key classificators of later results.

2.1 Text dataset

The dataset deals with scientific publications focusing on catalytic methanation reactions. Here, a total of 25 research papers and three review papers are collected on research topics of methanation of CO \(_2\) . Besides continuous text, the dataset also contains other data, such as figures, diagrams, tables, and chemical formulas. In addition, the header and footer of pages often contains text with no further domain-specific information. Thus, preprocessing of the scientific publications focuses on extraction of token of the continuous text of the text dataset and omitting data waste. The method of preprocessing is described further in Sect.  3.1 . The publications used as text dataset in this work are presented in Table A1 in Appendix A.

2.2 Semantic artifacts

For extension and annotation of ontologies, five ontologies and two thesauri are selected based on the set of ontologies deemed as important to the catalysis research domain by the NFDI4Cat project [ 1 , 13 , 14 ]. The Allotrope Foundation Ontology (AFO) [ 15 ], Chemical Entities of Biological Interest (CHEBI) [ 16 ], and Chemical Methods Ontology (CHMO) [ 17 ] are closely related to the chemical domain and contain concepts related to chemical experiments in laboratories. In contrast, the BioAssay Ontology (BAO) [ 18 ] focuses on biological screening assays and their results. While the scope of the BAO might not be intuitively fitting to the chosen text dataset, certain concepts are contained in the BAO such as chemical roles of substances (e.g., catalyst), which also play a role in the text dataset. Similar to that, the scope of the Systems Biology Ontology (SBO) [ 19 ] is system biology and computational modelling. Similar to the BAO, it is chosen as it also contains relations regarding substances and also general laboratory contexts, which also are contained in the text dataset.

In addition to these ontologies, two thesauri are used: the IUPAC Compendium of Chemical Terminology (IUPAC-Goldbook) [ 20 ] and the National Cancer Institute Thesaurus (NCIT) [ 5 ]. They cover vast amounts of chemical species and domain-specific words of the chemical domain of knowledge while also providing definition strings for the respective words. In order to be processed properly, all ontologies and the NCIT were used in the OWL file format in RDF/XML syntax and converted to OWL (RDF/XML), when only available in, e.g., TTL-serialization using Protégé [ 21 ]. IUPAC-Goldbook was used in json-file format as provided by the homepage [ 20 ]. The semantic artifacts discussed and used in this work are listed in Table  1 along with the number of classes or concepts they contain.

2.3 Vectorization with Word2Vec

After preprocessing the data, it is further used to get semantic similarity of the token extracted. For this, the algorithm Word2Vec implemented in the python module gensim is used [ 22 ]. It vectorizes words to learn relations between token and thus, represents a statistical method. Using the preprocessed text as input, Word2Vec creates a vocabulary, vectorizing each word to a vector of user defined length. While a longer vector corresponds to a higher dimension of the vector space used for the vectorization, it also results in longer computational time resulting in a trade-off between computational time and expressivity of the vectors [ 23 ]. The similarity of two concepts can be calculated with the help of the cosine similarity by calculating the cosine of the angle \(\varphi \) between two vectors \(\vec {a}\) and \(\vec {b}\) using the equation

resulting in a value close to one for token close to each other and close to minus one for token far away from each other. Because this is a statistical method, the frequency of occurrence of the token within the text corpus is important to consider. This is reflected in the Word2Vec parameter min_count setting the number of occurrences in the text corpus, a token must have at least to be considered by the model. The higher this number is set, the smaller the overall considered number of words gets; thus, the model focuses only on the most occurring words. A lower min_count is more prone to include token based on, e.g., typing errors or are those of less relevance to the overall domain of knowledge represented in the text corpus.

To obtain information from scientific papers, the text corpus first needs to be extracted and preprocessed to be viable in further steps. Part of Speech tagging (POS-tagging) is used to extract only nouns as candidates for new ontological classes. Searching for these extracted concepts (token) in already existing semantic artifacts (ontologies or thesauri) yields token annotated with definition strings and linkage to the respective semantic artifact, the definition was taken from. To extend an already existing ontology with concepts based on the found token, a Word2Vec model is trained that vectorizes the text data. This in turn allows to output tokens with small cosine similarity to the already contained classes of an ontology and introducing those as new classes in the ontology. In addition, relations to denote semantic relation of these classes are posed, to connect the already contained ontology class to the automatically created classes based on Word2Vec. This overall workflow is depicted in Fig.  1 with the start of the workflow denoted in red and the output of the workflow in green. The following sections explain the main three steps of this general workflow in more detail. First, the text extraction is explained in detail, as the text corpus first needs to be extracted and preprocessed to be useful in further steps. Then, POS-tagging and search of the token in already existing ontologies takes place to annotate the extracted token. In the final step, the extension of an ontology by new classes based on the text dataset is explained.

figure 1

Overall workflow conducted in this work to extract token from text, supply them with definitions based on ontologies and extend ontologies with new classes. The red box denotes the start of the workflow, while the output boxes are colored green (Color figure online)

3.1 Text extraction

Data from the text dataset contains, besides textual information, also information that is either non-textual or meaningless. Non-textual information, such as figures, can be neglected to reduce the file size. Text fragments without further domain-specific information also can be deleted to get a more condensed text dataset.

Thus, all figures, tables, and diagrams that do not contain complete sentences are removed first by hand with acrobat reader  [ 24 ] and using the python module pdfminer  [ 25 ]. Annotations and tables containing text in bullet point form are considered individually. Furthermore, lists such as references, table of figures, and table of nomenclature are removed, as these usually represent a list of individual words and symbols that do not reflect any context or relations. However, definition directories containing technical terms explained by short sentences are not removed, since they can contain relevant information. Subsequently, textual content that occurs repeatedly is removed, such as a DOI contained in the footer of each page or the journal name in the header of each page. These have no informative value and would negatively influence the creation of the model. Captions are also removed, since their information content is marginal and repeat often without enhancement of the textual dataset (such as“Introduction” or“Conclusion”). Those cleaned files of the dataset are read in as strings using python code as a singular string such that each dataset contains a single string. The module SpaCy [ 26 ] is used to apply POS-tagging. This transforms the read-in string into a nested list, where each sentence is represented as list entry in a separate list. Using interpunction and space characters as separators, token are extracted and lemmatized using the vocabulary en_core_web_sm . This categorizes each word contained in each sentence regarding its lexical category (e.g., noun, verb, number,...).

3.2 Annotation of extracted token

As ontology classes are mostly nouns, only token with categories “noun” and “proper noun” are retained from the dataset and used in further procedures. Thus, a search of those token in ontologies is performed to determine the amount of token contained in each ontology as a class. The result helps to decide, which ontology can be taken as basis in further extension steps. Further help is provided by extraction of definitions of classes contained as string values in the ontologies, enabling for an easy determination of the best definition by domain experts in later steps.

To choose a fitting ontology to the dataset and enrich it by the concepts gathered by pre-processing, existing definitions of token contained in the ontologies should be known. Thus, python code is produced, which loads ontologies based on a local database using owlready2 [ 27 ]. Then, all class labels as well as their definition strings are read in from the ontologies and stored as key-value pairs in dictionaries. Nested dictionaries are used to store all classes and their definitions of a single ontology in a dictionary with the ontology name as key and the dictionary containing class names and their definitions as value. Token found by text extraction, as discussed in Sect.  3.1 , is read in, and the dictionary is browsed for those token in class names. Finally, the number of found token per ontology can be accessed. In addition, the token is stored in a table along with the respective definitions, each assigned to its source ontology for later review of domain experts. The workflow of the code constructed for the annotation of extracted token is depicted in Fig.  2 . The red elements denote the needed input of the workflow, i.e., the ontology database and the token obtained by text extraction, while the output boxes are colored green.

figure 2

Workflow of the code constructed for the annotation of extracted token. The red elements denote the input of the workflow, while the output boxes are colored green (Color figure online)

3.3 Extension of an ontology by new classes based on text dataset

The Word2Vec model is trained on the textual data obtained by the methods discussed in Sect.  3.1 . Following [ 23 ], a vector size of 300 was set. While the Word2Vec model could be used for hierarchic clustering, the resulting clusters would not yield hierarchies in an ontological, semantic way. This is due to the nature of relations between token extracted by vectorization of concepts. As the text-clusters contain semantic similarities of words important for domains of knowledge, no classification and hierarchical information is obtained from the Word2Vec model. Thus, hierarchical clustering with, e.g., dendrograms, would not necessarily yield classifications (ontology classes and respective subclasses) of concepts. However, Word2Vec is able to give token with high cosine similarity to an initial input concept.

To use this functionality of similar token, the output of the workflow presented in Sect.  3.2 is used. The workflow not only annotates token of a text dataset with definitions contained in ontologies, but also can be used to output which token already are contained in each investigated ontology.

Picking the ontology with most common classes, these already contained classes are used as input for the Word2Vec model trained on the text dataset. The model then is used to retrieve the closest n token regarding cosine similarity of the input word. This is accompanied by a threshold value, restricting the amount of output token also with regards to the minimal cosine similarity allowed. This would allow for, e.g., setting a necessary minimal cosine similarity of 0.999, which would in turn only yield token very close to the input, while a minimal similarity of 0.8 would also include broader token, farther away in the vector space. As those token are most similar to the already contained ontology class, the ontology class and the token retrieved in this way by Word2Vec are assumed to have some kind of a semantic relationship.

If a token output by Word2Vec in this way is not already contained in the ontology, a new class has to be created, reflecting the token. To have an overarching class of newly included classes, not yet defined properly by semantic means, a class called w2vConcept is created as a subclass of owl:Thing class. Token output by the Word2Vec model and not yet contained in the ontology are then created as class. In addition, they are set to be subclasses of the also automatically created class w2vConcept , which in turn is set as subclass of the ontology root class owl:Thing . This is done to help in the later revision of the automatically created classes as they are more easy to find using an ontology editor, e.g., Protégé, when listed as subclass of the same class. Furthermore, this ensures that the integration of new classes does not disturb the semantic integrity of the ontology. The unique classes are also connected via an automatically created relationship to the classes deemed as similar by the Word2Vec model. This object property is called conceptually related to and is intended to ease the later definition of the exact relation between the two classes. To annotate the classes with missing definition strings, the workflow presented in Sect.  3.2 is used to search for definition strings of the newly created classes in other semantic artifacts. The code cannot decide by itself which definition might be more fitting when multiple definition strings are found. Thus, each definition string obtained is listed in a separate rdfs:comment of the class along with a note on the source of the definition.

After storing the resulting extended ontology, domain experts thus can go through newly added classes and easily accept or neglect the classes and modify the conceptually related to relation to a relation more fitting. This workflow of code to extend an ontology automatically is depicted in Fig.  3 . The ontology used as input is denoted red, while the extended ontology, which poses the output of the workflow, is colored green.

figure 3

Workflow of code to extend an ontology by new classes based on text dataset. The ontology used as input is denoted red, while the extended ontology, which poses the output of the workflow, is colored green (Color figure online)

4 Results and discussion

The textual data of 28 scientific texts are preprocessed and extracted according to Sect.  3.1 . This yields a dataset of overall 858,014 symbols which result in 4,170 noun token identified for further use in the workflows proposed in Sect.  3 . Applying different min_count parameters in the range min_count   \(=[1...25]\) yields different amounts of token as shown in Fig.  4 . While higher min_count parameters yield lower amounts of token, the token contained is deemed the more important ones, as they occur more often in the dataset.

figure 4

Number of token obtained from the text dataset of 28 scientific papers for different min_count parameters

The resulting sets of token are then used as concept names to search for fitting classes in the seven semantic artifacts proposed in Sect.  2.2 . This yields the number of token already contained in the respective ontology as classes as well as textual definitions of the classes in an automated way. In addition to this, the count of classes already contained can be used to suggest the ontology most fitting with regards to the respective text dataset.

Table  2 lists the resulting numbers of found classes in semantic artifacts of the performed annotation for six different min_count in the range [1...100]. Each token only needs to be annotated with a textual definition at least once; thus, the overall sum of annotated token is calculated for each set of token. Thus, if a token has annotations from multiple semantic artifacts, it is counted each respective row, while it only gets counted once in the row of sum of annotated token. Dividing the sum of annotated token by the overall amount of token then yields the rate of annotated token. A high rate of annotated token is desired in order to reduce later workload in revising the ontology, as coming up with definitions for classes is more difficult than agreeing on an already existing one. However, a high sum of annotated token also is desired as integrating more classes into an ontology results in a higher expressivity of the latter.

While sets obtained by setting a low min_count contain more token than those with higher min_count , the rate of annotated token rises with higher min_count parameters. This also might indicate a higher relevance of the token contained in the sets with high min_count parameters. In addition, the rate of annotated token for a min_count   \(=1\) is quite low with \( 28.25~\%\) compared to the other rates. This might be due to the inclusion of typing mistakes and non-domain relevant token at lower min_count , as one occurrence would suffice for the token to be contained in the text dataset. On the other hand, lower min_count parameters take into account more concepts not yet defined in the ontologies. These concepts in turn allow for generation of more new candidates of classes in the respective ontologies. The ontologies themselves have lower amounts of token contained compared to the thesauri. However, the AFO is expected to be the ontology best fitting to the dataset as it has the highest number of annotated token while not having the highest amount of classes compared to the other ontologies. This indicates an intersection of topics represented in the text dataset and the AFO.

Plotting the rate of annotated token against the min_count parameters, as in Fig.  5 , the largest jump in the rate occurs between min_count   \(=1\) and min_count   \(=2\) .

figure 5

Rate of annotated token for different min_count

Taking into account the number of token found in each ontology, the AFO contains the most token for each min_count . Thus, the AFO is deemed as most fitting ontology of the five ontologies for the description of the knowledge domain contained in the text dataset and accordingly chosen as ontology to be extended by the method elucidated in Sect.  3.3 .

Word2Vec models are trained on token sets based on min_count parameters in the range min_count   \(=[1...25]\) . Then, class labels from the AFO that are also contained in the token set are used as input to determine the most similar words. As the similarity of the words is determined by the cosine similarity, thresholds can be set to confine the amount of output words with regards to their similarity to the input word. A maximum amount of five output words per input word is set, and the threshold varied in the range of [0.8, ..., 0.999]. As some words are contained in multiple output sets for different input words, the amount of unique token generated by Word2Vec is calculated by only counting each word generated as a class candidate of the ontology once. With the AFO as ontology to be extended, Fig.  6 shows the amount of unique token found for different min_count parameters and different cosine similarity thresholds.

figure 6

Amount of unique token output by Word2Vec for classes of AFO with different min_count and cosine similarity thresholds varied between [0.8, ..., 0.999]

While the cosine similarity threshold has an impact on the amount of unique token generated for low min_count , the effect seems to be mitigated for thresholds in the range [0.8, ..., 0.995] and min_count   \(>5\) . Using different min_count and a cosine similarity threshold of 0.999, the AFO is extended automatically by new classes suggested by the Word2Vec model. The new classes are furthermore annotated by respective textual definitions obtained from the classes and concepts of the other semantic artifacts presented in Sect.  2.2 . Object properties conceptually related to are asserted, pointing to the respective ontology classes already contained in the AFO before extension.

Table  3 lists the resulting number of new classes inserted into the AFO obtained by setting the cosine similarity threshold to 0.999 and applying different min_count parameters in the range [1, ..., 25]. In addition, the amount of annotated new classes is listed along with the number of textual definitions according to the source of the textual definition related to the corresponding semantic artifact. Here, a min_count of 10 seems to be the most promising one, as the number of new classes (91) and number of annotated new classes (68) are highest. Thus, the AFO is extended by 91 classes which are created automatically based on the text dataset. From these new classes, 68 are annotated based on the other semantic artifacts achieving an annotation rate of \(68/91 = 74.73\%\) . Of these 68 annotated new classes, 6 are annotated based on BAO class-definitions, 7 based on CHEBI, 3 based on CHMO, and 9 based on SBO classes. Furthermore, 28 classes are annotated based on IUPAC-Goldbook concepts and 58 based on the NCIT. The sum of these annotations is greater than 68, indicating multiple annotations for some new classes in the extended AFO.

The automatically added classes are concepts taken from the text dataset; thus, they may be used to describe the context represented in the 28 scientific texts. Furthermore, the semantic artifacts chosen in this publication all deal somehow with the domain of chemistry or at least are situated in the domain of natural sciences that deal with chemical substances. Thus, the annotation of the classes is assumed to be in the correct domain as the source of the annotation already is situated close to the needed domain of knowledge. As the annotations often only vary in small details, the decision on (re-)use of specific ontology classes should be done by domain experts.

figure 7

Visualization of class hierarchy of new class flow in Protégé. Class flow and relations conceptually related to to existing classes created automatically by the workflow with min_count  = 10 and cosine similarity threshold = 0.999. Solid blue arrows indicate relation has subclass , dashed orange arrows denote relation conceptually related to

To provide an example of the resulting extension, Protégé is used for visualization of the resulting ontology. Figure  7 shows the class hierarchy of the already contained AFO classes concentration and rate using blue arrows for the hierarchical relation has subclass .

The new class flow is inserted based on the workflow as subclass of w2vConcept and gets assigned the relation of conceptually related to (denoted by dashed orange arrows) connecting it to the classes concentration and rate .

Furthermore, the new class flow gets annotated by the textual definition of the concept flow found in the NCIT. The resulting annotations of the class flow are depicted in Fig.  8 . The first entry contains the label of the class, while the next two entries point to the word-input that led to the generation of the class. The bottommost entry contains a textual definition found in the NCIT. The remark ‘Found in [NCIT]’ gives the link to the underlying class of the ontology, allowing for later reuse of the respective entity. As the new classes are generated automatically, an arbitrary amount of such rdfs:comment can be assigned to a class, but only one rdfs:label is assigned.

Thus, an existing ontology can be extended automatically by concepts based on scientific texts. After extension of the ontology, an evaluation by domain experts should be conducted, as not every resulting definition and relation might be correct.

This in turn can be used for an automated, ontology aligned annotation of research data: When a researcher uploads their research data and corresponding textual documentation to a database, the workflow presented in this work can then be used to automatically choose the best fitting ontology and extend it. The extended ontology could then in turn be used to annotate the previous uploaded research data, linking data entries with relations as posed in the textual documentation.

figure 8

Annotations of new class flow visualized in Protégé for later review by domain experts

5 Summary and outlook

5.1 conclusion.

Ontologies are used to describe knowledge in an explicit and machine-readable way, while still being human-readable. Thus, they are used to model knowledge and semantic relations between data and concepts of scientific knowledge domains.

In this contribution, a method is set up to automatically make use of natural language processing (NLP) techniques to extract concepts contained in a text dataset in order to extend existing ontologies by these concepts relevant to a domain of knowledge. A search for textual concept definitions from different sources such as different ontologies and thesauri allows for automated annotation of these concepts found. This also helps in picking the right ontology to be extended in the second part of the workflow, where the extension of an ontology is performed by new classes based on the text dataset. Different word vectorization models using Word2Vec are trained based on different allowed numbers of repetitions of the token within the preprocessed text dataset ( min_count ) and used to suggest new classes and relations between them. Finally, the classes are annotated with textual definition based on other ontologies and thesauri, where possible.

This workflow allows for automated extension of ontologies by classes contained as concepts in a text dataset. A text dataset of 28 papers on the topic of catalytic methanation of CO \(_2\) reactions, five ontologies and two thesauri are used as a proof-of-concept. While use of a low min_count parameter results in higher numbers of new classes suggested, it also allows for integration of concepts not that important to the domain of knowledge, as the lower rates of annotated token suggest. Using a min_count parameter of 10, the Allotrope Foundation Ontology (AFO) is extended automatically by 91 new classes obtained by the text dataset. Of these classes, 68 classes are provided automatically with at least one textual definition based on the other semantic artifacts (i.e., the other ontologies and thesauri) provided.

This workflow can easily be adapted for other ontologies and text datasets to extend existing ontologies. Additionally, the database of semantic artifacts can be set for a larger number of ontologies and thesauri. While this can be adjusted quickly, the use of other definition databases such as WikiData can be implemented with some code adjustments.

5.2 Limitations and future work

The workflow only uses single-word tokens, thus only is able to search for and add single-word classes to the ontology. Detecting multi-word concepts with the presented workflow is not yet possible, but desirable as often ontology classes consist of more than one word. In the future, manipulation of the applied POS-tagging is planned to mitigate the limitation of only single-word classes being considered by the presented workflow. Here, e.g., neighboring noun token could be combined to one class, such as “flow rate”, or pairs of neighboring adjective and noun pairs, like “catalytic reaction.” Furthermore, the use of more sophisticated methods, such as named entity recognition (NER) [ 28 ], can be used. However, this method requires the pre-definition of categories. While this is already quite available for general categories, the definition of catalysis-related categories for NER is yet to be implemented to the best knowledge of the authors.

The second major limitation of the presented workflow is the missing refinement of the “semantically related to” relation used to link existing and newly created classes. The relationships could not be further refined because the semantic relation of the concepts is not appropriately given by word2vec. For example, no distinction is made between a hierarchical relationship or an object property. This is also due to the fact that only nouns are included as classes into the ontology; thus, verbs and adjectives are not considered, which would be the more fitting candidates for ontology properties and relations. In future work, relationship extraction and entity linking, i.e., the Radbound Entity Linker  [ 29 ] could be used to develop more sophisticated relationship extraction. After extracting the relations, additional linking to already existing ontology relationships is also in the scope.

To evaluate the usefulness of the workflow, an evaluation by domain experts should be conducted, to classify the number of valuable classes and relations generated automatically by the workflow. Extending an ontology by textual input as shown in this work also will help domain experts in the future to automatically annotate research data when uploading a set of research data together with a corresponding paper to a research database.

6 Supplementary information

The code developed in this work is available in a GitHub repository here: https://github.com/TUDoAD/NLP-Based-Ontology-Extender .

The pre-processed pdf-files and the ontology files are available in a zenodo repository here: https://zenodo.org/record/7956870 .

Wulf C, Beller M, Boenisch T, Deutschmann O, Hanf S, Kockmann N, Kraehnert R, Oezaslan M, Palkovits S, Schimmler S, Schunk SA, Wagemann K, Linke D (2021) A unified research data infrastructure for catalysis research-challenges and concepts. ChemCatChem 13(14):3223–3236. https://doi.org/10.1002/cctc.202001974

Article   Google Scholar  

Wilkinson MD, Dumontier M, Aalbersberg IJ, Appleton G, Axton M, Baak A, Blomberg N, Boiten J-W, da Silva Santos LB, Bourne PE, Bouwman J, Brookes AJ, Clark T, Crosas M, Dillo I, Dumon O, Edmunds S, Evelo CT, Finkers R, Gonzalez-Beltran A, Gray AJG, Groth P, Goble C, Grethe JS, Heringa J, ’t Hoen PAC, Hooft R, Kuhn T, Kok R, Kok J, Lusher SJ, Martone ME, Mons A, Packer AL, Persson B, Rocca-Serra P, Roos M, van Schaik R, Sansone S-A, Schultes E, Sengstag T, Slater T, Strawn G, Swertz MA, Thompson M, van der Lei J, van Mulligen E, Velterop J, Waagmeester A, Wittenburg P, Wolstencroft K, Zhao J, Mons B (2016) The fair guiding principles for scientific data management and stewardship. Sci Data 3(1):160018. https://doi.org/10.1038/sdata.2016.18

Strömert P, Hunold J, Castro A, Neumann S, Koepler O (2022) Ontologies4chem: the landscape of ontologies in chemistry. Pure Appl Chem 94(6):605–622. https://doi.org/10.1515/pac-2021-2007

Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis 5(2):199–220. https://doi.org/10.1006/knac.1993.1008

National Cancer Institue: National Cancer Institue Thesaurus. https://ncit.nci.nih.gov (2022)

Grühn J, Behr AS, Eroglu TH, Trögel V, Rosenthal K, Kockmann N (2022) From coiled flow inverter to stirred tank reactor—bioprocess development and ontology design. Chem Ing Tec 94(6):852–863. https://doi.org/10.1002/cite.202100177

Menke MJ, Behr AS, Rosenthal K, Linke D, Kockmann N, Bornscheuer UT, Dörr M (2022) Development of an ontology for biocatalysis. Chemie Ingenieur Technik 94(11):1827–1835. https://doi.org/10.1002/cite.202200066

Asim MN, Wasim M, Khan MUG, Mahmood W, Abbasi HM (2018) A survey of ontology learning techniques and applications. Database. https://doi.org/10.1093/database/bay101

Dal A, Maria J (2012) Simple method for ontology automatic extraction from documents. Int J Adv Comput Sci Appl. https://doi.org/10.14569/ijacsa.2012.031206

Opasjumruskit K, Peters D, Schindler S (2020) DSAT: Ontology-based information extraction on technical data sheets. ISWC 2020, 2–6, Nov. 2020. https://ceur-ws.org/Vol-2721/paper563.pdf

Opasjumruskit K, Böning S, Schindler S, Peters D (2022) OntoHuman: ontology-based information extraction tools with human-in-the-loop interaction. In: International conference on cooperative design, visualization and engineering. Springer, Berlin, pp 68–74

Opasjumruskit K (2020) NLP for ontology development-a use case in spacecraft design domain. https://elib.dlr.de/136233/

Horsch M, Petrenko T, Kushnarenko V, Schembera B, Wentzel B, Behr A, Kockmann N, Schimmler S, Bönisch T (2022) Interoperability and architecture requirements analysis and metadata standardization for a research data infrastructure in catalysis. In: Pozanenko A, Stupnikov S, Thalheim B, Mendez E, Kiselyova N (eds) Data analytics and management in data intensive domains. Springer, Cham, pp 166–177. https://doi.org/10.1007/978-3-031-12285-9_10

Chapter   Google Scholar  

NFDI4Cat: Ontology collection of NFDI4Cat. https://nfdi4cat.org/en/services/ontology-collection (2022)

Allotrope Foundation: Allotrope Foundation Ontology. https://www.allotrope.org/ontologies (2022)

Hastings J, Owen G, Dekker A, Ennis M, Kale N, Muthukrishnan V, Turner S, Swainston N, Mendes P, Steinbeck C (2015) ChEBI in 2016: improved services and an expanding collection of metabolites. Nucleic Acids Res 44(D1):1214–9

Batchelor C (2022) Chemical methods ontology. http://purl.obolibrary.org/obo/chmo.owl

Abeyruwan S, Vempati UD, Küçük-McGinty H, Visser U, Koleti A, Mir A, Sakurai K, Chung C, Bittker JA, Clemons PA, Brudz S, Siripala A, Morales AJ, Romacker M, Twomey D, Bureeva S, Lemmon V, Schürer SC (2014) Evolving BioAssay ontology (BAO): modularization, integration and applications. J Biomed Semant. https://doi.org/10.1186/2041-1480-5-s1-s5

Nguen T, Karr J, Sheriff R (2022) Systems biology ontology. http://biomodels.net/SBO/

Gold V (ed.) (2019) The IUPAC compendium of chemical terminology. International Union of Pure and Applied Chemistry (IUPAC). https://doi.org/10.1351/goldbook

Musen MA (2015) The protégé project: a look back and a look forward. AI Matters 1(4):4–12. https://doi.org/10.1145/2757001.2757003

Řehůřek R, Sojka P (2010) Software framework for topic modelling with large corpora. pp 45–50. https://doi.org/10.13140/2.1.2393.1847

Pennington J, Socher R, Manning C.D (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp 1532–1543. http://www.aclweb.org/anthology/D14-1162

Adobe Inc (2022) Adobe Acrobat Pro PDF-reader, version 22.003.20258. https://www.adobe.com/acrobat.html

Shinyama Y (2007) PDFMiner—Python PDF Parser. https://github.com/euske/pdfminer

Honnibal M, Montani I, Van Landeghem S, Boyd A (2020) spaCy: industrial-strength natural language processing in Python. https://doi.org/10.5281/zenodo.1212303

Lamy J-B (2017) Owlready: ontology-oriented programming in python with automatic classification and high level constructs for biomedical ontologies. Artif Intell Med 80:11–28. https://doi.org/10.1016/j.artmed.2017.07.002

Nadeau D, Sekine S (2007) Named entities: recognition, classification and use. Lingvist Investig 30(1):3–26. https://doi.org/10.1075/li.30.1.03nad

van Hulst JM, Hasibi F, Dercksen K, Balog K, de Vries AP (2020) Rel: an entity linker standing on the shoulders of giants. In: Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. SIGIR’20. ACM

Download references

Acknowledgements

The authors thank the Deutsche Forschungsgemeinschaft (DFG) within the Nationale Forschungsdateninfrastruktur (NFDI) initiative (Grant No.: NFDI/2-1-2021) for funding part of this work as well as the fruitful discussions with researchers in NFDI4Cat. A.S.B. thanks the networking program “Sustainable Chemical Synthesis 2.0” (SusChemSys 2.0) for the support and fruitful discussions across disciplines.

Open Access funding enabled and organized by Projekt DEAL. The Deutsche Forschungsgemeinschaft (Grant No.: NFDI/2-1-2021) funded part of this work.

Author information

Authors and affiliations.

Laboratory of Equipment Design, Faculty of Biochemical and Chemical Engineering, TU-Dortmund University, Emil-Figge-Straße 68, 44139, Dortmund, North-Rhine-Westphalia, Germany

Alexander S. Behr, Marc Völkenrath & Norbert Kockmann

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Alexander S. Behr .

Ethics declarations

Conflict of interest.

The authors declare no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: References of text dataset

See Table  4 .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Behr, A.S., Völkenrath, M. & Kockmann, N. Ontology extension with NLP-based concept extraction for domain experts in catalytic sciences. Knowl Inf Syst 65 , 5503–5522 (2023). https://doi.org/10.1007/s10115-023-01919-1

Download citation

Received : 09 January 2023

Revised : 23 May 2023

Accepted : 21 June 2023

Published : 15 July 2023

Issue Date : December 2023

DOI : https://doi.org/10.1007/s10115-023-01919-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Natural language processing
  • Automated ontology annotation
  • Information extraction
  • CO \(_2\) methanation
  • Catalytic conversion
  • Find a journal
  • Publish with us
  • Track your research
  • Career Advice
  • Computer Vision
  • Data Engineering
  • Data Science
  • Language Models
  • Machine Learning
  • Programming
  • Cheat Sheets
  • Recommendations
  • Tech Briefs

Research Papers for NLP Beginners

Read research papers on neural models, word embedding, language modeling, and attention & transformers.

Research Papers for NLP Beginners

If you’re new to the world of data and have a particular interest in NLP (Natural Language Processing), you’re probably looking for resources to help grasp a better understanding. 

You have probably come across so many different research papers and are sitting there confused about which one to choose. Because let’s face it, they’re not short and they do consume a lot of brain power. So it would be smart to choose the right one that will benefit your path to mastering NLP. 

I have done some research and have collected a few NLP research papers that have been highly recommended for newbies in the NLP area and overall NLP knowledge.

I will break it up into sections so you can go find exactly what you want.

Machine Learning and NLP

  Text Classification from Labeled and Unlabeled Documents using EM by Kamal Nigam, 1999

This paper is about how you can improve the accuracy of learned text classifiers by augmenting a small number of labeled training documents with a large pool of unlabeled documents.

  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList by Marco Tulio Ribeiro et al., 2020

In this paper, you will learn more about CheckList, a task-agnostic methodology for testing NLP models as unfortunately some of the most used current approaches overestimate the performance of NLP models.

Neural Models

  Natural Language Processing (almost) from Scratch by Ronan Collobert, 2011

In this paper, you will go through the foundations of NLP - as it states in the title, it is ALMOST from scratch. Topics include Named Entity Recognition, Semantic role labeling, networks, training, and more. 

  Understanding LSTM Networks by Christopher Olah, 2015

Neural Networks are a major part of NLP, therefore having a good understanding of it will benefit you in the long run. In this paper, there is a focus on LSTM networks which are widely used. 

Word/Sentence Representation and Embedding

  Distributed Representations of Words and Phrases and their Compositionality by Tomas Mikolov, 2013

Written by Mikolov, who introduced the Skip-gram model for learning high-quality vector representations of words from large amounts of unstructured text data - this paper will present several extensions of the original Skip-gram model.

  Distributed Representations of Sentences and Documents by Quoc Le and Tomas Mikolov, 2014

Going into more depth about the two major weaknesses of bag-of-words, the authors introduce Paragraph Vector - which is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of text, such as sentences.

Language Modelling

  Language Models are Unsupervised Multitask Learners by Alec Radford, 2018

Natural language processing tasks are normally approached with supervised learning on task-specific datasets. However, Multitask learning is being tested as a promising framework for improving general performance in NLP. 

  The Unreasonable Effectiveness of Recurrent Neural Networks by Andrej Karpathy, 2015

This paper goes back to the start of recurrent neural networks and why they are so effective and robust with code examples to give you a better understanding

Attention & Transformers

  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding by Jacob Devlin et al., 2019

As you’re learning about machine learning, you have probably heard about BERT - Bidirectional Encoder Representations from Transformers. It is widely used and known for being able to pre-train deep bidirectional representations from unlabeled text. In this paper, you will further understand and learn how to improve your fine-tuning based on BERT.

  Attention is All You Need by Ashish Vaswani et al., 2017

This paper focuses on the Transformer, solely on attention mechanisms which differ from models which are typically based on complex recurrent or convolutional neural networks. You will learn how Transformer generalizes well to other tasks and may be the better option.

  HuggingFace's Transformers: State-of-the-art Natural Language Processing by Thomas Wolf et al., 2020

Want to learn more about Transformers which has become the dominant architecture for natural language processing? In this paper, you will learn more about its architecture and how it facilitates the distribution of pre-trained models.

Wrapping Up

Like I said above, I don’t want to overwhelm you with so many different research papers - therefore I have kept it at a minimal level. 

If you know of any that beginners may benefit from, please drop them in the comments so that they can see them. Thank you!

    Nisha Arya is a Data Scientist and Freelance Technical Writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.  

More On This Topic

  • Generative Agent Research Papers You Should Read
  • Must Read NLP Papers from the Last 12 Months
  • 8 Innovative BERT Knowledge Distillation Papers That Have Changed…
  • Top 5 NLP Cheat Sheets for Beginners to Professional
  • Learn To Reproduce Papers: Beginner’s Guide
  • 2021: A Year Full of Amazing AI papers — A Review

nlp based research papers

Get the FREE ebook 'The Great Big Natural Language Processing Primer' and 'The Complete Collection of Data Science Cheat Sheets' along with the leading newsletter on Data Science, Machine Learning, AI & Analytics straight to your inbox.

By subscribing you accept KDnuggets Privacy Policy

Latest Posts

  • 10 GitHub Repositories to Master Computer Science
  • 5 Free SQL Courses for Data Science Beginners
  • 5 Data Analyst Projects to Land a Job in 2024
  • 5 AI Courses From Google to Advance Your Career
  • Mistral 7B-V0.2: Fine-Tuning Mistral’s New Open-Source LLM with Hugging Face
  • Convert Python Dict to JSON: A Tutorial for Beginners
  • A Beginner’s Guide to the Top 10 Machine Learning Algorithms
  • Distribute and Run LLMs with llamafile in 5 Simple Steps
  • Top 7 Model Deployment and Serving Tools
  • Enroll in a 4-year Computer Science Degree Program For Free

nlp based research papers

The latest news in Healthcare IT – straight to your inbox.

Home

  • Global Edition

Medical NLP may help bring AI to bear in healthcare

More regional news.

Sonogram on screen

Clarius teams up with ThinkSono for AI-guided ultrasound system

Tom Peterson, CEO of Blackbird

Q&A: Blackbird on its youth mental health platform and $17M raise

Craig Richardville of Intermountain Healthcare on health IT vendors

Intermountain Healthcare CDIO shares advice for evaluating and selecting vendors

Craig Richardville of Intermountain Healthcare on health IT vendors

White Papers

More Whitepapers

More Webinars

David Smith at Anatomy IT_ Doctor writing in EHR Photo by ipopba / iStock / Getty Images Plus

More Stories

Senator Hawley, holding up a paper, speaks in Congress

  • Artificial Intelligence
  • Cloud Computing
  • Government & Policy
  • Interoperability
  • Patient Engagement
  • Population Health
  • Precision Medicine
  • Privacy & Security
  • Women In Health IT
  • Learning Center
  • Research Papers
  • Special Projects
  • In-Person Events

The Daily Brief Newsletter

Search form, top stories.

Craig Richardville of Intermountain Healthcare on health IT vendors

IMAGES

  1. NLP-based learning approaches in software engineering research with

    nlp based research papers

  2. (PDF) Developing an NLP-based Recommender System for the Ethical, Legal

    nlp based research papers

  3. Example-Based NLP Techniques

    nlp based research papers

  4. (PDF) Practical NLP-Based Text Indexing

    nlp based research papers

  5. Data Augmentation Techniques for Text Classification in NLP (Research Paper Walkthrough)

    nlp based research papers

  6. (PDF) Review on NLP Paraphrase Detection Approaches

    nlp based research papers

VIDEO

  1. Boost Your Accounts Grades: Cbse Sample Paper 2024 II Class 12 II part 20 Question no 33 or part

  2. RKSMBK Maths Paper Class 5 ✌️ RKSMBK द्वितीय योगात्मक आकलन

  3. NLP in the Real World

  4. SECRETS VIDEO PROMO

  5. Revolutionizing Communication: NLP's 2024 Breakthroughs with AI

  6. AI Enabled NLP based Text to Text Medical Chatbot

COMMENTS

  1. Natural Language Processing

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... 2335 benchmarks • 662 tasks • 2015 datasets • 27865 papers with code Representation Learning Representation Learning. 16 benchmarks ... Aspect-Based Sentiment Analysis (ABSA)

  2. Natural Language Processing and Its Applications in ...

    The rapid advancements in natural language processing provides strong support for machine translation research. This paper first introduces the key concepts and main content of natural language processing, and briefly reviews the history and progress of NLP research at home and abroad. ... Based on this, the paper analyzes the applications of ...

  3. Recent advancements and challenges of NLP-based ...

    Recent advancements and challenges of NLP-based sentiment analysis: A state-of-the-art review. Author links open overlay panel Jamin Rahman Jim a, Md Apon Riaz Talukder b, Partha Malakar b, ... At the beginning of this research, we selected 544 papers for the review.

  4. Natural language processing: state of the art, current trends and

    Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...

  5. Vision, status, and research topics of Natural Language Processing

    Research status of NLP. The bibliographic data of NLP scientific papers were retrieved from the Web of Science (WoS) database based on the search query displayed in Table 1. The search terms were selected based on prior reviews on NLP (e.g., Kreimeyer et al., 2017, Pons et al., 2016). This search generated a total of 31,485 NLP papers.

  6. Natural Language Processing Journal

    The NLP journal welcomes original research papers, review papers, position papers, tutorial and best practice papers. Special Issues proposals on specific current topics are welcomed. To foster trust, transparency, and reproducibility in AI research the NLP journal promotes open and FAIR data and software sharing practices.

  7. [2111.01243] Recent Advances in Natural Language Processing via Large

    Large, pre-trained transformer-based language models such as BERT have drastically changed the Natural Language Processing (NLP) field. We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches. We also present approaches that use pre-trained language models to generate data for ...

  8. natural language processing Latest Research Papers

    The transformer-based neural networks—BERT has been used for various natural language processing tasks. The built-in self-attention mechanism can capture the associations between words and phrases in a sentence. ... To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the ...

  9. Publications

    Performing groundbreaking Natural Language Processing research since 1999.

  10. Efficient Methods for Natural Language Processing: A Survey

    Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require ...

  11. Datasets: A Community Library for Natural Language Processing

    The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for ...

  12. NLP Research: Top Papers from 2021 So Far

    This NLP research paper proposes a novel span-based dynamic convolution to replace these self-attention heads to directly model local dependencies. The novel convolution heads, together with the rest self-attention heads, form a new mixed attention block that is more efficient at both global and local context learning.

  13. Methods to Integrate Natural Language Processing Into Qualitative Research

    NLP is used for qualitative data in the traditional sense or additionally for qualitative data which is collected in other formats, including open ended or free form feedback on a customer satisfaction survey, medical provider notes in an electronic medical record (EMR), or a transcript of research participant interviews (Koleck et al., 2019).

  14. Natural Language Processing

    Natural Language Processing (NLP) research at Google focuses on algorithms that apply at scale, across languages, and across domains. Our systems are used in numerous ways across Google, impacting user experience in search, mobile, apps, ads, translate and more. Our work spans the range of traditional NLP tasks, with general-purpose syntax and ...

  15. GPT-3 & Beyond: 10 NLP Research Papers You Should Read

    Best NLP Research Papers 2020. 1. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, by Keisuke Sakaguchi, Ronan Le Bras, ... Evaluation of state-of-the-art models with CheckList demonstrated that even though some NLP tasks are considered "solved" based on accuracy results, the behavioral testing highlights many areas for ...

  16. (PDF) Natural Language Processing

    the machine knowledge according to the output ob tained. Natural language processing is an integral area of computer. science in which machine learni ng and computational. linguistics are b roadly ...

  17. [2310.10449] Text Summarization Using Large Language Models: A

    Text summarization is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Leveraging Large Language Models (LLMs) has shown remarkable promise in enhancing summarization techniques. This paper embarks on an exploration of text summarization with a diverse set of LLMs, including MPT-7b-instruct, falcon-7b-instruct, and ...

  18. Natural Language Processing (NLP) based Text Summarization

    The size of data on the Internet has risen in an exponential manner over the past decade. Thus, the need for a solution emerges, that transforms this vast raw information into useful information which a human brain can understand. One such common technique in research that helps in dealing with enormous data is text summarization. Automatic summarization is a renowned approach which is used to ...

  19. The Best of NLP: February 2023's Top NLP Papers

    Top NLP Papers of February 2023 Highlighted by Our Research Discord Community These papers were highlighted by C4AI research discord community members. Big thank you to Ujan#3046, bhavnicksm#8949, EIFY#4102, cvf#1006, MajorMelancholy#1836, cakiki#9145, hails#6601, Mike-RsrchRabbit#9843, and the rest of the Cohere For AI NLP research community ...

  20. Using clinical Natural Language Processing for health outcomes research

    The importance of incorporating Natural Language Processing (NLP) methods in clinical informatics research has been increasingly recognized over the past years, and has led to transformative advances.. Typically, clinical NLP systems are developed and evaluated on word, sentence, or document level annotations that model specific attributes and features, such as document content (e.g., patient ...

  21. Ontology extension with NLP-based concept extraction for ...

    In addition, plenty of information is presented in scientific research in textual form, e.g., research papers by many domain experts. Those research papers contain a high number of domain-specific vocabulary. Using techniques from Natural Language Processing (NLP), in turn, can help to automate the setup of ontologies based on unstructured ...

  22. Research Papers for NLP Beginners

    Natural Language Processing (almost) from Scratch by Ronan Collobert, 2011. In this paper, you will go through the foundations of NLP - as it states in the title, it is ALMOST from scratch. Topics include Named Entity Recognition, Semantic role labeling, networks, training, and more. Understanding LSTM Networks by Christopher Olah, 2015.

  23. Top 10 NLP Research Papers Worth Reading For Beginners

    Top 10 research papers for NLP for beginners. 1. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. 2. A Primer on Neural Network Models for Natural Language Processing. 3. Efficient Estimation of Word Representations in Vector Space. 4.

  24. Medical NLP may help bring AI to bear in healthcare

    April 08, 2024. Want to get more stories like this one? Get daily news updates from Healthcare IT News. More recent developments in AI have boosted natural language processing, which will help bring the benefits of NLP to provider organizations, says Dr. Tim O'Connell, founder and CEO of emtelligent, a medical NLP company.