language detection using nlp research paper

Language Detection Using Natural Language Processing

Ieee account.

Change Username/Password
Update Address

Purchase Details

Payment Options
Order History
View Purchased Documents

Profile Information

Communications Preferences
Profession and Education
Technical Interests
US & Canada: +1 800 678 4333
Worldwide: +1 732 981 0060
Contact & Support
About IEEE Xplore
Accessibility
Terms of Use
Nondiscrimination Policy
Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Language Identification Using Multinomial Naive Bayes Technique

Conference paper
First Online: 03 January 2024
Cite this conference paper

language detection using nlp research paper

Parul Mangla 13 ,
Gurpreet Singh 13 ,
Nitish Pathak 14 &
Sunil Chawla 13

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 786))

Included in the following conference series:

International Conference on Data Analytics & Management

105 Accesses

Language detection is a significant effort in natural language processing (NLP) and has various applications such as machine translation, text summarization, and sentiment analysis. In this paper, we propose using a Multinomial Naive Bayes (MNB) algorithm for the task of language detection. MNB is a widely used algorithm in NLP and is effective in various text classification tasks, including language detection. In this research, we propose using MNB for the task of language detection. We used a dataset of texts written in different languages to train the algorithm. The dataset was preprocessed to extract features and remove halts. The MNB algorithm was implemented using the scikit-learn library in Python. The algorithm was first trained, and the set used for it was termed as training set and then was tested on the testing set. Using the accuracy, the algorithm’s performance was estimated. This paper is organized into five sections: Sect. 1 is introduction, Section 2 is literature review and research gap, Sect. 3 includes the implementations and discussion, Sect. 4 consists of the main resultant part, and Sect. 5 concludes with the future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Available as PDF
Read on any device
Instant download
Own it forever
Available as EPUB and PDF
Compact, lightweight edition
Dispatched in 3 to 5 business days
Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Rani L, Sahoo AK, Sarangi PK, Yadav CS, Rath BP (2022) Feature extraction and dimensionality reduction models for printed numerals recognition. In: 2022 9th international conference on computing for sustainable global development (InaCom), 2022 March 23. IEEE, pp 798–801

Google Scholar

Rabbi AKMSA, Islam MM, Hasan N, Nahar J, Rahman F (2020) Language detection using convolutional neural network. In: 2020 11th international conference on computing, communication and networking technologies (ICCCNT), Kharagpur, India, pp 1–5. https://doi.org/10.1109/ICCCNT49239.2020.9225610

Harry S, Sedogbo C, Gas B, Saradar JL (2006) Language detection combining discriminating approach and temporal decision with neural network modeling. In: 2006 IEEE Odyssey—the speaker and language recognition workshop, San Juan, PR, USA, pp 1–4. https://doi.org/10.1109/ODYSSEY.2006.248107

Shaman A, Manfredi K (2010) Language detection with GMM optimization using neural networks. In: 2010 third international joint conference on computational science and optimization, Huangshan, China, pp 461–465. https://doi.org/10.1109/CSO.2010.21

Shetty D, Sarojadevi H, Shakeel U, Sanjan S, Aishwarya GM, Nupur P (2022) An approach to identify Indic languages using text classification and natural language processing. In: 2022 IEEE 2nd Mysore sub section international conference (Mysuru on), Mysuru, India, pp 1–6. https://doi.org/10.1109/MysuruCon55714.2022.9972371

Smith I, Thiazinam U (2019) Language detection in Sinhala-English code-mixed data. In 2019 international conference on Asian language processing (IALP), Shanghai, China, pp 228–233. https://doi.org/10.1109/IALP48816.2019.9037680

Lui M, Lau JH, Baldwin T (2014) Automatic detection and language identification of multilingual documents. Trans Assoc Comput Linguist 1(2):27–40

Article Google Scholar

Pujari BS, Jagadeesh D (2020) An anatomization of language detection and translation using NLP techniques. Int J Innov Technol Explor Eng 10:69–77. https://doi.org/10.35940/ignite.B8265.1210220

Julianne T, Lui M, Zapier M, Baldwin T, Linden K (2019) Automatic language identification in texts: A survey. J Artif Intell Res 25(65):675–782

MathSciNet Google Scholar

Singh G, Sarangi PK, Rani L, Sharma K, Sinha S, Sahoo AK, Rath BP (2022) CNN-RNN based hybrid machine learning model to predict the currency exchange rate: USD to INR. In: 2022 2nd international conference on advance computing and innovative technologies in engineering (ICACITE), 2022 April 28. IEEE, pp 1668–1672

Begum H, Islam MM (2017) Recognition of handwritten Bangla characters using Gabor filter and artificial neural network. Int J Comput Technol Appl 8(5):618–621

Alum MZ, Sidiki P, Hasan M, Taha TM, Azari VK (2018) Handwritten Bangla character recognition using the state-of-the-art deep convolutional neural networks. Comput Intell Neurosci 27:2018

Biswas C, Bhattacharya U, Parul SK (2012) HMM based online handwritten Bangla character recognition using Dirichlet distributions. In: 2012 international conference on frontiers in handwriting recognition, 18 Sept 2012. IEEE, pp 600–605

Rahman MM, Akhund MA, Islam S, Shill PC, Rahman MH (2015) Bangla handwritten character recognition using convolutional neural network. Int J Image Graph Signal Process (IJIGSP) 7(8):42–49

Das N, Paramania S, Base S, Saah PK, Sarkar R, Kundu M, Manipuri M (2014) Recognition of handwritten Bangla basic characters and digits using convex hull-based feature set. arXiv preprint arXiv:1410.0478

Download references

Author information

Authors and affiliations.

Chikara University Institute of Engineering and Technology, Chikara University, Chandigarh, Punjab, India

Parul Mangla, Gurpreet Singh & Sunil Chawla

Bhagwan Parshuram Institute of Technology (BPIT), Guru Gobind Singh Indraprastha University (GGSIPU), New Delhi, India

Nitish Pathak

You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Parul Mangla .

Editor information

Editors and affiliations.

Department of Information Technology, Bhagwan Parshuram Institute of Technology, New Delhi, Delhi, India

Abhishek Swaroop

Jan Wyżykowski University, Polkowice, Poland

Zdzislaw Polkowski

Polytechnic Institute of Portalegre, Portalegre, Portugal

Sérgio Duarte Correia

Centre for Communications Technology, London Metropolitan University, London, UK

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper.

Mangla, P., Singh, G., Pathak, N., Chawla, S. (2024). Language Identification Using Multinomial Naive Bayes Technique. In: Swaroop, A., Polkowski, Z., Correia, S.D., Virdee, B. (eds) Proceedings of Data Analytics and Management. ICDAM 2023. Lecture Notes in Networks and Systems, vol 786. Springer, Singapore. https://doi.org/10.1007/978-981-99-6547-2_24

Download citation

DOI : https://doi.org/10.1007/978-981-99-6547-2_24

Published : 03 January 2024

Publisher Name : Springer, Singapore

Print ISBN : 978-981-99-6546-5

Online ISBN : 978-981-99-6547-2

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Publish with us

Policies and ethics

Find a journal
Track your research

Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, language identification.

123 papers with code • 6 benchmarks • 19 datasets

Language identification is the task of determining the language of a text.

Benchmarks Add a Result

Most implemented papers

The wili benchmark dataset for written language identification.

This paper describes the WiLI-2018 benchmark dataset for monolingual written natural language identification.

SpeechBrain: A General-Purpose Speech Toolkit

SpeechBrain is an open-source and all-in-one speech toolkit.

Scaling Speech Technology to 1,000+ Languages

Expanding the language coverage of speech technology has the potential to improve access to information for many more people.

GlotLID: Language Identification for Low-Resource Languages

cisnlp/glotlid • 24 Oct 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages.

Universal Dependency Parsing for Hindi-English Code-switching

irshadbhat/nsdp-cs • NAACL 2018

We present a treebank of Hindi-English code-switching tweets under Universal Dependencies scheme and propose a neural stacking model for parsing that efficiently leverages part-of-speech tag and syntactic tree annotations in the code-switching treebank and the preexisting Hindi and English treebanks.

Predicting the Type and Target of Offensive Posts in Social Media

idontflow/olid • NAACL 2019

In particular, we model the task hierarchically, identifying the type and the target of offensive messages in social media.

SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval)

We present the results and the main findings of SemEval-2019 Task 6 on Identifying and Categorizing Offensive Language in Social Media (OffensEval).

Word-level Embeddings for Cross-Task Transfer Learning in Speech Processing

Recent breakthroughs in deep learning often rely on representation learning and knowledge transfer.

Common Voice: A Massively-Multilingual Speech Corpus

To our knowledge this is the largest audio corpus in the public domain for speech recognition, both in terms of number of hours and number of languages.

VoxLingua107: a Dataset for Spoken Language Recognition

Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech.

Help | Advanced Search

Computer Science > Computation and Language

Title: efficient methods for natural language processing: a survey.

Abstract: Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require fewer resources to achieve similar results. This survey synthesizes and relates current methods and findings in efficient NLP. We aim to provide both guidance for conducting NLP under limited resources, and point towards promising research directions for developing more efficient methods.

Submission history

Access paper:.

Other Formats

References & Citations

Google Scholar
Semantic Scholar

1 blog link

Bibtex formatted citation.

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

Natural Language Processing: 11 Key NLP Techniques
Best Natural Language Processing Models Available for NLP Tasks
Build Language Detection Model With NLP
Natural Language Processing- How different NLP Algorithms work
Language Detection using Neural Network
Data Augmentation Techniques for Text Classification in NLP (Research Paper Walkthrough)

VIDEO

SIGN LANGUAGE DETECTION USING CV
Keyword extraction and ranking based on crawler and natural language processing
NLP for Social Media
NLP Project Sign Language Detection
NLP PRESENTATION
Module 4: Semantic analysis and NLP Toolkits

COMMENTS

Language Detection Using Natural Language Processing
NLP gives computers the ability to understand human language and respond correctly, performing language detection for us. The current paper provides a summary of developments in tongue process, including analysis, establishment, various areas of rapid advancement in natural language processing research, development tools, and techniques.
(PDF) Language Detection Using Natural Language Processing
Natural Language Processing (NLP) is a technique for. processing languages and transformin g them into forms. that the u ser can readily process or interpret. NLP is a. method of co mputer ...
Natural language processing: state of the art, current trends and
Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computationally. It has spread its applications in various fields such as machine translation, email spam detection, information extraction, summarization, medical, and question answering etc. In this paper, we first distinguish four phases by discussing different levels of NLP ...
PDF Language Identiﬁcation from Text Documents
language is spoken by two geographically disconnected group of people (e.g Portuguese spoken in Portugal and Brazil). We experimented with both word and character n-grams. The character n-grams turned out to be particularly useful when differentiating between two languages using mostly distinct character sequences in their alphabet.
Machine-Generated Text Detection using Deep Learning
Our research focuses on the crucial challenge of discerning text produced by Large Language Models (LLMs) from human-generated text, which holds significance for various applications. With ongoing discussions about attaining a model with such functionality, we present supporting evidence regarding the feasibility of such models. We evaluated our models on multiple datasets, including Twitter ...
Application of Natural Language Processing (NLP) in Detecting and
According to the findings of this research work, NLP could help in the early detection of individuals who have suicide ideation and allow timely implementation of preventive measures. It is also found that passive surveillance via mobile applications, online activity, and social media is feasible and may help in the early diagnosis and ...
Language Identification Using Multinomial Naive Bayes Technique
MNB is a widely used algorithm in NLP and is effective in various text classification tasks, including language detection. In this research, we propose using MNB for the task of language detection. We used a dataset of texts written in different languages to train the algorithm. The dataset was preprocessed to extract features and remove halts.
Language Identification
OffMix-3L: A Novel Code-Mixed Dataset in Bangla-English-Hindi for Offensive Language Identification. languagetechnologylab/offmix-3l • 27 Oct 2023. Code-mixing is a well-studied linguistic phenomenon when two or more languages are mixed in text or speech. 3.
An Anatomization of Language Detection and Translation using NLP
Conference Paper. Apr 2023. Xiaobo Chang. Request PDF | An Anatomization of Language Detection and Translation using NLP Techniques | The issue with identifying language relates to process of ...
Language Identification
alumae/torch-xvectors-wav • • 25 Nov 2020. Speech activity detection and speaker diarization are used to extract segments from the videos that contain speech. 2. Paper. Code. Language identification is the task of determining the language of a text.
A systematic review of hate speech automatic detection using natural
With the development in natural language processing (NLP) technology, much research has been done concerning automatic textual hate speech detection in recent years. A couple of renowned competitions (e.g., SemEval-2019 [191] and 2020 [192], GermEval-2018 [183]) have held various events to find a better solution for automated hate speech ...
Automatic Detection and Language Identification of ...
In this paper, we propose using a Multinomial Naive Bayes (MNB) algorithm for the task of language detection. MNB is a widely used algorithm in NLP and is effective in various text classification ...
Vision, status, and research topics of Natural Language Processing
Research status of NLP. The bibliographic data of NLP scientific papers were retrieved from the Web of Science (WoS) database based on the search query displayed in Table 1. The search terms were selected based on prior reviews on NLP (e.g., Kreimeyer et al., 2017, Pons et al., 2016). This search generated a total of 31,485 NLP papers.
Efficient Methods for Natural Language Processing: A Survey
Recent work in natural language processing (NLP) has yielded appealing results from scaling model parameters and training data; however, using only scale to improve performance means that resource consumption also grows. Such resources include data, time, storage, or energy, all of which are naturally limited and unevenly distributed. This motivates research into efficient methods that require ...
Full article: Detection of Hate Speech using BERT and Hate Speech Word
Word Embedding. Word embedding (Bengio et al. Citation 2003) is a prominent natural language processing (NLP) technique that seeks to convey the semantic meaning of a word.It provides a useful numerical description of the term based on its context. The words are represented by an N-dimensional dense vector that can be used in estimating the similarities between the words in a specific language ...
How to Detect and Translate Languages for NLP Project
First, you import the detect method from langdetect and then pass the text to the method. Output: "sw". The method detects the text provided is in the Swahili language ('sw'). You can also find out the probabilities for the top languages by using detect_langs method. Output: [sw:0.9999971710531397]
(PDF) Natural Language Processing
Natural language processing is an integral area of computer. science in which machine learni ng and computational. linguistics are b roadly used. This field is mainly concerned. with making t he h ...
[PDF] ABUSIVE LANGUAGE DECTECTION USING NLP
Many automated methods using machine learning, deep learning, and natural language processing (NLP) have been developed in the past due to the severe and frequent nature of this activity. This paper provides a thorough summary of the reducing techniques that the research in this field has suggested for detecting offensive content.
A Multi-Stance Detection Method by Fusing Sentiment Features
As a result, this paper develops a multi-stance detection model by fusing sentiment features. First, a five-category stance indicator system is built based on the LDA model, then sentiment features are extracted from the reviews using the sentiment lexicon, and finally, stance detection is implemented using a hybrid neural network model.
SENTIMENT ANALYSIS USING NATURAL LANGUAGE PROCESSING AND ...
SENTIMENT ANALYSIS USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING. April 2023. Shu Ju Cai Ji Yu Chu Li/Journal of Data Acquisition and Processing 38 (2):520-526. DOI: 10.5281/zenodo ...