real analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Sets and Numbers

This chapter covers set theory. The topics include set algebra, relations, orderings and mappings, countability and sequences, real numbers, sequences and limits, and set classes including monotone classes, rings, fields, and sigma fields. The final section introduces the basic ideas of real analysis including Euclidean distance, sets of the real line, coverings, and compactness.

International Conference on Mathematical and Statistical Sciences (ICMSS) 2021 Study Program of Mathematics and the Study Program of Statistics Faculty of Mathematics and Natural Sciences Universitas Lambung Mangkurat Banjarbaru - Indonesia, 15 - 16 September 2021 The International Conference on Mathematical and Statistical Sciences (ICMSS) 2021 was organized through a collaboration between the Study Program of Mathematics and the Study Program of Statistics, Faculty of Mathematics and Natural Sciences - Universitas Lambung Mangkurat (ULM). The theme raised was “Mathematical and Statistical Sciences in Multidisciplinary Research”, with the aims are to acknowledge, learn, share, and transfer the results of scientific knowledge and research among academia and practitioners who have used or implemented Mathematical and Statistical Sciences to solve real-world problems and improve the quality of life. The scopes of our conference are Mathematical modeling, Artificial intelligence, Mathematical physics, Algebra and its applications, Statistics and its applications, Computational fluid dynamics, Data mining and its applications, Dynamical nonlinear systems, Mathematics Educations, Financial mathematics, Mathematical biology, Numerical methods and analysis, Operation research and optimizations, and Real analysis. On behalf of the committee, we would like to thank the Rector of Universitas Lambung Mangkurat, the Dean of Faculty of Mathematics and Natural Sciences, Coordinator of the Study Program of Mathematics and Coordinator of the Study Program of Statistics, advisory board, steering committee, all committee members, reviewers, presenters, and participants. We also would thank the Indonesian Mathematical Society (IndoMS), The Indonesian Algebra Society (IAS), and The Forum Pendidikan Tinggi Statistika (Forstat). Special thanks are also given to the Journal of Physics: Conference Series. We, on behalf of the ICMSS 2021 committee, would like to thank all parties for their participation in supporting this publication. We hope to see you all at the next conference. Kind regards, Dr. Muhammad Ahsar Karim Chair of the ICMSS 2021 List of Organizing Committees, Photographs and Peer review statement are available in this pdf.

Roadmap to glory: scaffolding real analysis for deeper learning

Real analysis, analisis kemampuan berpikir tingkat tinggi mahasiswa dalam mengkonstruksi representasi biner bilangan real.

Higher-order thinking skills (HOTS) are needed to determine the student's ability to construct an answer. In this study, researchers analyzed the higher-order thinking skills of students of the Mathematics Education Study Program in constructing one of the test answers, namely constructing a binary representation of real numbers in the Introduction to Real Analysis course. Fifty-two students taking the Introduction to Real Analysis course in the odd semester 2020/2021 are the subjects of this research. Data was collected using a test that was analyzed based on the indicator of higher-order thinking ability created (C6). It was revealed that the students' higher-order thinking skills were in the sufficient category. This means that most students have not been able to construct and analyze information into the right strategy. The results of this study are expected to be a reference for the lecture process where students are familiarized with giving HOTS-oriented questions during exams and practice questions for lectures to help develop higher-order thinking skills.Keywords: Bloom's taxonomy-C6; higher order thinking skill; binary representation Kemampuan berpikir tingkat tinggi diperlukan untuk mengetahui kemampuan mahasiswa mengkonstruksi suatu jawaban. Pada studi ini, peneliti menganalisis kemampuan berpikir tingkat tinggi mahasiswa Program Studi Pendidikan Matematika dalam mengkonstruksi salah satu jawaban tes yaitu mengkonstruksi representasi biner bilangan real pada mata kuliah Pengantar Analisis Real. Lima puluh dua mahasiswa yang sedang mengambil mata kuliah Pengantar Analisis Real pada semester ganjil 2020/2021 sebagai subjek penelitian ini. Data dikumpulkan dengan tes yang dianalisis berdasarkan indikator kemampuan berpikir tingkat tinggi create (C6). Terungkap bahwa kemampuan berpikir tingkat tinggi mahasiswa berada pada kategori cukup. Ini berarti sebagian besar mahasiswa belum mampu mengkonstruksi dan menganalisis informasi menjadi strategi yang tepat. Hasil penelitian ini diharapkan menjadi acuan untuk proses perkuliahan dimana mahasiswa dibiasakan dengan pemberian soal yang berorientasi HOTS baik itu pada saat ujian maupun latihan-latihan soal perkuliahan untuk membantu mengembangkan kemampuan berpikir tingkat tinggi.Kata Kunci:  taksonomi Bloom-C6; kemampuan berpikir tingkat tinggi; representasi biner

On two kinds of the reverse half-discrete Mulholland-type inequalities involving higher-order derivative function

AbstractBy means of the weight functions, Hermite–Hadamard’s inequality, and the techniques of real analysis, a new more accurate reverse half-discrete Mulholland-type inequality involving one higher-order derivative function is given. The equivalent statements of the best possible constant factor related to a few parameters, the equivalent forms, and several particular inequalities are provided. Another kind of the reverses is also considered.

Real Analysis, Harmonic Analysis and Applications

Ε and δ real analysis, mathematical-analytical thinking skills: the impacts and interactions of open-ended learning method & self-awareness (its application on bilingual test instruments).

Analytical thinking is a skill to unite the initial process, plan solutions, produce solutions, and conclude something to produce conclusions or correct answers. This research aims to 1) determine whether there are differences in students' mathematical, analytical thinking skills between classes that use the Open-ended learning method and classes that use the lecturing method, 2) to find out whether there are mathematical, analytical thinking skills differences between students with high, moderate, and low self-awareness criteria, and 3) to find out whether there is an interaction between Open-ended learning method and self-awareness toward students' mathematical-analytical thinking skills. This research employs a quasi-experimental design. Based on the data and data analysis, this research is mixed-method research, and the design used in this research is the posttest control group design. This research was conducted on students who have studied the Real Analysis Courses. Based on the results of hypothesis testing, it was found out that, first, there are differences in students' mathematical-analytical thinking skills between the class that uses the Open-ended learning method and the class that uses the lecturing method. Second, there are mathematical-analytical thinking skills differences between high, moderate, and low self-awareness criteria. Third, there is no interaction between the Open-ended learning method with self-awareness of students' mathematical-analytical thinking skills.

Equivalent Properties of Two Kinds of Hardy-Type Integral Inequalities

In this paper, using weight functions as well as employing various techniques from real analysis, we establish a few equivalent conditions of two kinds of Hardy-type integral inequalities with nonhomogeneous kernel. To prove our results, we also deduce a few equivalent conditions of two kinds of Hardy-type integral inequalities with a homogeneous kernel in the form of applications. We additionally consider operator expressions. Analytic inequalities of this nature and especially the techniques involved have far reaching applications in various areas in which symmetry plays a prominent role, including aspects of physics and engineering.

Export Citation Format

Share document.

research paper on real analysis

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

  •  We're Hiring!
  •  Help Center

Real Analysis

  • Most Cited Papers
  • Most Downloaded Papers
  • Newest Papers
  • Save to Library
  • Last »
  • Riemannian Geometries Follow Following
  • Logical Paradox Follow Following
  • Special functions Follow Following
  • Russell Follow Following
  • Mathematics Follow Following
  • Paradox Follow Following
  • Complex Analysis Follow Following
  • Differential Geometry Follow Following
  • Functional Analysis Follow Following
  • Analysis (Mathematics) Follow Following

Enter the email address you signed up with and we'll email you a reset link.

  • Academia.edu Publishing
  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

Advertisement

Advertisement

Machine Learning: Algorithms, Real-World Applications and Research Directions

  • Review Article
  • Published: 22 March 2021
  • Volume 2 , article number  160 , ( 2021 )

Cite this article

research paper on real analysis

  • Iqbal H. Sarker   ORCID: orcid.org/0000-0003-1740-5517 1 , 2  

516k Accesses

1468 Citations

23 Altmetric

Explore all metrics

In the current age of the Fourth Industrial Revolution (4 IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyze these data and develop the corresponding smart and automated  applications, the knowledge of artificial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning , which is part of a broader family of machine learning methods, can intelligently analyze the data on a large scale. In this paper, we present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view.

Similar content being viewed by others

research paper on real analysis

Deep Learning: A Comprehensive Overview on Techniques, Taxonomy, Applications and Research Directions

research paper on real analysis

Machine learning and deep learning

research paper on real analysis

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Avoid common mistakes on your manuscript.

Introduction

We live in the age of data, where everything around us is connected to a data source, and everything in our lives is digitally recorded [ 21 , 103 ]. For instance, the current electronic world has a wealth of various kinds of data, such as the Internet of Things (IoT) data, cybersecurity data, smart city data, business data, smartphone data, social media data, health data, COVID-19 data, and many more. The data can be structured, semi-structured, or unstructured, discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”, which is increasing day-by-day. Extracting insights from these data can be used to build various intelligent applications in the relevant domains. For instance, to build a data-driven automated and intelligent cybersecurity system, the relevant cybersecurity data can be used [ 105 ]; to build personalized context-aware smart mobile applications, the relevant mobile data can be used [ 103 ], and so on. Thus, the data management tools and techniques having the capability of extracting insights or useful knowledge from the data in a timely and intelligent way is urgently needed, on which the real-world applications are based.

figure 1

The worldwide popularity score of various types of ML algorithms (supervised, unsupervised, semi-supervised, and reinforcement) in a range of 0 (min) to 100 (max) over time where x-axis represents the timestamp information and y-axis represents the corresponding score

Artificial intelligence (AI), particularly, machine learning (ML) have grown rapidly in recent years in the context of data analysis and computing that typically allows the applications to function in an intelligent manner [ 95 ]. ML usually provides systems with the ability to learn and enhance from experience automatically without being specifically programmed and is generally referred to as the most popular latest technologies in the fourth industrial revolution (4 IR or Industry 4.0) [ 103 , 105 ]. “Industry 4.0” [ 114 ] is typically the ongoing automation of conventional manufacturing and industrial practices, including exploratory data processing, using new smart technologies such as machine learning automation. Thus, to intelligently analyze these data and to develop the corresponding real-world applications, machine learning algorithms is the key. The learning algorithms can be categorized into four major types, such as supervised, unsupervised, semi-supervised, and reinforcement learning in the area [ 75 ], discussed briefly in Sect. “ Types of Real-World Data and Machine Learning Techniques ”. The popularity of these approaches to learning is increasing day-by-day, which is shown in Fig. 1 , based on data collected from Google Trends [ 4 ] over the last five years. The x - axis of the figure indicates the specific dates and the corresponding popularity score within the range of \(0 \; (minimum)\) to \(100 \; (maximum)\) has been shown in y - axis . According to Fig. 1 , the popularity indication values for these learning types are low in 2015 and are increasing day by day. These statistics motivate us to study on machine learning in this paper, which can play an important role in the real-world through Industry 4.0 automation.

In general, the effectiveness and the efficiency of a machine learning solution depend on the nature and characteristics of data and the performance of the learning algorithms . In the area of machine learning algorithms, classification analysis, regression, data clustering, feature engineering and dimensionality reduction, association rule learning, or reinforcement learning techniques exist to effectively build data-driven systems [ 41 , 125 ]. Besides, deep learning originated from the artificial neural network that can be used to intelligently analyze data, which is known as part of a wider family of machine learning approaches [ 96 ]. Thus, selecting a proper learning algorithm that is suitable for the target application in a particular domain is challenging. The reason is that the purpose of different learning algorithms is different, even the outcome of different learning algorithms in a similar category may vary depending on the data characteristics [ 106 ]. Thus, it is important to understand the principles of various machine learning algorithms and their applicability to apply in various real-world application areas, such as IoT systems, cybersecurity services, business and recommendation systems, smart cities, healthcare and COVID-19, context-aware systems, sustainable agriculture, and many more that are explained briefly in Sect. “ Applications of Machine Learning ”.

Based on the importance and potentiality of “Machine Learning” to analyze the data mentioned above, in this paper, we provide a comprehensive view on various types of machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, the key contribution of this study is explaining the principles and potentiality of different machine learning techniques, and their applicability in various real-world application areas mentioned earlier. The purpose of this paper is, therefore, to provide a basic guide for those academia and industry people who want to study, research, and develop data-driven automated and intelligent systems in the relevant areas based on machine learning techniques.

The key contributions of this paper are listed as follows:

To define the scope of our study by taking into account the nature and characteristics of various types of real-world data and the capabilities of various learning techniques.

To provide a comprehensive view on machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

To discuss the applicability of machine learning-based solutions in various real-world application domains.

To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services.

The rest of the paper is organized as follows. The next section presents the types of data and machine learning algorithms in a broader sense and defines the scope of our study. We briefly discuss and explain different machine learning algorithms in the subsequent section followed by which various real-world application areas based on machine learning algorithms are discussed and summarized. In the penultimate section, we highlight several research issues and potential future directions, and the final section concludes this paper.

Types of Real-World Data and Machine Learning Techniques

Machine learning algorithms typically consume and process data to learn the related patterns about individuals, business processes, transactions, events, and so on. In the following, we discuss various types of real-world data as well as categories of machine learning algorithms.

Types of Real-World Data

Usually, the availability of data is considered as the key to construct a machine learning model or data-driven real-world systems [ 103 , 105 ]. Data can be of various forms, such as structured, semi-structured, or unstructured [ 41 , 72 ]. Besides, the “metadata” is another type that typically represents data about the data. In the following, we briefly discuss these types of data.

Structured: It has a well-defined structure, conforms to a data model following a standard order, which is highly organized and easily accessed, and used by an entity or a computer program. In well-defined schemes, such as relational databases, structured data are typically stored, i.e., in a tabular format. For instance, names, dates, addresses, credit card numbers, stock information, geolocation, etc. are examples of structured data.

Unstructured: On the other hand, there is no pre-defined format or organization for unstructured data, making it much more difficult to capture, process, and analyze, mostly containing text and multimedia material. For example, sensor data, emails, blog entries, wikis, and word processing documents, PDF files, audio files, videos, images, presentations, web pages, and many other types of business documents can be considered as unstructured data.

Semi-structured: Semi-structured data are not stored in a relational database like the structured data mentioned above, but it does have certain organizational properties that make it easier to analyze. HTML, XML, JSON documents, NoSQL databases, etc., are some examples of semi-structured data.

Metadata: It is not the normal form of data, but “data about data”. The primary difference between “data” and “metadata” is that data are simply the material that can classify, measure, or even document something relative to an organization’s data properties. On the other hand, metadata describes the relevant data information, giving it more significance for data users. A basic example of a document’s metadata might be the author, file size, date generated by the document, keywords to define the document, etc.

In the area of machine learning and data science, researchers use various widely used datasets for different purposes. These are, for example, cybersecurity datasets such as NSL-KDD [ 119 ], UNSW-NB15 [ 76 ], ISCX’12 [ 1 ], CIC-DDoS2019 [ 2 ], Bot-IoT [ 59 ], etc., smartphone datasets such as phone call logs [ 84 , 101 ], SMS Log [ 29 ], mobile application usages logs [ 137 ] [ 117 ], mobile phone notification logs [ 73 ] etc., IoT data [ 16 , 57 , 62 ], agriculture and e-commerce data [ 120 , 138 ], health data such as heart disease [ 92 ], diabetes mellitus [ 83 , 134 ], COVID-19 [ 43 , 74 ], etc., and many more in various application domains. The data can be in different types discussed above, which may vary from application to application in the real world. To analyze such data in a particular problem domain, and to extract the insights or useful knowledge from the data for building the real-world intelligent applications, different types of machine learning techniques can be used according to their learning capabilities, which is discussed in the following.

Types of Machine Learning Techniques

Machine Learning algorithms are mainly divided into four categories: Supervised learning, Unsupervised learning, Semi-supervised learning, and Reinforcement learning [ 75 ], as shown in Fig. 2 . In the following, we briefly discuss each type of learning technique with the scope of their applicability to solve real-world problems.

figure 2

Various types of machine learning techniques

Supervised: Supervised learning is typically the task of machine learning to learn a function that maps an input to an output based on sample input-output pairs [ 41 ]. It uses labeled training data and a collection of training examples to infer a function. Supervised learning is carried out when certain goals are identified to be accomplished from a certain set of inputs [ 105 ], i.e., a task-driven approach . The most common supervised tasks are “classification” that separates the data, and “regression” that fits the data. For instance, predicting the class label or sentiment of a piece of text, like a tweet or a product review, i.e., text classification, is an example of supervised learning.

Unsupervised: Unsupervised learning analyzes unlabeled datasets without the need for human interference, i.e., a data-driven process [ 41 ]. This is widely used for extracting generative features, identifying meaningful trends and structures, groupings in results, and exploratory purposes. The most common unsupervised learning tasks are clustering, density estimation, feature learning, dimensionality reduction, finding association rules, anomaly detection, etc.

Semi-supervised: Semi-supervised learning can be defined as a hybridization of the above-mentioned supervised and unsupervised methods, as it operates on both labeled and unlabeled data [ 41 , 105 ]. Thus, it falls between learning “without supervision” and learning “with supervision”. In the real world, labeled data could be rare in several contexts, and unlabeled data are numerous, where semi-supervised learning is useful [ 75 ]. The ultimate goal of a semi-supervised learning model is to provide a better outcome for prediction than that produced using the labeled data alone from the model. Some application areas where semi-supervised learning is used include machine translation, fraud detection, labeling data and text classification.

Reinforcement: Reinforcement learning is a type of machine learning algorithm that enables software agents and machines to automatically evaluate the optimal behavior in a particular context or environment to improve its efficiency [ 52 ], i.e., an environment-driven approach . This type of learning is based on reward or penalty, and its ultimate goal is to use insights obtained from environmental activists to take action to increase the reward or minimize the risk [ 75 ]. It is a powerful tool for training AI models that can help increase automation or optimize the operational efficiency of sophisticated systems such as robotics, autonomous driving tasks, manufacturing and supply chain logistics, however, not preferable to use it for solving the basic or straightforward problems.

Thus, to build effective models in various application areas different types of machine learning techniques can play a significant role according to their learning capabilities, depending on the nature of the data discussed earlier, and the target outcome. In Table 1 , we summarize various types of machine learning techniques with examples. In the following, we provide a comprehensive view of machine learning algorithms that can be applied to enhance the intelligence and capabilities of a data-driven application.

Machine Learning Tasks and Algorithms

In this section, we discuss various machine learning algorithms that include classification analysis, regression analysis, data clustering, association rule learning, feature engineering for dimensionality reduction, as well as deep learning methods. A general structure of a machine learning-based predictive model has been shown in Fig. 3 , where the model is trained from historical data in phase 1 and the outcome is generated in phase 2 for the new test data.

figure 3

A general structure of a machine learning based predictive model considering both the training and testing phase

Classification Analysis

Classification is regarded as a supervised learning method in machine learning, referring to a problem of predictive modeling as well, where a class label is predicted for a given example [ 41 ]. Mathematically, it maps a function ( f ) from input variables ( X ) to output variables ( Y ) as target, label or categories. To predict the class of given data points, it can be carried out on structured or unstructured data. For example, spam detection such as “spam” and “not spam” in email service providers can be a classification problem. In the following, we summarize the common classification problems.

Binary classification: It refers to the classification tasks having two class labels such as “true and false” or “yes and no” [ 41 ]. In such binary classification tasks, one class could be the normal state, while the abnormal state could be another class. For instance, “cancer not detected” is the normal state of a task that involves a medical test, and “cancer detected” could be considered as the abnormal state. Similarly, “spam” and “not spam” in the above example of email service providers are considered as binary classification.

Multiclass classification: Traditionally, this refers to those classification tasks having more than two class labels [ 41 ]. The multiclass classification does not have the principle of normal and abnormal outcomes, unlike binary classification tasks. Instead, within a range of specified classes, examples are classified as belonging to one. For example, it can be a multiclass classification task to classify various types of network attacks in the NSL-KDD [ 119 ] dataset, where the attack categories are classified into four class labels, such as DoS (Denial of Service Attack), U2R (User to Root Attack), R2L (Root to Local Attack), and Probing Attack.

Multi-label classification: In machine learning, multi-label classification is an important consideration where an example is associated with several classes or labels. Thus, it is a generalization of multiclass classification, where the classes involved in the problem are hierarchically structured, and each example may simultaneously belong to more than one class in each hierarchical level, e.g., multi-level text classification. For instance, Google news can be presented under the categories of a “city name”, “technology”, or “latest news”, etc. Multi-label classification includes advanced machine learning algorithms that support predicting various mutually non-exclusive classes or labels, unlike traditional classification tasks where class labels are mutually exclusive [ 82 ].

Many classification algorithms have been proposed in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the most common and popular methods that are used widely in various application areas.

Naive Bayes (NB): The naive Bayes algorithm is based on the Bayes’ theorem with the assumption of independence between each pair of features [ 51 ]. It works well and can be used for both binary and multi-class categories in many real-world situations, such as document or text classification, spam filtering, etc. To effectively classify the noisy instances in the data and to construct a robust prediction model, the NB classifier can be used [ 94 ]. The key benefit is that, compared to more sophisticated approaches, it needs a small amount of training data to estimate the necessary parameters and quickly [ 82 ]. However, its performance may affect due to its strong assumptions on features independence. Gaussian, Multinomial, Complement, Bernoulli, and Categorical are the common variants of NB classifier [ 82 ].

Linear Discriminant Analysis (LDA): Linear Discriminant Analysis (LDA) is a linear decision boundary classifier created by fitting class conditional densities to data and applying Bayes’ rule [ 51 , 82 ]. This method is also known as a generalization of Fisher’s linear discriminant, which projects a given dataset into a lower-dimensional space, i.e., a reduction of dimensionality that minimizes the complexity of the model or reduces the resulting model’s computational costs. The standard LDA model usually suits each class with a Gaussian density, assuming that all classes share the same covariance matrix [ 82 ]. LDA is closely related to ANOVA (analysis of variance) and regression analysis, which seek to express one dependent variable as a linear combination of other features or measurements.

Logistic regression (LR): Another common probabilistic based statistical model used to solve classification issues in machine learning is Logistic Regression (LR) [ 64 ]. Logistic regression typically uses a logistic function to estimate the probabilities, which is also referred to as the mathematically defined sigmoid function in Eq. 1 . It can overfit high-dimensional datasets and works well when the dataset can be separated linearly. The regularization (L1 and L2) techniques [ 82 ] can be used to avoid over-fitting in such scenarios. The assumption of linearity between the dependent and independent variables is considered as a major drawback of Logistic Regression. It can be used for both classification and regression problems, but it is more commonly used for classification.

K-nearest neighbors (KNN): K-Nearest Neighbors (KNN) [ 9 ] is an “instance-based learning” or non-generalizing learning, also known as a “lazy learning” algorithm. It does not focus on constructing a general internal model; instead, it stores all instances corresponding to training data in n -dimensional space. KNN uses data and classifies new data points based on similarity measures (e.g., Euclidean distance function) [ 82 ]. Classification is computed from a simple majority vote of the k nearest neighbors of each point. It is quite robust to noisy training data, and accuracy depends on the data quality. The biggest issue with KNN is to choose the optimal number of neighbors to be considered. KNN can be used both for classification as well as regression.

Support vector machine (SVM): In machine learning, another common technique that can be used for classification, regression, or other tasks is a support vector machine (SVM) [ 56 ]. In high- or infinite-dimensional space, a support vector machine constructs a hyper-plane or set of hyper-planes. Intuitively, the hyper-plane, which has the greatest distance from the nearest training data points in any class, achieves a strong separation since, in general, the greater the margin, the lower the classifier’s generalization error. It is effective in high-dimensional spaces and can behave differently based on different mathematical functions known as the kernel. Linear, polynomial, radial basis function (RBF), sigmoid, etc., are the popular kernel functions used in SVM classifier [ 82 ]. However, when the data set contains more noise, such as overlapping target classes, SVM does not perform well.

Decision tree (DT): Decision tree (DT) [ 88 ] is a well-known non-parametric supervised learning method. DT learning methods are used for both the classification and regression tasks [ 82 ]. ID3 [ 87 ], C4.5 [ 88 ], and CART [ 20 ] are well known for DT algorithms. Moreover, recently proposed BehavDT [ 100 ], and IntrudTree [ 97 ] by Sarker et al. are effective in the relevant application domains, such as user behavior analytics and cybersecurity analytics, respectively. By sorting down the tree from the root to some leaf nodes, as shown in Fig. 4 , DT classifies the instances. Instances are classified by checking the attribute defined by that node, starting at the root node of the tree, and then moving down the tree branch corresponding to the attribute value. For splitting, the most popular criteria are “gini” for the Gini impurity and “entropy” for the information gain that can be expressed mathematically as [ 82 ].

figure 4

An example of a decision tree structure

figure 5

An example of a random forest structure considering multiple decision trees

Random forest (RF): A random forest classifier [ 19 ] is well known as an ensemble classification technique that is used in the field of machine learning and data science in various application areas. This method uses “parallel ensembling” which fits several decision tree classifiers in parallel, as shown in Fig. 5 , on different data set sub-samples and uses majority voting or averages for the outcome or final result. It thus minimizes the over-fitting problem and increases the prediction accuracy and control [ 82 ]. Therefore, the RF learning model with multiple decision trees is typically more accurate than a single decision tree based model [ 106 ]. To build a series of decision trees with controlled variation, it combines bootstrap aggregation (bagging) [ 18 ] and random feature selection [ 11 ]. It is adaptable to both classification and regression problems and fits well for both categorical and continuous values.

Adaptive Boosting (AdaBoost): Adaptive Boosting (AdaBoost) is an ensemble learning process that employs an iterative approach to improve poor classifiers by learning from their errors. This is developed by Yoav Freund et al. [ 35 ] and also known as “meta-learning”. Unlike the random forest that uses parallel ensembling, Adaboost uses “sequential ensembling”. It creates a powerful classifier by combining many poorly performing classifiers to obtain a good classifier of high accuracy. In that sense, AdaBoost is called an adaptive classifier by significantly improving the efficiency of the classifier, but in some instances, it can trigger overfits. AdaBoost is best used to boost the performance of decision trees, base estimator [ 82 ], on binary classification problems, however, is sensitive to noisy data and outliers.

Extreme gradient boosting (XGBoost): Gradient Boosting, like Random Forests [ 19 ] above, is an ensemble learning algorithm that generates a final model based on a series of individual models, typically decision trees. The gradient is used to minimize the loss function, similar to how neural networks [ 41 ] use gradient descent to optimize weights. Extreme Gradient Boosting (XGBoost) is a form of gradient boosting that takes more detailed approximations into account when determining the best model [ 82 ]. It computes second-order gradients of the loss function to minimize loss and advanced regularization (L1 and L2) [ 82 ], which reduces over-fitting, and improves model generalization and performance. XGBoost is fast to interpret and can handle large-sized datasets well.

Stochastic gradient descent (SGD): Stochastic gradient descent (SGD) [ 41 ] is an iterative method for optimizing an objective function with appropriate smoothness properties, where the word ‘stochastic’ refers to random probability. This reduces the computational burden, particularly in high-dimensional optimization problems, allowing for faster iterations in exchange for a lower convergence rate. A gradient is the slope of a function that calculates a variable’s degree of change in response to another variable’s changes. Mathematically, the Gradient Descent is a convex function whose output is a partial derivative of a set of its input parameters. Let, \(\alpha\) is the learning rate, and \(J_i\) is the training example cost of \(i \mathrm{th}\) , then Eq. ( 4 ) represents the stochastic gradient descent weight update method at the \(j^\mathrm{th}\) iteration. In large-scale and sparse machine learning, SGD has been successfully applied to problems often encountered in text classification and natural language processing [ 82 ]. However, SGD is sensitive to feature scaling and needs a range of hyperparameters, such as the regularization parameter and the number of iterations.

Rule-based classification : The term rule-based classification can be used to refer to any classification scheme that makes use of IF-THEN rules for class prediction. Several classification algorithms such as Zero-R [ 125 ], One-R [ 47 ], decision trees [ 87 , 88 ], DTNB [ 110 ], Ripple Down Rule learner (RIDOR) [ 125 ], Repeated Incremental Pruning to Produce Error Reduction (RIPPER) [ 126 ] exist with the ability of rule generation. The decision tree is one of the most common rule-based classification algorithms among these techniques because it has several advantages, such as being easier to interpret; the ability to handle high-dimensional data; simplicity and speed; good accuracy; and the capability to produce rules for human clear and understandable classification [ 127 ] [ 128 ]. The decision tree-based rules also provide significant accuracy in a prediction model for unseen test cases [ 106 ]. Since the rules are easily interpretable, these rule-based classifiers are often used to produce descriptive models that can describe a system including the entities and their relationships.

figure 6

Classification vs. regression. In classification the dotted line represents a linear boundary that separates the two classes; in regression, the dotted line models the linear relationship between the two variables

Regression Analysis

Regression analysis includes several methods of machine learning that allow to predict a continuous ( y ) result variable based on the value of one or more ( x ) predictor variables [ 41 ]. The most significant distinction between classification and regression is that classification predicts distinct class labels, while regression facilitates the prediction of a continuous quantity. Figure 6 shows an example of how classification is different with regression models. Some overlaps are often found between the two types of machine learning algorithms. Regression models are now widely used in a variety of fields, including financial forecasting or prediction, cost estimation, trend analysis, marketing, time series estimation, drug response modeling, and many more. Some of the familiar types of regression algorithms are linear, polynomial, lasso and ridge regression, etc., which are explained briefly in the following.

Simple and multiple linear regression: This is one of the most popular ML modeling techniques as well as a well-known regression technique. In this technique, the dependent variable is continuous, the independent variable(s) can be continuous or discrete, and the form of the regression line is linear. Linear regression creates a relationship between the dependent variable ( Y ) and one or more independent variables ( X ) (also known as regression line) using the best fit straight line [ 41 ]. It is defined by the following equations:

where a is the intercept, b is the slope of the line, and e is the error term. This equation can be used to predict the value of the target variable based on the given predictor variable(s). Multiple linear regression is an extension of simple linear regression that allows two or more predictor variables to model a response variable, y, as a linear function [ 41 ] defined in Eq. 6 , whereas simple linear regression has only 1 independent variable, defined in Eq. 5 .

Polynomial regression: Polynomial regression is a form of regression analysis in which the relationship between the independent variable x and the dependent variable y is not linear, but is the polynomial degree of \(n^\mathrm{th}\) in x [ 82 ]. The equation for polynomial regression is also derived from linear regression (polynomial regression of degree 1) equation, which is defined as below:

Here, y is the predicted/target output, \(b_0, b_1,... b_n\) are the regression coefficients, x is an independent/ input variable. In simple words, we can say that if data are not distributed linearly, instead it is \(n^\mathrm{th}\) degree of polynomial then we use polynomial regression to get desired output.

LASSO and ridge regression: LASSO and Ridge regression are well known as powerful techniques which are typically used for building learning models in presence of a large number of features, due to their capability to preventing over-fitting and reducing the complexity of the model. The LASSO (least absolute shrinkage and selection operator) regression model uses L 1 regularization technique [ 82 ] that uses shrinkage, which penalizes “absolute value of magnitude of coefficients” ( L 1 penalty). As a result, LASSO appears to render coefficients to absolute zero. Thus, LASSO regression aims to find the subset of predictors that minimizes the prediction error for a quantitative response variable. On the other hand, ridge regression uses L 2 regularization [ 82 ], which is the “squared magnitude of coefficients” ( L 2 penalty). Thus, ridge regression forces the weights to be small but never sets the coefficient value to zero, and does a non-sparse solution. Overall, LASSO regression is useful to obtain a subset of predictors by eliminating less important features, and ridge regression is useful when a data set has “multicollinearity” which refers to the predictors that are correlated with other predictors.

Cluster Analysis

Cluster analysis, also known as clustering, is an unsupervised machine learning technique for identifying and grouping related data points in large datasets without concern for the specific outcome. It does grouping a collection of objects in such a way that objects in the same category, called a cluster, are in some sense more similar to each other than objects in other groups [ 41 ]. It is often used as a data analysis technique to discover interesting trends or patterns in data, e.g., groups of consumers based on their behavior. In a broad range of application areas, such as cybersecurity, e-commerce, mobile data processing, health analytics, user modeling and behavioral analytics, clustering can be used. In the following, we briefly discuss and summarize various types of clustering methods.

Partitioning methods: Based on the features and similarities in the data, this clustering approach categorizes the data into multiple groups or clusters. The data scientists or analysts typically determine the number of clusters either dynamically or statically depending on the nature of the target applications, to produce for the methods of clustering. The most common clustering algorithms based on partitioning methods are K-means [ 69 ], K-Mediods [ 80 ], CLARA [ 55 ] etc.

Density-based methods: To identify distinct groups or clusters, it uses the concept that a cluster in the data space is a contiguous region of high point density isolated from other such clusters by contiguous regions of low point density. Points that are not part of a cluster are considered as noise. The typical clustering algorithms based on density are DBSCAN [ 32 ], OPTICS [ 12 ] etc. The density-based methods typically struggle with clusters of similar density and high dimensionality data.

Hierarchical-based methods: Hierarchical clustering typically seeks to construct a hierarchy of clusters, i.e., the tree structure. Strategies for hierarchical clustering generally fall into two types: (i) Agglomerative—a “bottom-up” approach in which each observation begins in its cluster and pairs of clusters are combined as one, moves up the hierarchy, and (ii) Divisive—a “top-down” approach in which all observations begin in one cluster and splits are performed recursively, moves down the hierarchy, as shown in Fig 7 . Our earlier proposed BOTS technique, Sarker et al. [ 102 ] is an example of a hierarchical, particularly, bottom-up clustering algorithm.

Grid-based methods: To deal with massive datasets, grid-based clustering is especially suitable. To obtain clusters, the principle is first to summarize the dataset with a grid representation and then to combine grid cells. STING [ 122 ], CLIQUE [ 6 ], etc. are the standard algorithms of grid-based clustering.

Model-based methods: There are mainly two types of model-based clustering algorithms: one that uses statistical learning, and the other based on a method of neural network learning [ 130 ]. For instance, GMM [ 89 ] is an example of a statistical learning method, and SOM [ 22 ] [ 96 ] is an example of a neural network learning method.

Constraint-based methods: Constrained-based clustering is a semi-supervised approach to data clustering that uses constraints to incorporate domain knowledge. Application or user-oriented constraints are incorporated to perform the clustering. The typical algorithms of this kind of clustering are COP K-means [ 121 ], CMWK-Means [ 27 ], etc.

figure 7

A graphical interpretation of the widely-used hierarchical clustering (Bottom-up and top-down) technique

Many clustering algorithms have been proposed with the ability to grouping data in machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

K-means clustering: K-means clustering [ 69 ] is a fast, robust, and simple algorithm that provides reliable results when data sets are well-separated from each other. The data points are allocated to a cluster in this algorithm in such a way that the amount of the squared distance between the data points and the centroid is as small as possible. In other words, the K-means algorithm identifies the k number of centroids and then assigns each data point to the nearest cluster while keeping the centroids as small as possible. Since it begins with a random selection of cluster centers, the results can be inconsistent. Since extreme values can easily affect a mean, the K-means clustering algorithm is sensitive to outliers. K-medoids clustering [ 91 ] is a variant of K-means that is more robust to noises and outliers.

Mean-shift clustering: Mean-shift clustering [ 37 ] is a nonparametric clustering technique that does not require prior knowledge of the number of clusters or constraints on cluster shape. Mean-shift clustering aims to discover “blobs” in a smooth distribution or density of samples [ 82 ]. It is a centroid-based algorithm that works by updating centroid candidates to be the mean of the points in a given region. To form the final set of centroids, these candidates are filtered in a post-processing stage to remove near-duplicates. Cluster analysis in computer vision and image processing are examples of application domains. Mean Shift has the disadvantage of being computationally expensive. Moreover, in cases of high dimension, where the number of clusters shifts abruptly, the mean-shift algorithm does not work well.

DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) [ 32 ] is a base algorithm for density-based clustering which is widely used in data mining and machine learning. This is known as a non-parametric density-based clustering technique for separating high-density clusters from low-density clusters that are used in model building. DBSCAN’s main idea is that a point belongs to a cluster if it is close to many points from that cluster. It can find clusters of various shapes and sizes in a vast volume of data that is noisy and contains outliers. DBSCAN, unlike k-means, does not require a priori specification of the number of clusters in the data and can find arbitrarily shaped clusters. Although k-means is much faster than DBSCAN, it is efficient at finding high-density regions and outliers, i.e., is robust to outliers.

GMM clustering: Gaussian mixture models (GMMs) are often used for data clustering, which is a distribution-based clustering algorithm. A Gaussian mixture model is a probabilistic model in which all the data points are produced by a mixture of a finite number of Gaussian distributions with unknown parameters [ 82 ]. To find the Gaussian parameters for each cluster, an optimization algorithm called expectation-maximization (EM) [ 82 ] can be used. EM is an iterative method that uses a statistical model to estimate the parameters. In contrast to k-means, Gaussian mixture models account for uncertainty and return the likelihood that a data point belongs to one of the k clusters. GMM clustering is more robust than k-means and works well even with non-linear data distributions.

Agglomerative hierarchical clustering: The most common method of hierarchical clustering used to group objects in clusters based on their similarity is agglomerative clustering. This technique uses a bottom-up approach, where each object is first treated as a singleton cluster by the algorithm. Following that, pairs of clusters are merged one by one until all clusters have been merged into a single large cluster containing all objects. The result is a dendrogram, which is a tree-based representation of the elements. Single linkage [ 115 ], Complete linkage [ 116 ], BOTS [ 102 ] etc. are some examples of such techniques. The main advantage of agglomerative hierarchical clustering over k-means is that the tree-structure hierarchy generated by agglomerative clustering is more informative than the unstructured collection of flat clusters returned by k-means, which can help to make better decisions in the relevant application areas.

Dimensionality Reduction and Feature Learning

In machine learning and data science, high-dimensional data processing is a challenging task for both researchers and application developers. Thus, dimensionality reduction which is an unsupervised learning technique, is important because it leads to better human interpretations, lower computational costs, and avoids overfitting and redundancy by simplifying models. Both the process of feature selection and feature extraction can be used for dimensionality reduction. The primary distinction between the selection and extraction of features is that the “feature selection” keeps a subset of the original features [ 97 ], while “feature extraction” creates brand new ones [ 98 ]. In the following, we briefly discuss these techniques.

Feature selection: The selection of features, also known as the selection of variables or attributes in the data, is the process of choosing a subset of unique features (variables, predictors) to use in building machine learning and data science model. It decreases a model’s complexity by eliminating the irrelevant or less important features and allows for faster training of machine learning algorithms. A right and optimal subset of the selected features in a problem domain is capable to minimize the overfitting problem through simplifying and generalizing the model as well as increases the model’s accuracy [ 97 ]. Thus, “feature selection” [ 66 , 99 ] is considered as one of the primary concepts in machine learning that greatly affects the effectiveness and efficiency of the target machine learning model. Chi-squared test, Analysis of variance (ANOVA) test, Pearson’s correlation coefficient, recursive feature elimination, are some popular techniques that can be used for feature selection.

Feature extraction: In a machine learning-based model or system, feature extraction techniques usually provide a better understanding of the data, a way to improve prediction accuracy, and to reduce computational cost or training time. The aim of “feature extraction” [ 66 , 99 ] is to reduce the number of features in a dataset by generating new ones from the existing ones and then discarding the original features. The majority of the information found in the original set of features can then be summarized using this new reduced set of features. For instance, principal components analysis (PCA) is often used as a dimensionality-reduction technique to extract a lower-dimensional space creating new brand components from the existing features in a dataset [ 98 ].

Many algorithms have been proposed to reduce data dimensions in the machine learning and data science literature [ 41 , 125 ]. In the following, we summarize the popular methods that are used widely in various application areas.

Variance threshold: A simple basic approach to feature selection is the variance threshold [ 82 ]. This excludes all features of low variance, i.e., all features whose variance does not exceed the threshold. It eliminates all zero-variance characteristics by default, i.e., characteristics that have the same value in all samples. This feature selection algorithm looks only at the ( X ) features, not the ( y ) outputs needed, and can, therefore, be used for unsupervised learning.

Pearson correlation: Pearson’s correlation is another method to understand a feature’s relation to the response variable and can be used for feature selection [ 99 ]. This method is also used for finding the association between the features in a dataset. The resulting value is \([-1, 1]\) , where \(-1\) means perfect negative correlation, \(+1\) means perfect positive correlation, and 0 means that the two variables do not have a linear correlation. If two random variables represent X and Y , then the correlation coefficient between X and Y is defined as [ 41 ]

ANOVA: Analysis of variance (ANOVA) is a statistical tool used to verify the mean values of two or more groups that differ significantly from each other. ANOVA assumes a linear relationship between the variables and the target and the variables’ normal distribution. To statistically test the equality of means, the ANOVA method utilizes F tests. For feature selection, the results ‘ANOVA F value’ [ 82 ] of this test can be used where certain features independent of the goal variable can be omitted.

Chi square: The chi-square \({\chi }^2\) [ 82 ] statistic is an estimate of the difference between the effects of a series of events or variables observed and expected frequencies. The magnitude of the difference between the real and observed values, the degrees of freedom, and the sample size depends on \({\chi }^2\) . The chi-square \({\chi }^2\) is commonly used for testing relationships between categorical variables. If \(O_i\) represents observed value and \(E_i\) represents expected value, then

Recursive feature elimination (RFE): Recursive Feature Elimination (RFE) is a brute force approach to feature selection. RFE [ 82 ] fits the model and removes the weakest feature before it meets the specified number of features. Features are ranked by the coefficients or feature significance of the model. RFE aims to remove dependencies and collinearity in the model by recursively removing a small number of features per iteration.

Model-based selection: To reduce the dimensionality of the data, linear models penalized with the L 1 regularization can be used. Least absolute shrinkage and selection operator (Lasso) regression is a type of linear regression that has the property of shrinking some of the coefficients to zero [ 82 ]. Therefore, that feature can be removed from the model. Thus, the penalized lasso regression method, often used in machine learning to select the subset of variables. Extra Trees Classifier [ 82 ] is an example of a tree-based estimator that can be used to compute impurity-based function importance, which can then be used to discard irrelevant features.

Principal component analysis (PCA): Principal component analysis (PCA) is a well-known unsupervised learning approach in the field of machine learning and data science. PCA is a mathematical technique that transforms a set of correlated variables into a set of uncorrelated variables known as principal components [ 48 , 81 ]. Figure 8 shows an example of the effect of PCA on various dimensions space, where Fig. 8 a shows the original features in 3D space, and Fig. 8 b shows the created principal components PC1 and PC2 onto a 2D plane, and 1D line with the principal component PC1 respectively. Thus, PCA can be used as a feature extraction technique that reduces the dimensionality of the datasets, and to build an effective machine learning model [ 98 ]. Technically, PCA identifies the completely transformed with the highest eigenvalues of a covariance matrix and then uses those to project the data into a new subspace of equal or fewer dimensions [ 82 ].

figure 8

An example of a principal component analysis (PCA) and created principal components PC1 and PC2 in different dimension space

Association Rule Learning

Association rule learning is a rule-based machine learning approach to discover interesting relationships, “IF-THEN” statements, in large datasets between variables [ 7 ]. One example is that “if a customer buys a computer or laptop (an item), s/he is likely to also buy anti-virus software (another item) at the same time”. Association rules are employed today in many application areas, including IoT services, medical diagnosis, usage behavior analytics, web usage mining, smartphone applications, cybersecurity applications, and bioinformatics. In comparison to sequence mining, association rule learning does not usually take into account the order of things within or across transactions. A common way of measuring the usefulness of association rules is to use its parameter, the ‘support’ and ‘confidence’, which is introduced in [ 7 ].

In the data mining literature, many association rule learning methods have been proposed, such as logic dependent [ 34 ], frequent pattern based [ 8 , 49 , 68 ], and tree-based [ 42 ]. The most popular association rule learning algorithms are summarized below.

AIS and SETM: AIS is the first algorithm proposed by Agrawal et al. [ 7 ] for association rule mining. The AIS algorithm’s main downside is that too many candidate itemsets are generated, requiring more space and wasting a lot of effort. This algorithm calls for too many passes over the entire dataset to produce the rules. Another approach SETM [ 49 ] exhibits good performance and stable behavior with execution time; however, it suffers from the same flaw as the AIS algorithm.

Apriori: For generating association rules for a given dataset, Agrawal et al. [ 8 ] proposed the Apriori, Apriori-TID, and Apriori-Hybrid algorithms. These later algorithms outperform the AIS and SETM mentioned above due to the Apriori property of frequent itemset [ 8 ]. The term ‘Apriori’ usually refers to having prior knowledge of frequent itemset properties. Apriori uses a “bottom-up” approach, where it generates the candidate itemsets. To reduce the search space, Apriori uses the property “all subsets of a frequent itemset must be frequent; and if an itemset is infrequent, then all its supersets must also be infrequent”. Another approach predictive Apriori [ 108 ] can also generate rules; however, it receives unexpected results as it combines both the support and confidence. The Apriori [ 8 ] is the widely applicable techniques in mining association rules.

ECLAT: This technique was proposed by Zaki et al. [ 131 ] and stands for Equivalence Class Clustering and bottom-up Lattice Traversal. ECLAT uses a depth-first search to find frequent itemsets. In contrast to the Apriori [ 8 ] algorithm, which represents data in a horizontal pattern, it represents data vertically. Hence, the ECLAT algorithm is more efficient and scalable in the area of association rule learning. This algorithm is better suited for small and medium datasets whereas the Apriori algorithm is used for large datasets.

FP-Growth: Another common association rule learning technique based on the frequent-pattern tree (FP-tree) proposed by Han et al. [ 42 ] is Frequent Pattern Growth, known as FP-Growth. The key difference with Apriori is that while generating rules, the Apriori algorithm [ 8 ] generates frequent candidate itemsets; on the other hand, the FP-growth algorithm [ 42 ] prevents candidate generation and thus produces a tree by the successful strategy of ‘divide and conquer’ approach. Due to its sophistication, however, FP-Tree is challenging to use in an interactive mining environment [ 133 ]. Thus, the FP-Tree would not fit into memory for massive data sets, making it challenging to process big data as well. Another solution is RARM (Rapid Association Rule Mining) proposed by Das et al. [ 26 ] but faces a related FP-tree issue [ 133 ].

ABC-RuleMiner: A rule-based machine learning method, recently proposed in our earlier paper, by Sarker et al. [ 104 ], to discover the interesting non-redundant rules to provide real-world intelligent services. This algorithm effectively identifies the redundancy in associations by taking into account the impact or precedence of the related contextual features and discovers a set of non-redundant association rules. This algorithm first constructs an association generation tree (AGT), a top-down approach, and then extracts the association rules through traversing the tree. Thus, ABC-RuleMiner is more potent than traditional rule-based methods in terms of both non-redundant rule generation and intelligent decision-making, particularly in a context-aware smart computing environment, where human or user preferences are involved.

Among the association rule learning techniques discussed above, Apriori [ 8 ] is the most widely used algorithm for discovering association rules from a given dataset [ 133 ]. The main strength of the association learning technique is its comprehensiveness, as it generates all associations that satisfy the user-specified constraints, such as minimum support and confidence value. The ABC-RuleMiner approach [ 104 ] discussed earlier could give significant results in terms of non-redundant rule generation and intelligent decision-making for the relevant application areas in the real world.

Reinforcement Learning

Reinforcement learning (RL) is a machine learning technique that allows an agent to learn by trial and error in an interactive environment using input from its actions and experiences. Unlike supervised learning, which is based on given sample data or examples, the RL method is based on interacting with the environment. The problem to be solved in reinforcement learning (RL) is defined as a Markov Decision Process (MDP) [ 86 ], i.e., all about sequentially making decisions. An RL problem typically includes four elements such as Agent, Environment, Rewards, and Policy.

RL can be split roughly into Model-based and Model-free techniques. Model-based RL is the process of inferring optimal behavior from a model of the environment by performing actions and observing the results, which include the next state and the immediate reward [ 85 ]. AlphaZero, AlphaGo [ 113 ] are examples of the model-based approaches. On the other hand, a model-free approach does not use the distribution of the transition probability and the reward function associated with MDP. Q-learning, Deep Q Network, Monte Carlo Control, SARSA (State–Action–Reward–State–Action), etc. are some examples of model-free algorithms [ 52 ]. The policy network, which is required for model-based RL but not for model-free, is the key difference between model-free and model-based learning. In the following, we discuss the popular RL algorithms.

Monte Carlo methods: Monte Carlo techniques, or Monte Carlo experiments, are a wide category of computational algorithms that rely on repeated random sampling to obtain numerical results [ 52 ]. The underlying concept is to use randomness to solve problems that are deterministic in principle. Optimization, numerical integration, and making drawings from the probability distribution are the three problem classes where Monte Carlo techniques are most commonly used.

Q-learning: Q-learning is a model-free reinforcement learning algorithm for learning the quality of behaviors that tell an agent what action to take under what conditions [ 52 ]. It does not need a model of the environment (hence the term “model-free”), and it can deal with stochastic transitions and rewards without the need for adaptations. The ‘Q’ in Q-learning usually stands for quality, as the algorithm calculates the maximum expected rewards for a given behavior in a given state.

Deep Q-learning: The basic working step in Deep Q-Learning [ 52 ] is that the initial state is fed into the neural network, which returns the Q-value of all possible actions as an output. Still, when we have a reasonably simple setting to overcome, Q-learning works well. However, when the number of states and actions becomes more complicated, deep learning can be used as a function approximator.

Reinforcement learning, along with supervised and unsupervised learning, is one of the basic machine learning paradigms. RL can be used to solve numerous real-world problems in various fields, such as game theory, control theory, operations analysis, information theory, simulation-based optimization, manufacturing, supply chain logistics, multi-agent systems, swarm intelligence, aircraft control, robot motion control, and many more.

Artificial Neural Network and Deep Learning

Deep learning is part of a wider family of artificial neural networks (ANN)-based machine learning approaches with representation learning. Deep learning provides a computational architecture by combining several processing layers, such as input, hidden, and output layers, to learn from data [ 41 ]. The main advantage of deep learning over traditional machine learning methods is its better performance in several cases, particularly learning from large datasets [ 105 , 129 ]. Figure 9 shows a general performance of deep learning over machine learning considering the increasing amount of data. However, it may vary depending on the data characteristics and experimental set up.

figure 9

Machine learning and deep learning performance in general with the amount of data

The most common deep learning algorithms are: Multi-layer Perceptron (MLP), Convolutional Neural Network (CNN, or ConvNet), Long Short-Term Memory Recurrent Neural Network (LSTM-RNN) [ 96 ]. In the following, we discuss various types of deep learning methods that can be used to build effective data-driven models for various purposes.

figure 10

A structure of an artificial neural network modeling with multiple processing layers

MLP: The base architecture of deep learning, which is also known as the feed-forward artificial neural network, is called a multilayer perceptron (MLP) [ 82 ]. A typical MLP is a fully connected network consisting of an input layer, one or more hidden layers, and an output layer, as shown in Fig. 10 . Each node in one layer connects to each node in the following layer at a certain weight. MLP utilizes the “Backpropagation” technique [ 41 ], the most “fundamental building block” in a neural network, to adjust the weight values internally while building the model. MLP is sensitive to scaling features and allows a variety of hyperparameters to be tuned, such as the number of hidden layers, neurons, and iterations, which can result in a computationally costly model.

CNN or ConvNet: The convolution neural network (CNN) [ 65 ] enhances the design of the standard ANN, consisting of convolutional layers, pooling layers, as well as fully connected layers, as shown in Fig. 11 . As it takes the advantage of the two-dimensional (2D) structure of the input data, it is typically broadly used in several areas such as image and video recognition, image processing and classification, medical image analysis, natural language processing, etc. While CNN has a greater computational burden, without any manual intervention, it has the advantage of automatically detecting the important features, and hence CNN is considered to be more powerful than conventional ANN. A number of advanced deep learning models based on CNN can be used in the field, such as AlexNet [ 60 ], Xception [ 24 ], Inception [ 118 ], Visual Geometry Group (VGG) [ 44 ], ResNet [ 45 ], etc.

LSTM-RNN: Long short-term memory (LSTM) is an artificial recurrent neural network (RNN) architecture used in the area of deep learning [ 38 ]. LSTM has feedback links, unlike normal feed-forward neural networks. LSTM networks are well-suited for analyzing and learning sequential data, such as classifying, processing, and predicting data based on time series data, which differentiates it from other conventional networks. Thus, LSTM can be used when the data are in a sequential format, such as time, sentence, etc., and commonly applied in the area of time-series analysis, natural language processing, speech recognition, etc.

figure 11

An example of a convolutional neural network (CNN or ConvNet) including multiple convolution and pooling layers

In addition to these most common deep learning methods discussed above, several other deep learning approaches [ 96 ] exist in the area for various purposes. For instance, the self-organizing map (SOM) [ 58 ] uses unsupervised learning to represent the high-dimensional data by a 2D grid map, thus achieving dimensionality reduction. The autoencoder (AE) [ 15 ] is another learning technique that is widely used for dimensionality reduction as well and feature extraction in unsupervised learning tasks. Restricted Boltzmann machines (RBM) [ 46 ] can be used for dimensionality reduction, classification, regression, collaborative filtering, feature learning, and topic modeling. A deep belief network (DBN) is typically composed of simple, unsupervised networks such as restricted Boltzmann machines (RBMs) or autoencoders, and a backpropagation neural network (BPNN) [ 123 ]. A generative adversarial network (GAN) [ 39 ] is a form of the network for deep learning that can generate data with characteristics close to the actual data input. Transfer learning is currently very common because it can train deep neural networks with comparatively low data, which is typically the re-use of a new problem with a pre-trained model [ 124 ]. A brief discussion of these artificial neural networks (ANN) and deep learning (DL) models are summarized in our earlier paper Sarker et al. [ 96 ].

Overall, based on the learning techniques discussed above, we can conclude that various types of machine learning techniques, such as classification analysis, regression, data clustering, feature selection and extraction, and dimensionality reduction, association rule learning, reinforcement learning, or deep learning techniques, can play a significant role for various purposes according to their capabilities. In the following section, we discuss several application areas based on machine learning algorithms.

Applications of Machine Learning

In the current age of the Fourth Industrial Revolution (4IR), machine learning becomes popular in various application areas, because of its learning capabilities from the past and making intelligent decisions. In the following, we summarize and discuss ten popular application areas of machine learning technology.

Predictive analytics and intelligent decision-making: A major application field of machine learning is intelligent decision-making by data-driven predictive analytics [ 21 , 70 ]. The basis of predictive analytics is capturing and exploiting relationships between explanatory variables and predicted variables from previous events to predict the unknown outcome [ 41 ]. For instance, identifying suspects or criminals after a crime has been committed, or detecting credit card fraud as it happens. Another application, where machine learning algorithms can assist retailers in better understanding consumer preferences and behavior, better manage inventory, avoiding out-of-stock situations, and optimizing logistics and warehousing in e-commerce. Various machine learning algorithms such as decision trees, support vector machines, artificial neural networks, etc. [ 106 , 125 ] are commonly used in the area. Since accurate predictions provide insight into the unknown, they can improve the decisions of industries, businesses, and almost any organization, including government agencies, e-commerce, telecommunications, banking and financial services, healthcare, sales and marketing, transportation, social networking, and many others.

Cybersecurity and threat intelligence: Cybersecurity is one of the most essential areas of Industry 4.0. [ 114 ], which is typically the practice of protecting networks, systems, hardware, and data from digital attacks [ 114 ]. Machine learning has become a crucial cybersecurity technology that constantly learns by analyzing data to identify patterns, better detect malware in encrypted traffic, find insider threats, predict where bad neighborhoods are online, keep people safe while browsing, or secure data in the cloud by uncovering suspicious activity. For instance, clustering techniques can be used to identify cyber-anomalies, policy violations, etc. To detect various types of cyber-attacks or intrusions machine learning classification models by taking into account the impact of security features are useful [ 97 ]. Various deep learning-based security models can also be used on the large scale of security datasets [ 96 , 129 ]. Moreover, security policy rules generated by association rule learning techniques can play a significant role to build a rule-based security system [ 105 ]. Thus, we can say that various learning techniques discussed in Sect. Machine Learning Tasks and Algorithms , can enable cybersecurity professionals to be more proactive inefficiently preventing threats and cyber-attacks.

Internet of things (IoT) and smart cities: Internet of Things (IoT) is another essential area of Industry 4.0. [ 114 ], which turns everyday objects into smart objects by allowing them to transmit data and automate tasks without the need for human interaction. IoT is, therefore, considered to be the big frontier that can enhance almost all activities in our lives, such as smart governance, smart home, education, communication, transportation, retail, agriculture, health care, business, and many more [ 70 ]. Smart city is one of IoT’s core fields of application, using technologies to enhance city services and residents’ living experiences [ 132 , 135 ]. As machine learning utilizes experience to recognize trends and create models that help predict future behavior and events, it has become a crucial technology for IoT applications [ 103 ]. For example, to predict traffic in smart cities, parking availability prediction, estimate the total usage of energy of the citizens for a particular period, make context-aware and timely decisions for the people, etc. are some tasks that can be solved using machine learning techniques according to the current needs of the people.

Traffic prediction and transportation: Transportation systems have become a crucial component of every country’s economic development. Nonetheless, several cities around the world are experiencing an excessive rise in traffic volume, resulting in serious issues such as delays, traffic congestion, higher fuel prices, increased CO \(_2\) pollution, accidents, emergencies, and a decline in modern society’s quality of life [ 40 ]. Thus, an intelligent transportation system through predicting future traffic is important, which is an indispensable part of a smart city. Accurate traffic prediction based on machine and deep learning modeling can help to minimize the issues [ 17 , 30 , 31 ]. For example, based on the travel history and trend of traveling through various routes, machine learning can assist transportation companies in predicting possible issues that may occur on specific routes and recommending their customers to take a different path. Ultimately, these learning-based data-driven models help improve traffic flow, increase the usage and efficiency of sustainable modes of transportation, and limit real-world disruption by modeling and visualizing future changes.

Healthcare and COVID-19 pandemic: Machine learning can help to solve diagnostic and prognostic problems in a variety of medical domains, such as disease prediction, medical knowledge extraction, detecting regularities in data, patient management, etc. [ 33 , 77 , 112 ]. Coronavirus disease (COVID-19) is an infectious disease caused by a newly discovered coronavirus, according to the World Health Organization (WHO) [ 3 ]. Recently, the learning techniques have become popular in the battle against COVID-19 [ 61 , 63 ]. For the COVID-19 pandemic, the learning techniques are used to classify patients at high risk, their mortality rate, and other anomalies [ 61 ]. It can also be used to better understand the virus’s origin, COVID-19 outbreak prediction, as well as for disease diagnosis and treatment [ 14 , 50 ]. With the help of machine learning, researchers can forecast where and when, the COVID-19 is likely to spread, and notify those regions to match the required arrangements. Deep learning also provides exciting solutions to the problems of medical image processing and is seen as a crucial technique for potential applications, particularly for COVID-19 pandemic [ 10 , 78 , 111 ]. Overall, machine and deep learning techniques can help to fight the COVID-19 virus and the pandemic as well as intelligent clinical decisions making in the domain of healthcare.

E-commerce and product recommendations: Product recommendation is one of the most well known and widely used applications of machine learning, and it is one of the most prominent features of almost any e-commerce website today. Machine learning technology can assist businesses in analyzing their consumers’ purchasing histories and making customized product suggestions for their next purchase based on their behavior and preferences. E-commerce companies, for example, can easily position product suggestions and offers by analyzing browsing trends and click-through rates of specific items. Using predictive modeling based on machine learning techniques, many online retailers, such as Amazon [ 71 ], can better manage inventory, prevent out-of-stock situations, and optimize logistics and warehousing. The future of sales and marketing is the ability to capture, evaluate, and use consumer data to provide a customized shopping experience. Furthermore, machine learning techniques enable companies to create packages and content that are tailored to the needs of their customers, allowing them to maintain existing customers while attracting new ones.

NLP and sentiment analysis: Natural language processing (NLP) involves the reading and understanding of spoken or written language through the medium of a computer [ 79 , 103 ]. Thus, NLP helps computers, for instance, to read a text, hear speech, interpret it, analyze sentiment, and decide which aspects are significant, where machine learning techniques can be used. Virtual personal assistant, chatbot, speech recognition, document description, language or machine translation, etc. are some examples of NLP-related tasks. Sentiment Analysis [ 90 ] (also referred to as opinion mining or emotion AI) is an NLP sub-field that seeks to identify and extract public mood and views within a given text through blogs, reviews, social media, forums, news, etc. For instance, businesses and brands use sentiment analysis to understand the social sentiment of their brand, product, or service through social media platforms or the web as a whole. Overall, sentiment analysis is considered as a machine learning task that analyzes texts for polarity, such as “positive”, “negative”, or “neutral” along with more intense emotions like very happy, happy, sad, very sad, angry, have interest, or not interested etc.

Image, speech and pattern recognition: Image recognition [ 36 ] is a well-known and widespread example of machine learning in the real world, which can identify an object as a digital image. For instance, to label an x-ray as cancerous or not, character recognition, or face detection in an image, tagging suggestions on social media, e.g., Facebook, are common examples of image recognition. Speech recognition [ 23 ] is also very popular that typically uses sound and linguistic models, e.g., Google Assistant, Cortana, Siri, Alexa, etc. [ 67 ], where machine learning methods are used. Pattern recognition [ 13 ] is defined as the automated recognition of patterns and regularities in data, e.g., image analysis. Several machine learning techniques such as classification, feature selection, clustering, or sequence labeling methods are used in the area.

Sustainable agriculture: Agriculture is essential to the survival of all human activities [ 109 ]. Sustainable agriculture practices help to improve agricultural productivity while also reducing negative impacts on the environment [ 5 , 25 , 109 ]. The sustainable agriculture supply chains are knowledge-intensive and based on information, skills, technologies, etc., where knowledge transfer encourages farmers to enhance their decisions to adopt sustainable agriculture practices utilizing the increasing amount of data captured by emerging technologies, e.g., the Internet of Things (IoT), mobile technologies and devices, etc. [ 5 , 53 , 54 ]. Machine learning can be applied in various phases of sustainable agriculture, such as in the pre-production phase - for the prediction of crop yield, soil properties, irrigation requirements, etc.; in the production phase—for weather prediction, disease detection, weed detection, soil nutrient management, livestock management, etc.; in processing phase—for demand estimation, production planning, etc. and in the distribution phase - the inventory management, consumer analysis, etc.

User behavior analytics and context-aware smartphone applications: Context-awareness is a system’s ability to capture knowledge about its surroundings at any moment and modify behaviors accordingly [ 28 , 93 ]. Context-aware computing uses software and hardware to automatically collect and interpret data for direct responses. The mobile app development environment has been changed greatly with the power of AI, particularly, machine learning techniques through their learning capabilities from contextual data [ 103 , 136 ]. Thus, the developers of mobile apps can rely on machine learning to create smart apps that can understand human behavior, support, and entertain users [ 107 , 137 , 140 ]. To build various personalized data-driven context-aware systems, such as smart interruption management, smart mobile recommendation, context-aware smart searching, decision-making that intelligently assist end mobile phone users in a pervasive computing environment, machine learning techniques are applicable. For example, context-aware association rules can be used to build an intelligent phone call application [ 104 ]. Clustering approaches are useful in capturing users’ diverse behavioral activities by taking into account data in time series [ 102 ]. To predict the future events in various contexts, the classification methods can be used [ 106 , 139 ]. Thus, various learning techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can help to build context-aware adaptive and smart applications according to the preferences of the mobile phone users.

In addition to these application areas, machine learning-based models can also apply to several other domains such as bioinformatics, cheminformatics, computer networks, DNA sequence classification, economics and banking, robotics, advanced engineering, and many more.

Challenges and Research Directions

Our study on machine learning algorithms for intelligent data analysis and applications opens several research issues in the area. Thus, in this section, we summarize and discuss the challenges faced and the potential research opportunities and future directions.

In general, the effectiveness and the efficiency of a machine learning-based solution depend on the nature and characteristics of the data, and the performance of the learning algorithms. To collect the data in the relevant domain, such as cybersecurity, IoT, healthcare and agriculture discussed in Sect. “ Applications of Machine Learning ” is not straightforward, although the current cyberspace enables the production of a huge amount of data with very high frequency. Thus, collecting useful data for the target machine learning-based applications, e.g., smart city applications, and their management is important to further analysis. Therefore, a more in-depth investigation of data collection methods is needed while working on the real-world data. Moreover, the historical data may contain many ambiguous values, missing values, outliers, and meaningless data. The machine learning algorithms, discussed in Sect “ Machine Learning Tasks and Algorithms ” highly impact on data quality, and availability for training, and consequently on the resultant model. Thus, to accurately clean and pre-process the diverse data collected from diverse sources is a challenging task. Therefore, effectively modifying or enhance existing pre-processing methods, or proposing new data preparation techniques are required to effectively use the learning algorithms in the associated application domain.

To analyze the data and extract insights, there exist many machine learning algorithms, summarized in Sect. “ Machine Learning Tasks and Algorithms ”. Thus, selecting a proper learning algorithm that is suitable for the target application is challenging. The reason is that the outcome of different learning algorithms may vary depending on the data characteristics [ 106 ]. Selecting a wrong learning algorithm would result in producing unexpected outcomes that may lead to loss of effort, as well as the model’s effectiveness and accuracy. In terms of model building, the techniques discussed in Sect. “ Machine Learning Tasks and Algorithms ” can directly be used to solve many real-world issues in diverse domains, such as cybersecurity, smart cities and healthcare summarized in Sect. “ Applications of Machine Learning ”. However, the hybrid learning model, e.g., the ensemble of methods, modifying or enhancement of the existing learning techniques, or designing new learning methods, could be a potential future work in the area.

Thus, the ultimate success of a machine learning-based solution and corresponding applications mainly depends on both the data and the learning algorithms. If the data are bad to learn, such as non-representative, poor-quality, irrelevant features, or insufficient quantity for training, then the machine learning models may become useless or will produce lower accuracy. Therefore, effectively processing the data and handling the diverse learning algorithms are important, for a machine learning-based solution and eventually building intelligent applications.

In this paper, we have conducted a comprehensive overview of machine learning algorithms for intelligent data analysis and applications. According to our goal, we have briefly discussed how various types of machine learning methods can be used for making solutions to various real-world issues. A successful machine learning model depends on both the data and the performance of the learning algorithms. The sophisticated learning algorithms then need to be trained through the collected real-world data and knowledge related to the target application before the system can assist with intelligent decision-making. We also discussed several popular application areas based on machine learning techniques to highlight their applicability in various real-world issues. Finally, we have summarized and discussed the challenges faced and the potential research opportunities and future directions in the area. Therefore, the challenges that are identified create promising research opportunities in the field which must be addressed with effective solutions in various application areas. Overall, we believe that our study on machine learning-based solutions opens up a promising direction and can be used as a reference guide for potential research and applications for both academia and industry professionals as well as for decision-makers, from a technical point of view.

Canadian institute of cybersecurity, university of new brunswick, iscx dataset, http://www.unb.ca/cic/datasets/index.html/ (Accessed on 20 October 2019).

Cic-ddos2019 [online]. available: https://www.unb.ca/cic/datasets/ddos-2019.html/ (Accessed on 28 March 2020).

World health organization: WHO. http://www.who.int/ .

Google trends. In https://trends.google.com/trends/ , 2019.

Adnan N, Nordin Shahrina Md, Rahman I, Noor A. The effects of knowledge transfer on farmers decision making toward sustainable agriculture practices. World J Sci Technol Sustain Dev. 2018.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. 1998; 94–105

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. In: ACM SIGMOD Record. ACM. 1993;22: 207–216

Agrawal R, Gehrke J, Gunopulos D, Raghavan P. Fast algorithms for mining association rules. In: Proceedings of the International Joint Conference on Very Large Data Bases, Santiago Chile. 1994; 1215: 487–499.

Aha DW, Kibler D, Albert M. Instance-based learning algorithms. Mach Learn. 1991;6(1):37–66.

Article   Google Scholar  

Alakus TB, Turkoglu I. Comparison of deep learning approaches to predict covid-19 infection. Chaos Solit Fract. 2020;140:

Amit Y, Geman D. Shape quantization and recognition with randomized trees. Neural Comput. 1997;9(7):1545–88.

Ankerst M, Breunig MM, Kriegel H-P, Sander J. Optics: ordering points to identify the clustering structure. ACM Sigmod Record. 1999;28(2):49–60.

Anzai Y. Pattern recognition and machine learning. Elsevier; 2012.

MATH   Google Scholar  

Ardabili SF, Mosavi A, Ghamisi P, Ferdinand F, Varkonyi-Koczy AR, Reuter U, Rabczuk T, Atkinson PM. Covid-19 outbreak prediction with machine learning. Algorithms. 2020;13(10):249.

Article   MathSciNet   Google Scholar  

Baldi P. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning, 2012; 37–49 .

Balducci F, Impedovo D, Pirlo G. Machine learning applications on agricultural datasets for smart farm enhancement. Machines. 2018;6(3):38.

Boukerche A, Wang J. Machine learning-based traffic prediction models for intelligent transportation systems. Comput Netw. 2020;181

Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

Article   MATH   Google Scholar  

Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. CRC Press; 1984.

Cao L. Data science: a comprehensive overview. ACM Comput Surv (CSUR). 2017;50(3):43.

Google Scholar  

Carpenter GA, Grossberg S. A massively parallel architecture for a self-organizing neural pattern recognition machine. Comput Vis Graph Image Process. 1987;37(1):54–115.

Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E, et al. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018 pages 4774–4778. IEEE .

Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1251–1258, 2017.

Cobuloglu H, Büyüktahtakın IE. A stochastic multi-criteria decision analysis for sustainable biomass crop selection. Expert Syst Appl. 2015;42(15–16):6065–74.

Das A, Ng W-K, Woon Y-K. Rapid association rule mining. In: Proceedings of the tenth international conference on Information and knowledge management, pages 474–481. ACM, 2001.

de Amorim RC. Constrained clustering with minkowski weighted k-means. In: 2012 IEEE 13th International Symposium on Computational Intelligence and Informatics (CINTI), pages 13–17. IEEE, 2012.

Dey AK. Understanding and using context. Person Ubiquit Comput. 2001;5(1):4–7.

Eagle N, Pentland AS. Reality mining: sensing complex social systems. Person Ubiquit Comput. 2006;10(4):255–68.

Essien A, Petrounias I, Sampaio P, Sampaio S. Improving urban traffic speed prediction using data source fusion and deep learning. In: 2019 IEEE International Conference on Big Data and Smart Computing (BigComp). IEEE. 2019: 1–8. .

Essien A, Petrounias I, Sampaio P, Sampaio S. A deep-learning model for urban traffic flow prediction with traffic events mined from twitter. In: World Wide Web, 2020: 1–24 .

Ester M, Kriegel H-P, Sander J, Xiaowei X, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd. 1996;96:226–31.

Fatima M, Pasha M, et al. Survey of machine learning algorithms for disease diagnostic. J Intell Learn Syst Appl. 2017;9(01):1.

Flach PA, Lachiche N. Confirmation-guided discovery of first-order rules with tertius. Mach Learn. 2001;42(1–2):61–95.

Freund Y, Schapire RE, et al. Experiments with a new boosting algorithm. In: Icml, Citeseer. 1996; 96: 148–156

Fujiyoshi H, Hirakawa T, Yamashita T. Deep learning-based image recognition for autonomous driving. IATSS Res. 2019;43(4):244–52.

Fukunaga K, Hostetler L. The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans Inform Theory. 1975;21(1):32–40.

Article   MathSciNet   MATH   Google Scholar  

Goodfellow I, Bengio Y, Courville A, Bengio Y. Deep learning. Cambridge: MIT Press; 2016.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y. Generative adversarial nets. In: Advances in neural information processing systems. 2014: 2672–2680.

Guerrero-Ibáñez J, Zeadally S, Contreras-Castillo J. Sensor technologies for intelligent transportation systems. Sensors. 2018;18(4):1212.

Han J, Pei J, Kamber M. Data mining: concepts and techniques. Amsterdam: Elsevier; 2011.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: ACM Sigmod Record, ACM. 2000;29: 1–12.

Harmon SA, Sanford TH, Sheng X, Turkbey EB, Roth H, Ziyue X, Yang D, Myronenko A, Anderson V, Amalou A, et al. Artificial intelligence for the detection of covid-19 pneumonia on chest ct using multinational datasets. Nat Commun. 2020;11(1):1–7.

He K, Zhang X, Ren S, Sun J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell. 2015;37(9):1904–16.

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016: 770–778.

Hinton GE. A practical guide to training restricted boltzmann machines. In: Neural networks: Tricks of the trade. Springer. 2012; 599-619

Holte RC. Very simple classification rules perform well on most commonly used datasets. Mach Learn. 1993;11(1):63–90.

Hotelling H. Analysis of a complex of statistical variables into principal components. J Edu Psychol. 1933;24(6):417.

Houtsma M, Swami A. Set-oriented mining for association rules in relational databases. In: Data Engineering, 1995. Proceedings of the Eleventh International Conference on, IEEE.1995:25–33.

Jamshidi M, Lalbakhsh A, Talla J, Peroutka Z, Hadjilooei F, Lalbakhsh P, Jamshidi M, La Spada L, Mirmozafari M, Dehghani M, et al. Artificial intelligence and covid-19: deep learning approaches for diagnosis and treatment. IEEE Access. 2020;8:109581–95.

John GH, Langley P. Estimating continuous distributions in bayesian classifiers. In: Proceedings of the Eleventh conference on Uncertainty in artificial intelligence, Morgan Kaufmann Publishers Inc. 1995; 338–345

Kaelbling LP, Littman ML, Moore AW. Reinforcement learning: a survey. J Artif Intell Res. 1996;4:237–85.

Kamble SS, Gunasekaran A, Gawankar SA. Sustainable industry 4.0 framework: a systematic literature review identifying the current trends and future perspectives. Process Saf Environ Protect. 2018;117:408–25.

Kamble SS, Gunasekaran A, Gawankar SA. Achieving sustainable performance in a data-driven agriculture supply chain: a review for research and applications. Int J Prod Econ. 2020;219:179–94.

Kaufman L, Rousseeuw PJ. Finding groups in data: an introduction to cluster analysis, vol. 344. John Wiley & Sons; 2009.

Keerthi SS, Shevade SK, Bhattacharyya C, Radha Krishna MK. Improvements to platt’s smo algorithm for svm classifier design. Neural Comput. 2001;13(3):637–49.

Khadse V, Mahalle PN, Biraris SV. An empirical comparison of supervised machine learning algorithms for internet of things data. In: 2018 Fourth International Conference on Computing Communication Control and Automation (ICCUBEA), IEEE. 2018; 1–6

Kohonen T. The self-organizing map. Proc IEEE. 1990;78(9):1464–80.

Koroniotis N, Moustafa N, Sitnikova E, Turnbull B. Towards the development of realistic botnet dataset in the internet of things for network forensic analytics: bot-iot dataset. Fut Gen Comput Syst. 2019;100:779–96.

Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012: 1097–1105

Kushwaha S, Bahl S, Bagha AK, Parmar KS, Javaid M, Haleem A, Singh RP. Significant applications of machine learning for covid-19 pandemic. J Ind Integr Manag. 2020;5(4).

Lade P, Ghosh R, Srinivasan S. Manufacturing analytics and industrial internet of things. IEEE Intell Syst. 2017;32(3):74–9.

Lalmuanawma S, Hussain J, Chhakchhuak L. Applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review. Chaos Sol Fract. 2020:110059 .

LeCessie S, Van Houwelingen JC. Ridge estimators in logistic regression. J R Stat Soc Ser C (Appl Stat). 1992;41(1):191–201.

LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

Liu H, Motoda H. Feature extraction, construction and selection: A data mining perspective, vol. 453. Springer Science & Business Media; 1998.

López G, Quesada L, Guerrero LA. Alexa vs. siri vs. cortana vs. google assistant: a comparison of speech-based natural user interfaces. In: International Conference on Applied Human Factors and Ergonomics, Springer. 2017; 241–250.

Liu B, HsuW, Ma Y. Integrating classification and association rule mining. In: Proceedings of the fourth international conference on knowledge discovery and data mining, 1998.

MacQueen J, et al. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967;volume 1, pages 281–297. Oakland, CA, USA.

Mahdavinejad MS, Rezvan M, Barekatain M, Adibi P, Barnaghi P, Sheth AP. Machine learning for internet of things data analysis: a survey. Digit Commun Netw. 2018;4(3):161–75.

Marchand A, Marx P. Automated product recommendations with preference-based explanations. J Retail. 2020;96(3):328–43.

McCallum A. Information extraction: distilling structured data from unstructured text. Queue. 2005;3(9):48–57.

Mehrotra A, Hendley R, Musolesi M. Prefminer: mining user’s preferences for intelligent mobile notification management. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Heidelberg, Germany, 12–16 September, 2016; pp. 1223–1234. ACM, New York, USA. .

Mohamadou Y, Halidou A, Kapen PT. A review of mathematical modeling, artificial intelligence and datasets used in the study, prediction and management of covid-19. Appl Intell. 2020;50(11):3913–25.

Mohammed M, Khan MB, Bashier Mohammed BE. Machine learning: algorithms and applications. CRC Press; 2016.

Book   Google Scholar  

Moustafa N, Slay J. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In: 2015 military communications and information systems conference (MilCIS), 2015;pages 1–6. IEEE .

Nilashi M, Ibrahim OB, Ahmadi H, Shahmoradi L. An analytical method for diseases prediction using machine learning techniques. Comput Chem Eng. 2017;106:212–23.

Yujin O, Park S, Ye JC. Deep learning covid-19 features on cxr using limited training data sets. IEEE Trans Med Imaging. 2020;39(8):2688–700.

Otter DW, Medina JR , Kalita JK. A survey of the usages of deep learning for natural language processing. IEEE Trans Neural Netw Learn Syst. 2020.

Park H-S, Jun C-H. A simple and fast algorithm for k-medoids clustering. Expert Syst Appl. 2009;36(2):3336–41.

Liii Pearson K. on lines and planes of closest fit to systems of points in space. Lond Edinb Dublin Philos Mag J Sci. 1901;2(11):559–72.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

MathSciNet   MATH   Google Scholar  

Perveen S, Shahbaz M, Keshavjee K, Guergachi A. Metabolic syndrome and development of diabetes mellitus: predictive modeling based on machine learning techniques. IEEE Access. 2018;7:1365–75.

Santi P, Ram D, Rob C, Nathan E. Behavior-based adaptive call predictor. ACM Trans Auton Adapt Syst. 2011;6(3):21:1–21:28.

Polydoros AS, Nalpantidis L. Survey of model-based reinforcement learning: applications on robotics. J Intell Robot Syst. 2017;86(2):153–73.

Puterman ML. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons; 2014.

Quinlan JR. Induction of decision trees. Mach Learn. 1986;1:81–106.

Quinlan JR. C4.5: programs for machine learning. Mach Learn. 1993.

Rasmussen C. The infinite gaussian mixture model. Adv Neural Inform Process Syst. 1999;12:554–60.

Ravi K, Ravi V. A survey on opinion mining and sentiment analysis: tasks, approaches and applications. Knowl Syst. 2015;89:14–46.

Rokach L. A survey of clustering algorithms. In: Data mining and knowledge discovery handbook, pages 269–298. Springer, 2010.

Safdar S, Zafar S, Zafar N, Khan NF. Machine learning based decision support systems (dss) for heart disease diagnosis: a review. Artif Intell Rev. 2018;50(4):597–623.

Sarker IH. Context-aware rule learning from smartphone data: survey, challenges and future directions. J Big Data. 2019;6(1):1–25.

Sarker IH. A machine learning based robust prediction model for real-life mobile phone data. Internet Things. 2019;5:180–93.

Sarker IH. Ai-driven cybersecurity: an overview, security intelligence modeling and research directions. SN Comput Sci. 2021.

Sarker IH. Deep cybersecurity: a comprehensive overview from neural network and deep learning perspective. SN Comput Sci. 2021.

Sarker IH, Abushark YB, Alsolami F, Khan A. Intrudtree: a machine learning based cyber security intrusion detection model. Symmetry. 2020;12(5):754.

Sarker IH, Abushark YB, Khan A. Contextpca: predicting context-aware smartphone apps usage based on machine learning techniques. Symmetry. 2020;12(4):499.

Sarker IH, Alqahtani H, Alsolami F, Khan A, Abushark YB, Siddiqui MK. Context pre-modeling: an empirical analysis for classification based user-centric context-aware predictive modeling. J Big Data. 2020;7(1):1–23.

Sarker IH, Alan C, Jun H, Khan AI, Abushark YB, Khaled S. Behavdt: a behavioral decision tree learning to build user-centric context-aware predictive model. Mob Netw Appl. 2019; 1–11.

Sarker IH, Colman A, Kabir MA, Han J. Phone call log as a context source to modeling individual user behavior. In: Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (Ubicomp): Adjunct, Germany, pages 630–634. ACM, 2016.

Sarker IH, Colman A, Kabir MA, Han J. Individualized time-series segmentation for mining mobile phone user behavior. Comput J Oxf Univ UK. 2018;61(3):349–68.

Sarker IH, Hoque MM, MdK Uddin, Tawfeeq A. Mobile data science and intelligent apps: concepts, ai-based modeling and research directions. Mob Netw Appl, pages 1–19, 2020.

Sarker IH, Kayes ASM. Abc-ruleminer: user behavioral rule-based machine learning method for context-aware intelligent services. J Netw Comput Appl. 2020; page 102762

Sarker IH, Kayes ASM, Badsha S, Alqahtani H, Watters P, Ng A. Cybersecurity data science: an overview from machine learning perspective. J Big Data. 2020;7(1):1–29.

Sarker IH, Watters P, Kayes ASM. Effectiveness analysis of machine learning classification models for predicting personalized context-aware smartphone usage. J Big Data. 2019;6(1):1–28.

Sarker IH, Salah K. Appspred: predicting context-aware smartphone apps using random forest learning. Internet Things. 2019;8:

Scheffer T. Finding association rules that trade support optimally against confidence. Intell Data Anal. 2005;9(4):381–95.

Sharma R, Kamble SS, Gunasekaran A, Kumar V, Kumar A. A systematic literature review on machine learning applications for sustainable agriculture supply chain performance. Comput Oper Res. 2020;119:

Shengli S, Ling CX. Hybrid cost-sensitive decision tree, knowledge discovery in databases. In: PKDD 2005, Proceedings of 9th European Conference on Principles and Practice of Knowledge Discovery in Databases. Lecture Notes in Computer Science, volume 3721, 2005.

Shorten C, Khoshgoftaar TM, Furht B. Deep learning applications for covid-19. J Big Data. 2021;8(1):1–54.

Gökhan S, Nevin Y. Data analysis in health and big data: a machine learning medical diagnosis model based on patients’ complaints. Commun Stat Theory Methods. 2019;1–10

Silver D, Huang A, Maddison CJ, Guez A, Sifre L, Van Den Driessche G, Schrittwieser J, Antonoglou I, Panneershelvam V, Lanctot M, et al. Mastering the game of go with deep neural networks and tree search. nature. 2016;529(7587):484–9.

Ślusarczyk B. Industry 4.0: Are we ready? Polish J Manag Stud. 17, 2018.

Sneath Peter HA. The application of computers to taxonomy. J Gen Microbiol. 1957;17(1).

Sorensen T. Method of establishing groups of equal amplitude in plant sociology based on similarity of species. Biol Skr. 1948; 5.

Srinivasan V, Moghaddam S, Mukherji A. Mobileminer: mining your frequent patterns on your phone. In: Proceedings of the International Joint Conference on Pervasive and Ubiquitous Computing, Seattle, WA, USA, 13-17 September, pp. 389–400. ACM, New York, USA. 2014.

Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2015; pages 1–9.

Tavallaee M, Bagheri E, Lu W, Ghorbani AA. A detailed analysis of the kdd cup 99 data set. In. IEEE symposium on computational intelligence for security and defense applications. IEEE. 2009;2009:1–6.

Tsagkias M. Tracy HK, Surya K, Vanessa M, de Rijke M. Challenges and research opportunities in ecommerce search and recommendations. In: ACM SIGIR Forum. volume 54. NY, USA: ACM New York; 2021. p. 1–23.

Wagstaff K, Cardie C, Rogers S, Schrödl S, et al. Constrained k-means clustering with background knowledge. Icml. 2001;1:577–84.

Wang W, Yang J, Muntz R, et al. Sting: a statistical information grid approach to spatial data mining. VLDB. 1997;97:186–95.

Wei P, Li Y, Zhang Z, Tao H, Li Z, Liu D. An optimization method for intrusion detection classification model based on deep belief network. IEEE Access. 2019;7:87593–605.

Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big data. 2016;3(1):9.

Witten IH, Frank E. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2005.

Witten IH, Frank E, Trigg LE, Hall MA, Holmes G, Cunningham SJ. Weka: practical machine learning tools and techniques with java implementations. 1999.

Wu C-C, Yen-Liang C, Yi-Hung L, Xiang-Yu Y. Decision tree induction with a constrained number of leaf nodes. Appl Intell. 2016;45(3):673–85.

Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inform Syst. 2008;14(1):1–37.

Xin Y, Kong L, Liu Z, Chen Y, Li Y, Zhu H, Gao M, Hou H, Wang C. Machine learning and deep learning methods for cybersecurity. IEEE Access. 2018;6:35365–81.

Xu D, Yingjie T. A comprehensive survey of clustering algorithms. Ann Data Sci. 2015;2(2):165–93.

Zaki MJ. Scalable algorithms for association mining. IEEE Trans Knowl Data Eng. 2000;12(3):372–90.

Zanella A, Bui N, Castellani A, Vangelista L, Zorzi M. Internet of things for smart cities. IEEE Internet Things J. 2014;1(1):22–32.

Zhao Q, Bhowmick SS. Association rule mining: a survey. Singapore: Nanyang Technological University; 2003.

Zheng T, Xie W, Xu L, He X, Zhang Y, You M, Yang G, Chen Y. A machine learning-based framework to identify type 2 diabetes through electronic health records. Int J Med Inform. 2017;97:120–7.

Zheng Y, Rajasegarar S, Leckie C. Parking availability prediction for sensor-enabled car parks in smart cities. In: Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP), 2015 IEEE Tenth International Conference on. IEEE, 2015; pages 1–6.

Zhu H, Cao H, Chen E, Xiong H, Tian J. Exploiting enriched contextual information for mobile app classification. In: Proceedings of the 21st ACM international conference on Information and knowledge management. ACM, 2012; pages 1617–1621

Zhu H, Chen E, Xiong H, Kuifei Y, Cao H, Tian J. Mining mobile user preferences for personalized context-aware recommendation. ACM Trans Intell Syst Technol (TIST). 2014;5(4):58.

Zikang H, Yong Y, Guofeng Y, Xinyu Z. Sentiment analysis of agricultural product ecommerce review data based on deep learning. In: 2020 International Conference on Internet of Things and Intelligent Applications (ITIA), IEEE, 2020; pages 1–7

Zulkernain S, Madiraju P, Ahamed SI. A context aware interruption management system for mobile devices. In: Mobile Wireless Middleware, Operating Systems, and Applications. Springer. 2010; pages 221–234

Zulkernain S, Madiraju P, Ahamed S, Stamm K. A mobile intelligent interruption management system. J UCS. 2010;16(15):2060–80.

Download references

Author information

Authors and affiliations.

Swinburne University of Technology, Melbourne, VIC, 3122, Australia

Iqbal H. Sarker

Department of Computer Science and Engineering, Chittagong University of Engineering & Technology, 4349, Chattogram, Bangladesh

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Iqbal H. Sarker .

Ethics declarations

Conflict of interest.

The author declares no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Advances in Computational Approaches for Artificial Intelligence, Image Processing, IoT and Cloud Applications” guest edited by Bhanu Prakash K N and M. Shivakumar.

Rights and permissions

Reprints and permissions

About this article

Sarker, I.H. Machine Learning: Algorithms, Real-World Applications and Research Directions. SN COMPUT. SCI. 2 , 160 (2021). https://doi.org/10.1007/s42979-021-00592-x

Download citation

Received : 27 January 2021

Accepted : 12 March 2021

Published : 22 March 2021

DOI : https://doi.org/10.1007/s42979-021-00592-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine learning
  • Deep learning
  • Artificial intelligence
  • Data science
  • Data-driven decision-making
  • Predictive analytics
  • Intelligent applications
  • Find a journal
  • Publish with us
  • Track your research
  • Open access
  • Published: 03 August 2021

Predictive analytics using Big Data for the real estate market during the COVID-19 pandemic

  • Andrius Grybauskas   ORCID: orcid.org/0000-0002-3246-645X 1 ,
  • Vaida Pilinkienė 1 &
  • Alina Stundžienė 1  

Journal of Big Data volume  8 , Article number:  105 ( 2021 ) Cite this article

17k Accesses

29 Citations

8 Altmetric

Metrics details

As the COVID-19 pandemic came unexpectedly, many real estate experts claimed that the property values would fall like the 2007 crash. However, this study raises the question of what attributes of an apartment are most likely to influence a price revision during the pandemic. The findings in prior studies have lacked consensus, especially regarding the time-on-the-market variable, which exhibits an omnidirectional effect. However, with the rise of Big Data, this study used a web-scraping algorithm and collected a total of 18,992 property listings in the city of Vilnius during the first wave of the COVID-19 pandemic. Afterwards, 15 different machine learning models were applied to forecast apartment revisions, and the SHAP values for interpretability were used. The findings in this study coincide with the previous literature results, affirming that real estate is quite resilient to pandemics, as the price drops were not as dramatic as first believed. Out of the 15 different models tested, extreme gradient boosting was the most accurate, although the difference was negligible. The retrieved SHAP values conclude that the time-on-the-market variable was by far the most dominant and consistent variable for price revision forecasting. Additionally, the time-on-the-market variable exhibited an inverse U-shaped behaviour.

Introduction

The emergence of the COVID-19 pandemic and its detrimental consequences to the global financial system were unexpected and affected millions of people by descending economic activity into a partial shutdown. Without exception, the virus reached the shores of Lithuania back on February 29th and what seemed at first to be a minuscule obstacle with a few instances of sickness reported, by March 16th the government of Lithuania folded and deliberately introduce quarantine measures shutting down almost all operations of the economy. The quarantine included restrictions and/or bans on travel, restaurants, bars, concerts, night club activities, hotels, sports clubs and tourism, leaving other leisure activities heavily regulated as well. Within these circumstances, many experts publicly claimed that housing prices would fall and assumed a 2007-style mass housing sale discount for troubled asset owners. This led to the following questions. Which prices would fall? More precisely, which predictors are best for experts to follow to anticipate price changes? Is the year when the house was built the best criterion to anticipate a discount? Or will the heating type heavily influence the price revision and should thus be monitored closely? Is it reasonable to assume that time-on-the-market (TOM) would affect the price change negatively? All these questions are extremely relevant for families, investors, entrepreneurs and even governments who are looking forward to granting financial support to harmed asset owners.

For the most part, a literature review is the beginning for data analysts in search of answers to complicated questions. Interestingly, the paradigm of the real estate theory was found to be in a predicament in many cases. For instance, the work of Johnson et al. [ 26 ], who carried out an in-depth review of previous studies addressing the price-TOM relationship, found that 29 studies had captured a positive relationship, 52 displayed a negative relationship, and 24 studies did not find any significant impact on the price. Other covariates in the literature also exhibited an omni-directional response to real estate prices, making it hard to deduce what variables influence the price revisions the most and in what direction.

Although success in various fields of using Big Data was achieved by Park and Bae [ 35 ], Borde et al. [ 10 ], Trawiński et al. [ 39 ], Čeh et al. [ 17 ], Baldominos et al. [ 5 ], De Nadai and Lepri [ 19 ], Pérez-Rave et al. [ 36 ] and Côrte-Real et al. [ 16 ], many of the mentioned papers focused on price determination hedonic models. Further, the modelling of price change in most cases was either a by-product of the models, meaning that the dependent variable was not the price change but the final transaction or listing price. Although it is easy to miss some studies in the sea of real estate literature, the ones using price change as a dependent variable were carried out by Knight [ 28 ], Khezr [ 27 ], Verbrugge et al. [ 40 ], but only probit and regression models were employed. Additionally, Pérez-Rave et al. [ 36 ] argued that the predictive power of hedonic regression is not mature and is more suited to inference, simultaneously admitting that machine learning (ML) models possess drawbacks in explaining predictive power. However, with the recent introduction of Shapley values (SHAP) created by Lundberg and Lee [ 31 ], a new dimension of knowledge can be obtained. For all of the reasons indicated in this section, this study, by using ML methods, aims to uncover the best predictors of an apartment price drop during the COVID-19 pandemic in Lithuania.

The present work makes several worthwhile contributions to the existing literature. First, it provides foresight for the households, entrepreneurs and investors who are related to the real estate sector by explaining what variables should be considered to anticipate price drops in the real estate market. Second, it provides further clarity for the TOM variable’s behaviour using the “SHAP” values. Third, it provides insight into understanding of which ML models were the most accurate for real estate predictive analytics. Fourth and finally, it contributes to the existing literature knowledge by examining feature importance in the period of pandemic.

The remainder of this paper is structured as follows. “ Literature review ” section analyses the existing knowledge on covariates and their implications for predicting real estate prices. “ Methodology ” section outlines data collection and the methodological steps taken in constructing the ML models. “ Research results ” section presents the empirical results and model interpretations, and “ Conclusions ” section provides the conclusions for the research paper.

Literature review

Following Armstrong et al.’s [ 2 ] advice, a review of prior knowledge must be carried out before constructing a formidable forecasting model. The years of causal inference can contribute important insights and help avoid nonsensical relationships that models sometimes assign by chance, thus, to obtain a solid theoretical basis for the forecasting model, a literature review analysis was conducted in three parts. First, a review of previously used variables and their effects on price was carried out, which directed choosing candidate variables in the forecasting model. The second step examined scientific studies that attempted to measure variable importance, emphasising the literature gap. Finally, in the third step, the review of real estate and pandemic studies was discussed to gather any additional insight that could be helpful for model explanation or construction.

The variable review

The first variable on the list was the most intriguing and widely discussed covariate among real estate scholars: the so-called TOM variable. The best summary of this variable’s effect can be described via the study completed by Benefield et al. [ 9 ], where out of 197 price equation estimations, 73 instances reported insignificant, 24—positive and 100—negative TOM relationships with the real estate price. These findings stem from two long-established theories: the search theory formed by Yinger [ 41 ] and the sale clearance theory of Lazear [ 29 ].

The former theory states that the longer a property is on the market (listed on the real estate website), the higher the probability is to discover a buyer that is willing to pay the highest price. This notion intuitively makes sense, as not all buyers are constantly refreshing websites and spotting every single property in the sea of listings. As full-time work and other personal matters consume most time for any individual, a longer TOM does not necessarily increase the likelihood of a price drop but inversely helps to find a buyer willing to pay the highest price.

In contrast, the Lazear [ 29 ] clearance model states that high TOM values for a property simply indicate a lack of buyer interest, thus, to make the property more attractive, the price needs to be reduced. The authors who sympathise with this theory argue that with longer TOM values, a certain stigma is attached to the property, as if it is not valuable or something is inherently wrong with it. The most recent papers by An et al. [ 3 ] and He et al. [ 25 ] further attempted to explain the TOM phenomenon. An et al. [ 3 ] claimed that the TOM effect on the price solely depends on the market conditions, meaning that in times of high growth, a longer TOM should help find the best buyer, but in times of economic downfall, higher TOM values will negatively affect the selling price. He et al. [ 25 ] argued that the TOM relationship is non-linear and possesses an inverted U-shaped component, meaning that up to a certain point, the TOM variable raises the chance of finding the best buyer, but after the inflection point, the TOM effect becomes negative.

Two points regarding the TOM variable must be considered. First, most of the studies tried to establish a linear model, which confines the dynamics of the TOM variable. Second, researchers have used different local market datasets. It could be that geographical locations exhibit different results. Either way, due to many differing conclusions, it is cumbersome to grasp the magnitude or the direction of the TOM variable effect while relying on earlier studies. Nonetheless, many papers consider the TOM variable an important factor influencing real estate prices; therefore, this variable is essential in the forecasting model.

The empirical findings provided by Huang and Palmquist [ 24 ], Knight [ 28 ], Anglin et al. [ 1 ], Herrin [ 23 ], Johnson et al. [ 26 ], Benefield et al. [ 6 ] and Verbrugge et al. [ 40 ] suggested that the initial price setup or the degree of overpricing can affect the price change. The idea here is that asset owners set an initial price too high with respect to other similar properties on the market and eventually have to reduce their price. This relates to information asymmetry and is acknowledged by many authors, thus, the price variable should also be included.

Another variable that is worth discussing is location. In following research papers by Rosiers et al. [ 37 ], Owusu-Edusei et al. [ 34 ], Benefield et al. [ 6 ], Khezr [ 27 ], Verbrugge et al. [ 40 ], Baldominos et al. [ 5 ], Du et al. [ 18 ], Bogin et al. [ 11 ], Metzner and Kindt [ 32 ] and Oust et al. [ 33 ], location was found to affect the price of an asset significantly in one direction or the other. The connotation behind this covariate is simply that some areas of the city have better infrastructure or perhaps higher traffic and crime rates, thus, prices are higher or lower in certain zones. Income segregation by different city zones also persists, since wealthier people tend to live in more expensive neighbourhoods. Hence, a different reaction to shocks can be expected from different areas. Some authors, such as Huang and Palmquist [ 24 ] and Park and Bae [ 35 ], even included distances to schools or shops. Families tend to look for a “full package,” meaning that the price of a building is only a part of the equation. A house might be cheaper in one zone, but if the nearest school is far away, the constant driving back and forth every month will incur additional expenses, and the initial win on a lower apartment price will evaporate in the long run. As a result, the latter variable helps to control for important factors that can affect a price change.

The huge extent of real estate literature limits the ability to review all variables; nevertheless, a pattern of many repeating covariates was detected within most studies. This included a heating type, a building type, asymmetric information, agencies, year built, proximity to shops, universities, schools, train stations, size in sq. meters, number of rooms, floors, garages, pools and other individual housing characteristics; although, little was mentioned about the significance or predictive power of each variable.

Studies that measured variable importance

In addition to conflicting evidence as to how variables affect price, it was troublesome to extract findings on the importance or the so-called predictive power of each variable from previous studies. Surely, knowing that the TOM variable influences price changes means very little if the effect magnitude is miniscule. Unfortunately, only a handful of papers have investigated the latter issue. The papers that attempted to estimate the probability of the price change were written by Knight [ 28 ], Khezr [ 27 ] and Verbrugge et al. [ 40 ]. However, while Verbrugge et al. [ 40 ] noted that the initial rent price, TOM and location were the most important variables in predicting rent price changes, the authors regrettably did not analyse the sales price. Further, the empirical model of Khezr [ 27 ] did not provide any ranked importance but indicated that longer TOM and thin markets increased the likelihood for prices to drop. A study by Knight [ 28 ] proposed that the biggest revision was due to higher vacancy, mark-up and seller motivation. Being within a certain price range also decreased the probability for a price to change. However, the authors only employed probit or regression models, which did not address the non-linearity issues within the TOM or other variables. Moreover, recent advances in machine learning have not been tested. This leaves many answered questions and a literature gap.

Pandemic impact on the variable importance

Regarding variable importance during pandemics, a handful of recent studies recorded that the location variable can have detrimental effects on price revisions. Liu and Su [ 30 ] discovered that during COVID-19, the housing demand shifted away from high population density areas. Similarly, Gupta et al. [ 22 ] showed that house prices and rents declined in city centres during the COVID-19 period. It is expected that people flee crowded areas, as virus infections are more likely to occur there. Even in the London cholera outbreak analysed by Ambrus et al. [ 4 ], it was reported that ten years post-outbreak, real estate prices in the city of London were still significantly lower, since a single neighbourhood had a constant reoccurring disease rate, thereby attaching a certain stigma to a particular zone. Likewise, a study published by Francke and Korevaary [ 20 ] analysed the plague outbreak in Amsterdam and the cholera spread in Paris. Both pandemics had a significant impact on population mortality rates and diminished consumer confidence, consequentially affecting the real estate market. The authors found a decline in housing prices of about 5% and around 2% in rent prices annually, it was also established that certain infected neighbourhoods lost their value due to risk perception of the renter, but these quickly reversed back after the disease disappeared. Therefore, the location or city centre variable is an important predictor.

Other studies focused more on real estate price analysis. Wong [ 42 ] recorded a small 1.5% housing price decrease during the SARS outbreak in Hong Kong. Additionally, a recent study by Giudice et al. [ 21 ] constructed a forecasting model to evaluate the COVID-19 influence on real estate price changes in Italy. The authors employed the Lotka–Volterra estimation (a “prey–predator” model) and concluded that housing prices are expected to drop by 4.16% in the short run and by 6.49% in the mid run. Following the logic of An et al. [ 3 ], the TOM variable effect should be negative since pandemics put the economy into a recession, but it could also exhibit other functional forms, as mentioned by He et al. [ 25 ].

Regrettably, the previously mentioned studies on epidemics did not yield insights into how the TOM or other variables changed and what predictive power they held during the pandemics. The location variable effect on price revision might exist, but the magnitude might be small. Also, the authors only tested regression models without trying other machine learning methods. For this reason, further empirical research is needed.

Methodology

The methodology of this paper comprised three steps: (1) data mining, (2) data cleaning and preparation and (3) machine learning methods. For better understanding, the entire research framework is depicted in Fig.  1 .

figure 1

Research framework

Data mining

Recently, it has become common to use a web-scraping technique for data collection. Simply put, it is a way to extract structured data from websites in an automated way and has been used by authors like Borde et al. [ 10 ], Pérez-Rave et al. [ 36 ] and Berawi et al. [ 7 ]. In this paper, the Python programming language, with packages made by BeautifulSoup and Selenium, was used to write an algorithm and purposely collect desired variables for apartment listings in the capital city of Vilnius with sell and rent operations. The data were collected monthly from May to August 2020 for a total of 4 months, and the datasets were saved independently for each month. The latter period covers two important aspects: the beginning of coronavirus, including the quarantine period, and the quarantine release period. With the quarantine restrictions increasing and decreasing, it is interesting to test whether the variables would have different impacts on the forecasting model.

Data cleaning and processing

After the extensive data collection and cleaning procedures, a total of 18,992 apartment listings were gathered in the four-month period with at most 16 features: zone (the city zone that the apartment is located in), address, listing price, number of rooms, apartment size, the floor, the number of floors, change in the list price, year built, distance to the shop, distance to the kindergarten, distance to school, built type (whether the apartment is made of bricks, etc.), heating type, vacancy and price change date. Some features, like heating type, had more than 40 levels but were reorganised into 13 levels. It is worth mentioning that the size of the collected dataset was very close to the population size, as the retrieved data represented the majority of all existing apartment listings in Vilnius.

Afterwards, the price drops of the property listings in Vilnius were analysed and compared to previous authors’ work on pandemics. Additionally, since many authors have found the TOM variable significantly predicting price drop, a heatmap of TOM values according to the Vilnius city boroughs was created for all four months and both sell and rent operations. From the heat map, one could also observe whether vacancies were more prominent in the city centre compared to other zones, where darker colours showed higher vacancy values and brighter colours indicated smaller vacancies. Additional variable distribution visualisations of the rent and sell operations are depicted in Appendices 1 and 2 .

Before applying supervised learning, data preparation and feature selection processes were initiated. First, the target variable (indicating whether a price change occurred or not) was composed into a dummy variable for each month, as follows:

where I is an indicator function with space A that composes dummy variable y into 1 if a price change occurred and into 0 if a price change did not occur. Similarly, the location variable was also composed into a dummy variable, where apartments located in the city centre were assigned a value of 1 and 0 if they were outside the city centre. Furthermore, to avoid noise and the curse of dimensionality, this study employed target encoding for the heating and built type variables. The formula for the target encoding has the following form:

where N marks the collected data points ( \({x}_{i}\) , \({y}_{i}\) ), x marks the input variables, y marks the target variables, j marks the number of levels and I is the indicator function that maps each level of x into a feature \(\mathrm{\varphi }\) . Additionally, particular variables like rooms, the number of floors in the building and the floor on which the apartment is located were encoded ordinally to preserve the rank order.

Machine learning methods

The ML process had two distinct stages, as shown in Fig.  1 . In the first stage, the dataset was split into 70% and 30% training and test datasets, and the most consistent ML algorithm (MCMLA) was searched on the training set between the months to ensure equal interpretation when using SHAP values, as different algorithms might exhibit different variable effects. Thus, for all four months, the following 15 algorithms were applied: CatBoost Classifier, Extreme Gradient Boosting (XGB), Light Gradient Boosting Machine, Random Forest Classifier, Extra Trees Classifier, Gradient Boosting Classifier, Linear Discriminant Analysis, Logistic Regression, Ridge Classifier, Naive Bayes, Ada Boost Classifier, K-Neighbors Classifier, Decision Tree Classifier, Quadratic Discriminant Analysis and SVM—Linear Kernel (due to an abundance of algorithms, their formulae will not be shown; however, they are standard in Python libraries). Furthermore, for each algorithm, during the stratified cross-validation, the SMOTE synthetic minority sampling algorithm was deployed on the training set, which, as described by Chawla et al. [ 14 ], considers five minority samples and calculates the nearest neighbour’s average according to the Euclidean distance metric to generate new samples. This was done for each month separately and addressed the classification bias problem.

Subsequently, the 15 models’ results for four months and both sell and rent operations were provided in seven different criteria: accuracy, area-under-the-curve (AUC), recall, precision, F1-score and Kappa and Matthews correlation coefficient (MCC). As described by Brownlee [ 8 ], in using these criteria, one can objectively choose the best models for the task at hand. In this paper, the most attention was paid to accuracy, F1 score and precision ratios since this study dealt with an imbalanced dataset with many negatives. In all cases, the higher the ratios, the better. The formula for accuracy was as follows:

which gives the general model accuracy, as it used all samples in the denominator. Meanwhile, the formula for precision in the denominator used only true positives and false positives, and had the following form:

As discussed by Buckland and Gey [ 12 ] and Chawla [ 15 ], there is usually a trade-off between precision and recall, as one goes up and the other goes down, thus, depending on the goal, one or the other metric can be maximised. Additionally, another measure can combine the trade-offs between precision and recall and yield a single metric of a classifier in the presence of rare cases. It is called the F 1 metric:

In conclusion, the accuracy, precision and F 1 metrics were the most important while deciding the MCMLA. Furthermore, since this paper independently analysed both sell and rent operations monthly, all models metric scores were combined and averaged. One thing to consider is that machine learning processes have a stochastic feature, meaning that in different iterations, the models changed accuracy positions [ 8 , 38 ]. This is especially true when SMOTE oversampling or stratified cross-validation that splits data into different sets is used. In order to have a replicability of this paper, it was decided to set a random seed fixed.

In the second ML stage, the tuning and application of the MCMLA began. The XGB algorithm yielded the most consistent scores and was thereby chosen as the MCMLA. In the tuning process, the stratified cross-validation with the SMOTE algorithm was used again, and to achieve better precision scores, the hyperparameters of the XGB algorithm were tuned using a grid search. For the sell operations, the tuned XGB algorithm used a max depth of 8, a learning rate of 0.491 and, for the rent operations, a max depth of 8 and a learning rate of 0.41. Furthermore, to highlight the functional form of variable effects when analysing SHAP values, the SMOTE oversample method was applied to the whole dataset, and the tuned XGB model was applied independently once more each month on this oversampled dataset.

Last, the recent adaptation of SHAP values in supervised learning has opened the dimension for explainable artificial intelligence. Lundberg and Lee [ 31 ] and Christoph [ 13 ] described the principle of SHAP values as the average marginal impact of a feature value across all possible coalitions. Originally, the following formula was used in game theory to compute SHAP values:

where v represents a characteristic function, S represents a coalition, i represents the target variable to assess and \({\upphi }_{i}\) represents the feature contribution. In this study, the positive SHAP values pushed the prediction for price change to occur, and the negatives reduced the prediction for price changes to emerge. Furthermore, to understand the general variable predictive power, the SHAP values for each feature were averaged in absolute terms, and this number showed what predictive power on average the variable achieved among all other variables. The higher the SHAP value, the higher the predictive power. Thus, in this paper individual SHAP values and the average SHAP values will be presented.

Research results

In accordance with previous studies on the topic of pandemics and real estate, this paper found a significant but adequate apartment price response during the COVID-19 pandemic. Within the 4-month period from May to August, only 17.2% and 10.7% of listings, on average, displayed a negative price revision in rent and sell activities, respectively, meaning that the majority of properties were intact. The price revisions for rent operations occurred after 23 days on average, while for sell operations, they occurred after approximately 63 days. Investors and brokers should pay close attention to the latter values since apartment listings over this period tend to have a higher chance of price revision. Most price adjustments aggregated in a thin left-tailed distribution with a 4-month average price drop of − 7.20% and − 4.2% for rent and sell operations, respectively (the distribution of the price change is depicted in Appendices 1 and 2 ). Compared to Giudice et al.’s [ 21 ] forecasting model, which predicted a 4.8% drop in the short run, and Francke and Korevaary’s [ 20 ] estimations, which recorded a 5% drop in sale prices and a 2% drop in rent prices in the case of cholera, the COVID-19 period price drop in Vilnius was similar.

When analysing the price dynamics within the four months, a pattern was observed in which the apartment price revision size tended to shrink each month, beginning in May with the largest decrease in price and ending in August with the smallest decrease in price, for both sale and rent operations. Likewise, the median prices for rent and sell operations mostly dipped in May and June, while median prices started to rise in August. Although the causal COVID-19 impact was not measured, it was recorded that the number of coronavirus cases was larger in May than in August, exactly when the biggest price dip occurred and the quarantine was still ongoing, which ended on July 16 th . After quarantine abolition, only a few instances of viral infection were recorded; hence, businesses returned to their normal activities. The descriptive statistics for all the variables and all months are presented in Tables 1 , 2 , 3 and 4 for sell operations, in Tables 4 , 5 , 6 , 7 and 8 for rent operations and also in Appendices 1 and 2 .

Noticeable differences can also be observed in the vacancy rates (or the so-called TOM variable), which are depicted in Appendices 1 and 2 and Fig.  2 . For rent activities, the average TOM increased from 21 to 24 days, while for sell activities, it rose from 31 to 45 days. Following the Lazear [ 29 ] clearance model, these higher TOM values would indicate that the market was in decline, as fewer buyer commitments to buy or rent were observed. With rising economic uncertainty, burdensome real estate transactions were delayed, thus, to keep their assets attractive, asset owners had to reduce asset prices or endure higher vacancies. On the other hand, the Yinger [ 41 ] theory would argue that the market participants were enduring higher TOM values to maximise their selling prices. Some believed that due to the viral spread, more crowded and denser city zones, like the old town or the new town, would endure the highest vacancies because people would start moving out to suburban areas. Unfortunately, the collected data did not validate this notion. From May to August, the vacancy growth rates for the old town and the new town increased by around 33% for sell operations, and by 11.7% and 18.8% for rent operations, respectively, although other regions underwent vacancy growth reaching up to 70% or 80%. Despite this, the city centre accounted for an average of 34.7% of rent and almost 19.1% in sell operations for all price revisions.

figure 2

Vilnius city vacancy maps

Finally, the 15 unique algorithms were deployed, amounting to a total of 120 machine learning models developed for each month and for both sell and rent operations. Table 9 shows the average metrics for all months and both operation types and is arranged according to the F1 column from smallest to largest. As observed, the extreme gradient boosting marginally outperformed other algorithms in F1 and accuracy metrics. For the accuracy measure, the difference between the first and second algorithms was 0.002, while for the F1 metric it was 0.022. As discussed in “ Methodology ” section, a trade of can be seen between precision and recall. Models that had higher precision had lower recall, and although the Catboost model had slightly better precision of 0.001, the XGB had a significantly better overall quality when looking at the F1 metric.

After selecting the algorithm, eight individual models based on the XGB model were developed to dissect the feature importance by using the SHAP values (the results are depicted in Table 10 ). Some limitations must be noted regarding the choice of variables. Since the COVID-19 outbreak hit unexpectedly, the number of variables gathered for the first two months (May and June) was smaller compared to the number of those gathered for the last two months. Nevertheless, this study incorporated more variables with upcoming months with the intention to see if the model interpretation changed.

When scrutinizing the feature importance scores, a clear dominant factor was observed in both sell and rent operations over the entire four-month (May to August) period. According to the SHAP scores, the TOM variable was the single most important feature in explaining whether any price change would occur or not. The TOM variable had an average 4-month SHAP value of 2.11 for sell operations and 1.24 rent operations. While adding more variables to the models changed the TOM SHAP score, it still remained consistently the largest influencer for price revision to materialise. For the sell operations, the year and initial price setup served as the second and third largest contributors in the model, whereas other variables were far less useful at dissecting the change, especially when more variables were added; in the rent case, the predictive models relied heavily on the TOM and initial price variables, with the minimum effect from the remaining covariates.

Furthermore, it was relevant to take a closer look at the TOM variable since it demonstrated powerful capabilities for predicting future changes. The results are depicted in Fig.  3 which show individual SHAP values. Similar to He et al.’s [ 25 ] discoveries, Fig.  3 ’s explanation incorporates both Lazear’s [ 29 ] and Yinger’s [ 41 ] theories. As apartments were listed for a certain time duration, the TOM variable had a negative effect on the price change variable, meaning that it was not rational to expect a price change at the beginning of the listing. That is why in Fig.  3 low TOM values have negative SHAP values. Interestingly, two smooth transition points occurred later on. For the rent operations, the first smooth transition occurred after around 25 days and after around 45 days for the sell operations. From this point on, the TOM variable began to push the price revisions to occur (SHAP values became positive), but as the number of days increased, a U-style behaviour emerged, eventually leading to a second transition point where diminishing effect for price revision to occur from TOM variable was recorded. The second transition point was between 90–120 and 200–250 days for rent and sell operations, respectively. It could be that asset owners have a pre-determined limit to how much of loss they are able to bear. These findings coincide with the Lazear clearance model, which proposed that with an increase in TOM, properties begin to lose their attractiveness, and eventually, a price revision occurs, nevertheless, they also incorporated Yinger’s theory, stating that with longer waiting times, a higher chance of buyers ready to pay the highest price might occur. Additionally, the findings confirm He et al.’s [ 25 ] notion that the relationship between the price and TOM is not linear but more of an inverted U-shape, although the right-hand side of the TOM variable in Fig.  3 is less defined. Thus, entrepreneurs should base their investment strategies not on the highest TOM values, but on the range between the two transition points where the inflection occurs.

figure 3

The TOM variable effect

Conclusions

The COVID-19 pandemic has dramatically affected many economic operations, and within these circumstances, real estate experts have claimed that real estate prices might fall. However, this study raised the question of what apartment attributes or variables are most likely to influence price revisions during the pandemic. In analysing the previous literature, particular variable effects were unclear on many occasions, especially regarding the TOM variable, which varied from extremely significant to not significant at all. Furthermore, many scholars focused on hedonic price determination models, while the pandemic mostly employed price change analysis. Thus, a niche for new research was discovered.

With the rise of Big Data, this study was able to create a custom web-scraping algorithm and collect property listings in the city of Vilnius during the first wave of COVID-19. Subsequently, 15 different ML models were applied to forecast apartment revisions, and each model was evaluated per particular criteria to identify the most accurate algorithm. Furthermore, the recent development of SHAP values allowed this study to dissect the variable predictive power.

The findings in this study coincide with the previous findings that real estate is quite resilient to pandemics, as the price drops were not as dramatic as anticipated. A four-month average price drop only reached − 7.20% and − 4.2% for rent and sell operations, respectively. However, an increase in apartment vacancies in most Vilnius boroughs was recorded, suggesting a worsening situation for the real estate market. Out of 15 different models tested, the XGB was the most precise, although the difference was negligible about 0.002 in accuracy criteria and 0.022 in the F1 metric. The retrieved SHAP values concluded that the TOM variable was by far the most dominant and consistent variable for price revision forecasting. Second, in line was the initial price setup. Additionally, the TOM variable exhibited an inverse U-shaped behaviour that was previously discovered by other authors, implying that there are two transition points, one at around 25 and 45 days and the other between 90–120 and 200–250 days for rent and sell operations, respectively.

From a social impact perspective, this study gives guidance to investors, households and other market participants how to evaluate the real estate market conditions and how to anticipate price revisions. For one, growing TOM values in the boroughs could indicate either emerging problems in the market that can lead to recessions or over supply of properties. Thus, governments should closely monitor TOM values as it consistently provides useful information in real time rather than waiting for monthly housing price indexes to appear. Secondly, although many variables have been found to significantly affect price change in prior studies, their effect in this study was found to be miniscule or inconsistent except for the TOM variable. Therefore, households or investors should carefully consider the TOM values when making future investments, as lower TOM values might indicate higher property resilience to market disruptions.

Availability of data and materials

The datasets generated and analysed during the current study are available from the corresponding author on reasonable request.

Abbreviations

Days an apartment is listed on the market (time on the market)

Extreme gradient boosting

  • Machine learning

Most consistent machine learning algorithm

Anglin PM, Rutherford R, Springer TM. The trade-off between the selling price of residential properties and time-on-the-market: the impact of price setting. J Real Estate Finan Econ. 2003;26(1):95–111.

Article   Google Scholar  

Armstrong JS, Green KC, Graefe A. Golden rule of forecasting: be conservative. J Real Estate Res. 2015;68(8):1717–31. https://doi.org/10.1016/j.jbusres.2015.03.031 .

An Z, Cheng P, Lin Z, Liu Y. How do market conditions impact the price-TOM relationship? Evidence from real estate owned (REO) sales. J Hous Econ. 2013;22(3):250–63. https://doi.org/10.1016/j.jhe.2013.07.003 .

Ambrus A, Field E, Gonzalez R. Loss in the time of cholera: long-run impact of a disease epidemic on the urban landscape. Am Econ Rev. 2020;110(2):475–525.

Baldominos A, Blanco I, Moreno JA, Iturrarte R, Bernárdez O, Afonso C. Identifying real estate opportunities using machine learning. Appl Sci. 2018;8(11):1–23.

Benefield JD, Cain CL, Johnson KH. On the relationship between property price, time-on-market, and photo depictions in a multiple listing service. J Real Estate Finan Econ. 2009;43(3):401–22.

Berawi MA, Miraj P, Saroji G, et al. Impact of rail transit station proximity to commercial property prices: utilizing big data in urban real estate. J Big Data. 2020;7:71. https://doi.org/10.1186/s40537-020-00348-z .

Brownlee J. Machine learning mastery with Python. 2020; Ebook.

Benefield J, Cain C, Johnson K. A review of literature utilizing simultaneous modelling techniques for property price and time-on-market. J Real Estate Lit. 2014;22(2):149–75.

Borde S, Rane A, Shende G, Shetty S. Real estate investment advising using machine learning. Int Res J Eng Tech (IRJET). 2017;4(3):1821–5.

Google Scholar  

Bogin A, Doerner W, Larson W. Local house price dynamics: new indices and stylized facts. Real Estate Econ. 2018. https://doi.org/10.1111/1540-6229.12233 .

Buckland M, Gey F. The relationship between recall and precision. JASIST. 1994;45(1):12–9. https://doi.org/10.1002/(sici)1097-4571(199401)45:1%3c12::aid-asi2%3e3.0.co;2-l .

Christoph M. Interpretable machine learning. A Guide for Making Black Box Models Explainable. 2019. https://christophm.github.io/interpretable-ml-book/ . Accessed 20 Dec 2020.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell. 2011;2011(16):321–57.

MATH   Google Scholar  

Chawla N.V. (2009) Data mining for imbalanced datasets: an overview. In: Maimon O, Rokach L, editors. Data mining and knowledge discovery handbook. Springer, USA. https://doi.org/10.1007/978-0-387-09823-4_45 .

Côrte-Real N, Ruivo P, Oliveira T, Popovič A. Unlocking the drivers of big data analytics value in firms. J Bus Res. 2019;97:160–73. https://doi.org/10.1016/j.jbusres.2018.12.072 .

Čeh M, Kilibarda M, Lisec A, Bajat B. Estimating the performance of random forest versus multiple regression for predicting prices of the apartments. J Geoinf. 2018;7(5):168.

Du Q, Wu C, Ye X, Ren F, Lin Y. Evaluating the effects of landscape on housing prices in urban China. Tijdsch Econ Soc Geogr. 2018;109(4):525–41.

De Nadai M, Lepri B. The economic value of neighbourhoods: predicting real estate prices from the urban environment. In: 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). 2018. https://doi.org/10.1109/dsaa.2018.00043 .

Francke M, Matthijs K. Housing markets in a pandemic: evidence from historical outbreaks. 2020. Available at SSRN: https://ssrn.com/abstract=3566909 or https://doi.org/10.2139/ssrn.3566909 . Accessed 21 Dec 2020.

Giudice VD, De Paola P, Giudice FPD. COVID-19 infects real estate markets: short and mid-run effects on housing prices in Campania Region (Italy). Soc Sci. 2020;9(7):114. https://doi.org/10.3390/socsci9070114 .

Gupta A, Mittal V, Peeters J, Nieuwerburgh S. Flattening the curve: pandemic-induced revaluation of urban real estate. Available at SSRN: https://ssrn.com/abstract=3780012 or https://doi.org/10.2139/ssrn.3780012 . 2021.

Herrin WE, Knight JR, Sirmans CF. Price cutting behavior in residential markets. J Hous Econ. 2004;13(3):195–207. https://doi.org/10.1016/j.jhe.2004.07.002 .

Huang J, Palmquist RB. Environmental conditions, reservation prices, and time on the market for housing. J Real Estate Finan Econ. 2001;22(2/3):203–19.

He X, Lin Z, Liu Y, Seiler MJ. Search benefit in housing markets: an inverted u-shaped price and TOM relation. Real Estate Econ. 2017. https://doi.org/10.1111/1540-6229.12221 .

Johnson K, Benefield J, Wiley J. The probability of sale for residential real estate. J Hus Res. 2007;16:131–42.

Khezr P. Time on the market and price change: the case of Sydney housing market. Appl Econ. 2014;47(5):485–98. https://doi.org/10.1080/00036846.2014.972549 .

Knight JR. Listing price, time on market, and ultimate selling price: causes and effects of listing price changes. Real Estate Econ. 2002;30(2):213–37. https://doi.org/10.1111/1540-6229.00038 .

Lazear E. Retail pricing and clearance sales. Am Econ Rev. 1986;76:14–32.

Liu S, Su Y. The impact of the COVID-19 pandemic on the demand for density: evidence from the US housing market. SSRN Electron J. 2021. https://doi.org/10.2139/ssrn.3661052 .

Lundberg M, Lee SI. A unified approach to interpreting model prediction. In: 31st Conference on Neural Information Processing Systems. 2017n; Available at https://dl.acm.org/doi/ https://doi.org/10.5555/3295222.3295230 . Accessed 21 Dec 2020.

Metzner S, Kindt A. Determination of the parameters of automated valuation models for the hedonic property valuation of residential properties. Int J Hous Mark Anal. 2018;11(1):73–100. https://doi.org/10.1108/ijhma-02-2017-0018 .

Oust A, Hansen SN, Pettrem TR. Combining property price predictions from repeat sales and spatially enhanced hedonic regressions. J Real Estate Finan Econ. 2020;2019(61):183–207.

Owusu-Edusei K, Espey M, Lin H. Does close count? School proximity, school quality, and residential property values. J Agric Appl Econ. 2007;39(01):211–21. https://doi.org/10.1017/s1074070800022859 .

Park B, Bae JK. Using machine learning algorithms for housing price prediction: the case of Fairfax County, Virginia housing data. Expert Syst Appl. 2015;42(6):2928–34. https://doi.org/10.1016/j.eswa.2014.11.040 .

Pérez-Rave JI, Correa-Morales JC, González-Echavarría F. A machine learning approach to big data regression analysis of real estate prices for inferential and predictive purposes. J Prop Res. 2019;36:59–96. https://doi.org/10.1080/09599916.2019.1587489 .

Rosiers FD, Lagana A, Theriault M. Size and proximity effects of primary schools on surrounding house values. J Prop Res. 2001;18(2):149–68. https://doi.org/10.1080/09599910110039905 .

Sabuncu MR. Intelligence plays dice: Stochasticity is essential for machine learning. Cornell University. 2020 Available at: https://arxiv.org/abs/2008.07496 .

Trawiński B, Telec Z, Krasnoborski J, Piwowarczyk M, Talaga M, Lasota T, Sawiłow E. Comparison of expert algorithms with machine learning models for real estate appraisal. In: 2017 IEEE International Conference on INnovations in Intelligent SysTems and Applications 2017; 51–54. 2017. Available at: https://www.google.com/search?q=Comparison+of+expert+algorithms+with+machine+learning+models+for+real+estate+appraisal.&oq=Comparison+of+expert+algorithms+with+machine+learning+models+for+real+estate+appraisal.&aqs=chrome..69i57j69i59j0i30.123j0j7&sourceid=chrome&ie=UTF-8 .

Verbrugge R, Dorfman A, Johnson W, Marsh F, Poole R, Shoemaker O. Determinants of differential rent changes: mean reversion versus the usual suspects. Real Estate Econ. 2016;45(3):591–627. https://doi.org/10.1111/1540-6229.12145 .

Yinger J. A search model of real estate broker behavior. Am Econ Rev. 1981;71:591–605.

Wong G. Has SARS infected the property market? Evidence from Hong Kong. J Urban Econ. 2008;63(1):74–95. https://doi.org/10.1016/j.jue.2006.12.007 .

Download references

Acknowledgements

Not applicable.

Funding is provided by the Kaunas University of Technology.

Author information

Authors and affiliations.

School of Economics and Business, Kaunas University of Technology, K. Donelaičio g. 73, 44249, Kaunas, Lithuania

Andrius Grybauskas, Vaida Pilinkienė & Alina Stundžienė

You can also search for this author in PubMed   Google Scholar

Contributions

All the authors discussed and designed the experiments as well as contributing to the writing of the paper. All the authors read and approved the final manuscript.

Corresponding author

Correspondence to Andrius Grybauskas .

Ethics declarations

Ethics approval and consent to participate, consent for publication, competing interests.

The authors declare that they have no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

See Fig. 4 .

figure 4

Summary statistics of sell variables

See Fig. 5 .

figure 5

Summary statistics of rent variables

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Grybauskas, A., Pilinkienė, V. & Stundžienė, A. Predictive analytics using Big Data for the real estate market during the COVID-19 pandemic. J Big Data 8 , 105 (2021). https://doi.org/10.1186/s40537-021-00476-0

Download citation

Received : 26 February 2021

Accepted : 24 May 2021

Published : 03 August 2021

DOI : https://doi.org/10.1186/s40537-021-00476-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Real estate

research paper on real analysis

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Curr Genomics
  • v.8(4); 2007 Jun

Real-Time PCR: Revolutionizing Detection and Expression Analysis of Genes

Department of Studies in Applied Botany and Biotechnology, University of Mysore, Manasagangotri, Mysore 570006, India

KR Kottapalli

Plant Genome Research Unit, National Institute of Agrobiological Sciences, 2-1-2 Kannondai, Tsukuba 305- 8602, Ibaraki, Japan

Human Stress Signal Research Center (HSS), National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba West, 16-1 Onogawa, Tsukuba 305-8569, Ibaraki, Japan

Research Laboratory for Agricultural Biotechnology and Biochemistry (RLABB), GPO Box 8207, Kathmandu, Nepal

Plant Protection Institute, Hungarian Academy of Sciences, Budapest, Hungary

KS Rangappa

Department of Studies in Chemistry, University of Mysore, Manasagangotri, Mysore 570006, India

Invention of polymerase chain reaction (PCR) technology by Kary Mullis in 1984 gave birth to real-time PCR. Real-time PCR — detection and expression analysis of gene(s) in real-time — has revolutionized the 21 st century biological science due to its tremendous application in quantitative genotyping, genetic variation of inter and intra organisms, early diagnosis of disease, forensic, to name a few. We comprehensively review various aspects of real-time PCR, including technological refinement and application in all scientific fields ranging from medical to environmental issues, and to plant.

The invention of polymerase chain reaction (PCR) by Kary Mullis in 1984 was considered as a revolution in science. Real-time PCR, hereafter abbreviated RT PCR, is becoming a common tool for detecting and quantifying expression profiles of selected genes. The technology to detect PCR products in real-time, i.e., during the reaction, has been available for the past 10 years, but has seen a dramatic increase in use over the past 2 years. A search using the key word real-time and PCR yielded 7 publications in 1995, 357 in 2000, and 2291 and 4398 publications in 2003 and 2005, respectively. At the time of this writing, there were 3316 publications in 2006. The overwhelming majority of the current publications in the field of the genomics have been dealing with the various aspects of the application of methods in medicine, with the search for new techniques providing higher preciosity rates and with the elucidation of the principal biochemical and biophysical processes underlying the phenotypic expression of cell regulation. Series of RT PCR machines have also been developed for routine analysis (Table ​ (Table1) 1 ) [ 1 ].

Real-Time Cyclers Available in the Market and their Characteristics

The advancements in bioscience during the last century help in comprehensive understanding of information about interacting network of various gene modules that coordinately carry out integrated cellular function in somewhat isolated fashion, i.e., the molecular mechanism of phenotypic expression of genotype. The function of a major part of the genome is still unknown and the relationship between enzymes, signaling substances and various small molecules is still rather limited. In order to fully understand the regulation of metabolism and to alter it successfully more information of gene expression, recognition of DNA by proteins, transcription factors, drugs and other small molecules is required.

Gene expression profile has been widely used to address the relationship between ecologically influenced or disease phenotypes and the cellular expression patterns. PCR–based detection technologies utilizing species specific primers are proving indispensable as research tools providing enhanced information on biology of plant/microbe interactions with special regard to the ecology, aetiology and epidemiology of plant pathogenic microorganisms.

In general, laboratory experience with nested PCR for diagnostics on presence of microbial DNA in extracts from a diverse range of plant matrices (including soils) offers improved sensitivity and robustness, particularly in the presence of enzyme inhibitors. In order to meet consumer and regulatory demands, several PCR-based methods have been developed and commercialized to detect and quantify mRNA in various organisms. Most of them are based on the use of internal transcribed spacer regions within the nuclear ribo-somal gene clusters as these are particularly attractive loci for the design of PCR–based detection assays. These clusters are readily accessible using universal primers and typically present in high copy number in the cell, whilst often exhibiting sufficient inter-specific sequence divergence for the design of species specific primers. The limit of detection is usually a few alien molecules even in the presence of very high levels of background DNA. The high sensitivity and specificity of RT PCR allow it to be the first choice of scientists interested in detecting dynamics of gene expression in plant/microbe associations (Table ​ (Table2 2 ).

Obligate Pathogen Detection Using Real-Time PCR

The RT PCR allows quantitative genotyping and detection of single nucleotide polymorphisms and allelic discrimination as well as genetic variations when only a small proportion of the sample carrying the mutation. The use of multiplex PCR systems using combined probes and primes targeted to sequences specific to counterpartners of plant/ microbe associations is becoming more important than standard PCR, which is proving to be insufficient for such living systems.

The multiplex RT PCR is suitable for multiple gene identification based on the use of fluorochomes and the analysis of melting curves of the amplified products. This multiplex approach showed a high sensitivity in duplex reactions and is useful alternative to RT PCR based on sequence-specific probes, e.g., TaqMan chemistry (Table ​ (Table3 3 ).

Multiplexing Using Real-Time PCR

Although RT PCR is a powerful technique for absolute comparison of all transcripts within the investigated tissue, it has a few problems as it depends critically on the correct use of calibration and reference materials. Successful and routine application of PCR diagnostics to tissues of plant/microbe consortium is often limited by the lack of quality template due to inefficient RNA extraction methodologies, but also the presence of high levels of unidentified, co-precipitated PCR inhibitory compounds, presumably plant polyphenolics and polysaccharides (Table ​ (Table4 4 ).

PCR Inhibitory Compounds

The sampling procedures are of great importance towards the validation of analytical methods for analysis. The largest single source of error in the analysis of plant/microbe associations is the sampling procedure (Fig. ​ (Fig.1). 1 ). Sampling risks can be managed by choosing an appropriate sample size for analysis. The extraction and purification of nucleic acids is a crucial step for the preparation of samples for PCR. Current methods for gene expression studies typically begin with a template preparation step in which nucleic acids are freed of bound proteins and are then purified. Many protocols for nucleic acid purification, reverse transcription of RNA and/or amplification of DNA require repeated transfers from tube to tube and other manipulations during which materials may be lost.

An external file that holds a picture, illustration, etc.
Object name is CG-8-4-234_F1.jpg

Sampling procedures are of great importance towards the validation of analytical methods for analysis.

Of the range of protocols reported for the extraction of DNA/RNA from plant material, most are complicated and time consuming in application. The protocols should be perused case by case and to be adopted judiciously for a particular plant species. In this respect major variations exist in this step as compared to samples of mammalian origin. Isolation of RNA is particularly challenging because this molecule is sensitive to elevated temperatures and is degraded by RNAses, which therefore have to be immediately inactivated upon cell lysis. Design of species or race specific primers from inter-specific universal internal transcribed spacer primers is also needed.

There are numerous commercially available kits for PCR. The data output from certain RT PCR machines gives an immediate appreciation of the kinetics of the PCR occurring within the tube and, in addition, gives an instantaneous visual representation of the amount of PCR product present following each cycle. Following a single RT PCR, the data extracted give the type of information that was only previously inferable from multiple conventional PCRs. Detailed information is available from the respective companies’ web-sites about the protocols and output information generated.

In this review, we highlight some of the general criteria and essential methodological components of PCR technologies, for rapid functional genomics. Examples are provided to illustrate the utility of results of plant pathology studies and validation of targets for mammalian studies.

APPLICATIONS

Medical science.

Nucleic acid amplification techniques have revolutionized diagnostics. Current technologies that allow the detection of amplification in real-time are fast becoming clinical standards, particularly in a personalized diagnostic context [ 2 ]. On the way to personalized medicine, we may stepwise improve the chances of choosing the right drug for a patient by categorizing patients into genetically definable classes that have similar drug effects (as, for example, human races, or any population group carrying a particular set of genes) [ 3 ]. Adverse drug reactions (ADRs) are a significant cause of morbidity and mortality. The majority of these cases can be related to the alterations in expression of clinical phenotype that is strongly influenced by environmental variables [ 4 ]. Application of RT PCR combined with other molecular techniques made possible the monitoring of both therapeutic intervention, and individual responses to drugs. However, it is wise to expect that, even after we have reached the goal to establish personalized medicine, we will not have eliminated all uncertainties [ 5 ]. The needs in clinical application of molecular methods initiated important developments in diagnostics stimulating progress in other branches of science. The introduction of these new methods in fields of human practices induced rapid expansion of molecular approaches.

Cancer arises from the accumulation of inherited polymorphism (SNPs) and mutation and/or sporadic somatic polymorphism (i.e. non-germline polymorphism) in cell cycle, DNA repair, and growth signaling genes [ 6 ]. Despite advances in diagnostic imaging technology, surgical management, and therapeutic modalities, cancer remains a major cause of mortality worldwide. Early detection of cancer and its progression is difficult due to complex multifactorial nature and heterogeneity [ 7 ]. A reliable method to monitor progress of cancer therapeutic agents can be of immense use. RT PCR, currently the most sensitive method to quantify the specific DNA makes it possible to detect even a single molecule and diagnostics become feasible with lower amounts of complex biological materials compared to traditional methods [ 8 , 9 ]. Research has been well documented in cancer research [ 10 , 11, 12 ]. Most of the commonly occurring cancers have been detected by measuring marker gene expressions or by using probes. The sensitivity of single-marker assays is not high enough for clinical applications [ 13 ]. Adopting a multigene panel for most common malignant diseases (carcinoma of bladder, breast cancer, colorectal cancer, endometrial carcinoma) significantly increased the accuracy of diagnosis that is extremely important as each of them had excellent prognosis if diagnosed at early stage [ 14 ]. The use of new technology and methodic developments has been intensively started with diseases of complicated diagnosis (Table ​ (Table5). 5 ). During the first five years after introduction of RT PCR six of ten applications were made for detecting leukemias. Recently numerous kits are marketed for clinical tests, and these developments promoted the use of RT PCR in other fields of human practices.

Time Course of Developments in Application of Real-Time PCR Used for Cancer Diagnosis

Majority of research using RT PCR has been made for detecting or quantifying viruses from viral infected human specimens. Various studies have provided protocols for detecting and quantifying viruses especially related to human diseases [ 15 ]. Detection of HSV1 and HSV2 was achieved by using TaqMan probes and it was in many ways alternative to conventional nested PCR assays [ 16 ]. Recently, a detection, quantification and differentiation between HSV1 and HSV2 genotypes were achieved using primers and probes (Light cycler) targeting HSV DNA polymerase gene [ 17 ]. Furthermore, genital herpes, which is the most common sexually transmitted disease (STD) around the world, accounts for 20 % of the STDs in United States alone [ 18 ]. RT PCR detection of HSV of genital and dermal specimens has also been well documented [ 19 – 25 ]. RT PCR showed superior sensitivity in detecting varicella-zoster virus compared to cell culture assays in dermal specimens [ 21 , 26, 27 ]. Further RT PCR has been standardized for studying the interactions between virus and the host, which in turn can provide a reliable means to study the efficacy of antiviral compounds or to determine the chronic conditions [ 28 , 29 ]. Immuno-deficient patients tend to harbor several co-infections; under this, detection of multiple pathogens is essential for therapy (Table ​ (Table6). 6 ). RT PCR multiplex assays have been developed for viral genotype differentiation [ 17 , 30 ].

Application of Real-Time PCR for Virus Diagnosis

Bacteriology

Traditionally, initial antibiotic therapy was based on identifying the Gram stain classification. High variability that existed in identification of bacterial pathogens by mere observations was corrected by use of conventional PCR-based methods; later, this was further fastened by use of RT PCR. Fluorescence hybridization probes allowed a fast detection of low amounts of bacterial DNA and a correct Gram stain classification [ 31 ]. RT PCR has been shown as advantageous over other techniques (immunoassay or culture method) for detecting the bacteria irrespective of type of clinical specimen and especially those which are difficult to culture or slow growing. A quicker conformation of the pathogen will facilitate early prescription of appropriate antibiotics. Published accounts indicate that RT PCR was faster and sensitivity was greater or equal in some cases when compared to conventional methods.

Identification of mycobacterial infections earlier on certain occasions lacked specificity and sensitivity while employing conventional methods [ 32 ]. Mycobacterium species of common interest and so far detected as well as quantified by RT PCR include Mycobacterium tuberculosis , M. avium , M. bovis , M. bovis BCG, M. abscessus , M. chelonae and M. ulcerans [ 33 – 40 ]. Further, detection of antitubercular resistant isolates that were usually detected by broth dilution method have been replaced by RT PCR targeting mutant genes isoniazid ( katG ), rifampin ( rpoB ) and ethambutol ( embB ) from culture or clinical specimens [ 41 – 45 ].

Bacteria represent the potential agents for biological warfare. Some RT PCR assays (Light Cycler) have allowed the use of autoclaved samples for immediate detection of Bacillus species causing anthrax [ 46 – 47 ]. However, clinical studies are required to determine the usefulness of these tests for the rapid identification of this pathogen directly from human specimens.

Major fungi causing infections in humans are Aspergillus species ( A. fumigatus, A. flavus, A. niger, A. nidulans, A. terreus, A. versicolor ), Candida species ( C. albicans, C. dub-liniensis ), and Pneumocystis jiroveci . The conventional methods developed for detection of these infectious fungi are culturing, histopathology/phenotypic assays/biochemicals/ microscopy, conventional PCR, nucleic acid probe, CFU quantification, broth dilution and staining followed by microscopic observations. The efficacy of these methods seems to be slower on many occasions. The RT PCR for detecting and measuring the same proved to be faster on many instances irrespective of the clinical specimen [ 48 – 53 ]. Quantitative or qualitative RT PCR assays have also been developed for other fungi such as Coccidioides sp., Conidiobolus sp., Cryptococcus sp., Histoplasma sp., Pneumocystis sp., Paracoccidioides sp., and Stachybotrys sp. [ 54 – 61 ].

Molecular biology (and particularly PCR) has been increasingly used for the diagnosis of parasitic protozoa of medical interest [ 62 ]. RT PCR and other technical improvements in the past decade permit precise quantification and routine use for the diagnosis facilitating the study of parasitic populations, although the use of this method for malaria remains limited due to high cost [ 62 ]. RT PCR assays for clinical application have been described for detecting amoebic dysentery [ 63 ], chagas’ disease [ 64 ], cutaneous and visceral leishmaniasis [ 65 ], giardiasis [ 66 ], Cyclospora cayeta- nensis [ 66 ] causing prolonged gastroenteritis [ 67 ], toxoplas-mosis in the amniotic fluid of pregnant women [ 68 ], and in immuno-compromised patients [ 69 ]. Protozoans cause several diseases, which are endemic in large parts of the world. Further genome sequencing efforts are requested as many parasitologists work on organisms whose genomes have been only partially sequenced and where little, if any, annotation is available [ 70 ].

Animal models have served investigators from decades to understand several biological functions of humans including disease diagnosis and to take appropriate measures for therapy. The development of quantitative reverse transcription-PCR, such as RT RT-PCR techniques, approach theoretical limits of per reaction sensitivity, further increments in the sensitivity of measurements of the pathogens [ 71 – 72 ]. Infection of domestic cats with the feline immunodeficiency virus (FIV) results in a fatal immunodeficiency disease, and is similar to the human immunodeficiency virus 1 (HIV-1) in humans. This has helped the progress of in-depth research on this morphologically and genetically resembling virus especially in development of candidate vaccines. Highly sensitive detection and quantification assays have been developed by RT PCR methods for this virus [ 71 , 73 ]. Simian immunode-ficiency virus (SIV) detection was earlier done by branched-chain DNA assay that was quite expensive, but with low sensitivity (1500 viral RNA copies/ml). Leutenegger and coworkers developed a TaqMan RT RT-PCR assay which could detect with higher sensitivity (50 viral RNA copies/ml) [ 74 ]. Feline coronavirus (FcoV) is known to be more prevalent in cat population and is a fatal infectious disease. Control measures include detection as well as separation of infected populations or vaccination. A reliable absolute quantification real-time TaqMan probes were designed to detect important laboratory and field strains of FcoV by Gut and co-workers [ 75 ]. Further, tick-borne zoonotic pathogens are well known in many areas all over the world [ 76 ]. Clinical diagnosis of tick-borne diseases is difficult due to unusual clinical signs. Early diagnosis and treatment is necessary to prevent fatal infections and chronic damage to various tissues. A series of new projects in this area have yielded detection and quantification methods for important tick borne pathogens [ 77 – 79 ]. Other studies on various aspects of veterinary science have been performed using RT PCR for instance, effects of viral infections on neural stem cell viability [ 80 ], detection of several viruses [ 81 – 83 ], innate immune responses to virus infection [ 84 ], factors influencing viral replication [ 85 ], gene expression profiling during infection [ 86 ], characterization of viruses [ 87 ] are a few to mention.

Insects tend to harbor Corynebacterium pseudotubercu-losis and are responsible for the disease spread in dairy farms [ 88 ]. An investigation on identification of insect vectors spreading Corynebacterium pseudotuberculosis by TaqMan PCR assay ( PLD gene) supported the hypothesis that this pathogen may be vectored to horses by Haematobia irritans , Stomoxys calcitrans , and Musca domestica . The organism can be identified in up to 20 % of houseflies in the vicinity of diseased horses [ 89 ].

The prevalence, clinical manifestations, and risk factors for infection for all three feline hemoplasma species were performed by Willi and co-workers [ 90 ]. Diagnosis, quantification, and follow-up of hemoplasma infection in cats were performed using three newly designed sensitive RT PCR assays. Efficacy Marbofloxacin drug was studied in cats against Candidatus Mycoplasma haemominutum , which revealed decreased copy number of the pathogen and no correlation was evident on Candidatus Mycoplasma haemominu-tum in chronic FIV infection [ 91 – 92 ].

Food Microbiology and Safety

Mycotoxins are the major food contaminants and they have become a great concern worldwide due to their several ill effects [ 93 ]. In order to overcome this problem, a rapid, cost-effective, and automated diagnosis of food-borne pathogens throughout the food chain continues to be a major concern for the industry and public health. An international expert group of the European Committee for Standardization has been established to describe protocols for the diagnostic detection of food-borne pathogens by PCR [ 94 ]. A standardized PCR-based method for the detection of food-borne pathogens should optimally fulfill various criteria such as analytical and diagnostic accuracy, high detection probability, high robustness (including an internal amplification control [IAC]), low carryover contamination, and acceptance by easily accessible and user-friendly protocols for its application and interpretation [ 95 ]. RT PCR has the potential to meet all these criteria by combining amplification and detection in a one-step closed-tube reaction. A high throughput identification of Fusarium at genus level or distinguishing species [ 96 – 97 ] has been published. Salmonella , one of the most common causes of food-borne disease outbreaks due to its widespread occurrence and several sources have been known to harbor this pathogen [ 98 ]. A duplex real-time SYBR Green LightCycler PCR (LC-PCR) assay was developed for 17 food/water borne bacterial pathogens from stools by Fukushima and co-workers [ 99 – 100 ]. The pathogens examined were enteroinvasive Escherichia coli , enteropatho-genic E. coli , enterohemorrhagic E. coli , enterotoxigenic E. coli , enteroaggregative E. coli , Salmonella spp., Shigella spp., Yersinia enterocolitica , Yersinia pseudotuberculosis , Campylobacter jejuni , Vibrio cholerae , Vibrio parahaemo-lyticus , Vibrio vulnificus , Aeromonas spp., Staphylococcus aureus , Clostridium perfringens , Bacillus cereus, Plesio-monas shigelloides and Providencia alcalifaciens. Further, detection assays for Clostridium botulinum applicable to both purified DNA and crude DNA extracted from cultures and enrichment broths as well as DNA extracted directly from clinical and food specimens were developed [ 101 ]. Similarly, RT PCR has been used to quantify the food-borne pathogen Listeria monocytogenes by first incorporating an IAC [ 102 ].

Food borne viral infections are one of the leading diseases in humans worldwide. Currently over two billion people have evidence of previous Hepatitis B virus infection and 350 million have become chronic carriers of the virus [ 103 ]. Successful detection of this virus from serum and plasma, by RT PCR has been developed. This method is useful for monitoring the efficacy of Hepatitis B virus therapy and screening human population in endemic areas. Other important food borne viruses quantified by this technique are Ro-tavirus [ 104 ] and gastroenteritis virus [ 105 ]. However, detection or quantification of these viruses directly from various types of food samples seems to be a difficult task.

Forensic Science

Advanced technologies for DNA analysis using short tandem repeats (STR) sequences has brought about a revolution in forensic investigations. One of the most common methods used is PCR, which allows accurate genotype information from samples. Forensic community relied on slot blot technique which is time consuming and labor intensive. RT PCR has become a well-recognized tool in forensic investigations. Improved amplification and quantification of human mtDNA was accomplished by monitoring the hyper-variable region (HV1) using fluorogenic probes, and the same study was also extended to discriminate sex. A duplex RT qPCR assay was developed for quantifying human nuclear and mitochondrial DNA in forensic samples and this method also was efficient for highly degraded samples [ 106 ]. Repetitive Alu sequence based RT PCR detection has been developed and have proved to be advantageous compared with other methods with detection limits as low as 1 pg [ 107 ]. MGB Eclipse primers and probes as well as QSY 7-labeled primer PCR method have been designed for Alu sequence [ 108 – 109 ]. Similarly, RT PCR assays to quantify total genomic DNA and identify males from forensic samples with high efficiency have been standardized [ 110 ]. Recently, human DNA quantifier and qualifier kits have been developed and validated. The efficiency was either comparable or superior to methods available [ 111 ]. Forensic samples are often contaminated with PCR inhibitors and DNA extrac- tion methods fail to exclude the contaminants. A computa- tional method that allows analysts to identify problematic samples with statistical reliability was standardized by using tannic acid and comparing the amplification efficiencies of unknown template DNA samples with clean standards [ 112 ]. Further, methods have also been standardized for assessing the DNA degradation in forensic samples [ 113 ].

Environmental Issues

RT PCR is a convenient method for detection of the mobility of genetic elements. The worldwide increasing environmental pollution is pressing us to find new methods for elimination of undesirable chemicals. The application of microorganisms for the biodegradation of synthetic compounds (xenobiotics) is an attractive and simple method. Unfortunately, the majority of these pollutants are chemically stable and resistant to microbial attack. The isolation of new strains or the adaptation of existing ones to the decomposition of xenobiotics will probably increase the efficacy of microbiological degradation of pollutants in the near future. The widespread application of combined techniques using microbiological decomposition and chemical or physical treatments to enhance the efficacy of the microbiological decomposition can also be expected. The cloning and expression in Escherichia coli of an ‘azoreductase’ from various species have been reported (Table ​ (Table7). 7 ). The exoenzymes of white-rot fungi have also been objects of genetic engineering. The laccase of various filamentous fungi was successfully transmitted into yeast. These manipulations enhanced the capacity of microorganisms to decompose pol-yaromatic compounds (PAC).

Improvement of Deteriorative Activity of Organisms by Interspecific Transfer of Genetic Elements

The expression of oxidases from higher plants augmented the catabolic potential of microbes [ 114 ] and in turn microbial genes straightened the tolerance of higher plant to Poly R-487 [ 115 – 116 ]. Plants tolerant to PACs may be useful in phytoremediation because they could provide a rhizosphere that was suitable for colonization by microbes that are efficient degraders of aromatic structures. Moreover, the plant derived compounds can induce production of fungal redoxenzymes. The C-hydroxylation of aromatic rings by mammalian monoxygenases facilitates subsequent microbial degradation. Human cytochrome P450 enzymes are now routinely expressed as recombinant proteins in many different systems [ 117 – 118 ]. The capacity of such recombinants to catabolize PACs has been tested. It is clear that complexity of association involved in the complete degradation should be increased with increasing complexity of the chemical structure of xenobiotics. The genetically engineered micro-organisms can accomplish degradation of xenobiotics, which persist under normal natural conditions. In natural habitats, complex microbial/macrobial communities carry out biodegradation. Within them, a single organism may interact through inter-specific transfer of metabolites. This co-metabolic potential may be complementary so that extensive biodegradation or even mineralization of xenobiotics can occur [ 119 ]. In this respect, deterioration of industrial and municipal effluents in constructed wetlands with multi-site catabolic potential is a promising possibility. Mobilizing specific genes, encoding nonspecific multifunctional degradative sequences, may decisively increase the degradative potential of natural synthropic community against synthetic pollutants and persisting natural toxins. The use of recombinants that harbor deteriorating determinants from other species can essentially enhance the capacity of remediation technologies. However, the widespread use of genetically modified organism needs continuous survey of gene transmission, and for that RT PCR is a plausible and rapid method.

Validation of Microarray Results

RT PCR has been employed to study the gene expression patterns during several stresses leading to activation of genes relating to signal transduction, biosynthesis, and metabolism. Nitrogen deprivation response in Arabidopsis was analyzed by profiling transcription factors using Affymetrix ATH1 arrays and a RT RT-PCR platform [ 1 , 120 ]. The results revealed large number of differentially expressed putative regulator genes. In this study, MapMan visualization software was used to identify coordinated, system-wide changes in metabolism and other cellular processes. Similarly, Czechowski and co-workers have profiled of over 1,400 Arabidopsis transcription factors, and revealed 36 root and 52 shoot specific genes [ 121 ]. Further, gene expression studies have been made in the direction of stress signaling during biotic and abiotic stress conditions in plants [ 122 – 127 ]. Standardization of house-keeping genes for such studies has been made in potato. Among the seven common genes tested, ef1alpha was the most stable gene during biotic and abiotic stress [ 128 ]. Furthermore, the data obtained by microarray analysis are questioned on few instances and confirmation is achieved by RT PCR (or conventional PCR in some instances). The expression levels observed in microar-ray is generally higher compared to measurement by RT PCR [ 129 ]. In general, studies made so far reveal a good relationship between these two techniques, and for this reason RT PCR is considered as confirmatory tool for microar-ray results [ 130 ].

Plant-Microbe Interaction

Host plant and associated microbes form a special consortium where the parasite is an alien element. Early diagnosis of the pathogens can provide rapid and suitable measurements for limiting the epidemics and selection of appropriate control measures. Molecular diagnostics is a rapidly growing area in plant pathology especially for detection and quantification of commercially important crop pathogens. As a novel methodology, adoption of RT PCR technique is of growing interest due to its rapidity and sensitivity as well as its ability to detect minute amounts of the pathogen’s DNA from infected plant tissues and insect vectors [ 131 ]. Simultaneous detection of several pathogens can be achieved by multiplex PCR. The technique has aided detection of pathogens associated with serious diseases like Fusarium head blight, which is a prerequisite for reduction in the incidence by understanding of its epidemiology [ 97 ]. Several reports are available on detection and/or quantification of plant pathogens (Table ​ (Table8). 8 ). Published literature reveals quantification of pathogens [ 132 – 133 ], determination of symbiotic microbes and pathogens [ 134 ], detection/quantification of seed borne pathogens [ 135 ], host resistance screening [ 136 ] and distinguishing between pathogen pathovars [ 137 – 138 ] using RT PCR.

Plant Pathogens/Pests Determined by Quantitative Real-Time PCR

Species Identification

In plants, the presence of such a large number of multiple copies within each gene family complicates the clear understanding of function of each member. Plant molecular biologists prefer RT PCR methods to other methods and the number of findings is increasing at high rate. The northern blotting determination of genes expressed at lower levels is difficult and closely related genes may cross-hybridize [ 139 ]. Both unique and redundant functions within a multigene family have been identified [ 140 – 142 ]. Expression analysis of all members (33 genes) encoding cell-wall enzymes in Arabidopsis thaliana using RT PCR revealed that most members exhibited distinct expression profiles along with redundant expression patterns of some genes [ 143 ]. Similarly, an expression profile for shaggy-like kinase multigene family during plant development has also been made using this technique [ 144 ]. Further, transformants with high number of copies lead to lower or unstable gene expression of inserted gene. Primary transformants are analyzed for randomly inserted gene copy number. A study using duplex RT PCR has also been described for determining the transgene copy number in transformed plants with high degree of correlation with southern blot analysis [ 145 ]. Likewise, many studies are available on detection of copy number using RT PCR in various crops [ 146 – 147 ].

CONCLUSIONS

RT PCR is becoming a common tool for detecting and quantifying expression profiles of desired genes. The review itself indicates that the technology to detect PCR products in real-time, i.e., during the reaction, has seen a dramatic leap in use and application over the past couple of years. The PCR based detection technologies utilizing species- specific primers are proving indispensable as research tools providing enhanced information on biology of plant-microbe interactions with special regard to the ecology, aetiology and epidemiology of plant pathogenic micro-organisms. The RT PCR allows quantitative genotyping and detection of single nucleotide polymorphisms and allelic discrimination as well as genetic variation. The use of multiplex PCR systems using combined probes and primes targeted to sequences specific to counterpartners of plant/microbe associations is becoming more important than standard PCR, which is proving to be insufficient for such living systems. Application of RT PCR combined with other molecular techniques made possible the monitoring of both therapeutic intervention and individual responses to drugs. Developments in bioinformatics helped to understand how the genome gives rise to different cell types, how it contributes to basic and specialized functions in those cells and how it contributes to the ways cells interact with the environment. RT PCR is a valuable methodic tool in clarifying such problems. The needs in clinical application of molecular methods initiated important developments in diagnostics stimulating progress in other branches of science. The introduction of these new methods in other fields of human practices induced rapid expansion of molecular approaches.

Plants and animals use small RNAs (microRNAs [miR-NAs] and siRNAs) as guides for post-transcriptional and epigenetic regulation. The microRNAs (miRNAs) were initially considered a biological sideshow, the oddly interesting regulators of developmental timing genes in Caenorhabditis elegans . But in the past few years, studies have shown that miRNAs are a considerable part of the transcriptional output of the genomes of plants and animals. Therefore these miR-NAs play important regulatory functions in widespread biological activities. Accordingly, miRNAs are now recognized as an additional layer of post-transcriptional control that must be accounted for if we are to understand the complexity of gene expression and the regulatory potential of the ge-nome. Owing to this impressive progress in understanding the genomics and functions of miRNAs, we think this is an ideal time to examine the available evidence to see where this rapidly growing field is going. The small RNA repertoire in plants is complex, and few known about their function that constitute new challenges [ 148 ].

Research has focused on approaches to detect the presence of miRNAs and their impact on genomes, and the roles they play in regulating biological functions had been explored. Studies generally followed a progressive logic from discovery to target prediction to function to systems perspective and finally to organism perspective.

Plant and animal genomes have been shaped by miRNAs, as seen by the substantial number of conserved miRNAs that have accumulated through selection and the presence of miRNA target sites in genes of diverse functions. However, the true number of miRNAs and targets remains difficult to estimate. In plants, miRNAs and trans-acting (ta) siRNAs form through distinct biogenesis pathways, although they both interact with target transcripts and guide cleavage [ 149 ]. Developments in bioinformatics requested for correct definition a ‘true’ miRNA and the implications this definition will have for future studies. Approaches to the prediction of targets of miRNAs consider the case for combinatorial control of target expression by multiple miRNAs acting synergistically. Some of the fundamental goals of investigations into genome function are to understand how the genome gives rise to different cell types, how it contributes to basic and specialized functions in those cells and how it contributes to the ways cells interact with the environment. RT PCR is a valuable methodic tool in clarifying such problems. One has to take a systems approach to conceptualize a network of interacting miRNAs and targets and might be supposed that miRNAs act to canalize developmental gene expression programs through ontogeny on both unicellular and multicellular organisms. The topology of this network resembles that mapped previously in yeast, reinforcing the idea that similar networks may underlie the genetic basis of complex human disease. Recent breakthrough discovery by Rigoutsos and co-workers of self-similar, repetitive elements (what they call “Pyknons”) throughout the coding- as well as non-coding “Junk” DNA elevates the question how the novel findings relate to fractality of the DNA as well as opens question on fractal hierarchies of complex organization of genes and non-genes [ 150 ]. These unexpected findings suggest functional connections between the coding and noncod-ing parts of the human genome. Some recent data provide evidence for roles of miRNAs encoded in pathogen and host cell in influencing the cell-type specificity of their interaction. The miRNAs from an organismal perspective and other endogenous regulatory RNAs in plants might have diverse biological roles in realization of both developmental programs and stress responses. There are several instances of polymorphism influence on human disease progression but no definitive answer has yet to be obtained. However, no data was found in plant-microbe interactions. Most heritable traits, including disease susceptibility, are affected by interactions between multiple genes. However, we understand little about how genes interact because very few possible genetic interactions have been explored experimentally.

A genome-wide association approaches to map the genetic determinants of the transcriptome in established host/parasite complexes and microbial populations associated to plants. The concept, that genes and non-genes comprise fractal sets, determining the ensuing fractal hierarchies of complexity of biological processes undoubtedly helps to analyze enormous sets of data obtained by RT PCR on functional expression of genes. Although algorithms for discovery of generic motif in sequential data represent an extremely valuable tool for data analysis, the emergence of informatic market makes difficulties as patent applications back out of scientific disputation on these new methods in large scale [ 151 ]. Nevertheless, one can assume that application of this approach to plant-microbe interactions will accelerate evolution of our imaginations about this matter and initiates elaboration of new theory of plant pathology. Also, the organization of microbial consortia and their functional interaction with macrobial partners can be evaluated in whole complexity basing on this new concept.

The genes might also serve as therapeutic agents. The use of alien toxin as well as detoxifying enzyme-coding genes led to promising economic results in plant cultivation. Sequencing of the genomes of a number of model organisms provides a strong framework to achieve this goal. Several methods, among which gene expression profiling and protein interaction mapping, are being used on a large-scale basis, and constitute useful entry points to identify pathways involved in disease mechanisms. The requested time for clarification of these processes can be shortened by applying RT PCR.

The methods relying on the genetic manipulation of well-characterized and simple models of host/parasite systems (HPS) to reconstruct disease-associated pathways can pinpoint biologically-valid therapeutic targets on the basis of function-based datasets generated in vivo . The HPSs are strongly complementary to well-established complex models, and multiple ways exist to integrate these results into the early stage of the drug discovery process.

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • My Account Login
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 27 May 2024

Research on domain ontology construction based on the content features of online rumors

  • Jianbo Zhao 1 ,
  • Huailiang Liu 1 ,
  • Weili Zhang 1 ,
  • Tong Sun 1 ,
  • Qiuyi Chen 1 ,
  • Yuehai Wang 2 ,
  • Jiale Cheng 2 ,
  • Yan Zhuang 1 ,
  • Xiaojin Zhang 1 ,
  • Shanzhuang Zhang 1 ,
  • Bowei Li 3 &
  • Ruiyu Ding 2  

Scientific Reports volume  14 , Article number:  12134 ( 2024 ) Cite this article

26 Accesses

1 Altmetric

Metrics details

  • Computational neuroscience
  • Computer science
  • Data acquisition
  • Data integration
  • Data mining
  • Data processing
  • Human behaviour
  • Information technology
  • Literature mining
  • Machine learning
  • Scientific data

Online rumors are widespread and difficult to identify, which bring serious harm to society and individuals. To effectively detect and govern online rumors, it is necessary to conduct in-depth semantic analysis and understand the content features of rumors. This paper proposes a TFI domain ontology construction method, which aims to achieve semantic parsing and reasoning of the rumor text content. This paper starts from the term layer, the frame layer, and the instance layer, and based on the reuse of the top-level ontology, the extraction of core literature content features, and the discovery of new concepts in the real corpus, obtains the core classes (five parent classes and 88 subclasses) of the rumor domain ontology and defines their concept hierarchy. Object properties and data properties are designed to describe relationships between entities or their features, and the instance layer is created according to the real rumor datasets. OWL language is used to encode the ontology, Protégé is used to visualize it, and SWRL rules and pellet reasoner are used to mine and verify implicit knowledge of the ontology, and judge the category of rumor text. This paper constructs a rumor domain ontology with high consistency and reliability.

Similar content being viewed by others

research paper on real analysis

From rumor to genetic mutation detection with explanations: a GAN approach

research paper on real analysis

Improving long COVID-related text classification: a novel end-to-end domain-adaptive paraphrasing framework

research paper on real analysis

How do we study misogyny in the digital age? A systematic literature review using a computational linguistic approach

Introduction.

Online rumors are false information spread through online media, which have the characteristics of wide content 1 , hard to identify 2 , 3 . Online rumors can mislead the public, disrupt social order, damage personal and collective reputations, and pose a great challenge to the governance of internet information content. Therefore, in order to effectively detect and govern online rumors, it is necessary to conduct an in-depth semantic analysis and understanding of the rumor text content features.

The research on the content features of online rumors focuses on the lexical, syntactic and semantic features of the rumor text, including lexical, syntactic and semantic features 4 , syntactic structure and functional features 5 , source features 5 , 6 , rhetorical methods 7 , narrative structure 6 , 7 , 8 , language style 6 , 9 , 10 , corroborative means 10 , 11 and emotional features 10 , 12 , 13 , 14 , 15 , 16 , 17 , 18 . Most of the existing researches on rumor content features are feature mining under a single domain topic type, and lack of mining the influence relationship between multiple features. Therefore, this paper proposes to build an online rumor domain ontology to realize fine-grained hierarchical modeling of the relationship between rumor content features and credible verification of its effectiveness. Domain ontology is a systematic description of the objective existence in a specific discipline 19 . The construction methods mainly include TOVE method 20 , skeleton method 21 , IDEF-5 method 22 , 23 , methontology method 24 , 25 and seven-step method 26 , 27 , among which seven-step method is the most mature and widely used method at present 28 , which has strong systematicness and applicability 29 , but it does not provide quantitative indicators and methods about the quality and effect of ontology. The construction technology can be divided into the construction technology based on thesaurus conversion, the construction technology based on existing ontology reuse and the semi-automatic and automatic construction technology based on ontology engineering method 30 . The construction technology based on thesaurus conversion and the construction technology based on existing ontology reuse can save construction time and cost, and improve ontology reusability and interoperability, but there are often differences in structure, semantics and scene. Semi-automatic and automatic construction technology based on ontology engineering method The application of artificial intelligence technology can automatically extract ontology elements and structures from data sources with high efficiency and low cost, but the quality and accuracy are difficult to guarantee. Traditional domain ontology construction methods lack effective quality evaluation support, and construction technology lacks effective integration application. Therefore, this paper proposes an improved TFI network rumor domain ontology construction method based on the seven-step method. Starting from the terminology layer, the framework layer and the instance layer, it integrates the top-level ontology and core document content feature reuse technology, the bottom-up semi-automatic construction technology based on N-gram new word discovery algorithm and RoBERTa-Kmeans clustering algorithm, defines the fine-grained features of network rumor content and carries out hierarchical modeling. Using SWRL rules and pellet inference machine, the tacit knowledge of ontology is mined, and the quality of ontology validity and consistency is evaluated and verified.

The structure of this paper is as follows: Sect “ Related work ” introduces the characteristics of rumor content and the related work of domain ontology construction.; Sect “ Research method ” constructs the term layer, the frame layer and the instance layer of the domain ontology; Sect “ Domain ontology construction ” mines and verifies the implicit knowledge of the ontology based on SWRL rules and Pellet reasoner; Sect “ Ontology reasoning and validation ” points out the research limitations and future research directions; Sect “ Discussion ” summarizes the research content and contribution; Sect “ Conclusion ” summarizes the research content and contribution of this paper.

Related Work

Content features of online rumors.

The content features of online rumors refer to the adaptive description of vocabulary, syntax and semantics in rumor texts. Fu et al. 5 have made a linguistic analysis of COVID-19’s online rumors from the perspectives of pragmatics, discourse analysis and syntax, and concluded that the source of information, the specific place and time of the event, the length of the title and statement, and the emotions aroused are the important characteristics to judge the authenticity of the rumors; Zhang et al. 6 summarized the narrative theme, narrative characteristics, topic characteristics, language style and source characteristics of new media rumors; Li et al. 7 found that rumors have authoritative blessing and fear appeal in headline rhetoric, and they use news and digital headlines extensively, and the topic construction mostly uses programmed fixed structure; Yu et al. 8 analyzed and summarized the content distribution, narrative structure, topic scene construction and title characteristics of rumors in detail; Mourao et al. 9 found that the language style of rumors is significantly different from that of real texts, and rumors tend to use simpler, more emotional and more radical discourse strategies; Zhou et al. 10 analyzed the rumor text based on six analysis categories, such as content type, focus object and corroboration means, and found that the epidemic rumors were mostly “infectious” topics, with narrative expression being the most common, strong fear, and preference for exaggerated and polarized discourse style. Huang et al. 11 conducted an empirical study based on WeChat rumors, and found that the “confirmation” means of rumors include data corroboration and specific information, hot events and authoritative release; Butt et al. 12 analyzed the psycholinguistic features of rumors, and extracted four features from the rumor data set: LIWC, readability, senticnet and emotions. Zhou et al. 13 analyzed the semantic features of fake news content in theme and emotion, and found that the distribution of fake news and real news is different in theme features, and the overall mood, negative mood and anger of fake news are higher; Tan et al. 14 divided the content characteristics of rumors into content characteristics with certain emotional tendency and social characteristics that affect credibility; Damstra et al. 15 identified the elements as a consistent indicator of intentionally deceptive news content, including negative emotions causing anger or fear, lengthy sensational headlines, using informal language or swearing, etc. Lai et al. 16 put forward that emotional rumors can make the rumor audience have similar positive and negative emotions through emotional contagion; Yuan et al. 17 found that multimedia evidence form and topic shaping are important means to create rumors, which mostly convey negative emotions of fear and anger, and the provision of information sources is related to the popularity and duration of rumors; Ruan et al. 18 analyzed the content types, emotional types and discourse focus of Weibo’s rumor samples, and found that the proportion of social life rumors was the highest, and the emotional types were mainly hostile and fearful, with the focus on the general public and the personnel of the party, government and military institutions.

The forms and contents of online rumors tend to be diversified and complicated. The existing research on the content features of rumors is mostly aimed at the mining of content characteristics under specific topics, which cannot cover various types of rumor topics, and lacks fine-grained hierarchical modeling of the relationship between features and credible verification of their effectiveness.

Domain ontology construction

Domain ontology is a unified definition, standardized organization and visual representation of the concepts of knowledge in a specific domain 31 , 32 , and it is an important source of information for knowledge-based systems 19 , 33 . Theoretical methods include TOVE method 20 , skeleton method 21 , IDEF-5 method 22 , 23 , methontology method 24 , 25 and seven-step method 26 , 27 . TOVE method transforms informal description into formal ontology, which is suitable for fields that need accurate knowledge, but it is complex and time-consuming, requires high-level domain knowledge and is not easy to expand and maintain. Skeleton method forms an ontology skeleton by defining the concepts and relationships of goals, activities, resources, organizations and environment, which can be adjusted according to needs and is suitable for fields that need multi-perspective and multi-level knowledge, but it lacks formal semantics and reasoning ability. Based on this method, Ran et al. 34 constructed the ontology of idioms and allusions. IDEF5 method uses chart language and detailed description language to construct ontology, formalizes and visualizes objective knowledge, and is suitable for fields that need multi-source data and multi-participation, but it lacks a unified ontology representation language. Based on this method, Li et al. 35 constructed the business process activity ontology of military equipment maintenance support, and Song et al. 36 established the air defense and anti-missile operation process ontology. Methontology is a method close to software engineering. It systematically develops ontologies through the processes of specification, knowledge acquisition, conceptualization, integration, implementation, evaluation and document arrangement, which is suitable for fields that need multi-technology and multi-ontology integration, but it is too complicated and tedious, and requires a lot of resources and time 37 . Based on this method, Yang et al. 38 completed the ontology of emergency plan, Duan et al. 39 established the ontology of high-resolution images of rural residents, and Chen et al. 40 constructed the corpus ontology of Jiangui. Seven-step method is the most mature and widely used method at present 28 . It is systematic and applicable to construct ontology by determining its purpose, scope, terms, structure, attributes, limitations and examples 29 , but it does not provide quantitative indicators and methods about the quality and effect of ontology. Based on this method, Zhu et al. 41 constructed the disease ontology of asthma, Li et al. 42 constructed the ontology of military events, the ontology of weapons and equipment and the ontology model of battlefield environment, and Zhang et al. 43 constructed the ontology of stroke nursing field, and verified the construction results by expert consultation.

Domain ontology construction technology includes thesaurus conversion, existing ontology reuse and semi-automatic and automatic construction technology based on ontology engineering method 30 . The construction technology based on thesaurus transformation takes the existing thesaurus as the knowledge source, and transforms the concepts, terms and relationships in the thesaurus into the entities and relationships of domain ontology through certain rules and methods, which saves the time and cost of ontology construction and improves the quality and reusability of ontology. However, it is necessary to solve the structural and semantic differences between thesaurus and ontology and adjust and optimize them according to the characteristics of different fields and application scenarios. Wu et al. 44 constructed the ontology of the natural gas market according to the thesaurus of the natural gas market and the mapping of subject words to ontology, and Li et al. 45 constructed the ontology of the medical field according to the Chinese medical thesaurus. The construction technology based on existing ontology reuse uses existing ontologies or knowledge resources to generate new domain ontologies through modification, expansion, merger and mapping, which saves time and cost and improves the consistency and interoperability of ontologies, but it also needs to solve semantic differences and conflicts between ontologies. Chen et al. 46 reuse the top-level framework of scientific evidence source information ontology (SEPIO) and traditional Chinese medicine language system (TCMLS) to construct the ontology of clinical trials of traditional Chinese medicine, and Xiao et al. 47 construct the domain ontology of COVID-19 by extracting the existing ontology and the knowledge related to COVID-19 in the diagnosis and treatment guide. Semi-automatic and automatic construction technology based on ontology engineering method semi-automatically or automatically extracts the elements and structures of ontology from data sources by using natural language processing, machine learning and other technologies to realize large-scale, fast and low-cost domain ontology construction 48 , but there are technical difficulties, the quality and accuracy of knowledge extraction can not be well guaranteed, and the quality and consistency of different knowledge sources need to be considered. Suet al. 48 used regular templates and clustering algorithm to construct the ontology of port machinery, Zheng et al. 49 realized the automatic construction of mobile phone ontology through LDA and other models, Dong et al. 50 realized the automatic construction of ontology for human–machine ternary data fusion in manufacturing field, Linli et al. 51 proposed an ontology learning algorithm based on hypergraph, and Zhai et al. 52 learned from it through part-of-speech tagging, dependency syntax analysis and pattern matching.

At present, domain ontology construction methods are not easy to expand, lack of effective quality evaluation support, lack of effective integration and application of construction technology, construction divorced from reality can not guide subsequent practice, subjective ontology verification and so on. Aiming at the problems existing in the research of content characteristics and domain ontology construction of online rumors, this paper proposes an improved TFI network rumor domain ontology construction method based on seven-step method, which combines top-down existing ontology reuse technology with bottom-up semi-automatic construction technology, and establishes rumor domain ontology based on top-level ontology reuse, core document content feature extraction and new concept discovery in the real corpus from the terminology layer, framework layer and instance layer. Using Protégé as a visualization tool, the implicit knowledge mining of ontology is carried out by constructing SWRL rules to verify the semantic parsing ability and consistency of domain ontology.

Research method

This paper proposes a TFI online rumor domain ontology construction method based on the improvement of the seven-step method, which includes the term layer, the frame layer and the instance layer construction.

Term layer construction

Determine the domain and scope: the purpose of constructing the rumor domain ontology is to support the credible detection and governance of online rumors, and the domain and scope of the ontology are determined by answering questions.

Three-dimensional term set construction: investigate the top-level ontology and related core literature, complete the mapping of reusable top-level ontology and rumor content feature concept extraction semi-automatically from top to bottom; establish authoritative real rumor datasets, and complete the domain new concept discovery automatically from bottom to top; based on this, determine the term set of the domain ontology.

Frame layer construction

Define core classes and hierarchical relationships: combine the concepts of the three-dimensional rumor term set, based on the data distribution of the rumor dataset, define the parent class, summarize the subclasses, design hierarchical relationships and explain the content of each class.

Define core properties and facets of properties: in order to achieve deep semantic parsing of rumor text contents, define object properties, data properties and property facets for each category in the ontology.

Instance layer construction

Create instances: analyze the real rumor dataset, extract instance data, and add them to the corresponding concepts in the ontology.

Encode and visualize ontology: use OWL language to encode ontology, and use Protégé to visualize ontology, so that ontology can be understood and operated by computer.

Ontology verification: use SWRL rules and pellet reasoner to mine implicit knowledge of ontology, and verify its semantic parsing ability and consistency.

Ethical statements

This article does not contain any studies with human participants performed by any of the authors.

Determine the professional domain and scope of the ontology description

This paper determines the domain and scope of the online rumor domain ontology by answering the following four questions:

(1) What is the domain covered by the ontology?

The “Rumor Domain Ontology” constructed in this paper only considers content features, not user features and propagation features; the data covers six rumor types of politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, and others involved in China’s mainstream internet rumor-refuting websites.

(2) What is the purpose of the ontology?

To perform fine-grained hierarchical modeling of the relationships among the features of multi-domain online rumor contents, realize semantic parsing and credibility reasoning verification of rumor texts, and guide fine-grained rumor detection and governance. It can also be used as a guiding framework and constraint condition for online rumor knowledge graph construction.

(3) What kind of questions should the information in the ontology provide answers for?

To provide answers for questions such as the fine-grained rumor types of rumor instances, the valid features of rumor types, etc.

(4) Who will use the ontology in the future?

Users of online rumor detection and governance, users of online rumor knowledge graphs construction.

Three-dimensional term set construction

Domain concepts reused by top-level ontology.

As a mature and authoritative common ontology, top-level ontology can be shared and reused in a large range, providing reference and support for the construction of domain ontology. The domain ontology of online rumors established in this paper focuses on the content characteristics, mainly including the content theme, events and emotions of rumor texts. By reusing the terminology concepts in the existing top-level ontology, the terminology in the terminology set can be unified and standardized. At the same time, the top-level concept and its subclass structure can guide the framework construction of domain ontology and reduce the difficulty and cost of ontology construction. Reusable top-level ontologies include: SUMO, senticnet and ERE after screening.

SUMO ontology: a public upper-level knowledge ontology containing some general concepts and relations for describing knowledge in different domains. The partial reusable SUMO top-level concepts and subclasses selected in this paper are shown in Table 1 , which provides support for the sub-concept design of text topics in rumor domain ontology.

Senticnet: a knowledge base for concept-based sentiment analysis, which contains semantic, emotional, and polarity information related to natural language concepts. The partial reusable SenticNet top-level concepts and subclasses selected in this paper are shown in Table 2 , which provides support for the sub-concept design of text topics in rumor domain ontology.

Entities, relations, and events (ERE): a knowledge base of events and entity relations. The partial reusable ERE top-level concepts and subclasses selected in this paper are shown in Table 3 , which provides support for the sub-concept design of text elements in the rumor domain ontology.

Extracting domain concepts based on core literature content features

Domain core literature is an important source for extracting feature concepts. This paper uses ‘rumor detection’ as the search term to retrieve 274 WOS papers and 257 CNKI papers from the WOS and CNKI core literature databases. The content features of rumor texts involved in the literature samples are extracted, the repetition content features are eliminated, the core content features are screened, and the canonical naming of synonymous concepts from different literatures yields the domain concepts as shown in Table 4 . Among them, text theme, text element, text style, text feature and text rhetoric are classified as text features; emotional category, emotional appeal and rumor motive are classified as emotional characteristics; source credibility, evidence credibility and testimony method are classified as information credibility characteristics; social context is implicit.

Extracting domain concepts based on new concept discovery

This paper builds a general rumor dataset based on China’s mainstream rumor-refuting websites as data sources, and proposes a domain new concept discovery algorithm to discover domain new words in the dataset, add them to the word segmentation dictionary to improve the accuracy of word segmentation, and cluster them according to rumor type, resulting in a concept subclass dictionary based on the real rumor dataset, which provided realistic basis and data support for the conceptual design of each subclass in domain ontology.

Building a general rumor dataset

The rumor dataset constructed in this paper contains 12,472 texts, with 6236 rumors and 6236 non-rumors; the data sources are China’s mainstream internet rumor-refuting websites: 1032 from the internet rumor exposure platform of China internet joint rumor-refuting platform, 270 from today’s rumor-refuting of China internet joint rumor-refuting platform, 1852 from Tencent news Jiaozhen platform, 1744 from Baidu rumor-refuting platform, 7036 from science rumor-refuting platform, and 538 from Weibo community management center. This paper invited eight researchers to annotate the labels (rumor, non-rumor), categories (politics and military, disease prevention and treatment, social life, science and technology, nutrition and health, others) of the rumor dataset. Because data annotation is artificial and subjective, in order to ensure the effectiveness and consistency of annotation, before inviting researchers to annotate, this paper formulates annotation standards, including the screening method, trigger words and sentence break identification of rumor information and corresponding rumor information, and clearly explains and exemplifies the screening method and trigger words of rumor categories, so as to reduce the understanding differences among researchers; in view of this standard, researchers are trained in labeling to familiarize them with labeling specifications, so as to improve their labeling ability and efficiency. The method of multi-person cross-labeling is adopted when labeling, and each piece of data is independently labeled by at least two researchers. In case of conflicting labeling results, the labeling results are jointly decided by the data annotators to increase the reliability and accuracy of labeling. After labeling, multi-person cross-validation method is used to evaluate the labeling results. Each piece of data is independently verified by at least two researchers who did not participate in labeling, and conflicting labeling results are jointly decided by at least five researchers to ensure the consistency of evaluation results. Examples of the results are shown in Table 5 .

N-gram word granularity rumor text new word discovery algorithm

Existing neologism discovery algorithms are mostly based on the granularity of Chinese characters, and the time complexity of long word discovery is high and the accuracy rate is low. The algorithm’s usefulness is low, and the newly discovered words are mostly already found in general domain dictionaries. To solve these problems, this paper proposes an online rumor new word discovery algorithm based on N-gram word granularity, as shown in Fig.  1 .

figure 1

Flowchart of domain new word discovery algorithm.

First, obtain the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\) , and perform the first preprocessing on the corpus to be processed, which includes: sentence segmentation, Chinese word segmentation and punctuation removal for the corpus to be processed. Obtain the first corpus \({{\varvec{c}}}^{{\varvec{p}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}},{{\varvec{s}}}_{2}^{{\varvec{p}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}}\}\) ; where \({s}_{i}\) represents the \(i\) -th sentence in the corpus to be processed, \({n}_{c}\) represents the number of sentences in the corpus to be processed, and \({s}_{i}^{p}\) is the i-th sentence in the first corpus; perform N-gram operation on each sentence in the first corpus separately, and obtain multiple candidate words \(n=2\sim 5\) ; count the word frequency of each candidate word in the first corpus, and remove the candidate words with word frequency less than the first threshold, and obtain the first class of candidate word set;calculate the cohesion of each candidate word in the first class of candidate word set according to the following formula:

In the formula, \(P(\cdot )\) represents word frequency.Then filter according to the second threshold corresponding to N-gram operation, and obtain the second class of candidate word set; after loading the new words in the second class of candidate word set into LTP dictionary, perform the second preprocessing on the corpus to be processed \({\varvec{c}}=\{{{\varvec{s}}}_{1},{{\varvec{s}}}_{2},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}\}\) ; and obtain the second corpus \({{\varvec{c}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}=\{{{\varvec{s}}}_{1}^{{\varvec{p}}\boldsymbol{^{\prime}}},{{\varvec{s}}}_{2}^{{\varvec{p}}\boldsymbol{^{\prime}}},...,{{\varvec{s}}}_{{{\varvec{n}}}_{{\varvec{c}}}}^{{\varvec{p}}\boldsymbol{^{\prime}}}\}\) ; where the second preprocessing includes: sentence segmentation, Chinese word segmentation and stop word removal for the corpus to be processed; after obtaining the vector representation of each word in the second corpus, determine the vector representation of each new word in the second class of candidate word set; according to the vector representation of each new word, use K-means algorithm for clustering; according to the clustering results and preset classification rules, classify each new word to the corresponding domain. The examples of new words discovered are shown in Table 6 :

RoBERTa-Kmeans rumor text concepts extraction algorithm

After adding the new words obtained by the new word discovery to the LTP dictionary, the accuracy of LTP word segmentation is improved. The five types of rumor texts established in this paper are segmented by using the new LTP dictionary, and the word vectors are obtained by inputting them into the RoBERTa word embedding layer after removing the stop words. The word vectors are clustered by k-means according to rumor type to obtain the concept subclass dictionary. The main process is as follows:

(1) Word embedding layer

The RoBERTa model uses Transformer-Encode for computation, and each module contains multi-head attention mechanism, residual connection and layer normalization, feed-forward neural network. The word vectors are obtained by representing the rumor texts after accurate word segmentation through one-hot encoding, and the position encoding represents the relative or absolute position of the word in the sequence. The word embedding vectors generated by superimposing the two are used as input X. The multi-head attention mechanism uses multiple independent Attention modules to perform parallel operations on the input information, as shown in formula ( 2 ):

where \(\left\{{\varvec{Q}},{\varvec{K}},{\varvec{V}}\right\}\) is the input matrix, \({{\varvec{d}}}_{{\varvec{k}}}\) is the dimension of the input matrix. After calculation, the hidden vectors obtained after computation are residual concatenated with layer normalization, and then calculated by two fully connected layers of feed-forward neural network for input, as shown in formula ( 3 ):

where \(\left\{{{\varvec{W}}}_{{\varvec{e}}},{{\varvec{W}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the weight matrices of two connected layers, \(\left\{{{\varvec{b}}}_{{\varvec{e}}},{{\varvec{b}}}_{0}\boldsymbol{^{\prime}}\right\}\) are the bias terms of two connected layers.

After calculation, a bidirectional association between word embedding vectors is established, which enables the model to learn the semantic features contained in each word embedding vector in different contexts. Through fine-tuning, the learned knowledge is transferred to the downstream clustering task.

(2) K-means clustering

Randomly select k initial points to obtain k classes, and iterate until the loss function of the clustering result is minimized. The loss function can be defined as the sum of squared errors of each sample point from its cluster center point, as shown in formula ( 4 ).

where \({x}_{i}\) represents the \(i\) sample, \({a}_{i}\) is the cluster that \({x}_{i}\) belongs to, \({u}_{{a}_{i}}\) represents the corresponding center point, \(N\) is the total number of samples.

After RoBERTa-kmeans calculation, the concept subclasses obtained are manually screened, merged repetition items, deleted invalid items, and finally obtained 79 rumor concept subclasses, including 14 politics and military subclasses, 23 disease prevention and treatment subclasses, 15 social life subclasses, 13 science and technology subclasses, and 14 nutrition and health subclasses. Some statistics are shown in Table 7 .

Each concept subclass is obtained by clustering several topic words. For example, the topic words that constitute the subclasses of body part, epidemic prevention and control, chemical drugs, etc. under the disease prevention and treatment topic are shown in Table 8 .

(3) Determining the terminology set

This paper constructs a three-dimensional rumor domain ontology terminology set based on the above three methods, and unifies the naming of the terms. Some of the terms are shown in Table 9 .

Framework layer construction

Define core classes and hierarchy, define parent classes.

This paper aims at fine-grained hierarchical modeling of the relationship between the content characteristics of multi-domain network rumors. Therefore, the top-level parent class needs to include the rumor category and the main content characteristics of a sub-category rumor design. The main content characteristics are the clustering results of domain concepts extracted based on the content characteristics of core documents, that is, rumor text feature, rumor emotional characteristic, rumor credibility and social context. The specific contents of the five top parent classes are as follows:

Rumor type: the specific classification of rumors under different subject categories; Rumor text feature, the common features of rumor texts in terms of theme, style, rhetoric, etc. Rumor emotional characteristic: the emotional elements of rumor texts, the Rumor motive of the publisher, and the emotional changes they hope to trigger in the receiver. Rumor credibility: the authority of the information source, the credibility of the evidence material provided by the publisher, and the effectiveness of the testimony method. Social context: the relevant issues and events in the society when the rumor is published.

Induce subclasses and design hierarchical relationships

In this paper, under the top-level parent class, according to the top-level concepts of top-level ontologies such as SUMO, senticnet and ERE and their subclass structures, and the rumor text features of each category extracted from the real rumor text dataset, we summarize its 88 subclasses and design the hierarchical relationships, as shown in Fig.  2 , which include:

(1) Rumor text feature

figure 2

Diagram of the core classes and hierarchy of the rumor domain ontology.

① Text theme 6 , 8 , 13 , 18 , 53 : the theme or topic that the rumor text content involves. Based on the self-built rumor dataset, it is divided into politics and military 54 , involving information such as political figures, political policies, political relations, political activities, military actions, military events, strategic objectives, politics and military reviews, etc.; nutrition and health 55 , involving information such as the relationship between human health and nutrition, the nutritional components and value of food, the plan and advice for healthy eating, health problems and habits, etc.; disease prevention and treatment 10 , involving information such as the definition of disease, vaccine, treatment, prevention, data, etc.; social life 56 , involving information such as social issues, social environment, social values, cultural activities, social media, education system, etc.; science and technology 57 , involving information such as scientific research, scientific discovery, technological innovation, technological application, technological enterprise, etc.; other categories.

② Text element 15 : the structured information of the rumor text contents. It is divided into character, political character, public character, etc.; geographical position, city, region, area, etc.; event, historical event, current event, crisis event, policy event, etc.; action, protection, prevention and control, exercise, fighting, crime, eating, breeding, health preservation, rest, exercise, education, sports, social, cultural, ideological, business, economic, transportation, etc.; material, food, products (food, medicine, health products, cosmetics, etc.) and the materials they contain and their relationship with human health. effect, nutrition, health, harm, natural disaster, man-made disaster, guarantee, prevention, treatment, etc.; institution, government, enterprise, school, hospital, army, police, social group, etc.; nature, weather, astronomy, environment, agriculture, disease, etc.

③ Text style 7 , 10 : the discourse style of the rumor text contents, preferring exaggerated and emotional expression. It is divided into gossip style, creating conflict or entertainment effect; curious style, satisfying people’s curiosity and stimulation; critical style, using receivers’ stereotypes or preconceptions; lyrical style, creating resonance and influencing emotion; didactic style influencing receivers’ thought and behavior from an authoritative perspective; plain style concise objective arousing resonance etc.

④ Text feature 7 , 58 : special language means in the rumor text contents that can increase the transmission and influence of the rumor. It is divided into extensive punctuation reminding or attracting receivers’ attention; many mood words enhancing emotional color and persuasiveness; many emoji conveying attitude; induce forwarding using @ symbol etc. to induce receivers to forward etc.

⑤ Text rhetoric 15 : common rhetorical devices in rumor contents. It is divided into metaphor hyperbole repetition personification etc.

(2) Rumor emotional characteristic

① Emotion category 17 , 59 , 60 : the emotional tendency and intensity expressed in the rumor texts. It is divided into positive emotion happy praise etc.; negative emotion fear 10 anger sadness anxiety 61 dissatisfaction depression etc.; neutral emotion no preference plain objective etc.

② Emotional appeal 16 , 62 , 63 : the online rumor disseminator hopes that the rumor they disseminate can trigger some emotional changes in the receiver. It is divided into “joy” happy pleasant satisfied emotions that prompt receivers to spread or believe some rumors that are conducive to social harmony; “love” love appreciation admiration emotions that prompt receivers to spread or believe some rumors that are conducive to some people or group interests; “anger” angry annoyed dissatisfied emotions that prompt receivers to spread or believe some rumors that are anti-social or intensify conflicts; “fear” fearful afraid nervous emotions that prompt receivers to spread or believe some rumors that have bad effects deliberately exaggerated; “repugnance” disgusted nauseous emotions that prompt receivers to spread or believe some rumors that are detrimental to social harmony; “surprise” surprised shocked amazed emotions that prompt receivers to spread or believe some rumors that deliberately attract traffic exaggerated fabricated etc.

③ Rumor motive 17 , 64 , 65 , 66 : the purpose and need of the rumor publisher to publish rumors and the receiver to forward rumors. Such as profit-driven seeking fame and fortune deceiving receivers; emotional catharsis relieving dissatisfaction emotions by venting; creating panic creating social unrest and riots disrupting social order; entertainment fooling receivers seeking stimulation; information verification digging out the truth of events etc.

(3) Rumor credibility

① source credibility 7 , 17 : the degree of trustworthiness that the information source has. Such as official institutions and authoritative experts and scholars in the field with high credibility; well-known encyclopedias and large-scale civil organizations with medium credibility; small-scale civil organizations and personal hearsay personal experience with low credibility etc.

② evidence credibility 61 : the credibility of the information proof material provided by the publisher. Data support such as scientific basis based on scientific theory or method; related feature with definite research or investigation result in data support; temporal background with clear time place character event and other elements which related to the information content; the common sense of life in line with the facts and scientific common sense that are widely recognized.

③ testimony method 10 , 11 , 17 : the method to support or refute a certain point of view. Such as multimedia material expressing or fabricating content details through pictures videos audio; authority endorsement policy documents research papers etc. of authorized institutions or persons; social identity identity of social relation groups.

(4) Social context

① social issue 67 : some bad phenomena or difficulties in society such as poverty pollution corruption crime government credibility decline 68 etc.

② public attention 63 : events or topics that arouse widespread attention or discussion in the society such as sports events technological innovation food safety religious beliefs Myanmar fraud nuclear wastewater discharge etc.

③ emergency(public sentiment) 69 : some major or urgent events that suddenly occur in society such as earthquake flood public safety malignant infectious disease outbreaks etc.

(5) Rumor type

① Political and military rumor:

Political image rumor: rumors related to images closely connected to politics and military, such as countries, political figures, institutions, symbols, etc. These include positive political image smear rumor, negative political image whitewash rumor, political image fabrication and distortion rumor, etc.

Political event rumor: rumors about military and political events, such as international relations, security cooperation, military strategy, judicial trial, etc. These include positive political event smear rumor, negative political event whitewash rumor, political event fabrication and distortion rumor, etc.

② Nutrition and health rumor:

Food product rumor: rumors related to food, products (food, medicine, health products, cosmetics, etc.), the materials they contain and their association with human health. These include positive effect of food product rumor, negative effect of food product rumor, food product knowledge rumor, etc.

Living habit rumor: rumors related to habitual actions in life and their association with human health. These include positive effect of living habit rumor, negative effect of living habit rumor, living habit knowledge rumor, etc.

③ Disease prevention and treatment rumor:

Disease management rumor: rumors related to disease management and control methods that maintain and promote individual and group health. These include positive prevention and treatment rumor, negative aggravating disease rumor, disease management knowledge rumor, etc.

Disease confirmed transmission rumor: rumors about the confirmation, transmission, and immunity of epidemic diseases at the social level in terms of causes, processes, results, etc. These include local confirmed cases rumor, celebrity confirmed cases rumor, transmission mechanism rumor, etc.

Disease notification and advice rumor: rumors that fabricate or distort the statements of authorized institutions or experts in the field, and provide false policies or suggestions related to diseases. These include institutional notification rumor, expert advice rumor, etc.

④ Social life rumor:

Public figure public opinion rumor: rumors related to public figures’ opinions, actions, private lives, etc. These include positive public figure smear rumor, negative public figure whitewash rumor, public figure life exposure rumor, etc.

Social life event rumor: rumors related to events, actions, and impacts on people's social life. These include positive event sharing rumor, negative event exposure rumor, neutral event knowledge rumor, etc.

Disaster occurrence rumor: rumors related to natural disasters or man-made disasters and their subsequent developments. These include natural disaster occurrence rumor, man-made disaster occurrence rumor, etc.

⑤ Science and technology rumor:

Scientific knowledge rumor: rumors related to natural science or social science theories and knowledge. These include scientific theory rumor, scientific concept rumor, etc.

Science and technology application rumor: rumors related to the research and development and practical application of science and technology and related products. These include scientific and technological product rumor, scientific and technological information rumor, etc.

⑥ Other rumor: rumors that do not contain elements from the above categories.

Definition of core properties and facets of properties

Properties in the ontology are used to describe the relationships between entities or the characteristics of entities. Object properties are relationships that connect two entities, describing the interactions between entities; data properties represent the characteristics of entities, usually in the form of some data type. Based on the self-built rumor dataset, this paper designs object properties, data properties and facets of properties for the parent classes and subclasses of the rumor domain ontology.

Object properties

A partial set of object properties is shown in Table 10 .

Data attributes

The partial data attribute set is shown in Table 11 .

Creating instances

Based on the defined core classes and properties, this paper creates instances according to the real rumor dataset. An example is shown in Table 12 .

This paper selects the online rumor that “Lin Chi-ling was abused by her husband Kuroki Meisa, the tears of betrayal, the shadow of gambling, all shrouded her head. Even if she tried to divorce, she could not get a solution…..” as an example, and draws a structure diagram of the rumor domain ontology instance, as shown in Fig.  3 . This instance shows the seven major text features of the rumor text: text theme, text element, text style, emotion category, emotional appeal, rumor motivation, and rumor credibility, as well as the related subclass instances, laying a foundation for building a multi-source rumor domain knowledge graph.

figure 3

Schematic example of the rumor domain ontology.

Encoding ontology and visualization

Encoding ontology.

This paper uses OWL language to encode the rumor domain ontology, to accurately describe the entities, concepts and their relationships, and to facilitate knowledge reasoning and semantic understanding. Classes in the rumor domain ontology are represented by the class “Class” in OWL and the hierarchical relationship is represented by subclassof. For example, in the creation of the rumor emotional characteristic class and its subclasses, the OWL code is shown in Fig.  4 :

figure 4

Partial OWL codes of the rumor domain ontology.

The ontology is formalized and stored as a code file using the above OWL language, providing support for reasoning.

Ontology visualization

This paper uses protégé5.5 to visualize the rumor domain ontology, showing the hierarchical structure and relationship of the ontology parent class and its subclasses. Due to space limitations, this paper only shows the ontology parent class “RumorEmotionalFeatures” and its subclasses, as shown in Fig.  5 .

figure 5

Ontology parent class “RumorEmotionalFeatures” and its subclasses.

Ontology reasoning and validation

Swrl reasoning rule construction.

SWRL reasoning rule is an ontology-based rule language that can be used to define Horn-like rules to enhance the reasoning and expressive ability of the ontology. This paper uses SWRL reasoning rules to deal with the conflict relationships between classes and between classes and instances in the rumor domain ontology, and uses pellet reasoner to deeply mine the implicit semantic relationships between classes and instances, to verify the semantic parsing ability and consistency of the rumor domain ontology.

This paper summarizes the object property features of various types of online rumors based on the self-built rumor dataset, maps the real rumor texts with the rumor domain ontology, constructs typical SWRL reasoning rules for judging 32 typical rumor types, as shown in Table 13 , and imports them into the protégé rule library, as shown in Fig.  6 . In which x, n, e, z, i, t, v, l, etc. are instances of rumor types, text theme, emotion category, effect, institution, event, action, geographical position, etc. in the ontology. HasTheme, HasEmotion, HasElement, HasSource, HasMood and HasSupport are object property relationships. Polarity value is a data property relationship.

figure 6

Partial SWRL rules for the rumor domain ontology.

Implicit knowledge mining and verification based on pellet reasoner

This paper extracts corresponding instances from the rumor dataset, imports the rumor domain ontology and SWRL rule description into the pellet reasoner in the protégé software, performs implicit knowledge mining of the rumor domain ontology, judges the rumor type of the instance, and verifies the semantic parsing ability and consistency of the ontology.

Positive prevention and treatment of disease rumors are mainly based on the theme of disease prevention and treatment, usually containing products to be sold (including drugs, vaccines, equipment, etc.) and effect of disease names, claiming to have positive effects (such as prevention, cure, relief, etc.) on certain diseases or symptoms, causing positive emotions such as surprise and happiness among patients and their families, thereby achieving the purpose of selling products. The text features and emotional features of this kind of rumors are relatively clear, so this paper takes the rumor text “Hong Kong MDX Medical Group released the ‘DCV Cancer Vaccine’, which can prevent more than 12 kinds of cancers, including prostate cancer, breast cancer and lung cancer.” as an example to verify the semantic parsing ability of the rumor domain ontology. The analysis result of this instance is shown in Fig.  7 . The text theme is cancer prevention in disease prevention and treatment, the text style is plain narrative style, and the text element includes product-DCV cancer vaccine, positive effect-prevention, disease name-prostate cancer, disease name-breast cancer, disease name-lung cancer; the emotion category of this instance is a positive emotion, emotional appeal is joy, love, surprise; The motive for releasing rumors is profit-driven in selling products, the information source is Hong Kong MDX medical group, and pictures and celebrity endorsements are used as testimony method. This paper uses a pellet reasoner to reason on the parsed instance based on SWRL rules, and mines out the specific rumor type of this instance as positive prevention and treatment of disease rumor. This paper also conducted similar instance analysis and reasoning verification for other types of rumor texts, and the results show that the ontology has high consistency and reliability.

figure 7

Implicit relationship between rumor instance parsing results and pellet reasoner mining.

Comparison and evaluation of ontology performance

In this paper, the constructed ontology is compared with the representative rumor index system in the field. By inviting four experts to make a comprehensive evaluation based on the self-built index system 70 , 71 , 72 , their performance in the indicators of reliability, coverage and operability is evaluated. According to the ranking order given by experts, they are given 1–4 points, and the first place in each indicator item gets four points. The average value given by three experts is taken as the single indicator score of each subject, and the total score of each indicator item is taken as the final score of the subject.

As can be seen from Table 14 , the rumor domain ontology constructed in this paper constructs a term set through three ways: reusing the existing ontology, extracting the content features of core documents and discovering new concepts based on real rumor data sets, and the ontology structure has been verified by SWRL rule reasoning of pellet inference machine, which has high reliability; ontology covers six kinds of Chinese online rumors, including the grammatical, semantic, pragmatic and social characteristics of rumor text characteristics, emotional characteristics, rumor credibility and social background, which has a high coverage; ontology is coded by OWL language specification and displayed visually on protege, which is convenient for further expansion and reuse of scholars and has high operability.

The construction method of TFI domain ontology proposed in this paper includes terminology layer, framework layer and instance layer. Compared with the traditional methods, this paper adopts three-dimensional data set construction method in terminology layer construction, investigates top-level ontology and related core documents, and completes the mapping of reusable top-level ontology from top to bottom and the concept extraction of rumor content features in existing literature research. Based on the mainstream internet rumor websites in China, the authoritative real rumor data set is established, and the new word discovery algorithm of N-gram combined with RoBERTa-Kmeans clustering algorithm is used to automatically discover new concepts in the field from bottom to top; determine the terminology set of domain ontology more comprehensively and efficiently. This paper extracts the clustering results of domain concepts based on the content characteristics of core documents in the selection of parent rumors content characteristics in the framework layer construction, that is, rumors text characteristics, rumors emotional characteristics, rumors credibility characteristics and social background characteristics; based on the emotional characteristics and the entity categories of real rumor data sets, the characteristics of rumor categories are defined. Sub-category rumor content features combine the concept of three-dimensional rumor term set and the concept distribution based on real rumor data set, define the sub-category concept and hierarchical relationship close to the real needs, and realize the fine-grained hierarchical modeling of the relationship between multi-domain network rumor content features. In this paper, OWL language is used to encode the rumor domain ontology in the instance layer construction, and SWRL rule language and Pellet inference machine are used to deal with the conflict and mine tacit knowledge, judge the fine-grained categories of rumor texts, and realize the effective quality evaluation of rumor ontology. This makes the rumor domain ontology constructed in this paper have high consistency and reliability, and can effectively analyze and reason different types of rumor texts, which enriches the knowledge system in this field and provides a solid foundation for subsequent credible rumor detection and governance.

However, the study of the text has the following limitations and deficiencies:

(1) The rumor domain ontology constructed in this paper only considers the content characteristics, but does not consider the user characteristics and communication characteristics. User characteristics and communication characteristics are important factors affecting the emergence and spread of online rumors, and the motivation and influence of rumors can be analyzed. In this paper, these factors are not included in the rumor feature system, which may limit the expressive ability and reasoning ability of the rumor ontology and fail to fully reflect the complexity and multidimensional nature of online rumors.

(2) In this paper, the mainstream Internet rumor-dispelling websites in China are taken as the data source of ontology instantiation. The data covers five rumor categories: political and military, disease prevention, social life, science and technology, and nutrition and health, and the data range is limited. And these data sources are mainly official or authoritative rumor websites, and their data volume and update frequency may not be enough to reflect the diversity and variability of online rumors, and can not fully guarantee the timeliness and comprehensiveness of rumor data.

(3) The SWRL reasoning rules used in this paper are based on manual writing, which may not cover all reasoning scenarios, and the degree of automation needs to be improved. The pellet inference engine used in this paper is an ontology inference engine based on OWL-DL, which may have some computational complexity problems and lack of advanced reasoning ability.

The following aspects can be considered for optimization and improvement in the future:

(1) This paper will introduce user characteristics into the rumor ontology, and analyze the factors that cause and accept rumors, such as social attributes, psychological state, knowledge level, beliefs and attitudes, behavioral intentions and so on. This paper will introduce the characteristics of communication, and analyze the propagation dynamic factors of various types of rumors, such as propagation path, propagation speed, propagation range, propagation period, propagation effect, etc. This paper hopes to introduce these factors into the rumor feature system, increase the breadth and depth of the rumor domain ontology, and provide more credible clues and basis for the detection, intervention and prevention of rumors.

(2) This paper will expand the data sources, collect the original rumor data directly from social media, news media, authoritative rumor dispelling institutions and other channels, and build a rumor data set with comprehensive types, diverse expressions and rich characteristics; regularly grab the latest rumor data from these data sources and update and improve the rumor data set in time; strengthen the expressive ability of rumor ontology instance layer, and provide full data support and verification for the effective application of ontology.

(3) The text will introduce GPT, LLaMA, ChantGLM and other language models, and explore the automatic generation algorithm and technology of ontology inference rules based on rumor ontology and dynamic Prompt, so as to realize more effective and intelligent rumor ontology evaluation and complex reasoning.

This paper proposed a method of constructing TFI network rumor domain ontology. Based on the concept distribution of three-dimensional term set and real rumor data set, the main features of network rumors are defined, including text features, emotional features, credibility features, social background features and category features, and the relationships among these multi-domain features are modeled in a fine-grained hierarchy, including five parent classes and 88 subcategories. At the instance level, 32 types of typical rumor category judgment and reasoning rules are constructed, and the ontology is processed by using SWRL rule language and pellet inference machine for conflict processing and tacit knowledge mining, so that the semantic analysis and reasoning of rumor text content are realized, which proves its effectiveness in dealing with complex, fuzzy and uncertain information in online rumors and provides a new perspective and tool for the interpretable analysis and processing of online rumors.

Data availability

The datasets generated during the current study are available from the corresponding author upon reasonable request.

Jiang, S. The production scene and content characteristics of scientific rumors. Youth J. https://doi.org/10.15997/j.cnki.qnjz.2020.33.011 (2020).

Article   Google Scholar  

Jin, X. & Zhao, Y. Analysis of internet rumors from the perspective of co-governance—Practice of rumor governance on wechat platform. News and Writing. 6 , 41–44 (2017).

Bai, S. Research on the causes and countermeasures of internet rumors. Press https://doi.org/10.15897/j.cnki.cn51-1046/g2.2010.04.035 (2010).

Garg, S. & Sharma, D. K. Linguistic features based framework for automatic fake news detection. Comput. Ind. Eng. 172 , 108432 (2022).

Zhao, J., Fu, C. & Kang, X. Content characteristics predict the putative authenticity of COVID-19 rumors. Front. Public Health 10 , 920103 (2022).

Article   PubMed   PubMed Central   Google Scholar  

Zhang, Z., Shu, K. & He, L. The theme and characteristics of wechat rumors. News and Writing. 1 , 60–64 (2016).

Li, B. & Yu, G. Research on the discourse space and communication field of internet rumors in the post-truth era—Based on the analysis of 4160 rumors in wechat circle of friends. Journalism Research. 2 , 103–112 (2018).

Yu, G. Text structure and expression characteristics of internet rumors—Analysis of 6000+ rumors based on tencent big data screening and identification. News and Writing. 2 , 53–59 (2018).

Mourão, R. R. & Robertson, C. T. Fake news as discursive integration: An analysis of sites that publish false, misleading, hyperpartisan and sensational information. J. Stud. 20 , 2077–2095 (2019).

Google Scholar  

Zhou, G. Analysis on the content characteristics and strategies of epidemic rumors—Based on Sina’s “novel coronavirus epidemic rumors list”. Sci. Popul. https://doi.org/10.19293/j.cnki.1673-8357.2021.05.002 (2021).

Huang, Y. An analysis of the internal logic and methods of rumor “confirmation”—An empirical study based on 60 rumors spread on wechat. J. Party Sch. Tianjin Munic. Comm. CPC 20 , 7 (2018).

Butt, S. et al . What goes on inside rumour and non-rumour tweets and their reactions: A psycholinguistic analyses. Comput. Hum. Behav. 135 , 107345 (2022).

Zhou, L., Tao, J. & Zhang, D. Does fake news in different languages tell the same story? An analysis of multi-level thematic and emotional characteristics of news about COVID-19. Inf. Syst. Front. 25 , 493–512. https://doi.org/10.1007/s10796-022-10329-7 (2023).

Article   PubMed   Google Scholar  

Tan, L. et al . Research status of deep learning methods for rumor detection. Multimed. Tools Appl. 82 , 2941–2982 (2023).

Damstra, A. et al. What does fake look like? A review of the literature on intentional deception in the news and on social media. J. Stud. 22 , 1947–1963. https://doi.org/10.1080/1461670X.2021.1979423 (2021).

Lai, S. & Tang, X. Research on the influence of information emotionality on the spread of online rumors. J. Inf. 35 , 116–121 (2016).

ADS   Google Scholar  

Yuan, H. & Xie, Y. Research on the rumor maker of internet rumors about public events—Based on the content analysis of 118 influential Internet rumors about public events. Journalist https://doi.org/10.16057/j.cnki.31-1171/g2.2015.05.008 (2015).

Ruan, Z. & Yin, L. Types and discourse focus of weibo rumors—Based on the content analysis of 307 weibo rumors. Contemporary Communication. 4 , 77–78+84 (2014).

Zhang, W. & Zhu, Q. Research on the Construction Method of Domain Ontology. Books and Information. 5 , 16–19+40 (2011).

Tham, K.D., Fox, M.S. & Gruninger, M. A cost ontology for enterprise modelling. In Proceedings of 3rd IEEE Workshop on Enabling Technologies: Infrastructure for Collaborative Enterprises. IEEE , 197–210. https://doi.org/10.1109/ENABL.1994.330502 (1994).

Uschold, M. & Gruninger, M. Ontologies: Principles, methods and applications. Knowl. Eng. Rev. 11 , 93–136 (1996).

Menzel, C. P., Mayer, R. J. & Painter, M. K. IDEF5 ontology description capture method: Concepts and formal foundations (Armstrong Laboratory, Air Force Materiel Command, Wright-Patterson Air Force, 1992).

Book   Google Scholar  

Song, Z., Zhu, F. & ZHANG, D. Research on air and missile defense domain ontology development based on IDEF5 and OWL. Journal of Projectiles, Rockets, Missiles and Guidance. 30 , 176–178 (2010).

Fernández-López, M., Gómez-Pérez, A. & Juristo, N. Methontology: From ontological art towards ontological engineering. AAAI-97 Spring Symposium Series . https://oa.upm.es/5484/ (1997).

Sawsaa, A. & Lu, J. Building information science ontology (OIS) with methontology and protégé. J. Internet Technol. Secur. Trans. 1 , 100–109 (2012).

Yue, L. & Liu, W. Comparative study on the construction methods of domain ontology at home and abroad. Inf. Stud. Theory Appl. 39 , 119–125. https://doi.org/10.16353/j.cnki.1000-7490.2016.08.024 (2016).

Noy, N.F. & McGuinness, D.L. Ontology development 101: A guide to creating your first ontology. Stanford knowledge systems laboratory technical report. KSL-01–05 (2001).

Luo, Y. et al . vim: Research on OWL-based vocabulary ontology construction method for units of measurement. Electronics 12 , 3783 (2023).

Al-Aswadi, F. N., Chan, H. Y. & Gan, K. H. Automatic ontology construction from text: A review from shallow to deep learning trend. Artif. Intell. Rev. 53 , 3901–3928 (2020).

Chen, X. & Mao, T. Ontology construction of documentary heritage—Taking China archives documentary heritage list as an example. Libr. Trib. 43 , 120–131 (2023).

CAS   Google Scholar  

Zhao, X. & Li, T. Research on the ontology construction of archives oriented to digital humanism—Taking Wanli tea ceremony archives as an example. Inf. Stud. Theory Appl. 45 , 154–161. https://doi.org/10.16353/j.cnki.1000-7490.2022.08.021 (2022).

Huang, X. et al . Construction of special knowledge base of government website pages based on domain ontology—Taking “COVID-19 vaccine science popularization” as an example. Libr. Inf. Serv. 66 , 35–46. https://doi.org/10.13266/j.issn.0252-3116.2022.17.004 (2022).

Jindal, R., Seeja, K. & Jain, S. Construction of domain ontology utilizing formal concept analysis and social media analytics. Int. J. Cogn. Comput. Eng. 1 , 62–69 (2020).

Ran, J. et al . Research on ontology construction of idioms and allusions based on OWL. Comput. Technol. Dev. 20 , 63–66 (2010).

Li, L. et al . Research on business process modeling of army equipment maintenance support based on IDEF5. Technol. Innov. Appl. 11 , 80–82 (2021).

Song, Z. et al . Ontology modeling of air defense and anti-missile operation process based on IDEF5/OWL. J. Missiles Guid. 30 , 176–178 (2010).

Li, A., Xu, Y. & Chi, Y. Summary of ontology construction and application. Inf. Stud. Theory Appl 46 , 189–195. https://doi.org/10.16353/j.cnki.1000-7490.2023.11.024 (2023).

Yang, J., Song, C. & Jin, L. Ontology construction of emergency plan based on methontology method. J. Saf. Environ. 18 , 1427–1431. https://doi.org/10.13637/j.issn.1009-6094.2018.04.033 (2018).

Duan, L. & Li, H. Ontology modeling method of high-resolution image rural residential area supported by OIA technology. Modern Agricultural Science and Technology. 2 , 338–340 (2016).

Chen, Y. & Jiang, H. Construction of fire inspection knowledge map based on GIS geospatial relationship. J. Subtrop. Resour. Environ. 18 , 109–118. https://doi.org/10.19687/j.cnki.1673-7105.2023.03.014 (2023).

Zhu, L. et al. Construction of TCM asthma domain ontology. Chin. J. Exp. Tradit. Med. Formulae 23 , 222–226. https://doi.org/10.13422/j.cnki.syfjx.2017150222 (2017).

Li, H. et al . Domain ontology construction and relational reasoning. J. Inf. Eng. Univ. 24 , 321–327 (2023).

Zhang, Y. et al. Construction of ontology of stroke nursing field based on corpus. Chin. Nurs. Res. 36 , 4186–4190 (2022).

Wu, M. et al. Ontology construction of natural gas market knowledge map. Pet. New Energy 34 , 71–76 (2022).

Li, X. et al . Research on ontology construction based on thesaurus and its semantic relationship. Inf. Sci. 36 , 83–87 (2018).

Article   ADS   CAS   Google Scholar  

Chen, Q. et al . Construction of knowledge ontology of clinical trial literature of traditional Chinese medicine. Chin. J. Exp. Tradit. Med. Formulae 29 , 190–197. https://doi.org/10.13422/j.cnki.syfjx.20231115 (2023).

Xiao, Y. et al. Construction and application of novel coronavirus domain ontology. Mil. Med. 46 , 263–268 (2022).

Su, N. et al . Automatic construction method of domain-limited ontology. Lifting the Transport Machinery. 8 , 49–57 (2023).

Zheng, S. et al . Ontology construction method for user-generated content. Inf. Sci. 37 , 43–47. https://doi.org/10.13833/j.issn.1007-7634.2019.11.007 (2019).

Dong, J., Wang, J. & Wang, Z. Ontology automatic construction method for human-machine-object ternary data fusion in manufacturing field. Control Decis. 37 , 1251–1257. https://doi.org/10.13195/j.kzyjc.2020.1298 (2022).

Zhu, L., Hua, G. & Gao, W. Mapping ontology vertices to a line using hypergraph framework. Int. J. Cogn. Comput. Eng. 1 , 1–8 (2020).

Zhai, Y. & Wang, F. Research on the construction method of Chinese domain ontology based on text mining. Inf. Sci. 33 , 3–10. https://doi.org/10.13833/j.cnki.is.2015.06.001 (2015).

Duan, Z. Generation mechanism of internet rumors and countermeasures. Guizhou Soc. Sci. https://doi.org/10.13713/j.cnki.cssci.2016.04.014 (2016).

Du, Z. & Zhi, S. The harm and governance of network political rumors. Academic Journal of Zhongzhou. 4 , 161–165 (2019).

Song, X. et al . Research on influencing factors of health rumor sharing willingness based on MOA theory. J. China Soc. Sci. Tech. Inf. 39 , 511–520 (2020).

Jiang, S. Research on the characteristics, causes and countermeasures of social rumors dissemination in china in recent years. Red Flag Manuscript . 16 , 4 (2011).

Huang, J., Wang, G. & Zhong, S. Research on the propagation law and function mode of sci-tech rumors. Journal of Information. 34 , 156–160 (2015).

Liu, Y. et al . A survey of rumor recognition in social media. Chin. J. Comput. 41 , 1536–1558 (2018).

Wei, D. et al. Public emotions and rumors spread during the covid-19 epidemic in China: Web-based correlation study. J. Med. Internet Res. 22 , e21933 (2020).

Runxi, Z. & Di, Z. A model and simulation of the emotional contagion of netizens in the process of rumor refutation. Sci. Rep. https://doi.org/10.1038/s41598-019-50770-4 (2019).

Tang, X. & Lai, S. Research on the forwarding of network health rumors in public health security incidents—Interaction between perceived risk and information credibility. J. Inf. 40 , 101–107 (2021).

Nicolas, P., Dominik, B. & Stefan, F. Emotions in online rumor diffusion. EPJ Data Sci. https://doi.org/10.1140/epjds/s13688-021-00307-5 (2021).

Deng, G. & Tang, G. Research on the spread of network rumors and its social impact. Seeker https://doi.org/10.16059/j.cnki.cn43-1008/c.2005.10.031 (2005).

Ji, Y. Research on the communication motivation of wechat rumors. Youth J. https://doi.org/10.15997/j.cnki.qnjz.2019.17.006 (2019).

Yuan, G. Analysis on the causes and motives of internet rumors in emergencies—Taking social media as an example. Media. 21 , 80–83 (2016).

Zhao, N., Li, Y. & Zhang, J. A review of the research on influencing factors and motivation mechanism of rumor spread. J. Psychol. Sci. 36 , 965–970. https://doi.org/10.16719/j.cnki.1671-6981.2013.04.015 (2013).

Article   CAS   Google Scholar  

Hu, H. On the formation mechanism of social rumors from the perspective of “rumors and salt storm”. J. Henan Univ. 52 , 63–68 (2012).

Yue, Y. et al. Trust in government buffers the negative effect of rumor exposure on people’s emotions. Curr. Psychol. 42 , 23917–23930 (2023).

Wang, C. & Hou, X. Analysis of rumor discourse in major emergencies. J. Commun. 19 , 34–38 (2012).

Xu, L. Research progress of ontology evaluation. J. China Soc. Scie. Tech. Inf. 35 , 772–784 (2016).

Lantow, B. & Sandkuhl, K. An analysis of applicability using quality metrics for ontologies on ontology design patterns. Intell. Syst. Acc. Financ. Manag. 22 , 81–99 (2015).

Pak, J. & Zhou, L. A framework for ontology evaluationIn. Exploring the Grand Challenges for Next Generation E-Business: 8th Workshop on E-Business, WEB 2009, Phoenix, AZ, USA, December 15, 2009, Revised Selected Papers 8. , 10–18. https://doi.org/10.1007/978-3-642-17449-0_2 (Springer Berlin Heidelberg, 2011).

Download references

Acknowledgements

This study was financially supported by Xi'an Major Scientific and Technological Achievements Transformation and Industrialization Project (20KYPT0003-10).

This work was supported by Xi’an Municipal Bureau of Science and Technology, 20KYPT0003-10.

Author information

Authors and affiliations.

School of Economics and Management, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

Jianbo Zhao, Huailiang Liu, Weili Zhang, Tong Sun, Qiuyi Chen, Yan Zhuang, Xiaojin Zhang & Shanzhuang Zhang

School of Artificial Intelligence, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

Yuehai Wang, Jiale Cheng & Ruiyu Ding

School of Telecommunications Engineering, Xidian University, 266 Xifeng Road, Xi’an, 710071, China

You can also search for this author in PubMed   Google Scholar

Contributions

H.L. formulated the overall research strategy and guided the work. J.Z kept the original data on which the paper was based and verified whether the charts and conclusions accurately reflected the collected data. J.Z. W.Z. and T.S. wrote the main manuscript text. W.Z. Y.W. and Q.C. finished collecting and sorting out the data. J.C. Y.Z. and X.Z. prepared Figs.  1 – 7 , S.Z. B.L. and R.D. prepared Tables 1 – 14 . All authors reviewed the manuscript.

Corresponding author

Correspondence to Jianbo Zhao .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Zhao, J., Liu, H., Zhang, W. et al. Research on domain ontology construction based on the content features of online rumors. Sci Rep 14 , 12134 (2024). https://doi.org/10.1038/s41598-024-62459-4

Download citation

Received : 07 December 2023

Accepted : 16 May 2024

Published : 27 May 2024

DOI : https://doi.org/10.1038/s41598-024-62459-4

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Rumor content features
  • Domain ontology
  • Top-level ontology reuse
  • New concept discovery

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper on real analysis

IMAGES

  1. 😎 Research analysis paper. How to Write the Analysis Section of My

    research paper on real analysis

  2. Analysis In A Research Paper

    research paper on real analysis

  3. ⭐ Sample research study paper. Sample papers. 2022-10-25

    research paper on real analysis

  4. Research Paper On Budget Analysis

    research paper on real analysis

  5. How to Write an Analytical Research Paper Guide

    research paper on real analysis

  6. Analysis In A Research Paper

    research paper on real analysis

VIDEO

  1. Real Analysis || Lecture-1 || Set Theory || By Mr. Parveen Kumar

  2. L-1: Real Analysis (Introduction)

  3. Real analysis

  4. B Sc 4th semester Real Analysis paper

  5. Real Analysis, Lecture 22 (Differentiability and derivatives)

  6. 9th standard social science question paper ಸಮಾಜ ವಿಜ್ಞಾನ ಪ್ರಶ್ನೆಪತ್ರಿಕೆ@ThejaswiniPushkar

COMMENTS

  1. 120340 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on REAL ANALYSIS. Find methods information, sources, references or conduct a literature review on REAL ...

  2. real analysis Latest Research Papers

    Real Analysis . Mixed Method Research . Learning Method . Control Group Design . Analytical Thinking . Quasi Experimental . Initial Process. Analytical thinking is a skill to unite the initial process, plan solutions, produce solutions, and conclude something to produce conclusions or correct answers.

  3. (PDF) A History of Real Analysis

    Real analysis is a branch of mathematical analysis dealin g with the real. numbers, sequence and series of real numbers, and real-valued functions of a. real variable. In particular, it deals with ...

  4. (PDF) Real Analysis

    A modern graduate course in real functions doubtless owes much to their activity but it is only infrequently explicit. 1 This essay on Youngs' influence on some aspects of real analysis was ...

  5. Home

    The Journal of Analysis publishes high-quality, peer-reviewed original papers and surveys on mathematical analysis and its various applications. Covers a broad range of topics in real and complex analysis. Official publication of the Forum D'Analystes, promoting the study of mathematical analysis. Prioritizes significant results of topical ...

  6. INTRODUCTION TO REAL ANALYSIS

    For general information on our other products and services or for technical support, please contact our Customer Care Department within the United States at (800) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002. Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may ...

  7. PDF Real Analysis

    The description of physical systems, such as planetary motion, by dynamical systems (ordinary di erential equations); 2. The theory of partial di erential equations, such as those describing heat ow or quantum particles; 3. Harmonic analysis on Lie groups, of which R is a simple example; 4. Representation theory; 1.

  8. Real Analysis and Applications

    About this book. This textbook introduces readers to real analysis in one and n dimensions. It is divided into two parts: Part I explores real analysis in one variable, starting with key concepts such as the construction of the real number system, metric spaces, and real sequences and series. In turn, Part II addresses the multi-variable ...

  9. A History of Real Analysis by Harris Dela Cruz :: SSRN

    Abstract. Real analysis is a branch of mathematical analysis dealing with the real numbers, sequence and series of real numbers, and real-valued functions of a real variable. In particular, it deals with theories on limits, convergence, continuity, differentiation, and integration. Although real analysis is distinguished from complex analysis ...

  10. [PDF] Real Analysis with Real Applications

    Most of the topics in the book, heretofore accessible only through research papers, are treated here from the basics to the currently active research, often motivated by practical problems arising in diverse applications such as science, engineering, geophysics, and business and economics. Expand. 420. PDF.

  11. Real Analysis: With Proof Strategies

    However, proving theorems in real analysis is a challenging task for many undergraduate students. In this context, this research was conducted to explore difficulties experienced by undergraduate students in proving theorems of real analysis. Narrative research design under the interpretive research paradigm was used for the study.

  12. PDF Real Analysis Mathematical Knowledge for Teaching: an Investigation

    Real analysis is a course that nearly all mathematics majors and some mathematics education majors are required to take (Conference Board of the Mathematical Sciences, 2012). Standard topics covered in real analysis include the real number system, functions and limits, topology of

  13. Real Analysis with Economic Applications on JSTOR

    There are many mathematics textbooks on real analysis, but they focus on topics not readily helpful for studying economic theory or they are inaccessible to mo...

  14. PDF Real Analysis

    ISBN 978-1-4704-1099-5 (alk. paper) 1. Mathematical analysis—Textbooks. I. Title. QA300.S53 2015 515 .8—dc23 2014047381 Copying and reprinting. Individual readers of this publication, and nonprofit libraries acting for them, are permitted to make fair use of the material, such as to copy select pages for use in teaching or research.

  15. Appendix C: Projects in Real Analysis

    PROJECTS IN REAL ANALYSIS Student projects in real analysis can range from simple, expository papers with class presentations to original research on theoretical or pedagogical topics. We present a collection of sample projects for students that begins with historically based writing

  16. Real Analysis Research Papers

    Qualitative "real analysis"-math notes in view of awareness, growth mindset, and dreams. Qualitative notes on the study of real analysis in mathematics are used under the context of expanding awareness, an exponential growth mindset, and how math relates to dreams. Download. by Michael T Chase.

  17. Machine Learning: Algorithms, Real-World Applications and Research

    To discuss the applicability of machine learning-based solutions in various real-world application domains. To highlight and summarize the potential research directions within the scope of our study for intelligent data analysis and services. The rest of the paper is organized as follows.

  18. Real-time data analysis in health monitoring systems: A comprehensive

    Full-paper articles and research papers ... Despite claiming real-time data analysis, most examined studies did not present precise data analysis real-time parameters. For instance, Arunan et al. [39] proposed running SVM classification in combination with a simple sum of weighted parameters in a constrained device. However, the authors did not ...

  19. (PDF) Concepts of Real Analysis

    Abstract. Concepts of Real Analysis is a student friendly textbook on real analysis, a topic taught as part of the undergraduate mathematics syllabus of pass and honours courses of all ...

  20. Predictive analytics using Big Data for the real estate market during

    As the COVID-19 pandemic came unexpectedly, many real estate experts claimed that the property values would fall like the 2007 crash. However, this study raises the question of what attributes of an apartment are most likely to influence a price revision during the pandemic. The findings in prior studies have lacked consensus, especially regarding the time-on-the-market variable, which ...

  21. Real-Time PCR: Revolutionizing Detection and Expression Analysis of

    Invention of polymerase chain reaction (PCR) technology by Kary Mullis in 1984 gave birth to real-time PCR. Real-time PCR — detection and expression analysis of gene(s) in real-time — has revolutionized the 21 st century biological science due to its tremendous application in quantitative genotyping, genetic variation of inter and intra organisms, early diagnosis of disease, forensic, to ...

  22. Deep Learning for Network Traffic Monitoring and Analysis (NTMA): A

    Finally, Verma et al. [31] surveyed real-time analysis of big IoT data. In this work, the authors reviewed the latest network data analytics methods, which are appropriate for real-time IoT network data analytics. Moreover, in that paper, the foundations of real-time IoT analytics, use cases, and software platforms are discussed.

  23. Research on domain ontology construction based on the content ...

    This paper starts from the term layer, the frame layer, and the instance layer, and based on the reuse of the top-level ontology, the extraction of core literature content features, and the ...

  24. (PDF) REAL ANALYSIS 1 UNDERGRADUATE LECTURE NOTES

    An order on a set S is a relat ion denoted by "<" with the following. properties: 1) If and then one and only one of the statements x<y, x>y and x = y is true. 2) If x, y, z and if x<y, y<z ...

  25. A Review of Deep Learning Advancements in Road Analysis for ...

    The rapid advancement of autonomous vehicle technology has brought into focus the critical need for enhanced road safety systems, particularly in the areas of road damage detection and surface classification. This paper explores these two essential components, highlighting their importance in autonomous driving. In the domain of road damage detection, this study explores a range of deep ...

  26. Global Insights & Market Intelligence

    Every Friday our chief economist provides a political and economic perspective of global events. Subscribe now to understand the trends that impact your business and investments. EIU is the world leader in global business intelligence & market insights. View our solutions & read our insights today to make strategic & impactful decisions.