Journal of Statistical Distributions and Applications Cover Image

  • Search by keyword
  • Search by citation

Page 1 of 3

A generalization to the log-inverse Weibull distribution and its applications in cancer research

In this paper we consider a generalization of a log-transformed version of the inverse Weibull distribution. Several theoretical properties of the distribution are studied in detail including expressions for i...

  • View Full Text

Approximations of conditional probability density functions in Lebesgue spaces via mixture of experts models

Mixture of experts (MoE) models are widely applied for conditional probability density estimation problems. We demonstrate the richness of the class of MoE models by proving denseness results in Lebesgue space...

Structural properties of generalised Planck distributions

A family of generalised Planck (GP) laws is defined and its structural properties explored. Sometimes subject to parameter restrictions, a GP law is a randomly scaled gamma law; it arises as the equilibrium la...

New class of Lindley distributions: properties and applications

A new generalized class of Lindley distribution is introduced in this paper. This new class is called the T -Lindley{ Y } class of distributions, and it is generated by using the quantile functions of uniform, expon...

Tolerance intervals in statistical software and robustness under model misspecification

A tolerance interval is a statistical interval that covers at least 100 ρ % of the population of interest with a 100(1− α ) % confidence, where ρ and α are pre-specified values in (0, 1). In many scientific fields, su...

Combining assumptions and graphical network into gene expression data analysis

Analyzing gene expression data rigorously requires taking assumptions into consideration but also relies on using information about network relations that exist among genes. Combining these different elements ...

A comparison of zero-inflated and hurdle models for modeling zero-inflated count data

Counts data with excessive zeros are frequently encountered in practice. For example, the number of health services visits often includes many zeros representing the patients with no utilization during a follo...

A general stochastic model for bivariate episodes driven by a gamma sequence

We propose a new stochastic model describing the joint distribution of ( X , N ), where N is a counting variable while X is the sum of N independent gamma random variables. We present the main properties of this gene...

A flexible multivariate model for high-dimensional correlated count data

We propose a flexible multivariate stochastic model for over-dispersed count data. Our methodology is built upon mixed Poisson random vectors ( Y 1 ,…, Y d ), where the { Y i } are conditionally independent Poisson random...

Generalized fiducial inference on the mean of zero-inflated Poisson and Poisson hurdle models

Zero-inflated and hurdle models are widely applied to count data possessing excess zeros, where they can simultaneously model the process from how the zeros were generated and potentially help mitigate the eff...

Multivariate distributions of correlated binary variables generated by pair-copulas

Correlated binary data are prevalent in a wide range of scientific disciplines, including healthcare and medicine. The generalized estimating equations (GEEs) and the multivariate probit (MP) model are two of ...

On two extensions of the canonical Feller–Spitzer distribution

We introduce two extensions of the canonical Feller–Spitzer distribution from the class of Bessel densities, which comprise two distinct stochastically decreasing one-parameter families of positive absolutely ...

A new trivariate model for stochastic episodes

We study the joint distribution of stochastic events described by ( X , Y , N ), where N has a 1-inflated (or deflated) geometric distribution and X , Y are the sum and the maximum of N exponential random variables. Mod...

A flexible univariate moving average time-series model for dispersed count data

Al-Osh and Alzaid ( 1988 ) consider a Poisson moving average (PMA) model to describe the relation among integer-valued time series data; this model, however, is constrained by the underlying equi-dispersion assumpt...

Spatio-temporal analysis of flood data from South Carolina

To investigate the relationship between flood gage height and precipitation in South Carolina from 2012 to 2016, we built a conditional autoregressive (CAR) model using a Bayesian hierarchical framework. This ...

Affine-transformation invariant clustering models

We develop a cluster process which is invariant with respect to unknown affine transformations of the feature space without knowing the number of clusters in advance. Specifically, our proposed method can iden...

Distributions associated with simultaneous multiple hypothesis testing

We develop the distribution for the number of hypotheses found to be statistically significant using the rule from Simes (Biometrika 73: 751–754, 1986) for controlling the family-wise error rate (FWER). We fin...

New families of bivariate copulas via unit weibull distortion

This paper introduces a new family of bivariate copulas constructed using a unit Weibull distortion. Existing copulas play the role of the base or initial copulas that are transformed or distorted into a new f...

Generalized logistic distribution and its regression model

A new generalized asymmetric logistic distribution is defined. In some cases, existing three parameter distributions provide poor fit to heavy tailed data sets. The proposed new distribution consists of only t...

The spherical-Dirichlet distribution

Today, data mining and gene expressions are at the forefront of modern data analysis. Here we introduce a novel probability distribution that is applicable in these fields. This paper develops the proposed sph...

Item fit statistics for Rasch analysis: can we trust them?

To compare fit statistics for the Rasch model based on estimates of unconditional or conditional response probabilities.

Exact distributions of statistics for making inferences on mixed models under the default covariance structure

At this juncture when mixed models are heavily employed in applications ranging from clinical research to business analytics, the purpose of this article is to extend the exact distributional result of Wald (A...

A new discrete pareto type (IV) model: theory, properties and applications

Discrete analogue of a continuous distribution (especially in the univariate domain) is not new in the literature. The work of discretizing continuous distributions begun with the paper by Nakagawa and Osaki (197...

Density deconvolution for generalized skew-symmetric distributions

The density deconvolution problem is considered for random variables assumed to belong to the generalized skew-symmetric (GSS) family of distributions. The approach is semiparametric in that the symmetric comp...

The unifed distribution

We introduce a new distribution with support on (0,1) called unifed. It can be used as the response distribution for a GLM and it is suitable for data aggregation. We make a comparison to the beta regression. ...

On Burr III Marshal Olkin family: development, properties, characterizations and applications

In this paper, a flexible family of distributions with unimodel, bimodal, increasing, increasing and decreasing, inverted bathtub and modified bathtub hazard rate called Burr III-Marshal Olkin-G (BIIIMO-G) fam...

The linearly decreasing stress Weibull (LDSWeibull): a new Weibull-like distribution

Motivated by an engineering pullout test applied to a steel strip embedded in earth, we show how the resulting linearly decreasing force leads naturally to a new distribution, if the force under constant stress i...

Meta analysis of binary data with excessive zeros in two-arm trials

We present a novel Bayesian approach to random effects meta analysis of binary data with excessive zeros in two-arm trials. We discuss the development of likelihood accounting for excessive zeros, the prior, a...

On ( p 1 ,…, p k )-spherical distributions

The class of ( p 1 ,…, p k )-spherical probability laws and a method of simulating random vectors following such distributions are introduced using a new stochastic vector representation. A dynamic geometric disintegra...

A new class of survival distribution for degradation processes subject to shocks

Many systems experience gradual degradation while simultaneously being exposed to a stream of random shocks of varying magnitudes that eventually cause failure when a shock exceeds the residual strength of the...

A new extended normal regression model: simulations and applications

Various applications in natural science require models more accurate than well-known distributions. In this context, several generators of distributions have been recently proposed. We introduce a new four-par...

Multiclass analysis and prediction with network structured covariates

Technological advances associated with data acquisition are leading to the production of complex structured data sets. The recent development on classification with multiclass responses makes it possible to in...

High-dimensional star-shaped distributions

Stochastic representations of star-shaped distributed random vectors having heavy or light tail density generating function g are studied for increasing dimensions along with corresponding geometric measure repre...

A unified complex noncentral Wishart type distribution inspired by massive MIMO systems

The eigenvalue distributions from a complex noncentral Wishart matrix S = X H X has been the subject of interest in various real world applications, where X is assumed to be complex matrix variate normally distribute...

Particle swarm based algorithms for finding locally and Bayesian D -optimal designs

When a model-based approach is appropriate, an optimal design can guide how to collect data judiciously for making reliable inference at minimal cost. However, finding optimal designs for a statistical model w...

Admissible Bernoulli correlations

A multivariate symmetric Bernoulli distribution has marginals that are uniform over the pair {0,1}. Consider the problem of sampling from this distribution given a prescribed correlation between each pair of v...

On p -generalized elliptical random processes

We introduce rank- k -continuous axis-aligned p -generalized elliptically contoured distributions and study their properties such as stochastic representations, moments, and density-like representations. Applying th...

Parameters of stochastic models for electroencephalogram data as biomarkers for child’s neurodevelopment after cerebral malaria

The objective of this study was to test statistical features from the electroencephalogram (EEG) recordings as predictors of neurodevelopment and cognition of Ugandan children after coma due to cerebral malari...

A new generalization of generalized half-normal distribution: properties and regression models

In this paper, a new extension of the generalized half-normal distribution is introduced and studied. We assess the performance of the maximum likelihood estimators of the parameters of the new distribution vi...

Analytical properties of generalized Gaussian distributions

The family of Generalized Gaussian (GG) distributions has received considerable attention from the engineering community, due to the flexible parametric form of its probability density function, in modeling ma...

A new Weibull- X family of distributions: properties, characterizations and applications

We propose a new family of univariate distributions generated from the Weibull random variable, called a new Weibull-X family of distributions. Two special sub-models of the proposed family are presented and t...

The transmuted geometric-quadratic hazard rate distribution: development, properties, characterizations and applications

We propose a five parameter transmuted geometric quadratic hazard rate (TG-QHR) distribution derived from mixture of quadratic hazard rate (QHR), geometric and transmuted distributions via the application of t...

A nonparametric approach for quantile regression

Quantile regression estimates conditional quantiles and has wide applications in the real world. Estimating high conditional quantiles is an important problem. The regular quantile regression (QR) method often...

Mean and variance of ratios of proportions from categories of a multinomial distribution

Ratio distribution is a probability distribution representing the ratio of two random variables, each usually having a known distribution. Currently, there are results when the random variables in the ratio fo...

The power-Cauchy negative-binomial: properties and regression

We propose and study a new compounded model to extend the half-Cauchy and power-Cauchy distributions, which offers more flexibility in modeling lifetime data. The proposed model is analytically tractable and c...

Families of distributions arising from the quantile of generalized lambda distribution

In this paper, the class of T-R { generalized lambda } families of distributions based on the quantile of generalized lambda distribution has been proposed using the T-R { Y } framework. In the development of the T - R {

Risk ratios and Scanlan’s HRX

Risk ratios are distribution function tail ratios and are widely used in health disparities research. Let A and D denote advantaged and disadvantaged populations with cdfs F ...

Joint distribution of k -tuple statistics in zero-one sequences of Markov-dependent trials

We consider a sequence of n , n ≥3, zero (0) - one (1) Markov-dependent trials. We focus on k -tuples of 1s; i.e. runs of 1s of length at least equal to a fixed integer number k , 1≤ k ≤ n . The statistics denoting the n...

Quantile regression for overdispersed count data: a hierarchical method

Generalized Poisson regression is commonly applied to overdispersed count data, and focused on modelling the conditional mean of the response. However, conditional mean regression models may be sensitive to re...

Describing the Flexibility of the Generalized Gamma and Related Distributions

The generalized gamma (GG) distribution is a widely used, flexible tool for parametric survival analysis. Many alternatives and extensions to this family have been proposed. This paper characterizes the flexib...

  • ISSN: 2195-5832 (electronic)

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Statistics articles within Scientific Reports

Article 04 May 2024 | Open Access

Estimating neutrosophic finite median employing robust measures of the auxiliary variable

  • Saadia Masood
  • , Bareera Ibrar
  •  &  Zabihullah Movaheedi

Article 01 May 2024 | Open Access

Zika emergence, persistence, and transmission rate in Colombia: a nationwide application of a space-time Markov switching model

  • Laís Picinini Freitas
  • , Dirk Douwes-Schultz
  •  &  Kate Zinszer

Article 29 April 2024 | Open Access

Exploring drivers of overnight stays and same-day visits in the tourism sector

  • Francesco Scotti
  • , Andrea Flori
  •  &  Giovanni Azzone

A support vector machine based drought index for regional drought analysis

  • Mohammed A Alshahrani
  • , Muhammad Laiq
  •  &  Muhammad Nabi

Article 25 April 2024 | Open Access

Joint Bayesian estimation of cell dependence and gene associations in spatially resolved transcriptomic data

  • Arhit Chakrabarti
  •  &  Bani K. Mallick

Estimating SARS-CoV-2 infection probabilities with serological data and a Bayesian mixture model

  • Benjamin Glemain
  • , Xavier de Lamballerie
  •  &  Fabrice Carrat

Article 24 April 2024 | Open Access

Applications of nature-inspired metaheuristic algorithms for tackling optimization problems across disciplines

  • Elvis Han Cui
  • , Zizhao Zhang
  •  &  Weng Kee Wong

Article 23 April 2024 | Open Access

Variable parameters memory-type control charts for simultaneous monitoring of the mean and variability of multivariate multiple linear regression profiles

  • Hamed Sabahno
  •  &  Marie Eriksson

Article 22 April 2024 | Open Access

Modeling health and well-being measures using ZIP code spatial neighborhood patterns

  • , Michael LaValley
  •  &  Shariq Mohammed

Article 20 April 2024 | Open Access

Sequence based model using deep neural network and hybrid features for identification of 5-hydroxymethylcytosine modification

  • Salman Khan
  • , Islam Uddin
  •  &  Dost Muhammad Khan

Article 19 April 2024 | Open Access

Identification of CT radiomic features robust to acquisition and segmentation variations for improved prediction of radiotherapy-treated lung cancer patient recurrence

  • Thomas Louis
  • , François Lucia
  •  &  Roland Hustinx

Explainable prediction of node labels in multilayer networks: a case study of turnover prediction in organizations

  • László Gadár
  •  &  János Abonyi

Article 18 April 2024 | Open Access

The quasi-xgamma frailty model with survival analysis under heterogeneity problem, validation testing, and risk analysis for emergency care data

  • Hamami Loubna
  • , Hafida Goual
  •  &  Haitham M. Yousof

Memory type Bayesian adaptive max-EWMA control chart for weibull processes

  • Abdullah A. Zaagan
  • , Imad Khan
  •  &  Bakhtiyar Ahmad

Article 17 April 2024 | Open Access

Improved data quality and statistical power of trial-level event-related potentials with Bayesian random-shift Gaussian processes

  • Dustin Pluta
  • , Beniamino Hadj-Amar
  •  &  Marina Vannucci

Article 16 April 2024 | Open Access

Comparison and evaluation of overcoring and hydraulic fracturing stress measurements

  • , Meifeng Cai
  •  &  Mostafa Gorjian

Predictors of divorce and duration of marriage among first marriage women in Dejne administrative town

  • Nigusie Gashaye Shita
  •  &  Liknaw Bewket Zeleke

Article 12 April 2024 | Open Access

Determinants of multimodal fake review generation in China’s E-commerce platforms

  • Chunnian Liu
  •  &  Lan Yi

Article 11 April 2024 | Open Access

New ridge parameter estimators for the quasi-Poisson ridge regression model

  • Aamir Shahzad
  • , Muhammad Amin
  •  &  Muhammad Faisal

A bicoherence approach to analyze multi-dimensional cross-frequency coupling in EEG/MEG data

  • Alessio Basti
  • , Guido Nolte
  •  &  Laura Marzetti

Article 10 April 2024 | Open Access

Response times are affected by mispredictions in a stochastic game

  • Paulo Roberto Cabral-Passos
  • , Antonio Galves
  •  &  Claudia D. Vargas

The effect of city reputation on Chinese corporate risk-taking

  •  &  Haifeng Jiang

Article 06 April 2024 | Open Access

Improvement in variance estimation using transformed auxiliary variable under simple random sampling

  • , Syed Muhammad Asim
  •  &  Soofia Iftikhar

Article 28 March 2024 | Open Access

Fatty liver classification via risk controlled neural networks trained on grouped ultrasound image data

  • Tso-Jung Yen
  • , Chih-Ting Yang
  •  &  Hsin-Chou Yang

Article 27 March 2024 | Open Access

A new unit distribution: properties, estimation, and regression analysis

  • Kadir Karakaya
  • , C. S. Rajitha
  •  &  Ahmed M. Gemeay

Article 26 March 2024 | Open Access

GeneAI 3.0: powerful, novel, generalized hybrid and ensemble deep learning frameworks for miRNA species classification of stationary patterns from nucleotides

  • Jaskaran Singh
  • , Narendra N. Khanna
  •  &  Jasjit S. Suri

On topological indices and entropy measures of beryllonitrene network via logarithmic regression model

  • , Muhammad Kamran Siddiqui
  •  &  Fikre Bogale Petros

Article 22 March 2024 | Open Access

Measuring the similarity of charts in graphical statistics

  • Krzysztof Górnisiewicz
  • , Zbigniew Palka
  •  &  Waldemar Ratajczak

Article 21 March 2024 | Open Access

Risk prediction and interaction analysis using polygenic risk score of type 2 diabetes in a Korean population

  • Minsun Song
  • , Soo Heon Kwak
  •  &  Jihyun Kim

A longitudinal causal graph analysis investigating modifiable risk factors and obesity in a European cohort of children and adolescents

  • Ronja Foraita
  • , Janine Witte
  •  &  Vanessa Didelez

Article 19 March 2024 | Open Access

A novel group decision making method based on CoCoSo and interval-valued Q-rung orthopair fuzzy sets

  • , Hongwu Qin
  •  &  Xiuqin Ma

Impact of using virtual avatars in educational videos on user experience

  • Ruyuan Zhang
  •  &  Qun Wu

A generalisation of the method of regression calibration and comparison with Bayesian and frequentist model averaging methods

  • Mark P. Little
  • , Nobuyuki Hamada
  •  &  Lydia B. Zablotska

Article 18 March 2024 | Open Access

Monitoring gamma type-I censored data using an exponentially weighted moving average control chart based on deep learning networks

  • Pei-Hsi Lee
  •  &  Shih-Lung Liao

Article 15 March 2024 | Open Access

Statistical detection of selfish mining in proof-of-work blockchain systems

  • Sheng-Nan Li
  • , Carlo Campajola
  •  &  Claudio J. Tessone

Article 13 March 2024 | Open Access

Evaluation metrics and statistical tests for machine learning

  • Oona Rainio
  • , Jarmo Teuho
  •  &  Riku Klén

PARSEG: a computationally efficient approach for statistical validation of botanical seeds’ images

  • Luca Frigau
  • , Claudio Conversano
  •  &  Jaromír Antoch

Application of analysis of variance to determine important features of signals for diagnostic classifiers of displacement pumps

  • Jarosław Konieczny
  • , Waldemar Łatas
  •  &  Jerzy Stojek

Article 12 March 2024 | Open Access

Prediction and detection of side effects severity following COVID-19 and influenza vaccinations: utilizing smartwatches and smartphones

  • , Margaret L. Brandeau
  •  &  Dan Yamin

Article 08 March 2024 | Open Access

Evaluating the lifetime performance index of omega distribution based on progressive type-II censored samples

  • N. M. Kilany
  •  &  Lobna H. El-Refai

Article 07 March 2024 | Open Access

Development of risk models of incident hypertension using machine learning on the HUNT study data

  • Filip Emil Schjerven
  • , Emma Maria Lovisa Ingeström
  •  &  Frank Lindseth

Article 06 March 2024 | Open Access

Online trend estimation and detection of trend deviations in sub-sewershed time series of SARS-CoV-2 RNA measured in wastewater

  • Katherine B. Ensor
  • , Julia C. Schedler
  •  &  Loren Hopkins

Article 05 March 2024 | Open Access

Machine learning and XAI approaches highlight the strong connection between \(O_3\) and \(NO_2\) pollutants and Alzheimer’s disease

  • Alessandro Fania
  • , Alfonso Monaco
  •  &  Roberto Bellotti

Article 04 March 2024 | Open Access

High-precision regressors for particle physics

  • Fady Bishara
  • , Ayan Paul
  •  &  Jennifer Dy

Applying explainable artificial intelligence methods to models for diagnosing personal traits and cognitive abilities by social network data

  • Anastasia S. Panfilova
  •  &  Denis Yu. Turdakov

Application of machine learning with large-scale data for an effective vaccination against classical swine fever for wild boar in Japan

  • Satoshi Ito
  • , Cecilia Aguilar-Vega
  •  &  José Manuel Sánchez-Vizcaíno

Article 01 March 2024 | Open Access

Evaluating public opinions: informing public health policy adaptations in China amid the COVID-19 pandemic

  • Chenyang Wang
  • , Xinzhi Wang
  •  &  Hui Zhang

Article 26 February 2024 | Open Access

Quality assessment and community detection methods for anonymized mobility data in the Italian Covid context

  • Jules Morand
  • , Shoichi Yip
  •  &  Luca Tubiana

Article 23 February 2024 | Open Access

The disparate impacts of college admissions policies on Asian American applicants

  • Joshua Grossman
  • , Sabina Tomkins
  •  &  Sharad Goel

Article 22 February 2024 | Open Access

Spectrum analysis of digital UPWM signals generated from random modulating signals

  • Konstantinos Kaleris
  • , Emmanouil Psarakis
  •  &  John Mourjopoulos

Advertisement

Browse broader subjects

  • Mathematics and computing

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

research paper related to statistical analysis

Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

Data analytics using statistical methods and machine learning: a case study of power transfer units

  • Application
  • Open access
  • Published: 30 March 2021
  • Volume 114 , pages 1859–1870, ( 2021 )

Cite this article

You have full access to this open access article

research paper related to statistical analysis

  • Sharmin Sultana Sheuly   ORCID: orcid.org/0000-0003-0883-0044 1 ,
  • Shaibal Barua 1 ,
  • Shahina Begum 1 ,
  • Mobyen Uddin Ahmed 1 ,
  • Ekrem Güclü 2 &
  • Michael Osbakk 2  

3601 Accesses

6 Citations

Explore all metrics

Sensors can produce large amounts of data related to products, design, and materials; however, it is important to use the right data for the right purposes. Therefore, detailed analysis of data accumulated from different sensors in production and assembly manufacturing lines is necessary to minimize faulty products and understand the production process. Additionally, when selecting analytical methods, manufacturing companies must select the most suitable techniques. This paper presents a data analytics approach to extract useful information, such as important measurements for the dimensions of a shim, a small part for aligning shafts, from the manufacturing data of a power transfer unit (PTU). This paper also identifies the best techniques and analytical approaches within the following six individual areas: (1) identifying measurements associated with faults; (2) identifying measurements associated with shim dimensions; (3) identifying associations between station codes; (4) predicting shim dimensions; (5) identifying duplicate samples in faulty data; and (6) identifying error distributions associated with measurement. These areas are analysed in accordance with two analytical approaches: (a) statistical analysis and (b) machine learning (ML)-based analysis. The results show (a) the relative importance of measurements with regard to the faulty unit and shim dimensions, (b) the error distribution of measurements, and (c) the reproduction rate of faulty units. Additionally, both statistical analysis and ML-based analysis have shown that the measurement ‘PTU housing measurement’ is the most important measurement among available shim dimensions. Additionally, certain faulty stations correlated with one another. ML is shown to be the most suitable technique in three areas (e.g. identifying measurements associated with faults), while statistical analysis is sufficient for the other three areas (e.g. identifying measurements associated with shim dimensions) because they do not require a complex analytical model. This study provides a clearer understanding of assembly line production and identifies highly correlated and significant measurements of a faulty unit.

Similar content being viewed by others

research paper related to statistical analysis

Artificial intelligence, cyber-threats and Industry 4.0: challenges and opportunities

research paper related to statistical analysis

A review on fault detection and diagnosis techniques: basics and beyond

research paper related to statistical analysis

Machine learning applied in production planning and control: a state-of-the-art in the era of industry 4.0

Avoid common mistakes on your manuscript.

1 Introduction

Today, with the rise of advanced sensor technology through the Internet of Things (IoT), a large amount of data, commonly known as big data, is collected through cyber physical systems (CPSs) [ 1 , 2 , 3 ]. However, only a small portion of the available data is being used today, and often, most of these data are not used for any purpose. Proper usage of data enables smart manufacturing through improved decision-making using a data analytics approach based on historical and real-time data for fault detection, fault prognosis, production cost estimation, and more [ 4 , 5 ]. Traditional routine-based maintenance in industry can be transformed into big data-assisted predictive maintenance. Machine health monitoring can be conducted by predicting health status based on real-time and historical data [ 6 ]. ML technology can be used for predictive maintenance, as in [ 6 , 7 , 8 ]. Thus, data-driven ML techniques have created a new dimension in the manufacturing industry.

The application of ML in the manufacturing industry is a recent development [ 9 , 10 ]. Several techniques for integrating ML into manufacturing have emerged in the last few decades. ML methods such as decision trees, Bayesian networks, k-nearest neighbours (kNNs), and neural networks are currently being used in the manufacturing industry for tool condition monitoring. Tool wear-sensitive features are defined and extracted [ 11 ], and ML-aided tool wear monitoring or tool condition monitoring can be helpful in the manufacturing industry [ 12 , 13 ]. This trend has been applied in the semiconductor industry as well, and faulty wafers can be detected with the help of ML techniques such as Gaussian density estimation, Gaussian mixture models, the Parzen-window method, k-means clustering, support vector machines (SVM), and principal component analysis (PCA) [ 14 ]. Fault detection and fault classification are essential parts of process monitoring in photovoltaic (PV) arrays and can be performed with the help of ML algorithms [ 15 , 16 , 17 ]. ML-aided automated fault detection and diagnosis have been successful in many cases [ 18 ]. To lower the necessity of human expertise in fault detection, convolutional ML algorithms such as convolutional neural networks outperform traditional systems in rotating machinery [ 19 ]. Images of partially printed objects in 3-D printing are used for automated process monitoring. The object is classified as ‘defective’ or ‘good’ with the help of SVM [ 20 ]. Another application of ML in process monitoring is monitoring surface roughness in additive manufacturing. Temperature and vibration data are fed into an ensemble learning algorithm to predict roughness [ 21 ]. Data analytics aims to gain knowledge from raw data or derived data (i.e. results received from ML algorithms) [ 22 ]. Today, manufacturing systems are less dependent on human knowledge and rely more on advanced techniques such as deep learning to extract knowledge from raw data.

ML technology has recently been applied in the manufacturing industry. Before ML, statistical analysis was the primary method used in the manufacturing industry. Statistical methods help to correlate, organize, and interpret data [ 23 ], and statistical analysis shows the underlying patterns in a data set; for example, correlation indicates a relationship between two variables. Currently, manufacturing systems are becoming more complex, and it is challenging to detect and isolate faults. The Gaussian mixture model for finding probabilistic correlation is one method that is used for anomaly detection [ 24 ]. Another statistical method that can be used for fault detection is canonical correlation analysis (CCA), which is used during alumina evaporation [ 25 ]. Based on the correlation coefficient of the voltage curves, fault detection can be performed on short circuits [ 26 ]. Fault diagnosis in fluctuating workloads (i.e., large-scale cloud computing environments) can be performed with the help of canonical correlation analysis between workloads and performance matrices [ 27 ].

As discussed above, statistics played an important role in process control before the emergence of ML and other technologies. However, most companies are still not fully using their data to create new knowledge. Additionally, most companies face challenges in their choice of data analytics techniques—whether they will adhere to traditional statistical analysis or use the most current ML techniques. This study attempts to solve these problems by extracting useful knowledge from raw data and investigates which method (ML or statistical analysis) is best suited for different areas. To our knowledge, no study has investigated which data analytical methods have been used for power transfer unit (PTU).

Consider the following example: a local company Footnote 1 manufactures power transfer units (PTUs) for vehicles and uses different IoT-based sensors to measure different dimensions associated with the PTUs. The primary PTU housing shown in Fig. 1 is supported with 3 shims. Approximately 6.8% of PTUs are reported to be faulty, resulting in economic loss. The data collected from the assembly line were analysed to extract useful knowledge and identify the best method for data analytics.

figure 1

Main housing of PTU

In this case, the influence of different measurements (i.e. ‘PTU housing measurement’) on the shim dimensions is investigated. Again, both statistical analysis methods (e.g. correlation) and ML algorithms (e.g. linear regression (LR), support vector regression (SVR), and random forest regression (RFR)) have been used to identify the most significant measurements associated with the shim. Furthermore, the data can be used to identify measurements that are highly responsible for a faulty unit. In this study, associations between station codes and shim dimension prediction are also investigated. Additionally, the reproduction rate of the faulty unit and error distribution of measurements are analysed. Both statistical analysis and ML-based analysis are compared to identify the method best suited to the areas mentioned above.

2 Data collection and analysis

2.1 power transfer unit.

PTUs transfer power from the front of a vehicle to the back. This action is performed with the help of two cogwheels or gears. The efficiency of the PTU depends on the position of these two gears; misplaced gears result in vibrations and noise. Thus, to align these two gears, shims are used. Figure 2 shows a PTU in efficient driveline (ED) mode.

figure 2

PTU in efficient driveline mode

2.2 Dataset

The dataset investigated in this study was obtained from a manufacturing company’s logistics in-production system database and consists of various measurements performed on an assembly line that manufactures PTUs. In total, 151,342 units are constructed, 6,488 of which have been marked as ‘faulty’ by the operator due to mismatches in measurements or incorrect shim dimensions. Forty-two measurements for each unit were recorded in the dataset, including mounting distances from the housing of the gear and gear heights. Each unit has a serial number and production time. There are several PTU stations at which the data were collected, and each station has a station code. The faulty samples were also marked in red, and the STATION fields of the nonfaulty samples were kept empty.

Explanations of the different stations are listed in Table 1 . The data used in this study were gathered from an IoT platform that connects all the sensors via the internet.

2.3 Data analytics

Several data analytics areas that have been investigated in this study are shown in Fig. 3 .

figure 3

Different data analytics areas

In this study, Area A identifies an association of different measurements with faults (i.e. which of 42 measurements are highly correlated with faulty units). Area B concerns the identification of the most important measurement associated with shim dimensions . Area C identifies a correlation between the stations , as each faulty unit has a station code. Area D predicts shim dimensions , and Area E identifies duplicate samples within the faulty data sets . Finally, Area F identifies error distributions associated with the measurements.

3 Overview of the approach

The step-by-step approach to data analytics is shown in Fig. 4 . The methods used in this study include domain knowledge; problem formulation; data and data pre-processing; data analytics involving statistical data analysis; ML-based data analysis; evaluation of the approach; new knowledge; and the best technique as an outcome. Initially, domain knowledge, data, requirements, and ideas are accumulated from the manufacturing company’s assembly line. Typically, the problem is formulated based on the requirements; in this study, the problem is formulated to explore the data, gather more knowledge about the assembly line, and find the best method of analysis. Additionally, domain knowledge is extracted and stored separately to evaluate the outcome of the approach.

figure 4

States of the proposed method

Because data were collected in a raw format, data pre-processing (i.e. populating missing values, identifying outliers, etc.) was performed. In this stage, the values representing NaN (not a number) and null values were replaced with zeros, and missing values were identified and populated via imputation. Furthermore, data exploration was performed to identify irregular cardinality and outliers in the dataset. None of the measurements had a cardinality of 1 or a low cardinality. Therefore, irregular cardinality was absent in the dataset. To identify outliers, the distributions of measurements as well as minimums and maximums were observed. However, the dataset did not contain any outliers. Finally, all measurements were normalized to a range of 0 to 1. Then, the dataset was divided into training (containing 80% of data) and test (the other 20%) datasets to apply ML-based analysis.

In this study, data analytics was performed in two phases: (1) Phase 1 performed statistical analysis to investigate different data distributions and correlations between different station codes as well as measurements associated with shim dimensions to identify correlations within the PTU domain; and (2) Phase 2 performed ML-based data analysis to identify the most relevant measurements and optimize the number of measurements. The results of these two steps were analysed and evaluated to create new useful knowledge about the manufacturing company’s assembly line. Additionally, a comparison between Phase 1 and Phase 2 was performed to identify the most suitable methods for individual areas.

Statistical data analysis ( Phase 1 ) was performed to explore the data and describe the various characteristics of the dataset. The goal of this Phase is to identify the distribution of faulty items considering the different ranges of measurement values, correlation between different measurements of the shim dimension, and correlation between error rates and assembly stations. Statistical analysis provides insights into the dataset, such as an overall understanding of the assembly line, the importance of different measurements, and the effects of faulty measurements on different stations in the assembly line. To identify the relationships between different measurements and the number of errors, the target measurements were divided into 100 bins. For each bin, the number of errors was summed, and the distribution of the errors was explored with histograms.

According to expert opinions, faults in the dataset are associated with one of the important measurements called the ‘PTU housing measurement’. A correlation analysis that indicates the degree to which two random measurements were linearly connected was used to see how faults from different stations were associated with station codes for ‘PTU housing measurement’. To estimate the correlation, station codes for ‘PTU housing measurement’ were first listed, and a matrix was created, which was then used to calculate the cross-correlation of the accumulated station codes. The correlation showed that certain stations were highly correlated. Additionally, certain faulty samples were found to be repeated in the dataset. Therefore, duplicate values corresponding to an item’s serial number were identified, and the frequency of faulty samples for measurement ‘PTU housing measurement’ was estimated for each station code.

The objectives of ML-based analysis ( Phase 2 ) were to classify PTU faults, predict shim dimensions, and identify the relationships between station codes. Classifying faults helps to understand the most relevant measurements, and in the future, fault classification may help to predict the values that must be adapted for an accurate unit. All faulty and nonfaulty units were labelled station codes 1 and 0, respectively. The hyperparameters of the ML models were optimized with the goal of comparing the performance of the ML models with/without the default parameters. Additionally, most case options for hyperparameter optimization were set to default, and the creation of models with the default optimization option took an average of 12 hours. Due to the long optimization process and good performance of the default hyperparameter optimization option (discussed in Section 4), default values of the option for optimization were not changed. All eligible hyperparameters were not optimized (except RFR) for the same reason; RFR was optimized because of the deviation in the RFR model predicted value from the real value.

Two support vector machine (SVM) classifiers were trained to classify the faulty units using the training dataset. Then, the coefficient values of the measurements obtained from the SVM classifier were used to rank the measurements, and the most relevant measurements were compared to the suggestions of experts. One of the classifiers had default hyperparameters, and another had optimized hyperparameters. The default hyperparameters associated with the classifier are box constraint=1, kernel scale=1, kernel function= ‘linear’, and standardized data=0. The second classifier was built using automatic hyperparameter optimization. The hyperparameter optimization option was set to ‘auto’, which indicates that the hyperparameters ‘BoxConstraint’ and ‘KernelScale’ will be optimized instead of all eligible parameters. Options for optimization were set to default values except ‘AcquisitionFunctionName’, which was set to ‘expected improvement plus’ to enable reproducibility. After 30 iterations, a hyperparameter-optimized model (support vector classifier) was created. The best feasible ‘BoxConstraint’ value is 837.56, and the ‘KernelScale’ value is 133.58.

Furthermore, to identify the correlations between ‘Gear (Pinion) height’, ‘PTU housing measurement’ and ‘Manual adjustment’, and the ‘shim dimension’ and to predict the shim dimension, several ML algorithms (LR, SVR, and RFR) were trained. With the LR algorithm, only one model was trained because the hyperparameters were not involved in fitting the input datapoints. It is assumed that the relation between input and output follows the formula y  =  bx  +  c .

In SVR, two models were trained: one with default hyperparameters, and one with optimized hyperparameters. The default hyperparameter SVR was trained with a linear kernel, and the hyperparameters were set to default values (lambda=8.259×10 −6 , learner=SVM, regularization=ridge(L2)). Conversely, for the optimized model, the parameters to be optimized were set to ‘auto’ to optimize three hyperparameters: BoxConstraint, KernelScale, and Epsilon. Option for optimization was set to default. After 30 iterations, a hyperparameter-optimized regression model was created. The values of the optimized hyperparameters are BoxConstraint=0.022683, KernelScale=0.013568, and Epsilon=0.00022608.

In RFR, three models were trained: one with default hyperparameters, one with four hyperparameters optimized and one with all hyperparameters optimized. The default RFR was trained using a bagged ensemble of 200 regression trees, and the hyperparameters were set as follows: number of ensemble learning cycles=200, learn rate=1, method=‘bag’, and number of predictors to select at random for each split=all. In the four hyperparameter-optimized RFR models, the parameters to be optimized were set to ‘auto’ to optimize four hyperparameters: Method, NumLearningCycles, LearnRate, and MinLeafSize. Options for optimization was set to default. After 30 iterations a four hyperparameter-optimized RFR model was created The values of the optimized hyperparameters are Method= ‘LSBoost’, NumLearningCycles=85, LearnRate= 0.050891, and MinLeafSize=1. In the third model, all eligible parameters were optimized. The values of all optimized hyperparameters are Method= ‘Bag’, NumLearningCycles=16, LearnRate=NaN, MinLeafSize=4, MaxNumSplits= 60006, NumVariablesToSample=2. Then, these models were evaluated using the test dataset.

To identify the relationships between different stations, 10 rules were mined using an Apriori algorithm on the Weka platform. General association rules were mined instead of class association rules by setting ‘car’ to false. The rules were ranked based on the values of ‘confidence’, and the minimum metric score was 0.9. Upper bound for minimum support was 1.0.

4 Results and discussion

The goal of this evaluation was to gather new, useful knowledge about the assembly line using the proposed data analytics method and identify the best techniques for individual areas. In this study, an exploratory validation approach is used to find the best ML model.

In Fig. 3 , different areas of data analytics are described, and an evaluation is presented based on these different areas.

Experts from the manufacturing company provided a set of the most relevant measurements corresponding to faults. In Phase 1 , the objective was to find the correlation coefficients between each of the 42 measurements and STATION. However, this method was found to be time-consuming. The MATLAB command ‘corrplot’ for finding correlations resulted in a 42×42 matrix that was difficult to interpret. Another method of implementing Phase 1 analysis is analysis of variance (ANOVA), where p -values are used to select the most informative measurements [ 28 ]. The authors in [ 28 ] discarded measurements depending on the p -value. However, this work does not use the ANOVA method because the dataset was not normally distributed in certain cases.

Implementation of Phase 1 analysis could also be accomplished by following the methods used by Andrew and Srinivas [ 29 ]. The authors deleted one measurement at a time to find the most important measurements; however, this method is time-consuming. Due to these problems, we did not consider Phase 1 to be a suitable analysis method.

In the next step, we found a different set of relevant measures in Phase 2 (ML algorithms). Two SVM classifiers were created: one with default hyperparameter values and another with optimized hyperparameters. Both classifiers provided the same measurements based on relevance, and the identified relevant measurements found with both SVM classifiers are shown in Table 2 . However, a large amount of overlap was observed between the measurements provided by the experts and measurements identified using the ML algorithm SVM. Thus, SVM classification was used to classify the samples into two groups: ‘faulty’ and ‘nonfaulty’. Then, linear coefficients associated with the predictors (measurements) were compared. We have listed the 18 most relevant measurements. A comparison between the list of 18 measurements provided by the manufacturer and those uncovered using SVM showed that the lists agree. After discussion with the experts, it was confirmed that whenever a fault takes place, technicians can check the measurements in Table 2 for possible faults.

The classification results using the test dataset and classifier with default hyperparameters and optimized hyperparameters are shown in Table 3 . The classifiers are useful based on these measurements. None of the samples were incorrectly classified as faulty or nonfaulty by the classifiers, and both classifiers had 100% accuracy, specificity, and sensitivity. The motivation of creating a hyperparameter-optimized model is to see if there is any change in performance.

Phase 1 analysis is also shown to be unsuitable for Area A. With increments in the number of measurements, the difficulty of implementing Phase 1 increases exponentially. Thus, Phase 2 is best suited for this area, considering implementation time and difficulty.

Both Phase 1 and Phase 2 analyses were implemented . Three measurements—‘Gear (Pinion) height’, ‘PTU housing measurement’, and ‘Manual adjustment’—were analysed for correlations with the shim dimension. In Phase 1 , the correlation coefficients of these measurements with the shim dimension were calculated and are shown in Table 4 .

As shown in Table 4 , ‘PTU housing measurement’ has the highest correlation with the shim dimension, and this result also aligns with the experts’ opinions.

In Phase 2 , the relative importance (i.e. linear coefficients of measurements associated with shim dimension) was found by the ML algorithms LR, SVR, and RFR with default hyperparameters and optimized hyperparameters (Table 5 ). These ML algorithms predicted the shim dimension with the help of regression models.

From the table, it can concluded that if there is any fault in the shim dimension, it is highly probable that ‘PTU housing measurement’ has a problem. A technician can check this measurement for probable adjustment. Both the default and optimized hyperparameter models provided the same result except for the default hyperparameter RFR model. In the case of the default hyperparameter RFR model, ‘gear (pinion) height’ has the highest importance with regard to the shim dimension. However, this result does not align with the results of the remainder of the models. Because hyperparameter-optimized SVR and LR have higher accuracies (Table 8 ), we considered ‘PTU housing measurement’ as the most important measurement. Additionally, a comparison between the default hyperparameter and optimized hyperparameter models (SVRs) showed that the overall relative importance of the predictors is lower in the optimized hyperparameter model than in the default hyperparameter model. The effect of predictors on the shim dimension is lower in the optimized hyperparameter model than in the default hyperparameter model.

Although both Phase 1 and Phase 2 analyses were implemented in this area, Phase 1 was easier to use than Phase 2 . Phase 2 involved the creation of regression models with hyperparameter tuning. Additionally, knowledge of ML is required to implement Phase 2 analysis. The application of ML is not necessary when the target problem can be easily solved with traditional mathematics or statistics. Therefore, for this area, Phase 1 is the most suitable method of analysis.

The correlation ( Phase 1 ) between different station codes for the ‘PTU housing measurement’ was calculated, and the most highly correlated station codes are shown in Table 6 (i.e. faults with a correlation coefficient higher than 0.80). The remainder of the station codes appeared random because their correlation coefficients were comparatively low and are thus not listed in Table 6 .

In Phase 2 analysis, association rules were mined using the Weka platform, and the results are shown in Table 7 . All of the rules have confidence levels higher than 90%. For example, we can interpret the first row as if Station 114 does not have any fault, then there is a 100% chance that Station 140 will not have any fault as a confidence level of 1. A lift value greater than 1 indicates that the rule body and rule head occur together more often than expected. Additionally, if the conviction value is 1, then the rule body and rule head are independent. A conviction value other than 1 indicates a better rule. A high leverage value indicates a higher probability of the rule head and rule body happening together. All of these measures, as shown in Table 7 indicate that the rules are reliable.

However, the stations that have a high correlation according to Phase 2 do not align with the results of Phase 1 . Manual checking of the stations suggests that Phase 2 is more accurate. Statistical analysis only measured the correlation by the number of faults and ignored the relationship when a fault was absent. ML considered the relationships between stations according to both faults and non-faults. Therefore, for this area of analysis, Phase 2 is most suitable.

To use Phase 1 in Area D, we reviewed 50 peer-reviewed papers published in 2019–2020 and selected certain statistical techniques. For example, we attempted to use spatial statistics [ 30 ]; however, this method has basic applications in feature extraction, not prediction. Similarly, Cox proportional hazards regression [ 31 ] was used to predict the next occurrence of an event; however, predicting the shim dimension was not possible with this algorithm. The accelerated failure time (AFT) model was also considered. However, this model uses the same method as the Cox proportional hazards regression. Logistic regression was considered as a statistical method in one study [ 32 ]; however, logistic regression is a classifier that cannot be used for regression. Thus, we could not find any other statistical techniques that could be implemented in Area D. For this reason, Phase 1 was not implemented in Area D.

In Phase 2 , both the LR and SVR (default and optimized hyperparameter) algorithms predicted the shim dimension with an accuracy near 100%. A small deviation was observed in the predicted value from the real value in the case of RFR (both default and optimized hyperparameter) compared to LR and SVR. All eligible hyperparameters were optimized in one of the RFR models; however, the deviation was also the same for that model. Figure 5 shows the parity plot for the shim-dimension prediction using the test dataset and the optimized hyperparameter RFR algorithm. These deviated values were within 10% of the real values.

figure 5

(Area D) Parity plot for optimized random forest regression

Table 8 lists the coefficient of determination ( R 2 ), root mean square error (RMSE), mean absolute error (MAE), and mean square error (MSE) values of the regression models (default and optimized hyperparameter). In the hyperparameter-optimized models, the R 2 , RMSE, MAE, and MSE values were marginally improved compared to the default hyperparameter models. However, for the RFR model, there was no improvement in the hyperparameter-optimized model. In Table 8 , a lower RMSE value indicates a better fit, and the observed data points are near the model’s predicted values. Conversely, the R 2 values are 1 or near 1, indicating that the models can significantly predict the shim dimension.

Additionally, the MAE and MSE of the models are near zero, indicating that the models can predict without any error. However, the dataset to which the results are compared is labelled by technicians and thus may be labelled incorrectly. Thus, there may be faults in the model.

Table 9 shows the estimated coefficients of the linear regression model, where ‘Gear (Pinion) height’, ‘PTU housing measurement’, and ‘Manual adjustment’ are the predictors. The term ‘Estimate’ indicates the relative importance (coefficient value) of the predictors in the model. The predictor ‘PTU housing measurement’ is the most important of the three predictors.

‘SE’ is the standard deviation of the estimate and indicates the standard error of the coefficients, which represents the model’s ability to estimate coefficient values. A lower SE indicates a better estimate. In Table 9 , the SE is small, meaning that the model accurately estimated the values of the coefficients.

‘tStat’ is used to determine whether a null hypothesis should be accepted or rejected by measuring the precision of measurement estimates. ‘Null hypothesis’ indicates that there is no relationship between the input and the output. The higher the tStat value, the more significant the estimate is in the regression model. Thus, the null hypothesis can be rejected because tStat is high.

The ‘ P -value’ in the linear regression analysis indicates whether the null hypothesis can be rejected. In this study, the null hypothesis can be rejected if the p -value is low. Additionally, there is a high correlation between the input and the output.

In Table 9 , all p -values are 0, indicating that predictors are highly correlated with the response.

For Area D, Phase 2 is the most suitable method because Phase 1 could not be implemented.

The ‘Serial number’ column was checked for duplicate instances of a PTU unit, and a duplicate instance was created if a fault were present. The faulty item was repaired, and the same ‘serial number’ was provided. In Phase 1 , analysis was performed on faults with station codes 90 and 110. A total of 3,930 items with station codes 90 and 110 were found to be faulty. Out of these 3930 faulty items, only 360 items with the same ‘serial number’ were repaired. According to discussions with experts in this field, PTUs with faults can be assigned new ‘Serial numbers’, or can be considered scrap.

Phase 2 was not implemented in this area because it is not necessary to use ML to find duplicate instances within a given set of numbers; traditional statistics are sufficient for this purpose. ML is necessary the following cases [ 33 ]:

A task that is too complex for a human to solve

A task requiring large amounts of memory

A task requiring adaptivity

Therefore, for Area E, Phase 1 is the best suited method.

Phase 1 was implemented to find the error distribution. The relationship between faults and measurements follows a Gaussian distribution, except for ‘housing measurement from loading house/measuring house’, which has a large bar at 59. We assume that the data were equivalent to 59 and not due to a programming error. After double-checking, it was confirmed that these data were correct. The error distribution of the ‘PTU housing measurement’ is shown in Fig. 6 . At a threshold of 103.58, the error rate is high. Conversely, the error rate decreases below a threshold of 103.68.

figure 6

Error distribution of ‘PTU housing measurement’

Phase 2 was not implemented for the same reason stated in Area E; therefore, Phase 1 is the most suitable method for Area F.

5 Conclusions

Concerning the various areas described in Fig. 3 , the outcomes of the proposed intelligent data analytics with regard to power transfer units are as follows:

Area A: Out of 42 measurements, the experts from the manufacturing company identified the 18 most relevant measurements. In this study, we used two SVM classifiers to find the most relevant measurements, which are listed in Table 2 . There is a large amount of overlap between the measurements provided by the experts and the measurements identified using the ML algorithm. Phase 1 is not best suited for this area; Phase 2 is needed for this area in this study.

Area B: Both statistical analysis and ML-based analysis have shown that ‘PTU housing measurement’ is the most important measurement for the shim dimension. Phase 1 is the method best suited for this area.

Area C: Certain station codes were highly correlated. Phase 2 is the most suitable method for this area because Phase 1 produced incomplete results.

Area D: ML algorithms predicted the shim dimension accurately. The manufacturing company’s technicians manually selected a shim dimension whenever there was a mismatch. This manually selected shim dimension was frequently correct. In this study, the dataset that was used to train the ML models to predict the shim dimension (Area D) contains these erroneous values. In the future, the prediction of shim dimensions can be improved by classifying them with the help of an ML algorithm instead of depending on the knowledge of technicians to create the labelled datasets. Phase 1 could not be implemented in this area; thus, Phase 2 is the most suitable method for this area.

Area E: Not all units that had faults were reproduced, which was determined by observing the number of duplicate instances. Phase 1 was more effective in this area than Phase 2 .

Area F: The relationship between fault and measurements follows a Gaussian distribution. Phase 1 is thus the most suitable method for this area.

Thus, this study contributes to knowledge about a manufacturing company’s assembly line and presents a comparative study of the suitability of various analytical methods in the aforementioned six areas. The proposed methods allow assembly line technicians to check important measurements identified by ML (Area A) when there is a fault in a PTU instead of checking all 42 measurements. Additionally, in the case of shim dimensions (Area B), a technician can check ‘PTU housing measurement’ for mismatches. The identification of relationships between station codes (Area C) can help the manufacturing company find patterns and causes of failures. The prediction of the shim dimension (Area D) will help technicians choose shims when there is a mismatch, and the shim dimension prediction system can be used in the cloud. Considering the rate of reproduction of faulty units (Area E), technicians can try to reduce this rate. According to discussions with experts at the manufacturing company, the error distribution of ‘PTU housing measurement’ (Area F, Fig. 6 ) follows an exponential distribution. However, the distribution found in this study is Gaussian; this discrepancy will be investigated in future research.

The performance of the hyperparameter-optimized RFR model was not higher than the default hyperparameter model; this topic will also be investigated in future research.

We attempted to find the most suitable method of analysis for six areas of interest. Based on various analyses, neither statistics nor ML can be used in all six areas. Statistics were found to be most suitable for areas B, E, and F, while ML was found to be the most suitable technique for areas A, C, and D because ML is used when a problem is too complex for statistics to solve and requires adaptability. None of the problems solved in areas B, E, and F were too complex, nor did they require any adaptability, while the problems solved in A, C, and F were complex and benefitted from the advantages of ML.

Availability of data and materials

The data used in this work is not possible to upload to a repository due to confidentiality of the manufacturing company.

https://www.gkn.com/

Lasi H, Fettke P, Kemper H-G, Feld T, Hoffmann M (2014) Industry 4.0. Bus Inf Syst Eng 6(4):239–242

Article   Google Scholar  

Lee J, Bagheri B, Kao H-A (2015) A cyber-physical systems architecture for industry 4.0-based manufacturing systems. Manuf Lett 3:18–23

Nagorny K, Lima-Monteiro P, Barata J, Colombo AW (2017) Big data analysis in smart manufacturing: A review. Int J Commun Netw Syst Sci 10(3):31–58

Google Scholar  

Tao F, Qi Q, Liu A, Kusiak A (2018) Data-driven smart manufacturing. J Manuf Syst 48:157–169

Gao W, Zhu Y (2017) A cloud computing fault detection method based on deep learning. J Comput Commun 5(12):24–34

Wan J, Tang S, Li D, Wang S, Liu C, Abbas H, Vasilakos AV (2017) A Manufacturing Big Data Solution for Active Preventive Maintenance. IEEE Trans Ind Inf 13(4):2039–2047. https://doi.org/10.1109/TII.2017.2670505

Susto GA, Schirru A, Pampuri S, McLoone S, Beghi A (2015) Machine Learning for Predictive Maintenance: A Multiple Classifier Approach. IEEE Trans Ind Inf 11(3):812–820. https://doi.org/10.1109/TII.2014.2349359

Prytz R (2014) Machine learning methods for vehicle predictive maintenance using off-board and on-board data. Halmstad University Press

Monostori L, Márkus A, Van Brussel H, Westkämpfer E (1996) Machine learning approaches to manufacturing. CIRP Ann 45(2):675–712

Tomohiko Sakao PF, Matschewsky J , Bengtsson M, Ahmed MU (2021) AI-LCE: Adaptive and Intelligent Life Cycle Engineering by applying digitalization and AI methods – An emerging paradigm shift in Life Cycle Engineering Paper presented at the 28th CIRP Conference on Life Cycle Engineering (CIRP LCE 2021)

Kilundu B, Dehombreux P, Chiementin X (2011) Tool wear monitoring by machine learning techniques and singular spectrum analysis. Mech Syst Signal Process 25(1):400–415

de Farias A, de Almeida SLR, Delijaicov S, Seriacopi V, Bordinassi EC (2020) Simple machine learning allied with data-driven methods for monitoring tool wear in machining processes. Int J Adv Manuf Technol 109(9):2491–2501

Serin G, Sener B, Ozbayoglu A, Unver H (2020) Review of tool condition monitoring in machining and opportunities for deep learning. Int J Adv Manuf Technol 1–22

Kim D, Kang P, Cho S, Lee H-j, Doh S (2012) Machine learning-based novelty detection for faulty wafer detection in semiconductor manufacturing. Expert Syst Appl 39(4):4075–4083

Zhao Y, Yang L, Lehman B, de Palma J-F, Mosesian J, Lyons R (2012) Decision tree-based fault detection and classification in solar photovoltaic arrays. In: 2012 Twenty-Seventh Annual IEEE Applied Power Electronics Conference and Exposition (APEC). IEEE, pp 93–99

Omran WA, Kazerani M, Salama MM (2010) A clustering-based method for quantifying the effects of large on-grid PV systems. IEEE Trans Power Deliv 25(4):2617–2625

Zhao Y, Ball R, Mosesian J, de Palma J-F, Lehman B (2014) Graph-based semi-supervised learning for fault detection and classification in solar photovoltaic arrays. IEEE Trans Power Electron 30(5):2848–2858

Han H, Gu B, Wang T, Li Z (2011) Important sensors for chiller fault detection and diagnosis (FDD) from the perspective of feature selection and machine learning. Int J Refrig 34(2):586–599

Janssens O, Slavkovikj V, Vervisch B, Stockman K, Loccufier M, Verstockt S, Van de Walle R, Van Hoecke S (2016) Convolutional neural network based fault detection for rotating machinery. J Sound Vib 377:331–345

Delli U, Chang S (2018) Automated process monitoring in 3D printing using supervised machine learning. Procedia Manuf 26:865–870

Li Z, Zhang Z, Shi J, Wu D (2019) Prediction of surface roughness in extrusion-based additive manufacturing with machine learning. Robot Comput Integr Manuf 57:488–495

Xu K, Li Y, Liu C, Liu X, Hao X, Gao J, Maropoulos PG (2020) Advanced Data Collection and Analysis in Data-Driven Manufacturing Process. Chin J Mech Eng 33(1):1–21

Daniel EC, Onyedika IC, Christian OI, Benjamin AU (2014) Statistical Analysis of Processing Data for a Manufacturing Industry (A Case Study of Stephens Bread Industry) Conference Proceedings

Guo Z, Jiang G, Chen H, Yoshihira K (2006) Tracking probabilistic correlation of monitoring data for fault detection in complex systems. In: International Conference on Dependable Systems and Networks (DSN’06). IEEE, pp 259–268

Chen Z, Ding SX, Zhang K, Li Z, Hu Z (2016) Canonical correlation analysis-based fault detection methods with application to alumina evaporation process. Control Eng Pract 46:51–58

Xia B, Shang Y, Nguyen T, Mi C (2017) A correlation based fault detection method for short circuits in battery packs. J Power Sources 337:1–10

Wang T, Zhang W, Wei J, Zhong H (2015) Fault detection for cloud computing systems with correlation analysis. In: 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM). IEEE, pp 652–658

Huang Z, Chen H, Hsu C-J, Chen W-H, Wu S (2004) Credit rating analysis with support vector machines and neural networks: a market comparative study. Decis Support Syst 37(4):543–558

Sung AH, Mukkamala S (2003) Identifying important features for intrusion detection using support vector machines and neural networks. In: 2003 Symposium on Applications and the Internet. Proceedings, 2003. IEEE, pp 209–216

Cameron B, Tasan C (2019) Microstructural damage sensitivity prediction using spatial statistics. Sci Rep 9(1):1–6

Senders JT, Staples P, Mehrtash A, Cote DJ, Taphoorn MJ, Reardon DA, Gormley WB, Smith TR, Broekman ML, Arnaout O (2020) An online calculator for the prediction of survival in glioblastoma patients using classical statistics and machine learning. Neurosurgery 86(2):E184–E192

Barnett-Itzhaki Z, Elbaz M, Butterman R, Amar D, Amitay M, Racowsky C, Orvieto R, Hauser R, Baccarelli AA, Machtinger R (2020) Machine learning vs. classic statistics for the prediction of IVF outcomes. J Assist Reprod Genet 37(10):2405–2412

Shalev-Shwartz S, Ben-David S (2014) Understanding machine learning: From theory to algorithms. Cambridge university press

Download references

Acknowledgements

The authors would like to acknowledge the students Simon Svensson, Kristoffer Lindve, Henrik Särnblad, and Gustav Radbrandt, for their initial work on the data and effort in the course project held in Mälardalen University, Sweden, many thanks to them. Many thanks to GKN Footnote 2 for the data and domain knowledge.

Open access funding provided by Mälardalen University. The study was conducted through the project AUTOMAD which is funded by the XPRES framework and also the project DIGICOGS which is financed by Vinnova (Vinnovas Diarienr: 2019-0532) and the innovation programme Process Industrial IT and Automation (PiiA) at Mälardalen University.

Author information

Authors and affiliations.

Mälardalen University, högskoleplan 1, 72220, Västerås, Sweden

Sharmin Sultana Sheuly, Shaibal Barua, Shahina Begum & Mobyen Uddin Ahmed

GKN ePowertrain, Volvogatan 6, 73136, Köping, Sweden

Ekrem Güclü & Michael Osbakk

You can also search for this author in PubMed   Google Scholar

Contributions

The corresponding author Sharmin Sultana Sheuly has been responsible for writing the paper, developing the MATLAB codes, comparison between ML and statistical analysis, and finding the most suitable method. Shaibal Barua and Shahina Begum have been responsible for identifying and discussing the most important analysis areas. Mobyen Uddin Ahmed has been working in identifying most relevant measurements corresponding to faults and shim dimension prediction. Ekrem Güclü and Michael Osbakk have provided the data and domain knowledge for evaluating the analysis result.

Corresponding author

Correspondence to Sharmin Sultana Sheuly .

Ethics declarations

Consent to participate.

Not applicable.

Consent to publish

All the authors signed the consent to publish this work.

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Sheuly, S.S., Barua, S., Begum, S. et al. Data analytics using statistical methods and machine learning: a case study of power transfer units. Int J Adv Manuf Technol 114 , 1859–1870 (2021). https://doi.org/10.1007/s00170-021-06979-7

Download citation

Received : 05 October 2020

Accepted : 22 March 2021

Published : 30 March 2021

Issue Date : May 2021

DOI : https://doi.org/10.1007/s00170-021-06979-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Data analytics
  • Statistical analysis
  • Machine Learning
  • Power transfer unit
  • Predictive maintenance
  • Fault detection
  • Advanced manufacturing
  • Find a journal
  • Publish with us
  • Track your research

data analysis Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Introduce a Survival Model with Spatial Skew Gaussian Random Effects and its Application in Covid-19 Data Analysis

Futuristic prediction of missing value imputation methods using extended ann.

Missing data is universal complexity for most part of the research fields which introduces the part of uncertainty into data analysis. We can take place due to many types of motives such as samples mishandling, unable to collect an observation, measurement errors, aberrant value deleted, or merely be short of study. The nourishment area is not an exemption to the difficulty of data missing. Most frequently, this difficulty is determined by manipulative means or medians from the existing datasets which need improvements. The paper proposed hybrid schemes of MICE and ANN known as extended ANN to search and analyze the missing values and perform imputations in the given dataset. The proposed mechanism is efficiently able to analyze the blank entries and fill them with proper examining their neighboring records in order to improve the accuracy of the dataset. In order to validate the proposed scheme, the extended ANN is further compared against various recent algorithms or mechanisms to analyze the efficiency as well as the accuracy of the results.

Applications of multivariate data analysis in shelf life studies of edible vegetal oils – A review of the few past years

Hypothesis formalization: empirical findings, software limitations, and design implications.

Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools.

The Complexity and Expressive Power of Limit Datalog

Motivated by applications in declarative data analysis, in this article, we study Datalog Z —an extension of Datalog with stratified negation and arithmetic functions over integers. This language is known to be undecidable, so we present the fragment of limit Datalog Z programs, which is powerful enough to naturally capture many important data analysis tasks. In limit Datalog Z , all intensional predicates with a numeric argument are limit predicates that keep maximal or minimal bounds on numeric values. We show that reasoning in limit Datalog Z is decidable if a linearity condition restricting the use of multiplication is satisfied. In particular, limit-linear Datalog Z is complete for Δ 2 EXP and captures Δ 2 P over ordered datasets in the sense of descriptive complexity. We also provide a comprehensive study of several fragments of limit-linear Datalog Z . We show that semi-positive limit-linear programs (i.e., programs where negation is allowed only in front of extensional atoms) capture coNP over ordered datasets; furthermore, reasoning becomes coNEXP-complete in combined and coNP-complete in data complexity, where the lower bounds hold already for negation-free programs. In order to satisfy the requirements of data-intensive applications, we also propose an additional stability requirement, which causes the complexity of reasoning to drop to EXP in combined and to P in data complexity, thus obtaining the same bounds as for usual Datalog. Finally, we compare our formalisms with the languages underpinning existing Datalog-based approaches for data analysis and show that core fragments of these languages can be encoded as limit programs; this allows us to transfer decidability and complexity upper bounds from limit programs to other formalisms. Therefore, our article provides a unified logical framework for declarative data analysis which can be used as a basis for understanding the impact on expressive power and computational complexity of the key constructs available in existing languages.

An empirical study on Cross-Border E-commerce Talent Cultivation-—Based on Skill Gap Theory and big data analysis

To solve the dilemma between the increasing demand for cross-border e-commerce talents and incompatible students’ skill level, Industry-University-Research cooperation, as an essential pillar for inter-disciplinary talent cultivation model adopted by colleges and universities, brings out the synergy from relevant parties and builds the bridge between the knowledge and practice. Nevertheless, industry-university-research cooperation developed lately in the cross-border e-commerce field with several problems such as unstable collaboration relationships and vague training plans.

The Effects of Cross-border e-Commerce Platforms on Transnational Digital Entrepreneurship

This research examines the important concept of transnational digital entrepreneurship (TDE). The paper integrates the host and home country entrepreneurial ecosystems with the digital ecosystem to the framework of the transnational digital entrepreneurial ecosystem. The authors argue that cross-border e-commerce platforms provide critical foundations in the digital entrepreneurial ecosystem. Entrepreneurs who count on this ecosystem are defined as transnational digital entrepreneurs. Interview data were dissected for the purpose of case studies to make understanding from twelve Chinese immigrant entrepreneurs living in Australia and New Zealand. The results of the data analysis reveal that cross-border entrepreneurs are in actual fact relying on the significant framework of the transnational digital ecosystem. Cross-border e-commerce platforms not only play a bridging role between home and host country ecosystems but provide entrepreneurial capitals as digital ecosystem promised.

Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis With Limited Computational Resources

The effects of cross-border e-commerce platforms on transnational digital entrepreneurship, a trajectory evaluator by sub-tracks for detecting vot-based anomalous trajectory.

With the popularization of visual object tracking (VOT), more and more trajectory data are obtained and have begun to gain widespread attention in the fields of mobile robots, intelligent video surveillance, and the like. How to clean the anomalous trajectories hidden in the massive data has become one of the research hotspots. Anomalous trajectories should be detected and cleaned before the trajectory data can be effectively used. In this article, a Trajectory Evaluator by Sub-tracks (TES) for detecting VOT-based anomalous trajectory is proposed. Feature of Anomalousness is defined and described as the Eigenvector of classifier to filter Track Lets anomalous trajectory and IDentity Switch anomalous trajectory, which includes Feature of Anomalous Pose and Feature of Anomalous Sub-tracks (FAS). In the comparative experiments, TES achieves better results on different scenes than state-of-the-art methods. Moreover, FAS makes better performance than point flow, least square method fitting and Chebyshev Polynomial Fitting. It is verified that TES is more accurate and effective and is conducive to the sub-tracks trajectory data analysis.

Export Citation Format

Share document.

StatAnalytica

Top 99+ Trending Statistics Research Topics for Students

statistics research topics

Being a statistics student, finding the best statistics research topics is quite challenging. But not anymore; find the best statistics research topics now!!!

Statistics is one of the tough subjects because it consists of lots of formulas, equations and many more. Therefore the students need to spend their time to understand these concepts. And when it comes to finding the best statistics research project for their topics, statistics students are always looking for someone to help them. 

In this blog, we will share with you the most interesting and trending statistics research topics in 2023. It will not just help you to stand out in your class but also help you to explore more about the world.

If you face any problem regarding statistics, then don’t worry. You can get the best statistics assignment help from one of our experts.

As you know, it is always suggested that you should work on interesting topics. That is why we have mentioned the most interesting research topics for college students and high school students. Here in this blog post, we will share with you the list of 99+ awesome statistics research topics.

Why Do We Need to Have Good Statistics Research Topics?

Table of Contents

Having a good research topic will not just help you score good grades, but it will also allow you to finish your project quickly. Because whenever we work on something interesting, our productivity automatically boosts. Thus, you need not invest lots of time and effort, and you can achieve the best with minimal effort and time. 

What Are Some Interesting Research Topics?

If we talk about the interesting research topics in statistics, it can vary from student to student. But here are the key topics that are quite interesting for almost every student:-

  • Literacy rate in a city.
  • Abortion and pregnancy rate in the USA.
  • Eating disorders in the citizens.
  • Parent role in self-esteem and confidence of the student.
  • Uses of AI in our daily life to business corporates.

Top 99+ Trending Statistics Research Topics For 2023

Here in this section, we will tell you more than 99 trending statistics research topics:

Sports Statistics Research Topics

  • Statistical analysis for legs and head injuries in Football.
  • Statistical analysis for shoulder and knee injuries in MotoGP.
  • Deep statistical evaluation for the doping test in sports from the past decade.
  • Statistical observation on the performance of athletes in the last Olympics.
  • Role and effect of sports in the life of the student.

Psychology Research Topics for Statistics

  • Deep statistical analysis of the effect of obesity on the student’s mental health in high school and college students.
  • Statistical evolution to find out the suicide reason among students and adults.
  • Statistics analysis to find out the effect of divorce on children in a country.
  • Psychology affects women because of the gender gap in specific country areas.
  • Statistics analysis to find out the cause of online bullying in students’ lives. 
  • In Psychology, PTSD and descriptive tendencies are discussed.
  • The function of researchers in statistical testing and probability.
  • Acceptable significance and probability thresholds in clinical Psychology.
  • The utilization of hypothesis and the role of P 0.05 for improved comprehension.
  • What types of statistical data are typically rejected in psychology?
  • The application of basic statistical principles and reasoning in psychological analysis.
  • The role of correlation is when several psychological concepts are at risk.
  • Actual case study learning and modeling are used to generate statistical reports.
  • In psychology, naturalistic observation is used as a research sample.
  • How should descriptive statistics be used to represent behavioral data sets?

Applied Statistics Research Topics

  • Does education have a deep impact on the financial success of an individual?
  • The investment in digital technology is having a meaningful return for corporations?
  • The gap of financial wealth between rich and poor in the USA.
  • A statistical approach to identify the effects of high-frequency trading in financial markets.
  • Statistics analysis to determine the impact of the multi-agent model in financial markets. 

Personalized Medicine Statistics Research Topics

  • Statistical analysis on the effect of methamphetamine on substance abusers.
  • Deep research on the impact of the Corona vaccine on the Omnicrone variant. 
  • Find out the best cancer treatment approach between orthodox therapies and alternative therapies.
  • Statistics analysis to identify the role of genes in the child’s overall immunity.
  • What factors help the patients to survive from Coronavirus .

Experimental Design Statistics Research Topics

  • Generic vs private education is one of the best for the students and has better financial return.
  • Psychology vs physiology: which leads the person not to quit their addictions?
  • Effect of breastmilk vs packed milk on the infant child overall development
  • Which causes more accidents: male alcoholics vs female alcoholics.
  • What causes the student not to reveal the cyberbullying in front of their parents in most cases. 

Easy Statistics Research Topics

  • Application of statistics in the world of data science
  • Statistics for finance: how statistics is helping the company to grow their finance
  • Advantages and disadvantages of Radar chart
  • Minor marriages in south-east Asia and African countries.
  • Discussion of ANOVA and correlation.
  • What statistical methods are most effective for active sports?
  • When measuring the correctness of college tests, a ranking statistical approach is used.
  • Statistics play an important role in Data Mining operations.
  • The practical application of heat estimation in engineering fields.
  • In the field of speech recognition, statistical analysis is used.
  • Estimating probiotics: how much time is necessary for an accurate statistical sample?
  • How will the United States population grow in the next twenty years?
  • The legislation and statistical reports deal with contentious issues.
  • The application of empirical entropy approaches with online grammar checking.
  • Transparency in statistical methodology and the reporting system of the United States Census Bureau.

Statistical Research Topics for High School

  • Uses of statistics in chemometrics
  • Statistics in business analytics and business intelligence
  • Importance of statistics in physics.
  • Deep discussion about multivariate statistics
  • Uses of Statistics in machine learning

Survey Topics for Statistics

  • Gather the data of the most qualified professionals in a specific area.
  • Survey the time wasted by the students in watching Tvs or Netflix.
  • Have a survey the fully vaccinated people in the USA 
  • Gather information on the effect of a government survey on the life of citizens
  • Survey to identify the English speakers in the world.

Statistics Research Paper Topics for Graduates

  • Have a deep decision of Bayes theorems
  • Discuss the Bayesian hierarchical models
  • Analysis of the process of Japanese restaurants. 
  • Deep analysis of Lévy’s continuity theorem
  • Analysis of the principle of maximum entropy

AP Statistics Topics

  • Discuss about the importance of econometrics
  • Analyze the pros and cons of Probit Model
  • Types of probability models and their uses
  • Deep discussion of ortho stochastic matrix
  • Find out the ways to get an adjacency matrix quickly

Good Statistics Research Topics 

  • National income and the regulation of cryptocurrency.
  • The benefits and drawbacks of regression analysis.
  • How can estimate methods be used to correct statistical differences?
  • Mathematical prediction models vs observation tactics.
  • In sociology research, there is bias in quantitative data analysis.
  • Inferential analytical approaches vs. descriptive statistics.
  • How reliable are AI-based methods in statistical analysis?
  • The internet news reporting and the fluctuations: statistics reports.
  • The importance of estimate in modeled statistics and artificial sampling.

Business Statistics Topics

  • Role of statistics in business in 2023
  • Importance of business statistics and analytics
  • What is the role of central tendency and dispersion in statistics
  • Best process of sampling business data.
  • Importance of statistics in big data.
  • The characteristics of business data sampling: benefits and cons of software solutions.
  • How may two different business tasks be tackled concurrently using linear regression analysis?
  • In economic data relations, index numbers, random probability, and correctness are all important.
  • The advantages of a dataset approach to statistics in programming statistics.
  • Commercial statistics: how should the data be prepared for maximum accuracy?

Statistical Research Topics for College Students

  • Evaluate the role of John Tukey’s contribution to statistics.
  • The role of statistics to improve ADHD treatment.
  • The uses and timeline of probability in statistics.
  • Deep analysis of Gertrude Cox’s experimental design in statistics.
  • Discuss about Florence Nightingale in statistics.
  • What sorts of music do college students prefer?
  • The Main Effect of Different Subjects on Student Performance.
  • The Importance of Analytics in Statistics Research.
  • The Influence of a Better Student in Class.
  • Do extracurricular activities help in the transformation of personalities?
  • Backbenchers’ Impact on Class Performance.
  • Medication’s Importance in Class Performance.
  • Are e-books better than traditional books?
  • Choosing aspects of a subject in college

How To Write Good Statistics Research Topics?

So, the main question that arises here is how you can write good statistics research topics. The trick is understanding the methodology that is used to collect and interpret statistical data. However, if you are trying to pick any topic for your statistics project, you must think about it before going any further. 

As a result, it will teach you about the data types that will be researched because the sample will be chosen correctly. On the other hand, your basic outline for choosing the correct topics is as follows:

  • Introduction of a problem
  • Methodology explanation and choice. 
  • Statistical research itself is in the main part (Body Part). 
  • Samples deviations and variables. 
  • Lastly, statistical interpretation is your last part (conclusion). 

Note:   Always include the sources from which you obtained the statistics data.

Top 3 Tips to Choose Good Statistics Research Topics

It can be quite easy for some students to pick a good statistics research topic without the help of an essay writer. But we know that it is not a common scenario for every student. That is why we will mention some of the best tips that will help you choose good statistics research topics for your next project. Either you are in a hurry or have enough time to explore. These tips will help you in every scenario.

1. Narrow down your research topic

We all start with many topics as we are not sure about our specific interests or niche. The initial step to picking up a good research topic for college or school students is to narrow down the research topic.

For this, you need to categorize the matter first. And then pick a specific category as per your interest. After that, brainstorm about the topic’s content and how you can make the points catchy, focused, directional, clear, and specific. 

2. Choose a topic that gives you curiosity

After categorizing the statistics research topics, it is time to pick one from the category. Don’t pick the most common topic because it will not help your grades and knowledge. Instead of it, please choose the best one, in which you have little information, or you are more likely to explore it.

In a statistics research paper, you always can explore something beyond your studies. By doing this, you will be more energetic to work on this project. And you will also feel glad to get them lots of information you were willing to have but didn’t get because of any reasons.

It will also make your professor happy to see your work. Ultimately it will affect your grades with a positive attitude.

3. Choose a manageable topic

Now you have decided on the topic, but you need to make sure that your research topic should be manageable. You will have limited time and resources to complete your project if you pick one of the deep statistics research topics with massive information.

Then you will struggle at the last moment and most probably not going to finish your project on time. Therefore, spend enough time exploring the topic and have a good idea about the time duration and resources you will use for the project. 

Statistics research topics are massive in numbers. Because statistics operations can be performed on anything from our psychology to our fitness. Therefore there are lots more statistics research topics to explore. But if you are not finding it challenging, then you can take the help of our statistics experts . They will help you to pick the most interesting and trending statistics research topics for your projects. 

With this help, you can also save your precious time to invest it in something else. You can also come up with a plethora of topics of your choice and we will help you to pick the best one among them. Apart from that, if you are working on a project and you are not sure whether that is the topic that excites you to work on it or not. Then we can also help you to clear all your doubts on the statistics research topic. 

Frequently Asked Questions

Q1. what are some good topics for the statistics project.

Have a look at some good topics for statistics projects:- 1. Research the average height and physics of basketball players. 2. Birth and death rate in a specific city or country. 3. Study on the obesity rate of children and adults in the USA. 4. The growth rate of China in the past few years 5. Major causes of injury in Football

Q2. What are the topics in statistics?

Statistics has lots of topics. It is hard to cover all of them in a short answer. But here are the major ones: conditional probability, variance, random variable, probability distributions, common discrete, and many more. 

Q3. What are the top 10 research topics?

Here are the top 10 research topics that you can try in 2023:

1. Plant Science 2. Mental health 3. Nutritional Immunology 4. Mood disorders 5. Aging brains 6. Infectious disease 7. Music therapy 8. Political misinformation 9. Canine Connection 10. Sustainable agriculture

Related Posts

how-to-find-the=best-online-statistics-homework-help

How to Find the Best Online Statistics Homework Help

why-spss-homework-help-is-an-important-aspects-for-students

Why SPSS Homework Help Is An Important aspect for Students?

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: enhanced visual question answering: a comparative analysis and textual feature extraction via convolutions.

Abstract: Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model complexity and its effect on performance. In this work, we conduct a comprehensive comparison between complex textual models that leverage long dependency mechanisms and simpler models focusing on local textual features within a well-established VQA framework. Our findings reveal that employing complex textual encoders is not invariably the optimal approach for the VQA-v2 dataset. Motivated by this insight, we introduce an improved model, ConvGRU, which incorporates convolutional layers to enhance the representation of question text. Tested on the VQA-v2 dataset, ConvGRU achieves better performance without substantially increasing parameter complexity.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

National Center for Science and Engineering Statistics

Working paper.

  • Report PDF (309 KB)
  • Report - All Formats .ZIP (1.3 MB)
  • Share on X/Twitter
  • Share on Facebook
  • Share on LinkedIn
  • Send as Email

Assessment of the FY 2016 Survey of Nonprofit Research Activities to Determine Whether Data Meet Current Statistical Standards for Publication

Working papers are intended to report exploratory results of research and analysis undertaken by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF). Any opinions, findings, conclusions, or recommendations expressed in this working paper do not necessarily reflect the views of NSF. This working paper has been released to inform interested parties of ongoing research or activities and to encourage further discussion of the topic.

This working paper describes an assessment of the data in the FY 2016 Survey of Nonprofit Research Activities to identify estimates that would meet the NCSES quality criteria for official statistics. Please see the corresponding InfoBrief ( https://ncses.nsf.gov/pubs/nsf22337 /) and data tables ( https://ncses.nsf.gov/pubs/nsf22338/ ) for the estimates that meet the criteria for NCSES official statistics.

The Survey of Nonprofit Research Activities (NPRA Survey) collects information on activities related to research and development that are performed or funded by nonprofits in the United States. The NPRA Survey is part of the data collection portfolio directed by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF). The FY 2016 NPRA Survey was conducted in 2018 with a sample of nonprofit organizations in the United States. The overall response rate was 48% unweighted and 61% weighted. Due to a low response rate, particularly for certain subgroups such as hospitals (35% unweighted and 45% weighted response rate), not all of the NPRA Survey data met NCSES’ criteria for official statistics. NCSES decided to undertake additional assessment in order to determine the subset of NPRA data that would meet the current NCSES statistical standards required for official release. This working paper identifies which data from the FY 2016 NPRA Survey met NCSES’ statistical standards. This document summarizes the steps taken to conduct this additional assessment, and it also includes detailed information on the data quality comparisons.

Introduction

The National Center for Science and Engineering Statistics (NCSES) conducted the Survey of Nonprofit Research Activities (NPRA Survey) to collect information on activities related to research and development that are performed or funded by nonprofits in the United States. ​ An organization is considered a nonprofit if it is categorized by the Internal Revenue Service as a 501(c)(3) public charity, a 501(c)(3) private foundation, or another exempt organization—e.g., a 501(c)(4), 501(c)(5), or 501(c)(6). A pilot survey was conducted from September 2016 through February 2017 that collected FY 2015 data, ​ The pilot survey data were provided by the respondents for testing purposes only and were not published. and a full implementation of the survey was conducted in 2018 that collected FY 2016 data.

The survey obtained a 48% unweighted response rate overall (61% weighted response rate). However, response rates varied across groups, with the lowest response rate from hospitals (35% unweighted and 45% weighted response rate). Due to high nonresponse rate, not all of the NPRA Survey data meet NCSES’s criteria for official statistics as outlined in the NCSES statistical standards for information products (released in September 2020). At the conclusion of the FY 2016 survey, NCSES decided to proceed with discussing the results via a working paper, FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results. Working Paper NCSES 21-202. Alexandria, VA: National Science Foundation, National Center for Science and Engineering. Available at https://www.nsf.gov/statistics/2021/ncses21202/ ." data-bs-content="Britt R, Jankowski J; National Center for Science and Engineering Statistics (NCSES). 2021. FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results. Working Paper NCSES 21-202. Alexandria, VA: National Science Foundation, National Center for Science and Engineering. Available at https://www.nsf.gov/statistics/2021/ncses21202/ ." data-endnote-uuid="84623eba-d8d5-47ff-b8a5-788c7185d8e3">​ Britt R, Jankowski J; National Center for Science and Engineering Statistics (NCSES). 2021. FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results. Working Paper NCSES 21-202. Alexandria, VA: National Science Foundation, National Center for Science and Engineering. Available at https://www.nsf.gov/statistics/2021/ncses21202/ . which gave caveats that all data did not meet the criteria for official statistics. At the same time, NCSES decided to undertake additional assessment to determine the subset of NPRA Survey data that would meet the current NCSES statistical standards required for official release. This document reflects the results of that additional assessment.

The data quality of the NPRA Survey was assessed based on the Federal Committee on Statistical Methodology (FCSM) Framework for Data Quality . https://nces.ed.gov/fcsm/pdf/FCSM.20.04_A_Framework_for_Data_Quality.pdf ." data-bs-content="Federal Committee on Statistical Methodology. 2020. A Framework for Data Quality. FCSM 20-04. Available at https://nces.ed.gov/fcsm/pdf/FCSM.20.04_A_Framework_for_Data_Quality.pdf ." data-endnote-uuid="e6909e1a-4979-4237-b31c-ac59869f9274">​ Federal Committee on Statistical Methodology. 2020. A Framework for Data Quality. FCSM 20-04. Available at https://nces.ed.gov/fcsm/pdf/FCSM.20.04_A_Framework_for_Data_Quality.pdf . The framework states, “Data quality is the degree to which data capture the desired information using appropriate methodology in a manner that sustains public trust.” Therefore, NCSES’s assessment of the NPRA Survey data is guided by the three FCSM data quality dimensions—utility, objectivity, and integrity.

Utility refers to the extent to which information is well-targeted to identified and anticipated needs; it reflects the usefulness of the information to the intended users. Objectivity refers to whether information is accurate, reliable, and unbiased, and is presented in an accurate, clear and interpretable, and unbiased manner. Integrity refers to the maintenance of rigorous scientific standards and the protection of information from manipulation or influence as well as unauthorized access or revision. (p. 3)

The nonprofit sector is one of four major sectors of the economy (i.e., business, government, higher education, and nonprofit organizations) that perform or fund R&D. ​ The nonprofit sector includes nonprofit organizations other than government or academia. R&D performed by nonprofits that receive federal funds is reported on in the Survey of Federal Funds for Research and Development. R&D performed by higher education nonprofits is reported on in the Higher Education Research and Development Survey. Historically, NCSES has combined nonprofit sector data with data from the other three sectors to estimate total national R&D expenditures, which are presented in the annual report National Patterns of R&D Resources . The other three sectors are surveyed annually; however, prior to fielding the pilot NPRA Survey, NCSES had last collected R&D data from nonprofit organizations in 1997. That mail survey was based on a sample of 1,131 nonprofit organizations that were prescreened as performing or funding R&D worth at least $250,000 in 1996. Since the 1997 survey, the National Patterns of R&D Resources has relied on statistical modeling based on the results of the 1996–97 Survey of Research and Development Funding and Performance by Nonprofit Organizations, supplemented by information from the Survey of Federal Funds for Research and Development, to continue estimation of the nonprofit sector’s R&D expenditures.

The primary objective of the NPRA Survey is to fill in data gaps in the National Patterns of R&D Resources in such a way that the data are compatible with the data collected on other sectors of the U.S. economy and are appropriate for international comparisons. The results of the FY 2016 NPRA Survey provide the first estimates of nonprofit R&D activity in the United States since the late 1990s, as well as a better understanding of the scope and nature of R&D in the nonprofit sector.

From a Framework for Data Quality utility perspective, specifically the relevance and timeliness dimensions, the NPRA Survey will improve NCSES’s estimate of total R&D from the nonprofit sector for publication in the National Patterns of R&D Resources . Conducted in 2018, with a FY 2016 reference year, the NPRA Survey provides more current data than the existing source (1997 with annual adjustments). Moreover, the growth of the nonprofit sector highlights the relevancy of these data to accurately measure the share of R&D from the nonprofit sector.

Summary of NPRA Survey Methodology

A sample of nonprofit organizations was selected from the Internal Revenue Service (IRS) Exempt Organizations Business Master File Extract. Organizations filing Form 990 (public charities) and Form 990-PF (private foundations) were eligible. Organizations were stratified based on their estimated likelihood of performing or funding research. The stratification included a set of organizations “known” to perform or fund research since they were identified as performers or funders in auxiliary data sources, including past survey data from NCSES (2010–13 Survey of Federal Science and Engineering Support to Universities, Colleges, and Nonprofit Institutions, 2009 Survey of Science and Engineering Research Facilities, and 1996 and 1997 Survey of Research and Development Funding and Performance by Nonprofit Organizations), association and society memberships (Association of Independent Research Institutes, Consortium of Social Science Associations, Science Philanthropy Alliance, Health Research Alliance), and other sources (affiliates of the Higher Education Research and Development Survey, Grant Station Funder Database, and sources discovered through cognitive interviews). These organizations were selected with certainty.

The NPRA Survey staff attempted to contact U.S.-based nonprofit organizations using a two-phase approach to obtain information about whether the organization performed or funded research. ​ Throughout this document, the term “research” is synonymous with “research and development” or “R&D.” Some organizations’ performer and funder status were known based on auxiliary data sources (see previous paragraph). Organizations in the sample with an unknown performer or funder status were sent a screener card in phase 1, which began in February 2018.

Phase 2, which began in April 2018, included the phase 1 organizations that reported either performing or funding R&D during phase 1 (including those that did not respond) and organizations with a known performer or funder status through sources external to the survey. Performer and funder status (i.e., whether an organization performed research, funded research, or both in FY 2016) was then confirmed in phase 2, and organizations that either performed or funded research were asked to complete additional questions about their research activities. Performer or funder status could have been obtained in either data collection phase 1 or phase 2, but research activity questionnaires were only completed in phase 2.

Overall response rates were calculated by using best practices established by the American Association for Public Opinion Research ( https://www.aapor.org/Standards-Ethics/Standard-Definitions-(1).aspx ). Organizations that reported no R&D activity in phase 1 or phase 2 were considered complete surveys and were included in the numerator of response rate calculations. Organizations that reported performing or funding R&D in phase 1 or phase 2 and completed some or all of the full questionnaire, with a minimum of the total amounts answered (Q9 and Q16), were considered complete or partial surveys and were included in the numerator of response rate calculations. Organizations that reported performing or funding in phase 1 but did not complete the phase 2 questionnaire were not included in the numerator of response rate calculations.

Imputation was conducted for organizations that reported that they performed or funded research (in phase 1 or phase 2) but did not provide information on the amounts spent in the full questionnaire. These organizations were considered nonrespondents. The imputation included substituted values using auxiliary data about the amounts spent performing or funding research (including information reported in annual reports and IRS filings, as well as information from the pilot survey), information from the pilot survey, and model-based imputations. The imputed amount represented about 30% of the overall total amount performing R&D, with 20% from auxiliary data and the pilot and 10% from the imputation model. Nonresponse weights were used to account for organizations that did not respond about their performing or funding status in either phase 1 or phase 2. These organizations were considered nonrespondents. After reviewing nonresponse adjustment alternatives by using total expenses and total organizations, the nonresponse adjustment was ultimately based on a ratio estimator using total expenses considering its correlation with the survey outcomes of total R&D performance (0.49) and funding (0.27).

NPRA Survey Adherence to NCSES Statistical Standards

NCSES has a set of statistical standards for the release of “official statistics,” specifically

Standard 9.2: The statistical quality of official statistics must undergo rigorous program review and statistical review and approved by the chief statistician for releasing.

Guideline 9.2a: The reliability of official statistics must meet the following quality criteria:

  • Top line estimates have a coefficient of variation (CV) < 5%.
  • The estimated CV for the majority of the key estimates is ≤ 30%.

Guideline 9.2b: The indicators of accuracy of official statistics must meet the following quality criteria:

  • Unit response rates >60%.
  • Item response rates >70%.
  • Coverage ratios for population groups associated with key estimates are >70%.
  • Above thresholds may not apply if nonresponse bias analyses are at an acceptable level.

The NCSES standards focus on the aspects of accuracy, including response rates, data missingness, and frame coverage as well as precision. These elements align with the accuracy and reliability dimension of the objectivity domain. Demonstrating that the NPRA Survey produces accurate and reliable estimates of total R&D performing and funding is the primary focus of this assessment. Therefore, we summarize the metrics as compared to the standard in table 1 and follow with a more detailed description for each standard.

FY 2016 NPRA Survey adherence to standards

CV = coefficient of variation; NPRA Survey = Survey of Nonprofit Research Activities.

a Does not include imputation variance. Amount of performance and funding was imputed for organizations that confirmed that they fund or perform research but did not provide the amount. No imputation was conducted for total expenses, performance status, or funding status. b Includes imputations using substitutions based on auxiliary data.    

For the CV, the total number of responding organizations is 3,254. When evaluating total expenses, the CV is 0.5%, well below the 5% CV standard. However, the sample was designed to optimize total expenses, so other estimates are expected to have higher variability. The CV for the proportion of performing organizations and proportion of funding organizations are 12%. The high CVs are largely due to the low proportions of nonprofits that reported performing and funding R&D, 6% (+/−1.4%) and 4% (+/−1.0%) respectively. However, the confidence intervals are small for both of these estimates. The CVs for the mean performance and mean funding are both 8%. These CVs, based on a subset of organizations that indicated they perform and fund research, meet the CV standard of 30% for the majority of key estimates but not the CV standard of 5% for the top line estimates. Since imputation is a separate criterion, these CV estimates are based only on sampling variance. Table 2 includes additional CVs for total performance and funding amounts by source, R&D type, and field as well as the mean number of full-time equivalents. All but nine of the estimates meet the standard of a 30% CV for key estimates.

National Center for Science and Engineering Statistics, Survey of Nonprofit Research Activities, FY 2016.

CV . The total number of responding organizations is 3,254. When evaluating total expenses, the CV is 0.5%, well below the 5% CV standard. However, the sample was designed to optimize total expenses, so other estimates are expected to have higher variability. The CVs for the proportion of performing organizations and proportion of funding organizations are 12%. The high CVs are largely due to the low proportions of nonprofits that reported performing and funding R&D, 6% (+/−1.4%) and 4% (+/−1.0%), respectively. However, the confidence intervals are small for both of these estimates. The CVs for the mean performance and mean funding are both 8%. These CVs, based on a subset of organizations that indicated they perform and fund research, meet the CV standard of 30% for the majority of key estimates but not the CV standard of 5% for the top line estimates. Since imputation is a separate criterion, these CV estimates are based only on sampling variance. Table 2 includes additional CVs for total performance and funding amounts by source, R&D type, and field as well as the mean number of full-time equivalents. All but nine of the estimates meet the standard of a 30% CV for key estimates.

Total amount spent on R&D by nonprofits that performed or funded R&D and coefficient of variation of FY 2016 NPRA Survey, by source, R&D type, and field

NPRA Survey = Survey of Nonprofit Research Activities.

a Does not include imputation variance.

Details for full-time equivalent do not add to total because of missing data.

Coverage ratio . Based on analysis conducted during the sampling, 85% of performers and 88% of funders are estimated to be covered in the frame, which is well above the NCSES threshold of 70% coverage. The frame coverage estimates are based on an analysis of the percentage of “likely” performers and “likely” funders on the frame after exclusions based on type of organization (e.g., educational institutions or churches) and the size truncation based on total expenses for public charities or total assets for private foundations. The exclusions and truncation were conducted to increase the efficiency of finding R&D performers and funders. Since the exclusions focused on organization types that are not likely to perform or fund or are small in terms of overall expenses or assets, the undercoverage most likely has little impact on the estimated total R&D performance and funding.

Unit response rate . The unweighted response rate (48%) falls short of the 60% standard. The weighted response rate (61%), which uses the sampling weights for all 6,373 sample organizations (6,071 identified as eligible), does meet the standard. The difference in the unweighted and weighted response rates means that the average weight for responders (mean weight = 23.6) is higher than that for nonresponders (14.2). Unweighted and weighted response rates provide distinct measures of survey quality. Unweighted response rates are an indicator of success with data collection operations. FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results at https://www.nsf.gov/statistics/2021/ncses21202/#chp5 ." data-bs-content="For more information on lessons learned regarding data collection operations, see the Working Paper, FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results at https://www.nsf.gov/statistics/2021/ncses21202/#chp5 ." data-endnote-uuid="10238958-b178-46ab-a4af-c5220649ee7c">​ For more information on lessons learned regarding data collection operations, see the Working Paper, FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results at https://www.nsf.gov/statistics/2021/ncses21202/#chp5 . Weighted response rates are most useful when the sampling fractions vary across strata in a sample design and when the interest is population-level inference about the amount of information captured from the population sampled.

Both unweighted and weighted response rates are important metrics for evaluating the quality of the data. Although the weighted response rate meets the standard, the unweighted response rate does not. Therefore, additional analysis measuring the impact of nonresponse is required.

Item response rate . The amount of imputation for performing status (28%) and funding status (17%) meets the minimum threshold for item response, as does the imputation for funding amount (26%). However, the performing amount is just below the standard with 32% of the responses imputed (289 of 912).

Nonresponse bias . The response rate and the item response rate evaluations both include metrics that do not meet the NCSES quality standards. The standards state that the nonresponse standard threshold may be exceeded if the estimate for nonresponse bias is acceptable. The nonresponse bias is a function of the weighted response rate and the weighted mean difference between the respondents and nonrespondents is as follows:

The nonresponse rate (or 1 minus the response rate) multiplied by the difference in means between the respondents and the nonrespondents.

is the weighted response rate,

image3.png

is the weighted respondent mean, and

image4.png

is the weighted nonrespondent mean.

The estimated nonresponse bias was a mean bias of −4 million for total expenses (overall mean = 13, respondent mean = 9). This was a relative bias of −0.451 ( table 3 ). This result suggests that organizations with higher amounts of expenses were less likely to respond than those with lower total expenses. This is confirmed from the response rates for deciles based on total expenses presented in table 4 . The lowest response rates occur in deciles 9 and 10, the organizations with the highest amount of expenses. Over 50% of the organizations in these deciles are hospitals, the organization type experiencing lowest response. When removing organizations in deciles 9 and 10, the relative bias is decreased to −0.112 (data not shown), demonstrating that the largest organizations contribute the majority of bias in the total.

Estimated bias of expenses, assets, and revenues in the FY 2016 NPRA Survey

a The sample analyzed in this this table excludes 302 cases identified as being out of sample. In addition, 31 organizations had missing expenses and revenues; 32 had missing assets.

The imputation and nonresponse adjustments were informed by the nonresponse bias analysis. The imputations were prioritized based on previous data from the pilot survey as well as administrative data for the largest organizations. These substitutions reduced the bias of total expenses to −0.336. Further, the imputation model predicting the amount of performance and funding based on total expenses reduces the nonresponse bias to −0.169 for total expenses.

To address higher nonresponse for hospitals and larger organizations, the nonresponse adjustment cells included hospitals and a flag identifying the largest 500 organizations in terms of expenses. The nonresponse adjustment was based on a ratio adjustment using total expenses. Therefore, the bias is 0 when evaluating weighted nonresponse bias in terms of total expenses. Further, if successful, the nonresponse adjustment based on expenses will reduce the nonresponse bias for correlated measures such as assets and revenues. The relative bias in assets (−0.403) and revenues (−0.414) are similarly high based on the full and partially completed surveys. After weighting, the nonresponse bias is reduced to +0.046 for total assets and +0.014 for total revenues.

Considering that hospitals and health organizations tended to have lower response rates, an indicator for hospitals and non-hospitals was included when defining the nonresponse adjustment cells. The result was low relative bias for health and medical organizations when measuring total expenses (−0.04), total assets (−0.03), and total revenues (−0.04) (data not shown).

In summary, although the nonresponse bias is high, the information available on the nonrespondents allows for imputation and nonresponse adjustments to address these observed biases. Assuming R&D performance and funding expenditures are correlated with the size of the organization based on expenses, the risk of nonresponse bias is reduced. This is the case when measuring assets and revenues, which were not used in the weighting but are both correlated with expenses. We cannot directly measure the reduction in the bias for total performing and total funding since we do not have data for nonrespondents. However, we can evaluate the correlation of expenses and R&D performance and funding using the respondents to the survey. Table 4 provides the mean R&D performance and R&D funding for each expense decile. Both increase as the deciles of total expenses increase. The log-log imputation models used for the NPRA Survey are further evidence of a relationship between expenses and R&D performance and funding, where total expenses was a significant predictor for both models, 0.54 (standard error = 0.08) for total performing and 0.47 (standard error = 0.09) for total funding.

Response rate of nonprofits to the FY 2016 NPRA Survey, by expense decile, and percentage of hospitals in each expense decile

Decile 1 has the least number of expenses, and the decile 10 has the most expenses.

Recommendations for Data Tables to Publish as Official Statistics

The final domain of the Framework for Data Quality is the integrity domain, which focuses on the data producer and the unbiased development of data that instills confidence in its accuracy. This assessment and the rigorous review of these data before publication speak directly to the scientific integrity and credibility dimensions:

Scientific integrity refers to an environment that ensures adherence to scientific standards, use of established scientific methods to produce and disseminate objective data products, and protection of these products from inappropriate political influence.

Credibility characterizes the confidence that users place in data products based simply on the qualifications and past performance of the data producer.

The NPRA Survey was designed to provide timely and relevant data to meet a long-standing need for information on R&D expenditures within the nonprofit sector. As noted in this report, low response rates called into question the quality of the data. The quality assessment included a thorough review of the error sources, including sampling error, response rates, frame coverage error, item nonresponse, and nonresponse bias. Further, the assessment reviewed the post-survey adjustments (e.g., weighting and imputation) designed to mitigate the risk of bias. This process of additional assessment identified which data from the NPRA Survey meet the NCSES statistical standards and illuminated the aspects of the data that should not be published.

Based on this additional assessment, we recommend publishing a set of high-level summary data tables, shown in appendix A, as NCSES official statistics. The tables cover all categories in table 2 except field, due to the high CVs for several fields. Variance estimates are based on successive difference replication using 80 replicates. The CVs in the appendix tables include both the sampling variance and imputation variance. All estimates are under the CV standard of 30%.

As part of the post survey evaluation report, we evaluated the stability of the variance estimates by comparing variance estimates with 80 replicates to the variance estimates based on 160 replicates. The CV for the total R&D performance was 13.2% for 80 replicates and 13.6% for 160 replicates. The CV for the total R&D funding was 18.0% for 80 replicates and 16.1% for 160 replicates. For the estimates in appendix A, three estimates exceeded the 30% CV standard when using 160 replicates instead of 80 replicates. These included the following:

Appendix table 1-A . Total R&D expenditures sourced from nonprofits (31%)

Appendix table 7-A . Total funds for R&D from internal funds (CV = 31%)

Appendix table 9-A . Nonfederal funds for R&D for other nonprofit organizations (CV = 34%)

All other estimates have a CV under 30%.

Specifically, this subset of FY 2016 NPRA data will be published via a Data Release InfoBrief and a set of data tables ( https://ncses.nsf.gov/pubs/nsf22337 / and https://ncses.nsf.gov/pubs/nsf22338/ ). Technical notes published with the data tables will also be provided to summarize the survey methodology and the data limitations. The technical notes will reference this working paper where additional survey information is provided, including details about the initial assessment of the NPRA survey estimates.

1 An organization is considered a nonprofit if it is categorized by the Internal Revenue Service as a 501(c)(3) public charity, a 501(c)(3) private foundation, or another exempt organization—e.g., a 501(c)(4), 501(c)(5), or 501(c)(6).

2 The pilot survey data were provided by the respondents for testing purposes only and were not published.

3 Britt R, Jankowski J; National Center for Science and Engineering Statistics (NCSES). 2021. FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results. Working Paper NCSES 21-202. Alexandria, VA: National Science Foundation, National Center for Science and Engineering. Available at https://www.nsf.gov/statistics/2021/ncses21202/ .

4 Federal Committee on Statistical Methodology. 2020. A Framework for Data Quality. FCSM 20-04. Available at https://nces.ed.gov/fcsm/pdf/FCSM.20.04_A_Framework_for_Data_Quality.pdf .

5 The nonprofit sector includes nonprofit organizations other than government or academia. R&D performed by nonprofits that receive federal funds is reported on in the Survey of Federal Funds for Research and Development. R&D performed by higher education nonprofits is reported on in the Higher Education Research and Development Survey.

6 Throughout this document, the term “research” is synonymous with “research and development” or “R&D.”

7 For more information on lessons learned regarding data collection operations, see the Working Paper, FY 2016 Nonprofit Research Activities Survey: Summary of Methodology, Assessment of Quality, and Synopsis of Results at https://www.nsf.gov/statistics/2021/ncses21202/#chp5 .

Appendix A. Data Tables with Relative Standard Errors

Suggested citation.

Britt R, Mamon S, ZuWallack R; National Center for Science and Engineering Statistics (NCSES). 2022. Assessment of the FY 2016 Survey of Nonprofit Research Activities to Determine Whether Data Meet Current Statistical Standards for Publication . Working Paper NCSES 22-212. Alexandria, VA: National Science Foundation. Available at https://ncses.nsf.gov/pubs/ncses22212/ .

Report Authors

Ronda Britt Survey Manager, NCSES

Sherri Mamon ICF, under contract to NCSES

Randal ZuWallack ICF, under contract to NCSES

National Center for Science and Engineering Statistics Directorate for Social, Behavioral and Economic Sciences National Science Foundation 2415 Eisenhower Avenue, Suite W14200 Alexandria, VA 22314 Tel: (703) 292-8780 FIRS: (800) 877-8339 TDD: (800) 281-8749 E-mail: [email protected]

Source Data & Analysis

InfoBrief (NSF 22-337) and Data Tables (NSF 22-338)

Get e-mail updates from NCSES

NCSES is an official statistical agency. Subscribe below to receive our latest news and announcements.

Regions & Countries

Religious landscape study.

research paper related to statistical analysis

The RLS, conducted in 2007 and 2014, surveys more than 35,000 Americans from all 50 states about their religious affiliations, beliefs and practices, and social and political views. User guide | Report about demographics | Report about beliefs and attitudes

Explore religious groups in the U.S. by tradition, family and denomination

Explore religious affiliation data by state, region or select metro areas, northeastern states.

  • Connecticut
  • Massachusetts
  • New Hampshire
  • Pennsylvania
  • Rhode Island

Southern States

  • District of Columbia
  • Mississippi
  • North Carolina
  • South Carolina
  • West Virginia

Midwestern States

  • North Dakota
  • South Dakota

Western States

All metro areas.

  • Atlanta Metro Area
  • Baltimore Metro Area
  • Boston Metro Area
  • Chicago Metro Area
  • Dallas/Fort Worth Metro Area
  • Detroit Metro Area
  • Houston Metro Area
  • Los Angeles Metro Area
  • Miami Metro Area
  • Minneapolis/St. Paul Metro Area
  • New York City Metro Area
  • Philadelphia Metro Area
  • Phoenix Metro Area
  • Pittsburgh Metro Area
  • Providence Metro Area
  • Riverside, CA Metro Area
  • San Diego Metro Area
  • San Francisco Metro Area
  • Seattle Metro Area
  • St. Louis Metro Area
  • Tampa Metro Area
  • Washington, DC Metro Area

Topics & Questions

Demographic information.

  • Race and Ethnicity
  • Immigration Status
  • Marital Status
  • Parental Status

Beliefs and Practices

  • Belief in God
  • Importance of Religion
  • Attendance at Religious Services
  • Prayer Frequency
  • Prayer Groups
  • Feelings of Spiritual Wellbeing
  • Feelings of Sense of Wonder
  • Guidance on Right and Wrong
  • Standards for Right and Wrong
  • Reading Scripture
  • Interpretation of Scripture
  • Belief in Heaven
  • Belief in Hell

Social and Political Views

  • Political Party
  • Political Ideology
  • Size of Government
  • Government Aid to the Poor
  • Homosexuality
  • Same-Sex Marriage
  • Protecting the Environment
  • Human Evolution

About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of The Pew Charitable Trusts .

  • Mobile Site
  • Staff Directory
  • Advertise with Ars

Filter by topic

  • Biz & IT
  • Gaming & Culture

Front page layout

the war for talent goes on —

Apple poaches ai experts from google, creates secretive european ai lab, at least 36 former googlers now work on ai for apple..

Michael Acton, Financial Times - Apr 30, 2024 2:16 pm UTC

Apple has been tight-lipped about its AI plans but industry insiders suggest the company is focused on deploying generative AI on its mobile devices.

Apple has poached dozens of artificial intelligence experts from Google and has created a secretive European laboratory in Zurich, as the tech giant builds a team to battle rivals in developing new AI models and products.

According to a Financial Times analysis of hundreds of LinkedIn profiles as well as public job postings and research papers, the $2.7 trillion company has undertaken a hiring spree over recent years to expand its global AI and machine learning team.

The iPhone maker has particularly targeted workers from Google, attracting at least 36 specialists from its rival since it poached John Giannandrea to be its top AI executive in 2018.

While the majority of Apple’s AI team work from offices in California and Seattle, the tech group has also expanded a significant outpost in Zurich.

Professor Luc Van Gool from Swiss university ETH Zurich said Apple’s acquisitions of two local AI startups—virtual reality group FaceShift and image recognition company Fashwell—led Apple to build a research laboratory, known as its “Vision Lab,” in the city.

Zurich-based employees have been involved in Apple’s research into the underlying technology that powers products such as OpenAI’s ChatGPT chatbot. Their papers have focused on ever more advanced AI models that incorporate text and visual inputs to produce responses to queries.

The company has been advertising jobs in generative AI across two locations in Zurich, one of which has a particularly low profile. A neighbor told the FT they were not even aware of the office’s existence. Apple did not respond to requests to comment.

Apple has been typically tight-lipped about its AI plans even as big tech rivals Microsoft, Google, and Amazon tout multibillion-dollar investments in cutting-edge technology.

reader comments

Channel ars technica.

IMAGES

  1. Analysis In A Research Paper

    research paper related to statistical analysis

  2. Statistical tools for data analysis pdf

    research paper related to statistical analysis

  3. 6+ Data Analysis Report Templates

    research paper related to statistical analysis

  4. Statistical research paper examples

    research paper related to statistical analysis

  5. 7 Types of Statistical Analysis: Definition and Explanation

    research paper related to statistical analysis

  6. Statistical Analysis and Data Mining: The ASA Data Science Journal

    research paper related to statistical analysis

VIDEO

  1. Data Analysis Using #SPSS (Part 1)

  2. Demographic Analysis in SPSS

  3. Deepfake Videos Just With One Photo and This is Better Than Real (Microsoft Vasa-1)

  4. Day 2: Statistical Data Analysis using R Programming for Staff and Students of Makerere University

  5. Top 5 Statistical Packages for Academic Research and Analysis

  6. Statistical Rethinking Winter 2019 Lecture 02

COMMENTS

  1. Introduction to Research Statistical Analysis: An Overview of the Basics

    Introduction. Statistical analysis is necessary for any research project seeking to make quantitative conclusions. The following is a primer for research-based statistical analysis. It is intended to be a high-level overview of appropriate statistical testing, while not diving too deep into any specific methodology.

  2. The Beginner's Guide to Statistical Analysis

    Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.

  3. (PDF) An Overview of Statistical Data Analysis

    1 Introduction. Statistics is a set of methods used to analyze data. The statistic is present in all areas of science involving the. collection, handling and sorting of data, given the insight of ...

  4. (PDF) Data Science: the impact of statistics

    classification, and related methods") [37]. ... ration, and Statistical Data Analysis,a sw e l la sf o r ... Springer Nature supports a reasonable amount of sharing of research papers by ...

  5. A practical guide to data analysis in general literature reviews

    This article is a practical guide to conducting data analysis in general literature reviews. The general literature review is a synthesis and analysis of published research on a relevant clinical issue, and is a common format for academic theses at the bachelor's and master's levels in nursing, physiotherapy, occupational therapy, public health and other related fields.

  6. Statistics

    Statistics is the application of mathematical concepts to understanding and analysing large collections of data. A central tenet of statistics is to describe the variations in a data set or ...

  7. Articles

    Describing the Flexibility of the Generalized Gamma and Related Distributions. The generalized gamma (GG) distribution is a widely used, flexible tool for parametric survival analysis. Many alternatives and extensions to this family have been proposed. This paper characterizes the flexib...

  8. Home

    Overview. Statistical Papers is a forum for presentation and critical assessment of statistical methods encouraging the discussion of methodological foundations and potential applications. The Journal stresses statistical methods that have broad applications, giving special attention to those relevant to the economic and social sciences.

  9. Statistics

    Read the latest Research articles in Statistics from Scientific Reports. ... Improved data quality and statistical power of trial-level event-related potentials with Bayesian random-shift Gaussian ...

  10. (PDF) SPSS: An Imperative Quantitative Data Analysis ...

    Abstract. The purpose of this paper is to elaborate on the importance of the Statistical Package for the Social Sciences, widely known as SPSS in the field of social sciences as an effective tool ...

  11. The Use of Statistics in Health Sciences: Situation Analysis and

    Statistics plays a crucial role in research, planning and decision-making in the health sciences. Progress in technologies and continued research in computational statistics has enabled us to implement sophisticated mathematical models within software that are handled by non-statistician researchers. As a result, over the last decades, medical journals have published a host of papers that use ...

  12. PDF Study Design and Statistical Analysis

    Study Design and Statistical Analysis A Practical Guide for Clinicians This book takes the reader through the entire research process: choosing a question, designing a study, collecting the data, using univariate, bivariate and multivariable analysis, and publishing the results. It does so by using plain language rather than complex

  13. Research Papers / Publications

    Research Papers / Publications. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Seyed Hamed Hassani, Insup Lee, Osbert Bastani, Edgar Dobriban, Uncertainty in Language Models: Assessment through Rank-Calibration. Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas ...

  14. Data analytics using statistical methods and machine ...

    Sensors can produce large amounts of data related to products, design, and materials; however, it is important to use the right data for the right purposes. Therefore, detailed analysis of data accumulated from different sensors in production and assembly manufacturing lines is necessary to minimize faulty products and understand the production process. Additionally, when selecting analytical ...

  15. data analysis Latest Research Papers

    Find the latest published documents for data analysis, Related hot topics, top authors, the most cited documents, and related journals ... hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into ...

  16. Top 99+ Trending Statistics Research Topics for Students

    Statistics analysis to determine the impact of the multi-agent model in financial markets. Personalized Medicine Statistics Research Topics. Statistical analysis on the effect of methamphetamine on substance abusers. Deep research on the impact of the Corona vaccine on the Omnicrone variant.

  17. Enhanced Visual Question Answering: A Comparative Analysis and Textual

    Visual Question Answering (VQA) has emerged as a highly engaging field in recent years, attracting increasing research efforts aiming to enhance VQA accuracy through the deployment of advanced models such as Transformers. Despite this growing interest, there has been limited exploration into the comparative analysis and impact of textual modalities within VQA, particularly in terms of model ...

  18. Digital Sovereignty: A Descriptive Analysis and a Critical ...

    It is claimed by and related to various global actors, whose narratives are often competing and mut ... Claudio and Roberts, Huw and Floridi, Luciano, Digital Sovereignty: A Descriptive Analysis and a Critical Evaluation of Existing Models (April 21, 2024). Available at SSRN: https://ssrn ... Paper statistics. Downloads. 120. Abstract Views ...

  19. Assessment of the FY 2016 Survey of Nonprofit Research Activities to

    The Survey of Nonprofit Research Activities (NPRA Survey) collects information on activities related to research and development that are performed or funded by nonprofits in the United States. The NPRA Survey is part of the data collection portfolio directed by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF).

  20. Religion in America: U.S. Religious Data, Demographics and Statistics

    About Pew Research Center Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions.

  21. (PDF) Use of Statistics in Research

    The function of statistics in research is to purpose as a tool in conniving research, analyzing its data and portrayal of conclusions there from. Most research studies result in a extensive ...

  22. Analysis of the Asymmetric Game Between Lawyers and Clients

    It is proposed that regulatory agencies should increase their supervision and punishment efforts to prevent lawyers from misleading clients to obtain high profits by model solving and analysis and establish an open information exchange service platform. This article studies the interest constraint relationship between lawyers and clients, constructs a game model under information asymmetry ...

  23. A Critical Discourse Analysis of Chatgpt's Role in Knowledge and ...

    The research discovered that ChatGPT's answers unveiled a prejudice towards capitalist discourse, established knowledge, and suppressed alternative opinions. Also, the definitions of capitalism and communism by ChatGPT were insufficient because the political and ideological dimensions of capitalism were ignored, which can be considered major ...

  24. Apple poaches AI experts from Google, creates secretive European AI lab

    According to a Financial Times analysis of hundreds of LinkedIn profiles as well as public job postings and research papers, the $2.7 trillion company has undertaken a hiring spree over recent ...