Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 March 2024

Evaluation metrics and statistical tests for machine learning

  • Oona Rainio   ORCID: orcid.org/0000-0002-7775-7656 1 ,
  • Jarmo Teuho   ORCID: orcid.org/0000-0001-9401-0725 1 &
  • Riku Klén   ORCID: orcid.org/0000-0002-0982-8360 1  

Scientific Reports volume  14 , Article number:  6086 ( 2024 ) Cite this article

18k Accesses

32 Citations

3 Altmetric

Metrics details

  • Computer science

An Author Correction to this article was published on 08 July 2024

This article has been updated

Research on different machine learning (ML) has become incredibly popular during the past few decades. However, for some researchers not familiar with statistics, it might be difficult to understand how to evaluate the performance of ML models and compare them with each other. Here, we introduce the most common evaluation metrics used for the typical supervised ML tasks including binary, multi-class, and multi-label classification, regression, image segmentation, object detection, and information retrieval. We explain how to choose a suitable statistical test for comparing models, how to obtain enough values of the metric for testing, and how to perform the test and interpret its results. We also present a few practical examples about comparing convolutional neural networks used to classify X-rays with different lung infections and detect cancer tumors in positron emission tomography images.

Similar content being viewed by others

evaluation metrics thesis

Improving the repeatability of deep learning models with Monte Carlo dropout

evaluation metrics thesis

MedMNIST v2 - A large-scale lightweight benchmark for 2D and 3D biomedical image classification

evaluation metrics thesis

Self-supervised learning for medical image classification: a systematic review and implementation guidelines

Introduction.

Due to our developed technology and access to huge amounts of digitized data, the number of different applications using machine learning (ML) has increased dramatically during the past few decades 1 . Whereas ML techniques initially included only statistical methods and simple algorithms 2 , ML is currently used for different purposes across the fields of engineering, medicine, public health, finance, politics, and natural sciences, both in academia and industry 3 . However, because of this immerse interdisciplinary interest, some of the new ML researchers might not have a good grasp of basic statistical concepts. This prompts need for ongoing education about the proper use of statistics and appropriate metrics for evaluation of performance of ML algorithms.

When new ML models are created, it is necessary to compare their performance to the already existing ones 4 . Evaluation serves two purposes: methods that do not perform well can be discarded, and the ones that seem promising can be further optimized. Also, especially in medicine, it is often useful to know whether an ML model outperforms an educated professional or not 5 , 6 , 7 . In supervised ML, we first divide our data for training and test sets, use the training data for training and validation of the model, predict all the instances of the test data, and compare the obtained predictions to the corresponding ground-truth values of the test set 8 . In this way, we can estimate whether the predictions of a new ML model are better than the predictions of a human or existing models in our test set.

Despite complexity of final applications, ML models typically consists of relatively simple sub-tasks, such as binary or multi-class classification and regression. In addition, a special image processing ML technique called a convolutional neural network (CNN) can be used to perform image segmentation 9 and object detectors are used to find desired targets in images or video footage 10 . Depending on the task in question, there are certain choices of evaluation metrics that can be used to assess the performance of supervised ML models 11 . There are also established statistical testing practices, especially for metrics used in binary classification 8 , 12 . Nonetheless, the misuse of certain well-known tests, such as the paired t-test, is common 4 , and the required assumptions of the tests are often ignored 11 .

Our aim here is to introduce the most common metrics for binary and multi-class classification, regression, image segmentation, and object detection. We explain the basics of statistical testing and what tests should be used in different situations related to supervised ML. At the end, we also give three examples about comparing the performance of CNNs for classifying X-rays related to lung infections and performing image segmentation for positron emission tomography (PET) images.

Different machine learning tasks

Binary classification.

In a binary classification task, the instances of data are typically predicted to be either positive or negative so that a positive label is interpreted as presence of illness, abnormality, or some other deviation while a negative instance does not differ from the baseline in this respect. Each predicted binary label has therefore four possible designations: a true positive (TP) is a correctly predicted positive outcome, a true negative (TN) is a correctly predicted negative outcome, a false positive (FP) is a negative instance predicted to be positive, and a false negative (FN) is a positive instance predicted to be negative 13 . A confusion matrix, here a \(2\times 2\) -matrix containing the counts of TP, TN, FP, and FN observations like Table  1 , can be used to compute several metrics for the evaluation of the binary classifier.

The most commonly used evaluation metrics for binary classification are accuracy, sensitivity, specificity, and precision, which express the percentage of correctly classified instances in the set of all the instances, the truly positive instances, the truly negative instances, or the instances classified as positive, respectively. Sensitivity is commonly referred as recall 14 . They have the formulas

where TP, TN, FP, and FN refer to the numbers of the predictions with these designations 13 , 14 , 15 , 16 . Especially in diagnostics, sensitivity or recall is also known as true positive rate 14 , specificity as true negative rate 16 , and precision as positive predictive value 17 . With the exception of accuracy, the aforementioned metrics are often used as pairs, such as precision and recall or sensitivity and specificity. It is noteworthy that sensitivity and specificity reveal more about the model than accuracy especially if the number of real positive and negative instances is very imbalanced.

There are also several other evaluation metrics like accuracy that depend on all the values of the confusion matrix: Youden’s index 18 , defined as \(\mathrm{Sen.}+\mathrm{Spe.}-1\) 15 , gives an equal weight to the accuracies within the positive and the negative instances, regardless of their numbers. The F1-score, defined as

is a harmonic mean of precision and recall 19 . Cohen’s kappa ( \(\kappa\) ), defined as

compares how well the binary classifier performs compared to the randomized accuracy \(p_e\) 19 . It was originally introduced as a measurement for the degree of agreement between two observers in psychology 20 but it can be applied to measure the agreement between the predicted and the real classes. Furthermore, Matthews’ correlation coefficient (MCC), defined as

measures the correlation between the real and the predicted values of the instances 21 . This definition of MCC follows directly from that of Pearson’s correlation coefficient 22 .

To compute the values of the metrics above, the predictions of the test set by the model must be converted with some threshold if they are not already binary labels. The value of this threshold is often the default choice of 0.5 or the cut-point that gives highest accuracy or Youden’s index for the predictions of the training set. The threshold should be always chosen based on the predictions of the training set only because using the threshold that maximizes the accuracy of the predictions of the test set produces unrealistically good results.

However, if the numeric predictions before their conversion into binary are available, we can consider the receiver operating characteristic (ROC) curve. It is obtained by plotting sensitivity against the false positive rate (equal to 1 minus specificity) at all possible threshold values. As can be seen from Fig.  1 , it follows that a ROC curve is always monotonically increasing function inside the unit square tied to the points (0, 0) and (1, 1) so that closer the ROC curve is to (0, 1) the better the predictions are 23 . The area under the ROC curve (AUC) is another possible evaluation metric with values in [0, 1] but, unlike for the metrics, its value does not depend on the choice of the threshold at all.

figure 1

ROC curves computed from the binary predictions of a test set containing 300 chest X-rays with COVID-19 and 300 X-rays from healthy patients by the modified U-Net (in blue) and InceptionV3 (in gray), accompanied by a straight line equal to the theoretic ROC curve of a random binary classifier. The x -axis here uses sensitivity instead of the false positive rate but, since its values range from 1 to 0, the end result is a typical plot, not its reflection. The AUC values are 0.845 for the modified U-Net and 0.821 for InceptionV3. The values of other evaluation metrics are in Table  4 .

Alternatively, if we have n predictions \(q_i\in (0,1]\) for binary labels \(p_i\in \{0,1\}\) , we can also compute their cross-entropy loss defined as

The cross-entropy loss is often used for training ML models as its values decrease as the differences between the predictions and the real binary labels diminish 24 .

Multi-class classification

If the classification task is separating n instances between \(k\ge 3\) different classes, we can present the results of the classifier by using a \(k\times k\) confusion matrix as in Table  2 . Its element \(n_{ij}\) at the intersection of the i th and the j th column for \(i,j=1,\ldots ,k\) is the number of instances from the i th classified to the j th class. The evaluation of this matrix uses same metrics that we introduced for binary classification.

Firstly, there are two simple ways to obtain the values for all the evaluation metrics except AUC introduced with the previous section. We need to create a unique \(2\times 2\) confusion matrix for each of the k classes:

In a process called macro-averaging, we calculate the value of the metric separately for each class \(i=1,\ldots ,k\) by using the numbers \(\textrm{TP}_i\) , \(\textrm{TN}_i\) , \(\textrm{FN}_i\) , and \(\textrm{FP}_i\) defined as above and then consider the mean value of the k resulting values of the metric. Alternatively, in micro-averaging, we compute the value of the evaluation metric from the sums \(\sum ^k_{i=1}\textrm{TP}_i\) , \(\sum ^k_{i=1}\textrm{TN}_i\) , \(\sum ^k_{i=1}\textrm{FN}_i\) , and \(\sum ^k_{i=1}\textrm{FP}_i\) . Out of these procedures, macro-averaging gives equal weight to each class regardless of their size whereas micro-averaging gives equal weight to each instance and is therefore easily dominated by larger classes 25 . However, if each class should contain equally many instances as in the situation of in Table  2 , both micro- and macro-averaging yield same values for accuracy, sensitivity, specificity, and Youden’s index.

Cohen’s \(\kappa\) and MCC have also own definitions specially designed for the multi-class classification: Cohen’s \(\kappa\) can be written as

where \(n_{i\cdot }=\sum ^k_{j=1}n_{ij}\) , \(n_{\cdot i}=\sum ^k_{j=1}n_{ji}\) , and \(n=\sum ^k_{i=1}\sum ^k_{j=1}n_{ij}\) 26 . Similarly, MCC can be computed from a general \(k\times k\) confusion matrix with the formula 27

In the special case \(k=2\) , we obtain the same formulas for Cohen’s \(\kappa\) and MCC as in ( 2 ) and ( 3 ) 22 .

Multi-label classification

Multi-label classification is a generalized version of multi-class classification with nonexclusive class labels. Instead of dividing the data instances between several classes, the aim is to find all the class labels that apply out of \(k\ge 2\) possible labels. For each n instances, the model returns a binary vector \(y^{(i)}\) , \(i=1,\ldots ,n\) , whose j th element is 1 if the j th label is present and otherwise 0 for all \(j=1,\ldots ,k\) . A possible metric for evaluation is the Hamming loss, defined as

where \(x^{(i)}_j\) is the real value of the j th element in the binary vector of the i th data instance and \(y^{(i)}_j\) is the corresponding predicted value. The smaller the Hamming loss is, the better the model is. Alternatively, we can compute for instance the micro- or macro-average accuracy, precision, or recall for the vectors \(y^{(i)}\) , \(i=1,\ldots ,n\) 28 .

In a regression problem, a model is used predict instances whose values are real numbers rather than categorical. This is the case when predicting, for instance, height, stock prices, voter turnout, or rainfall amount. Here, we denote the real value of the i th instance in a test set of n instances by \(x_i\) and its predicted value by \(y_i\) for \(i=1,\ldots ,n\) .

One way to evaluate the model is to measure correlation between the real and the predicted values 12 . The most well-known method for this is Pearson’s correlation coefficient, defined as

where \(\overline{x}\) and \(\overline{y}\) denote the mean values of the vectors \((x_1,\ldots ,x_n)\) and \((y_1,\ldots ,y_n)\) , respectively 29 . However, Pearson’s correlation coefficient is designed for measuring correlation between variables whose marginal distributions are assumed to be normal. Because of this, Spearman’s correlation coefficient \(r_s\) might be a better evaluation metric when the real values \(x_i\) are not even approximately normally distributed. Spearman’s correlation coefficient is obtained by first converting the observations \(x_i\) and \(y_i\) , \(i=1,\ldots ,n\) , into their ranks and then computing Pearson’s correlation coefficient of these ranks 29 .

Another way to evaluate the model is to use some error measurement, such as mean absolute error (MAE) \(\sum ^n_{i=1}|x_i-y_i|\) or mean squared error (MSE) \(\sum ^n_{i=1}(x_i-y_i)^2\) 12 . The difference between MSE and MAE is that MSE punishes more for large errors 12 . Naturally, the smaller the error measurement is, the better the model performs.

Image segmentation

Image segmentation is a process of dividing images into regions of pixels or, in case of three-dimensional (3D) images, voxels, so that different objects and their boundaries can be located. In practice, this means converting a matrix of the same size as an image into a segmentation mask whose each point tells the class of the corresponding point in the image. In binary image segmentation, the desired output is a binary mask with positive elements coded as 1s and negative elements as 0s but we can also perform multiclass image segmentation called semantic segmentation by using more integers to signify different classes. An example of binary tumor segmentation can be seen in Fig.  2 .

figure 2

The binary tumor mask predicted by U-Net CNN with maximum dimensionality of 128 (in blue) and the ground-truth tumor mask drawn by a physician (in white) for one transaxial slice from a PET image of a head and neck cancer patient. The image is 128  \(\times\)  128 pixels and the predicted segmentation mask contains 181 TP pixels, 16156 TN pixels, 17 FP pixels, and 30 FN pixels. This gives us Dice of 0.885, IoU of 0.794, and overall pixel accuracy of 0.997.

One of the possible evaluation metric for an image segmentation masks is accuracy. In case of binary segmentation, we could simply count the number of TP, TN, FN, and FP pixels and calculate the accuracy as in ( 1 ). However, the issue with this approach is that the number of positive pixels is typically very small compared to the number of negative pixels: For instance, if we try perform tumor segmentation for medical images of the body, the positive targets, while incredibly important, have minimal volume compared to the background and they might not even be present in some images. Because of this, the value of accuracy can be very high even in the cases where the model does not find the positive object as long as the majority of negative pixels is correct.

Consequently, the results of binary segmentation are often evaluated with a metric that ignores the TN points. Instead, we concentrate on evaluating the similarity of the predicted positive segment given by a CNN and the ground-truth positive segment annotated by a human. For this purpose, we can use the Sørensen–Dice similarity coefficient 30 , 31 , also known as the Dice score, defined for two sets X and Y as

where | S | denotes the number of pixels or voxels in the set S 32 . This definition can be equivalently written as

by using the elements of the confusion matrix from the binary predictions of the points 32 . A very similar alternative to Dice score is the Jaccard similarity coefficient 33 , which is also known as the Jaccard index or Intersection over Union (IoU), and defined as

for the sets X and Y , and

for the elements of the confusion matrix 32 . The equality \(\textrm{IoU}=D/(2-D)\) holds trivially between the IoU and the Dice score 32 .

There are also metrics specially designed for 3D segmentation, as this is common task for medical tomography images. The surface of the point set X , denoted by \(\partial X\) , is the set of all voxels in X for which at least one of the 18 or the 26 neighbour voxels is does not belong in X . As an alternative to the typical Dice score, the surface Dice similarity coefficient (SDSC) can be computed by replacing X and Y with their surfaces \(\partial X\) and \(\partial Y\) in ( 5 ). Let d ( x ,  y ) be the Euclidean distance between two voxels x and y , and define \(d(x,Y)=\min _{y\in \partial Y}d(x,y)\) for the set Y . The average symmetric surface distance (ASD) between sets X and Y can now be defined as

The Hausdorff distance is \(\textrm{hd}(X,Y)=\max _{x\in X}d(x,Y)\) and its symmetric version, also known as the maximum symmetric surface distance, is \(\textrm{HD}(X,Y)=\max \{\textrm{hd}(X,Y),\textrm{hd}(Y,X)\}\) . The symmetric volume difference (SVD) is a Dice-based error metric defined as \(\textrm{SVD}=1-D\) and the volumetric overlap error (VOE) is the corresponding error measure derived from IoU, \(\textrm{VOE}=1-\textrm{IoU}\) . The model performance is considered better with smaller surface distances and errors terms 34 .

The results of multi-class semantic segmentation are typically evaluated by using mean Dice or IoU values, either as the mean of all within-class scores in a single image or the class-specific means of several images. The similarity of two semantic segmentation masks or any two can be also evaluated with structural similarity index measure (SSIM). If u and v are two image matrices with means \(\overline{u}\) and \(\overline{v}\) , variances \(s_u\) and \(s_v\) , and covariance \(s_{u,v}\) , then we have

for constants \(c_1\) and \(c_2\) depending on pixel values 35 . The SSIM is typically computed by using the formula above within several kernels or windows of the images. The values of SSIM are interpreted as those correlation: 1 for perfect similarity, 0 for no association, and \(-1\) for perfect opposites.

Object detection

Another similar tasks related to image processing is object detection, in which we find bounding boxes around each object in the image and classify them into different classes. A good object detector is capable of finding all the objects in an image without producing any false observations, placing the bounding boxes as close their correct locations as possible, and also classifying all the found objects correctly. Due to the diversity in these subtasks, evaluation of object detectors is slightly more complicated than it is for the other models introduced.

To evaluate the results of object detection, we must start by counting how many objects of a specific class were found. This quickly leads to the question how to decide how close a predicted bounding box needs to be a ground-truth box so that we can interpret the object as found. The common criteria here is IoU defined as in ( 6 ): The prediction is only considered a match of a ground-truth box if the IoU value of the two boxes exceeds a certain threshold value, often 0.5. If there are several predicted boxes producing an IoU high enough with the same ground-truth box, only the best one in terms of IoU is considered a match to the ground-truth box while all the others are FP observations. Namely, FP is here the number of predicted boxes without a matching ground-truth box while TP is the number of the predictions that match a ground-truth box of the same class and FN is the number of ground-truth boxes without a matching prediction 10 .

With the TP, FP, and FN numbers of the specific class, we can compute precision and recall as in ( 1 ). Since an object detector outputs a confidence for every bounding box expressing how confident the model is about the prediction, we can remove the predictions below a threshold of confidence. Changing this threshold affects TP, FP, and FN numbers and therefore also precision and recall. The precision-recall curve (PRC) can be obtained by plotting precision against recall at all possible thresholds of confidence. After that, we can compute average precision (AP) as the area under the PRC. The whole model is evaluated by computing mean average precision (mAP) as the mean value of the APs in all the different classes. We often consider [email protected] which is computed by using the IoU threshold 0.5 to define a match but just as well we could compute [email protected] or [email protected], or mAP@[0.5:0.95] which is the the mean value of [email protected], [email protected], \(\ldots\) , [email protected]. The metric [email protected] is more strict than [email protected] given it requires greater overlap for the potential matches and is therefore suitable for situations where the predicted bounding box locations need to be very exact 10 .

Information retrieval

Information search and retrieval is a significant task in ML research. The ability to retrieve only relevant results from large image- or text-based databases is crucial for these databases to be actually useful. Search engines and other information retrievals models can be evaluated by using precision and recall to describe the percentage of relevant retrieved documents among either search results or all the relevant documents. If we have K results \(d_1,\ldots ,d_K\) ordered by estimated relevance from the database D and each document d is either relevant ( \(\textrm{rel}(d)=1\) ) or not ( \(\textrm{rel}(d)=0\) ), we can compute precision of the first k retrieved documents as P@ \(k=\sum ^k_{i=1}\textrm{rel}(d_i)/k\) , for \(k=1,\ldots ,K\) and then define AP as 36

The mAP is obtained by a mean value of AP across different topics or search queries 36 . If results have more classes than just relevant and non-relevant, discounted cumulative gain (DCG) of k first results can be defined as

where G ( i ) is a numerical value presenting the gain of the i th result 37 . For instance, the values 10, 7, 3, 0.5, and 0 are often used for perfect, excellent, good, fair, and bad results, respectively 37 . If there are several search queries to be evaluated, mean DCG can be used.

Statistical tests

The motivation behind statistical tests is often to find out whether there is a significant difference between two different populations within respect of some specific property. We can collect smaller data sets from the populations and use them to compute values of the numeric quantity representing the feature of interest. Since there is nearly always at least slight difference between these values, the relevant question is whether this difference is great enough to be considered as an actual evidence of an underlying dissimilarity between the populations or if it is just a result of random variation.

The process of statistical testing is relatively simple: We formulate a null hypothesis \(H_0\) according to which there is no real difference, choose some level of significance \(\alpha \in (0,1)\) , and define a suitable test statistic Z with a known probability distribution \(P(Z|H_0)\) under the null hypothesis. We then use this distribution to compute the probability of obtaining at least as extreme value for the statistic Z than the one value z already observed. If the resulting probability \(p=2\min \{P(Z\le z|H_0),P(Z\ge z|H_0)\}\) , called p value, is less than \(\alpha\) , then the null hypothesis is rejected and the difference is considered statistically significant. We make a type I error when rejecting a true null hypothesis, and a type II error is accepting a false null hypothesis. We can control the probability of a type I error as its is equal to \(\alpha\) . We could also use \(\alpha\) to compute the critical values for the statistic for accepting or rejecting the null hypothesis instead of using a p value. However, in this paper, all the test functions in Python 38 and R 39 mentioned return a p value. We use \(\alpha =0.05\) as the level of significance in our examples.

When comparing performance of two or more models, it is often necessary to perform the tests for multiple times depending on the evaluation metric and the statistical test used. For instance, while we can compute Dice score of every predicted segmentation mask in the test set, we only obtain one value of accuracy from the predictions of the whole test set after binary classification and as well as one value of MSE after regression. If we want to compare regression models, we can test squared errors instead of their mean and, in case of binary classification, there are tests that are based on the predictions of a single test set. In other cases, we have to evaluate our models on several test sets to obtain enough values from other evaluation metrics for statistical testing. The required values of an evaluation metric for a certain statistical test are summarized in the flowchart of Fig.  3 .

While the test sets should ideally come from fully different data sets, sometimes our only option is to use a resampling procedure to create multiple test sets from the same data. In practice, we must re-initialize, train, and test the models for several times and save the values of the evaluation metrics from the predictions of the test set on each iteration round. We should use same training and test set for all the models on the same iteration round but vary them between the rounds because, otherwise, our conclusions about a potential difference between the models might be misled by some unknown factor in these specific data sets. Researchers commonly use here k -fold cross-validation, in which the data is divided into k similarly sized folds and, during k iteration rounds, each fold is the test set exactly once while the other \(k-1\) form the training data 12 . Alternatively, we can perform repeated cross-validation that has a few re-runs of each potential test set 12 . However, it should be taken into account that resampling methods do not produce independent values for the evaluation metrics and might lead to underestimating the variance of the test statistic, causing biased results 12 .

figure 3

The possible tasks for a model, their evaluation metrics, the values of the evaluation metric that must be computed for each model before statistical testing, the potential questions a statistical test could answer in the situation, and the suitable test.

Testing for a significant difference in any evaluation metric

Regardless of whether the values of the evaluation metric come from a single test set or several test sets on different iteration rounds, the values of the metric for the two models are based on the same instances and therefore paired. Many researchers therefore check which of the models gives a higher mean and then use a paired t-test to test if the difference in the mean is significant 4 . The null hypothesis of the paired t-test is that the mean of the differences in the matched pairs is equal to 0 40 , and this test can be performed with the function ttest_rel in the package scipy.stats 41 in Python or t.test(x,y,paired=TRUE) in the base package stats in R. There are also such newer variations of the t-test that are specially designed to repeated cross-validation 11 . However, the t-test is not recommended for this situation because it is strongly affected by outliers 4 and not valid when resampled test sets are used 12 .

Another possible test is a sign test. If two models are evaluated by using N test sets and there is no difference between them, then each of them should produce a better value for the evaluation metric N /2 times 4 . Thus, the number of times where the first model is better than the second follows a binomial distribution and, for a greater number of N , a normal distribution with a mean N /2 and standard deviation \(\sqrt{N}/2\) 11 . We can therefore apply the sign test to test whether one of the models outperforms the other with respect to the chosen evaluation metric in a statistically significant way. However, the sign test has a very weak power for detecting significant differences 4 .

The best alternative for this situation is the Wilcoxon signed-rank test instead 4 . It is a non-parametric test for the null hypothesis that the median of the differences in the matched pairs is equal to 0 42 . This test has the test statistic

and \(\textrm{rank}(|d_i|)\) , \(i=1,\ldots ,n\) , denote the differences \(d_i\) in the n matched pairs ranked by their absolute values 43 . The T -statistic can be examined directly by using its own critical values or, for large values of n , utilizing the statistic

which follows the normal distribution under the null hypothesis 4 . The Wilcoxon signed-rank test can be performed with wilcoxon in scipy.stats in Python or wilcox.test(x,y, paired=TRUE) in stats in R.

Test for comparing several models

As explained above, we can use Wilcoxon signed-rank test to estimate whether the differences between two models are significant with respect to any evaluation metric, but this test is not ideal when comparing several models. Namely, while we can repeat Wilcoxon tests between each pair of models, the risk of type I error increases with multiple comparisons. Adjusting the level of significance by Bonferroni correction has been suggested as a solution 44 but it is overly radical 4 .

Instead, the better approach in a situation where we have K models evaluated in J data sets is to perform Friedman’s test 4 . The average rank of the k th model, \(k=1,\ldots ,K\) , is \(\overline{R}_k=\sum ^J_{j=1}r^j_k/J\) where \(r^j_k\) is the rank of the j th value of the evaluation metric for the k th model 4 . The test statistic can be now written as

or, as noted by Iman and Davenport 45 , as 4

Out of the two statistics, \(\chi ^2_F\) is overly conservative and \(F_{ID}\) is therefore recommended 4 . Under the null hypothesis, \(\chi ^2_F\) follows the \(\chi ^2\) -distribution with \(K-1\) degrees of freedom and \(F_{ID}\) follows the F -distribution with \(K-1\) and \((K-1)(J-1)\) degrees of freedom 4 . Friedman’s test can be performed with friedmanchisquare in scipy.stats in Python or friedman.test in stats in R, but both of these functions are based on the statistic \(\chi ^2_F\) and therefore are not reliable for small values of J . However, if J is small, we can use a few separate Wilcoxon signed-rank tests instead.

Tests for binary classification of a single test set

There are also such tests for comparison of two classifiers which only require their predictions from a single iteration round. McNemar’s test is a common non-parametric test that only requires two numbers and is typically used to compare either sensitivity or specificity of two classifiers 46 . To find out whether there is a significant difference in the sensitivity of the classifiers, let b be the number of positive instances in the test set misclassified as FN by the first classifier but not by the second classifier and c similarly the number of positive instances misclassified as FN by the second classifier but not by the first classifier. To study specificity, count the numbers b and c by using FP misclassifications among the negative instances. Comparing accuracy by counting errors among both positive and negative sets is not recommended 47 . If there is no significant difference in the performance of the two classifiers, the test statistic

follows the \(\chi ^2\) -distribution with 1 degree of freedom for \(b+c\ge 20\) and a binomial distribution otherwise 11 . This test can be performed with mcnemar in statsmodels.stats.contingency_tables 48 in Python or mcnemar.test in stats in R.

We can also use the DeLong test to see whether there is a statistically significant different between the AUCs of two binary classifiers. Namely, DeLong et al. 49 noticed that the Mann-Whitney statistic can be used as an estimate of an AUC and the theory of generalized U-statistic can be applied to compare two AUCs. The Mann-Whitney two-sample statistic for AUC can be written as

where m is the number of truly positive instances, n is the number of the number of truly negative instances, \(Y_{i1}\) is the numeric prediction of the i th positive instance before it was converted into binary and, similarly, \(Y_{j0}\) is the numeric prediction of the j th negative instance 50 . Let \(\hat{\theta }_1\) be the estimate above for the AUC of the first classifier and \(\hat{\theta }_2\) the same for the second classifier. The DeLong test estimates their variance and covariance (see e.g. 51 for the exact formulas) and then uses the statistic

which follows the normal distribution under the null hypothesis due to the properties of the known U-statistic 51 . The DeLong test can be performed with roc.test(x,y,method= ’delong’) in the package pROC 52 in R.

Tests for comparing variance

Another important factor when comparing the performance of models is the amount of variance they produce. A model that consistently obtains high values in some evaluation metric is better than a model whose performance varies greatly on different iteration rounds. However, it must be taken into careful consideration here how the multiple values of the evaluation metric are obtained before considering their variance. For instance, if we use repeated cross-validation, we will not obtain a realistic estimate how the performance of a model would vary over different data sets.

We can use the F-test of equality of variances to test the null hypothesis according to which two populations have equal variances. The test statistic is \(F=S^2_1/S^2_2\) where \(S^2_1\) and \(S^2_2\) are the sample variations in the values produced by the two models for the evaluation metric, and this F-statistic follows the F-distribution with \(n-1\) and \(n-1\) degrees of freedom under the null hypothesis 53 .

However, the use of the F-test is not recommend for non-normally distributed values and this is often the case when comparing evaluation metrics: For instance, if the model has a median accuracy of 90% but a high amount of variation between different test sets, it is likely that the distribution of accuracy is left-skewed as the accuracy is limited on [0, 1] by its definition. The normality can be tested here with the Shapiro–Wilk test 54 ( shapiro in the package scipy.stats and shapiro.test in the package stats in R). If the data is not normally distributed, the possible alternatives for the F-test include Barlett’s test 55 ( bartlett in scipy.stats in Python and bartlett.test in stats in R) and Levene’s test 56 ( levene in scipy.stats in Python and leveneTest in the package car 57 in R).

Comparison to a human

In ML research, it is often of interest if a specific ML model performs better than a human. Especially, in a medical field, it is useful to estimate the difference between the tumor masks predicted by a CNN differ and those drawn by a physician by taking into account how much difference there would be if the same masks were drawn by two different physicians. For this purpose, we can use statistical testing to compare the results of an ML model and a human in terms of a relevant evaluation metric as we would compare the performance of two models. However, there might be some cases where this comparison is not possible: A human is not able to go through very large amounts of data, at least not fast, and, while we can always re-initialize the model between different rounds of repeated cross-validation, a human will not forget their earlier decisions. Because of this, statistical comparison between an ML model and a human is often limited to using McNemar’s test or the DeLong test to compare classifications in a single test set or the Wilcoxon signed-rank test to compare segmentation masks in terms of Dice and IoU values for a reasonable number of images.

Software requirements

The CNNs were coded in Python (version: 3.9.9) 38 with packages TensorFlow (version: 2.7.0) 58 and Keras (version: 2.7.0) 59 . Most of the test were preformed in Python with scipy (version: 1.7.3) 41 or statsmodels (version: 0.14.0) 48 . The DeLong test was performed and Fig.  1 was plotted with pROC (version: 1.18.5) 52 in R (version: 3.4.1) 39 . The images of the third data set had been studied with Carimas (version: 2.10) 60 , which was also used to draw their binary masks.

We use three data sets consisting of two-dimensional grayscale images converted into the size of 128  \(\times\)  128 pixels. The first data set contains 3000 chest X-rays of COVID-19 patients and 3000 chest X-rays of healthy patients chosen from COVID-19 Radiography Database 61 , 62 . The second data set has 700 chest X-rays of healthy patients and 700 chest X-rays of COVID-19 patients from COVID-19 Radiography Database, 700 chest X-rays of patients with pneumonia from Chest X-Ray Images (Pneumonia) 63 , and 700 chest X-rays of tuberculosis patients from Tuberculosis (TB) Chest X-ray Database 64 . The third data set has a total of 962 two-dimensional transaxial image slices from the PET images of 89 head and neck squamous cell carcinoma patients. The patients were imaged with \(^{18}\) F-fluorodeoxyglucose tracer in Turku PET Centre, Turku, Finland, during years 2014–2022. More details about the imaging can be found in 65 , 66 . Each of the slices has also a ground-truth binary segmentation mask showing pixels depicting cancerous tissue as positive and the rest as negative, and they were chosen so that they have at least 6 positive pixels. All the cancer patients were at least 18 years of age, gave informed consent to the research use of their data, and the research from their data was approved by Ethics Committee of the Hospital District of Southwest Finland. All research was performed in accordance with the Declaration of Helsinki.

Convolutional neural networks

In both binary and multi-class classification, we use a CNN that has U-Net architecture by Ronneberger et al. 67 modified for classification 65 and a ready-built CNN called InceptionV3 available in Keras. For binary segmentation, we use two U-Nets, a shallower of which has 64 as maximum dimensionality of a Conv2D layer and a deeper of which has 128. They were also used in 66 , 68 . We use stochastic gradient descent as an optimizer for the classification CNN and Adam for the segmentation CNNs. The classification CNNs are trained on 10 epochs and the segmentation CNNs on 50. The learning rate of 0.001 and, during training, 30% of the training data is used for validation. After training the CNNs for binary classification, we predict both training and test sets and use the threshold giving the maximal Youden’s index in the training set as a threshold for converting the numeric predictions of the test set into binary labels. We similarly convert the output after binary segmentation by using the threshold that produces the highest median Dice in the training set. For the multi-class classification, we obtain directly class labels by using the maximum elements of one-hot encoding.

Our experiments

We first compare the performance of the modified U-Net and InceptionV3 in binary classification by using our first data set of COVID-19 and negative X-rays with fivefold cross-validation. We compute all the possible evaluation metrics from our single test set and use McNemar’s test for sensitivity and specificity and DeLong test for AUC. Then we compare the modified U-Net and InceptionV3 in multi-class classification with repeated fivefold cross-validation (5 re-runs of each test set). We save the values of micro- and macro-average evaluation metrics after each round and use the Wilcoxon signed-rank test to estimate whether the differences in the resulting 25 values of each metric are significant or not. Even though the paired t-test should not be used for this, we perform it to see if its p values would be different from those of the Wilcoxon test. Finally, we divide our third data set patient-wise into train and test sets so that the test set has 191 slices (19.9% of the total data), and compare the two U-Nets for binary segmentation. We use the Shapiro–Wilk test to test the normality of Dice and IoU values of different segmentation masks, t-test and Wilcoxon test to estimate their differences, and F-test, Bartlett’s test and Levene’s test to check if there are significant differences in variances.

The results of the binary classification task are summarized in the contingency table of Table  3 and the resulting values of the evaluation metrics are in Table  4 . According two McNemar’s test computed from Table  3 separately for sensitivity among COVID-19 patients and specificity negative patients, the modified U-Net produced significantly higher sensitivity ( p value < 5.07e−5) but significantly lower specificity ( p value < 0.0207). The ROC curves of the modified U-Net and InceptionV3 can be seen from Fig.  1 and, according the DeLong test, there is no significant difference in their AUC ( p value = 0.137).

The median values of the evaluation metrics are in Table  5 for the multi-class classification task. According to t-tests and Wilcoxon tests, the modified U-Net is significantly better than InceptionV3, regardless of which metric is used. The p value of the t-test for macro-average F1-score is 6.47e−4 and less than 2.38e−5 for all the other metrics and, similarly, the p value of the Wilcoxon test for macro-average F1-score is 0.00116 and less than 6.37e−5 for all the other metrics.

The median and standard deviation of Dice and IoU values computed for the two U-Nets in the segmentation task are in Table  6 , as are the p values of Shapiro–Wilk tests, t-tests, Wilcoxon tests, F-tests, Bartlett’s tests, and Levene’s tests. Based on these p values, neither Dice nor IoU values are normally distributed, the deeper U-Net is significantly better in terms of both Dice and IoU values, and, while the deeper U-Net had higher standard deviation, this difference is only significant according to Levene’s test performed for the IoU values.

In our first experiment, we used both McNemar’s test and the DeLong test to study two CNNs used for binary classification. Our results show that the choice of the threshold was not ideal for the modified U-Net as we obtained high sensitivity on the cost of the specificity. This also reveals one issue with McNemar’s test: It does not tell us which classifier is better if one of them has a significantly higher sensitivity but a significantly lower specificity. We would need to use some other thresholds to convert the output of the CNN into binary labels and then repeat McNemar’s tests in order to find out if the significant differences are caused by specific threshold choices or not. In this respect, the DeLong test is more useful as its results do not depend on the threshold choices. However, to obtain more trustworthy results, it would still be necessary to use cross-validation and compare the AUCs of different test sets with the Wilcoxon signed-rank test.

In our second and third experiments, we used the t-test for comparing the values of evaluation metrics, even though it is not recommend for this, especially not when combined with repeated cross-validation. Its p values were relatively close to those of the Wilcoxon tests and, regardless of which test was used, we obtained the same conclusions about the significant differences. Since the misuse of the t-test is rather common, as noted by Demšar 4 , it is good to know that the results obtained in earlier research are not necessary wrong. Similarly, even though the F-test is not designed for non-normally distributed data, its p values were very close to those of Bartlett’s tests. However, both the t-test and the F-test are sensitive to the error caused by potential outliers so their use can lead incorrect results.

It should be noted here that aim of our experiments was to give examples of the use of the evaluation metrics and the related tests. To find out how often the t-test or some other test produces false conclusions when improperly used, more research is needed. Similarly, one possible topic for future research is also how many the number of the test sets affects the trustworthiness of the conclusions.

In this paper, we introduced several evaluation metrics for common ML tasks including binary and multi-class classification, regression, image segmentation, and object detection. Statistical testing can be used to estimate whether the different values in these metrics between two or more models are caused by actual differences between the models. The choice of the exact test depends the task of the models, the evaluation metric used, and the number of test sets available. As some metrics produce only one value from a single test set and there might be only one data set, some type of resampling, such as repeated cross-validation, is often necessary. Because of this, the well-known tests such the paired t-test underestimate variance and do not produce reliable results. Instead, the use of non-parametric tests such as the Wilcoxon signed-rank test or Friedman’s test is recommend.

Data availability

The X-ray data sets analyzed during the current study are available in the repositories: COVID-19 Radiography Database 61 , 62 https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database , Chest X-Ray Images (Pneumonia) 63 https://www.kaggle.com/datasets/paultimothymooney/chest-xray-pneumonia , and Tuberculosis (TB) Chest X-ray Database 64 https://www.kaggle.com/datasets/tawsifurrahman/tuberculosis-tb-chest-xray-dataset .

Code availability

Available at github.com/rklen/statistical_tests_for_CNNs.

Change history

08 july 2024.

A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-66611-y

Jordan, M. I. & Mitchell, T. M. Machine learning: Trends, perspectives, and prospects. Science 349 (6245), 255–260 (2015).

ADS   MathSciNet   CAS   PubMed   Google Scholar  

Fradkov, A. L. Early history of machine learning. IFAC-PapersOnLine 53 (2), 1385–1390 (2020).

Google Scholar  

Bertolini, M., Mezzogori, D., Neroni, M. & Zammori, F. Machine Learning for industrial applications: A comprehensive literature review. Expert Syst. Appl. 175 , 114820 (2021).

Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7 , 1–30 (2006).

MathSciNet   Google Scholar  

Angeline, R., Kanna, S.N., Menon, N.G., Ashwath, B.: Identifying malignancy of lung cancer using deep learning concepts. In Artificial Intelligence in Healthcare (eds. Garg, L., Basterrech, S., Banerjee, C., Sharma, T.K.) 35–46 https://doi.org/10.1007/978-981-16-6265-2_3 (Advanced Technologies and Societal Change, Springer, 2022).

Debats, O. A., Litjens, G. J. & Huisman, H. J. Lymph node detection in MR Lymphography: False positive reduction using multi-view convolutional neural networks. PeerJ 7 , e8052 (2019).

PubMed   PubMed Central   Google Scholar  

Madabhushi, A., Feldman, M., Metaxas, D., Chute, D., Tomaszeweski, J. Optimal feature combination for automated segmentation of prostatic adenocarcinoma from high resolution MRI. In Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (IEEE Cat. No. 03CH37439) 614–617, Vol. 1. IEEE (2003).

Raschka, S. Model evaluation, model selection, and algorithm selection in machine learning. arXiv:1811.12808 (2018).

Li, Z., Liu, F., Yang, W., Peng, S. & Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Netw. Learn. Syst. 33 (12), 6999 (2021).

ADS   MathSciNet   Google Scholar  

Planche, B. & Andres, E. Hands-On Computer Vision with TensorFlow 2: Leverage Deep Learning to Create Powerful Image Processing Apps with TensorFlow 2.0 and Keras (Packt Publishing, 2019).

Santafe, G., Inza, I. & Lozano, J. A. Dealing with the evaluation of supervised classification algorithms. Artif. Intell. Rev. 44 , 467–508 (2015).

Tohka, J. & Van Gils, M. Evaluation of machine learning algorithms for health and wellness applications: a tutorial. Comput. Biol. Med. 132 , 104324 (2021).

PubMed   Google Scholar  

Zhu, W., Zeng, N. & Wang, N. Sensitivity, specificity, accuracy, associated confidence interval and ROC analysis with practical SAS implementations. In NESUG proceedings: health care and life sciences, Baltimore, Maryland 67, vol. 19 (2010).

Dehmer, M. & Basak, S. C. Statistical and Machine Learning Approaches for Network Analysis (Wiley, 2012).

Šimundić, A. M. Measures of diagnostic accuracy: Basic definitions. EJIFCC 19 (4), 203–211 (2009).

Small Casler, K. & Gawlik, K. (eds) Laboratory Screening and Diagnostic Evaluation: An Evidence-Based Approach (Springer, 2022).

Cox, D. J. & Vladescu, J. C. Statistics for Applied Behavior Analysis Practitioners and Researchers (Academic Press, 2023).

Youden, W. J. Index for rating diagnostic tests. Cancer 3 (1), 32–35 (1950).

CAS   PubMed   Google Scholar  

Emmert-Streib, F., Moutari, S. & Dehmer, M. Elements of Data Science, Machine Learning, and Artificial Intelligence Using R (Springer, 2023).

Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20 (1), 37–46 (1960).

Lantz, B. Machine Learning with R: Learn Techniques for Building and Improving Machine Learning Models, from Data Preparation to Model Tuning, Evaluation, and Working with Big Data (Packt Publishing, 2023).

Boughorbel, S., Jarray, F. & El-Anbari, M. Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric. PLoS ONE 12 (6), e0177678 (2017).

Pepe, M., Longton, G. & Janes, H. Estimation and comparison of receiver operating characteristic curves. Stata J. 9 , 1 (2009).

Martinez, M., & Stiefelhagen, R. Taming the cross entropy loss. In Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings 628–637, Vol. 40. Springer (2019).

Manning, C. & Schutze, H. Foundations of Statistical Natural Language Processing (MIT Press, 1999).

Tallón-Ballesteros, A. J., Riquelme, J. C. Data mining methods applied to a digital forensics task for supervised machine learning. In Computational Intelligence in Digital Forensics: Forensic Investigation and Applications 413–428 (2014).

Yilmaz, A. E. & Demirhan, H. Weighted kappa measures for ordinal multi-class classification performance. Appl. Soft Comput. 134 , 110020 (2023).

Zhang, M. L. & Zhou, Z. H. A review on multi-label learning algorithms. IEEE Trans. Knowl. Data Eng. 26 (8), 1819–1837 (2013).

Xiao, C., Ye, J., Esteves, R. M. & Rong, C. Using Spearman’s correlation coefficients for exploratory data analysis on big dataset. Concurr. Comput. Pract. Exp. 28 , 3866–3878 (2016).

Dice, L. R. Measures of the amount of ecologic association between species. Ecology 26 (3), 297–302 (1945).

Sørensen, T. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. K. Dan. Vidensk. Selsk. 5 (4), 1–34 (1948).

Sarkar, M. & Sahoo, P. K. Intelligent image segmentation methods using deep convolutional neural network. In Biomedical Signal and Image Processing with Artificial Intelligence 309–335 (Springer, 2022).

Jaccard, P. The Distribution of the Flora in the Alpine Zone.1. New Phytol. 11 (2), 37–50 (1912).

Voiculescu, I., & Yeghiazaryan, V. (2015). An Overview of Current Evaluation Methods Used in Medical Image Segmentation .

Brunet, D., Vrscay, E. R. & Wang, Z. On the mathematical properties of the structural similarity index. IEEE Trans. Image Process. 21 (4), 1488–1499 (2011).

ADS   MathSciNet   PubMed   Google Scholar  

Cormack, G. V., & Lynam, T. R. Statistical precision of information retrieval evaluation. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval 533–540 (2006).

Dupret, G. & Piwowarski, B. Model based comparison of discounted cumulative gain and average precision. J. Discrete Algorithms 18 , 49–62 (2013).

van Rossum, G. & Drake, F. L. Python 3 Reference Manual (CreateSpace, 2009).

R Core Team. R: A Language and Environment for Statistical Computing (R Foundation of Statistical Computing, 2021).

Jekel, J. F. Epidemiology, Biostatistics, and Preventive Medicine (Elsevier Health Sciences, 2007).

Virtanen, P. et al. SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nat. Methods 17 (3), 261–272 (2020).

CAS   PubMed   PubMed Central   Google Scholar  

Lang, T. A. & Secic, M. How to Report Statistics in Medicine: Annotated Guidelines for Authors, Editors, and Reviewers (ACP Press, Berlin, 2006).

Corder, G. W. & Foreman, D. I. Nonparametric Statistics for Non-statisticians (Wiley, 2009).

Salzberg, S. L. On comparing classifiers: Pitfalls to avoid and a recommended approach. Data Min. Knowl. Discov. 1 , 317–328 (1997).

Iman, R. L. & Davenport, J. M. Approximations of the critical region of the Friedman statistic. Commun. Stat. 9 , 571–595 (1980).

Kim, S. & Lee, W. Does McNemar’s test compare the sensitivities and specificities of two diagnostic tests?. Stat. Methods Med. Res. 26 (1), 142–154 (2017).

MathSciNet   PubMed   Google Scholar  

Trajman, A. & Luiz, R. R. McNemar chi2 test revisited: Comparing sensitivity and specificity of diagnostic examinations. Scand. J. Clin. Lab Invest. 68 (1), 77–80 (2008).

Seabold, S., & Perktold, J. Statsmodels: Econometric and statistical modeling with python. In Proceedings of the 9th Python in Science Conference (2010).

DeLong, E. R., DeLong, D. M. & Clarke-Pearson, D. L. Comparing the areas under two or more correlated receiver operating characteristic curves: A nonparametric approach. Biometrics 44 (3), 837–45 (1988).

Qin, G. & Hotilovac, L. Comparison of non-parametric confidence intervals for the area under the ROC curve of a continuous-scale diagnostic test. Stat. Methods Med. Res. 17 (2), 207–221 (2008).

Nakas, C. T., Bantis, L. E. & Gatsonis, C. A. ROC Analysis for Classification and Prediction in Practice (CRC Press, 2023).

Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinform. 12 , 77 (2011).

Bethea, R. M., Duran, B. S. & Boullion, T. L. Statistical Methods for Engineers and Scientists (Taylor & Francis, 1995).

Shapiro, S. S. & Wilk, M. B. An analysis of variance test for normality (complete samples). Biometrika 52 (3–4), 591–611 (1965).

Bartlett, M. S. Properties of sufficiency and statistical tests. Proc. R. Stat. Soc. Ser. A 160 , 268–282 (1937).

ADS   Google Scholar  

Levene, H. Robust tests for equality of variances. In Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling (eds Olkin, I., Hotelling, H. et al. ) 278–292 (Stanford University Press, 1960).

Fox, J. & Weisberg, S. An R Companion to Applied Regression 3rd edn. (Sage, 2019).

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015).

Keras, C. F. GitHub (2015).

Rainio, O. et al. Carimas: An extensive medical imaging data processing tool for research. J. Digit. Imaging 36 (4), 1885 (2023).

Chowdhury, M. E. H. et al. Can AI help in screening Viral and COVID-19 pneumonia?. IEEE Access 2020 (8), 132665–132676 (2020).

Rahman, T. et al. Exploring the effect of image enhancement techniques on COVID-19 detection using chest X-ray images. Comput. Biol. Med. 132 , 104319 (2021).

Kermany, D. S. et al. Identifying medical diagnoses and treatable diseases by image-based deep learning. Cell 172 (5), 1122-1131.e9 (2018).

Rahman, T. et al. Reliable tuberculosis detection using chest X-ray with deep learning, segmentation and visualization. IEEE Access 8 , 191586–191601 (2020).

Hellström, H. et al. Classification of head and neck cancer from PET images using convolutional neural networks. Sci. Rep. 13 , 10528 (2023).

ADS   PubMed   PubMed Central   Google Scholar  

Liedes, J. et al. Automatic segmentation of head and neck cancer from PET-MRI data using deep learning. J. Med. Biol. Eng. https://doi.org/10.1007/s40846-023-00818-8 (2023).

Article   Google Scholar  

Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2015. MICCAI 2015 Vol. 9351 (eds Navab, N. et al. ) 234–241 (Springer, 2015).

Rainio, O. et al. New method of using a convolutional neural network for 2D intraprostatic tumor segmentation from PET images. Res. Biomed. Eng. https://doi.org/10.1007/s42600-023-00314-7 (2023) ( to appear ).

Download references

Acknowledgements

We are grateful to the referees for their suggestions.

The first author was financially supported by the Finnish Cultural Foundation and Jenny and Antti Wihuri Foundation. The second author was supported by the Finnish Cultural Foundation (Maire and Aimo Mäkinen Foundation).

Author information

Authors and affiliations.

Turku PET Centre, University of Turku and Turku University Hospital, Turku, Finland

Oona Rainio, Jarmo Teuho & Riku Klén

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Oona Rainio .

Ethics declarations

Competing interests.

On the behalf of all authors, the corresponding author states that there is no conflict of interest.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this Article was revised: The original version of this Article contained an error in an equation in the Different machine learning tasks section, under the subheading ‘Multi-class classification’. Full information regarding the correction made can be found in the correction for this Article.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Rainio, O., Teuho, J. & Klén, R. Evaluation metrics and statistical tests for machine learning. Sci Rep 14 , 6086 (2024). https://doi.org/10.1038/s41598-024-56706-x

Download citation

Received : 13 December 2023

Accepted : 09 March 2024

Published : 13 March 2024

DOI : https://doi.org/10.1038/s41598-024-56706-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Evaluation metrics
  • Machine learning
  • Medical images
  • Statistical testing

This article is cited by

Fuchs’ endothelial corneal dystrophy evaluation using a high-resolution wavefront sensor.

  • Carolina Belda-Para
  • Gonzalo Velarde-Rodríguez
  • José M. Rodríguez-Ramos

Scientific Reports (2024)

Investigation of emergency department abandonment rates using machine learning algorithms in a single centre study

  • Marta Rosaria Marino
  • Teresa Angela Trunfio
  • Giovanni Improta

Preterm birth risk stratification through longitudinal heart rate and HRV monitoring in daily life

  • Mohammad Feli
  • Amir M. Rahmani

Comparison of thresholds for a convolutional neural network classifying medical images

  • Oona Rainio
  • Jonne Tamminen

International Journal of Data Science and Analytics (2024)

Real-time invasive sea lamprey detection using machine learning classifier models on embedded systems

  • Ian González-Afanador
  • Claudia Chen
  • Nelson Sepúlveda

Neural Computing and Applications (2024)

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

evaluation metrics thesis

  • Search Menu
  • Sign in through your institution
  • Advance articles
  • Author Guidelines
  • Submission Site
  • Open Access
  • Why Publish?
  • About Research Evaluation
  • Editorial Board
  • Advertising and Corporate Services
  • Journals Career Network
  • Self-Archiving Policy
  • Dispatch Dates
  • Journals on Oxford Academic
  • Books on Oxford Academic

Issue Cover

Article Contents

1. introduction, 2. backdrop and research questions, 3. data and methods, 4. analysis: metrics in peer assessments, 5. discussion and implications, supplementary data, acknowledgements.

  • < Previous

The role of metrics in peer assessments

ORCID logo

  • Article contents
  • Figures & tables
  • Supplementary Data

Liv Langfeldt, Ingvild Reymert, Dag W Aksnes, The role of metrics in peer assessments, Research Evaluation , Volume 30, Issue 1, January 2021, Pages 112–126, https://doi.org/10.1093/reseval/rvaa032

  • Permissions Icon Permissions

Metrics on scientific publications and their citations are easily accessible and are often referred to in assessments of research and researchers. This paper addresses whether metrics are considered a legitimate and integral part of such assessments. Based on an extensive questionnaire survey in three countries, the opinions of researchers are analysed. We provide comparisons across academic fields (cardiology, economics, and physics) and contexts for assessing research (identifying the best research in their field, assessing grant proposals and assessing candidates for positions). A minority of the researchers responding to the survey reported that metrics were reasons for considering something to be the best research. Still, a large majority in all the studied fields indicated that metrics were important or partly important in their review of grant proposals and assessments of candidates for academic positions. In these contexts, the citation impact of the publications and, particularly, the number of publications were emphasized. These findings hold across all fields analysed, still the economists relied more on productivity measures than the cardiologists and the physicists. Moreover, reviewers with high scores on bibliometric indicators seemed more frequently (than other reviewers) to adhere to metrics in their assessments. Hence, when planning and using peer review, one should be aware that reviewers—in particular reviewers who score high on metrics—find metrics to be a good proxy for the future success of projects and candidates, and rely on metrics in their evaluation procedures despite the concerns in scientific communities on the use and misuse of publication metrics.

Research organizations, funding agencies, national authorities and other organizations rely on peer assessments in their research evaluations. Peer assessments, in turn, may (partly) rely on metrics on scientific publications and their citations. In recent decades, such bibliometric indicators have become more easily accessible and have been used more in the evaluation of research. This raises the question of how such metrics impact what is perceived as good research, i.e. the notions of research quality. This paper addresses whether metrics are considered a legitimate and integral part of the assessment of research, explore the role of metrics in different review contexts and fields of research, and discuss implications for research evaluation and policy.

The use of metrics has a long history, dating back more than 100 years ( De Bellis 2009 ). With the creation of the Science Citation Index by Eugene Garfield in 1961, new possibilities for quantitative studies of scientific publishing emerged, including analyses of how often the articles had been referred to or cited in subsequent scientific literature. Initially, the potential of bibliometrics within science policy was only seen by a few individuals ( Martin 1996 ). Later, research evaluation became an important area of application of bibliometric analyses. Today, indicators or metrics are applied for a variety of purposes and have permeated many aspects of the research system ( Abbott et al. 2010 ; Aksnes, Langfeldt and Wouters 2019 ). For example, metrics have long been provided to peer reviewers in research evaluations, such as in national research assessments and institutional reviews ( Lewison, Cottrell and Dixon 1999 ; Wilsdon et al. 2015 ). Nowadays, individual applicants may be requested to provide standardized Curriculum Vitae (CVs) that include citations rates when applying for grants 1 , and metrics may also play an important role in hiring and promotion processes ( Stephan, Veugelers and Wang 2017 ).

The use of bibliometric indicators has been more common in the natural and medical sciences than in the social sciences and humanities ( Moed 2005 ). This may be due to the fact that the latter areas are less covered by standard bibliometric databases like Web of Science or Scopus ( Aksnes and Sivertsen 2019 ). They also have different communication practices with more publications in books and in national languages, and a slow accumulation of citations (although there is large heterogeneity at the level of disciplines). However, studies have shown that—even in the social sciences—it has become a common practice for researchers to include metrics in their CVs, applications for promotions, and grant applications ( Haddow and Hammarfelt 2019 ).

There are different types of metrics and a large variety of indicators (for an overview, see e.g. Ball 2017 ). By metrics, in this paper we refer to publication-based indictors wherein three types are investigated: productivity/number of publications, scientific impact/citations and the impact factor of journals where the publications appear. The most basic is the number of publications, which typically is regarded as an indirect measure of the volume of knowledge production. Citations and citation indicators, on the other hand, are commonly applied as proxies for the impact (or influence) of the research, one of the constituents of the concept of scientific quality (Aksnes, Langfeldt and Wouters 2019). One of the most popular and well-known bibliometric indicators is the journal impact factor (JIF), which is a measure of the frequency the average article in a journal has been cited. The impact factor is often regarded as an indicator of the significance and prestige of a journal ( Glänzel and Moed 2002 ). To what extent bibliometric measures can be used as proxies for these dimensions of research activities is, however, a matter of debate. Particularly, this issue has been addressed with respect to citation indicators, and many studies have, over the years, been carried out in order to assess their validity and appropriateness as performance measures (Aksnes, Langfeldt and Wouters 2019).

The use of metrics has always been controversial and is a key debate in research evaluation practices ( Wilsdon et al. 2015 ). There are many examples of their misuse, and potentially negative impacts of metrics upon the research system have received increased attention ( Weingart 2005 ; de Rijcke et al. 2016 ). General concerns about metrics being used when assessing individual researchers are expressed in key documents, such as the Leiden Manifesto ( Hicks et al. 2015 ), which contains 10 principles for the appropriate measurement of research performance, as well as the San Francisco Declaration on Research Assessment 2 (DORA), which intends to prevent the practice of using the journal impact factor as a surrogate measure of the quality of individual articles.

Despite the large amount of attention devoted to these issues, there are few empirical studies investigating researchers’ use of metrics in different evaluation processes and to which extent their own position, age, gender, and bibliometric performance affect this use. For example, publication metrics are not part of the criteria appearing in a recent review of studies of the criteria used to assess grant applications ( Hug and Aeschbach 2020 ). The present study analyses the use of metrics when assessing the past achievements of applicants for positions and grants. Based on a questionnaire survey, different types of metrics are addressed: journal impact factors, citation indicators, and indicators on number of publications. To enable the exploration of possible diversity in the use of metrics, this study covers three main fields: cardiology, economics and physics, in three countries (Netherlands, Norway, and Sweden). These fields are different in terms of how knowledge production is organized and valued ( Whitley 1984 ), and in the way they relate to metrics. Moreover, there are notable differences between these countries when it comes to the role of metrics in national research policy. As an introduction, we therefore give some brief background information on these issues.

Economics is a field wherein journal rankings have long traditions and are highly influential. Such rankings play a role, for example, in evaluating the performance of economics departments and in hiring processes ( Kalaitzidakis, Mamuneas and Stengos 2011 ; Gibson, Anderson and Tressler 2014 ). Many rankings exist ( Bornmann, Butz and Wohlrabe 2018 ). In particular, much importance is attached to publishing in the so-called ‘Top Five’ journals of economics ( Hylmö 2018) , and a study by Heckman and Moktan (2018) showed that publishing in these journals greatly increases the probability of the author(s) receiving tenure and promotion.

In medicine, the journal impact factor has, over a long time, been a very popular indicator and has been used for purposes such as those described above, as well as for ranking lists delineating where scientists ought to submit their publications. There are many reports on this issue, covering medicine more generally ( Brown 2007 ; Sousa and Hendriks 2007 ; Allen 2010 ; Hammarfelt and Rushforth 2017 ) and cardiology more specifically ( van der Wall 2012 ; Coats and Shewan 2015 ; Loomba and Anderson 2018 ). According to van der Wall (2012) , publishing in journals with an impact factor below five is considered a signal of ‘mediocre scientific quality’ in some institutions and departments.

In physics, on the other hand, the use of impact factors appears to be less prevalent compared with medicine, although there is a journal hierarchy whereby certain journals, such as Physical Review Letters , are considered to be among the most prestigious ( Bollen, Rodriguez and Van De Sompel 2006 ). Moreover, there are some very large journals, such as the Physical Review series, and several physics journals are among the world’s largest journals in terms of publication counts.

The three academic fields also have different publication profiles, which may be expected to influence the respondents’ views on metrics. The average number of publications per researcher is generally higher in medicine and the natural sciences when compared to the humanities and the social sciences. A study by Piro, Aksnes and Rorstad (2013) found that, in economics, researchers (on average) published 4.4 publications during a four-year period, compared with 5.3 for clinical medicine and 9.5 for physics. However, the average for physics is highly influenced by individuals having extremely high publication output due to their participation in articles with hyper-authorship (articles with several hundred authors, Cronin 2001 ). Such papers appear in high energy physics, particularly when related to the European Organization for Nuclear Research (CERN). According to Birnholtz (2008) , hyper-authorship makes it difficult to identify the roles of individual contributors, which may undermine authorship as the traditional currency of science with respect to performance assessments and career advancement.

This study includes data from multiple countries and also, at the national level, there are differences which might influence the respondents’ views on metrics. In Norway, there is a performance-based funding model whereby bibliometric indicators are applied for the allocation of funding across institutions. The system allocating funding to Norwegian universities is based on (among other things) publication indicators where publication channels are divided into quality levels ( Sivertsen 2017 ). In Sweden, governmental institutional funding has previously been granted partly based on bibliometric indicators on publications and citations in Web of Science ( Hammarfelt 2018 ). While these systems are designed to work on an overall national level, they are sometimes applied at lower levels as well, such as faculties, departments, and individual researchers. This is documented in an evaluation of the Norwegian model ( Aagaard 2015 ). In Sweden, several universities have applied the Norwegian publication indicator to allocate resources within institutions ( Hammarfelt 2018 ). In the Netherlands, institutional funding is not linked to bibliometric measurement systems ( Wilsdon et al. 2015 ; Jonkers and Zacharewicz 2016 ), but there are still research assessments (organized every sixth year). Here, evaluations are made by expert panels, which may use qualitative as well as quantitative indicators to assess research groups or programmes ( Wilsdon et al. 2015 ). 3 In such evaluations, panels consisting of a few members are often asked to assess the research of several hundred individuals, wherein the total research output may encompass more than a thousand publications.

As for the use of bibliometrics for the kind of assessments addressed in this article (assessments of research funding applications and hiring processes), there is no systematic overview of practices across organizations or countries. Moreover, reviewers are based across organizations and countries, and their propensity to use metrics in assessments may or may not be shaped by the use of metrics in national systems for performance-based funding and research assessments. In sum, how this may vary across countries is not obvious.

More generally, there are at least three separate reasons why peer reviewers may opt to use metrics as (part of) their basis for assessments of grant proposals or of candidates for academic positions. Evaluation processes involve categorization—that is, examining the characteristics of the entities to be assessed and locating them in one or more hierarchies ( Lamont 2012 ), and metrics may thus be helpful in several ways. First, metrics are easily accessible, and they ease the review task in terms of the time and effort required ( Espeland and Sauder 2007 : 17). Rather than spending time reading the applicants’ publications, a reviewer may get an impression by looking up bibliometric indicators (citations counts, h-index, journal impact factor or similar). Second, metrics may be used because the reviewers find them to be good—or fair—proxies for research quality or research performance. They may, for example, find that in their field the best research is published in the highest-ranking journals (as these tend to have the strictest review processes), or that highly cited papers are those that prove most important for the development of the field (by introducing new and valuable knowledge), whereas non-cited papers seldom prove to have any significance. They may also find that comparing applicants based on such indicators provides a more objective, fair and reliable basis for assessments compared to peer assessments that are not informed by such indicators. 4 Finally, the use of metrics may be explicitly encouraged by those organizing the review. It may be part of review criteria and guidelines, and the organizer may provide the metrics to be used. 5

Similar types of reasons for introducing metrics (availability; good/fair proxies; encouraged from outside) may motivate research and funding organizations. At the organizational level, metrics provide easily accessible information about applicants, they may be perceived as highly relevant and impartial, having the potential to reduce biases in peer assessments, and may also be encouraged by national authorities. Moreover, successful sister organizations using metrics may serve as role models. 6

Concerning the reasons for reviewers’ individual use of metrics, the first and the last types of reasons are obviously present both in grant reviews and reviews of candidates for academic positions: metrics are easily available and at least some funding agencies and research organizations encourage their use. The second type of reason, that metrics are perceived as being good proxies for research quality or research performance, is more uncertain and may vary substantially by field. Moreover, as peer reviewers have discretionary power and the basis of their judgements is not monitored, it may be a necessary condition that the reviewers perceive metrics to be an adequate basis for assessments. If they find metrics to be good proxies, they can be expected to use them, regardless of whether they are encouraged in the guidelines and/or provided to them. Conversely, if they perceive metrics to be an inadequate tool for evaluation, they may disregard guidelines encouraging their use and/or the metrics provided to them.

Against this background, this study addresses two main research questions:

To what extent are metrics part of researchers’ notion of good research?

To what extent are metrics used in reviews of research proposals and in reviews of candidates for academic positions?

The first question was investigated by asking the respondents to characterize the best research in their field, and whether high journal impact factors and many citations are among these characteristics. To answer the second question, we studied the respondents’ emphases for assessments of research proposals and candidates for academic positions. This issue was investigated for two types of indicators: publication productivity and citation impact. We aim to understand why some researchers are more apt to rely on metrics in their assessments, and explore how the use of metrics varies between field of research and other background characteristics.

Based on previous studies, we expect views on metrics to be diverse, both within fields and within countries ( Aksnes and Rip 2009 ; Wilsdon et al. 2015 ; Söderlind and Geschwind 2020 ) 7 . In a survey to researchers who reviewed grant proposals for the Research Council of Norway (RCN) (including reviewers in all fields of research, most of them from European countries apart from Norway), some commented that they would like the RCN to provide standardized metrics to the reviewers, while others stated that the RCN should try to minimize the weight put on metrics ( Langfeldt and Scordato 2016 ).

This paper draws on data from a web survey which explored varying notions and conditions of good research. The survey was filled out by researchers in physics, economics and cardiology in the Netherlands, Norway and Sweden. The three fields belong to different parts of science (the social sciences, the natural sciences and the medical sciences), and as noted above, they differ in publication profiles and in the use of metrics.

3.1 Sampling and response rates

The invited survey sample included all researchers active in the aforementioned three fields at the most relevant universities in the three countries, as defined by Web of Science data and journal classification. For this, a three step sampling strategy was used: in step one, we used journal categories to identify institutions with a minimum number of articles in the relevant journal categories in the period 2011–2016 (Web of Science (WoS) categories: ‘Physics’, ‘Astronomy & Astrophysics’; ‘Economics’; ‘Cardiac & cardiovascular systems’). In step two, the websites of these institutions were searched for relevant organizational units to include in the survey, and we generated lists of personnel in relevant academic positions (including staff members, post-docs and researchers—not including PhD students, adjunct positions, guest researchers or administrative and technical personnel). Some departments also had research groups in other disciplines than the one selected. In these cases, we removed the personnel found in the non-relevant groups. In step three, we added people (at the selected institutions) prevailing with a minimum number of WoS publications in the field, regardless of which department/unit they were affiliated with. For economics, a limit of at least five WoS publications (in 2011–2016) was used. In the case of cardiology and physics, where the publication frequency (and co-authorship) is higher, a minimum of 10 publications was used. 8 In this way, we combined two sampling strategies in order to obtain a comprehensive sample: Based on the organizations’ websites, we identified the full scope of researchers within a department/division (step two), and based on WoS categories, we identified those who publish in the field (step three).

The web survey yielded viable samples of researchers for each of the three fields; in total, there were 1621 replies 9 (32.7% response of those invited to the survey). The response rates varied substantially by country: 49.1% in Norway, 38.7% in Sweden, and 19.9% in the Netherlands. Response rates also varied somewhat by field (25.8% in cardiology, 31.5% in economics, 37.1% in physics), and we see that especially the Dutch cardiologists were less likely to reply, only 12.8% of them replied (see Table 1 ). 10 These biases were controlled with weighs in the bivariate analyses, see Section 3.4.

Response rates by field and country

CountryCardiology Economics Physics Total
% repliedn% repliedn% repliedn% repliedn
Netherlands12.872520.974524.3101019.92480
Norway47.437852.222449.043349.11035
Sweden27.860142.030542.3152638.72432
Total25.8170431.5127437.1296932.75947
CountryCardiology Economics Physics Total
% repliedn% repliedn% repliedn% repliedn
Netherlands12.872520.974524.3101019.92480
Norway47.437852.222449.043349.11035
Sweden27.860142.030542.3152638.72432
Total25.8170431.5127437.1296932.75947

3.2 Dependent variables in the analyses

In the survey, the respondents were asked why they considered something to be the best research in their field and what was important for their assessments of grant proposals and candidates for academic positions. The two latter questions were only posed to respondents who indicated that they had conducted such reviews in the last 12 months. 11 Reply categories included various qualitative aspects and characteristics of good research as well as bibliometric indicators and open category answers (see survey questions in Supplementary Appendix ).

The two kinds of assessments analysed in this paper—review of grant proposals and of candidates for academic positions—are performed in different settings. Research funding agencies and universities typically provide the contexts for these assessments. Within both types of organizations, the reviewers are normally provided with guidelines outlining the criteria and procedures for the review and are asked to compose a written review explaining their conclusions. Both types of assessments often include panel meetings in which the reviewers conclude on the ratings and/or ranking of the candidates/proposals.

The concerns and relevance of metrics in the reviews may vary greatly. When reviewing candidates for positions in research organizations, the reviewers are involved in facilitating or impeding the career of someone who might be their future colleague, and they often decide the composition of competencies and research interests at their own—or at a collaborating institution. This work may involve the reading and assessment of a considerable number of candidates’ publications or simply assessing the publication lists based on metrics. Reviewer tasks for funding agencies may vary from assessing proposals for small individual grants to assessing those for long-term funding for large groups/centres, and from a few proposals close to their own field of research to many proposals assigned to a multi-disciplinary group of reviewers. The proposals may address a specific thematic call or a call open to all research questions, and the applicants’ project descriptions and competencies are to be assessed accordingly.

In the survey we asked respondents about what they emphasized the last time they reviewed grant proposals, and what they emphasized the last time they reviewed a candidate for a position. They were also asked to indicate the type of grant/position in question. Metrics may be perceived as being less relevant as a basis for assessing junior applicants, i.e. applicants with a more limited track record. Hence, in this analysis, we distinguished between different types of positions and grants: recruiting to a junior or senior position; reviewing proposals for a research project, fellowship or large grant/centre, either to open calls or to targeted calls.

3.3 Control variables

Research quality notions and assessments may differ between fields and countries, and may be influenced by the respondent’s age, gender and academic position. Hence, in the analyses, we controlled for these variables, as well as for the type of grant or academic position being assessed. Table 2 provides details on the control variables. Note that all three fields studied are male-dominated. Even if the response rate among the female respondents was somewhat higher than among the male respondents, the obtained sample consists of 23% female respondents and 77% male respondents. 12

Descriptive statistics

Variable/valueCount value% valuen
Age: 39 and younger404281435
Age: 40 to 49 years old369261435
Age: 50 to 59 years old302211435
Age: 60 years and older360251435
Gender: Female325231432
Gender: Male1107771432
Position: Assistant Professor463291611
Position: Associate Professor391241611
Position: Leader7751611
Position: Other195121611
Position: Professor485301611
Recruiting Juniors55271774
Recruiting Seniors22229774
Grant specification: Open Call45070639
Grant specification: Target Research Call18930639
Grant type: Fellowship8313643
Grant type: Large Grants/Centre7812643
Grant type: Research Project48275643
Variable/valueCount value% valuen
Age: 39 and younger404281435
Age: 40 to 49 years old369261435
Age: 50 to 59 years old302211435
Age: 60 years and older360251435
Gender: Female325231432
Gender: Male1107771432
Position: Assistant Professor463291611
Position: Associate Professor391241611
Position: Leader7751611
Position: Other195121611
Position: Professor485301611
Recruiting Juniors55271774
Recruiting Seniors22229774
Grant specification: Open Call45070639
Grant specification: Target Research Call18930639
Grant type: Fellowship8313643
Grant type: Large Grants/Centre7812643
Grant type: Research Project48275643
Respondents’ bibliometric performanceMeanSt. Dev.MinMaxn
Number of publications27.8663.8350.250781.001355
Log of number of publications2.171.598−1.396.661355
Have cited publications (dummy MNCS)0.960.2020.001.001355
MNCS1.462.2620.0030.841355
Log of MNCS −0.030.913−2.303.431297
MNJS1.340.9850.1018.881355
Log of MNJS0.150.517−2.302.941355
Having publications in top percentile (dummy)0.610.4880.001.001355
Share of publications in top percentile13.9420.7960.00100.001355
Log of share of publications in top percentile 2.730.9330.004.61828
Respondents’ bibliometric performanceMeanSt. Dev.MinMaxn
Number of publications27.8663.8350.250781.001355
Log of number of publications2.171.598−1.396.661355
Have cited publications (dummy MNCS)0.960.2020.001.001355
MNCS1.462.2620.0030.841355
Log of MNCS −0.030.913−2.303.431297
MNJS1.340.9850.1018.881355
Log of MNJS0.150.517−2.302.941355
Having publications in top percentile (dummy)0.610.4880.001.001355
Share of publications in top percentile13.9420.7960.00100.001355
Log of share of publications in top percentile 2.730.9330.004.61828

Smaller n on reviews of grant proposals and candidates for positions, as these questions were posed only to those who reported to have participated in such reviews the last 12 months.

The log of MNCS/Share of publication in top percentile (including those who have scores above 0 on the MNCS indicator/publications in top percentile).

In addition, we examined the relation between the respondents’ publication outputs and their replies. The data on the respondents’ publication output was collected from the Web of Science database (WoS) covering the 2011–2017 period, and included articles, reviews and letters published in journals indexed in WoS. 13 Three types of indicators were calculated. First, the number of publications per respondent during the period. Second, their mean normalized citation score (MNCS). Here, the citation numbers of each publication were normalized by subject field, article type and year, and then averages were calculated for the total publication output of each respondent. Third, their mean normalized journal score (MNJS) was determined, which involved similar calculations for the journals. The latter indicator is an expression of the average normalized citation impact of the journals in which the respondents have published their work, and high scores indicate that the respondents have published in a high-impact journal. On both indicators, 1.00 corresponds to the world average. As an additional citation indicator, the proportion of articles that are among the 10% most cited articles in their fields has been included (the share of publications in the top percentile can be found in Table 2 ).

We included these metrics in binary logistic regression analyses, investigating the relation between the respondents’ bibliometric performance and their emphases on metrics when characterizing the best research in their fields, assessing grant proposals and assessing candidates for positions. Equations are attached in the note. 14 Apart from the factors included in the model, respondents’ institutional affiliation and their specific research fields may influence their emphases in the assessment of research. Institutional affiliation has proven to influence researcher’s evaluation at least in recruitments ( Musselin 2010 ) and there may be large differences within research fields regarding notions of research quality and use of metrics ( Lamont 2009 ; Hylmö 2018 ). Due to low numbers of respondents per institution, and insufficient data on subfields, we have not been able to control for these factors.

The bibliometric variables were skewedly distributed among the respondents, and thus the binary logistic regression analyses were conducted with log-transformed bibliometric variables, which ANOVA and AIC-tests showed improved our models. We settled on models with the log-transformed variables when displaying field differences. Still for graphic illustration of results the original variables are used to ease the interpretation for the reader. Table 2 displays the distribution of the original and log-transformed metrics variables.

It should be noted that the Web of Science database does not equally cover each field’s publication output. Generally, physics and cardiology are very well encompassed, while the coverage of economics is somewhat less so, due to different publication practices ( Aksnes and Sivertsen 2019 ). In addition, not all respondents had been active researchers during the entire 2011–2017 period, and for 16% of the respondents in the sample no publications were identified in the database. The latter individuals were not included in the bibliometric analyses. Despite these limitations, the data provides interesting information on the bibliometric performance of the researchers at an overall level.

3.4 Methods

We used the programming software ‘R’ when analysing the data and ‘RMarkdown’ for visualization. The RMarkdown file can be provided upon request.

Weighted Results : As sample sizes vary by fields and country, the bivariate analyses were weighted so that each field in each country contributed equally to the totals (the weights are presented in Table 3 ). In the regression analyses, both field and country were controlled for, and the weights were not applied. 15

FieldSwedenNorwayThe Netherlands
Cardiology1.4531.1852.771
Economics1.4761.7151.354
Physics0.3321.0590.866
FieldSwedenNorwayThe Netherlands
Cardiology1.4531.1852.771
Economics1.4761.7151.354
Physics0.3321.0590.866

Analyses: Binary logistic regression models were applied, including the stated characteristics of the best research, the emphasizes when assessing grant proposals and the emphasizes when assessing candidates for positions as dependent variables, while respondent characteristics (field, country, gender, age, academic position, and bibliometric performance) and type of proposal/position under review were included as control variables. To estimate whether the independent variables contributed with significant explanation to the variation in the dependent variable, we applied ANOVA tests ( Agresti 2013 ). We further conducted AIC- and BIC-test to detect which independent variables best explained the independent variables ( Agresti 2013 ) and applied the variance inflation factors-test (VIF-test) to check for eventual multicollinearity ( Lin 2008 ). Finally, we checked for interaction effects between the independent variables. In the analyses, we used Sweden and economics as baseline categories; Sweden because it was the largest group and economics because it eased the interpretation of field differences (economics was the most deviant category). We also conducted the analyses with the other countries and fields as baseline categories to validate the presented results. 16

We display the results from the binary logistic regression analyses in dot-and-whiskers plots with the fields’ logit coefficients. In the graphs, economics is the baseline category (dotted line), and the likelihood of belonging to physics or cardiology is marked with standard errors. Hence, the graphs do not show potentially significant differences between physics and cardiology. In the (rare) cases wherein these differences are significant, this is commented on in the text. In addition, we illustrate results by calculating changes in probabilities on the dependent variables produced by the independent variables for selected subgroups. The full regression models are in the Supplementary Appendix Tables A1–A11 .

4.1 Characteristics of the best research

As characteristics of the best research in one’s field, impact factors and citations were among the less important aspects. In total, 22% of the respondents indicated journal impact factor and/or citation rates as reasons for considering something the best research in their field 17 , whereas the most frequent reasons were that the research had solved key questions in their field (67%, see Table 4 ). Notably, respondents could select multiple replies and very few selected journal impact factor and/or citations as their only reasons for considering any research as being of the best. 18

Reasons for considering something the best research in their field (Percent. Weighted results)

ReplyCardiologyEconomicsPhysicsTotal
Has answered/solved key questions/challenges in the field70627067
Has changed the way research is done in the field (e.g. methodological breakthrough)35574747
Has changed the key theoretical framework of the field31323833
Has been a centre of discussion in the research field30293431
Has benefitted society (e.g. appl. in industry, new clinical practices, informed public policy)38261827
Has enabled researchers in the field to produce more reliable or precise research results21242523
Was published in a journal with a high impact factor18211317
Has attracted many citations11201415
Has drawn much attention in the larger society14131112
Is what all students/prospective researchers need to read41077
Other, please specify2222
Cannot say2312
n405.25405.25405.251621
ReplyCardiologyEconomicsPhysicsTotal
Has answered/solved key questions/challenges in the field70627067
Has changed the way research is done in the field (e.g. methodological breakthrough)35574747
Has changed the key theoretical framework of the field31323833
Has been a centre of discussion in the research field30293431
Has benefitted society (e.g. appl. in industry, new clinical practices, informed public policy)38261827
Has enabled researchers in the field to produce more reliable or precise research results21242523
Was published in a journal with a high impact factor18211317
Has attracted many citations11201415
Has drawn much attention in the larger society14131112
Is what all students/prospective researchers need to read41077
Other, please specify2222
Cannot say2312
n405.25405.25405.251621

The binary logistic regression analysis indicates field-dependent reasons for considering something to be ‘the best research’, as illustrated in Figure 1 (see Supplementary Appendix Tables A1–A3 for full regression models). Economists were significantly more inclined than physicists to indicate journal impact factor as a reason for considering something the best research, but differences between the economists and cardiologists, or between the cardiologist and physicists, were not statistically significant. Moreover, the economists were more inclined than both the physicists and cardiologists to indicate many citations as a reason for considering any research to be the best. Interpreting the results, the regression coefficients imply that, for Swedish economists, the probability of answering high impact factor was 18%, while the probability for Swedish physicists was 14%. Similarly, the probability for Swedish economists to answer citations impact was 18%, while it was 10% for cardiologists and 13% for physicists in the same country.

Journal impact factor and citations as reasons for considering something the best research in the field. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics as the baseline category represented by the dotted line.

Journal impact factor and citations as reasons for considering something the best research in the field. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics as the baseline category represented by the dotted line.

The ANOVA-analyses revealed country-dependent replies, but no dependence on the other control variables appeared ( Supplementary Appendix Tables A1–A3 ). Interestingly, respondents in Norway were more inclined to indicate metrics as reasons for considering something to be the best research. Hence, country-related differences in adherence to metrics should be further explored.

4.2 Grant proposals

Whereas quantitative indicators appeared to have moderate importance in the identification of the best research in the field, 45% of those who had reviewed grant proposals replied that the number of publications/productivity was ‘highly important’ in their assessments of the best proposal, and 23% found citations ‘highly important’ in their assessments. These metrics were also relatively important compared to several other aspects ( Table 5 ). They still appear far below the ‘research question’ (94%) and the ‘methods/research plan’ (85%), which came up as the most important in the assessments. Nonetheless, including those who replied ‘somewhat important’ (48% for number of publications and 59% for citation impact), the great majority replied that such metrics impacted their assessments of which proposal was the best ( Supplementary Appendix Table A12 ).

Aspects identified as “highly important” in grant assessments (Percent. Weighted results).

ReplyCardiologyEconomicsPhysicsTotalTotal n
Project description: research question/problem selection98879494678
Project description: methods/research plan90828185670
Track record of the research team: important prior contributions in the relevant research field (assessed independently of citation scores and source of publication)46355546673
Track record of the research team: number of publications/productivity41504445674
The research environment: resources and facilities for performing the proposed research59154141671
Track record of the research team: citation impact of past publications18292523676
Track record of the research team: experience with risk-taking research19142119670
Communication / dissemination plan for scientific publications91099673
Other, please specify10899664
Communication/dissemination plan addressing user groups outside academia6656666
ReplyCardiologyEconomicsPhysicsTotalTotal n
Project description: research question/problem selection98879494678
Project description: methods/research plan90828185670
Track record of the research team: important prior contributions in the relevant research field (assessed independently of citation scores and source of publication)46355546673
Track record of the research team: number of publications/productivity41504445674
The research environment: resources and facilities for performing the proposed research59154141671
Track record of the research team: citation impact of past publications18292523676
Track record of the research team: experience with risk-taking research19142119670
Communication / dissemination plan for scientific publications91099673
Other, please specify10899664
Communication/dissemination plan addressing user groups outside academia6656666

The binary logistic regression analysis shows that emphases on metrics were field-dependent, as shown in Figure 2 (full regression models are shown in Supplementary Appendix Tables A4–A7 ). Compared to cardiologists, the economists (dotted line) were significantly more inclined to identify the number of publications and citation impact as ‘highly important’. Conversely, the physicists were significantly more inclined to emphasize important research contributions (assessed independently of metrics) than were the economists. The analysis uncovered less difference between physicists and cardiologists, but still, the physicists were significantly more inclined than the cardiologists to emphasize citations.

Aspects identified as ‘highly important’ in assessing grant proposals. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics appears as the baseline category, represented by the dotted line.

Aspects identified as ‘highly important’ in assessing grant proposals. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics appears as the baseline category, represented by the dotted line.

The regression coefficients imply that the probability of Swedish economist professors with an average share of top publications and number of publications, of identifying the number of publications or/and citations as ‘highly important’ in their assessment of proposals to open calls is 50%, whereas the probability of the similar groups of physicists and cardiologists to do so is substantially lower (39% for physicists and 29% for cardiologists). Conversely, the economists in this group (professors with average bibliometric scores) were less inclined to emphasize ‘important prior research contributions assessed independently of metrics’ (59% for physicists, 49% for cardiologists, and 42% for economists).

Furthermore, the regression analyses indicated insignificant country-related effects, but significant effects of the respondents’ academic positions and the type of grants being reviewed. The probability of identifying citations or number of publications as highly important was lower when assessing project grants than when assessing fellowships or large/centres grants. Moreover, the probability of identifying the number of publications as highly important was lower when reviewing proposals to open calls rather than targeted calls ( Supplementary Appendix Tables A4–A7 ). The replies also depended on the respondents’ bibliometric performance, as discussed in detail below.

4.3 Candidates for positions

Similar results to those for assessing grant proposals appear for the assessment of candidates for positions: quantitative measures appear more important than when identifying the best research in the field. Forty-two percent answered that the number of publications/productivity was ‘highly important’ in their assessments of candidates ( Table 6 ). Citations impacts appear to be less important (19% replied that this was highly important). Notably, research contributions assessed independently of citation scores and publication source appear more important than number of publications/productivity in cardiology (47% highly important) and physics (61% highly important). In economics, on the other hand, there is a higher percentage who find the number of publications/productivity to be highly important (54%) and a lower percentage who find that contributions assessed independently of metrics are highly important (45%).

Aspects identified as ‘highly important’ when assessing candidates for positions (Percent. Weighted results).

ReplyCardiologyEconomicsPhysicsTotalTotal n
The potential for future achievements83908787823
Matching field/expertise to the needs of the group/unit/project70657269816
General impression from interview with candidate74526965813
Communication and language skills61405351821
Research achievements: important prior research contributions (assessed independently of citation scores and source of publication)47456151810
Research achievements: number of publications/productivity34543942817
Ability to compete for research grants43213332814
Standing of the unit/group where the candidate is/has been working/trained24201620815
Research achievements: citation impact of past publications17192119808
Teaching experience/achievements (including supervision of students)18181618816
Ensure diversity in the group/department (e.g. gender, ethnicity, age)13111714816
Other, please specify125231395
Experience/achievements from work outside science10446815
Experience in interacting with the public/users/industry8234805
ReplyCardiologyEconomicsPhysicsTotalTotal n
The potential for future achievements83908787823
Matching field/expertise to the needs of the group/unit/project70657269816
General impression from interview with candidate74526965813
Communication and language skills61405351821
Research achievements: important prior research contributions (assessed independently of citation scores and source of publication)47456151810
Research achievements: number of publications/productivity34543942817
Ability to compete for research grants43213332814
Standing of the unit/group where the candidate is/has been working/trained24201620815
Research achievements: citation impact of past publications17192119808
Teaching experience/achievements (including supervision of students)18181618816
Ensure diversity in the group/department (e.g. gender, ethnicity, age)13111714816
Other, please specify125231395
Experience/achievements from work outside science10446815
Experience in interacting with the public/users/industry8234805

When the respondents were asked to identify the most important among the aspects they had identified as highly important, the candidate’s ‘potential for future achievements’ and ‘expertize matching the group/unit/project’ prevail as the two most important aspects in all three fields, indicating that these have general high importance regardless of fields. The third most important aspect, however, varied greatly between the fields: whereas cardiology appears with ‘general impression from interview with candidate’ and physics with ‘important prior research contributions (assessed independently of citation scores and source of publication)’, in economics ‘number of publications/productivity’ appears as the third most important aspect ( Supplementary Appendix Figure A1 ).

Binary logistic regression analysis confirms statistically significant differences between fields (documented in Supplementary Appendix Tables A8–A11 and illustrated in Figure 3 below). Economists were more inclined than both cardiologists and physicists to identify the number of publications as ‘highly important’ when assessing candidates for positions (the difference between the two latter fields was not significant). Moreover, physicists, more frequently than economists, answered that prior research contributions had been ‘highly important’. The regression coefficients for a reference group of Swedish professors with average scores on the bibliometric indicators (number of publications and share of top cited publications) who assess candidates for senior positions, show that the probability of stating that the ‘number of publications’ is ‘highly important’ was 83% in economics, 68% in physics and 57% in cardiology. In contrast, the probability in this group of answering that prior contributions were ‘highly important’ was 84% in physics, 76% in cardiology, and 72% in economics.

Aspects identified as an ‘highly important’ when assessing candidates for positions. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics as the baseline category represented by the dotted line.

Aspects identified as an ‘highly important’ when assessing candidates for positions. Field coefficients from binary logistic regression analyses (Dot-And-Whiskers Plots). Economics as the baseline category represented by the dotted line.

In sum, the results indicate some field differences in line with the different publication and authorship patterns noted in Section 2. Economics, the field with the lower average number of publications per author and lower average number of co-authors, relies more frequently on number of publications/productivity when assessing candidates for positions. Conversely, higher numbers of co-authors and publications appear in physics and cardiology compared to economics ( Piro, Aksnes and Rorstad 2013 ), and this may be a reason for less emphasis on number of publications in the former fields. In these fields, it may be far less straightforward to reach conclusions based on the length of individual researchers’ publication lists.

The ANOVA tests showed that, in some of the models, the respondents’ gender, age and country had significant effect on the respondents’ emphases. The country-related effects were mostly insignificant, but being Dutch instead of Swedish decreased the possibility of identifying publication numbers as highly important. Professors were less inclined than those in other kinds of positions to see the number of publications as ‘highly important’, while there was no significant effect on emphasis on citations. Furthermore, both quantitative measures and important prior contributions were more often seen as important when recruiting to senior rather than to junior positions.

4.4 The effects of the respondents’ bibliometric performance

Looking further into the results, a key question is whether the respondents’ emphases on publication metrics corresponded with their own bibliometric performance. For example, one might find that researchers with many publications would put more emphasis on this dimension in their assessments. Therefore, we compared the respondents’ answers with their own scores on the relevant bibliometric indicators.

The regression analyses showed no effect of respondents’ bibliometric performance on their reasons for considering something the best research. However, in practice, when assessing grants proposals and candidates for positions, their own performance was positively related to their use of metrics. When assessing grant proposals, the probability of identifying the number of publications and citation impact as ‘highly important’ increased along with the respondents’ number of publications, whether they had top percentile publications, and their share of top percentile publications. 19 Figure 4 displays this relationship, and shows how the probability of identifying citation impact and/or number of publications as highly important in assessments of grant proposals for Swedish professors in economics increases with their own number of publications. 20 Respondents with high bibliometric performance scores more frequently considered such indicators as important in their assessments. On the other hand, the respondents’ bibliometric performance did not affect whether they found prior contributions ‘highly important’ ( Supplementary Appendix Table A7 ).

Assessment of grant proposals: The probability of identifying the number of publications and/or citation impact as ‘highly important’, by respondents’ number of publications. The rug at the x-axis marks the number of observations.

Assessment of grant proposals: The probability of identifying the number of publications and/or citation impact as ‘highly important’, by respondents’ number of publications. The rug at the x-axis marks the number of observations.

Similarly, metrics in the assessment of candidates for academic positions depended on the respondents’ bibliometric performance, but less so than for grant assessments. The respondent’s own number of publications did not significantly affect the probability of identifying candidates’ citation impact or number of publications as ‘highly important’, yet the log of the number of publications increased with respondents’ publication output. Moreover, a respondent having top cited publications increased the probability of identifying candidates’ citations as ‘highly important’. For Swedish economics professors who recruited for senior positions, having top cited publications increased this probability from 28 to 40%. However, neither the respondents’ MNCS, MNJS nor share of top cited publications had a significant effect on the use of metrics in the assessment of the candidates. Hence, the importance of metrics in these assessments was lower than that for the assessment of grant proposals.

Moreover, the respondents’ MNCS (log of) and MNJS increased the probability of identifying prior research contributions as highly important when assessing candidates for positions, but did not increase the probability of identifying such contributions as highly important when assessing grant proposals.

4.5 Divergent opinions and perspectives

Insofar as this attempt to conclude on whether metrics are considered a legitimate and integral part of assessments of research, the results indicate conflicting views as well as differences between review contexts and the type of metrics. A large majority of respondents confer to metrics in their reviews and seem to find it a legitimate and ordinary basis for reviews. This goes particularly for the number of publications/productivity in the review of grant proposals; only six percent replied that this was not important ( Supplementary Appendix Table A12 ). Still, a substantial proportion (33%) indicated that citation scores were not important regarding assessments of candidates for academic positions ( Supplementary Appendix Table A13 ).

The free text replies concerning the main positive characteristics of the best proposals illustrate the divergent opinions and perspectives. Some grant reviewers emphasized that metrics were not important (illustrated by #1 and #2 in Table 7 ). Others emphasized publication and citation rates as key characteristics of the best proposal, or simply publications in top international journals (#3 and #4 in Table 7 ).

Free text replies—what was important for your assessment of:

The best grant proposal
1

‘I evaluate research according to the value of proposed research. For the highest scores, there has to be an outstanding problem to address. There has to be a realistic plan outlining how this is possible. […] I don't pay much attention to the rate of publications, but rather to the lasting impact of these. I also do not care much about citations, as this varies profoundly between different topics. Rather, impact must be assessed based on an actual understanding of previous research’

(Research project grant, physicist in Sweden who selected ‘not important’ both on number of publications and citations.)

2

‘Solves relevant questions. Science at excellent level considering modern perspectives in research evaluation (NOT publication and citations numbers as primary component).’

(Fellowship grant, cardiologist in Norway who selected ‘not important’ both on number of publications and citations)

3

‘Concrete yet ambitious proposal, novel methodology, included a Plan B if the risky plan A failed, collaboration with industry to obtain interesting field data, productive researcher with high H-index.’

(Research project grant, social sciences/other in the Netherlands)

4

‘Important research problem, High-quality candidate(s), Track record in terms of publications in top international journals (e.g. Nature, Science), assessed independently of citation records.’

(Large grant/centre, physicist in the Netherlands)

The best candidate for a position
5

‘The candidate was expected to do research, to teach, and in particular to build a Research Group within [subfield]. Communication skills, Cultural competence and networking ability are crucial, in addition to number and “weight” of publications.’

(Junior/early career position, cardiologist in Norway)

6

‘Much of the selection is based on the impact factors of the journals a candidate has published in, and potentially the network of the candidate (I do not necessarily believe these are the best criteria per se, but they are generally used).’

(Junior/early career position, economist in the Netherlands)

7

‘It's a mix. Citation impact without productivity indicates a few very highly cited papers, which is not what I mean. It should be a combination of high productivity of high-quality papers that also have attracted citations. So number of high-quality papers, overall citations, h-index, and where the work was published all matter.’

(Senior/tenure position, physicist in Sweden)

The best grant proposal
1

‘I evaluate research according to the value of proposed research. For the highest scores, there has to be an outstanding problem to address. There has to be a realistic plan outlining how this is possible. […] I don't pay much attention to the rate of publications, but rather to the lasting impact of these. I also do not care much about citations, as this varies profoundly between different topics. Rather, impact must be assessed based on an actual understanding of previous research’

(Research project grant, physicist in Sweden who selected ‘not important’ both on number of publications and citations.)

2

‘Solves relevant questions. Science at excellent level considering modern perspectives in research evaluation (NOT publication and citations numbers as primary component).’

(Fellowship grant, cardiologist in Norway who selected ‘not important’ both on number of publications and citations)

3

‘Concrete yet ambitious proposal, novel methodology, included a Plan B if the risky plan A failed, collaboration with industry to obtain interesting field data, productive researcher with high H-index.’

(Research project grant, social sciences/other in the Netherlands)

4

‘Important research problem, High-quality candidate(s), Track record in terms of publications in top international journals (e.g. Nature, Science), assessed independently of citation records.’

(Large grant/centre, physicist in the Netherlands)

The best candidate for a position
5

‘The candidate was expected to do research, to teach, and in particular to build a Research Group within [subfield]. Communication skills, Cultural competence and networking ability are crucial, in addition to number and “weight” of publications.’

(Junior/early career position, cardiologist in Norway)

6

‘Much of the selection is based on the impact factors of the journals a candidate has published in, and potentially the network of the candidate (I do not necessarily believe these are the best criteria per se, but they are generally used).’

(Junior/early career position, economist in the Netherlands)

7

‘It's a mix. Citation impact without productivity indicates a few very highly cited papers, which is not what I mean. It should be a combination of high productivity of high-quality papers that also have attracted citations. So number of high-quality papers, overall citations, h-index, and where the work was published all matter.’

(Senior/tenure position, physicist in Sweden)

Several of those who had reviewed candidates for positions seemed to find publications in major/top journals a basic or objective criterion, and then added other important characteristics that would befit the particular research group or the tasks of the position (#5 in Table 7 ). Others indicated that the ranks of the journals the candidate had published in—or a combination of relevant metrics—was important in the selection process. Still, views on the adequacy of such criteria varied (#6 and #7 in Table 7 ).

In this paper we have explored whether metrics are part of researchers’ notion of good research, and whether metrics are used when reviewing research. Concerning the first issue, only a minority of the respondents reported metrics as a reason for considering something to be the best research. Thus, the empirical support for such an association is generally weak. On the second question, we find strong supportive evidence as a large majority indicated that metrics were important or partly important in their review of grant proposals and assessments of candidates for academic positions.

Notably, drawing conclusions on researchers’ notions of research quality is difficult. Research quality is a multidimensional concept; what are seen as the key characteristics of good research may differ largely between contexts and fields ( Langfeldt et al. 2020 ). Metrics such as citations, publication counts or journal impact factors may be perceived as relating to different characteristics of research quality, e.g. according to bibliometric studies, citations reflect (to some extent) the scientific value and impact of research, but not its originality, plausibility/soundness or societal value (Aksnes, Langfeldt and Wouters 2019). Our data indicate that the respondents distinguish between quantitative indicators as proxies for success when assessing the potential of future projects or candidates for positions and what they hold to be the characteristics of good research. A large majority of the respondents reported metrics as highly or somewhat important in their reviews of grant proposals and of candidates for positions, whereas about one-fifth of them indicated that one of their reasons for concluding on what was the best research in their field was that it was published in a journal with high impact factor or that it had attracted many citations. Hence, for one-fifth of the researchers in the survey, metrics seem to be a judgement device when identifying good research within one’s own field. This does not necessarily imply that they hold metrics, as such, to be characteristics of good research. Very few respondents indicated the journal impact factor or high citation rates as sole indicators of the best research in their field, and there is little indication that respondents view quantitative indicators as being a sufficient basis for concluding on eminent science. Nevertheless, some have suggested that publishing in high-impact journals has become an independent measure of scientific quality ( Wouters 1999 ; Rushforth and de Rijcke 2015 ).

Moreover, the analysis indicates significant field differences in the use of publication metrics: The economists were more inclined to indicate journal impact factor and many citations as reasons for concluding that something is the best research in their field, and they were more inclined to emphasize the applicants’ number of publications when assessing grant proposals and candidates for positions. Physicists and cardiologists, on the other hand, were less inclined to emphasize metrics and more inclined to emphasize prior research contributions assessed independently of metrics. These differences go along with differences in how research is organized and valued in these fields. In economics, research is mostly performed by individuals and organized around a theoretical core and key journals of high importance for individual reputation ( Whitley 1984 ; Hammarfelt and Rushforth 2017 ; Hylmö 2018 ). Herein, high reliance on metrics may be explained by the combination of an explicit journal hierarchy and organization of research that makes it easier to attribute research performance to individuals. Physics consists of highly collaborative fields, some with hyper-authorship ( Birnholtz 2008 ), and using publication metrics to attribute research performance to individual researchers is more difficult. Similarly, cardiology is a field within medical research with specialized tasks and skills, highly dependent on collaboration, resources and facilities for performing research ( Whitley 1984 ), which may explain the lower emphasis on publication metrics than in economics, as well as far stronger emphases on research resources and facilities when assessing grant proposals. Notably, there is also much variation in replies within the fields: for example, a substantial proportion of the physicists and the cardiologists indicate the applicant’s number of publications as highly important when assessing grant proposals and candidates for academic positions, whereas others find it unimportant or somewhat important. In sum, this points to the importance of understanding how epistemic and organizational differences—both between and within research fields—generate different bases for assessing research and research performance, and thereby different use of metrics.

Despite our comparative point of departure, along with the inclusion of countries with different use of metrics in national research funding, we found only limited country-specific differences. The lack of country-related differences indicates that notions of research quality are more connected to general field differences than to national context ( Lamont 2009 ; Musselin 2010 ). Still, even if our sample of three countries in the northern corner of Europe represents variety in research funding and research evaluation, a larger sample of more diverse countries might have exposed greater differences in the use of metrices in peer assessments.

The findings have policy importance for multiple aspects of the evaluation of research. Below, we discuss implications relating to (1) how research agendas and research activity adapt to research evaluations, (2) the policies for restraining the (mis)use of metrics in research evaluation, and (3) the design and organization of research evaluations.

First, an emphasis on metrics may impact research activity and research agendas. Researchers—at least young and non-tenured ones—cannot disregard what gives acclaim in the academic career system and what is needed for attracting research funding. They need to take into consideration what kind of research will help them qualify for grants and positions ( Müller and de Rijcke 2017 ). Notably, in our data, economists seem to put less emphasis (compared with the other groups) on expertize, matching the needs of the research group/unit, and they seem to be more apt to use metrics ( Supplementary Appendix Figure A1 ). This may imply that, rather than making explicit decisions about the kind of researchers to employ (their topics and methods), the researchers who are able to do the kind of research that are most easily published in (top) economics journals are hired ( Lee, Pham and Gu 2013 ). Hence, the ways in which researchers adapt to metrics come up as a key topic for studies in research evaluation and, more generally, for research policy.

Second, despite increasing concerns in the scientific communities on the use and misuse of research metrics ( Wilsdon et al. 2015 ), the results herein indicate that researchers rely on the three types of metrics addressed in the survey: journal impact factors, number of publications and citation impact. Close to one-fifth of the respondents reported high impact factor as a reason for something being the best research in their field. As discussed in the introduction, journal impact factors and journal rankings have been widely used, particularly in medicine and economics, for assessing scientific performance. With the launch of the DORA-declaration in 2012, the problem with this practice has received more attention. 21 As a response, policies and practices of many funding organizations, scientific societies, institutions and journal publishers have changed, according to Schmid (2017) . Nevertheless, others report that journal impact factors are still used for purposes that conflict with the DORA-declaration ( Bonnell2016 ). Notably, the DORA-declaration has led to an increased focus on other ways to assess research. This includes the development of alternative paper-based metrics ( Schmid 2017 ). Indicators of number of publications and citation impact do not have similar problems to those associated with the journal impact factor. Nevertheless, it is well known that these indicators also have various limitations and shortcomings as performance measures, particularly when applied at micro levels ( Wildgaard, Schneider and Larsen 2014 ), and our survey indicates extensive use of these indicators at micro levels when reviewing grant proposals or candidates for academic positions. Moreover, the field differences found in the survey point to a need for a better understanding of why and how metrics are used in different fields as well as a need to consider field-adjusted policies for the use of metrics in research evaluation.

Finally, there are implications regarding the design and organization of research evaluation. Publication-based metrics seem to be perceived as good proxies for research quality and performance, at least for the majority of the researchers in the fields studied. This may be because they trust the review processes of the scholarly journals and publishers in their field, and metrics make sense as a proxy for quality. From this perspective, the editors and reviewers of the major journals end up high on the list of those controlling the gatekeeping criteria, not only for scientific publishing, but also for academic positions and research grants. At the end of the ‘review chain’, we will often find the criteria, review processes and publication policy of the major journals in the field. Hence, the researchers complying with the topics, perspectives/methods and formats of these journals can be expected to have the highest chances of success in competitions for grants and academic positions. Still, the above analysis indicates deviant views among reviewers on the use of metrics in research evaluation. So even if certain topics, perspectives or methods dominate a field, the outcome of review processes may vary by the panel members’ views on metrics. Consequently, when it comes to the ‘luck of the reviewer draw’ ( Cole, Cole and Simon 1981 ), not only the panel members’ scholarly profile and competences, but also their preferences for metrics may be decisive. This implies that in order to provide fair and well-grounded review processes, there is a need for insight into how panels use metrics in their assessments and to encourage explicit discussions about the use of metrics 22 . If the role of metrics is not openly discussed in review panels, nor understood by those organizing the reviews and acting upon them, we risk concealed review criteria.

When applying for advanced grants from the European Research Council (ERC), applicants have been asked to provide a ten-year track record including publications in leading journals and ‘indicating the number of citations (excluding self-citations) they have attracted (if applicable)’. https://erc.europa.eu/sites/default/files/document/file/ERC_Work_Programme_2015.pdf . We find this formulation in the ERC work programmes for 2008–2016. For 2017, 2018, and 2019 the wording is: ‘(properly referenced, field relevant bibliometric indicators may also be included)’. http://ec.europa.eu/research/participants/data/ref/h2020/wp/2018-2020/erc/h2020-wp19-erc_en.pdf .

http://www.ascb.org/dora/

Likewise, in Norway, the research council regularly conducts peer evaluations of disciplines and subjects as well as institutes and programmes, and bibliometric indicators are used as one source of information whenever relevant ( Sivertsen 2017 ).

Such perceptions may in turn be formed by/rooted in extensive use of e.g. journal rankings or citation measures in the field ( Espeland and Sauder 2007 : 16).

The use of metrics in peer review is also part of the more general story about how information technology impacts our evaluative practices ( Lamont 2012 : last section).

For example, the Research Council of Norway requires applicants to use a CV template that includes citation counts for applicants for regular researcher projects in all research fields. Up to 2018, the RCN template was named after its role model ‘ERC track record description’.

A majority of those who provided input to the ‘Metric Tide’ report were sceptical to the role of metrics in research management while a significant minority were more supportive of the use of metrics ( Wilsdon et al. 2015 : viii).

The minimum number of publications (5 for economics and 10 for cardiology and physics) was selected based on analyses of individual publication output during the 2011–2016 period. By applying these thresholds we aimed at including the more active researchers within the fields and leaving peripheral researchers out. A higher number was applied for cardiology and physics because of the higher publication frequencies (and co-authorship) in these fields.

The survey is part of a larger research project and was launched in five countries in 2017–2018. The present analysis is based on replies from higher education institutions in three of these countries (1,621 replies). The full survey included 2,587 replies, and is also comprised of replies from economics and physics in Denmark and the UK, as well as replies from researchers affiliated with independent research institutes. Replies from Denmark and the UK are excluded from the present analyses, as cardiology was not sampled in these countries. We checked for the impacts of excluding the UK and Danish samples by conducting the analyses on Economics and Physics in all five countries, and did not find any significantly deviant results. Moreover, replies from independent research institutes are excluded as they constitute a small sample (in total 111 replies in the three countries) and research settings which may differ substantially from those at higher education institutions.

Table 1 shows response rates by field as identified in the sampling process, whereas our analyses are based on field as identified by survey responses. Respondents who replied other fields of research, rather than ‘Cardiac/cardiovascular systems/diseases’, ‘Economic’ or ‘Physics’ are not included in the analysis. Hence, the analyses are based on a smaller sample (1,621 respondents) than that which prevails in Table 1 (1,942 respondents).

Consequently, the analyses are based on the full sample for the first question, and different subsamples for the two latter questions. We checked for impacts of sample variation by additional analyses of those included in both subsamples (451 respondents stated that they had reviewed both grant proposals and candidates for positions the last 12 months). These analyses did not give deviant results. Hence, differences between the two review settings appearing from our data are not due to different subsamples.

We have data on gender for 92% of the invited respondents. Of these, 39% of the female and 35% of male respondents replied. Of those without information on gender, we have replies from 4%.

We have excluded minor contributions such as editorials, meeting abstracts, and corrections. As letters usually do not represent full scientific contributions, they are weighted as 0.25 of an article; this is in accordance with principles often applied by the Centre for Science and Technology Studies (CWTS) of Leiden University (for further discussion, see van Leeuwen, van der Wurff and de Craen 2007 ).

Y1 = X1Country + X2Field + X3Bibliometricsb + X4Age +  X5Gender + X6Position + e

Y2 = X1Country + X2Field + X3Bibliometricsb + X4Age +  X5Gender + X6Position + X7Call + X8Type + e

Y3 = X1Country + X2Field + X3Bibliometricsb + X4Age +  X5Gender + X6Position + X7Vacancy + e

Dependent variables: why they considered something to be the best research in their field (Y1), what was important for their assessment of grant proposals (Y2) and candidates for positions (Y3).

a = type of assessment criteria

b = type of bibliometrics

e = error term

As an extra control, the regression analyses were run with weights. Results were not altered.

We also conducted ordinal logistic regressions for the best suited models with assessment of grant proposals and candidates for academic positions as dependent variables. These models confirmed the results of the binary logistic models, with the exception that the respondents’ share of top percentile publications did not have a significant effect on their emphases on numbers of publications when assessing grant proposals. Likewise, the respondents’ fields of research did not have a significant effect on their emphases on citation impact when assessing grant proposals. Still, BIC-tests indicated that the binary logistic models were better suited to describe the data, and as these results are easier to communicate, we chose to keep them.

Of these, 17% replied high impact factor, 15% many citations ( Table 4 ).

In total, 17 respondents selected journal impact factor and citations as the only reasons, five selected only journal impact factor and four selected only citations.

Their MNCS did not affect their use of metrics, but the log-transformed MNCS variable showed increased use of metrics with increasing (log of) MNCS ( Supplementary Appendix Tables A4–A6 ).

As mentioned, the respondents’ number of publications was very skewed. The black line at the x-axis (the rug) shows that most respondents had between 0 and 100 publications.

Here it was declared that journal-based metrics, such as journal impact factors should not be used as ‘a surrogate measure of the quality of individual research articles, to assess an individual scientist’s contributions, or in hiring, promotion, or funding decisions’ ( http://www.ascb.org/dora/ ). Currently more than 1,800 organizations and 15,500 individuals have signed the declaration.

A study of grant panels at the UK National Institute for Health Research indicated that their panel members primarily use the metrics provided to them in their individual assessments in advance of the panel meeting, and less in the panel discussion ( Gunashekar, Wooding and Guthrie2017 ).

Supplementary data are available at Research Evaluation Journal online.

The research was funded by the Research Council of Norway, grant number 256223 (the R-QUEST centre). The multinational survey analysed in the paper was a joint effort of the R-QUEST team. Thed van Leeuwen took an important role in the sampling and in providing the bibliometric indicators. We are thankful to Thed van Leeuwen, Anders Hylmö, Thomas Franssen and the rest of the R-QUEST team for input and comments to the paper.

Aagaard K. ( 2015 ) ‘ How Incentives Trickle down: Local Use of a National Bibliometric Indicator System ’, Science and Public Policy , 42 : 725 – 37 .

Google Scholar

Abbott A. , Cyranoski D. , Jones N. , Maher B. , Schiermeier Q. , Van Noorden R. ( 2010 ) ‘ Do Metrics Matter? ’, Nature , 465 : 860 – 2 .

Agresti A. ( 2013 ) Categorical Data Analysis , 3rd edn. New Jersey : John Wiley & Sons .

Google Preview

Aksnes D. W. , Langfeldt L. , Wouters P. ( 2019 ) ‘ Citations, Citation Indicators, and Research Quality: An Overview of Basic Concepts and Theories ’, SAGE Open , 9 : 1 – 17 .

Aksnes D. W. , Rip A. ( 2009 ) ‘ Researcherś Perceptions of Citations ’, Research Policy , 38 : 895 – 905 .

Aksnes D. W. , Sivertsen G. ( 2019 ) ‘ A Criteria-Based Assessment of the Coverage of Scopus and Web of Science ’, Journal of Data and Information Science , 4 : 1 – 21 .

Allen M. A. ( 2010 ) ‘ On the Current Obsession with Publication Statistics ’, ScienceAsia , 36 : 1 – 5 .

Ball R. ( 2017 ) An Introduction to Bibliometrics. New Developments and Trends . Cambridge, MA : Chandos Publishing .

Birnholtz J. ( 2008 ) ‘ When Authorship Isn’t Enough: Lessons from CERN on the Implications of Formal and Informal Credit Attribution Mechanisms in Collaborative Research ’, The Journal of Electronic Publishing , 11 :

Bollen J. , Rodriguez M. A. , Van De Sompel H. ( 2006 ) ‘ Journal Status ’, Scientometrics , 69 : 669 – 87 .

Bonnell A. G. ( 2016 ) ‘ Tide or Tsunami? The Impact of Metrics on Scholarly Research ’, Australian Universities’ Review , 58 : 54 – 61 .

Bornmann L. , Butz A. , Wohlrabe K. ( 2018 ) ‘ What Are the Top Five Journals in Economics? A New Meta-Ranking ’, Applied Economics , 50 : 659 – 75 .

Brown H. ( 2007 ) ‘ How Impact Factors Changed Medical Publishing - and Science ’, British Medical Journal , 334 : 561 – 4 .

Coats A. J. S. , Shewan L. G. ( 2015 ) ‘ Impact Factor: Vagaries, Inconsistencies and Illogicalities; Should It Be Abandoned? ’, International Journal of Cardiology , 201 : 454 – 6 .

Cole S. , Cole R. , Simon G. A. ( 1981 ) ‘ Chance and Consensus in Peer Review ’, Science , 214 : 881 – 6 .

Cronin B. ( 2001 ) ‘ Hyperauthorship: A Postmodern Perversion or Evidence of a Structural Shift in Scholarly Communication Practices? ’, Journal of the American Society for Information Science and Technology , 52 : 558 – 69 .

De Bellis N. ( 2009 ) Bibliometrics and Citation Analysis: From the Science Citation Index to Cybermetrics . Landham, MD : Scarecrow Press .

de Rijcke S. , Wouters P. F. , Rushforth A. D. , Franssen T. P. , Hammarfelt B. ( 2016 ) ‘ Evaluation Practices and Effects of Indicator Use – A Literature Review ’, Research Evaluation , 25 : 161 – 9 .

Espeland W. N. , Sauder M. ( 2007 ) ‘ Rankings and Reactivity: How Public Measures Recreate Social Words ’, American Journal of Sociology , 113 : 1 – 40 .

Gibson J. , Anderson D. L. , Tressler J. ( 2014 ) ‘ Which Journal Rankings Best Explain Academic Salaries? Evidence from the University of California ’, Economic Inquiry , 52 : 1322 – 40 .

Glänzel W. , Moed H. F. ( 2002 ) ‘ Journal Impact Measures in Bibliometric Research ’, Scientometrics , 53 : 171 – 93 .

Gunashekar S. , Wooding S. , Guthrie S. ( 2017 ) ‘ How Do NIHR Peer Review Panels Use Bibliometric Information to Support Their Decisions? ’, Scientometrics , 112 : 1813 – 35 .

Haddow G. , Hammarfelt B. ( 2019 ) ‘ Quality, Impact, and Quantification: Indicators and Metrics Use by Social Scientists ’, Journal of the Association for Information Science and Technology , 70 : 16 – 26 .

Hammarfelt B. ( 2018 ) ‘ Taking Comfort in Points: The Appeal of the Norwegian Model in Sweden ’, Journal of Data and Information Science , 3 : 85 – 95 .

Hammarfelt B. , Rushforth A. D. ( 2017 ) ‘ Indicators as Judgment Devices: An Empirical Study of Citizen Bibliometrics in Research Evaluation ’, Research Evaluation , 26 : 169 – 80 .

Heckman J. J. , Moktan S. ( 2018 ) Publishing and Promotion in Economics: The Tyranny of the Top Five. NBER Working Paper No. 25093 . Institute for New Economic Thinking .

Hicks D. , Wouters P. , Waltman L. , de Rijcke S. , Rafols I. ( 2015 ) ‘ The Leiden Manifesto for Research Metrics ’, Nature , 520 : 429 – 31 .

Hug S. , Aeschbach M. ( 2020 ) ‘ Criteria for Assessing Grant Applications: A Systematic Review ’, Palgrave Communications , 6 : 1 – 15 .

Hylmö A. ( 2018 ) ‘Disciplined Reasoning: Styles of Reasoning and the Mainstream-Heterodoxy Divide in Swedish Economics’, Doctoral thesis, Lund University, Department of Sociology.

Jonkers K. , Zacharewicz T. ( 2016 ) Research Performance Based Funding Systems: A Comparative Assessment . Luxembourg: Publications Office of the European Union .

Kalaitzidakis P. , Mamuneas T. P. , Stengos T. ( 2011 ) ‘ An Updated Ranking of Academic Journals in Economics ’, Canadian Journal of Economics-Revue Canadienne D Economique , 44 : 1525 – 38 .

Lamont M. ( 2009 ) How Professor Think: Inside the Curious World of Academic Judgment , Cambridge, MA : Harvard University Press .

Lamont M. ( 2012 ) ‘ Toward a Comparative Sociology of Valuation and Evaluation ’, Annual Review of Sociology , 38 : 201 – 21 .

Langfeldt L. , Nedeva M. , Sörlin S. , Thomas D. A. ( 2020 ) ‘ Co-Exiting Notions of Research Quality: A Framework to Study Context-Specific Understandings of Good Research ’, Minerva , 58 : 115 – 37 .

Langfeldt L. , Scordato L. ( 2016 ) Efficiency and Flexibility in Research Funding. A Comparative Study of Funding Instruments and Review Criteria. NIFU Report 9/2016 . Oslo : NIFU Nordic Institute for Studies Innovation, Research and Education .

Lee F. S. , Pham X. , Gu G. ( 2013 ) ‘ The UK Research Assessment Exercise and the Narrowing of UK Economics ’, Cambridge Journal of Economics , 37 : 693 – 717 .

Lewison G. , Cottrell R. , Dixon D. ( 1999 ) ‘ Bibliometric Indicators to Assist the Peer Review Process in Grant Decisions ’, Research Evaluation , 8 : 47 – 52 .

Lin F. ( 2008 ) ‘ Solving Multicollinearity in the Process of Fitting Regression Model Using the Nested Estimate Procedure ’, Quality and Quantity , 42 : 417 – 26 .

Loomba R. S. , Anderson R. H. ( 2018 ) ‘ Are we Allowing Impact Factor to Have Too Much Impact: The Need to Reassess the Process of Academic Advancement in Pediatric Cardiology? ’, Congenital Heart Disease , 13 : 163 – 6 .

Martin B. R. ( 1996 ) ‘ The Use of Multiple Indicators in the Assessment of Basic Research ’, Scientometrics , 36 : 343 – 62 .

Moed H. F. (2005) Citation Analysis in Research Evaluation. Dordrecht: Springer.

Müller R. , de Rijcke S. ( 2017 ) ‘ Thinking with Indicators. Exploring the Epistemic Impacts of Academic Performance Indicators in the Life Sciences ’, Research Evaluation , 26 : 157 – 68 .

Musselin C. ( 2010 ) The Market for Academics , New York : Routledge .

Piro F. N. , Aksnes D. W. , Rorstad K. ( 2013 ) ‘ A Macro Analysis of Productivity Differences Across Fields: Challenges in the Measurement of Scientific Publishing ’, Journal of the American Society for Information Science and Technology , 64 : 307 – 20 .

Rushforth A. , de Rijcke S. ( 2015 ) ‘ Accounting for Impact? The Journal Impact Factor and the Making of Biomedical Research in the Netherlands ’, Minerva , 53 : 117 – 39 .

Schmid S. L. ( 2017 ) ‘ Five Years post-DORA: Promoting Best Practices for Research Assessment ’, Molecular Biology of the Cell , 28 : 2941 – 4 .

Sivertsen G. ( 2017 ) ‘ Unique, but Still Best Practice? the Research Excellence Framework (REF) from an International Perspective ’, Palgrave Communications , 3 : 17078 .

Söderlind J. , Geschwind L. ( 2020 ) ‘ Disciplinary Differences in Academics’ Perceptions of Performance Measurement at Nordic Universities ’, Higher Education Governance & Policy , 1 : 18 – 31 .

Sousa C. A. A. , Hendriks P. H. J. ( 2007 ) ‘ That Obscure Object of Desire: The Management of Academic Knowledge ’, Minerva , 45 : 259 – 74 .

Stephan P. , Veugelers R. , Wang J. ( 2017 ) ‘ Blinkered by Bibliometrics ’, Nature , 544 : 411 – 2 .

van der Wall E. E. ( 2012 ) ‘ Journal Impact Factor: Holy Grail? ’, Netherlands Heart Journal , 20 : 385 – 6 .

van Leeuwen T. N. , van der Wurff L. J. , de Craen A. J. M. ( 2007 ) ‘ Classification of ‘Research Letters’ in General Medical Journals and Its Consequences in Bibliometric Research Evaluation Processes ’, Research Evaluation , 16 : 59 – 63 .

Weingart P. ( 2005 ) ‘ Impact of Bibliometrics upon the Science System: Inadvertent Consequences? ’, Scientometrics , 62 : 117 – 31 .

Whitley R. 1984 . The Intellectual and Social Organization of the Sciences . Oxford : Clarendon Press .

Wildgaard L. , Schneider J. W. , Larsen B. ( 2014 ) ‘ A Review of the Characteristics of 108 Author-Level Bibliometric Indicators ’, Scientometrics , 101 : 125 – 58 .

Wilsdon J., et al.  ( 2015 ) The Metric Tide: Report of the Independent Review of the Role of Metrics in Research Assessment and Management . HEFCE . DOI: 10.13140/RG.2.1.4929.1363, https://responsiblemetrics.org/the-metric-tide/ .

Wouters P. ( 1999 ) ‘ Beyond the Holy Grail: From Citation Theory to Indicator Theories ’, Scientometrics , 44 : 561 – 80 .

Month: Total Views:
December 2020 175
January 2021 264
February 2021 92
March 2021 67
April 2021 24
May 2021 41
June 2021 37
July 2021 27
August 2021 9
September 2021 21
October 2021 358
November 2021 499
December 2021 248
January 2022 134
February 2022 120
March 2022 117
April 2022 77
May 2022 58
June 2022 41
July 2022 62
August 2022 46
September 2022 70
October 2022 82
November 2022 56
December 2022 56
January 2023 66
February 2023 77
March 2023 68
April 2023 64
May 2023 47
June 2023 77
July 2023 81
August 2023 87
September 2023 84
October 2023 135
November 2023 99
December 2023 113
January 2024 111
February 2024 135
March 2024 118
April 2024 93
May 2024 77
June 2024 104
July 2024 75
August 2024 105
September 2024 65

Email alerts

Citing articles via.

  • Recommend to your Library

Affiliations

  • Online ISSN 1471-5449
  • Print ISSN 0958-2029
  • Copyright © 2024 Oxford University Press
  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

Username or Email Address

Remember Me

Understanding research metrics

Introduction, what are research metrics.

Research metrics are quantitative tools used to help assess the quality and impact of research outputs. Metrics are available for use at the journal, article, and even researcher level. However, any one metric only tells a part of the story and each metric also has its limitations. Therefore, a single metric should never be considered in isolation.

For a long time, the only tool for assessing journal performance was the Impact Factor – more on that in a moment. Now there are a range of different research metrics available, from the Impact Factor to altmetrics, h -index, and more.

But what do they all mean? How is each metric calculated? Which research metrics are the most relevant to your journal? And how can you use these tools to monitor your journal’s performance?

Keep reading for a more in-depth look at the range of different metrics available.

Supporting the responsible use of research metrics

evaluation metrics thesis

We firmly believe that researchers should be assessed on the quality and broad impact of their work. While research metrics can help support this process, they should not be used as a quick substitute for proper review. The quality of an individual research article should always be assessed on its own merits rather than on the metrics of the journal in which it was published. Find out more .

Using metrics to promote your journal

Journal metrics can be a useful tool for researchers when they’re choosing where to submit their research. You may therefore be asked by prospective authors about your journal’s metrics. You might also want to highlight certain metrics when you’re talking about the journal, to illustrate its reach or impact.

If you do, we advise that you always quote at least two different metrics, to give researchers a richer view of journal performance. Please also accompany this quantitative data with qualitative information that will help researchers assess the suitability of the journal for their research, such as its aims & scope.

Our researcher guide to understanding journal metrics explains in more detail how authors can use metrics as part of the process of choosing a journal.

How to use metrics to monitor your journal

Metrics can help you assess your journal’s standing in the community, raise its profile, and support growth in high-quality submissions. But only if you know how to interpret and apply them.

Journal metrics on Taylor & Francis Online

Most journals on Taylor & Francis Online display a range of metrics, to help give a rounded view of a journal’s performance, reach, and impact. These metrics include usage, citation metrics, speed (review and production turnaround times), and acceptance rate.

Read the guide to Taylor & Francis Online journal metrics for more details about how they’re calculated and the advice given to researchers about their use.

How to identify the right metrics for your journal

To monitor your journal’s performance, first you need to identify which research metrics are the most appropriate. To do this, think about your journal and its objectives.

It may help to structure this thinking around some key questions:

Who is your target audience?

For journals with a practitioner focus, academic citations may be less valuable than mentions in policy documents (as reported by Altmetric). If your journal is for a purely academic audience, traditional citation metrics like Impact Factor are more relevant. If your journal has a regional focus, then geographical usage might be important to you.

What are you trying to achieve?

If your objective is to publish more high-quality, high-impact authors, consider analyzing the h -indices of authors in recent volumes to assess whether you’re achieving this. If your aim is to raise your journal’s profile within the wider community, it makes sense to consider altmetrics in your analysis. Perhaps your goal is to generate more citations from high-profile journals within your field – so looking at Eigenfactor rather than Impact Factor would be helpful.

What subject area are you working in?

The relevancy of different research metrics varies hugely between disciplines. Is Impact Factor appropriate, or would the 5-year Impact Factor be more representative of citation patterns in your field? Which metrics are your competitors using? It might be more useful to think about your journal’s ranking within its subject area , rather than considering specific metrics in isolation.

What business model does your journal use?

For journals following a traditional subscription model, usage can be particularly crucial. It’s a key consideration for librarians when it comes to renewals.

How to interpret research metrics

It’s tempting to reach for simple numbers and extrapolate meaning, but be careful about reading too closely into metrics. The best strategy is to see metrics as generating questions , rather than answers .

Metrics simply tells us “what”. What are the number of views of the work?  What are the number of downloads from the journal? What are the number of citations?

To interpret your metrics effectively, think less about “what” and use your metrics as a starting point to delve deeper into “who”, “how”, and “why”:

  • Who is reading the journal? Where are they based, what is their role, how are they accessing it?
  • Who are the key authors in your subject area? Where are they publishing now?
  • How are users responding to your content? Are they citing it in journals, mentioning it in policy documents, talking about it on Twitter?
  • How is your subject area developing? What are the hot topics, emerging fields, and key conversations?
  • Why was a specific article successful? What made the media pick up on it, what prompted citations from other journals, who was talking about it?

It’s easy to damage the overall picture of your research metrics by focusing too much on one specific metric. For example, if you wanted to boost your Impact Factor by publishing more highly-cited articles, you might be disregarding low-cited articles used extensively by your readers. Therefore, if you chose to publish only highly-cited content for a higher Impact Factor, you could lose the value of your journal for a particular segment of your readership.

Generally, the content most used by practitioners, educators, or students (who don’t traditionally publish) is not going to improve your Impact Factor, but will probably add value in other ways to your community.

Fundamentally, it’s important to consider a range of research metrics when monitoring your journal’s performance. It can be tempting to concentrate on one metric, like the Impact Factor, but citations are not the be-all and end-all.

Think about each research metric as a single tile in a mosaic: you need to piece them all together to see the bigger picture of journal performance.

Journal metrics: citations

Impact factor, what is the impact factor.

The Impact Factor is probably the most well-known metric for assessing journal performance. Designed to help librarians with collection management in the 1960s, it has since become a common proxy for journal quality.

The Impact Factor is a simple research metric: it’s the average number of citations received by articles in a journal within a two-year window.

The Web of Science Journal Citation Reports (JCR) publishes the official results annually, based on this calculation:

Number of citations received in one year to content published in Journal X during the two previous years, divided by the total number of articles and reviews published in Journal X within the previous two years.

For example, the 2022 Impact Factors (released in 2023) used the following calculation:

Number of citations received in 2022 to content published in Journal X during 2020 and 2021 , divided by the total number of articles and reviews published in Journal X in 2020 and 2021 .

How can I get an Impact Factor for my journal?

Only journals selected to feature in the Science Citation Index Expanded (SCIE) and Social Sciences Citation Index (SSCI) receive an official Impact Factor.

To be eligible for coverage in these Web of Science indices, journals must meet a wide range of criteria. You can find out more about the journal selection process on the Clarivate website.

For many journals, the first step to receiving an Impact Factor is to feature in the Emerging Sources Citation Index (ESCI). For more information on the ESCI, read our introduction here .

What are the disadvantages of the Impact Factor?

  • The Impact Factor is an arithmetic mean and doesn’t adjust for the distribution of citations .

This means that one highly-cited article can have a major positive effect on the Impact Factor, skewing the result for the two years. Most journals have a highly-skewed citation distribution, with a handful of highly-cited articles and many low- or zero-cited articles.

  • The JCR doesn’t distinguish between citations made to articles, reviews, or editorials.

So that the Impact Factor doesn’t penalize journals that publish rarely-cited content like book reviews, editorials, or news items, these content types are not counted in the denominator of the calculation (the total number of publications within the two-year period). However, citations to this kind of content are still counted.

This creates two main problems. Firstly, the classification of content is not subjective, so content such as extended abstracts or author commentaries fall into an unpredictable gray area. Secondly, if such articles are cited, they increase the Impact Factor without any offset in the denominator of the equation.

  • The Impact Factor only considers the number of citations, not the nature or quality.

An article may be highly cited for many reasons, both positive and negative. A high Impact Factor only shows that the research in a given journal is being cited. It doesn’t indicate the context or the quality of the publication citing the research.

  • You can’t compare Impact Factors like-for-like across different subject areas.

Different subject areas have different citation patterns, which reflects in their Impact Factors. Research in subject areas with typically higher Impact Factors (cell biology or general medicine, for example) is not better or worse than research in subject areas with typically lower Impact Factors (such as mathematics or history).

The difference in Impact Factor is simply a reflection of differing citation patterns, database coverage, and dominance of journals between the disciplines. Some subjects generally have longer reference lists and publish more articles, so there’s a larger pool of citations.

  • Impact Factors can show significant variation year-on-year, especially in smaller journals.

Because Impact Factors are average values, they vary year-on-year due to random fluctuations. This change is related to the journal size (the number of articles published per year): the smaller the journal, the larger the expected fluctuation.

What is the 5-year Impact Factor?

The 5-year Impact Factor is a modified version of the Impact Factor, using five years’ data rather than two. A journal must be covered by the JCR for five years or from Volume 1 before receiving a 5-year Impact Factor.

The 5-year Impact Factor calculation is:

Number of citations in one year to content published in Journal X during the previous five years, divided by the total number of articles and reviews published in Journal X within the previous five years.

The 5-year Impact Factor is more useful for subject areas where it takes longer for work to be cited, or where research has more longevity. It offers more stability for smaller titles as there are a larger number of articles and citations included in the calculation. However, it still suffers from many of the same issues as the traditional Impact Factor.

Eigenfactor

In 2007, the Web of Science JCR grew to include Eigenfactors and Article Influence Scores (see below). Unlike the Impact Factor, these metrics don’t follow a simple calculation. Instead, they borrow their methodology from network theory.

What is an Eigenfactor?

The Eigenfactor measures the influence of a journal based on whether it’s cited within other reputable journals over five years. A citation from a highly-cited journal is worth more than from a journal with few citations.

To adjust for subject areas, the citations are also weighted by the length of the reference list that they’re from. The Eigenfactor is calculated using an algorithm to rank the influence of journals according to the citations they receive. A five-year window is used, and journal self-citations are not included.

This score doesn’t take journal size into account. That means larger journals tend to have larger Eigenfactors as they receive more citations overall. Eigenfactors also tend to be very small numbers as scores are scaled so that the sum of all journal Eigenfactors in the JCR adds up to 100.

Very roughly, the Eigenfactor calculation is:

Number of citations in one year to content published in Journal X in the previous five years (weighted), divided by the total number of articles published in Journal X within the previous five years.

Article Influence Score

What is an article influence score.

The Article Influence Score is a measure of the average influence of a journal’s articles in the first five years after publication. A score greater than 1.00 shows above-average levels of influence.

The Article Influence Score calculation is:

(0.01 x Eigenfactor of Journal X ) divided by (number of articles published in Journal X over five years, divided by the number of articles published in all journals over five years).

These are then normalized so that the average journal in the JCR has a score of 1.

Like 5-year Impact Factors, journals don’t receive an Article Influence Score unless they have been covered by the JCR for at least five years, or from Volume 1.

What is CiteScore?

CiteScore is the ratio of citations to research published. It’s currently available for journals and book series which are indexed in Scopus.

The CiteScore calculation only considers content that is typically peer reviewed; such as articles, reviews, conference papers, book chapters, and data papers.

The CiteScore calculation is:

Number of all citations recorded in Scopus in one year to content published in Journal X in the last four years, divided by the total number of items published in Journal X in the previous four years.

What are the differences between CiteScore and Impact Factor?

  • CiteScore is based on the Scopus database rather than Web of Science. This means the number of citations and journal coverage in certain subject areas is notably higher.
  • CiteScore uses a four-year citation window, whereas Impact Factor uses a two-year citation window.
  • CiteScore covers all subject areas, whereas the Impact Factor is only available for journals indexed in the SCIE and SSCI.

CiteScore suffers from some of the same problems as Impact factor; namely that it isn’t comparable across disciplines and it is a mean calculated from a skewed distribution.

SNIP - Source Normalized Impact per Paper

SNIP is a journal-level metric which attempts to correct subject-specific characteristics, simplifying cross-discipline comparisons between journals. It measures citations received against citations expected for the subject field, using Scopus data. SNIP is published twice a year and looks at a three-year period.

The SNIP calculation is:

Journal citation count per paper, divided by citation potential in the field.

SNIP normalizes its sources to allow for cross-disciplinary comparison. In practice, this means that a citation from a publication with a long reference list has a lower value.

SNIP only considers citations to specific content types (articles, reviews, and conference papers), and does not count citations from publications that Scopus classifies as “non-citing sources”. These include trade journals, and many Arts & Humanities titles.

SJR - Scimago Journal Rank

The SJR aims to capture the effect of subject field, quality, and reputation of a journal on citations. It calculates the prestige of a journal by considering the value of the sources that cite it, rather than counting all citations equally.

Each citation received by a journal is assigned a weight based on the SJR of the citing journal. So, a citation from a journal with a high SJR value is worth more than a citation from a journal with a low SJR value.

The SJR calculation is:

Average number of (weighted) citations in a given year to Journal X , divided by the number of articles published in Journal X in the previous three years.

As with SNIP and CiteScore, SJR is calculated using Scopus data.

Journal metrics: usage, speed, and acceptance rate

As we’ve explained above, citations aren’t the only way to monitor the performance of your journal. The following metrics, which are available for many journals on Taylor & Francis Online, will help give you and your readers a more rounded view.

What does it measure? A journal’s usage is the number of times articles are viewed/downloaded. Gives a quick impression of the journal’s size and reach.

How is it calculated? The figure shown on Taylor & Francis Online is the total number of times articles in the journal were viewed by users in the previous calendar year, rounded to the nearest thousand. This includes all of the different formats available on Taylor & Francis Online, including HTML, PDF, and EPUB. Usage data for each journal is updated annually in February.

How can I access my journal’s usage data?

You can easily access article-level usage data via the “Metrics” tab on Taylor & Francis Online . We also provide more detailed annual usage reports to our journal editors. Find out more about the COUNTER compliant data we report with our brief introduction to Project COUNTER .

There are other online platforms which provide journal access, including aggregator services such as JSTOR and EBSCO. Of course, some readers still prefer print over online, so it’s important you consider these sources when building a broader picture of usage.

We’ve set out the limitations of this metric in our guide for researchers.

Speed metrics

The following speed metrics, which are available for many journals on Taylor & Francis Online, indicate how long different stages of the publishing process might take. The speed metrics published on Taylor & Francis Online are for the previous full calendar year and are updated in February.

All of these metrics have limitations, which authors should consider when using them to choose a journal. These limitations are set out in our researcher guide to understanding journal metrics .

Speed from submission to first decision

What does it measure? This metric indicates how long after submission it may take before you receive a decision about your article.

How is it calculated? This is the median number of days from submission to first decision for all manuscripts which received a first decision in the previous calendar year.

Speed from submission to first post-review decision

What does it measure? This metric only considers those articles that are sent out for peer review by experts in the field. It indicates how long it may take before you receive a decision on your peer reviewed article.

How is it calculated? This is the median number of days from submission to decision for all peer reviewed articles which received a first decision in the previous calendar year.

Speed from acceptance to online publication

What does it measure? This metric tells you about the journal’s production speed, indicating how long you are likely to wait to see your article published online once the journal’s editor has accepted it.

How is it calculated? On Taylor & Francis Online this figure is the median number of days from acceptance to online publication of the Version of Record, for articles published in the previous calendar year.

Acceptance rate

A journal’s acceptance rate is an indication of the number of submissions it receives for every article that’s eventually published.

How is it calculated? This figure represents the articles accepted by the journal for publication in the previous calendar year as a percentage of all papers receiving a final decision in that calendar year. It includes all article types submitted to the journal, including those that are rejected without being peer reviewed (desk rejects).

The acceptance rates published on Taylor & Francis Online are for the previous full calendar year and are updated in February.

Article metrics

Altmetric attention score.

The Altmetric Attention Score  tracks a wide range of online sources to capture the conversations happening around academic research.

How is the Altmetric Attention Score calculated?

Altmetric monitors each online mention of a piece of research and weights the mentions based on volume, sources, and authors. A mention in an international newspaper contributes to a higher score than a tweet about the research, for example.

Example of Altmetric donut

The Altmetric Attention Score is presented within a colorful donut. Each color indicates a different source of online attention (ranging from traditional media outlets to social media, blogs, online reference managers, academic forums, patents, policy documents, the Open Syllabus Project, and more). A strong Altmetric Score will feature both a high number in the center, and a wide range of colors in the donut.

Discover the different ways you can make Altmetric data work for you by reading this introduction from Altmetric’s Head of Marketing, Cat Chimes.

What are the advantages of the Altmetric Attention Score?

  • Receive instant, trackable feedback

Altmetric starts tracking online mentions of academic research from the moment it’s published. That means there’s no need to wait for citations to come in to get feedback on a piece of research.

  • Get a holistic view of attention, impact and influence

The data Altmetric gathers provides a more all-encompassing, nuanced view of the attention, impact, and influence of a piece of research than traditional citation-based metrics. Digging deeper into the Altmetric Attention Score can reveal not only the nature and volume of online mentions, but also who’s talking about the research, where in the world these conversations are happening, and which online platforms they’re using.

What are the disadvantages of the Altmetric Attention Score?

  • Biases in the data which Altmetric collects

There’s a tendency to focus on English-speaking sources (there’s some great thinking around this by Juan Pablo Alperin ). There’s also a bias towards Science, Technology and Medicine (STM) topics, although that’s partly a reflection of the activity happening online around research.

  • Limited to tracking online attention

The Altmetric Attention Score was built to track digital conversations. This means that attention from sources with little direct online presence (like a concert, or a sculpture) are not included. Even for online conversations, Altmetric can only track mentions when the source either references the article’s Digital Object Identifier (DOI) or uses two pieces of information (i.e. article title and author name).

Author metrics

What is the h -index.

The h -index is an author-level research metric, first introduced by Hirsch in 2005. The h -index attempts to measure the productivity of a researcher and the citation impact of their publications.

The basic h -index calculation is:

Number of articles published which have received the same number of citations.

For example, if you’ve published at least 10 papers that have each been cited 10 times or more, you will have a h -index of 10.

What are the advantages of the h -index?

  • Results aren’t skewed

The main advantage of the h -index is that it isn’t skewed upwards by a small number of highly-cited papers. It also isn’t skewed downwards by a long tail of poorly-cited work.

The h -index rewards researchers whose work is consistently well cited. That said, a handful of well-placed citations can have a major effect.

What are the disadvantages of the h -index?

  • Results can be inconsistent

Although the basic calculation of the h -index is clearly defined, it can still be calculated using different databases or time-frames, giving different results. Normally, the larger the database, the higher the h -index calculated from it. Therefore, a h -index taken from Google Scholar will nearly always be higher than one from Web of Science, Scopus, or PubMed. (It’s worth noting here that as Google Scholar is an uncurated dataset, it may contain duplicate records of the same article.)

  • Results can be skewed by self-citations

Although some self-citation is legitimate, authors can cite their own work to improve their h -index.

  • Results aren’t comparable across disciplines

The h -index varies widely by subject, so a mediocre h -index in the life sciences will still be higher than a very good h -index in the social sciences. We can’t benchmark h -indices because they are rarely calculated consistently for large populations of researchers using the same method.

  • Results can’t be compared between researchers

The h -index of a researcher with a long publication history including review articles cannot be fairly compared with a post-doctoral researcher in the same field, nor with a senior researcher from another field. Researchers who have published several review articles will normally have much higher citation counts than other researchers.

Save page as PDF

Save a PDF version of this page.

For the best results we recommend using Google Chrome.

evaluation metrics thesis

evaluation metrics thesis

How to Identify Right Performance Evaluation Metrics In Machine Learning Based Dissertation

Introduction.

Every Machine Learning pipeline has performance measurements. They inform you if you’re progressing and give you a number. A metric is required for all machine learning models, whether linear regression or a SOTA method like BERT.

Every Machine Learning Activity , like performance measurements, can be split down into Regression or Classification. For both issues, there are hundreds of metrics to choose from, but we’ll go through the most common ones and the information they give regarding model performance. It’s critical to understand how your model interprets your data!

Loss functions are not the same as metrics. Loss functions display a model’s performance. They’re often differentiable in the model’s parameters and are used to train a machine learning model (using some form of optimization like Gradient Descent).Metrics are used to track and quantify a model’s performance (during training and testing), and they don’t have to be differentiable. If the performance measure is differentiable for some tasks, it may also be utilized as a loss function (possibly with additional regularizations), such as MSE.

PhD Assistance   experts to develop new frameworks and novel techniques on improving the optimization for your   engineering dissertation Services .

blog image

Regression metrics

The output of regression models is continuous. As a result, we’ll need a measure that is based on computing some type of distance between anticipated and actual values. We’ll go through these machine learning measures in depth in order to evaluate regression models:

  • Mean Absolute Error (MAE)

The average of the difference between the ground truth and projected values is the Mean Absolute Error. There are a few essential factors for MAE to consider:

  • Because it does not exaggerate mistakes, it is more resistant to outliers than MAE.
  • It tells us how far the forecasts differed from the actual result. However, because MAE utilizes the absolute value of the residual, we won’t know which way the mistake is going, i.e. whether we’re under- or over-predicting the data.
  • There is no need to second-guess error interpretation.
  • In contrast to MSE, which is differentiable, MAE is non-differentiable.
  • This measure, like MSE, is straightforward to apply.

Hire PhD Assistance experts to develop your   algorithm and coding implementation   on improving the secure access for your   Engineering dissertation Services

  • Mean Squared Error (MSE):

The mean squared error is arguably the most often used regression statistic. It simply calculates the average of the squared difference between the goal value and the regression model’s projected value. A few essential features of MSE:

  • Because it is differentiable, it can be better optimized.
  • It penalizes even minor mistakes by squaring them, resulting in an overestimation of the model’s badness.
  • The squaring factor (scale) must be considered while interpreting errors.
  • It’s indeed essentially more prone to outliers than other measures due to the squaring effect.
  • Root Mean Squared Error (RMSE)

The square root of the average of the squared difference between the target value and the value predicted by the regression model is the Root Mean Squared Error. It corrects a few flaws in MSE.

A few essential points of RMSE:

  • It maintains MSE’s differentiable feature.
  • It square roots the penalization of minor mistakes performed by MSE.
  • Because the scale is now the same as the random variable, error interpretation is simple.
  • Because scale factors are effectively standardized, outliers are less likely to cause problems.
  • Its application is similar to MSE.

PhD Assistance experts has experience in handling Dissertation And Assignment in cloud security and machine learning techniques with assured 2:1 distinction. Talk to Experts Now

  • R² Coefficient of determination

The R2 coefficient of determination is a post measure, meaning it is determined after other metrics have been calculated. The purpose of computing this coefficient is to answer the question “How much (what percentage) of the entire variance in Y (target) is explained by variation in X (regression line)?”The sum of squared errors is used to compute this.

A few thoughts on the R2 results:

  • If the regression line’s sum of Squared Error is minimal, R2 will be near to 1 (ideal), indicating that the regression was able to capture 100% of the variance in the target variable.
  • In contrast, if the regression line’s sum of squared error is high, R2 will be close to 0, indicating that the regression failed to capture any variation in the target variable.
  • The range of R2 appears to be (0,1), but it is really (-,1) since the ratio of squared errors of the regression line and mean might exceed 1 if the squared error of the regression line is sufficiently high (>squared error of the mean).

PhD Assistance  has vast experience in developing dissertation research topics for students pursuing the  UK dissertation  in business management.  Order Now .

  • Adjusted R²

The R2 technique has various flaws, such as Deceiving The Researcher into assuming that the model is improving when the score rises while, in fact, no learning is taking place. This can occur when a model over fits the data; in such instance, the variance explained will be 100%, but no learning will have occurred. R2 is modified with the number of independent variables to correct this. Adjusted R2 is usually lower than R2 since it accounts for rising predictors and only indicates improvement when there is one.

Classification metrics

One of the most explored fields in the world is classification issues. Almost all production and industrial contexts have use cases. The list goes on and on: speech recognition, facial recognition, text categorization, and so on.

We need a measure that compares discrete classes in some way since classification algorithms provide discrete output. Classification Metrics assess a model’s performance and tell you if the classification is excellent or bad, but each one does it in a unique way.

So, in order to assess Classification models, we’ll go through the following measures in depth:

The easiest measure to use and apply is classification accuracy, which is defined as the number of correct predictions divided by the total number of predictions, multiplied by 100.We may accomplish this manually looping between the ground truth and projected values, or we can use the scikit-learn module .

  • Confusion Matrix (not a metric but fundamental to others)

The Ground-Truth Labels vs. Model Predictions Confusion Matrix is a tabular representation of the ground-truth labels vs. model predictions. The examples in a predicted class are represented by each row of the confusion matrix, whereas the occurrences in an actual class are represented by each column. The Confusion Matrix isn’t strictly a performance indicator, but it serves as a foundation for other metrics to assess the outcomes.We need to establish a value for the null hypothesis as an assumption in order to comprehend the confusion matrix.

  • Precision and Recall

Type-I mistakes are the subject of the precision metric (FP). When we reject a valid null Hypothesis(H0), we make a Type-I mistake. For example, Type-I error mistakenly classifies cancer patients as non-cancerous. An accuracy score of 1 indicates that your model did not miss any true positives and can distinguish correctly between accurate and wrong cancer patient labeling. What it can’t detect is Type-II error, or false negatives, which occur when a non-cancerous patient is mistakenly diagnosed as malignant. A low accuracy score (0.5) indicates that your classifier has a significant amount of false positives, which might be due to an imbalanced class or poorly adjusted model hyper parameters.

The percentage of genuine positives to all positives in ground truth is known as the recall. The type-II mistake is the subject of the recall metric (FN). When we accept a false null hypothesis (H0), we make a type-II mistake. As a result, type-II mistake is mislabeling non-cancerous patients as malignant in this situation. Recalling to 1 indicates that your model did not miss any genuine positives and can distinguish properly from wrongly classifying cancer patients. What it can’t detect is type-I error, or false positives, which occur when a malignant patient is mistakenly diagnosed as non-cancerous. A low recall score (0.5) indicates that your classifier has a lot of false negatives, which might be caused by an unbalanced class or an untuned model hyper parameter. To avoid FP/FN in an unbalanced class issue, you must prepare your data ahead of time using over/under-sampling or focal loss.

Precision and recall are combined in the F1-score measure. In reality, the harmonic mean of the two is the F1 score. A high F1 score now denotes a high level of accuracy as well as recall. It has an excellent mix of precision and recall, and it performs well on tasks with unbalanced categorization.

A low F1 score means (nearly) nothing; it merely indicates performance at a certain level. We didn’t strive to perform well on a large portion of the test set because we had low recall. Low accuracy indicates that we didn’t get many of the cases we recognised as affirmative cases accurate.

However, a low F1 does not indicate which instances are involved. A high F1 indicates that we are likely to have good accuracy and memory for a significant chunk of the choice (which is informative). It’s unclear what the issue is with low F1 (poor accuracy or low precision). Is Formula One merely a gimmick? No, it’s frequently used and regarded a good metric for arriving at a choice, but only with a few changes. When you combine FPR (false positive rates) with F1, you can reduce type-I mistakes and figure out who’s to blame for your poor F1 score.

  • AU-ROC (Area under Receiver operating characteristics curve)

AUC-ROC score/curves are also known as AUC-ROC score/curves. True positive rates (TPR) and false positive rates (FPR) are used .TPR/recall, on the surface, is the percentage of positive data points that are correctly classified as positive when compared to all positive data points. To put it another way, the higher the TPR, the fewer positive data items we’ll overlook. With regard to all negative data points, FPR/fallout refers to the fraction of Negative Data Points that are wrongly deemed positive. To put it another way, the greater the FPR, the more negative data points we’ll miss.

We first compute the two former measures using many different thresholds for the logistic regression, and then plot them on a single graph to merge the FPR and the TPR into a single metric. The ROC curve represents the result, and the measure we use is the area under the curve, which we refer to as AUROC.

A no-skill classifier is one that cannot distinguish between classes and will always predict a random or constant class. The proportion of positive to negative classes affects the no-skill line. It’s a horizontal line with the ratio of positive cases in the dataset as its value. It’s 0.5 for a well-balanced dataset. The area represents the likelihood that a randomly chosen positive example ranks higher than a randomly chosen negative example (i.e., has a higher probability of being positive than negative).As a result, a high ROC merely implies that the likelihood of a positive example being picked at random is truly positive. High ROC also indicates that your algorithm is good at rating test data, with the majority of negative instances on one end of a scale and the majority of positive cases on the other.

When your problem has a large class imbalance, ROC curves aren’t a smart choice. The explanation for this is not obvious, but it can be deduced from the formulae; you can learn more about it here. After processing an imbalance set or utilizing focus loss techniques, you can still utilise them in that circumstance. Other than academic study and comparing different classifiers, the AUROC measure is useless.

I hope you now see the value of performance measures in model evaluation and are aware of a few odd Small Techniques For Deciphering your model. One thing to keep in mind is that these metrics may be tweaked to fit your unique use case. Take, for instance, a weighted F1-score. It calculates each label’s metrics and determines their average weight based on support (the number of true instances for each label).A weighted accuracy, or Balanced Accuracy in technical words, is another example. To cope with unbalanced datasets, balanced accuracy in binary and multiclass classification problems is employed. It’s defined as the average recall in each category.

About Phdassistance

Ph.D. assistance expert helps you for research proposal in wide range of subjects. We have a specialized academicians who are professional and qualified in their particular specialization, like English, physics, chemistry, computer science, criminology, biological science, arts and literature, law ,sociology, biology, law, geography, social science, nursing, medicine, arts and literature, computer science, software programming, information technology, graphics, animation 3D drawing, CAD, construction etc. We also serve some other services as ; manuscript writing service, coursework writing service, dissertation writing service, manuscript writing and editing service, animation service.

  • Zhou, J., Gandomi, A. H., Chen, F., & Holzinger, A. (2021). Evaluating the quality of machine learning explanations: A survey on methods and metrics. Electronics, 10(5), 593.
  • Huang, C., Li, S. X., Caraballo, C., Masoudi, F. A., Rumsfeld, J. S., Spertus, J. A., … & Krumholz, H. M. (2021). Performance Metrics for the Comparative Analysis of Clinical Risk Prediction Models Employing Machine Learning. Circulation: Cardiovascular Quality and Outcomes, CIRCOUTCOMES-120.
  • Sanni, R. R., & Guruprasad, H. S. (2021). Analysis of Performance Metrics of Heart Failured Patients using Python and Machine Learning Algorithms. Global Transitions Proceedings.
  • Anderson, David R., et al. Statistics for Business and Economics. Cengage Learning, 2020.
  • Computer Science in Machine Learning
  • Evaluation Metrics in Machine Learning
  • Machine Learning Based Dissertation
  • Machine learning Help
  • Machine Learning in PhD Thesis Services
  • Machine Learning in PhD Topic
  • Machine Learning Metrics Services
  • Metrics in Machine Learning
  • Performance Evaluation Metrics in ML
  • Performance Metrics in Machine Learning
  • PhD Dissertation Writing Help UK
  • PhD Machine Learning Dissertation Writing Help
  • PhD Research Proposal
  • Quality of Machine Learning Research

Quick Contact

Phdassistance

  • Adversial Attacks
  • Artificial Intelligence
  • Artificial Intelligence (AI) and ML ( Machine Learning )
  • Big Data Analysis
  • Business and Management
  • Categories of Research methodology – PhDAssistance
  • Category of Research Proposal Services
  • coding & algorithm
  • Computer Data Science
  • Category of Machine Learning – PhDassistance
  • Computer Science/Research writing/Manuscript
  • Course Work Service
  • Data Analytics
  • Data Processing
  • Deep Networks
  • Dissertation Statistics
  • economics dissertation
  • Editing Services
  • Electrical Engineering Category
  • Engineering & Technology
  • finance dissertation writing
  • Gap Identification
  • Healthcare Dissertation Writing
  • Intrusion-detection-system
  • journals publishing
  • Life Science Dissertation writing services
  • literature review service
  • Machine Learning
  • medical thesis writing
  • Peer review
  • PhD Computer Programming
  • PhD Dissertation
  • PhD dissertation Writing
  • Phd Journal Manuscript
  • Annotated Bibliography
  • PhD Publication Support
  • Phd thesis writing services
  • Phd Topic Selection
  • Categories of PhdAssistance Dissertation
  • Power Safety
  • problem identification
  • Quantitative Analysis
  • quantitative research
  • Recent Trends
  • Referencing and Formatting
  • Research Gap
  • research journals
  • Research Methodology
  • research paper
  • Research Proposal Service
  • secondary Data collection
  • Statistical Consulting Services
  • Uncategorized

Phdassistance

  • Corpus ID: 262153445

Evaluation Metrics for NLG and TTS in Task-Oriented Dialog PhD. Thesis Proposal

  • PhD. Thesis Proposal Ondˇrej Pl´atek
  • Published 2023
  • Computer Science, Linguistics

Figures and Tables from this paper

figure 1

92 References

Overview of robust and multilingual automatic evaluation metrics for open-domain dialogue systems at dstc 11 track 4, towards an automatic turing test: learning to evaluate dialogue responses, augpt: dialogue with pre-trained language models and data augmentation, tod-bert: pre-trained natural language understanding for task-oriented dialogue.

  • Highly Influential

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

  • 17 Excerpts

Semantically Conditioned LSTM-based Natural Language Generation for Spoken Dialogue Systems

Large language models are state-of-the-art evaluators of translation quality, ubar: towards fully end-to-end task-oriented dialog systems with gpt-2, neural belief tracker: data-driven dialogue state tracking, is multiwoz a solved task an interactive tod evaluation framework with user simulator, related papers.

Showing 1 through 3 of 0 Related Papers

CHM 7010 - Turning Research into a Thesis: Source Evaluation/Metrics

  • Science Citation Index Expanded
  • Source Evaluation/Metrics
  • Thesis/Technical Writing

Ethical Guidelines to Publication of Chemical Research

  • Ethical Guidelines to Publication of Chemical Research From the American Chemical Society, but all reputable journals and publications should follow similar guidelines.
  • ACS Top 10 Tips for Ethical Authorship A one-page infographic.

Identifying Predatory or Questionable Publishers and Journals

  • Grand Valley State Libraries' Open Access Quality Indicators Guidelines to help you evaluate open access publications.
  • Laine, C., & Winker, M. A. (2017). Identifying predatory or pseudo-journals. Biochemia Medica, 27(2), 285–291. http://doi.org/10.11613/BM.2017.031
  • Cabells Scholarly Analytics This link opens in a new window Find publishing opportunities in many academic disciplines. Each entry includes submission guidelines, circulation data, review method, and contact information. Other available data include journal acceptance rate, time to publication, percentage of invited articles, and frequency of issue. Includes Cabells Predatory Reports, a list of publishers that at last review, met Cabells defined criteria for deceptive practices.
  • Cabells Predatory Report Criteria

Predatory Publishers

A predatory publisher is a publisher who produces low quality academic journals.  These journals are rarely peer-reviewed, and often charge the author a publication fee.  The publisher works hard to dupe authors into publishing by emulating well-known publishers, lying about their credentials, and soliciting submissions with spam emails.  Learn how to evaluate and identify predatory publishers using the criteria below:

  • Little or no peer review
  • Editors with no or fake credentials
  • No editors listed at all
  • False indexing claims in Web of Science, Directory of Open Access Journals, etc.
  • Rapid acceptance (within 2 weeks)
  • Call for papers email, a pre-acceptance email, and other "scammy" email
  • Empty or dead links
  • Spelling and grammar errors
  • Impact Factor can only be assigned by one company, Clarivate Analytics (formerly Thomson Reuters and Institute for Scientific Information).  An easy way to check is by searching the Impact Factor in InCites Journal Citations Report.
  • Journal Citation Reports This link opens in a new window

About Journal Citation Reports

CHM 6900 Library Guide about Predatory Journals

  • More about Predatory Journals - CHM 6900 Course Guide

More Reasons to Evaluate What You Find

  • Halevi, G., Moed, H., & Bar-Ilan, J. (2017). Review: Suitability of Google Scholar as a source of scientific information and as a source of data for scientific evaluation—review of the literature. Journal of Informetrics, 11 (3), 823-834.
  • Leetaru, K. (2016, Dec 16). How academia, Google Scholar, and predatory publishers help feed academic fake news. Forbes .
  • López-Cózar, E., Robinson-Garcia, N., & Torres-Salinas, D. (2013). The Google Scholar experiment: How to index false papers and manipulate bibliometric indicators. Journal of the Association for Information Science and Technology, 65 (3), 446-454.
  • Pettit, Emma. (August 1, 2018). These professors don’t work for a predatory publisher. It keeps claiming they do. The Chronicle of Higher Education.

About Impact Factors and Other Journal Metrics

  • Measuring Your Impact: Impact Factor, Citation Analysis, and other Metrics: Measuring Your Impact Overview of h-index, Eigenfactor, Impact Factor (IF), Journal Citation Reports, Citation Analysis, and other tools from the University of Illinois at Chicago.

Using Google Wisely

Google can be a great place to start your search for free information. Google Scholar helps you find scholarly information that may or may not be free. Wright State University Libraries pays for you to have access to many of the fee-based articles that you find in Google Scholar.

Whether using Google or Google Scholar, be sure to evaluate what you find.

  • Google Advanced Search
  • Google Scholar
  • Google Scholar Advanced Search

Advantages and Limitations of Google Scholar

  • Google Scholar Help -- Search Tips, Overview, Coverage, etc. From Google Scholar.
  • Advantages and Limitations of Google Scholar A quick comparison from East Carolina University.
  • Alakangas, S., Harzing, A., Harzing, A., & Alakangas, S. (2016). Google Scholar, Scopus and the Web of Science: a longitudinal and cross-disciplinary comparison. Scientometrics, 106 (2), 787-804.
  • Beckmann, M., Wehrden, H., & Palmer, M. (2012). Where you search is what you get: literature mining - Google Scholar versus Web of Science using a data set from a literature search in vegetation science. Journal of Vegetation Science, 23 (6), 1197-1
  • Bramer, W. M., Giustini, D., Kramer, B. M., & Anderson, P. (2013). The comparative recall of Google Scholar versus PubMed in identical searches for biomedical systematic reviews: a review of searches used in systematic reviews. Systematic Reviews, 2, 115.

Questions? Ask your librarian!

Profile Photo

More Information Resources for...

Scientific communication resource.

  • ACS Guide to Scholarly Communication by editors: Gregory M. Banik, Grace Baysinger, Prashant V. Kamat, Norbert J. Pienta Call Number: online ISBN: 9780841235830 Publication Date: 2020
  • ACS Style Quick Guide Examples of references for commonly cited types of sources. Note to the Reader: This is an open access chapter published under an ACS AuthorChoice License, which permits copying and redistribution of the chapter or any adaptations for non-commercial purposes. In all of the examples shown here, the source type is online unless noted otherwise.

Guide Index:

  • SciFinder-n
  • Mendeley - Citation Management Software
  • << Previous: Science Citation Index Expanded
  • Next: Mendeley >>
  • Last Updated: Jun 27, 2024 4:16 PM
  • URL: https://guides.libraries.wright.edu/chm7010

Synthetic Data and Its Evaluation Metrics for Machine Learning

  • Conference paper
  • First Online: 02 March 2023
  • Cite this conference paper

evaluation metrics thesis

  • A. Kiran   ORCID: orcid.org/0000-0002-4574-6688 7 &
  • S. Saravana Kumar   ORCID: orcid.org/0000-0001-5679-2367 7  

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 324))

529 Accesses

3 Citations

Artificial Intelligence (AI) has become the key driving force in Industrial Automation. Machine learning (ML) and Deep Learning (DL) can be considered to be the components of AI which rely on data for model training. Data generation has increased due to the Internet, connected devices, mobile devices and social networking which in turn have also given rise to cybercrime and cyber thefts. To prevent those and preserve the identity of individuals in the public data, government and policymakers have put stringent privacy-preserving laws. The economy of data collection, quality of data in the public domain, and data bias have made data accessibility and its usage a challenge for AI/ML training for research work or industrial purposes. This has forced researchers to look into the alternative. Synthetic Data offers a promising solution to overcome the data challenges. The last few years have seen many studies conducted to verify the utility and privacy protection capability of synthetic data. However, all of these have been exploratory. This paper focuses on various methods of synthetic data generation and their validation metrics. It opens up a few questions that need further study before we conclude that synthetic data offers a universal solution for AI and ML.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
  • Durable hardcover edition

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

evaluation metrics thesis

Synthetic Data: Development Status and Prospects for Military Applications

evaluation metrics thesis

Efficient Approaches for Data Augmentation by Using Generative Adversarial Networks

evaluation metrics thesis

Generative adversarial network based synthetic data training model for lightweight convolutional neural networks

McCarthy, J.: Artificial intelligence, logic and formalizing common sense. Philos. Log. Artif. Intell., 161–190 (1989). https://doi.org/10.1007/978-94-009-2448-2_6

Ongsulee, P.: Artificial intelligence, machine learning and deep learning (2018). https://doi.org/10.1109/ICTKE.2017.8259629

Surya, L.: An exploratory study of DevOps and it’s future in the United States. Int. J. Creat. Res. Thoughts 3 (2), 2320–2882 (2016)

Google Scholar  

Yale, A., et al.: Generation and evaluation of privacy preserving synthetic health data. To cite this version: HAL Id: hal-03158544 (2021)

Emam, K., Mosquera, L., Hoptroff, R., Safari, O.M.C.: Practical Synthetic Data Generation, p. 175 (2020).

Liu, J., Li, J., Li, W., Wu, J.: Rethinking big data: a review on the data quality and usage issues. ISPRS J. Photogramm. Remote Sens. 115 , 134–142 (2016). https://doi.org/10.1016/j.isprsjprs.2015.11.006

Article   Google Scholar  

Leevy, J.L., Khoshgoftaar, T.M., Bauder, R.A., Seliya, N.: A survey on addressing high-class imbalance in big data. J. Big Data 5 (1) (2018). https://doi.org/10.1186/s40537-018-0151-6

Haenlein, M., Kaplan, A.: A brief history of artificial intelligence: on the past, present, and future of artificial intelligence. Calif. Manage. Rev. 61 (4), 5–14 (2019). https://doi.org/10.1177/0008125619864925

Das, S., Dey, A., Pal, A., Roy, N.: Applications of artificial intelligence in machine learning: review and prospect. Int. J. Comput. Appl. 115 (9), 31–41 (2015). https://doi.org/10.5120/20182-2402

Alloghani, M., Al-Jumeily, D., Mustafina, J., Hussain, A., Aljaaf, A.J.: A systematic review on supervised and unsupervised machine learning algorithms for data science

El Naqa, I., Murphy, M.J.: Machine learning in radiation oncology. In: Machine Learning in Radiation Oncology, pp. 3–11 (2015). https://doi.org/10.1007/978-3-319-18305-3

Kourou, K., Exarchos, T.P., Exarchos, K.P., Karamouzis, M.V., Fotiadis, D.I.: Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 13 , 8–17 (2015). https://doi.org/10.1016/j.csbj.2014.11.005

L’Heureux, A., Grolinger, K., Elyamany, H.F., Capretz, M.A.M.: Machine learning with Big Data: challenges and approaches. IEEE Access 5 , 7776–7797 (2017). https://doi.org/10.1109/ACCESS.2017.2696365

Jain, P., Gyanchandani, M., Khare, N.: Big data privacy: a technological perspective and review. J. Big Data 3 (1) (2016). https://doi.org/10.1186/s40537-016-0059-y

Kaur, H., Pannu, H.S., Malhi, A.K.: A systematic review on imbalanced data challenges in machine learning: applications and solutions. ACM Comput. Surv. 52 (4) (2019). https://doi.org/10.1145/3343440

Rubin, D.B.: Statistical disclosure limitation (SDL). J. Off. Statis., 461–468 (1993). https://doi.org/10.1007/978-0-387-39940-9_3686

Rubin, D.B.: An overview of multiple imputation. In: Proceedings of the Survey Research Methods Section, American Statistical Association (1988)

Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., Naor, M.: Our data, ourselves: privacy via distributed noise generation. Lecture Notes Computer Science (including Subseries Lecture Notes Artificial Intelligence, Lecture Notes Bioinformatics), vol. 4004 LNCS, pp. 486–503 (2006). https://doi.org/10.1007/11761679_29

Kaaniche, N., Laurent, M., Belguith, S.: Privacy enhancing technologies for solving the privacy-personalization paradox: taxonomy and survey. J. Netw. Comput. Appl. 171 (Jan), 102807 (2020). https://doi.org/10.1016/j.jnca.2020.102807

Reiter, J.: Satisfying disclosure restrictions with synthetic data sets. J. Off. Stat. 18 (4), 1–19 (2002) [Online]. Available: http://www.stat.duke.edu/~jerry/Papers/jos02.pdf

Raghunathan, T.: Multiple imputation for statistical disclosure limitation. J. Off. Stat. 19 (1), 1–16 (2003) [Online]. Available: http://hbanaszak.mjr.uw.edu.pl/TempTxt/RaghunathanEtAl_2003_Multiple_Imputation_for_Statistical_Disclosure_Limitation.pdf

Raghunathan, T.E.: Synthetic data. Annu. Rev. Stat. Its Appl. 8 , 129–140 (2021). https://doi.org/10.1146/annurev-statistics-040720-031848

Article   MathSciNet   Google Scholar  

Reiter, J.P.: Using CART to generate partially synthetic, public use microdata. J. Off. Stat. 21 (3), 441–462 (2003) [Online]. Available: https://www.scb.se/contentassets/ca21efb41fee47d293bbee5bf7be7fb3/using-cart-to-generate-partially-synthetic-public-use-microdata.pdf

Nowok, B., Raab, G.M., Dibben, C.: Synthpop: bespoke creation of synthetic data in R. J. Stat. Softw. 74 (11) (2016). https://doi.org/10.18637/jss.v074.i11

Patki, N., Wedge, R., Veeramachaneni, K.: The synthetic data vault. In: Proceedings—3rd IEEE International Conference on Data Science and Advanced Analytics DSAA 2016, pp. 399–410 (2016). https://doi.org/10.1109/DSAA.2016.49

Zhang, J., Cormode, G., Procopiuc, C.M., Srivastava, D., Xiao, X.: Priv Bayes: private data release via Bayesian networks. ACM Trans. Database Syst. 42 (4) (2017). https://doi.org/10.1145/3134428

Ping, H., Stoyanovich, J., Howe, B.: Data synthesizer: privacy-preserving synthetic datasets. In: ACM International Conference Proceeding Series, vol. Part F1286 (2017). https://doi.org/10.1145/3085504.3091117

Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16 , 321–357 (2002). https://doi.org/10.1613/jair.953

Article   MATH   Google Scholar  

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680 (2014) [Online]. https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

Frid-Adar, M., Klang, E., Amitai, M., Goldberger, J., Greenspan, H.: Synthetic data augmentation using GAN for improved liver lesion classification. In: Proceedings—IEEE International Symposium on Biomedical Imaging, vol. 2018-April, pp. 289–293 (2018). https://doi.org/10.1109/ISBI.2018.8363576

El Emam, K.: Seven ways to evaluate the utility of synthetic data. IEEE Secur. Priv. 18 (4), 56–59 (2020). https://doi.org/10.1109/MSEC.2020.2992821

Hittmeir, M., Ekelhart, A., Mayer, R.: Utility and privacy assessments of synthetic data for regression tasks. In: Proceedings—2019 IEEE International Conference on Big Data (IEEE BigData 2019), pp. 5763–5772 (2019). https://doi.org/10.1109/BigData47090.2019.9005476

Hittmeir, M., Ekelhart, A., Mayer, R.: On the utility of synthetic data: an empirical evaluation on machine learning tasks. In: ACM International Conference Proceeding Series (2019). https://doi.org/10.1145/3339252.3339281

Heyburn, R., et al.: Machine learning using synthetic and real data: similarity of evaluation metrics for different healthcare datasets and for different algorithms, pp. 1281–1291 (2018). https://doi.org/10.1142/9789813273238_0160

Dankar, F.K., Ibrahim, M.: Fake it till you make it: guidelines for effective synthetic data generation. Appl. Sci. 11 (5), 1–18 (2021). https://doi.org/10.3390/app11052158

Cheng, V., Suriyakumar, V.M., Dullerud, N., Joshi, S., Ghassemi, M.: Can you fake it until you make it?: Impacts of differentially private synthetic data on downstream classification fairness. In: FAccT 2021—Proceedings 2021 ACM Conference Fairness, Accountability, Transparency, pp. 149–160 (2021). https://doi.org/10.1145/3442188.3445879

Ganev, G., Oprisanu, B., De Cristofaro, E.: Robin Hood and Matthew effects—differential privacy has disparate impact on synthetic data (2021) [Online]. http://arxiv.org/abs/2109.11429

Download references

Author information

Authors and affiliations.

Department of CSE, SOET, CMR University, Bangalore, India

A. Kiran & S. Saravana Kumar

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to A. Kiran .

Editor information

Editors and affiliations.

Khon Kaen University, Khon Kaen, Thailand

Chakchai So-In

National Institute of Technology, Raipur, Chhattisgarh, India

Narendra D. Londhe

Nirma University, Ahmedabad, Gujarat, India

Nityesh Bhatt

Estonian Business School, Tallinn, Estonia

Meelis Kitsing

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Cite this paper.

Kiran, A., Kumar, S.S. (2023). Synthetic Data and Its Evaluation Metrics for Machine Learning. In: So-In, C., Londhe, N.D., Bhatt, N., Kitsing, M. (eds) Information Systems for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 324. Springer, Singapore. https://doi.org/10.1007/978-981-19-7447-2_43

Download citation

DOI : https://doi.org/10.1007/978-981-19-7447-2_43

Published : 02 March 2023

Publisher Name : Springer, Singapore

Print ISBN : 978-981-19-7446-5

Online ISBN : 978-981-19-7447-2

eBook Packages : Intelligent Technologies and Robotics Intelligent Technologies and Robotics (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research
  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Evaluation Metrics for Retrieval-Augmented Generation (RAG) Systems

Retrieval-Augmented Generation (RAG) systems represent a significant leap forward in the realm of Generative AI, seamlessly integrating the capabilities of information retrieval and text generation. Unlike traditional models like GPT, which predict the next word based solely on previous context, RAG systems enhance responses by tapping into a vast reservoir of data, ensuring that the generated text is not only contextually appropriate but also richly informative. This makes them particularly valuable in fields such as customer support and content creation.

This article delves into the essential evaluation metrics that underline the effectiveness of RAG systems.

Table of Content

Overview of RAG Systems

Importance of evaluation metrics in rag systems, key metrics for rag system evaluation, 1. hit rate, 2. mean reciprocal rank (mrr), 3. relevancy.

RAG systems are all about improving automated responses by combining information retrieval with language generation. The retrieval part digs up relevant info from a database, and then the generation model takes that info to craft smart, context-aware answers. This combo means RAG systems can give you responses that are not just spot-on but also packed with the right context, which is perfect for tackling more complicated questions.

Basically, a RAG system works by looking at what a user asks and picking out important words or phrases. It then digs through a huge dataset to find the most relevant documents or sections. After that, it takes the information it found and feeds it to a language model, which puts everything together to create a clear and natural-sounding response. This way, the answer is on point and easy to understand.

Evaluation metrics are super important for figuring out how well RAG systems are doing. They give us a consistent way to check how effectively a system pulls up relevant info and gives accurate answers. With these metrics, developers and researchers can spot where things can be better, compare different models, and make sure the system is hitting the performance goals they want.

Hit Rate is a way to check how often a RAG system gives answers that are pretty close to what you were looking for. It’s a key measure of how accurate and reliable the system is, especially when you really need precise information. A higher Hit Rate means the system is doing a good job of finding and generating responses that meet what users want.

Calculation Method

  • responses : A list containing the system-generated answers.
  • ground_truth : A list of the expected correct answers.
  • The function iterates through pairs of responses and ground truth values.
  • It counts the number of instances where the system-generated response matches the expected answer exactly.
  • The Hit Rate is then calculated as the ratio of these correct matches to the total number of ground truth answers, providing a straightforward measure of the system’s accuracy.

The Hit Rate of “0.3333333333333333” suggests that approximately 33.33% of the responses generated by the RAG system were exactly what was expected or matched the ground truth data perfectly.

Mean Reciprocal Rank (MRR) is another key metric for checking how well RAG systems perform. Basically, it looks at how fast the system can pull up the right answer, showing how good it is at finding useful info. MRR is especially handy when the order of answers matters. Since it tells us how close the correct answers are to the top of the list. The higher the MRR, the better the system is at putting the right answers in upfront.

Calculation Method :

  • responses : A list of responses generated by the RAG system.
  • ground_truth : A list containing the correct or expected answers.
  • Initialize an Empty List : Start with an empty list to hold the reciprocal values of the ranks where the correct answers are found.
  • Iterate Through Ground Truth : For each correct answer in the ground_truth list, find its rank in the responses list.
  • Calculate Reciprocal Rank : For each correct answer found, compute the reciprocal of its rank (1 divided by the rank position).
  • Append Reciprocal Rank : Add each of these reciprocal values to the previously initialized list.
  • Compute Average MRR : Finally, calculate the average of these reciprocal values to obtain the MRR score.

The Mean Reciprocal Rank (MRR) of “0.611111111111111” gives an insightful look into the efficacy of a Retrieval-Augmented Generation (RAG) system in ranking relevant responses.

Relevancy is super important when it comes to checking how well a RAG system’s answers line up with what the user is asking. It’s like figuring out if the system is giving back info that really matches what the user is looking for. If the relevancy score is high, that means the system is doing a great job of getting what the user needs, which is key for providing helpful and on-point info.

  • query : The user’s query as a string, which the responses should address.
  • Count Relevant Responses : Iterate through the list of responses and count how many of them contain the user’s query as a substring. This involves checking each response to see if the query string appears within it.
  • Calculate Relevancy Score : The relevancy score is then computed as the ratio of relevant responses to the total number of responses, providing a direct measure of how well the system’s outputs match the query’s intent.

As the responses were related to ‘GeeksforGeeks’ which is the user’s query hence we got 1 meaning all responses are relevant to the user’s query.

A Relevancy score of “1.0” for a Retrieval-Augmented Generation (RAG) system is highly significant, indicating optimal performance in terms of how well the generated responses align with the user’s query.

Evaluation metrics are indispensable in the development and refinement of RAG systems. They enable developers to gauge the efficacy of their models, fine-tune configurations, and ensure the systems deliver high-quality, relevant, and accurate responses. Metrics such as Hit Rate, MRR, and Relevancy not only illustrate the performance of RAG systems but also guide improvements, ensuring these systems continue to evolve and better serve user needs. By understanding and applying these metrics, developers can significantly enhance the utility and effectiveness of RAG systems in various applications.

author

Please Login to comment...

Similar reads.

  • AI-ML-DS With Python
  • PS5 Pro Launched: Controller, Price, Specs & Features, How to Pre-Order, and More
  • How to Make Money on Twitch
  • How to activate Twitch on smart TV
  • 105 Funny Things to Do to Make Someone Laugh
  • #geekstreak2024 – 21 Days POTD Challenge Powered By Deutsche Bank

Improve your Coding Skills with Practice

 alt=

What kind of Experience do you want to share?

Thesis Gold Statistics

Total valuation.

Thesis Gold has a market cap or net worth of 118.43 million. The enterprise value is 117.18 million.

118.43M
Enterprise Value 117.18M

Important Dates

The next estimated earnings date is Tuesday, October 29, 2024.

Earnings Date Oct 29, 2024
Ex-Dividend Date n/a

Share Statistics

Shares Outstanding n/a
Shares Change (YoY) +79.01%
Shares Change (QoQ) +0.08%
Owned by Insiders (%) n/a
Owned by Institutions (%) n/a
Float 193.47M

Valuation Ratios

PE Ratio n/a
Forward PE n/a
PS Ratio n/a
PB Ratio n/a
P/FCF Ratio n/a
PEG Ratio n/a

Enterprise Valuation

EV / Earnings -353.94
EV / Sales n/a
EV / EBITDA n/a
EV / EBIT n/a
EV / FCF -5.00

Financial Position

The company has a current ratio of 4.43, with a Debt / Equity ratio of 0.00.

Current Ratio 4.43
Quick Ratio 3.76
Debt / Equity 0.00
Debt / EBITDA n/a
Debt / FCF -0.00
Interest Coverage n/a

Financial Efficiency

Return on equity (ROE) is -0.33% and return on invested capital (ROIC) is -2.39%.

Return on Equity (ROE) -0.33%
Return on Assets (ROA) -2.13%
Return on Capital (ROIC) -2.39%
Revenue Per Employee n/a
Profits Per Employee -331,082
Employee Count 1
Asset Turnover n/a
Inventory Turnover n/a
Income Tax n/a
Effective Tax Rate n/a

Stock Price Statistics

The stock price has increased by +21.59% in the last 52 weeks. The beta is 1.62, so Thesis Gold's price volatility has been higher than the market average.

Beta (5Y) 1.62
52-Week Price Change +21.59%
50-Day Moving Average 0.50
200-Day Moving Average 0.45
Relative Strength Index (RSI) 66.73
Average Volume (20 Days) 19,863

Short Selling Information

Short Interest n/a
Short Previous Month n/a
Short % of Shares Out n/a
Short % of Float n/a
Short Ratio (days to cover) n/a

Income Statement

Revenue n/a
Gross Profit n/a
Operating Income -3.85M
Pretax Income 459,198
Net Income -331,082
EBITDA n/a
EBIT -3.85M
Earnings Per Share (EPS) -0.00

Balance Sheet

The company has 1.25 million in cash and 4,942 in debt, giving a net cash position of 1.25 million.

Cash & Cash Equivalents 1.25M
Total Debt 4,942
Net Cash 1.25M
Net Cash Per Share n/a
Equity (Book Value) 118.52M
Book Value Per Share 0.68
Working Capital 4.47M

In the last 12 months, operating cash flow was 782,125 and capital expenditures -24.20 million, giving a free cash flow of -23.42 million.

Operating Cash Flow 782,125
Capital Expenditures -24.20M
Free Cash Flow -23.42M
FCF Per Share n/a
Gross Margin n/a
Operating Margin n/a
Pretax Margin n/a
Profit Margin n/a
EBITDA Margin n/a
EBIT Margin n/a
FCF Margin n/a

Dividends & Yields

Thesis Gold does not appear to pay any dividends at this time.

Dividend Per Share n/a
Dividend Yield n/a
Dividend Growth (YoY) n/a
Years of Dividend Growth n/a
Payout Ratio n/a
Buyback Yield -79.01%
Shareholder Yield -79.01%
Earnings Yield -0.35%
FCF Yield n/a

Stock Splits

The last stock split was on August 25, 2023. It was a reverse split with a ratio of 0.3846154.

Last Split Date Aug 25, 2023
Split Type Reverse
Split Ratio 0.3846154

Evaluation of Automated Driving System Safety Metrics With Logged Vehicle Trajectory Data

New citation alert added.

This alert has been successfully added and will be sent to:

You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below.

New Citation Alert!

Please log in to your account

Information & Contributors

Bibliometrics & citations, view options, index terms.

Computer systems organization

Embedded and cyber-physical systems

Embedded systems

Computing methodologies

Machine learning

Learning paradigms

Unsupervised learning

Anomaly detection

Modeling and simulation

Simulation types and techniques

Real-time simulation

Information systems

Information systems applications

Decision support systems

Data analytics

Spatial-temporal systems

Data streaming

Mathematics of computing

Probability and statistics

Probabilistic inference problems

Software and its engineering

Software organization and properties

Extra-functional properties

Recommendations

Metrics design for safety assessment.

Context :In the safety domain, safety assessment is used to show that safety-critical systems meet the required safety objectives. This process is also referred to as safety assurance and certification. During this procedure, safety standards are used ...

Implications of Safety Definition for Automated Driving

The developments of advanced sensing, communication and vehicle technologies in the past two decades have significantly changed the technological composition of road vehicles. It has become an expectation that automated vehicle technologies, including ...

A Class of Model Predictive Safety Performance Metrics for Driving Behavior Evaluation

This paper introduces a class of model predictive safety performance metrics for driving behavior evaluation. Through formulating the interactions of various traffic agents in the dynamic system context, a class of operational safety performance metrics ...

Information

Published in, publication history.

  • Research-article

Contributors

Other metrics, bibliometrics, article metrics.

  • 0 Total Citations
  • 0 Total Downloads
  • Downloads (Last 12 months) 0
  • Downloads (Last 6 weeks) 0

View options

Login options.

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Share this publication link.

Copying failed.

Share on social media

Affiliations, export citations.

  • Please download or close your previous search result export first before starting a new bulk export. Preview is not available. By clicking download, a status dialog will open to start the export process. The process may take a few minutes but once it finishes a file will be downloadable from your browser. You may continue to browse the DL while the export process is in progress. Download
  • Download citation
  • Copy citation

We are preparing your search results for download ...

We will inform you here when the file is ready.

Your file of search results citations is now ready.

Your search export query has expired. Please try again.

Help | Advanced Search

Computer Science > Machine Learning

Title: are heterophily-specific gnns and homophily metrics really effective evaluation pitfalls and new benchmarks.

Abstract: Over the past decade, Graph Neural Networks (GNNs) have achieved great success on machine learning tasks with relational data. However, recent studies have found that heterophily can cause significant performance degradation of GNNs, especially on node-level tasks. Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs and various homophily metrics have been designed to help people recognize these malignant datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics. In this paper, we point out three most serious pitfalls: 1) a lack of hyperparameter tuning; 2) insufficient model evaluation on the real challenging heterophilic datasets; 3) missing quantitative evaluation benchmark for homophily metrics on synthetic graphs. To overcome these challenges, we first train and fine-tune baseline models on $27$ most widely used benchmark datasets, categorize them into three distinct groups: malignant, benign and ambiguous heterophilic datasets, and identify the real challenging subsets of tasks. To our best knowledge, we are the first to propose such taxonomy. Then, we re-evaluate $10$ heterophily-specific state-of-the-arts (SOTA) GNNs with fine-tuned hyperparameters on different groups of heterophilic datasets. Based on the model performance, we reassess their effectiveness on addressing heterophily challenge. At last, we evaluate $11$ popular homophily metrics on synthetic graphs with three different generation approaches. To compare the metrics strictly, we propose the first quantitative evaluation method based on Fréchet distance.
Comments: arXiv admin note: substantial text overlap with
Subjects: Machine Learning (cs.LG)
Cite as: [cs.LG]
  (or [cs.LG] for this version)
  Focus to learn more arXiv-issued DOI via DataCite (pending registration)

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

license icon

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Thesis

    evaluation metrics thesis

  2. Commonly used evaluation metrics.

    evaluation metrics thesis

  3. FREE 10+ Thesis Evaluation Samples [ Master, Defense, Project ]

    evaluation metrics thesis

  4. Fillable Online EVALUATION OF VISUAL QUALITY METRICS A THESIS SUBMITTED

    evaluation metrics thesis

  5. (PDF) Evaluation Metrics and Evaluation

    evaluation metrics thesis

  6. Evaluation metrics for the proposed method

    evaluation metrics thesis

VIDEO

  1. Panel 3: Measuring Content: Evaluation, Metrics and Measuring Impact

  2. Comparative and Evaluation Research

  3. QUALITIES OF GOOD THESIS EXAMINERS

  4. វគ្គ2 How to assess a thesis

  5. WHAT ARE THE METRICS USED FOR EVALUATION OF LLM

  6. QA Metrics. What? Why? When?

COMMENTS

  1. Evaluation metrics and statistical tests for machine learning

    The most commonly used evaluation metrics for binary classification are accuracy, sensitivity, specificity, and precision, which express the percentage of correctly classified instances in the set ...

  2. Full article: Metrics for evaluating the performance of machine

    Two decisions need to be made to successfully use CV to judge model performance: first, how to organise the train/test-set split, and second, which metrics to use to judge performance. In this paper we focus on the role of performance metrics. We present 48 metrics that could potentially be used for this task.

  3. A Review of Evaluation Metrics in Machine Learning Algorithms

    The results of these evaluation metrics will determine if the classifier has performed optimally, or further refinement of the classifier is required. This review paper focused on highlighting the various evaluation metrics being applied in machine learning algorithms. Identified challenges and issues are also discussed.

  4. role of metrics in peer assessments

    This paper addresses whether metrics are considered a legitimate and integral part of the assessment of research, explore the role of metrics in different review contexts and fields of research, and discuss implications for research evaluation and policy. The use of metrics has a long history, dating back more than 100 years (De Bellis 2009).

  5. Understanding research metrics

    Read our guide to research metrics, and learn how to monitor your journal's performance, including, Citescore, impact factor, and h index.

  6. (PDF) Development, validation, and usage of metrics to evaluate

    Objectives: To develop, validate, and use evaluation instruments to assess the quality of clinical hypotheses generated using secondary data analytic tools. Materials and Methods: The development ...

  7. Evaluating Research Impact: A Comprehensive Overview of Metrics and

    The purpose of this research paper is to analyze and compare the various research metrics and online databases used to evaluate the impact and quality of scientific publications. The study focuses on the most widely used research metrics, such as the h-index, the...

  8. PDF A Tutorial on Evaluation Metrics used in Natural Language Generation

    Cutting-edge: This tutorial will follow the growth of automatic evaluation metrics over the years, starting with the initial metrics that are still pop-ularly used today, and building up to the more recent metrics. Substantial emphasis will be given to the recent trends and emerging direc-tions of research on this topic.

  9. Evaluation Metrics for Unsupervised Learning Algorithms

    Internal evaluation methods are commonly classified according to the type of clustering algorithm they are used with. For partitional algorithms, metrics based on the proximity matrix, as well as metrics of cohesion and separation, such as the silhouette coefficient, are often used.

  10. Evaluation Metrics In Machine Learning Based Dissertation

    Phdassistance offers the machine learning based dissertation. we'll need to measure based on computing some type of distance between anticipated and actual values.

  11. PDF A Survey of Accuracy Evaluation Metrics of Recommendation Tasks

    In some cases, applying incorrect evaluation metrics may result in selecting an inappropriate algorithm. We demonstrate this by experimenting with a wide collection of data sets, comparing a number of algorithms using various evaluation metrics, showing that the metrics rank the algorithms differently.

  12. Evaluation Metrics and Evaluation

    Abstract and Figures This chapter describes the metrics for the evaluation of information retrieval and natural language processing systems, the annotation techniques and evaluation metrics and ...

  13. PDF GUIDELINE FOR MASTER'S THESIS EVALUATION

    The master's thesis is an independent research project completed by the student. The supervisor shall evaluate all parts of the complete thesis submitted for evaluation, including the title page. As applicable, other factors such as the independent contribution of the student and his/her ability to stay on the agreed schedule may be considered in the evaluation process.

  14. Evaluation Metrics for NLG and TTS in Task-Oriented Dialog PhD. Thesis

    This thesis proposal explores the evaluation of Task-oriented Dialogue (ToD) systems and Text-to-Speech Synthesis (TTS) using automatic metrics and proposes a series of experiments aimed at resolving the identified limitations and enhancing the evaluation process for D2T and ToD NLG.

  15. PDF Master Thesis Using Machine Learning Methods for Evaluating the ...

    Therefore, this thesis focuses on the evaluation of translation quality, specifically con-cerning technical documentation, and answers two central questions: How can the translation quality of technical documents be evaluated, given the ... translation evaluation metrics in the context of a knowledge discovery process. The eval-

  16. PDF Chapter 6 Evaluation Metrics and Evaluation

    6.3 Metrics Evaluation, in this case quantitative evaluation, can have many different purposes. Theremay also be differentlimitations on the amountdata used for trainingand for evaluation. In some cases, high recall is considered a priority over high precision, and in some cases it is the opposite.

  17. PDF A Re-examination of Chatbot Evaluation Metrics

    The objective of the thesis is to understand the characteristics of two types of automated metrics: trained-metric and untrained-metric, and identify the most suitable metrics for dialog evaluation. Moreover, experiments have been conducted to study the weaknesses of word-overlap metrics in morphology-rich language and solutions for that problem.

  18. We Need to Talk About Classification Evaluation Metrics in NLP

    We Need to Talk About Classification Evaluation Metrics in NLP. In Natural Language Processing (NLP) classification tasks such as topic categorisation and sentiment analysis, model generalizability is generally measured with standard metrics such as Accuracy, F-Measure, or AUC-ROC. The diversity of metrics, and the arbitrariness of their ...

  19. Comparative evaluation of Large Language Models using key metrics and

    The evaluation focused on metrics such as bias, safety, accuracy, cost, robustness, and latency. Additionally, adaptability covering critical features like language translation and internet access, was independently researched since the Langsmith tool does not evaluate this metric. This ensures a holistic assessment of the LLM's capabilities.

  20. Classification Model Evaluation Metrics

    We have described all 16 metrics, which are used to evaluate classification models, listed their characteristics, mutual differences, and the parameter that evaluates each of these metrics.

  21. On Search Engine Evaluation Metrics

    Part II is where this thesis stops criticizing others and gets creative. In Chapter 7, the concept of relevance, so very central for evaluation, is discussed. After that, I present a framework for web search meta-evaluation, that is, an evaluation of the evaluation metrics themselves, in

  22. CHM 7010

    Provides links to relevant tutorials for database searching and use of Mendeley as well as links to resources about technical writing. identifying questionable journals and publishers; finding journal metrics; article evaluation criteria

  23. Synthetic Data and Its Evaluation Metrics for Machine Learning

    Synthetic data is a statistical process through which statistical information is extracted from the actual data and released for public usage. Fully synthetic data creation follows the approach introduced by Rubin whereas partially synthetic data creation was a concept suggested by Liitle in 1993.

  24. Evaluation Metrics for Retrieval-Augmented Generation (RAG) Systems

    Evaluation metrics are super important for figuring out how well RAG systems are doing. They give us a consistent way to check how effectively a system pulls up relevant info and gives accurate answers. With these metrics, developers and researchers can spot where things can be better, compare different models, and make sure the system is ...

  25. Thesis Gold Inc. (THSGF) Statistics & Valuation Metrics

    Detailed statistics for Thesis Gold Inc. (THSGF) stock, including valuation metrics, financial numbers, share information and more.

  26. Evaluation of Automated Driving System Safety Metrics With Logged

    The proposed evaluation framework is important for researchers, practitioners, and regulators to characterize different metrics, and to select appropriate metrics for different applications. Moreover, by conducting failure analysis on moments when a safety metric fails, we can identify its potential weaknesses, which can be valuable for ...

  27. Quality and performance evaluation metrics of websites

    This paper provides a systematic literature review to present a broad overview of the pr imary studi es on evaluating. the quality of websites since 2010. The motivation is the identificat ion of ...

  28. Are Heterophily-Specific GNNs and Homophily Metrics Really Effective

    Numerous heterophilic benchmark datasets have been put forward to validate the efficacy of heterophily-specific GNNs and various homophily metrics have been designed to help people recognize these malignant datasets. Nevertheless, there still exist multiple pitfalls that severely hinder the proper evaluation of new models and metrics.