University of Edinburgh Research Explorer Logo

  • Help & FAQ

ParaMetric: An Automatic Evaluation Metric for Paraphrasing

  • School of Informatics

Research output : Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract / Description of output

Access to document.

  • 2008_Callison-Burch_Cohn_ET AL_ParaMetric An Automatic Evaluation Metric for Paraphrasing Final published version, 267 KB
  • http://www.aclweb.org/anthology/C08-1013

Fingerprint

  • Evaluation Metric Computer Science 100%
  • Gold Standard Computer Science 50%
  • Driven Approach Computer Science 50%
  • Objective Measure Computer Science 50%

T1 - ParaMetric: An Automatic Evaluation Metric for Paraphrasing

AU - Callison-Burch, Chris

AU - Cohn, Trevor

AU - Lapata, Mirella

N2 - We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques.

AB - We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques.

M3 - Conference contribution

BT - Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

PB - Association for Computational Linguistics

Conference Proceedings

ParaMetric: An automatic evaluation metric for paraphrasing

C Callison-Burch, T Cohn, M Lapata

Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference | Published : 2008

We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques. © 2008 Licensed under the Creative Commons.

University of Melbourne Researchers

Citation metrics.

Subscribe to the PwC Newsletter

Join the community, edit social preview.

automatic evaluation metric for paraphrasing

Add a new code entry for this paper

Remove a code repository from this paper.

automatic evaluation metric for paraphrasing

Mark the official implementation from paper authors

Add a new evaluation result row.

  • MACHINE TRANSLATION
  • PARAPHRASE GENERATION

Remove a task

automatic evaluation metric for paraphrasing

Add a method

Remove a method, edit datasets, on the evaluation metrics for paraphrase generation.

17 Feb 2022  ·  Lingfeng Shen , Lemao Liu , Haiyun Jiang , Shuming Shi · Edit social preview

In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Experimental results demonstrate that ParaScore significantly outperforms existing metrics.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit, methods edit add remove.

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications

MT Evaluation in Many Languages via Zero-Shot Paraphrasing

thompsonb/prism

Folders and files, repository files navigation, prism: mt evaluation in many languages via zero-shot paraphrasing.

Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references. Prism uses a multilingual NMT model as a zero-shot paraphraser, which negates the need for synthetic paraphrase data and results in a single model which works in many languages.

Prism outperforms or statistically ties with all metrics submitted to the WMT 2019 metrics shared task as segment-level human correlation.

We provide a large, pre-trained multilingual NMT model which we use as a multilingual paraphraser, but the model may also be of use to the research community beyond MT metrics. We provide examples of using the model for both multilingual translation and paraphrase generation .

Prism scores raw, untokenized text; all preprocessing is applied internally. This document describes how to install and use Prism.

Installation

Prism requires a version of Fairseq compatible with the provided pretrained model. We recommend starting with a clean environment:

For reasonable speeds, we recommend running on a machine with a GPU and the CUDA version compatible with the version of fairseq/torch installed above. Prism will run on a GPU if available; to run on CPU instead, set CUDA_VISIBLE_DEVICES to an empty string.

Download the Prism code and install requirements, including Fairseq:

Download Model

Metric usage: command line.

Create test candidate/reference files:

To obtain system-level metric scores, run:

Here, "ref.en" is the (untokenized) human reference, and "cand.en" is the (untokenized) system output. This command will print some logging information to STDERR, including a model/version identifier, and print the system-level score (negative, higher is better) to STDOUT:

Prism identifier: {'version': '0.1', 'model': 'm39v1', 'seg_scores': 'avg_log_prob', 'sys_scores': 'avg_log_prob', 'log_base': 2} -1.0184667

Candidates can also be piped into prism.py:

To score output using the source instead of the reference (i.e., quality estimation as a metric), use the --src flag. Note that --lang still specifies the target/reference language:

Prism also has access to all WMT test sets via the sacreBLEU API. These can be specified as arguments to --src and --ref , for a hypothetical system output $cand, as follows:

which will cause it to use the English reference from the WMT19 German--English test set. (Since the language is known, no --lang is needed).

To see all options, including segment-level scoring, run:

Metric Usage: Python Module

All functionality is also available in Python, for example:

Which should produce:

Prism identifier: {'version': '0.1', 'model': 'm39v1', 'seg_scores': 'avg_log_prob', 'sys_scores': 'avg_log_prob', 'log_base': 2} System-level metric: -1.0184666 Segment-level metric: [-1.4878583 -0.5490748] System-level QE-as-metric: -1.8306842 Segment-level QE-as-metric: [-2.462842 -1.1985264]

Multilingual Translation

The Prism model is simply a multilingual NMT model, and can be used for translation -- see the multilingual translation README .

Paraphrase Generation

Attempting to generate paraphrases from the Prism model via naive beam search (e.g. "translate" from French to French) results in trivial copies most of the time. However, we provide a simple algorithm to discourage copying and enable paraphrase generation in many languages -- see the paraphrase generation README .

Supported Languages

Albanian (sq), Arabic (ar), Bengali (bn), Bulgarian (bg), Catalan; Valencian (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Esperanto (eo), Estonian (et), Finnish (fi), French (fr), German (de), Greek, Modern (el), Hebrew (modern) (he), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Latvian (lv), Lithuanian (lt), Macedonian (mk), Norwegian (no), Polish (pl), Portuguese (pt), Romanian, Moldavan (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovene (sl), Spanish; Castilian (es), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi)

Data Filtering

The data filtering scripts used to train the Prism model can be found here .

Publications

If you the Prism metric and/or the provided multilingual NMT model, please cite our EMNLP paper :

If you the paraphrase generation algorithm, please also cite our WMT paper :

Contributors 2

  • Python 100.0%

TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate

  • Published: 15 December 2009
  • Volume 23 , pages 117–127, ( 2009 )

Cite this article

automatic evaluation metric for paraphrasing

  • Matthew G. Snover 1 ,
  • Nitin Madnani 1 ,
  • Bonnie Dorr 1 &
  • Richard Schwartz 2  

615 Accesses

35 Citations

Explore all metrics

This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate translation adequacy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Similar content being viewed by others

automatic evaluation metric for paraphrasing

Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation

automatic evaluation metric for paraphrasing

Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study

automatic evaluation metric for paraphrasing

Scratching the Surface of Possible Translations

Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaulation measures for MT and/or summarization, pp 228–231

Bannard C, Callison-Burch C (2005) Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL 2005). Ann Arbor, Michigan, pp 597–604

Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press. http://www.cogsci.princeton.edu/wn Accessed 7 Sep 2000

Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: Proceedings of the human language technology conference of the North American chapter of the ACL, pp 455–462

Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Proceedings of the 6th conference of the association for machine translation in the Americas, pp 134–143

Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, pp 241–248

Lita LV, Rogati M, Lavie A (2005) BLANC: learning evaluation metrics for MT. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP). Vancouver, BC, pp 740–747

Lopresti D, Tomkins A (1997) Block edit models for approximate string matching. Theor Comput Sci 181(1): 159–179

Article   MATH   MathSciNet   Google Scholar  

Madnani N, Resnik P, Dorr BJ, Schwartz R (2008) Are multiple reference translations necessary? Investigating the value of paraphrased reference translations in parameter optimization. In: Proceedings of the eighth conference of the association for machine translation in the Americas, pp 143–152

Niessen S, Och F, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation, pp 39–45

Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

Porter MF (1980) An algorithm for suffic stripping. Program 14(3): 130–137

Google Scholar  

Przybocki M, Peterson K, Bronsart S (2008) Official results of the NIST 2008 “Metrics for MAchine TRanslation” Challenge (MetricsMATR08). http://nist.gov/speech/tests/metricsmatr/2008/results/

Rosti A-V, Matsoukas S, Schwartz R (2007) Improved word-level system combination for machine translation. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Prague, Czech Republic, pp 312–319

Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp 223–231

Snover M, Madnani N, Dorr B, Schwartz R (2009) Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the fourth workshop on statistical machine translation. Association for Computational Linguistics, Athens, Greece, pp 259–268

Zhou L, Lin C-Y, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 77–84

Download references

Author information

Authors and affiliations.

Laboratory for Computational Linguistics and Information Processing, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA

Matthew G. Snover, Nitin Madnani & Bonnie Dorr

BBN Technologies, Cambridge, MA, USA

Richard Schwartz

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Matthew G. Snover .

Rights and permissions

Reprints and permissions

About this article

Snover, M.G., Madnani, N., Dorr, B. et al. TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate. Machine Translation 23 , 117–127 (2009). https://doi.org/10.1007/s10590-009-9062-9

Download citation

Received : 15 May 2009

Accepted : 16 November 2009

Published : 15 December 2009

Issue Date : September 2009

DOI : https://doi.org/10.1007/s10590-009-9062-9

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Machine translation evaluation
  • Paraphrasing
  • Find a journal
  • Publish with us
  • Track your research

Help | Advanced Search

Computer Science > Computation and Language

Title: mrscore: evaluating radiology report generation with llm-based reward system.

Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. automatic evaluation metric for paraphrasing

    automatic evaluation metric for paraphrasing

  2. (PDF) ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    automatic evaluation metric for paraphrasing

  3. ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    automatic evaluation metric for paraphrasing

  4. Revisiting the Evaluation Metrics of Paraphrase Generation: Paper and

    automatic evaluation metric for paraphrasing

  5. PPT

    automatic evaluation metric for paraphrasing

  6. Paraphrasing For Automatic Evaluation: David Kauchak Regina Barzilay

    automatic evaluation metric for paraphrasing

VIDEO

  1. Brew Metric Automatic First Look ANOTHER WINNER!

  2. An Automatic Evaluation and Analysis Tool for Chinese Grammatical Error Correction

  3. Balancing Act: F1 Score Unveiled #machinelearning #ml

  4. Objective Evaluation Metric for Motion Generative Models: Validating Fréchet Motion Distance on

  5. Workshop on Empirical Translation Process Research (AMTA 2022): Ali Saeedi

  6. Programming for AI (AI504, Fall 2022), Class 2: Basic Machine Learning

COMMENTS

  1. PDF ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    with paraphrases. ParaMetric compares automatic paraphrases against reference paraphrases. In this paper we: Present a novel automatic evaluation metric for data-driven paraphrasing methods; Describe how manual alignments are cre-ated by annotating correspondences between words in multiple translations; Show how phrase extraction heuristics from

  2. ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    DOI: Bibkey: callison-burch-etal-2008-parametric. Cite (ACL): Chris Callison-Burch, Trevor Cohn, and Mirella Lapata. 2008. ParaMetric: An Automatic Evaluation Metric for Paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 97-104, Manchester, UK. Coling 2008 Organizing Committee.

  3. On the Evaluation Metrics for Paraphrase Generation

    In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth ...

  4. On the Evaluation Metrics for Paraphrase Generation

    In this paper, we revisit automatic metrics for paraphrase evaluation. We collect a list of popu-lar metrics used in recent researches (Kumar et al., 2019;Feng et al.,2021;Hegde and Patil,2020;Sun et al.,2021;Huang and Chang,2021;Kumar et al., 2020), and computed their correlation with human

  5. PDF Understanding Metrics for Paraphrasing

    Several metrics have been used for automatic eval-uation of model generated paraphrases with refer-ence during testing. Metrics like BLEU (Papineni ... 3 ROUGE-P: A novel metric for paraphrase evaluation Addressing how semantically similar two sentences are is a subset of the problem of how good a para-

  6. ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ... 2008_Callison-Burch_Cohn_ET AL_ParaMetric An Automatic Evaluation Metric for Paraphrasing Final ...

  7. ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    W e introduce an automatic evaluation metric, called ParaMetric, which uses paraphrasing tech-. niques to be compared and enables an evaluation. to be easily repeated in subsequent research. Para ...

  8. ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated and calculates precision and recall scores by comparing the paraphrase discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We present ParaMetric, an automatic evaluation ...

  9. PDF Paraphrase Generation: A Survey of the State of the Art

    Two general types of evaluation metrics are com-monly used to evaluate paraphrase generation: au-tomatic evaluation and human evaluation. Automatic Evaluation Several automatic evalu-ation metrics are used for the evaluation of para-phrase generation. The widely-used metrics in-clude (1) BLEU (Papineni et al.,2002), which was

  10. PDF ParaMetric: An Automatic Evaluation Metric for Paraphrasing

    isting paraphrasing methods. 2 Related Work No consensus has been reached with respect to the proper methodology to use when evaluating para-phrase quality. This section reviews past methods for paraphrase evaluation. Researchers usually present the quality of their automatic paraphrasing technique in terms of a subjective manual evaluation ...

  11. ParaMetric: An automatic evaluation metric for paraphrasing

    Abstract We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated.

  12. On the Evaluation Metrics for Paraphrase Generation

    In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation.Underlying reasons behind the above findings are explored through additional experiments and in-depth ...

  13. On the Evaluation Metrics for Paraphrase Generation

    In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are ...

  14. The (Un)Suitability of Automatic Evaluation Metrics for Text

    Abstract. In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments on the simplicity achieved by executing ...

  15. Paraphrasing for automatic evaluation

    METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, 65--72. Google Scholar; R. Barzilay, L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment.

  16. PDF arXiv:2202.08479v1 [cs.CL] 17 Feb 2022

    performance in supervised paraphrase generation. 2.2 Automatic Evaluation in Paraphrase Generation Existing paraphrase evaluation can be divided into three groups: reference-based metric, reference-free metric, and hybrid metric. Existing reference-based metrics are commonly NMT metrics like BLEU (Papineni et al.,2002), Rouge (Lin,2004),

  17. PDF Paraphrasing for Automatic Evaluation

    the accuracy of automatic evaluation. We also found a strong connection between the quality of automatic paraphrases as judged by humans and their contribution to automatic evaluation. 1 Introduction The use of automatic methods for evaluating machine-generated text is quickly becoming main-stream in natural language processing. The most

  18. Comparing product quality between translation and paraphrasing: Using

    As a result, a unique feature of PQA is the wide use of automatic evaluation models, ... For the paraphrasing evaluation, we referred to the metrics in previous studies and formulated our evaluation rubrics following the framework in translation. ... the remaining rubrics in the paraphrasing assessment are almost in line with the evaluation ...

  19. Revisiting the Evaluation Metrics of Paraphrase Generation

    Such empirical findings expose a lack of reliable automatic evaluation metrics. Therefore, this paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.

  20. Prism: MT Evaluation in Many Languages via Zero-Shot Paraphrasing

    Prism: MT Evaluation in Many Languages via Zero-Shot Paraphrasing. Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references. Prism uses a multilingual NMT model as a zero-shot paraphraser, which negates the need for synthetic paraphrase data and ...

  21. TER-Plus: paraphrase, semantic, and alignment enhancements to

    This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to ...

  22. [2205.13119] Understanding Metrics for Paraphrasing

    Paraphrase generation is a difficult problem. This is not only because of the limitations in text generation capabilities but also due that to the lack of a proper definition of what qualifies as a paraphrase and corresponding metrics to measure how good it is. Metrics for evaluation of paraphrasing quality is an on going research problem. Most of the existing metrics in use having been ...

  23. RepEval: Effective Text Evaluation with LLM Representation

    Automatic evaluation metrics for generated texts play an important role in the NLG field, especially with the rapid growth of LLMs. However, existing metrics are often limited to specific scenarios, making it challenging to meet the evaluation requirements of expanding LLM applications. Therefore, there is a demand for new, flexible, and effective metrics. In this study, we introduce RepEval ...

  24. [2404.17778] MRScore: Evaluating Radiology Report Generation with LLM

    In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically ...