- Help & FAQ
ParaMetric: An Automatic Evaluation Metric for Paraphrasing
- School of Informatics
Research output : Chapter in Book/Report/Conference proceeding › Conference contribution
Abstract / Description of output
Access to document.
- 2008_Callison-Burch_Cohn_ET AL_ParaMetric An Automatic Evaluation Metric for Paraphrasing Final published version, 267 KB
- http://www.aclweb.org/anthology/C08-1013
Fingerprint
- Evaluation Metric Computer Science 100%
- Gold Standard Computer Science 50%
- Driven Approach Computer Science 50%
- Objective Measure Computer Science 50%
T1 - ParaMetric: An Automatic Evaluation Metric for Paraphrasing
AU - Callison-Burch, Chris
AU - Cohn, Trevor
AU - Lapata, Mirella
N2 - We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques.
AB - We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques.
M3 - Conference contribution
BT - Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
PB - Association for Computational Linguistics
Conference Proceedings
ParaMetric: An automatic evaluation metric for paraphrasing
C Callison-Burch, T Cohn, M Lapata
Coling 2008 - 22nd International Conference on Computational Linguistics, Proceedings of the Conference | Published : 2008
We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated. ParaMetric calculates precision and recall scores by comparing the paraphrases discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We report scores for several established paraphrasing techniques. © 2008 Licensed under the Creative Commons.
University of Melbourne Researchers
Citation metrics.
Subscribe to the PwC Newsletter
Join the community, edit social preview.
Add a new code entry for this paper
Remove a code repository from this paper.
Mark the official implementation from paper authors
Add a new evaluation result row.
- MACHINE TRANSLATION
- PARAPHRASE GENERATION
Remove a task
Add a method
Remove a method, edit datasets, on the evaluation metrics for paraphrase generation.
17 Feb 2022 · Lingfeng Shen , Lemao Liu , Haiyun Jiang , Shuming Shi · Edit social preview
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth analyses. Based on the experiments and analyses, we propose ParaScore, a new evaluation metric for paraphrase generation. It possesses the merits of reference-based and reference-free metrics and explicitly models lexical divergence. Experimental results demonstrate that ParaScore significantly outperforms existing metrics.
Code Edit Add Remove Mark official
Tasks edit add remove, datasets edit, results from the paper edit, methods edit add remove.
Navigation Menu
Search code, repositories, users, issues, pull requests..., provide feedback.
We read every piece of feedback, and take your input very seriously.
Saved searches
Use saved searches to filter your results more quickly.
To see all available qualifiers, see our documentation .
- Notifications
MT Evaluation in Many Languages via Zero-Shot Paraphrasing
thompsonb/prism
Folders and files, repository files navigation, prism: mt evaluation in many languages via zero-shot paraphrasing.
Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references. Prism uses a multilingual NMT model as a zero-shot paraphraser, which negates the need for synthetic paraphrase data and results in a single model which works in many languages.
Prism outperforms or statistically ties with all metrics submitted to the WMT 2019 metrics shared task as segment-level human correlation.
We provide a large, pre-trained multilingual NMT model which we use as a multilingual paraphraser, but the model may also be of use to the research community beyond MT metrics. We provide examples of using the model for both multilingual translation and paraphrase generation .
Prism scores raw, untokenized text; all preprocessing is applied internally. This document describes how to install and use Prism.
Installation
Prism requires a version of Fairseq compatible with the provided pretrained model. We recommend starting with a clean environment:
For reasonable speeds, we recommend running on a machine with a GPU and the CUDA version compatible with the version of fairseq/torch installed above. Prism will run on a GPU if available; to run on CPU instead, set CUDA_VISIBLE_DEVICES to an empty string.
Download the Prism code and install requirements, including Fairseq:
Download Model
Metric usage: command line.
Create test candidate/reference files:
To obtain system-level metric scores, run:
Here, "ref.en" is the (untokenized) human reference, and "cand.en" is the (untokenized) system output. This command will print some logging information to STDERR, including a model/version identifier, and print the system-level score (negative, higher is better) to STDOUT:
Prism identifier: {'version': '0.1', 'model': 'm39v1', 'seg_scores': 'avg_log_prob', 'sys_scores': 'avg_log_prob', 'log_base': 2} -1.0184667
Candidates can also be piped into prism.py:
To score output using the source instead of the reference (i.e., quality estimation as a metric), use the --src flag. Note that --lang still specifies the target/reference language:
Prism also has access to all WMT test sets via the sacreBLEU API. These can be specified as arguments to --src and --ref , for a hypothetical system output $cand, as follows:
which will cause it to use the English reference from the WMT19 German--English test set. (Since the language is known, no --lang is needed).
To see all options, including segment-level scoring, run:
Metric Usage: Python Module
All functionality is also available in Python, for example:
Which should produce:
Prism identifier: {'version': '0.1', 'model': 'm39v1', 'seg_scores': 'avg_log_prob', 'sys_scores': 'avg_log_prob', 'log_base': 2} System-level metric: -1.0184666 Segment-level metric: [-1.4878583 -0.5490748] System-level QE-as-metric: -1.8306842 Segment-level QE-as-metric: [-2.462842 -1.1985264]
Multilingual Translation
The Prism model is simply a multilingual NMT model, and can be used for translation -- see the multilingual translation README .
Paraphrase Generation
Attempting to generate paraphrases from the Prism model via naive beam search (e.g. "translate" from French to French) results in trivial copies most of the time. However, we provide a simple algorithm to discourage copying and enable paraphrase generation in many languages -- see the paraphrase generation README .
Supported Languages
Albanian (sq), Arabic (ar), Bengali (bn), Bulgarian (bg), Catalan; Valencian (ca), Chinese (zh), Croatian (hr), Czech (cs), Danish (da), Dutch (nl), English (en), Esperanto (eo), Estonian (et), Finnish (fi), French (fr), German (de), Greek, Modern (el), Hebrew (modern) (he), Hungarian (hu), Indonesian (id), Italian (it), Japanese (ja), Kazakh (kk), Latvian (lv), Lithuanian (lt), Macedonian (mk), Norwegian (no), Polish (pl), Portuguese (pt), Romanian, Moldavan (ro), Russian (ru), Serbian (sr), Slovak (sk), Slovene (sl), Spanish; Castilian (es), Swedish (sv), Turkish (tr), Ukrainian (uk), Vietnamese (vi)
Data Filtering
The data filtering scripts used to train the Prism model can be found here .
Publications
If you the Prism metric and/or the provided multilingual NMT model, please cite our EMNLP paper :
If you the paraphrase generation algorithm, please also cite our WMT paper :
Contributors 2
- Python 100.0%
TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate
- Published: 15 December 2009
- Volume 23 , pages 117–127, ( 2009 )
Cite this article
- Matthew G. Snover 1 ,
- Nitin Madnani 1 ,
- Bonnie Dorr 1 &
- Richard Schwartz 2
615 Accesses
35 Citations
Explore all metrics
This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to correlate better with various types of human judgments. We present a correlation study comparing TERp to BLEU, METEOR and TER, and illustrate that TERp can better evaluate translation adequacy.
This is a preview of subscription content, log in via an institution to check access.
Access this article
Price includes VAT (Russian Federation)
Instant access to the full article PDF.
Rent this article via DeepDyve
Institutional subscriptions
Similar content being viewed by others
Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation
Machine Translation Evaluation: Manual Versus Automatic—A Comparative Study
Scratching the Surface of Possible Translations
Banerjee S, Lavie A (2005) METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL 2005 workshop on intrinsic and extrinsic evaulation measures for MT and/or summarization, pp 228–231
Bannard C, Callison-Burch C (2005) Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd annual meeting of the association for computational linguistics (ACL 2005). Ann Arbor, Michigan, pp 597–604
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press. http://www.cogsci.princeton.edu/wn Accessed 7 Sep 2000
Kauchak D, Barzilay R (2006) Paraphrasing for automatic evaluation. In: Proceedings of the human language technology conference of the North American chapter of the ACL, pp 455–462
Lavie A, Sagae K, Jayaraman S (2004) The significance of recall in automatic metrics for MT evaluation. In: Proceedings of the 6th conference of the association for machine translation in the Americas, pp 134–143
Leusch G, Ueffing N, Ney H (2006) CDER: efficient MT evaluation using block movements. In: Proceedings of the 11th conference of the European chapter of the association for computational linguistics, pp 241–248
Lita LV, Rogati M, Lavie A (2005) BLANC: learning evaluation metrics for MT. In: Proceedings of human language technology conference and conference on empirical methods in natural language processing (HLT/EMNLP). Vancouver, BC, pp 740–747
Lopresti D, Tomkins A (1997) Block edit models for approximate string matching. Theor Comput Sci 181(1): 159–179
Article MATH MathSciNet Google Scholar
Madnani N, Resnik P, Dorr BJ, Schwartz R (2008) Are multiple reference translations necessary? Investigating the value of paraphrased reference translations in parameter optimization. In: Proceedings of the eighth conference of the association for machine translation in the Americas, pp 143–152
Niessen S, Och F, Leusch G, Ney H (2000) An evaluation tool for machine translation: fast evaluation for MT research. In: Proceedings of the 2nd international conference on language resources and evaluation, pp 39–45
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
Porter MF (1980) An algorithm for suffic stripping. Program 14(3): 130–137
Google Scholar
Przybocki M, Peterson K, Bronsart S (2008) Official results of the NIST 2008 “Metrics for MAchine TRanslation” Challenge (MetricsMATR08). http://nist.gov/speech/tests/metricsmatr/2008/results/
Rosti A-V, Matsoukas S, Schwartz R (2007) Improved word-level system combination for machine translation. In: Proceedings of the 45th annual meeting of the association of computational linguistics. Prague, Czech Republic, pp 312–319
Snover M, Dorr B, Schwartz R, Micciulla L, Makhoul J (2006) A study of translation edit rate with targeted human annotation. In: Proceedings of association for machine translation in the Americas, pp 223–231
Snover M, Madnani N, Dorr B, Schwartz R (2009) Fluency, adequacy, or HTER? Exploring different human judgments with a tunable MT metric. In: Proceedings of the fourth workshop on statistical machine translation. Association for Computational Linguistics, Athens, Greece, pp 259–268
Zhou L, Lin C-Y, Hovy E (2006) Re-evaluating machine translation results with paraphrase support. In: Proceedings of the 2006 conference on empirical methods in natural language processing (EMNLP 2006), pp 77–84
Download references
Author information
Authors and affiliations.
Laboratory for Computational Linguistics and Information Processing, Institute for Advanced Computer Studies, University of Maryland, College Park, MD, USA
Matthew G. Snover, Nitin Madnani & Bonnie Dorr
BBN Technologies, Cambridge, MA, USA
Richard Schwartz
You can also search for this author in PubMed Google Scholar
Corresponding author
Correspondence to Matthew G. Snover .
Rights and permissions
Reprints and permissions
About this article
Snover, M.G., Madnani, N., Dorr, B. et al. TER-Plus: paraphrase, semantic, and alignment enhancements to Translation Edit Rate. Machine Translation 23 , 117–127 (2009). https://doi.org/10.1007/s10590-009-9062-9
Download citation
Received : 15 May 2009
Accepted : 16 November 2009
Published : 15 December 2009
Issue Date : September 2009
DOI : https://doi.org/10.1007/s10590-009-9062-9
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Machine translation evaluation
- Paraphrasing
- Find a journal
- Publish with us
- Track your research
Help | Advanced Search
Computer Science > Computation and Language
Title: mrscore: evaluating radiology report generation with llm-based reward system.
Abstract: In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically demonstrated by our observations within this paper. To address this challenge, we collaborated with radiologists to develop a framework that guides LLMs for radiology report evaluation, ensuring alignment with human analysis. Our framework includes two key components: i) utilizing GPT to generate large amounts of training data, i.e., reports with different qualities, and ii) pairing GPT-generated reports as accepted and rejected samples and training LLMs to produce MRScore as the model reward. Our experiments demonstrate MRScore's higher correlation with human judgments and superior performance in model selection compared to traditional metrics. Our code and datasets will be available on GitHub.
Submission history
Access paper:.
- HTML (experimental)
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
BibTeX formatted citation
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
- Institution
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .
IMAGES
VIDEO
COMMENTS
with paraphrases. ParaMetric compares automatic paraphrases against reference paraphrases. In this paper we: Present a novel automatic evaluation metric for data-driven paraphrasing methods; Describe how manual alignments are cre-ated by annotating correspondences between words in multiple translations; Show how phrase extraction heuristics from
DOI: Bibkey: callison-burch-etal-2008-parametric. Cite (ACL): Chris Callison-Burch, Trevor Cohn, and Mirella Lapata. 2008. ParaMetric: An Automatic Evaluation Metric for Paraphrasing. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pages 97-104, Manchester, UK. Coling 2008 Organizing Committee.
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are explored through additional experiments and in-depth ...
In this paper, we revisit automatic metrics for paraphrase evaluation. We collect a list of popu-lar metrics used in recent researches (Kumar et al., 2019;Feng et al.,2021;Hegde and Patil,2020;Sun et al.,2021;Huang and Chang,2021;Kumar et al., 2020), and computed their correlation with human
Several metrics have been used for automatic eval-uation of model generated paraphrases with refer-ence during testing. Metrics like BLEU (Papineni ... 3 ROUGE-P: A novel metric for paraphrase evaluation Addressing how semantically similar two sentences are is a subset of the problem of how good a para-
We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose para-phrases have been manually annotated. ... 2008_Callison-Burch_Cohn_ET AL_ParaMetric An Automatic Evaluation Metric for Paraphrasing Final ...
W e introduce an automatic evaluation metric, called ParaMetric, which uses paraphrasing tech-. niques to be compared and enables an evaluation. to be easily repeated in subsequent research. Para ...
ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated and calculates precision and recall scores by comparing the paraphrase discovered by automatic paraphrasing techniques against gold standard alignments of words and phrases within equivalent sentences. We present ParaMetric, an automatic evaluation ...
Two general types of evaluation metrics are com-monly used to evaluate paraphrase generation: au-tomatic evaluation and human evaluation. Automatic Evaluation Several automatic evalu-ation metrics are used for the evaluation of para-phrase generation. The widely-used metrics in-clude (1) BLEU (Papineni et al.,2002), which was
isting paraphrasing methods. 2 Related Work No consensus has been reached with respect to the proper methodology to use when evaluating para-phrase quality. This section reviews past methods for paraphrase evaluation. Researchers usually present the quality of their automatic paraphrasing technique in terms of a subjective manual evaluation ...
Abstract We present ParaMetric, an automatic evaluation metric for data-driven approaches to paraphrasing. ParaMetric provides an objective measure of quality using a collection of multiple translations whose paraphrases have been manually annotated.
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation.Underlying reasons behind the above findings are explored through additional experiments and in-depth ...
In this paper we revisit automatic metrics for paraphrase evaluation and obtain two findings that disobey conventional wisdom: (1) Reference-free metrics achieve better performance than their reference-based counterparts. (2) Most commonly used metrics do not align well with human annotation. Underlying reasons behind the above findings are ...
Abstract. In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments on the simplicity achieved by executing ...
METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, 65--72. Google Scholar; R. Barzilay, L. Lee. 2003. Learning to paraphrase: An unsupervised approach using multiple-sequence alignment.
performance in supervised paraphrase generation. 2.2 Automatic Evaluation in Paraphrase Generation Existing paraphrase evaluation can be divided into three groups: reference-based metric, reference-free metric, and hybrid metric. Existing reference-based metrics are commonly NMT metrics like BLEU (Papineni et al.,2002), Rouge (Lin,2004),
the accuracy of automatic evaluation. We also found a strong connection between the quality of automatic paraphrases as judged by humans and their contribution to automatic evaluation. 1 Introduction The use of automatic methods for evaluating machine-generated text is quickly becoming main-stream in natural language processing. The most
As a result, a unique feature of PQA is the wide use of automatic evaluation models, ... For the paraphrasing evaluation, we referred to the metrics in previous studies and formulated our evaluation rubrics following the framework in translation. ... the remaining rubrics in the paraphrasing assessment are almost in line with the evaluation ...
Such empirical findings expose a lack of reliable automatic evaluation metrics. Therefore, this paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
Prism: MT Evaluation in Many Languages via Zero-Shot Paraphrasing. Prism is an automatic MT metric which uses a sequence-to-sequence paraphraser to score MT system outputs conditioned on their respective human references. Prism uses a multilingual NMT model as a zero-shot paraphraser, which negates the need for synthetic paraphrase data and ...
This paper describes a new evaluation metric, TER-Plus (TERp) for automatic evaluation of machine translation (MT). TERp is an extension of Translation Edit Rate (TER). It builds on the success of TER as an evaluation metric and alignment tool and addresses several of its weaknesses through the use of paraphrases, stemming, synonyms, as well as edit costs that can be automatically optimized to ...
Paraphrase generation is a difficult problem. This is not only because of the limitations in text generation capabilities but also due that to the lack of a proper definition of what qualifies as a paraphrase and corresponding metrics to measure how good it is. Metrics for evaluation of paraphrasing quality is an on going research problem. Most of the existing metrics in use having been ...
Automatic evaluation metrics for generated texts play an important role in the NLG field, especially with the rapid growth of LLMs. However, existing metrics are often limited to specific scenarios, making it challenging to meet the evaluation requirements of expanding LLM applications. Therefore, there is a demand for new, flexible, and effective metrics. In this study, we introduce RepEval ...
In recent years, automated radiology report generation has experienced significant growth. This paper introduces MRScore, an automatic evaluation metric tailored for radiology report generation by leveraging Large Language Models (LLMs). Conventional NLG (natural language generation) metrics like BLEU are inadequate for accurately assessing the generated radiology reports, as systematically ...