Fine-grained AMR Evaluation



In order two compare two alternative AMR (Abstract Meaning Representation; http://amr.isi.edu) graphs representing a sentence, previous work focused on a single metric assessing the overall similarity between them, called Smatch (http://amr.isi.edu/evaluation.html).

AMR is a complex meaning representation which embeds many traditional NLP problems such as word sense disambiguation, named entity recognition, coreference resolution, semantic role labeling, etc. As such, AMR parsers (automatic tools that convert natural language sentences into AMR representation) need to deal with all of them. When we then want to compare two AMR parsers to decide which one works better on a specific dataset or domain, we need to take into account that a parser might work well on a specific subtask and worse on another.

In our EACL paper (https://arxiv.org/abs/1608.06111) we proposed a set of metrics that can be used in addition to the traditional Smatch score:

  • Unlabeled Smatch: Smatch score computed on the predicted graphs after removing all edge labels
  • No WSD. Smatch score while ignoring Propbank senses (e.g., duck-01 vs duck-02)
  • Named Entities. F-score on the named entity recognition (:name roles)
  • Wikification. F-score on the wikification (:wiki roles)
  • Negations. F-score on the negation detection (:polarity roles)
  • Concepts. F-score on the concept identification task
  • Reentrancy. Smatch computed on reentrant edges only
  • SRL. Smatch computed on :ARG-i roles only

Source code and instructions on how to use it are available at: https://github.com/mdtux89/amr-evaluation.

References

@inproceedings{damonte-17,
   title={An Incremental Parser for Abstract Meaning Representation},
   author={Marco Damonte and Shay B. Cohen and Giorgio Satta},
   booktitle={Proceedings of {EACL}},
   year={2017}
}