Skip to content
Search
Generic filters
Exact matches only

A taste of ACL2020: 6 new Datasets & Benchmarks

Datasets and benchmarks are at the core of progress in Natural Language Understanding (NLU): in leaderboard-driven research, progress is upper-bounded by the quality of our evaluations. While datasets for Machine Learning used to last — i.e. MNIST didn’t reach human performance until more than a decade after it was introduced — the latest benchmarks for Natural Language understanding are becoming obsolete faster than we expected, highlighting the importance of finding better ones.

The sheer number of papers about the topic is quite astounding, so at Zeta-Alpha we have curated this selection of the most interesting works about it at ACL2020, which will influence how progress is measured in the field.

In this paper — which is already making waves at +20 citations — the authors eloquently make the case why static NLU benchmarks become obsolete very fast and often models leverage spurious statistical patterns¹ that were undetectable in the data collection phase.

They introduce a dataset for Natural Language Inference (NLI), where given a premise and hypotesis, one should determine whether they are entailed, in contradiction or neutral. The catch is that they also introduce a framework to iterate on the dataset based on feedback from a trained model and humans in the loop introducing adversarial examples; with the aim of creating a dataset where the model fails. The round of annotation, shown in the figure below, consists of:

  • Annotating a dataset and training a model on it.
  • Have annotators write new adversarial hypotheses given a context and test them on the trained model.
  • If the model succeeds, we add the new samples to the training set.
  • When the model fails and another human agrees with the annotation, we add them to the dev, test or training sets.
Diagram of Adversarial human-in-the-loop data collection. Source: Adversarial NLI: A New Benchmark for Natural Language Understanding

The authors call this process HAMLET (Human-And-Model-in-the-Loop Enabled Training) and in the paper they showcase the creation of a dataset in 3 rounds, where annotators are incentivized to come up with hypothesis where the models will fail. This results in an increasingly more challenging dataset and as a side result, they reach state of the art on some variations of the MNLI dataset. While they speculate that this benchmark will not saturate soon thanks to how it was collected, they emphasize that even so, new rounds could be added to overcome this.

The main inconvenience of dynamic datasets as such is the difficulty of standarization that enables apples-to-apples comparisons of different works. While adversarial human-in-the-loop is not a new idea, this clean instance has the potential to serve as inspiration for future iterations and perhaps overcome standarization obstacles where dynamic datasets become the norm in the near future.

This paper presents a full fledged language benchmark consisting of 7 tasks that not only include labels, but also have human annotated ‘rationales’ for them, inspired by the success of the GLUE benchmark. These tasks include: Evidence inference, BoolQ (boolean QA), Movie Reviews, FEVER (fact extraction verification), MultiRC (reading comprehension), Commonsense Explanations (CoS-E), e-SNLI (language entailment) and Human Agreement.

The authors propose an Area Under the Precision-Recall Curve metric for evaluating the accoradance of model and human annotation rationales, but they are aware that this evaluation is hard to make objectively, which is why they explicitly make the call for more research in this direction.

The proposed benchmark is an early step towards an ambitious vision on more explainable comprehensive evaluation of Language Models.

Examples of tasks within the ERASE benchmark. Source: ERASER: A Benchmark to Evaluate Rationalized NLP Models

Sentiment Analysis has long been a fundamental task in NLP, but some of the most widely used datasets — such as SST2 with binary positive/negative sentiments — are surpassing human performance, becoming obsolete for measuring meaningful progress.

GoEmotions is a dataset of 58k manually annotated samples from popular English subreddit comments, it is very fine-grained with 27 sentiment labels (or neutral). The data collection process adheres to high standads, with full manual reviewing, length filtering, sentiment balancing, subreddit balanging and masking proper names and religion terms. Early baseline tests with a BERT model indicate that there’s a lot of room for improvement and that current state-of-the-art NLU models fail to understand emotion to this degree, making it a challenging new sentiment benchmark to focus on.

Class statistics for the GoEmotions dataset, along with a heatmap of correlations of emotions and their clustering into coarser kinds of emotion (positive, negative and ambiguous). Source: GoEmotions: A Dataset of Fine-Grained Emotions

This kinds of datasets are extremely valuable in this day and age to help us make sense of complex, large-scale social dynamics that play out on the internet.

On a similar note, also on ACL2020, iSarcasm: A Dataset of Intended Sarcasm, is a dataset that focuses on the differentiation between intended and perceived sarcasm such that we can overcome current biases on models detecting only more obvious forms of it. This dataset, more modest in size at 4.4k samples, also stresses the importance of this topic as a means to understanding social interactions in the context of social media.

The Sentence Cloze task consists of filling in sentence-sized gaps in text from a set of candidates. Similar sentence-level tasks are often used as a self-supervison for language models pre-training (i.e. BERT’s next sentence prediction); however, this task is often too simple because it can rely on spurious patterns of sentences given that the self-supervised sentence candidates are not challenging enough.

In this work, they introduce distractor sentences, which are human-curated sentences designed by English teachers that require non-local discourse-level aspects of language to successfuly perform the task. Current models only achieve an accuracy of 72% whereas humans get around 87%, showing a considerable room for improvement.

Humans don’t learn language in isolation so should we expect machines to do so? Multimodal machine learning explores the idea of leveraging different modes of data, such as vision and language, to make better models of the world.

“[…] a language understanding system should be able to classify images depicting recess and remorse, not just cats, dogs and bridges.”

After this provocative depiction of most current multimodal vision and language datasets, this work builds the BabelPic dataset focusing on non-concrete concepts, as a step to widen the coverage of multimodal semantic understanding. The dataset is built by combining WordNet’s² and BabelNet³ lexical knowledge bases.

Eamples of non-concrete samples from BabelPic. Source: Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts

After many filtering tricks, heuristics and manual validation, the resulting ‘gold’ dataset has 2.7k synsets (synonym sets) and 15k matched images, and an extended ‘silver’ set with 10k synsets, generated by a vision-language model by using the natural language definitions in WordNet.

Similarly as Adversarial NLI pointed out, many reading comprehension tasks rely on annotation artifacts and other biases from existing datasets, enabling the completion of tasks without the need of any understanding. To mitigate this, Naoya Inoue et. al present a task that requires not only to find the correct answer in a Reading Comprehension tasks, but also to proivde the adequate supporting facts.

Example of question answering including rationales, where the example is from HotpotQA⁶. Sources: R4C: A Benchmark for Evaluating RC Systems to Get the Right Answer for the Right Reason & HotpotQA⁴

The resulting annotated dataset is a total of 7.1k training and 6.6k development samples, which were sampled from the HotpotQA⁴ dataset, where rationales for each answer are included in the annotations. The evaluation of this task involves scoring answers and assessing the correctness of the ‘rationale alignments’ with the ground truth.