When a Bear and a Cat Beget a Monkey: Testing Domain-Adaptation Capabilities of Disambiguation Models
Introducing WiC-TSV-de, a German dataset for Target Sense Verification.
(TL;DR at the end)
Being able to distinguish the intended sense of a word that has different meanings is an important pre-requisite for many applications and therefore, a longstanding task in the Natural Language Processing (NLP) research community. When applied in domain-specific and enterprise settings, the models used on this disambiguation task often need to fulfil a list of additional properties, in order to be of practical use.
The amount of available training data in these use cases is often not sufficient to train a domain-specific model from scratch, therefore, it is a common practice to pre-train models on general domain data, before fine-tuning it on the specific domain. As general domain data is plentiful available, it can be used to learn the models general aspects of language (like syntax, grammar, semantics, etc). Then, the much smaller domain-specific dataset must only be used to adapt the model to a specific domain.
When using such approaches, two aspects have to be taken into account. First, the task formulation on top of which the model was trained needs to be flexible enough to also fit the use case and corresponding data sources at hand. And second, the pre-trained model must have generalised the task in such a way, that domain-adaptation is as easy as possible.
In this post, we will take a deeper look at
- a flexible task formulation for the disambiguation task
- a general-domain German dataset for training models for disambiguation
- a challenging test set to evaluate disambiguation and domain-adaptation capabilities of disambiguation models
Disclaimer: This post is based on a recent publication at LREC 2022 titled “WiC-TSV-de: German Word-in-Context Target-Sense-Verification Dataset and Cross-Lingual Transfer Analysis”, published by myself, Artem Revenko, and Narayani Blaschke.
A Flexible Disambiguation Task Formulation
Traditionally, the formulation of the Word Sense Disambiguation (WSD) task is used to tackle disambiguation, where we try to find out, which sense of the word is used in a given sentence. Therefore the target word is compared to all different senses in a given sense inventory in order to pick the most suitable one.
Despite the great adoption of this task formulation, it comes with various disadvantages:
As we try to find the most suitable sense, systems always have to model the entire sense inventory. This not only reduces the flexibility of the model but also assumes the availability of all senses within the sense inventory, which especially for domain-specific meanings is not always true (ever tried to collect all companies named of Alphabet?). Additionally, in a domain-specific setting or a specific use case, we will most likely be interested in rather a small amount of specific senses as opposed to all possible senses.
Therefore, in previous work, we introduced a new task formulation for disambiguation which is called Target Sense Verification. The aim of this task formulation is, given a word in context on the one side, and a target sense on the other side, to verify whether the word in context is used in the target sense or not.
The advantage of this formulation is that we are independent of sense inventories, so we can create more flexible models. Also, as the task is formulated as a binary classification task, we no longer have to assume that we have all data available, but we only need to have the data that we are actually interested in.
Generally, models trained on the TSV formulation are much easier to reuse for a given use case, as the requirements for how the senses of interest are provided are minimal.
For more information on the advantages of the TSV task formulation, I recommend taking a look at this post.
A German Dataset for TSV
The newly introduced WiC-TSV-de dataset provides the opportunity to train and evaluate disambiguation models on top of the Target Sense Verification formulation in German. It consists of over 4000 instances, which are split into balanced training, development and test data sets.
Each instance consists of a context containing the target word, as well as the desired target sense, described by its hypernyms and the definition.
While the training and the development set consists only of general domain instances, the test set also contains domain-specific subsets originating from four different domains:
- Gastronomy (FOOD)
- Hunting (HUNT)
- Medicine (MED)
- and Zoology (ZOO)
Having these domain-specific test sets, we can not only test the general performance of a model but also its ability to transfer the learned knowledge into specific domains, without further training.
Difficult Test Instances
Having a range of domain-specific examples is already a great start to evaluate whether a model is suitable for out-of-the-box domain adaptation, however, it is also important that these examples are challenging. In order to create difficult domain-specific test sets, different strategies were taken into account in the WiC-TSV-de dataset.
In the following section, a few examples from the dataset are shown. Although the English translation is provided, it is of course hard to convey the ambiguous characteristics of specific terms into another language. I will do my best to describe the reason why these are difficult examples but I already apologise to all non-German speakers, if I fail to fully do so.
Strategy 1: Commonality
The four domains differ in their characteristics. On the one hand, their target senses vary in their commonality: some of the target senses correspond to the most common sense, while others might only be known by domain experts.
To give an example for what is meant by this: the most common sense for the English term fork is the one describing cutlery. We could use this sense in the gastronomy domain, to have a very common target sense. On the other hand, if we would use the term fork and its associated sense from the computer science domain (i.e, a process that creates a copy of itself), we would be using a much more uncommon target sense.
Aside from using lesser common target senses, we further applied different strategies to collect hard testing instances.
Strategy 2a: In-Domain Ambiguities
This strategy concerns in-domain ambiguity, which deals with words that have two different meanings within the same domain. An example of this could be the word Bär, which refers to bear (like grizzly bear), and also to male marmot in the hunting domain. Looking at a concrete example:
We can here see the context that translates roughly to,
“I would like to add that the cats and the one-year-old marmots scream much more than the older bears”.
An additional piece of knowledge that is helpful in this regard is that female marmots are called cats in hunting language. (And baby marmots are called…. you guessed right, monkeys.)
Obviously, in-domain ambiguities are very hard to disambiguate, as the model cannot simply disambiguate the domain of the entire context, but really has to focus on the target word and sense.
Strategy 2b: Mixed Contexts
In-domain ambiguities are a great way to create challenging instances, however, their number is of course limited. When in-domain ambiguities could not be exploited, instead, mixed contexts can be applied.
Mixing of contexts means, that an in-domain sense is used in an out-of-domain context. So for example, if we say,
the virus gets its name from the distinctive corona of sugar proteins that project from it’s surface
although the context talks about the coronavirus, the target for corona does not refer to the virus, but to the arrangement of its proteins.
An example from the dataset could be taken from the gastronomy domain:
Here, the context talks about our visual system, and the target word is Apfel, like apple, the fruit from the apple tree. In German, however, the eyeball is literally translated to “eye apple”, so in this case, we have the context talking about the visual system, but the apple within it is not referring to the “eye apple”.
Strategy 2c: Trigger Words
Another strategy to create hard instances is to use trigger words, where not the entire context, but only specific words draw the connection to the target sense from a different domain. Like, when we say,
the Jaguar was run over by a sports car
where Jaguar in this context refers to the animal, and sports car establishes a link to the sense of the luxury vehicles company. While this example might be easy, trigger words example can also be harder. For example, let’s take a look at this instance from the zoology domain.
The context is about the Citroen 2CV, which in German-speaking countries is most commonly referred to Ente, so duck like the bird, which is our target sense in this case. Further, it is important to know that the German word for moulting, not only refers to the shedding of old feathers in birds but can also be used as a synonym for transformed. Having this in mind, the context of this example literally translates into,
But over time the duck moulted into a beloved cult car.
Distribution of difficult instances over domains
The strategies for difficult instances were applied to different degrees in the different domains.
To quantify the target sense commonality (strategy 1) of the instances in the different domains, we compared the target sense to the first sense provided in the crowd-sourced online dictionary Wiktionary, as well as to whether the sense was present as all in Wiktionary. Furthermore, we checked whether the target sense was present in the German expert-curated dictionary Duden (Duden does not provide a communality ordering).
As we can see here, in the Zoology and Medical domain, a great number of instances use the first sense of the target word as the target sense. For Gastronomy, while the target senses are in most cases present in the dictionary, not even 1/4 of them correspond to the first sense of the target word, and in Hunting domain, we can see that not even 2/3 of the target senses are common in general, and almost none of these correspond to the most common sense.
When looking at the distribution of instances originating from different strategies to create hard testing instances (strategy 2), we can see that the Medical domain contains the least hard examples with only around 11%, while again Hunting contains the most of them, where every third instance is a hard example.
So how do these difficult examples influence the performance of disambiguation models?
To investigate this, let’s take a look at an evaluation.
Evaluation on WiC-TSV-de
The basis of the following analysis is a BERT-based classifier called HyperBERT which was trained on the full training set provided by the WiC-TSV-de dataset. The overall architecture can be seen below.
In order to investigate the relationship between the performance on specific sub-sets and the aforementioned strategies for creating challenging instances, we calculated the responding Pearson correlation coefficients.
Target Sense Commonality
For the target sense commonality, we calculated the correlation between the prediction performance and the percent of instances in the which target sense was
- listed as first sense in the German Wiktionary
- listed as any sense in the German Wiktionary
- listed as any sense in the German expert-curated dictionary Duden
Analysing the influence of the target since commonality, we can see that, although the Pearson correlation coefficient is always positive, the overall correlation is not too high. Interestingly, however, the correlation between the performance and the Duden dictionary is much higher than the performance and Wiktionary. This could indicate that the expert-curated sources such as Duden simply better reflect the true commonality of senses compared to crowd-sourced resources such as Wiktionary.
In-Domain Ambiguities, Mixed Contexts and Trigger Words
The analysis of the influence of different strategies for retrieving hard contexts draws in much clearer picture.
Here we can see that all different kinds of strategies clearly negatively correlate with the performance, as Pearson correlation coefficients range from minus 0.71 to 0.92, meaning that the more hard instances we have, the worse the performance of the model gets. We can also see that the correlation increases if we combine different strategies.
This analysis indicates that the introduced strategies for collecting examples indeed lead to hard instances.
Based on a binary formulation of the disambiguation task (i.e., Target Sense Verification), the German WiC-TSV-de dataset was introduced. The test-set of this dataset contains 4 domain-specific subsets, making it suitable to evaluate both disambiguation and domain-adaptation capabilities of disambiguation models. Furthermore, different strategies for collecting hard test instances were applied, making this dataset highly challenging. More insights on e.g., the performance of the HyperBERT model on the different subsets can be found in the paper.