DaN+: Danish Nested Named Entities and Lexical Normalization

Source Code

Implemented By:

BP Barbara Plank

bapl@itu.dk
IT University of Copenhagen
KJ Kristian Nørgaard Jensen

krnj@itu.dk
IT University of Copenhagen
RG Rob van der Goot

robv@itu.dk
IT University of Copenhagen

Description:

This paper introduces DAN+, a new multi-domain corpus and annotation guidelines for Danish nested named entities (NEs) and lexical normalization to support research on cross-lingual cross-domain learning for a less-resourced language. We empirically assess three strategies to model the two-layer Named Entity Recognition (NER) task. We compare transfer capabilities from German versus in-language annotation from scratch. We examine language-specific versus multilingual BERT, and study the effect of lexical normalization on NER. Our results show that

the most robust strategy is multi-task learning which is rivaled by multi-label decoding,
BERT-based NER models are sensitive to domain shifts, and
in-language BERT and lexical normalization are the most beneficial on the least canonical data.

Our results also show that an out-of-domain setup remains challenging, while performance on news plateaus quickly. This highlights the importance of cross-domain evaluation of cross-lingual transfer.

This work uses MaChAmp, which is based on AllenNLP.

Tags:
- named entity recognition
- named entity detection
- lexical normalization
- domain adaptation
- Danish
AllenNLP Version: 1.1
Languages: Unknown
Datasets:
DaN+
Submitted On Apr 1, 2021