What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis

Xiaolei Huang, Jonathan May, and Nanyun Peng, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2019.

Download the full text

Abstract

Bib Entry

@inproceedings{huang2019matters,
  title = {What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis},
  author = {Huang, Xiaolei and May, Jonathan and Peng, Nanyun},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2019}
}

Related Publications

What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis

Xiaolei Huang, Jonathan May, and Nanyun Peng, in 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short, 2019.
Full Text BibTeX Details

@inproceedings{huang2019matters,
  title = {What Matters for Neural Cross-Lingual Named Entity Recognition: An Empirical Analysis},
  author = {Huang, Xiaolei and May, Jonathan and Peng, Nanyun},
  booktitle = {2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), short},
  year = {2019}
}

Details

Learning A Unified Named Entity Tagger From Multiple Partially Annotated Corpora For Efficient Adaptation

Xiao Huang, Li Dong, Elizabeth Boschee, and Nanyun Peng, in The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL), 2019.
Full Text Code Abstract BibTeX Details

Named entity recognition (NER) identifies typed entity mentions in raw text. While the task is well-established, there is no universally used tagset: often, datasets are annotated for use in downstream applications and accordingly only cover a small set of entity types relevant to a particular task. For instance, in the biomedical domain, one corpus might annotate genes, another chemicals, and another diseases—despite the texts in each corpus containing references to all three types of entities. In this paper, we propose a deep structured model to integrate these “partially annotated” datasets to jointly identify all entity types appearing in the training corpora. By leveraging multiple datasets, the model can learn robust input representations; by building a joint structured model, it avoids potential conflicts caused by combining several models’ predictions at test time. Experiments show that the proposed model significantly outperforms strong multi-task learning baselines when training on multiple, partially annotated datasets and testing on datasets that contain tags from more than one of the training corpora

@inproceedings{huang2019learning,
  title = {Learning A Unified Named Entity Tagger From Multiple Partially Annotated Corpora For Efficient Adaptation},
  author = {Huang, Xiao and Dong, Li and Boschee, Elizabeth and Peng, Nanyun},
  booktitle = {The 2019 SIGNLL Conference on Computational Natural Language Learning (CoNLL)},
  year = {2019}
}

Details

Multi-task multi-domain representation learning for sequence tagging

Nanyun Peng and Mark Dredze, in Proceedings of the 2nd Workshop on Representation Learning for NLP, 2017.
Full Text Abstract BibTeX Details

Multidomain crowd counting aims to learn a general model for multiple diverse datasets. However, deep networks prefer modeling distributions of the dominant domains instead of all domains, which is known as domain bias. In this study, we propose a simple-yet-effective Modulating Domain-specific Knowledge Network (MDKNet) to handle the domain bias issue in multidomain crowd counting. MDKNet is achieved by employing the idea of ‘modulating’, enabling deep network balancing and modeling different distributions of diverse datasets with little bias. Specifically, we propose an Instance-specific Batch Normalization (IsBN) module, which serves as a base modulator to refine the information flow to be adaptive to domain distributions. To precisely modulating the domain-specific information, the Domain-guided Virtual Classifier (DVC) is then introduced to learn a domain-separable latent space. This space is employed as an input guidance for the IsBN modulator, such that the mixture distributions of multiple datasets can be well treated. Extensive experiments performed on popular benchmarks, including Shanghai-tech A/B, QNRF and NWPU, validate the superiority of MDKNet in tackling multidomain crowd counting and the effectiveness for multidomain learning. Code is available at \url[https://github.com/csguomy/MDKNet].

@inproceedings{peng2017multi,
  title = {Multi-task multi-domain representation learning for sequence tagging},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 2nd Workshop on Representation Learning for NLP},
  year = {2017}
}

Details

A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition

Dingquan Wang, Nanyun Peng, and Kevin Duh, in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers), 2017.
Full Text Abstract BibTeX Details

Named Entity Recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying named entities in text. But much work hasn’t been done for complex named entity recognition in Bangla, despite being the seventh most spoken language globally. CNER is a more challenging task than traditional NER as it involves identifying and classifying complex and compound entities, which are not common in Bangla language. In this paper, we present the winning solution of Bangla Complex Named Entity Recognition Challenge - addressing the CNER task on BanglaCoNER dataset using two different approaches, namely Conditional Random Fields (CRF) and finetuning transformer based Deep Learning models such as BanglaBERT. The dataset consisted of 15300 sentences for training and 800 sentences for validation, in the .conll format. Exploratory Data Analysis (EDA) on the dataset revealed that the dataset had 7 different NER tags, with notable presence of English words, suggesting that the dataset is synthetic and likely a product of translation. We experimented with a variety of feature combinations including Part of Speech (POS) tags, word suffixes, Gazetteers, and cluster information from embeddings, while also finetuning the BanglaBERT (large) model for NER. We found that not all linguistic patterns are immediately apparent or even intuitive to humans, which is why Deep Learning based models has proved to be the more effective model in NLP, including CNER task. Our fine tuned BanglaBERT (large) model achieves an F1 Score of 0.79 on the validation set. Overall, our study highlights the importance of Bangla Complex Named Entity Recognition, particularly in the context of synthetic datasets. Our findings also demonstrate the efficacy of Deep Learning models such as BanglaBERT for NER in Bangla language.

@inproceedings{wang2017multi,
  title = {A multi-task learning approach to adapting bilingual word embeddings for cross-lingual named entity recognition},
  author = {Wang, Dingquan and Peng, Nanyun and Duh, Kevin},
  booktitle = {Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  pages = {383--388},
  year = {2017}
}

Details

Improving named entity recognition for chinese social media with word segmentation representation learning

Nanyun Peng and Mark Dredze, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, 2016.
Full Text Abstract BibTeX Details

Named entity recognition, and other information extraction tasks, frequently use linguistic features such as part of speech tags or chunkings. For languages where word boundaries are not readily identified in text, word segmentation is a key first step to generating features for an NER system. While using word boundary tags as features are helpful, the signals that aid in identifying these boundaries may provide richer information for an NER system. New state-of-the-art word segmentation systems use neural models to learn representations for predicting word boundaries. We show that these same representations, jointly trained with an NER system, yield significant improvements in NER for Chinese social media. In our experiments, jointly training NER and word segmentation with an LSTM-CRF model yields nearly 5% absolute improvement over previously published results.

@inproceedings{peng2016improving,
  title = {Improving named entity recognition for chinese social media with word segmentation representation learning},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics},
  year = {2016}
}

Details

An Empirical Study of Chinese Name Matching and Applications

Nanyun Peng, Mo Yu, and Mark Dredze, in Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL), 2015.
Full Text Abstract BibTeX Details

Application Programming Interfaces (API) are exposed to developers in order to reuse software libraries. API directives are natural-language statements in API documentation that make developers aware of constraints and guidelines related to the usage of an API. This paper presents the design and the results of an empirical study on the directives of API documentation of object-oriented libraries. Its main contribution is to propose and extensively discuss a taxonomy of 23 kinds of API directives.

@inproceedings{peng2015empirical,
  title = {An Empirical Study of Chinese Name Matching and Applications},
  author = {Peng, Nanyun and Yu, Mo and Dredze, Mark},
  booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year = {2015}
}

Details

Named entity recognition for chinese social media with jointly trained embeddings

Nanyun Peng and Mark Dredze, in Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 2015.
Full Text Abstract BibTeX Details

Stress and depression are prevalent nowadays across people of all ages due to the quick paces of life. People use social media to express their feelings. Thus, social media constitute a valuable form of information for the early detection of stress and depression. Although many research works have been introduced targeting the early recognition of stress and depression, there are still limitations. There have been proposed multi-task learning settings, which use depression and emotion (or figurative language) as the primary and auxiliary tasks respectively. However, although stress is inextricably linked with depression, researchers face these two tasks as two separate tasks. To address these limitations, we present the first study, which exploits two different datasets collected under different conditions, and introduce two multitask learning frameworks, which use depression and stress as the main and auxiliary tasks respectively. Specifically, we use a depression dataset and a stressful dataset including stressful posts from ten subreddits of five domains. In terms of the first approach, each post passes through a shared BERT layer, which is updated by both tasks. Next, two separate BERT encoder layers are exploited, which are updated by each task separately. Regarding the second approach, it consists of shared and task-specific layers weighted by attention fusion networks. We conduct a series of experiments and compare our approaches with existing research initiatives, single-task learning, and transfer learning. Experiments show multiple advantages of our approaches over state-of-the-art ones.

@inproceedings{peng2015named,
  title = {Named entity recognition for chinese social media with jointly trained embeddings},
  author = {Peng, Nanyun and Dredze, Mark},
  booktitle = {Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing},
  pages = {548--554},
  year = {2015}
}

Details