Named Entity Recognition as Graph Classification

Motivation

Injecting real-world information (typically contained in Knowledge Graphs) and human expertise into an end-to-end training pipeline for Natural Language Processing models is an open challenge. In this preliminary work, we propose to approach the task of Named Entity Recognition, which is traditionally viewed as a Sequence Labeling problem, as a Graph Classification problem, where every word is represented as a node in a graph. This allows to embed contextual information as well as other external knowledge relevant to each token, such as gazetteer mentions, morphological form, and linguistic tags. We experiment with a variety of graph modeling techniques to represent words, their contexts, and external knowledge, and we evaluate our approach on the standard CoNLL-2003 dataset. We obtained promising results when integrating external knowledge through the use of graph representation in comparison to the dominant end-to-end training paradigm.

In one figure

The GraphNER pipeline

NER as graph classification: instead of the traditional sequence tagging model (left), we propose to treat each word in a sentence as a graph where the word toclassify is linked to the words from its context, as well as other task related featuressuch as grammatical properties (in green), gazetteers mentions (in yellow) and task-specific hand-written features (in blue). The graph is turned into a fixed-length vector representation which is then passed to a classifier to predict the word label.

What external knowledge?

Grammatical tags

We use the Part of Speech tags (POS) e.g. ‘Noun’, ‘Verb’, ‘Adjective’, as well as the shallow parsing tags (chunking) e.g. ‘Verbal Phrase’, ‘Subordinated Clause’ etc. We use the manual annotations provided with the CoNLL dataset, however, it is possible to use off-the-shelf models to annontate data as state-of-the-art POS tagging approaches are are very close to human performance (especially on English).

Letter case

The presence of uppercase letters usually signify that a word refers to an entity. We thus add the following tags: ‘Capitalized’ if the word starts with a capital letter, ‘All Caps’ if the word is made of only uppercase letters, and ‘Acronym’ if the word is a succession of uppercase letters and periods.

Gazetteers

We generate lists of words that are related to potential entity types such as “Person First Name” and “Capital”.

To do so, we selected 14 classes from the Wikidata Knowledge Base that correspond to some of the entity classes (see table below) and used the public SPARQL endpoint to generate the gazetteers. For each of these classes, we query Wiki-data for all entities belonging to that class, a direct subclass of it, or a sub-class of a subclass of it (going any further made the queries much slower and yielded diminishing returns). For each entity, we get all English labels associated with it (from the properties rdfs:label, skos:altLabel) and keep the labels containing only one word. When creating the input graph, if the central node’s word appear in one of the gazetteers, we attach it to the appropriate tag, i.e. if the central node stands for the word “Ford”, it will be attached tothe nodes ‘family name’ (wd:Q11247279), ‘car brand’ (wd:Q44294), ‘male given name’ (wd:Q21021650) and so on. Because of the high number of subclasses for each category, we only keep the top category information for each label when using these gazetteers for our experiments. The code to generate these gazetteers can be found here.

Class QID Subclasses # Instances # One-word Labels
Artist Q483501 350 436 60
Brand Q431289 42 8194 3558
Capital Q5119 15 602 183
City Q515 3528 33101 8681
Country Q6256 51 699 197
Demonym Q217438 6 620 538
Family name Q101352 122 376094 315683
Geolocation Q2221906 190 10584664 276607
Georegion Q82794 978 6164118 568681
Given name Q202444 56 74182 60472
Name Q82799 308 542138 9504
Organization Q43229 3528 2906668 218091
Product Q2424752 3838 722076 29241
Town Q3957 39 44858 23983

Graph representations

The literature on graph representations is extremely diverse and rich. For our experiments, we chose a representative algorithm from each big "family": a shallow neural auto-encoder, Node2Vec (node embeddings), TransE (entity embeddings), and a Graph Convolution Network inspired by this architecture (figure). We also train a two-layers neural network on a simple binary embedding of graph nodes as a baseline.


The challenge of representing graphs does not end there, as we can materialize the idea expressed so far in multiple ways:

  • What constitutes the nodes of the graph and what can be modeled as a feature of the said nodes?
  • How to connect these nodes? Should everything be connected to the central node or should the connection reflect the order in the sentence? Should these relations be semantic, i.e. of different types?
  • Should we account for the entire context of the word or just limit it to a fixed-size window, and if so, what should be this window size?
  • What is the direction of information propagation through the graph?


Graph representations

Several potential representations of word graphs: (a) every word in the vocabulary and every potential tag are nodes that are directly linked to the central node (b) the context nodes are connected in the same order as they appear in the sentence, and the relations to the node are explicitly differentiated (as seen by the color of the edges) (c) the same representation but with the tags added as node features to the central node, not as nodes themselves, i.e. only words are modeled as nodes in this representation.

Results

This table shows the empirical results obtained by each method on the CoNLL-2003 dataset. The entries marked with “+” represent the models with external knowledge added to the words and their context.

Method Dev m-F1Dev M-F1Test m-F1Test M-F1
Auto-encoder91.067.390.363.2
Auto-encoder+91.571.791.570.4
Node2Vec93.381.690.068.3
Node2Vec+94.182.191.172.6
TransE91.875.091.770.0
TransE+93.678.891.974.5
GCN96.186.392.978.8
GCN+96.588.894.181.0
LUKE [*] 94.3

We observe a significant decrease in performance for all models between the evaluation and test sets (with a varying intensity depending on thechoice of the model) that is probably due to the fact that the test set contains a lot of out-of-vocabulary words that do not appear in the training set. Thus, they lack a node representation that we can feed to the network in inference time. We also see that adding the external knowledge consistently improve theperformance of the graph models on both Micro-F1 and Macro-F1 for all models considered. Finally, while the performance on the test set for all graph-onlymode ls is still behind LUKE, the best performing state of the art NER model on ConLL-2003, we observe that these models are significantly smaller and thus faster to train (in matters of minutes once the graph embeddings are gener-ated), when using a simple 2-layers feed-forward neural as a classifier.

These preliminary results show promising directions for additional investigations and improvements, of which we note:

  • Adding linguistic representations as features to the nodes (e.g. word embeddings)
  • Experimenting with other graph models
  • Pretraining the graph representations on a larger corpus
  • Adding attention among the different word/node representations