This article is a how-to guide for training a customized Entity Recognition model. The technical challenges such as installation issues, version conflict issues, operating system issues that are very common to this analysis are out of scope for this article.

Table of contents:

  1. Business use case for entity recognition
  2. Overview of CRF
  3. Annotating training data
  4. Python Code for deploying CRF
  5. References

Business use case for entity recognition
Let us suppose that you are part of an analytics team in an Insurance company where each day, the claims team receives thousands of emails from customers regarding their claims. The claims operations team goes through each email and updates an online form with the claim details before acting on them. You are asked to work along with the IT team to automate the process of pre-populating the online form. For this task, the analytics team needs to build a custom entity recognition algorithm.

To identify entities in text, one must be able to identify the pattern. For example, if we need to identify the claim number, we can look at the words around it such as “my id is” or “my number is” etc. Let us examine a few approaches as given below to identifying the patterns.

  1. Regular expressions: Regular expressions(RegEx) are a form of finite state automaton. They are very helpful in identifying patterns that follow a certain structure. For example, email ID, phone number etc can be identified well by RegEx. However, the downside for this approach is that one needs to be aware of all the possible exact words that occur before claim number. This is not a learning approach but rather brute force approach.
  2. Hidden Markov Model(HMM): This is a sequence modelling algorithm that learns the pattern. Although HMM considers the future observations around the entities for learning pattern, it assumes that the features are independent of each other. This approach is better than regular expression as we do not need to model the exact set of word(s) but in terms of performance, it is not known to be the best for entity recognition.
  3. MaxEnt Markov Model(MEMM): This is also a sequence modelling algorithm. This does not assume that features are independent of each other but does not consider future observations for learning the pattern. In terms of performance, it is not known to be the best for entity relationship.
  4. Conditional Random Fields(CRF): This is also a sequence modelling algorithm. This not only assumes that features are dependent on each other but also considers the future observations while learning a pattern. This has the best of HMM and MEMM. In terms of performance, it is known to be the best for entity recognition problem.

Overview of CRF
Below is the formula for CRF where Y is the hidden state (For eg, part of speech) and X is the observed variable (in our example this is the entity or other words around it).

Broadly speaking, there are 2 components to the CRF formula.

  1. Normalization: You may have observed that there are no probabilities on the right side of the equation where we have the weights and features. However, the output is expected to be a probability and hence a need for normalization. The normalization constant Z(x) is a sum of all possible state sequences such that the total becomes 1. You could find more details in the reference section for the detailed explanation of the forward-backward recursions to arrive at this value.
  2. Weights and Features: This component can be thought of as the logistic regression formula with weights and the corresponding features. The weight estimation is performed by maximum likelihood estimation and the features are defined by us.

Annotating training data
Now that you are aware of the CRF model, let us build training data. The first step to building training data is annotation. Annotation is a process of tagging the word(s) with the corresponding tag. For simplicity, let us suppose that we only need 2 entities to populate the online form namely, claimant name, claim number. The following is a sample email as received. Such emails need to be annotated so that the CRF model can be trained. The annotated text needs to be in an XML format. Although you may choose to annotate the documents the in your way, we walk you through the use of GATE architecture to do the same.

Email received:

Hi, I am writing this email to claim my insurance amount. My id is abc123 and I claimed on 1st january 2018. I did not receive any acknowledgement. Please help. Thanks, randomperson

Annotated Email:

Hi, I am writing this email to claim my insurance amount. My id is <claim_number>abc123</claim_number> and I claimed on 1st january 2018. I did not receive any acknowledgement. Please help. Thanks, randomperson

Annotations using GATE: Let us understand how to use General Architecture for Text Engineering(GATE). Please follow the below steps to install GATE.

  1. Download the latest version from this link: https://gate.ac.uk/download/#latest
  2. Install the GATE platform by executing the installer downloaded and following the installation steps appropriately
  3. Post installation, run the application executable file as shown below
  4. Once the application opens, load the emails iteratively into the language resources by right clicking on “Language Resources”>New>GATE Document as shown below. Give each email a name, set the encoding to “utf-8” so we have no issues in Python, navigate to the emails that need to be annotated by clicking on the icon in sourceUrl section as shown below.
  1. Open one email at a time and start the annotation exercise. There are 2 options for building annotations. a. Load the annotation xml into GATE and use it b. Create annotations on the fly and use them. In this article, we will demonstrate this approach.

  2. Click on the email in the Language Resources section for it to open. Click on the “Annotation Sets” and then select word or words and placing the cursor on it for a couple of seconds. A pop-up window for annotation comes up and you can then type in the annotation in place of “NEW” and hit enter. A new annotation is created as shown below. Repeat this exercise for all the annotations for each email

  3. Once all the training emails are annotated, create a corpus for ease of use by navigating to Language Resources>NEW>GATE Corpus

  4. Give the new corpus a name for one’s reference, click on the navigation icon and add each email that is loaded into the Language Corpus as shown below

  5. Save the corpus as inline xml in a folder on your machine by right clicking on the corpus and navigating to “Inline XML(.xml)” as shown below

  6. In the next pop-up window, select the annotation types that are pre-populated and remove them. Manually type the annotations and add them in place of the pre-populated annotations. Set the “includeFeatures” option to false by clicking on it and type “document” into the rootElement box. Once all these changes are made, save the file to a folder on your machine by clicking on the “Save To” icon . Following are the screenshots for reference.


  7. The above process will save all the annotated emails in one folder.

Python Code for deploying CRF
Following steps will help in installing pycrf package in Python.
• First download the pycrf module. For PIP installation, the command is “pip install python-crfsuite” and for conda installation, the command is “conda install -c conda-forge python-crfsuite”
• If the above installation doesn’t work, download the relevant pycrf module from https://anaconda.org/conda-forge/python-crfsuite/files. For example, if it is a windows OS, 64bit machine with python 2.7 version, then one can download win-64/python-crfsuite-0.9.2-py27_vc9_0.tar.bz2
• Extract the pycrfsuite and python_crfsuite-0.9.2-py2.7.egg-info files and place them in the folder where the rest of the packages are present. For example, if Anaconda is used, then these files can be placed in anaconda>lib>site-packages folder.

#invoke libraries
from bs4 import BeautifulSoup as bs
from bs4.element import Tag
import codecs
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.model_selection import train_test_split
import pycrfsuite
import os, os.path, sys
import glob
from xml.etree import ElementTree
import numpy as np
from sklearn.metrics import classification_report

#-------------------------------------------
# Build Functions
#-------------------------------------------
#this function appends all annotated files
def append_annotations(files):
    xml_files = glob.glob(files +"/*.xml")
    xml_element_tree = None
    new_data = ""
    for xml_file in xml_files:
        data = ElementTree.parse(xml_file).getroot()
        #print ElementTree.tostring(data)        
        temp = ElementTree.tostring(data)
        new_data += (temp)
    return(new_data)

#this function removes special characters and punctuations
def remov_punct(withpunct):
    punctuations = '''!()-[]{};:'"\,<>./?@#$%^&*_~'''
    without_punct = ""
    char = 'nan'
    for char in withpunct:
        if char not in punctuations:
            without_punct = without_punct + char
    return(without_punct)

#-------------------------------------------
# Functions end
#-------------------------------------------

#import annotated data
files_path = "D:/Annotated/"

allxmlfiles = append_annotations(files_path)
soup = bs(allxmlfiles, "html5lib")

#identify the tagged element
docs = []
sents = []
for d in soup.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   

    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))    
    sents = sents + tags 
   docs.append(sents) #appends all the individual documents into one list       
        
#Generate features. These are the default features that NER algorithm uses in nltk. One can modify it for customization
data = []
for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])

def word2features(doc, i):
    word = doc[i][0]
    postag = doc[i][1]

    # Common features for all words
    features = [
        'bias',
        'word.lower=' + word.lower(),
        'word[-3:]=' + word[-3:],
        'word[-2:]=' + word[-2:],
        'word.isupper=%s' % word.isupper(),
        'word.istitle=%s' % word.istitle(),
        'word.isdigit=%s' % word.isdigit(),
        'postag=' + postag
    ]

    # Features for words that are not
    # at the beginning of a document
    if i > 0:
        word1 = doc[i-1][0]
        postag1 = doc[i-1][1]
        features.extend([
            '-1:word.lower=' + word1.lower(),
            '-1:word.istitle=%s' % word1.istitle(),
            '-1:word.isupper=%s' % word1.isupper(),
            '-1:word.isdigit=%s' % word1.isdigit(),
            '-1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'beginning of a document'
        features.append('BOS')

    # Features for words that are not
    # at the end of a document
    if i < len(doc)-1:
        word1 = doc[i+1][0]
        postag1 = doc[i+1][1]
        features.extend([
            '+1:word.lower=' + word1.lower(),
            '+1:word.istitle=%s' % word1.istitle(),
            '+1:word.isupper=%s' % word1.isupper(),
            '+1:word.isdigit=%s' % word1.isdigit(),
            '+1:postag=' + postag1
        ])
    else:
        # Indicate that it is the 'end of a document'
        features.append('EOS')

    return features

# A function for extracting features in documents
def extract_features(doc):
    return [word2features(doc, i) for i in range(len(doc))]

def get_labels(doc):
    return [label for (token, postag, label) in doc]

X = [extract_features(doc) for doc in data]
y = [get_labels(doc) for doc in data]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


#train the CRF model
trainer = pycrfsuite.Trainer(verbose=True)
for xseq, yseq in zip(X_train, y_train):
    trainer.append(xseq, yseq)

# Set parameters of model
trainer.set_params({
    
    'c1': 0.1,    
    'c2': 0.01,     
    'max_iterations': 200,    
    'feature.possible_transitions': True
})

# Provide a file name as a parameter to the train function, such that
# the model will be saved to the file when training is finished
trainer.train('crf.model')


#Test the model
tagger = pycrfsuite.Tagger()
tagger.open('crf.model')
y_pred = [tagger.tag(xseq) for xseq in X_test]

# Let's take a look at a random sample in the testing set
i = 0
for x, y in zip(y_pred[i], [x[1].split("=")[1] for x in X_test[i]]):
    print("%s (%s)" % (y, x))

#Check model performance
# Create a mapping of labels to indices
labels = {"claim_number": 1, "claimant": 1,"NA": 0}

# Convert the sequences of tags into a 1-dimensional array
predictions = np.array([labels[tag] for row in y_pred for tag in row])
truths = np.array([labels[tag] for row in y_test for tag in row])

# Print out the classification report
print(classification_report(
    truths, predictions,
    target_names=["claim_number", "claimant","NA"]))

#-------------------------------------------------
#predict new data
#-------------------------------------------------
# Read new data
with codecs.open("D:/SampleEmail6.xml", "r", "utf-8") as infile:
    soup_test = bs(infile, "html5lib")


docs = []
sents = []
for d in soup_test.find_all("document"):
   for wrd in d.contents:    
    tags = []
    NoneType = type(None)   

    if isinstance(wrd.name, NoneType) == True:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,'NA'))            
    else:
        withoutpunct = remov_punct(wrd)
        temp = word_tokenize(withoutpunct)
        for token in temp:
            tags.append((token,wrd.name))
    #docs.append(tags)
    sents = sents + tags # puts all the sentences of a document in one element of the list
   docs.append(sents) #appends all the individual documents into one list       


data_test = []
for i, doc in enumerate(docs):
    tokens = [t for t, label in doc]    
    tagged = nltk.pos_tag(tokens)    
    data_test.append([(w, pos, label) for (w, label), (word, pos) in zip(doc, tagged)])


data_test_feats = [extract_features(doc) for doc in data_test]
tagger.open('crf.model')
newdata_pred = [tagger.tag(xseq) for xseq in data_test_feats]

# Let's take a look at a random sample in the testing set
i = 0
for x, y in zip(newdata_pred[i], [x[1].split("=")[1] for x in data_test_feats[i]]):
    print("%s (%s)" % (y, x))


Download the python code from here

By now, you would have understood how to annotate training data, how to use Python to train CRF model and finally how to identify entities from a new text. Although this algorithm provides some basic set of features that are helpful, you may come up with your own set of features to improve the accuracy of the model.

To summarize, here are the key points that we have covered in the article.
• Entities are parts of text that are of interest for the business problem at hand
• Sequence of words or tokens matter in identifying entities
• Pattern recognition approaches such as Regular Expressions or graph-based models such as Hidden Markov Model and Maximum Entropy Markov Model can help in identifying entities. However, Conditional Random Fields(CRF) is a popular and arguably a better candidate for entity recognition problems
• CRF is an undirected graph-based model that considered words that not only occur before the entity but also after it
• The training data can be annotated by using GATE architecture
• The Python code provided helps in training a CRF model and extracting entities from text
• In conclusion, this article should give you a good starting point for your business problem

References

  1. An Introduction to Conditional Random Fields by Charles Sutton & Andrew McCallum.
  2. Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing by Alexander M. Rush(based on joint work with Michael Collins, Tommi Jaakkola, Terry Koo, David Sontag).
  3. Performing Sequence Labelling using CRF in Python by Albert Au Yeung.
  4. Using GATE as an Annotation Tool by Tom Kenter, Diana Maynard.