Download Android App

Alternate Blog View: Timeslide Sidebar Magazine

Saturday, June 7, 2014

Collocation extraction using NLTK

A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. Collocations include noun phrases like strong tea and weapons of mass destruction, phrasal verbs like to make up, and other stock phrases like the rich and powerful.

Collocations are important for a number of applications: natural language generation (to make sure that the output sounds natural and mistakes like powerful tea or to take a decision are avoided), computational lexicography (to automatically identify the important collocations to be listed in a dictionary entry), parsing (so that preference can be given to parses with natural collocations), and corpus linguistic research.

2 Collocation algorithms:

a. The simplest method for finding collocations in a text corpus is counting. Pass the candidate phrases through a part-of-speech filter which only lets through those patterns that are likely to be “phrases”.

b. Mean and variance based methods work by looking at the pattern of varying distance between two words. 

We also want to know is whether two words occur together more often than chance. Assessing whether or not something is a chance event is one of the classical problems of statistics. It is usually couched in terms of hypothesis testing. The options are t-test, chi-square, PMI, likelihood ratios, Poisson-Stirling measure and Jaccard index.

Likelihood ratios is an effective approach to hypothesis testing for extracting collocations. You can read more about it here.

Now, to some code to see how collocations can be extracted. Collocation processing modules are available in Apache Mahout, NLTK and Stanford NLP. In this article, we will use NLTK library written in Python to extract bigram collocations from a set of documents.

@author: Nishant

import nltk
from nltk.collocations import *

from com.nishant import DataLoader

if __name__ == '__main__':

    tokens = []
    stopwords = nltk.corpus.stopwords.words('english')

    '''Load multiple documents and return it as a list. 
       Provide your implementation here'''
    data = DataLoader.load("data.xml")

    print 'Total documents loaded: %d' % (data.len)

    ''' Apply stop words '''
    for d in data:
        data_tokens = nltk.wordpunct_tokenize(d)
        filtered_tokens = [w for w in data_tokens if w.lower() not in stopwords]
        tokens = tokens + filtered_tokens

    print 'Total tokens loaded: %d' % (tokens.len)
    print 'Calculating Collocations'

    ''' Extract only bigrams within a window of 5. 
        Implementation for trigram also available. NLTK
        has utility functions for frequency, ngram and word filter''' 
    bigram_measures = nltk.collocations.BigramAssocMeasures()
    finder = BigramCollocationFinder.from_words(tokens, 5)

    ''' Return the 1000 n-grams with the highest likelihood ratio.
        You can also use PMI or other methods and compare results '''
    print 'Printing Top 1000 Collocations''
    print finder.nbest(bigram_measures.likelihood_ratio, 1000)

Output >>
('side', 'neck'), ('stomach', 'pain'), ('doctor', 'suggested'), ('loss', 'appetite') .....

NLTK documentation for collocations are available here.
Must read on this topic -