Term Frequency - Inverse Document Frequency 101




Let us expose here a basic and beautiful Information Retrieval concept such as tf-idf. In order to do so, we will use Python to define a basic in-memory “search engine” that will allow us to add documents and search for them. The search results will contain the relevant documents together with the tf-idf value.

from math import log
import re

class IrIndex:
    """An in-memory inverted index"""
    
    pattern = re.compile("^\s+|\s*,*\s*|\s+$")
    
    def __init__(self):
        self.index = {}
        self.documents = []
        self.tf = {}
    
    def index_document(self, document):
        ## split
        terms = [word for word in self.pattern.split(document)]
        
        ## add to documents
        self.documents.append(document)
        document_pos = len(self.documents)-1
        
        ## add posts to index, updating tf
        for term in terms:
            if term not in self.index:
                self.index[term] = []
                self.tf[term] = []
            self.index[term].append(document_pos)
            self.tf[term].append(terms.count(term))
        
    
    def tf_idf(self, term):
        ## get tf for each document and calculate idf
        if term in self.tf:
            res = []
            for tf, post in zip(self.tf[term], self.index[term]):
                idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )
                res.append((tf * idf, self.documents[post]))
            return res 
        else:
            return []

We create now our empty index.

index = IrIndex()

Add some documents…

index.index_document("Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France")
index.index_document("Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France")
index.index_document("Bruno Clair Clos Saint Jaques 2001, Bourgogne, France")
index.index_document("Bruno Clair Clos Saint Jaques 2002, Bourgogne, France")
index.index_document("Bruno Clair Clos Saint Jaques 2005, Bourgogne, France")
index.index_document("Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France")
index.index_document("Chateau Margaux 1982, Bordeaux, France")
index.index_document("Chateau Margaux 1996, Bordeaux, France")
index.index_document("Chateau Latour 1982, Bordeaux, France")
index.index_document("Domaine Raveneau Le Clos 2001, Bourgogne, France")

Let’s try some terms. First, we search for a term that doesn’t exists in any of our documents.

index.tf_idf("hello")

[]

Next, let’s try a term that appears in few a documents.

index.tf_idf("Bordeaux")

[(1.2039728043259361, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1996, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Latour 1982, Bordeaux, France')]

Now a term with higher idf. That is, ‘Margaux’ is less common or more specific than ‘Bordeaux’. We see higher scores in general.

index.tf_idf("Margaux")

[(1.6094379124341003, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.6094379124341003, 'Chateau Margaux 1996, Bordeaux, France')]

Followed by a term with tf higher than 1 for one of the documents. Now we search for ‘Bourgogne’, a more common term in our index, so we have lower idf. That means lower scores in general but higher in cases with higher tf among them.

index.tf_idf("Bourgogne")

[(0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France'),
 (0.44628710262841953,
  'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France'),
 (0.44628710262841953,
  'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France'),
 (0.22314355131420976, 'Domaine Raveneau Le Clos 2001, Bourgogne, France')]

Multi-term search (overlap score measure)

In order to do multi-term search, we will sum the tf-idf for each term hit per document. This is called overlap score measure. In order to do so, we need a new tf_idf method for our index.

def overlap_score_measure(self, terms):
        res = []
        hits = {}
        # sum tf-idfs for each hitting document
        hitting_terms = [term for term in self.pattern.split(terms) if term in self.tf]
        for term in hitting_terms: # for each term having at least on hit...
            for tf, post in zip(self.tf[term], self.index[term]): # store the tf-idf in hits for later sum
                if post not in hits:
                    hits[post] = []
                idf = log( float( len(self.documents) ) / float( len(self.tf[term]) ) )
                hits[post].append(tf * idf)
        # sum hits for each post
        for post in hits.iterkeys():
            tfidf = sum(hits[post])
            res.append((tfidf, self.documents[post]))
            
        return res 
    
    
    IrIndex.overlap_score_measure = overlap_score_measure

First, let’s check that works the same for single term queries.

index.overlap_score_measure("hello")

[]
index.overlap_score_measure("Bordeaux")

[(1.2039728043259361, 'Chateau Latour 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1996, Bordeaux, France')]
index.overlap_score_measure("Margaux")

[(1.6094379124341003, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.6094379124341003, 'Chateau Margaux 1996, Bordeaux, France')]
index.overlap_score_measure("Bourgogne")

[(0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France'),
 (0.8925742052568391,
  'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France'),
 (0.22314355131420976, 'Domaine Raveneau Le Clos 2001, Bourgogne, France')]

We try now with a multi term search where one of the terms doesn’t hit any documents at all.

index.overlap_score_measure("hello Bordeaux")

[(1.2039728043259361, 'Chateau Latour 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1996, Bordeaux, France')]

Multi-term, with disjoint results.

index.overlap_score_measure("Bourgogne Bordeaux")

[(0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France'),
 (0.22314355131420976,
  'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France'),
 (0.8925742052568391,
  'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France'),
 (1.2039728043259361, 'Chateau Margaux 1982, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Margaux 1996, Bordeaux, France'),
 (1.2039728043259361, 'Chateau Latour 1982, Bordeaux, France'),
 (0.22314355131420976, 'Domaine Raveneau Le Clos 2001, Bourgogne, France')]

And finally, a multi-term where some results have more than one term hitting. We see how the score is increased.

index.overlap_score_measure("Margaux Bordeaux")

[(1.2039728043259361, 'Chateau Latour 1982, Bordeaux, France'),
 (2.8134107167600364, 'Chateau Margaux 1982, Bordeaux, France'),
 (2.8134107167600364, 'Chateau Margaux 1996, Bordeaux, France')]

And with this we conclude our introduction to the concept of Term Frequency - Inverse Document Frequency. You can find all the code for this notebook on GitHub where you can fork it and make it yours to experiment.