Let us expose here a basic and beautiful Information Retrieval concept such as
tf-idf . In order to do so, we
will use Python to define a basic in-memory “search engine” that will allow us to add documents
and search for them. The search results will contain the relevant documents together
with the tf-idf value.
from math import log
import re
class IrIndex :
"""An in-memory inverted index"""
pattern = re . compile ( "^ \ s+| \ s*,* \ s*| \ s+$" )
def __init__ ( self ):
self . index = {}
self . documents = []
self . tf = {}
def index_document ( self , document ):
## split
terms = [ word for word in self . pattern . split ( document )]
## add to documents
self . documents . append ( document )
document_pos = len ( self . documents ) - 1
## add posts to index, updating tf
for term in terms :
if term not in self . index :
self . index [ term ] = []
self . tf [ term ] = []
self . index [ term ] . append ( document_pos )
self . tf [ term ] . append ( terms . count ( term ))
def tf_idf ( self , term ):
## get tf for each document and calculate idf
if term in self . tf :
res = []
for tf , post in zip ( self . tf [ term ], self . index [ term ]):
idf = log ( float ( len ( self . documents ) ) / float ( len ( self . tf [ term ]) ) )
res . append (( tf * idf , self . documents [ post ]))
return res
else :
return []
We create now our empty index.
index = IrIndex ()
Add some documents…
index . index_document ( "Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France" )
index . index_document ( "Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France" )
index . index_document ( "Bruno Clair Clos Saint Jaques 2001, Bourgogne, France" )
index . index_document ( "Bruno Clair Clos Saint Jaques 2002, Bourgogne, France" )
index . index_document ( "Bruno Clair Clos Saint Jaques 2005, Bourgogne, France" )
index . index_document ( "Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France" )
index . index_document ( "Chateau Margaux 1982, Bordeaux, France" )
index . index_document ( "Chateau Margaux 1996, Bordeaux, France" )
index . index_document ( "Chateau Latour 1982, Bordeaux, France" )
index . index_document ( "Domaine Raveneau Le Clos 2001, Bourgogne, France" )
Let’s try some terms. First, we search for a term that doesn’t exists in any of
our documents.
index . tf_idf ( "hello" )
[]
Next, let’s try a term that appears in few a documents.
index . tf_idf ( "Bordeaux" )
[( 1.2039728043259361 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1996, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Latour 1982, Bordeaux, France' )]
Now a term with higher idf. That is, ‘Margaux’ is less common or more
specific than ‘Bordeaux’. We see higher scores in general.
index . tf_idf ( "Margaux" )
[( 1.6094379124341003 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.6094379124341003 , 'Chateau Margaux 1996, Bordeaux, France' )]
Followed by a term with tf higher than 1 for one of the documents. Now we search for
‘Bourgogne’, a more common term in our index, so we have lower idf . That
means lower scores in general but higher in cases with higher tf among them.
index . tf_idf ( "Bourgogne" )
[( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France' ),
( 0.44628710262841953 ,
'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France' ),
( 0.44628710262841953 ,
'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France' ),
( 0.22314355131420976 , 'Domaine Raveneau Le Clos 2001, Bourgogne, France' )]
Multi-term search (overlap score measure)
In order to do multi-term search, we will sum the tf-idf for each term hit per
document. This is called overlap score measure . In order to do so, we need a new tf_idf
method for our index.
def overlap_score_measure ( self , terms ):
res = []
hits = {}
# sum tf-idfs for each hitting document
hitting_terms = [ term for term in self . pattern . split ( terms ) if term in self . tf ]
for term in hitting_terms : # for each term having at least on hit...
for tf , post in zip ( self . tf [ term ], self . index [ term ]): # store the tf-idf in hits for later sum
if post not in hits :
hits [ post ] = []
idf = log ( float ( len ( self . documents ) ) / float ( len ( self . tf [ term ]) ) )
hits [ post ] . append ( tf * idf )
# sum hits for each post
for post in hits . iterkeys ():
tfidf = sum ( hits [ post ])
res . append (( tfidf , self . documents [ post ]))
return res
IrIndex . overlap_score_measure = overlap_score_measure
First, let’s check that works the same for single term queries.
index . overlap_score_measure ( "hello" )
[]
index . overlap_score_measure ( "Bordeaux" )
[( 1.2039728043259361 , 'Chateau Latour 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1996, Bordeaux, France' )]
index . overlap_score_measure ( "Margaux" )
[( 1.6094379124341003 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.6094379124341003 , 'Chateau Margaux 1996, Bordeaux, France' )]
index . overlap_score_measure ( "Bourgogne" )
[( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France' ),
( 0.8925742052568391 ,
'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France' ),
( 0.22314355131420976 , 'Domaine Raveneau Le Clos 2001, Bourgogne, France' )]
We try now with a multi term search where one of the terms doesn’t hit any
documents at all.
index . overlap_score_measure ( "hello Bordeaux" )
[( 1.2039728043259361 , 'Chateau Latour 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1996, Bordeaux, France' )]
Multi-term, with disjoint results.
index . overlap_score_measure ( "Bourgogne Bordeaux" )
[( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Chambertin Clos de Beze 2005, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2001, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2002, Bourgogne, France' ),
( 0.22314355131420976 ,
'Bruno Clair Clos Saint Jaques 2005, Bourgogne, France' ),
( 0.8925742052568391 ,
'Coche-Dury Bourgogne Chardonay 2005, Bourgogne, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Margaux 1996, Bordeaux, France' ),
( 1.2039728043259361 , 'Chateau Latour 1982, Bordeaux, France' ),
( 0.22314355131420976 , 'Domaine Raveneau Le Clos 2001, Bourgogne, France' )]
And finally, a multi-term where some results have more than one term hitting. We
see how the score is increased.
index . overlap_score_measure ( "Margaux Bordeaux" )
[( 1.2039728043259361 , 'Chateau Latour 1982, Bordeaux, France' ),
( 2.8134107167600364 , 'Chateau Margaux 1982, Bordeaux, France' ),
( 2.8134107167600364 , 'Chateau Margaux 1996, Bordeaux, France' )]
And with this we conclude our introduction to the concept of Term Frequency -
Inverse Document Frequency. You can find all the code for this notebook
on GitHub
where you can fork it and make it yours to experiment.