3

I have code for an invertedIndex as follows. However I'm not too satisfied with it and was wondering how it can be made more compact and pythonic

class invertedIndex(object):


  def __init__(self,docs):
     self.docs,self.termList,self.docLists=docs,[],[]

     for index,doc in enumerate(docs):

        for term in doc.split(" "):
            if term in self.termList:
                i=self.termList.index(term)
                if index not in self.docLists[i]:
                    self.docLists[i].append(index)

            else:
                self.termList.append(term)
                self.docLists.append([index])  

  def search(self,term):
        try:
            i=self.termList.index(term)
            return self.docLists[i]
        except:
            return "No results"





docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print invertedIndex.search("sales")

2 Answers 2

5

Store the doc indicies in a Python set and use a dict to reference the "doc set" for each term.

from collections import defaultdict

class invertedIndex(object):

  def __init__(self,docs):
      self.docSets = defaultdict(set)
      for index, doc in enumerate(docs):
          for term in doc.split():
              self.docSets[term].add(index)

  def search(self,term):
        return self.docSets[term]

docs=["new home sales top forecasts june june june",
                     "home sales rise in july june",
                     "increase in home sales in july",
                     "july new home sales rise"]

i=invertedIndex(docs)
print i.search("sales") # outputs: set([0, 1, 2, 3])

set works a bit like a list, but is unordered and cannot contain duplicate entries.

defaultdict is basically a dict, which has a default type when no data is available (in this case an empty set).

Sign up to request clarification or add additional context in comments.

1 Comment

The defaultdict(set) is a neat way
2

This solution is almost identical to @Peter Gibson's. In this version, the index is the data, there's no delegated docSets object involved. This makes the code slightly shorter and clearer.

The code also preserves the original order of the documents... which is sort of a bug, I like Peter's set() implementation better.

Note also that reference to an non existent term, like ix['garbage'], implicitly modifies the index. If the only API is search, this is fine, but this case is worth noting.

source

class InvertedIndex(dict):
    def __init__(self, docs):
        self.docs = docs

        for doc_index,doc in enumerate(docs):
            for term in doc.split(" "):
                self[term].append(doc_index)

    def __missing__(self, term):
        # operate like defaultdict(list)
        self[term] = []
        return self[term]

    def search(self, term):
        return self.get(term) or 'No results'


docs=["new home sales top forecasts june june june",
      "home sales rise in july june",
      "increase in home sales in july",
      "july new home sales rise",
      'beer',
      ]

ix = InvertedIndex(docs)
print ix.__dict__
print
print 'sales:',ix.search("sales")
print 'whiskey:', ix.search('whiskey')
print 'beer:', ix.search('beer')

print '\nTEST OF KEY SETTING'
print ix['garbage']
print 'garbage' in ix
print ix.search('garbage')

output

{'docs': ['new home sales top forecasts june june june', 'home sales rise in july june', 'increase in home sales in july', 'july new home sales rise', 'beer']}

sales: [0, 1, 2, 3]
whiskey: No results
beer: [4]

TEST OF KEY SETTING
[]
True
No results

5 Comments

subclassing dict is a nice solution - you get things like __str__, __repr__, __eq__ etc for free
Thanks both of you guys. I'm trying next to write the Vector Space Model as concise as possible and will use these approaches
I have a follow up question for shavenwarthog. With this approach how do I restrict setting of keys from outside this class. For eg If someone did ix["garbage"], it would add an empty list to the ix object as { "garbage"=[]}. This is not desirable
@sarathjoseph, hmm, yes. I understand that this isn't desirable, but what is the overall use case? I thought the API was ix.search('garbage'), which provides "No results" as expected. I've updated the code and output to illustrate.
The search behaves just as expected.It shouldn't be too much of a problem and I might just get things working. But is there a way to keep the ix object free of other keys but the terms.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.