simple in memory positional inverted index in python

Question

I trying to make a simple positional index that but having some problems getting the correct output.

Given a list of strings (sentences) I want to use the string position in the sting list as document id and then iterate over the words in the sentence and use the words index in the sentence as its position. Then update a dictionary of words with a tuple of the doc id and it's position in the doc.

Code:

main func -

def doc_pos_index(alist):
    inv_index= {}
    words = [word for line in alist for word in line.split(" ")]

    for word in words:
        if word not in inv_index:
            inv_index[word]=[]

    for item, index in enumerate(alist): # find item and it's index in list
        for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
            if item2 in inv_index:
                inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position

    return inv_index

example list:

doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]

desired output:

{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)], 
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
 ect...}

Current output:

{'Delivered': [],
'necessary': [], 
'dejection': [], 
'do': [],
'objection': [], 
'prevailed': [], 
'mr': [], 
'hello': []}

An fyi, I do know about collections libarary and NLTK but I'm mainly doing this for learning/practice reasons.

You've got the order of what enumerate yields backwards. You want for index, item in enumerate(alist): — PM 2Ring
– PM 2Ring, Commented Oct 20, 2017 at 17:50

mshsayem · Accepted Answer · 2017-10-20 17:50:49Z

1

Check this:

>>> result = {}
>>> for doc_id,doc in enumerate(doc_list):
        for word_pos,word in enumerate(doc.split()):
            result.setdefault(word,[]).append((doc_id,word_pos))


>>> result
{'Delivered': [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 'necessary': [(0, 3), (1, 3), (2, 3), (3, 3), (4, 3)], 'dejection': [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)], 'do': [(0, 5), (1, 5), (2, 5), (3, 5), (4, 5)], 'objection': [(0, 4), (1, 4), (2, 4), (3, 4), (4, 4)], 'prevailed': [(0, 7), (1, 7), (2, 7), (3, 7), (4, 7)], 'mr': [(0, 6), (1, 6), (2, 6), (3, 6), (4, 6)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]}
>>>

answered Oct 20, 2017 at 17:50

mshsayem

18.1k11 gold badges65 silver badges73 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

arm93 Over a year ago

Thanks, the doc.split solved it. As someone else pointed out I was misunderstanding enumerate. btw, i've never seen that setdefault before, how does that work?

mshsayem Over a year ago

setdefault method checks for a key in the dictionary; if exists it returns the value, otherwise sets the key with the provided value and returns that.

Iguananaut · Accepted Answer · 2017-10-20 17:53:44Z

1

You seem to be confused about what enumerate does. The first item returned by enumerate() is the index, and the second item is the value. You seem to have it reversed.

You are further confused with your second use of enumerate():

for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index

First of all you don't need to do alist[item]. You already have the value of that line in the index variable (again, you are perhaps confused since you have the variable names backwards. Second, you seem to think that enumerate() will split a line into individual words. It won't. Instead it will just iterate over every character in the string (I'm confused why you thought this since you demonstrated earlier that you know how to split a string on spaces--interesting though).

As an additional tip, you don't need to do this:

for word in words:
    if word not in inv_index:
        inv_index[word]=[]

First of all, since you're just initializing a dict you don't need the if statement. Just

for word in words:
    inv_index[word] = []

will do. If the word is already in the dictionary this will make an unnecessary assignment, true, but it's still an O(1) operation so there's no harm. However, you don't even need to do this. Instead you can use collections.defaultdict:

from collections import defaultdict
inv_index = defaultdict(list)

Then you can just do ind_index[word].append(...). If word is not already in inv_index it will add it and initialize its value to an empty list. Otherwise it will just append to the existing list.

answered Oct 20, 2017 at 17:53

Iguananaut

23.8k6 gold badges54 silver badges65 bronze badges

2 Comments

arm93 Over a year ago

Thanks for pointing that out, tbh I put alist[items] by mistake but I was defo confused by enumerate. Even though it makes perfect sense now you've said it, for some reason I thought it would enable iteration over individual words!

arm93 Over a year ago

Also thanks for the extra tips. I know about default dict but when I'm practicing something kinda new I like reinventing the wheel a little so I know 100% whats going on in the program. However, I didn't know defaultdict adds the word if it's not already in there.

user19262983 · Accepted Answer · 2022-06-03 09:25:10Z

-1

#And the algorithm for the following: {term: [df, tf, {doc1: [tf, [offsets], doc2...}]]

InvertedIndex = {}

from TextProcessing import *

for i in range(len(listaDocumentos)):
docTokens = tokenization(listaDocumentos[i], NLTK=True) for token in docTokens: if token in InvertedIndex:
if i in InvertedIndextoken:
pass else:
InvertedIndex[token][0] += 1 InvertedIndextoken.append(i) else:
DF = 1 ListOfDOCIDs = [i] InvertedIndex[token] = [DF, ListOfDOCIDs]

Output

edited Jun 3, 2022 at 9:25

answered Jun 3, 2022 at 9:19

user19262983

11 bronze badge

1 Comment

Community Over a year ago

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

simple in memory positional inverted index in python

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related