0

I trying to make a simple positional index that but having some problems getting the correct output.

Given a list of strings (sentences) I want to use the string position in the sting list as document id and then iterate over the words in the sentence and use the words index in the sentence as its position. Then update a dictionary of words with a tuple of the doc id and it's position in the doc.

Code:

main func -

def doc_pos_index(alist):
    inv_index= {}
    words = [word for line in alist for word in line.split(" ")]

    for word in words:
        if word not in inv_index:
            inv_index[word]=[]

    for item, index in enumerate(alist): # find item and it's index in list
        for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index
            if item2 in inv_index:
                inv_index[i].append(tuple(index, index2)) # if word in index update it's list with tuple of doc index and position

    return inv_index 

example list:

doc_list= [
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed',
'hello Delivered dejection necessary objection do mr prevailed'
]

desired output:

{'Delivered': [(0,1),(1,1),(2,1),(3,1),(4,1)],
'necessary': [(0,3),(1,3),(2,3),(3,3),(4,3)], 
'dejection': [(0,2),(1,2),(2,2),(3,2),(4,2)],
 ect...}

Current output:

{'Delivered': [],
'necessary': [], 
'dejection': [], 
'do': [],
'objection': [], 
'prevailed': [], 
'mr': [], 
'hello': []}

An fyi, I do know about collections libarary and NLTK but I'm mainly doing this for learning/practice reasons.

1
  • You've got the order of what enumerate yields backwards. You want for index, item in enumerate(alist): Commented Oct 20, 2017 at 17:50

3 Answers 3

1

Check this:

>>> result = {}
>>> for doc_id,doc in enumerate(doc_list):
        for word_pos,word in enumerate(doc.split()):
            result.setdefault(word,[]).append((doc_id,word_pos))


>>> result
{'Delivered': [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)], 'necessary': [(0, 3), (1, 3), (2, 3), (3, 3), (4, 3)], 'dejection': [(0, 2), (1, 2), (2, 2), (3, 2), (4, 2)], 'do': [(0, 5), (1, 5), (2, 5), (3, 5), (4, 5)], 'objection': [(0, 4), (1, 4), (2, 4), (3, 4), (4, 4)], 'prevailed': [(0, 7), (1, 7), (2, 7), (3, 7), (4, 7)], 'mr': [(0, 6), (1, 6), (2, 6), (3, 6), (4, 6)], 'hello': [(0, 0), (1, 0), (2, 0), (3, 0), (4, 0)]}
>>> 
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks, the doc.split solved it. As someone else pointed out I was misunderstanding enumerate. btw, i've never seen that setdefault before, how does that work?
setdefault method checks for a key in the dictionary; if exists it returns the value, otherwise sets the key with the provided value and returns that.
1

You seem to be confused about what enumerate does. The first item returned by enumerate() is the index, and the second item is the value. You seem to have it reversed.

You are further confused with your second use of enumerate():

for item2, index2 in enumerate(alist[item]): # for words in string find word and it's index

First of all you don't need to do alist[item]. You already have the value of that line in the index variable (again, you are perhaps confused since you have the variable names backwards. Second, you seem to think that enumerate() will split a line into individual words. It won't. Instead it will just iterate over every character in the string (I'm confused why you thought this since you demonstrated earlier that you know how to split a string on spaces--interesting though).

As an additional tip, you don't need to do this:

for word in words:
    if word not in inv_index:
        inv_index[word]=[]

First of all, since you're just initializing a dict you don't need the if statement. Just

for word in words:
    inv_index[word] = []

will do. If the word is already in the dictionary this will make an unnecessary assignment, true, but it's still an O(1) operation so there's no harm. However, you don't even need to do this. Instead you can use collections.defaultdict:

from collections import defaultdict
inv_index = defaultdict(list)

Then you can just do ind_index[word].append(...). If word is not already in inv_index it will add it and initialize its value to an empty list. Otherwise it will just append to the existing list.

2 Comments

Thanks for pointing that out, tbh I put alist[items] by mistake but I was defo confused by enumerate. Even though it makes perfect sense now you've said it, for some reason I thought it would enable iteration over individual words!
Also thanks for the extra tips. I know about default dict but when I'm practicing something kinda new I like reinventing the wheel a little so I know 100% whats going on in the program. However, I didn't know defaultdict adds the word if it's not already in there.
-1

#And the algorithm for the following: {term: [df, tf, {doc1: [tf, [offsets], doc2...}]]

InvertedIndex = {}

from TextProcessing import *

for i in range(len(listaDocumentos)):
docTokens = tokenization(listaDocumentos[i], NLTK=True) for token in docTokens: if token in InvertedIndex:
if i in InvertedIndextoken:
pass else:
InvertedIndex[token][0] += 1 InvertedIndextoken.append(i) else:
DF = 1 ListOfDOCIDs = [i] InvertedIndex[token] = [DF, ListOfDOCIDs]

Output

1 Comment

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.