I'm trying to create an algo which goes through a list of strings, joins strings together if they meet a certain criteria, then skips by the number of strings it joined to avoid double counting of sections of the same joined string.
I understand i = i + x or i += x doesnt change the amount each loop iterates by, so am looking for an alternative method to skip a number of iterations by a variable.
Background: Im trying to create a Named Entity recognition algo for use in news articles. I tokenise the text ('Prime Minister Jacinda Ardern is from New Zealand') into ('Prime','Minister','Jacinda','Ardern','is'...) and run the NLTK POS tagging algo over it giving : ...(('Jacinda','NNP'),('Ardern','NNP'),('is','VBZ')... then combine words when subsequent words are also 'NNP' /proper nouns.
The goal is to count 'Prime Minister Jacinda Ardern' as 1 string as opposed to 4, then to skip the loop iteration by as many words to avoid the next string being 'Minister Jacinda Ardern' and then 'Jacinda Ardern'.
Context:
'text' is a list of lists created by tokenising and then POS tagging my article and is in the format: [...('She', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('roughly', 'RB'), ('25-minute', 'JJ'), ('meeting', 'NN')...]
'NNP' = proper noun or the names of places/people/organisations etc.
for (i) in range(len(text)):
print(i)
#initialising wordcounter as a variable
wordcounter = 0
# if text[i] is a Proper Noun, make namedEnt = the word.
# then increase wordcounter by 1
if text[i][1] == 'NNP':
namedEnt = text[i][0]
wordcounter +=1
# while the next word in text is also a Proper Noun,
# increase wordcounter by 1. Initialise J as = 1
while text[i + wordcounter][1] == 'NNP':
wordcounter +=1
j = 1
# While J is less than wordcounter, join text[i+j] to
# namedEnt. Increase J by 1. When that is no longer
# the case append namedEnt to a namedEntity list
while j < wordcounter:
namedEnt = ' '.join([namedEnt,text[i+j][0]])
j += 1
InitialNamedEntity.append(namedEnt)
i += wordcounter
If I print(i) at the start it goes up by 1 at a time. When I print the Counter of the NamedEntity list made up of namedEnts, i results as follows:
(...'New Zealand': 7, 'Zealand': 7, 'United': 4, 'Prime Minister Minister Jacinda Minister Jacinda Ardern': 3...)
So im not only getting double counts as in 'New Zealand' and 'Zealand', but im also getting wacky results like 'Prime Minister Minister Jacinda Minister Jacinda Ardern'.
The results I would like would be ('New Zealand':7, 'United States':4,'Prime Minister Jacinda Ardern':3)
Any help would be greatly appreciated. Cheers