1

I'm trying to create an algo which goes through a list of strings, joins strings together if they meet a certain criteria, then skips by the number of strings it joined to avoid double counting of sections of the same joined string.

I understand i = i + x or i += x doesnt change the amount each loop iterates by, so am looking for an alternative method to skip a number of iterations by a variable.

Background: Im trying to create a Named Entity recognition algo for use in news articles. I tokenise the text ('Prime Minister Jacinda Ardern is from New Zealand') into ('Prime','Minister','Jacinda','Ardern','is'...) and run the NLTK POS tagging algo over it giving : ...(('Jacinda','NNP'),('Ardern','NNP'),('is','VBZ')... then combine words when subsequent words are also 'NNP' /proper nouns.

The goal is to count 'Prime Minister Jacinda Ardern' as 1 string as opposed to 4, then to skip the loop iteration by as many words to avoid the next string being 'Minister Jacinda Ardern' and then 'Jacinda Ardern'.

Context: 'text' is a list of lists created by tokenising and then POS tagging my article and is in the format: [...('She', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('roughly', 'RB'), ('25-minute', 'JJ'), ('meeting', 'NN')...] 'NNP' = proper noun or the names of places/people/organisations etc.

for (i) in range(len(text)):

    print(i)

    #initialising wordcounter as a variable
    wordcounter = 0

    # if text[i] is a Proper Noun, make namedEnt = the word. 
    # then increase wordcounter by 1
    if text[i][1] == 'NNP':
        namedEnt = text[i][0]
        wordcounter +=1

        # while the next word in text is also a Proper Noun,
        # increase wordcounter by 1. Initialise J as = 1
        while text[i + wordcounter][1] == 'NNP':
            wordcounter +=1
            j = 1


            # While J is less than wordcounter, join text[i+j] to 
            # namedEnt. Increase J by 1. When that is no longer
            # the case append namedEnt to a namedEntity list
            while j < wordcounter:
                namedEnt = ' '.join([namedEnt,text[i+j][0]])
                j += 1
            InitialNamedEntity.append(namedEnt)

        i += wordcounter

If I print(i) at the start it goes up by 1 at a time. When I print the Counter of the NamedEntity list made up of namedEnts, i results as follows: (...'New Zealand': 7, 'Zealand': 7, 'United': 4, 'Prime Minister Minister Jacinda Minister Jacinda Ardern': 3...)

So im not only getting double counts as in 'New Zealand' and 'Zealand', but im also getting wacky results like 'Prime Minister Minister Jacinda Minister Jacinda Ardern'.

The results I would like would be ('New Zealand':7, 'United States':4,'Prime Minister Jacinda Ardern':3)

Any help would be greatly appreciated. Cheers

1
  • 1
    Just use a while loop here Commented Oct 21, 2019 at 3:27

3 Answers 3

1

Don't use a for loop if you need to adjust how i is incremented, as it always sets it to the next value in the range. Use a while loop:

i = 0
while i < len(text):
    ...
    i += wordcounter
Sign up to request clarification or add additional context in comments.

Comments

1

range() creates an iterable object. The for...in construct calls a next method on it and each time next returns the next value in the sequence. So your i variable is not the index in that sequence, it's just the next value produced by the iterator. Modifying i has no effect, it will just be overwritten when the next value is retrieved from the sequence.

This is very different from a loop like for (int i = 0; i < 5; i++) {} in C, where there is no concept of a sequence; that just checks if i less than five before executing the block.

Compare it to this:

for i in {2,-1,-4}:
  print(i)
  i = i + 2

Perhaps here it is more obvious that setting i will have no effect.

But that C-like construct, you can do that in Python too. As follows:

i = 0
while i < 6:
  print(i)
  if i == 2:
    i = i + 2
  else:
    i = i + 1

This prints

0
1
2
4
5

See how it didn't output 3? When it got to i == 2, it added 2 so it skipped over 3. You can do something similar in your code.

(these examples were Python 3)

2 Comments

I think you mean "constructor". And thank you, iterable is the correct term. I'll edit my answer. For any wanting to read more on range, here is the documentation: docs.python.org/3/library/stdtypes.html#typesseq
Specifically, it's a generator.
0

Thanks for the help everyone. I used the while loop shown by Barmar:

i = 0

while i < len(text):

i += wordcounter

and at the end used an if else statement:

if wordcounter > 0: i += wordcounter

else: i += 1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.