text processing in python

Question

I have this code that looks inside a file and picks up 5000 words one at a time written on a new line. parsing is my variable, in this case it equals "economist". If the words in the "data" file are within parsing, then the word is appended to the output list.

The problem is why the words 'on' and 'no' are repeatedly appended? This happens with some other phrases and not necessarily with all. The words 'on' and 'no' are repeated only once in the data file.

Using set helps with the repeat but some words are repeated in the phrase so I lose them.

My code for reading the file into data:

data = [line.strip() for line in open("words.txt", 'r')]

output = []
for each in data: 
        if parsing != "" and each in parsing:
            output.append(each)

Samples:

phrase = economist
sortedout = ['economist', 'on', 'no', 'on', 'no', 'no', 'no', 'no']

and

phrase = timesonline  # with this one 'in' gets repeated and not no
sortedout = ['online', 'online', 'time', 'line', 'line', 'son', 'in', 'on', 'so', 'me', 'in', 'on', 'so', 'in']

It is a hacker rank challenge. Here is the Data File, which is suppose to be on their local drive and the Challenge.

When I do this [d for d in data if d == "on" ] it returns more than one 'on' and it should not.

The little bit of code you posted looks fine. It will be difficult to investigate further without a minimal reproducible example. — Kevin
– Kevin, Commented Jan 5, 2016 at 13:55
What is in your data? A list with all words in the document? Seens so - and that you have the other words, in that order, in the text. — jsbueno
– jsbueno, Commented Jan 5, 2016 at 14:03
For the record, you could simplify the code to: output = [d for d in data if d in parsing] if parsing else [] to simplify to the filtering list comprehension, and avoid all the work when parsing is empty (so your parsing != "" test would cause the loop to do nothing anyway). Or to avoid all the verbosity on one line: output = [] then if parsing: output.extend(d for d in data if d in parsing). By just testing parsing, not parsing != "" or parsing != [], you can switch the type of parsing without needing to change the test; empty sequences are falsy, non-empty are truthy. — ShadowRanger
– ShadowRanger, Commented Jan 5, 2016 at 14:23

SiHa · Accepted Answer · 2016-01-05 18:46:12Z

1

You are checking whether a string is in another string:

if parsing != "" and each in parsing:

...so if parsing is equal to economist, then your statement evaluates to True for economist, no and on because these are substrings of `economist'.

>>> 'on' in 'economist'
True

if you want to match entire strings, you can check the item against a list of strings

>>> 'on' in ['economist']
False

So, re-writing your code (using a list with more than one element, for clarity):

>>> data = ['economist', 'blah', 'on', 'engineer' ,'no', 'gin' ,'economist']
>>>
>>> parsing = ['economist', 'engineer']
>>> output = []
>>> for each in data:
...         if parsing != [] and each in parsing:
...             output.append(each)
...
>>> print output
['economist', 'engineer', 'economist']

Edit:
I agree that the text in the challenge you link to in the comments implies that the words in the list are unique, but they are not. I've just done a very simple manual text search and counted two occurrences of on five of no, and one of economist, just like your results.

Tip: If your code isn't generating the expected results from your source data - check your assumptions about the source data are correct :)

edited Jan 5, 2016 at 18:46

answered Jan 5, 2016 at 14:15

SiHa

8,49713 gold badges39 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

kevbuntu Over a year ago

my data =['market','watch'] and parsing = ['marketwatch'] I am trying to make output=['market','watch'] ofcourse data is 5000 words and I am trying to parse jointed words into single words. Thanks for all the help

kevbuntu Over a year ago

Also I understand why it finds 'on' and 'no' but why more than once?

SiHa Over a year ago

Can you post a link to the data file? (Edit it into your question, rather than as a comment here). Also the code that reads the file into data

kevbuntu Over a year ago

it is a hacker rank challenge data file s3.amazonaws.com/hr-testcases/479/assets/words.txt, which is suppose to be on their local drive and the challenge is hackerrank.com/challenges/url-hashtag-segmentation. When I do this [d for d in data if d == "on" ] it returns more than one 'no' and it should not

kevbuntu Over a year ago

thanks then I guess their file is corrupted, and I am not losing my mind.

|

Collectives™ on Stack Overflow

text processing in python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related