Boolean search text file in Python

Question

I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:

import re 
sections = [] 
current = []
with open("Aberdeen2005.txt") as f:
    for line in f:
        if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):        
           sections.append("".join(current))
           current = [line]
        else:
           current.append(line)

print(len(sections))

So now, articles are represented by the expression sections

The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.

Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:

for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])

Nevertheless, it does not seem to work. Any ideas why?

Describe what the expected output of "identify them with the number 'x'" looks like to you. — OneCricketeer
– OneCricketeer, Commented Jan 26, 2016 at 11:19
well, creating a a group with all the articles that fulfilled certain criteria: for example group1 = sections[1,3,7,9] and group2 = sections[2,4,10,27]. — Economist_Ayahuasca
– Economist_Ayahuasca, Commented Jan 26, 2016 at 11:36
Okay, I was think more a dictionary {"1": [1,3,7,9], "2": [2,4,10,27]} — OneCricketeer
– OneCricketeer, Commented Jan 26, 2016 at 11:47
either works, as I said I am new to this and I do not know which one might be more straight forward :) — Economist_Ayahuasca
– Economist_Ayahuasca, Commented Jan 26, 2016 at 11:53
@AndresAzqueta you should read the Regular Expression HOWTO and try the regular expressions on texts using regex101. The latter regular expression does not even compile. — Antti Haapala
– Antti Haapala, Commented Jan 26, 2016 at 17:27

glibdud · Accepted Answer · 2016-01-26 13:34:20Z

2

It's a bit wordy, but you can get away without using regular expressions here, for example:

# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
    ('uncertainty' in s or 'uncertain' in s) and
    ('tax' in s or 'policy' in s)):
    do_stuff()

answered Jan 26, 2016 at 13:34

glibdud

7,9704 gold badges32 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Antti Haapala Over a year ago

this does not consider word boundaries at all

glibdud Over a year ago

@AnttiHaapala Correct.

Economist_Ayahuasca Over a year ago

what do you mean it does not consider word boundaries at all?

glibdud Over a year ago

@AndresAzqueta This solution would match not only if the section contains "tax", but also, for example, "ataxia". In other words, it's not matching whole words, but just checking to make sure those particular sequences of characters exist somewhere in the section. If that's an important distinction for you, you'll need to look further at regexes.

Economist_Ayahuasca Over a year ago

great, thanks for the tip. I will check regex and implement a few changes to deal with the problem. Cheers,

Antti Haapala · Accepted Answer · 2016-01-26 20:19:59Z

2

It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.

belongs_to_group1 = bool(re.match(
     r'(?=.*\b(?:economic|economy)\b)'
     r'(?=.*\b(?:uncertain|uncertainty)\b)'
     r'(?=.*\b(?:tax|policy)\b)', text, re.I))

Thus not very readable.

A more fruitful approach would be to find all words and put them into a set

words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
    and ('economic' in words or 'economy' in words)
    and ('tax' in words or 'policy' in words))

edited Jan 26, 2016 at 20:19

answered Jan 26, 2016 at 17:39

Antti Haapala

135k23 gold badges297 silver badges349 bronze badges

2 Comments

OneCricketeer Over a year ago

Could you shorten-up uncertain|uncertainty to uncertain(?:ty)?? And economic|economy to econom(?:ic|y)

Antti Haapala Over a year ago

I did, but I reverted that because it made it even less readable

user3088440 · Accepted Answer · 2016-01-26 14:10:55Z

-1

You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.

One thing you might want to note is that your logic may need brackets.

By

economy OR economic AND uncertainty OR uncertain AND tax OR policy

I assume you mean

(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)

which is different to (for example)

economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy

EDIT1: Python will evaluate your statement without brackets from left to right, i.e.:

( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)

Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)

EDIT2: As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)

edited Jan 26, 2016 at 14:10

answered Jan 26, 2016 at 12:07

user3088440

1447 bronze badges

5 Comments

Antti Haapala Over a year ago

Python absolutely will not evaluate and and or left to right like that. Instead ands are always evaluated first, and ors afterwards

user3088440 Over a year ago

@antti Huh, thats interesting. I'd assumed it would evaluated like one would mathematically. Does and being evaluated first mean that "economy OR economic AND uncertainty OR uncertain" becomes " economy OR (economic AND uncertainty) OR uncertain" or does it becomes " (economy OR economic) AND (uncertainty OR uncertain)"

bereal Over a year ago

@user3088440 and has higher precedence than or, in Python, most other languages, and in math.

AdmiralWen Over a year ago

@user3088440: It becomes your first case. Without any brackets, and's come before or's. This IS mathematically, by the way, as multiplication/division comes before addition/subtraction.

user3088440 Over a year ago

Huh, had no idea, I've just always bracketed anything ambiguous. Good to know!

Collectives™ on Stack Overflow

Boolean search text file in Python

3 Answers 3

5 Comments

2 Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

2 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related