3

I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:

import re 
sections = [] 
current = []
with open("Aberdeen2005.txt") as f:
    for line in f:
        if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):        
           sections.append("".join(current))
           current = [line]
        else:
           current.append(line)

print(len(sections)) 

So now, articles are represented by the expression sections

The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.

Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:

for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])

Nevertheless, it does not seem to work. Any ideas why?

9
  • Describe what the expected output of "identify them with the number 'x'" looks like to you. Commented Jan 26, 2016 at 11:19
  • well, creating a a group with all the articles that fulfilled certain criteria: for example group1 = sections[1,3,7,9] and group2 = sections[2,4,10,27]. Commented Jan 26, 2016 at 11:36
  • Okay, I was think more a dictionary {"1": [1,3,7,9], "2": [2,4,10,27]} Commented Jan 26, 2016 at 11:47
  • either works, as I said I am new to this and I do not know which one might be more straight forward :) Commented Jan 26, 2016 at 11:53
  • 1
    @AndresAzqueta you should read the Regular Expression HOWTO and try the regular expressions on texts using regex101. The latter regular expression does not even compile. Commented Jan 26, 2016 at 17:27

3 Answers 3

2

It's a bit wordy, but you can get away without using regular expressions here, for example:

# Take a lowercase copy for comparisons
s = sections[i].lower()
if (('economic' in s or 'economy' in s) and
    ('uncertainty' in s or 'uncertain' in s) and
    ('tax' in s or 'policy' in s)):
    do_stuff()
Sign up to request clarification or add additional context in comments.

5 Comments

this does not consider word boundaries at all
@AnttiHaapala Correct.
what do you mean it does not consider word boundaries at all?
@AndresAzqueta This solution would match not only if the section contains "tax", but also, for example, "ataxia". In other words, it's not matching whole words, but just checking to make sure those particular sequences of characters exist somewhere in the section. If that's an important distinction for you, you'll need to look further at regexes.
great, thanks for the tip. I will check regex and implement a few changes to deal with the problem. Cheers,
2

It is possible to write this as a single regular expression, but it is a bit tricky. For each and you'd use a zero-width lookahead assertion (?= ), and for each or you'd use a branch. Also, we'd have to use the \b for a word boundary. We'd use re.match instead of re.search.

belongs_to_group1 = bool(re.match(
     r'(?=.*\b(?:economic|economy)\b)'
     r'(?=.*\b(?:uncertain|uncertainty)\b)'
     r'(?=.*\b(?:tax|policy)\b)', text, re.I))

Thus not very readable.

A more fruitful approach would be to find all words and put them into a set

words = set(re.findall(r'\w+', text.lower()))
belongs_to_group1 = (('uncertainty' in words or 'uncertain' in words)
    and ('economic' in words or 'economy' in words)
    and ('tax' in words or 'policy' in words))

2 Comments

Could you shorten-up uncertain|uncertainty to uncertain(?:ty)?? And economic|economy to econom(?:ic|y)
I did, but I reverted that because it made it even less readable
-1

You can use re.search to find those words. Then you can use if statements and python's and and or statements for the logic, and then store group one and two as two lists with the section index number as a value.

One thing you might want to note is that your logic may need brackets.

By

economy OR economic AND uncertainty OR uncertain AND tax OR policy

I assume you mean

(economy OR economic) AND (uncertainty OR uncertain) AND (tax OR policy)

which is different to (for example)

economy OR (economic AND uncertainty) OR (uncertain AND tax) OR policy

EDIT1: Python will evaluate your statement without brackets from left to right, i.e.:

( ( ( ( (economy OR economic) AND uncertainty) OR uncertain) AND tax) OR policy)

Which I imagine is not what you want (e.g. the above evaluates true if it includes the word policy but none of the others)

EDIT2: As pointed out in comments, EDIT1 is incorrect, although you would still need brackets to achieve case 1, if you don't have them you will get case 2 instead (and case 3 is a load of rubbish)

5 Comments

Python absolutely will not evaluate and and or left to right like that. Instead ands are always evaluated first, and ors afterwards
@antti Huh, thats interesting. I'd assumed it would evaluated like one would mathematically. Does and being evaluated first mean that "economy OR economic AND uncertainty OR uncertain" becomes " economy OR (economic AND uncertainty) OR uncertain" or does it becomes " (economy OR economic) AND (uncertainty OR uncertain)"
@user3088440 and has higher precedence than or, in Python, most other languages, and in math.
@user3088440: It becomes your first case. Without any brackets, and's come before or's. This IS mathematically, by the way, as multiplication/division comes before addition/subtraction.
Huh, had no idea, I've just always bracketed anything ambiguous. Good to know!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.