I have a text file with 32 articles. Each article starts with the expression: <Number> of 32 DOCUMENTS, for example: 1 of 32 DOCUMENTS, 2 of 32 DOCUMENTS, etc. In order to find each article I have used the following code:
import re
sections = []
current = []
with open("Aberdeen2005.txt") as f:
for line in f:
if re.search(r"(?i)\d+ of \d+ DOCUMENTS", line):
sections.append("".join(current))
current = [line]
else:
current.append(line)
print(len(sections))
So now, articles are represented by the expression sections
The next thing I want to do, is to subgroup the articles in 2 groups. Those articles containing the words: economy OR economic AND uncertainty OR uncertain AND tax OR policy, identify them with the number 1.
Whereas those articles containing the following words: economy OR economic AND uncertain OR uncertainty AND regulation OR spending, identify them with the number 2. This is what I have tried so far:
for i in range(len(sections)):
group1 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[tax|policy]", , sections[i])
group2 = re.search(r"+[economic|economy].+[uncertainty|uncertain].+[regulation|spending]", , sections[i])
Nevertheless, it does not seem to work. Any ideas why?
{"1": [1,3,7,9], "2": [2,4,10,27]}