0

I am reading a file from the web row by row and each row is a list. The list has three columns visibly separated by this pattern: +++$+++.

this is my code:

with closing(requests.get(url, stream=True)) as r:
    reader = csv.reader(codecs.iterdecode(r.iter_lines(), 'latin-1'))
    for i, row in enumerate(reader):
        if i < 5:
            t = row[0].split('(\s\+{3}\$\+{3}\s)+')
            print(t)

I have tried to split the list using this instruction in python3.6 and can't get it to work. Any suggestion is well appreciated:

the list:

['m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html']
['m1 +++$+++ 1492: conquest of paradise +++$+++ http://www.hundland.org/scripts/1492-ConquestOfParadise.txt']
['m2 +++$+++ 15 minutes +++$+++ http://www.dailyscript.com/scripts/15minutes.html']
['m3 +++$+++ 2001: a space odyssey +++$+++ http://www.scifiscripts.com/scripts/2001.txt']
['m4 +++$+++ 48 hrs. +++$+++ http://www.awesomefilm.com/script/48hours.txt']

this is my regex expression:

row[0].split('(\s\+{3}\$\+{3}\s)+')

each row has only one component -> row[0]

when I print the result is not splitting the row.

4
  • 1
    .split() on a string isn't a regex match at all - it's literally looking for the string (\s\+{3}\$\+{3}\s)+! You want re.split(r'(\s\+{3}\$\+{3}\s)+', row[0]) instead. Commented Jul 15, 2018 at 23:27
  • Or use row[0].split(" +++$+++ "), since nothing you're doing here appears to benefit from the power of regular expressions. Commented Jul 15, 2018 at 23:29
  • Also remove the brackets in the re.split to not return the +++$+++ Commented Jul 15, 2018 at 23:32
  • thanks, @jasonharper for the clarification. I learned this one now. Commented Jul 16, 2018 at 3:40

2 Answers 2

1

Doing

row[0].split(' +++$+++ ')

should give you exactly what you wanted without regex.

Sign up to request clarification or add additional context in comments.

Comments

0

Assuming you don't want to use split(), if you want to relax things and return a tuple maybe this can help.

Input

import re
input = '''['m0 +++$+++ 10 things i hate about you +++$+++ http://www.dailyscript.com/scripts/10Things.html']
['m1 +++$+++ 1492: conquest of paradise +++$+++ http://www.hundland.org/scripts/1492-ConquestOfParadise.txt']
['m2 +++$+++ 15 minutes +++$+++ http://www.dailyscript.com/scripts/15minutes.html']
['m3 +++$+++ 2001: a space odyssey +++$+++ http://www.scifiscripts.com/scripts/2001.txt']
['m4 +++$+++ 48 hrs. +++$+++ http://www.awesomefilm.com/script/48hours.txt']'''
output = re.findall('\[\'([\S\s]+?)[\s]+[\+]{3}\$[\+]{3}[\s]+([\S\s]+?)[\s][\+]{3}\$[\+]{3}[\s]+([\S\s]+?)\'\]', input)
print(output)

Output:

[('m0', '10 things i hate about you', 'http://www.dailyscript.com/scripts/10Things.html'), ('m1', '1492: conquest of paradise', 'http://www.hundland.org/scripts/1492-ConquestOfParadise.txt'), ('m2', '15 minutes', 'http://www.dailyscript.com/scripts/15minutes.html'), ('m3', '2001: a space odyssey', 'http://www.scifiscripts.com/scripts/2001.txt'), ('m4', '48 hrs.', 'http://www.awesomefilm.com/script/48hours.txt')]   

.

.

I' also trying to experiment with an alternating regex, but for the life of me, I can't get the formula to work haha.. eventually. I'll post it later, but hopefully the above helps

1 Comment

Thanks, @Inquisitor01 I got a good one from jasonharper. Appreciate it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.