Regex passes in Rubular but not in Python

Question

import re
import urllib.request
file_txt = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/data/1408597/0000930413-12-003922.txt")
pattern_item4= re.compile("(Item\\n*\s*4.*)Item\\n*\s*5")
print(re.search(pattern_item4,bytes.decode(f)))
#Returns None

This regex returns what I want in rubular, but obviously it doesn't do what is expected in Python. Would anyone help me abit with this. The intention of the regex is to basically extract stuff between item4 and item5.

Thank you

enter image description here

\\n* it don't have effect. It must be: [\n]* (or [\\n]* depending as you pass this string). — Jack
– Jack, Commented Jul 11, 2012 at 23:29
Thanks, Jack. This trick doesn't work either. I tried both your suggestions but no luck.. — zsljulius
– zsljulius, Commented Jul 11, 2012 at 23:39
Have you checked my answer and checked that you actually have data in file_txt? Also where does the f come from in bytes.decode(f) ? — Jon Clements
– Jon Clements, Commented Jul 11, 2012 at 23:40
@zsljulius: If you post the exact part that do you want to extract, maybe we can elaborate a regular expression. — Jack
– Jack, Commented Jul 11, 2012 at 23:42
Hey Jon, So the file got transfered from sec's ftp server. It is in txt format. However, the file is more like a xml file. urllib.request.urlopen gives me a file like object, if I just do file_txt.read(), I couldn't apply re.search on it directly. This is why I used bytes.decode(f) to make it into a string like object. I also tried str(f), but that str(f) somehow truncates whatever I need. So I finally resort to the bytes.decode(f) to get the raw string — zsljulius
– zsljulius, Commented Jul 11, 2012 at 23:45

jfs · Accepted Answer · 2012-07-12 00:06:40Z

1

You need re.DOTALL flag otherwise . doesn't match a newline. To match Item at EOL you could use $ with re.MULTILINE flag:

pattern = re.compile(r"(Item$\s*4.*)Item$\s*5", re.S | re.M)

answered Jul 12, 2012 at 0:06

jfs

417k210 gold badges1k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Alan Moore Over a year ago

You don't need that $. All it does is force the \s* to match a linefeed, so you could write it as \n\s*. But I'm pretty sure any whitespace character will do, which is why I used \s+ in my answer.

zsljulius Over a year ago

Great! It Works! I don't even know that the dot does not match newline by default! You saved my day!

Alan Moore Over a year ago

If you assume all of the desired matches will include a linefeed immediately after Item, but that looks like an accident of formatting to me. The . after the number seems like a more reliable indicator.

Falmarri · Accepted Answer · 2012-07-11 23:27:12Z

1

Try using raw strings

re.compile (r"(Item\\n*\s*4.*)Item\\n*\s*5")

I would guess it has to do with your escaping of \n. But it's impossible to tell without knowing exactly what it is you're expecting that to match.

answered Jul 11, 2012 at 23:27

Falmarri

48.7k43 gold badges158 silver badges196 bronze badges

2 Comments

Joran Beasley Over a year ago

I would agree that its the \n escape ... but no way to be sure

zsljulius Over a year ago

Thanks for your reply. Unfortunately, the raw string trick doesn't work. I guess \\n is the correct way to get '\n' literally right?

Alan Moore · Accepted Answer · 2012-07-12 01:00:14Z

0

Knowing where the newlines are doesn't help you locate the matches, so there's no need to match \n specifically; it's just another whitespace character. Try this:

r"(?s)Item\s+4\..*?(?=Item\s+5\.)"

(?s) enables the . to match newlines, so .*? consumes everything until the lookahead - (?=Item\s*\d+\.) - spots the beginning of the next "Item" entry. If you wanted to iterate over all the Items, could replace the 4 and 5 with \d+.

edited Jul 12, 2012 at 1:00

answered Jul 12, 2012 at 0:48

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Collectives™ on Stack Overflow

Regex passes in Rubular but not in Python

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related