1
import re
import urllib.request
file_txt = urllib.request.urlopen("ftp://ftp.sec.gov/edgar/data/1408597/0000930413-12-003922.txt")
pattern_item4= re.compile("(Item\\n*\s*4.*)Item\\n*\s*5")
print(re.search(pattern_item4,bytes.decode(f)))
#Returns None

This regex returns what I want in rubular, but obviously it doesn't do what is expected in Python. Would anyone help me abit with this. The intention of the regex is to basically extract stuff between item4 and item5.

Thank you

enter image description here

5
  • \\n* it don't have effect. It must be: [\n]* (or [\\n]* depending as you pass this string). Commented Jul 11, 2012 at 23:29
  • Thanks, Jack. This trick doesn't work either. I tried both your suggestions but no luck.. Commented Jul 11, 2012 at 23:39
  • Have you checked my answer and checked that you actually have data in file_txt? Also where does the f come from in bytes.decode(f) ? Commented Jul 11, 2012 at 23:40
  • @zsljulius: If you post the exact part that do you want to extract, maybe we can elaborate a regular expression. Commented Jul 11, 2012 at 23:42
  • Hey Jon, So the file got transfered from sec's ftp server. It is in txt format. However, the file is more like a xml file. urllib.request.urlopen gives me a file like object, if I just do file_txt.read(), I couldn't apply re.search on it directly. This is why I used bytes.decode(f) to make it into a string like object. I also tried str(f), but that str(f) somehow truncates whatever I need. So I finally resort to the bytes.decode(f) to get the raw string Commented Jul 11, 2012 at 23:45

3 Answers 3

1

You need re.DOTALL flag otherwise . doesn't match a newline. To match Item at EOL you could use $ with re.MULTILINE flag:

pattern = re.compile(r"(Item$\s*4.*)Item$\s*5", re.S | re.M)
Sign up to request clarification or add additional context in comments.

3 Comments

You don't need that $. All it does is force the \s* to match a linefeed, so you could write it as \n\s*. But I'm pretty sure any whitespace character will do, which is why I used \s+ in my answer.
Great! It Works! I don't even know that the dot does not match newline by default! You saved my day!
If you assume all of the desired matches will include a linefeed immediately after Item, but that looks like an accident of formatting to me. The . after the number seems like a more reliable indicator.
1

Try using raw strings

re.compile (r"(Item\\n*\s*4.*)Item\\n*\s*5")

I would guess it has to do with your escaping of \n. But it's impossible to tell without knowing exactly what it is you're expecting that to match.

2 Comments

I would agree that its the \n escape ... but no way to be sure
Thanks for your reply. Unfortunately, the raw string trick doesn't work. I guess \\n is the correct way to get '\n' literally right?
0

Knowing where the newlines are doesn't help you locate the matches, so there's no need to match \n specifically; it's just another whitespace character. Try this:

r"(?s)Item\s+4\..*?(?=Item\s+5\.)"

(?s) enables the . to match newlines, so .*? consumes everything until the lookahead - (?=Item\s*\d+\.) - spots the beginning of the next "Item" entry. If you wanted to iterate over all the Items, could replace the 4 and 5 with \d+.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.