Python re.findall regex problems

Question

I'm trying to find some very specific data in a string. The problem is I'm not finding all of the data with the current regex I'm using. Here is some sample data:

[img:2gcfa9cc]http&#58;//img823&#46;imageshack&#46;us/img823/3295/pokaijumonlogo&#46;jpg[/img:2gcfa9cc]

Making these little guys into Kaiju monsters.  Again, I know nothing about them, other then which ones I thought would make for cool possible Kaiju (of the original 150) so here's Day 01

[b:2gcfa9cc][size=150:2gcfa9cc]BULBASAUR[/size:2gcfa9cc][/b:2gcfa9cc]
[i:2gcfa9cc]Feb 01[/i:2gcfa9cc]
[ddf2k12:2gcfa9cc]http&#58;//img853&#46;imageshack&#46;us/img853/2185/dailydrawfeb2012day01&#46;jpg[/ddf2k12:2gcfa9cc]

Setting myself up with the same &quot;parameters&quot; as last year

I may be breaking my own Challenge rules right now but...well I started this last night and I couldn't just leave 'em out in the cold all unfinished 'n' shit.  

Obligatory Skyrim drawing.

[ddf2k12:2ytorpmj]http&#58;//4&#46;bp&#46;blogspot&#46;com/-UIUSNXvnHz4/TynYf1BZ9oI/AAAAAAAAAl4/pRLHVP0Ny3U/s1600/01_cheatingcheaterwarmup1&#46;jpg[/ddf2k12:2ytorpmj]

What I'm trying to get is the data between the ddf2k12 tags and the img tags. I've only worked on the ddf2k12 tags thus far (I figure the latter will be the former with img instead of ddf2k12) and out of the 1586 tags I should have found, I'm only getting 5. Here's my regex:

ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[[ddf2k12]+\:[A-Za-z0-9]+\])'
ddf2k12_find = re.findall(ddf2k12_regex, post)

Obviously there's something wrong with my regex, but after banging my head against a wall I can't sort it out, so any help is appreciated. Thanks.

bukzor · Accepted Answer · 2012-03-03 19:00:36Z

3

You will do yourself a big favor by breaking down that big regex into parts and use composition. This seems to work correctly, and it's more obvious how to debug it.

import re

start_tag =    '(\[{tagname}:[^\]]+\])'
end_tag = start_tag.replace('\[', '\[\/', 1)
content = '((?:.|\n)*?)' # The ?: indicates a non-capturing group.                                                                                             
tag = start_tag + content + end_tag

ddf_tag=tag.format(tagname='ddf2k12')

for match in re.findall(ddf_tag, post):
    print match

edited Mar 3, 2012 at 19:00

answered Mar 3, 2012 at 18:33

bukzor

38.8k13 gold badges85 silver badges116 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

tchrist Over a year ago

Much better! I still wonder about dot getting stuck at newlines, though.

John P Over a year ago

YES! Thanks so much that worked perfectly. And yes making it much more legible was definitely a better move. I prefer to use the %s tag to concatenate strings, but that was the only thing I did differently. Thanks again!

senderle · Accepted Answer · 2012-03-03 18:16:31Z

2

Two things. First, you're missing the / in the closing ddf2k12 tag.

>>> ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[/[ddf2k12]+\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http&#58;//img853&#46;imageshack&#46;us/img853/2185/dailydrawfeb2012day01&#46;jpg', '[/ddf2k12:2gcfa9cc]')]

So now it works. But you're putting the ddf2k12 characters in brackets, which will match any tag with the characters 1, 2, d, f or k.

>>> silly_s = '[dddd:a]a[/ffff:a]'
>>> re.findall(ddf2k12_regex, silly_s)
[('[dddd:a]', 'a', '[/ffff:a]')]

So you need to match the exact tag instead; to do so, remove those outer brackets:

>>> ddf2k12_regex = '(\[ddf2k12\:[A-Za-z0-9]+\])(.*?)(\[/ddf2k12\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http&#58;//img853&#46;imageshack&#46;us/img853/2185/dailydrawfeb2012day01&#46;jpg', '[/ddf2k12:2gcfa9cc]')]
>>> re.findall(ddf2k12_regex, silly_s)
[]

answered Mar 3, 2012 at 18:16

senderle

152k36 gold badges218 silver badges244 bronze badges

3 Comments

senderle Over a year ago

Also, it doesn't matter in this case, but don't forget to make your regex strings raw strings by prepending r to them.

tchrist Over a year ago

I think I would like to see such a long pattern in a r'''...''' string across multiple lines in (?x) mode with extra white space with each chunk of it on its own line and maybe with comments too, in order to make it easier to read and maintain. I get a bit claustrophobic when it’s all scrunched together, risking sending the reader into punctuation shock. 😉

John P Over a year ago

This did work and thank you for the help. i chose the previous answer due to the fact that they broke it up further, which while it wasn't part of my question, will make it easier to re-use it for the img tag as well. +1'd though.

neizod · Accepted Answer · 2012-03-03 18:24:46Z

0

Grouping text together is (sometex), not [sometext]. And I thought that ddf2k12 tag could appear once in side your [...]. Drop + off and you'll now no need an (...).

\[ddf2k12:[a-zA-Z0-9]+\](.*?)\[/ddf2k12:[a-zA-Z0-9]+\]

Would do the work pretty well. Note that return value is text from (.*?). If you want to get tag name you may use (...) wrap ddf2k12. Then the combination version with your img tag would be like this.

\[(ddf2k12|img):[a-zA-Z0-9]+\](.*?)\[/(ddf2k12|img):[a-zA-Z0-9]+\]

edited Mar 3, 2012 at 18:24

answered Mar 3, 2012 at 18:19

neizod

1,58615 silver badges25 bronze badges

1 Comment

tchrist Over a year ago

Should there be a (?s) there for the dot to also match newlines? this might be a good place to use a multiline match with (?x) so that you can include comments. I’m a little bothered to see things repeated in the regex: both (ddf2k12|img) and [a-zA-Z0-9]+ occur twice, which risks getting out of sync because you’ve violated the DRY (“don’t repeat yourself”) rule. I think you should be able to use named groups here to make this more self-documenting, and more maintainable from a code-safety point of view.

Pushpak Dagade · Accepted Answer · 2012-03-03 18:47:20Z

0

This worked for me -

post = "[the data you want to be searched for using regex]"
ddf2k12_regex = re.compile(r"\[ddf2k12(?P<data>[\n.]*?)\[/ddf2k12")
ddf2k12_find = ddf2k12_regex.findall(post)

edited Mar 3, 2012 at 18:47

answered Mar 3, 2012 at 18:16

Pushpak Dagade

6,4907 gold badges31 silver badges43 bronze badges

1 Comment

tchrist Over a year ago

I found it confusing that you’re using data not only as the variable name but also for the named group as well; wouldn’t it be better to choose different identifiers? Also, I wonder whether you want (?s) or re.DOTALL in there so that the dot can cross newline boundaries.

Dan Gerhardsson · Accepted Answer · 2012-03-03 19:15:34Z

0

The problem is that you are using character set where you shouldn't. Try the following regex instead:

pattern = r'\[ddf2k12:\w+?\](.*?)\[/ddf2k12:\w+?\]'

\w is equivalent to [a-zA-Z0-9_]

Note that the semantics of \w and of the dot, as in (.*?), can be changed by using the DOTALL, LOCALE and UNICODE flags, or by adding (?s), (?L) or (?u) to the regex.

edited Mar 3, 2012 at 19:15

answered Mar 3, 2012 at 18:16

Dan Gerhardsson

1,89913 silver badges12 bronze badges

2 Comments

tchrist Over a year ago

Do you think you might want (?s) in there so that the dot can cross newline boundaries? Also, \w can include Unicode with (?u), or locale stuff with (?l). I now use the Unicode-flavor almost always, and no longer ever use the locale-flavor of it myself.

Dan Gerhardsson Over a year ago

@tchrist: That's a good point. I guess you might want to do that.

Collectives™ on Stack Overflow

Python re.findall regex problems

5 Answers 5

2 Comments

3 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

2 Comments

3 Comments

1 Comment

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related