1

I'm trying to find some very specific data in a string. The problem is I'm not finding all of the data with the current regex I'm using. Here is some sample data:

[img:2gcfa9cc]http://img823.imageshack.us/img823/3295/pokaijumonlogo.jpg[/img:2gcfa9cc]

Making these little guys into Kaiju monsters.  Again, I know nothing about them, other then which ones I thought would make for cool possible Kaiju (of the original 150) so here's Day 01

[b:2gcfa9cc][size=150:2gcfa9cc]BULBASAUR[/size:2gcfa9cc][/b:2gcfa9cc]
[i:2gcfa9cc]Feb 01[/i:2gcfa9cc]
[ddf2k12:2gcfa9cc]http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg[/ddf2k12:2gcfa9cc]

Setting myself up with the same "parameters" as last year

I may be breaking my own Challenge rules right now but...well I started this last night and I couldn't just leave 'em out in the cold all unfinished 'n' shit.  

Obligatory Skyrim drawing.

[ddf2k12:2ytorpmj]http://4.bp.blogspot.com/-UIUSNXvnHz4/TynYf1BZ9oI/AAAAAAAAAl4/pRLHVP0Ny3U/s1600/01_cheatingcheaterwarmup1.jpg[/ddf2k12:2ytorpmj]

What I'm trying to get is the data between the ddf2k12 tags and the img tags. I've only worked on the ddf2k12 tags thus far (I figure the latter will be the former with img instead of ddf2k12) and out of the 1586 tags I should have found, I'm only getting 5. Here's my regex:

ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[[ddf2k12]+\:[A-Za-z0-9]+\])'
ddf2k12_find = re.findall(ddf2k12_regex, post)

Obviously there's something wrong with my regex, but after banging my head against a wall I can't sort it out, so any help is appreciated. Thanks.

0

5 Answers 5

3

You will do yourself a big favor by breaking down that big regex into parts and use composition. This seems to work correctly, and it's more obvious how to debug it.

import re

start_tag =    '(\[{tagname}:[^\]]+\])'
end_tag = start_tag.replace('\[', '\[\/', 1)
content = '((?:.|\n)*?)' # The ?: indicates a non-capturing group.                                                                                             
tag = start_tag + content + end_tag

ddf_tag=tag.format(tagname='ddf2k12')

for match in re.findall(ddf_tag, post):
    print match
Sign up to request clarification or add additional context in comments.

2 Comments

Much better! I still wonder about dot getting stuck at newlines, though.
YES! Thanks so much that worked perfectly. And yes making it much more legible was definitely a better move. I prefer to use the %s tag to concatenate strings, but that was the only thing I did differently. Thanks again!
2

Two things. First, you're missing the / in the closing ddf2k12 tag.

>>> ddf2k12_regex = '(\[[ddf2k12]+\:[A-Za-z0-9]+\])(.*?)(\[/[ddf2k12]+\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg', '[/ddf2k12:2gcfa9cc]')]

So now it works. But you're putting the ddf2k12 characters in brackets, which will match any tag with the characters 1, 2, d, f or k.

>>> silly_s = '[dddd:a]a[/ffff:a]'
>>> re.findall(ddf2k12_regex, silly_s)
[('[dddd:a]', 'a', '[/ffff:a]')]

So you need to match the exact tag instead; to do so, remove those outer brackets:

>>> ddf2k12_regex = '(\[ddf2k12\:[A-Za-z0-9]+\])(.*?)(\[/ddf2k12\:[A-Za-z0-9]+\])'
>>> re.findall(ddf2k12_regex, post)
[('[ddf2k12:2gcfa9cc]', 'http://img853.imageshack.us/img853/2185/dailydrawfeb2012day01.jpg', '[/ddf2k12:2gcfa9cc]')]
>>> re.findall(ddf2k12_regex, silly_s)
[]

3 Comments

Also, it doesn't matter in this case, but don't forget to make your regex strings raw strings by prepending r to them.
I think I would like to see such a long pattern in a r'''...''' string across multiple lines in (?x) mode with extra white space with each chunk of it on its own line and maybe with comments too, in order to make it easier to read and maintain. I get a bit claustrophobic when it’s all scrunched together, risking sending the reader into punctuation shock. 😉
This did work and thank you for the help. i chose the previous answer due to the fact that they broke it up further, which while it wasn't part of my question, will make it easier to re-use it for the img tag as well. +1'd though.
0

Grouping text together is (sometex), not [sometext]. And I thought that ddf2k12 tag could appear once in side your [...]. Drop + off and you'll now no need an (...).

\[ddf2k12:[a-zA-Z0-9]+\](.*?)\[/ddf2k12:[a-zA-Z0-9]+\]

Would do the work pretty well. Note that return value is text from (.*?). If you want to get tag name you may use (...) wrap ddf2k12. Then the combination version with your img tag would be like this.

\[(ddf2k12|img):[a-zA-Z0-9]+\](.*?)\[/(ddf2k12|img):[a-zA-Z0-9]+\]

1 Comment

Should there be a (?s) there for the dot to also match newlines? this might be a good place to use a multiline match with (?x) so that you can include comments. I’m a little bothered to see things repeated in the regex: both (ddf2k12|img) and [a-zA-Z0-9]+ occur twice, which risks getting out of sync because you’ve violated the DRY (“don’t repeat yourself”) rule. I think you should be able to use named groups here to make this more self-documenting, and more maintainable from a code-safety point of view.
0

This worked for me -

post = "[the data you want to be searched for using regex]"
ddf2k12_regex = re.compile(r"\[ddf2k12(?P<data>[\n.]*?)\[/ddf2k12")
ddf2k12_find = ddf2k12_regex.findall(post)

1 Comment

I found it confusing that you’re using data not only as the variable name but also for the named group as well; wouldn’t it be better to choose different identifiers? Also, I wonder whether you want (?s) or re.DOTALL in there so that the dot can cross newline boundaries.
0

The problem is that you are using character set where you shouldn't. Try the following regex instead:

pattern = r'\[ddf2k12:\w+?\](.*?)\[/ddf2k12:\w+?\]'

\w is equivalent to [a-zA-Z0-9_]

Note that the semantics of \w and of the dot, as in (.*?), can be changed by using the DOTALL, LOCALE and UNICODE flags, or by adding (?s), (?L) or (?u) to the regex.

2 Comments

Do you think you might want (?s) in there so that the dot can cross newline boundaries? Also, \w can include Unicode with (?u), or locale stuff with (?l). I now use the Unicode-flavor almost always, and no longer ever use the locale-flavor of it myself.
@tchrist: That's a good point. I guess you might want to do that.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.