9

I have a regular expression to match strings like:

--D2CBA65440D

--77094A27E09

--77094A27E

--770

--77094A27E09--

basically, it matches a hexadecimal string surrounded by one or more line breaks or white space, and has the prefix -- and may or may not have -- as suffix

i use the following python code, and it works fine most of the time:

hexaPattern = "\s--[0-9a-fA-F]+[--]?\s"
hex = re.search(hexaPattern, part)
if hex:
   print "found a match"

this works for all of the above but it doesn't match --77094A27E09 in this block:

<div id="arrow2" class="headerLinksImg" style="display:block

--77094A27E09

;">

but matches the same string in:

<input type="checkbox" name="checkbox" id="checkboxKG3" class

--77094A27E09

Content-T="checkboxKG" value="KG3" />

What am i doing wrong?

4
  • try trimming down the html on either side until you find the character that's causing the problem Commented Apr 22, 2012 at 17:48
  • I get a match for that block: rubular.com/r/wfqgEPHObB Commented Apr 22, 2012 at 17:48
  • 2
    Note that [--]? will match one or none dashes, not two dashes. I think you meant (--)? Commented Apr 22, 2012 at 17:50
  • i meant two dashes ... but [--]? worked Commented Apr 22, 2012 at 18:01

4 Answers 4

12
import re
hexaPattern = re.compile(r'\s--([0-9a-fA-F]+)(?:--)?\s')
m = re.search(hexaPattern, part)
if m:
   print "found a match:", m.group(1)

This pre-compiles the pattern for speed. This uses a r'' (raw string) so the backslashes are sure to be passed through correctly. This adds parentheses to make a "match group" so you can extract your hex string after the match; it also adds a "non-matching group" around the second -- string.

Because you used the square brackets around the second "--", you got a "character class". I'm not sure exactly what the character class [--] matches; I think it should just match any '-' character. In a character class, a '-' is usually used for a range, as in [a-z] but the range [--] makes no sense so I think it would fall back to just matching a '-'. The problem is: because you have the ? after it, it would only match zero or one '-' character, and you need it to be able to match two.

Sign up to request clarification or add additional context in comments.

1 Comment

It's worth mentioning that you only need to compile the pattern if you're using a lot of different patterns. According to python re docs, these values are cached, so if you're using just a few patterns, compiling them won't do you much.
4

Try this: hexaPattern = r"^--[0-9a-fA-F]+(--)?\s"

The fixes I inserted are:

r at the beginning, so that that backslashes won't be "eaten" by the quotation marks

^ at the beginning to match the start of the string

then -- in parenthesis instead of square brackets (the brackets seem like a mistake)

1 Comment

You don't want to match from the beginning. OP's hex values are embedded in a longer string of html.
0

Others have pointed out problems with your regex, namely the [--] which basically finds one single hyphen in an unconventional way ... either way, not what you want anyway.

I would also suggest that having \s at both the beginning and end of the regex will also cause problems under certain circumstances, because it matches spaces, tabs, and newlines. So you could end up with a case where your file has --77094A27E09\n--D2CBA65440D and the second --D2CBA65440D won't match because the newline was consumed by \s at the end of the previous match.

Also, you seem to be checking each line in the file individually, which you don't really need to do. You can use re.findall to get all the matches in one fell swoop.

And finally -- at the beginning of the string seems to be your real marker, not \s at the beginning or end. So why not just use --([0-9a-fA-F]+)(?:--)? with a group around the hex number. findall only returns the groups which is what you want. Then you can do this (read the whole html file into one string, and check for all matches):

text = """
<input type="checkbox" name="checkbox" id="checkboxKG3" class
--D2CBA65440D
<a>    --77094A27E09--  </a>
  hello world  --77094A27E
--770--
    --77094A27E09
Content-T="checkboxKG" value="KG3" />
"""
import re
hexapattern = r'--([0-9a-fA-F]+)(?:--)?'
print re.findall(hexapattern, text)
>>> ['D2CBA65440D', '77094A27E09', '77094A27E', '770', '77094A27E09']

Which I think is what you want

Comments

-2

I used the following :

pattern = re.compile(r'(\n--)([0-9A-F]+)(--)?', re.I | re.S | re.M)

and it worked fine. Thanks to all your contributions.

1 Comment

Just, FYI, this wouldn't match the pattern if it's at the start of a buffer. Using ^ as Israel mentioned would work to find it at the start of any line.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.