Regular expression for hexadecimal string in python not working

Question

I have a regular expression to match strings like:

--D2CBA65440D

--77094A27E09

--77094A27E

--770

--77094A27E09--

basically, it matches a hexadecimal string surrounded by one or more line breaks or white space, and has the prefix -- and may or may not have -- as suffix

i use the following python code, and it works fine most of the time:

hexaPattern = "\s--[0-9a-fA-F]+[--]?\s"
hex = re.search(hexaPattern, part)
if hex:
   print "found a match"

this works for all of the above but it doesn't match --77094A27E09 in this block:

<div id="arrow2" class="headerLinksImg" style="display:block

--77094A27E09

;">

but matches the same string in:

<input type="checkbox" name="checkbox" id="checkboxKG3" class

--77094A27E09

Content-T="checkboxKG" value="KG3" />

What am i doing wrong?

try trimming down the html on either side until you find the character that's causing the problem — Shep
– Shep, Commented Apr 22, 2012 at 17:48
Note that [--]? will match one or none dashes, not two dashes. I think you meant (--)? — Hamish
– Hamish, Commented Apr 22, 2012 at 17:50

steveha · Accepted Answer · 2012-04-22 17:54:08Z

12

import re
hexaPattern = re.compile(r'\s--([0-9a-fA-F]+)(?:--)?\s')
m = re.search(hexaPattern, part)
if m:
   print "found a match:", m.group(1)

This pre-compiles the pattern for speed. This uses a r'' (raw string) so the backslashes are sure to be passed through correctly. This adds parentheses to make a "match group" so you can extract your hex string after the match; it also adds a "non-matching group" around the second -- string.

Because you used the square brackets around the second "--", you got a "character class". I'm not sure exactly what the character class [--] matches; I think it should just match any '-' character. In a character class, a '-' is usually used for a range, as in [a-z] but the range [--] makes no sense so I think it would fall back to just matching a '-'. The problem is: because you have the ? after it, it would only match zero or one '-' character, and you need it to be able to match two.

answered Apr 22, 2012 at 17:54

steveha

77.1k21 gold badges94 silver badges119 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

andersonvom Over a year ago

It's worth mentioning that you only need to compile the pattern if you're using a lot of different patterns. According to python re docs, these values are cached, so if you're using just a few patterns, compiling them won't do you much.

Israel Unterman · Accepted Answer · 2012-04-22 17:49:28Z

4

Try this: hexaPattern = r"^--[0-9a-fA-F]+(--)?\s"

The fixes I inserted are:

r at the beginning, so that that backslashes won't be "eaten" by the quotation marks

^ at the beginning to match the start of the string

then -- in parenthesis instead of square brackets (the brackets seem like a mistake)

answered Apr 22, 2012 at 17:49

Israel Unterman

13.6k4 gold badges30 silver badges35 bronze badges

1 Comment

Joel Cornett Over a year ago

You don't want to match from the beginning. OP's hex values are embedded in a longer string of html.

Ben · Accepted Answer · 2019-01-27 11:24:37Z

Others have pointed out problems with your regex, namely the [--] which basically finds one single hyphen in an unconventional way ... either way, not what you want anyway.

I would also suggest that having \s at both the beginning and end of the regex will also cause problems under certain circumstances, because it matches spaces, tabs, and newlines. So you could end up with a case where your file has --77094A27E09\n--D2CBA65440D and the second --D2CBA65440D won't match because the newline was consumed by \s at the end of the previous match.

Also, you seem to be checking each line in the file individually, which you don't really need to do. You can use re.findall to get all the matches in one fell swoop.

And finally -- at the beginning of the string seems to be your real marker, not \s at the beginning or end. So why not just use --([0-9a-fA-F]+)(?:--)? with a group around the hex number. findall only returns the groups which is what you want. Then you can do this (read the whole html file into one string, and check for all matches):

text = """
<input type="checkbox" name="checkbox" id="checkboxKG3" class
--D2CBA65440D
<a>    --77094A27E09--  </a>
  hello world  --77094A27E
--770--
    --77094A27E09
Content-T="checkboxKG" value="KG3" />
"""
import re
hexapattern = r'--([0-9a-fA-F]+)(?:--)?'
print re.findall(hexapattern, text)
>>> ['D2CBA65440D', '77094A27E09', '77094A27E', '770', '77094A27E09']

Which I think is what you want

Darth Plagueis · Accepted Answer · 2012-04-25 21:27:42Z

-2

I used the following :

pattern = re.compile(r'(\n--)([0-9A-F]+)(--)?', re.I | re.S | re.M)

and it worked fine. Thanks to all your contributions.

answered Apr 25, 2012 at 21:27

Darth Plagueis

9303 gold badges22 silver badges39 bronze badges

1 Comment

Kenneth Wilke Over a year ago

Just, FYI, this wouldn't match the pattern if it's at the start of a buffer. Using ^ as Israel mentioned would work to find it at the start of any line.

Collectives™ on Stack Overflow

Regular expression for hexadecimal string in python not working

4 Answers 4

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

1 Comment

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related