3

I am a beginner with Regex so I keep practicing by solving all the exercises I can find. In one of them, I need to extract all the Hex codes from a HTML source code, using Regex and Python. According to the exercise, the rules for spotting a Hex code are:

  1. It starts with #
  2. It has 3 or 6 digits
  3. Each digit is in the range of 0-F (the string is case insensitive)

The sample input is this:

#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
}

The desired output is:

#FfFdF8
#aef
#f9f9f9
#fff
#ABC
#fff

#BED and #Cab are to be omitted, because they are not Hex colors.

I tried this code, to solve the problem:

import re

text = """
#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
} """

r = re.compile(r'#[0-9A-Fa-f]{3}|[0-9A-Fa-f]{6}')
a = r.findall(text)
print(a)

Obtained output:

['#BED', '#FfF', '#aef', '#f9f', '#fff', '#Cab', '#ABC', '#fff']

It works fine, except that it doesn't catch the 6-digit codes and it doesn't eliminate the two tags that actually are not Hex color codes.

What am I mistaking? I looked at other attempts, but they didn't provide the correct answer. I am using Python 3.7.4 and the latest version of PyCharm.

4
  • #BED and #CAB are valid hex colors. Commented Oct 20, 2019 at 12:17
  • I know, but in this exercise they are bookmarks of what their name are. I am not proficient in HTML. Commented Oct 20, 2019 at 12:18
  • @dgw yes, but #BED and #CAB are not colors in that example. Commented Oct 20, 2019 at 12:19
  • But the regex cannot distinguish that. So the regex will show these as well and that will not be an error. Commented Oct 20, 2019 at 15:57

3 Answers 3

5

On one hand, you could match the 6-digit codes first, else matching the 3-digit codes will match half of them first (and thus not match the full 6-digit codes). But since you also want to match only CSS property rules, and not selectors, lookahead for ;, ,, or ):

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[;,)])

https://regex101.com/r/BtZaoV/2

If you also need to be able to exclude combined selectors, eg #BED, foo {, you could lookahead for non-{s followed by }:

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[^{]*})

https://regex101.com/r/BtZaoV/3

Use the case-insensitive flag to keep things DRY. (you could also use {3}){1,2} to keep from repeating the character set, but that'll make the pattern harder to read IMO)

Sign up to request clarification or add additional context in comments.

7 Comments

#BED, p { } multiple css selector?
I think the second one is better, +1.
yes I don't think there is nested {} in css (+1).
fails to match when there's media query, Regex Demo, because of (?=[^{]*)
@CodeManiac (?=[^{]*}|[^{]*?@media) I think this will fix the problem.
|
2

You can try

#(?:[0-9A-Fa-f]{6}|[0-9A-Fa-f]{3})(?=;|[^(]*\))

So here idea is match 6 character length with higher priority if not found match 3 character match, to ensure it doesn't match #BED or something we need to match the termination of hex color code, so we use lookahead with alternation

enter image description here

Regex Demo

1 Comment

Awesome diagram!
0

You may use

r = re.compile(r'#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)', re.M)

See proof

Sample Python code:

import re
regex = r"#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)"
test_str = ("#BED\n"
    "{\n"
    "    color: #FfFdF8; background-color:#aef;\n"
    "    font-size: 123px;\n"
    "    background: -webkit-linear-gradient(top, #f9f9f9, #fff);\n"
    "}\n"
    "#Cab\n"
    "{\n"
    "    background-color: #ABC;\n"
    "    border: 2px dashed #fff;\n"
    "}")
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)

3 Comments

As stated in the question #BED and #Cab are to be omitted, because they are not Hex colors. This pattern will match both.
This is a step in right direction though, I think it can be fixed by replacing \b with (?!$): #[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$) to avoid matching at the end of the string (or line if re.MULTILINE option is used), see demo.
@WiktorStribiżew agree, but the suggested one will not cover when there are multiple css selector, #BED, xyz

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.