How can I Correctly Parse a Hex Color Code in Python using Regex?

Question

I am a beginner with Regex so I keep practicing by solving all the exercises I can find. In one of them, I need to extract all the Hex codes from a HTML source code, using Regex and Python. According to the exercise, the rules for spotting a Hex code are:

It starts with #
It has 3 or 6 digits
Each digit is in the range of 0-F (the string is case insensitive)

The sample input is this:

#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
}

The desired output is:

#FfFdF8
#aef
#f9f9f9
#fff
#ABC
#fff

#BED and #Cab are to be omitted, because they are not Hex colors.

I tried this code, to solve the problem:

import re

text = """
#BED
{
    color: #FfFdF8; background-color:#aef;
    font-size: 123px;
    background: -webkit-linear-gradient(top, #f9f9f9, #fff);
}
#Cab
{
    background-color: #ABC;
    border: 2px dashed #fff;
} """

r = re.compile(r'#[0-9A-Fa-f]{3}|[0-9A-Fa-f]{6}')
a = r.findall(text)
print(a)

Obtained output:

['#BED', '#FfF', '#aef', '#f9f', '#fff', '#Cab', '#ABC', '#fff']

It works fine, except that it doesn't catch the 6-digit codes and it doesn't eliminate the two tags that actually are not Hex color codes.

What am I mistaking? I looked at other attempts, but they didn't provide the correct answer. I am using Python 3.7.4 and the latest version of PyCharm.

I know, but in this exercise they are bookmarks of what their name are. I am not proficient in HTML. — Bogdan Doicin
– Bogdan Doicin, Commented Oct 20, 2019 at 12:18
@dgw yes, but #BED and #CAB are not colors in that example. — Joan Lara
– Joan Lara, Commented Oct 20, 2019 at 12:19
But the regex cannot distinguish that. So the regex will show these as well and that will not be an error. — dgw
– dgw, Commented Oct 20, 2019 at 15:57

CertainPerformance · Accepted Answer · 2019-10-20 12:10:55Z

5

On one hand, you could match the 6-digit codes first, else matching the 3-digit codes will match half of them first (and thus not match the full 6-digit codes). But since you also want to match only CSS property rules, and not selectors, lookahead for ;, ,, or ):

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[;,)])

https://regex101.com/r/BtZaoV/2

If you also need to be able to exclude combined selectors, eg #BED, foo {, you could lookahead for non-{s followed by }:

(?i)#(?:[0-9a-f]{6}|[0-9a-f]{3})(?=[^{]*})

https://regex101.com/r/BtZaoV/3

Use the case-insensitive flag to keep things DRY. (you could also use {3}){1,2} to keep from repeating the character set, but that'll make the pattern harder to read IMO)

edited Oct 20, 2019 at 12:10

answered Oct 20, 2019 at 12:04

CertainPerformance

373k55 gold badges354 silver badges359 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Charif DZ Over a year ago

#BED, p { } multiple css selector?

Toto Over a year ago

I think the second one is better, +1.

Charif DZ Over a year ago

yes I don't think there is nested {} in css (+1).

Code Maniac Over a year ago

fails to match when there's media query, Regex Demo, because of (?=[^{]*)

Charif DZ Over a year ago

@CodeManiac (?=[^{]*}|[^{]*?@media) I think this will fix the problem.

|

Code Maniac · Accepted Answer · 2019-10-20 12:31:33Z

2

You can try

#(?:[0-9A-Fa-f]{6}|[0-9A-Fa-f]{3})(?=;|[^(]*\))

So here idea is match 6 character length with higher priority if not found match 3 character match, to ensure it doesn't match #BED or something we need to match the termination of hex color code, so we use lookahead with alternation

Regex Demo

edited Oct 20, 2019 at 12:31

answered Oct 20, 2019 at 12:19

Code Maniac

37.9k5 gold badges44 silver badges65 bronze badges

1 Comment

PatrickT Over a year ago

Awesome diagram!

Ryszard Czech · Accepted Answer · 2019-11-14 15:28:03Z

0

You may use

r = re.compile(r'#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)', re.M)

See proof

Sample Python code:

import re
regex = r"#[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$)"
test_str = ("#BED\n"
    "{\n"
    "    color: #FfFdF8; background-color:#aef;\n"
    "    font-size: 123px;\n"
    "    background: -webkit-linear-gradient(top, #f9f9f9, #fff);\n"
    "}\n"
    "#Cab\n"
    "{\n"
    "    background-color: #ABC;\n"
    "    border: 2px dashed #fff;\n"
    "}")
matches = re.findall(regex, test_str, re.MULTILINE)
print(matches)

edited Nov 14, 2019 at 15:28

answered Oct 20, 2019 at 12:00

Ryszard Czech

18.7k4 gold badges27 silver badges39 bronze badges

3 Comments

The fourth bird Over a year ago

As stated in the question #BED and #Cab are to be omitted, because they are not Hex colors. This pattern will match both.

Wiktor Stribiżew Over a year ago

This is a step in right direction though, I think it can be fixed by replacing \b with (?!$): #[0-9A-Fa-f]{3}(?:[0-9A-Fa-f]{3})?(?!$) to avoid matching at the end of the string (or line if re.MULTILINE option is used), see demo.

Code Maniac Over a year ago

@WiktorStribiżew agree, but the suggested one will not cover when there are multiple css selector, #BED, xyz

Collectives™ on Stack Overflow

How can I Correctly Parse a Hex Color Code in Python using Regex?

3 Answers 3

7 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

7 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related