2

I'm reading CSS files from disk as strings.

My goal is to extract HTML classes paired with a specific data attribute like this:

.foo[data-my-attr] 

The data attribute is unique enough so that I don't have to bother about traversing the CSS AST. I can simply use a regex like this:

(\.\S+)+\[data-my-attr\]

This already works, but \S+ is obviously a bad way to match an HTML class in a selector. It will include various combinators, pseudoclasses, pseudoselectors, etc.

I tried building a whitelist version of the regex, e. g. (\w|-)+, but the HTML5 spec for class names is very permissive. It's inevitable that either I miss certain characters or include incorrect characters.

What regex can be used to to extract HTML5 classes from a CSS selector string?

I'm using Node, i. e. the JavaScript flavor of regexes.

UPD1

Some examples:

  • .foo[data-my-attr] -- should match .foo
  • .foo>span[data-my-attr] -- should not match
  • .I_f%⌘ing_♥_HTML5[data-my-attr] -- should match .I_f%⌘ing_♥_HTML5

This question exists because I'm unable to think of every possible valid HTML5 class. I need a regex based on the surprisingly vague HTML5 class spec:

3.2.5.7 The class attribute

The attribute, if specified, must have a value that is a set of space-separated tokens representing the various classes that the element belongs to.

The classes that an HTML element has assigned to it consists of all the classes returned when the value of the class attribute is split on spaces. (Duplicates are ignored.)

There are no additional restrictions on the tokens authors can use in the class attribute, but authors are encouraged to use values that describe the nature of the content, rather than values that describe the desired presentation of the content.

Obviously, a class shouldn't contain spaces and characters like +>:()[]=~ because they are part of CSS selector syntax...

9
  • Whoever is voting to close the question, please explain in the comments what can be fixed to make this question valid. Commented Nov 25, 2017 at 11:45
  • Will this stackoverflow.com/a/6329126/1156518 regex extended with your specific attribute work for you? Commented Nov 25, 2017 at 11:55
  • @DmitryDruganov No, it's valid for HTML4, but will omit many HTML5-valid classes, such as #%LV-||_⌘⌥♥{©♤₩¤☆€~¥}. Commented Nov 25, 2017 at 12:11
  • 1
    Note that # can't be in a class name, since it's a selector for ids. Same thing about curly brackets. Commented Nov 25, 2017 at 13:22
  • 1
    You're working from the wrong spec. The relevant spec is not the HTML5 spec, but the Selectors spec, and in particular the selectors_group production. Commented Nov 25, 2017 at 14:23

2 Answers 2

2

You shouldn't use a regular expression.

A much more solid alternative is PostCSS (and its parser). With it, you will get a full AST (abstract syntax tree) of the whole stylesheet, with it you'll be able to easily extract the part you are looking for.

const postcss = require('postcss');
const Tokenizer = require('css-selector-tokenizer');

let output = [];

const postcssAttributes = postcss.plugin('postcss-attributes', function() {
  return function(css) {
    css.walkRules(function(rule) {
      rule.selectors.map(selector => {
        const tokenized = Tokenizer.parse(selector);
        if (
          tokenized.nodes.some(({ nodes }) =>
            nodes.some(
              node =>
                node.type === 'attribute' && node.content === 'data-my-attr'
            )
          )
        ) {
          output.push(selector);
        }
      });
    });
  };
});

const css = `
    .foo[data-my-attr] {
        color: red;
    }
    .foo[something] {
        color: red;
    }
`;

postcss([postcssAttributes])
  .process(css)
  .then(result => console.log(output));

// logs: [ '.foo[data-my-attr]' ]

This will log all the matching selectors.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your example. I've been considering using a CSS AST and decided against it for two reasons: 1. It will make my build times longer. 2. It does not solve the problem of extracting HTML classes from compound selectors, which will still require regexes.
My example does support compound selectors
0

The regex to match an HTML5 class in a selector string is:

/\.-?(?:[_a-z]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))(?:[_a-z0-9-]|[\240-\377]|(?:(:?\\[0-9a-f]{1,6}(\r\n|[ \t\r\n\f])?)|\\[^\r\n\f0-9a-f]))*/

Credit: @KOBA789

Thx to Alohci for pointing in the right direction.

6 Comments

Really? What about #notaclass:after { content:".notaclasstoo { whatever you want"; }
@CasimiretHippolyte Your example is not a valid selector.
What is invalid?
Your code sample is a CSS rule, the question is about a CSS selector.
Yes, it's a CSS rule, but how can you be sure to extract a CSS selector, even with a pattern that describes all possible selectors or the one you want, from a string that contains quoted parts? Inside quoted parts, you can also have something that matches your pattern and that isn't a selector.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.