1

I'm trying to analyze HTML code and extract all CSS classes and ID's from the source. So I need to extract whatever is between two quotation marks, which can be preceded by either class or id:

id="<extract this>"

class="<extract this>"
7
  • 5
    Use an HTML parser. Don't use regular expressions. Commented Apr 24, 2014 at 18:07
  • This is the compulsory comment reminding you that you should be using an XML/HTML parser not regex for HTML. Commented Apr 24, 2014 at 18:07
  • Whatever programming language you are using, be sure to use a parser and not regex. Commented Apr 24, 2014 at 18:07
  • Thank you for your suggestions, but if I wanted to use an HTML Parser, I would have posted that instead. I simply need to extract any classes and ID's from a page, that's all. I'm organizing stylesheets so I want a list of classes and ID's used in the plain HTML source before it gets compiled and jQuery Mobile blows it up with its own custom classes. Commented Apr 24, 2014 at 18:10
  • Might be related to: stackoverflow.com/a/1732454/464257 Commented Apr 24, 2014 at 18:14

3 Answers 3

2
/(?:id|class)="([^"]*)"/gi

replacement expression: $1

this regex in english: match either "id" or "class" then an equals sign and quote, then capture everything that is not a quote before matching another quote. do this globally and case insensitively.

Sign up to request clarification or add additional context in comments.

2 Comments

nice @Tim! those regexes... they'll get you every time.
I inputted this on regexr.com, along with an HTML page at the bottom and it matches the entire "id='id'" instead of just id. Can you verify? cl.ly/image/18363j1w1g1V
2

Since you prefer using regular expression, here is one way I suppose.

\b(?:id|class)\s*=\s*"([^"]*)"

Regular expression:

\b             # the boundary between a word char (\w) and not a word char
(?:            # group, but do not capture:
  id           # 'id'
 |             # OR
  class        # 'class'
)              # end of grouping
\s*            # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
 =             # '='
 \s*           # whitespace (\n, \r, \t, \f, and " ") (0 or more times)
   "           # '"'
   (           # group and capture to \1:
    [^"]*      # any character except: '"' (0 or more times)
   )           # end of \1
   "           # '"'

Comments

1

You may want to try this:

<?php

$css = <<< EOF
id="<extract this>"
class="<extract this>"id="<extract this2>"
class="<extract this3>"id="<extract this4>"
class="<extract this5>"id="<extract this6>"
class="<extract this7>"id="<extract this8>"
class="<extract this9>"
EOF;

preg_match_all('/(?:id|class)="(.*?)"/sim', $css , $classes, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($classes[1]); $i++) {
    echo $classes[1][$i]."\n";
}
    /*
    <extract this>
    <extract this>
    <extract this2>
    <extract this3>
    <extract this4>
    <extract this5>
    <extract this6>
    <extract this7>
    <extract this8>
    <extract this9>
    */
?>

DEMO:
http://ideone.com/Nr9FPt

3 Comments

Exactly what I wanted. I just threw my giant HTML page into the CSS variable, ran it, and it neatly printed every ID and class on that HTML page. Thank you!
Tuga, what does the /sim mean?
s modifier: single line. Dot matches newline characters i modifier: insensitive. Case insensitive match (ignores case of [a-zA-Z]) m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.