matching html attributes with regex in php

Question

I'm trying to make an expression that will search through a page like how2bypass.co.cc and return the contents of the "action" attribute in the "form" tag, and the contents of the "name" and "type" attributes in any input tags. I can't use an html parser because my ultimate goal is to automatically detect if a given page is a web proxy, and once sites catch on that I'm doing that they're probably going to start doing silly things like writing the entire document with javascript to stop me from parsing it.

I'm using the code

    preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches);

which works fine for the action attribute, but once I put a " after type\= the code stops working. why is this? It works fine once, but not twice?

Jason McCreary · Accepted Answer · 2011-05-28 00:19:38Z

1

Regular expressions are greedy...

If you inspect the page source, the following is probably matching the first <input with the last type=, and capturing everything in between.

`<input.*type\=`

You're not going to be able to capture the form and all inputs with your current expression because not every input is prefixed with the form markup. You need to approach it one of the following ways:

Capture the entire form markup, <form>...</form>, and then a regex to match all the inputs in the capture
Adjust your current expression to be non-greedy, .*?, and allow for multiple captures of input markup.

answered May 28, 2011 at 0:19

Jason McCreary

73.3k23 gold badges140 silver badges177 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

some guy Over a year ago

Thanks, I didn't realize .* would do that. However, my original problem remains. Putting quotes in breaks the expression, and I don't understand why. To clarify: why does /<form.*?action=/i work, but /<form.*?action="/i not return anything? If I can't figure this out I'll just go the route of capturing the entire form markup and doing it piece by piece. Also, the page I'm testing this with is the one I mentioned, how2bypass.co.cc

mario · Accepted Answer · 2011-05-28 00:18:59Z

0

Without seeing the target page that you want to extract from, there are only a few things to guess:

The type= attribute might not have double quotes, as type=text is valid too. Or it might have single quotes instead, or some whitespace around the =.
The .* placeholders might fail if there are newlines between or within the tags. Using the /s regex flag is advisable.
And it's usually more reliable to use negated character classes like [^<>]* or [^"] instead of .* anyway.
You don't need to escape the \= equal sign.

And maybe you should split it up. Use one regex to extract the <form>..</form> block. And then search for the <input> tags within.

answered May 28, 2011 at 0:18

mario

146k20 gold badges243 silver badges293 bronze badges

Collectives™ on Stack Overflow

matching html attributes with regex in php

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related