4

I'm trying to make an expression that will search through a page like how2bypass.co.cc and return the contents of the "action" attribute in the "form" tag, and the contents of the "name" and "type" attributes in any input tags. I can't use an html parser because my ultimate goal is to automatically detect if a given page is a web proxy, and once sites catch on that I'm doing that they're probably going to start doing silly things like writing the entire document with javascript to stop me from parsing it.

I'm using the code

    preg_match_all('/<form.*action\="(.*?)".*>[^<]*<input.*type\=/i', $pageContents, $inputMatches);

which works fine for the action attribute, but once I put a " after type\= the code stops working. why is this? It works fine once, but not twice?

2 Answers 2

1

Regular expressions are greedy...

If you inspect the page source, the following is probably matching the first <input with the last type=, and capturing everything in between.

`<input.*type\=`

You're not going to be able to capture the form and all inputs with your current expression because not every input is prefixed with the form markup. You need to approach it one of the following ways:

  • Capture the entire form markup, <form>...</form>, and then a regex to match all the inputs in the capture
  • Adjust your current expression to be non-greedy, .*?, and allow for multiple captures of input markup.
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, I didn't realize .* would do that. However, my original problem remains. Putting quotes in breaks the expression, and I don't understand why. To clarify: why does /<form.*?action=/i work, but /<form.*?action="/i not return anything? If I can't figure this out I'll just go the route of capturing the entire form markup and doing it piece by piece. Also, the page I'm testing this with is the one I mentioned, how2bypass.co.cc
0

Without seeing the target page that you want to extract from, there are only a few things to guess:

  • The type= attribute might not have double quotes, as type=text is valid too. Or it might have single quotes instead, or some whitespace around the =.
  • The .* placeholders might fail if there are newlines between or within the tags. Using the /s regex flag is advisable.
  • And it's usually more reliable to use negated character classes like [^<>]* or [^"] instead of .* anyway.
  • You don't need to escape the \= equal sign.

And maybe you should split it up. Use one regex to extract the <form>..</form> block. And then search for the <input> tags within.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.