5

Ok, so as the title says, I have an HTML page that I fetch using libcurl (cURL inside PHP).
That page has one <form> that I need to extract the <input> names and values, and I would like to do that using Regex.
I'm making it using Regex because I think that's the easier way. If you think I shouldn't use regex, but something like xpath, say how.

I don't know if you can understand what I'm trying to say, so feel free to ask.

Here's the PHP code (complete):

<?php



/***** DISABLED BY NETWORK TRAFFIC REASONS... USING LOCAL CACHE

$curl = curl_init();
$url = 'https://secure.optimus.pt/Particulares/Kanguru/Login/';
$useragent = 'Mozilla/5.0 (Windows; U; Windows NT 6.0; pt-PT; rv:1.9.1) Gecko/20090624 Firefox/3.5';
curl_setopt($curl,CURLOPT_URL,$url);
curl_setopt($curl,CURLOPT_USERAGENT,$useragent);
curl_setopt($curl,CURLOPT_SSL_VERIFYPEER,true);
curl_setopt($curl,CURLOPT_SSL_VERIFYHOST,2);
curl_setopt($curl,CURLOPT_CAINFO,getcwd()."\optimus_secure.crt");
curl_setopt($curl,CURLOPT_RETURNTRANSFER,true);
$contents = curl_exec($curl);
*/

$contents = file_get_contents('local_secure.html');
preg_match('%<form name="aspnetForm" .*? action="(.*?)" .*?>(.*?)</form>%s',$contents,$matches);
//echo '<pre>'.htmlentities($contents).'</pre>';
//array_shift($matches);
echo '<pre>---------';
foreach($matches as $match)
    echo '$match:::::: '.htmlentities($match)."\r\n\r\n";
echo '</pre>';

echo '<pre>__________';
preg_match_all('/<input type=".*?" name="(.*?)" value="(.*?)" \/>/', $matches[0], $matches2);
print_r($matches2);
echo '</pre>';

?>

Of course that the <pre> tags and all that output is just for debugging.

Also, here's the source code of the HTML page (the part that matters):

<form name="aspnetForm" method="post" action="../Login?OptimusChannelID=D5774383-A407-42E9-A0AD-4838C97AB162&amp;OptimusContentID=&amp;OptimusSelectedSiteID=B33E7D52-8738-4756-A25D-B907D1823B71&amp;OptimusSelectedAreaID=AF8E0BDF-17E3-4438-9FA9-D53A13A508D8&amp;OptimusSelectedLocalID=D5774383-A407-42E9-A0AD-4838C97AB162" onsubmit="javascript:return WebForm_OnSubmit();" id="aspnetForm">
<div>
<input type="hidden" name="__EVENTTARGET" id="__EVENTTARGET" value="" />
<input type="hidden" name="__EVENTARGUMENT" id="__EVENTARGUMENT" value="" />
<input type="hidden" name="__VIEWSTATE" id="__VIEWSTATE" value="/wEPDwUKLTc4MzE4NTQyNQ9kFgJmD2QWBgIID2QWAgIBD2QWBGYPZBYCAgMPDxYEHhRWYWxpZGF0aW9uRXhwcmVzc2lvbgUCLioeB0VuYWJsZWRoZGQCAQ9kFgICBQ8PFgIeBFRleHQFKk8gY2FtcG8gRW1haWwgJmVhY3V0ZTsgb2JyaWdhdCZvYWN1dGU7cmlvIWRkAgkPZBYCAgEPFgIfAmVkAgoPDxYCHgdWaXNpYmxlaGRkGAEFHl9fQ29udHJvbHNSZXF1aXJlUG9zdEJhY2tLZXlfXxYCBSVjdGwwMCRNYWluQ29udGVudFBsYWNlSG9sZGVyJEltZ0xvZ2luBSxjdGwwMCRNYWluQ29udGVudFBsYWNlSG9sZGVyJGltZ0J0blJlY3VwZXJhcorZDETv8JCxlvTojv3w53/dbo9m" />
</div>
<script type="text/javascript">....</script>
<script src="..." type="text/javascript"></script>
<script src="..." type="text/javascript"></script>
<script type="text/javascript">...</script>
<div class="row_container">
<div class="titulo_barra rosa laranja_empresas">
LOGIN<br/>
</div>
<div class="PanelLogin">
<div class="Mensagem">
<div class="texto">
Para aceder, por favor, fa&ccedil;a login. 
</div>
</div>
<div id="ctl00_MainContentPlaceHolder_PanelLogin" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_MainContentPlaceHolder_ImgLogin')">
<div class="Mensagem">
<div>                           
<label for="telem">
User<span id="ctl00_MainContentPlaceHolder_UsernameValidator" style="color:Red;display:none;"></span>
</label>
<input name="ctl00$MainContentPlaceHolder$TxtUsername" type="text" id="ctl00_MainContentPlaceHolder_TxtUsername" class="text" maxlength="255" />
<label style="padding-left: 10px" for="password">
Password
<span id="ctl00_MainContentPlaceHolder_RequiredPasswordValidator" style="color:Red;display:none;"></span><span id="ctl00_MainContentPlaceHolder_UsernameRegexValidator" style="color:Red;display:none;"></span> </label>
<input name="ctl00$MainContentPlaceHolder$TxtPassword" type="password" id="ctl00_MainContentPlaceHolder_TxtPassword" class="text" maxlength="5" />
<input type="hidden" name="fromssl" value="" />
<input type="image" name="ctl00$MainContentPlaceHolder$ImgLogin" id="ctl00_MainContentPlaceHolder_ImgLogin" src="/img/btn_password.gif" alt="Login" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$MainContentPlaceHolder$ImgLogin&quot;, &quot;&quot;, true, &quot;&quot;, &quot;&quot;, false, false))" style="border-width:0px;position: absolute; padding-left: 5px " /><br />                          
</div>
<div id="login_error_box">
<div id="ctl00_MainContentPlaceHolder_ValidationSummary1" class="error" style="color:#FF6000;display:none;">
</div>
</div>
</div>
</div>
</div>
<div class="titulo_barra rosa laranja_empresas">
RECUPERA&Ccedil;&Atilde;O DE PASSWORD
</div>
<div class="PanelLogin">
<div class="Mensagem">
<div class="texto">
Para recuperar a sua password introduza o seu e-mail. Se pretender recuperar o seu username utilize o link abaixo
</div>
</div>
<div id="ctl00_MainContentPlaceHolder_Panel1" onkeypress="javascript:return WebForm_FireDefaultButton(event, 'ctl00_MainContentPlaceHolder_imgBtnRecuperar')">
<div class="Mensagem">
<div id="Div1">
<label for="telem">
Email</label>
<input name="ctl00$MainContentPlaceHolder$txtEmailHabitual" type="text" id="ctl00_MainContentPlaceHolder_txtEmailHabitual" class="text" maxlength="255" />
<input type="image" name="ctl00$MainContentPlaceHolder$imgBtnRecuperar" id="ctl00_MainContentPlaceHolder_imgBtnRecuperar" class="img rosa azul_empresas" src="/img/bot_recuperar.gif" alt="Recuperar" onclick="javascript:WebForm_DoPostBackWithOptions(new WebForm_PostBackOptions(&quot;ctl00$MainContentPlaceHolder$imgBtnRecuperar&quot;, &quot;&quot;, true, &quot;email&quot;, &quot;&quot;, false, false))" style="border-width:0px;margin-top: -2px; position: absolute;" />
<br />
<span id="ctl00_MainContentPlaceHolder_EmailValidator" class="error" style="color:Red;display:none;">O campo Email &eacute; obrigat&oacute;rio!</span>
<span id="ctl00_MainContentPlaceHolder_EmailRegularExpressionValidator" style="color:Red;display:none;"> Formato do Email inválido.</span>
</div> 
<div class="Mensagem" CssClass="error" DisplayMode="SingleParagraph" ForeColor="#FF6000">
</div>
<a id="ctl00_MainContentPlaceHolder_lnkRecuser" href="javascript:__doPostBack('ctl00$MainContentPlaceHolder$lnkRecuser','')">
<div align="left"  style="color:#FF7000" class="footerButtonsOrange">Recuperar username</div>
</a>
</div>
</div>
</div>
</div>
<script type="text/javascript">...</script>
<script type="text/javascript">...</script>
<script type="text/javascript">...</script>
<div>
<input type="hidden" name="__EVENTVALIDATION" id="__EVENTVALIDATION" value="/wEWBwKQ08lZAqmxyPwLAvCnm8wMAt/Wt8sGAv2svvMEAtCB5oUIAr6ar9wLz+9apOkY23Vs+vCYNJuK2ug3Gm0=" />
</div>
<script type="text/javascript">...</script>
</form>

Also, sorry for the low readability of the source code. If you want, I can try to indent it better.

Thank you,
Pedro Cunha

EDIT: Thank you all for your help. All the answers worked flawlessly, however I chose VolkerK's response, because since it is an HTML page, elements may be nested, and I know (of the few things that I know about XPath) that // is a wildcard.

3 Answers 3

8

If you think I shouldn't use regex, but something like xpath, say how.
That would be something like

<?php
$doc = new DOMDocument;
if ( !$doc->loadhtml($contents) ) {
  echo 'something went wrong';
}
else {
  $xpath = new DOMXpath($doc);
  foreach($xpath->query('//form[@name="aspnetForm"]//input') as $eInput) {
      echo 'name=', $eInput->getAttribute('name'), ' value=', $eInput->getAttribute('value'), "\n";
  }
}

If you get annoying warning messages you might want to use @$doc->loadhtml($contents); maybe in conjuction with libxml_use_internal_errors() and libxml_get_errors()

Sign up to request clarification or add additional context in comments.

1 Comment

Solved my problem 10 years later
3

How about this --> http://simplehtmldom.sourceforge.net/

*  A HTML DOM parser written in PHP5+ let you manipulate HTML in a very easy way!
* Require PHP 5+.
* Supports invalid HTML.
* Find tags on an HTML page with selectors just like jQuery.
* Extract contents from HTML in a single line.

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>'; 

Good luck.

1 Comment

[quote]Good luck.[/quote] Thanks, I'll need it. :)
1

OK. Since you asked: You should not try to parse non-regular languages with regular expressions. A simple heuristic is: if the language seems "nested", it is not regular.

One simple way might be something along the following lines:

$htmldoc = new DOMDocument;
$htmldoc->loadHTMLFile("local_secure.html");
$forms = $htmldoc->getElementsByTagName("form");
$inputs = $forms->item(0)->getElementsByTagName("input");

foreach ($inputs as $input)
  { do_something_with($input->getAttribute("name"));
    do_something_with($input->getAttribute("value")); };

Add error checks to your liking. Further documentation: http://www.php.net/book.dom

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.