1

I have several unordered lists. The list items are URLs. How can I extract the URL and link text from each list item to insert into a database?

<ul id="1">
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
</ul>

<ul id="2">
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
</ul>

<ul id="3">
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
    <li><a href="someplace.com">Text</a></li>
</ul>

I know RegEx should be avoided. I already have the PDO set up. The ul id number goes into the categoryID on the mysql table.

The only thing that seems to make sense would be something like a while-loop with another loop inside to get the URLs and text, and then after increment the id. I just don't know how to start it. Should the URL and text go into an array?

9
  • Are you trying to insert them after the page is loaded and also is this dynamic content? Commented May 30, 2013 at 13:24
  • Why should regex be avoided? Commented May 30, 2013 at 13:26
  • I'm suspecting you will want to make use of DOMDocument or another PHP DOM library. Commented May 30, 2013 at 13:26
  • Nah. I only need to have this run once, just to get the existing lists I have inserted. Commented May 30, 2013 at 13:26
  • I think what is being asked is if you have these lists before you've sent them to the browser, or if you are trying to get them after they have been sent to the browser. Commented May 30, 2013 at 13:28

4 Answers 4

3

Assuming your HTML is stored in the string $content, you could use PHP DOM to extract the various list items without having to resort to regex.

$dom = DOMDocument::loadHTML($content);
$lists = $dom->getElementsByTagName('ul');
foreach($lists as $list) {
  $id = $list->getAttribute('id');
  $links = $list->getElementsByTagName('a');
  foreach ($links as $link) {
    $href = $link->getAttribute('href');
    $text = $link->nodeValue;
    // insert $id, $text and $href into the data here 
  }
}
Sign up to request clarification or add additional context in comments.

Comments

1

You could use regex just fine:

preg_match_all('/<a href=\"(.*?)\"[.*]?>(.*?)<\/a>/i', $string, $matches);

$insert = array();

foreach($matches as $val)
{
    /* DONT FORGET TO ESCAPE YOUR DATA IF NEEDED */
    $url = $val[0];
    $name = $val[1];

    $insert[] = 'INSERT INTO tableName (url, name) VALUES ("' . $url . '", "' . $name . '")';
}

print_r($insert);

3 Comments

Only problem with regex is it becomes hard to pick and choose which URLs you are capturing in a huge block. What if there are urls spread between the <UL> tags? Using the DOM gives the ability to apply some better filtering logic and readability at, perhaps, the cost of some speed.
The above regex only grabs the URLS from within the href attribute within <a> tags. It can quite easily be modified to look for <a> tags within <li> tags as well, I guess its preference.
The question as to why there is hesitancy in using regex was never answered. It's completely useful and viable for long-term support. Besides, if this is a one-off use, using something like Perl would be a better solution because if it's inherent wide support of regex features.
0

I recommend you try SimpleHTMLDom, it's a PHP library i use for processing xml like documents.

You could easily go like so:

require_once("/path/to/simplehtmldom/library");
$parsed_data = array();
//we next need to create a dom object --
//case 1: let me assume the HTML is in a string
$dom_object = str_get_html($html_string);
//case 2: it's at a particular url
$dom_object = file_get_html("http://www.site-with-the-content.com");
//now we have our object
$links = $dom_object->find("ul li a");
//finds all the <a> tags on the page inside <ul>, you could filter it 
//using class or ids like with jQuery if you like
foreach($links as $link){
    $parsed_data[] = array(
        "link"=>$link->href,
        "text"=>$link->innertext
    );
}
//You can now go through your array of parsed content and insert into your DB    

Hope this help :)

SimpleHTMLDom Sourceforge project

Comments

0

here is the jQuery version to extract your desired values if you are trying to get them after they have been sent to the browser

var data=$("ul");
var values=new Array();
$.each(data,function(i){
values[i]=$(this).attr("id");
$.each($(this).find("li"),function(j){
values[i+"-"+j+"link"]=$(this).find("a").attr("href")
values[i+"-"+j+"text"]=$(this).find("a").text();
});
});
console.log(values)

Now send this array to your php file via ajax call

Hope it makes sense

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.