0

I want to extract content of a page which has the attribute name itemprop. Suppose I have page which has different HTML tags that have the attribute named itemprop so I want text in between those tags,

For a heading:

<h1 itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>

Table data from td tag:

<td itemprop="productID">AP3963893</td>

Here the itemprop attribute is common. So I need data in between these tags like Whirlpool Direct Drive Washer Motor Coupling and AP3963893 using regexp .

Below is my code (which is currently not working)

preg_match_all(
    '/<div class=\"pdct\-inf\">(.*?)<\/div>/s',
    $producturl,
    $posts    
);

My code:

<?php
    define('CSV_PATH','csvfiles/');
    $csv_file = CSV_PATH . "producturl.csv"; // Name of your producturl file
    $csvfile = fopen($csv_file, 'r');
    $csv_fileoutput = CSV_PATH . "productscraping.csv"; // Name of your product page data file
    $csvfileoutput = fopen($csv_fileoutput, 'a');

    $websitename = "http://www.appliancepartspros.com";

    while($data = fgetcsv($csvfile)) 
    {
        $producturl = $websitename . trim($data[1]);

        preg_match_all(
            '/<.*itemprop=\".*\".*>(.*?)<\/.*>/s',
            $producturl,
            $posts    
        );
        print_r($posts);
    }

2 Answers 2

1

Firstly, never ever use RegEx to parse HTML. Secondly, you can achieve this using jQuery quite simply by using the attribute selector:

var nameItemprop = $('[itemprop="name"]').text(); // = 'Whirlpool Direct Drive Washer Motor Coupling'
var productIdItemprop = $('[itemprop="productID"]').text(); // = 'AP3963893'

Note however, that it is invalid HTML to create your own non-standard attributes. You should ideally be using data-* attributes to contain data associated with those elements:

<h1 data-itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>
<td data-itemprop="productID">AP3963893</td>
var nameItemprop = $('[data-itemprop="name"]').text();
var productIdItemprop = $('[data-itemprop="productID"]').text();

Finally, should there ever be multiple elements with the same itemprop attribute then you would need to loop through them to get the value from each individual element.

Sign up to request clarification or add additional context in comments.

3 Comments

can you please suggest me php example using regexp, because as there are lot of product url which is store in csv file.
Sorry, I don't know PHP. As you tagged jQuery I used that.
Updated my question with my code .. please have a look
0

As already mentioned, you shouldn't use RegExp to parse HTML, but if you insist on doing it, here's a pattern that should work:

$producturl = '<h1 itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>';

if (preg_match_all(
   '/<.*itemprop=\".*\".*>(.*?)<\/.*>/s',
   $producturl,
   $posts    
)) {
    print_r($posts);
}

This creates the following output:

Array
(
    [0] => Array
        (
            [0] => <h1 itemprop="name" class="h2">Whirlpool Direct Drive Washer Motor Coupling</h1>
        )
    [1] => Array
        (
            [0] => Whirlpool Direct Drive Washer Motor Coupling
        )
)

3 Comments

it returns blank array.. I have updated my code in question .. please have a look.
I've added a complete example so you can copy, paste and execute it. That works for me..
you are passing html content to producturl variable, but in my case its a absolute url ..

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.