2

I'm reading source code of an online shop website, and on each product page I need to find a JSON string which shows product SKUs and their quantity.

Here are 2 samples:

'{"sku-SV023435_B_M":7,"sku-SV023435_BL_M":10,"sku-SV023435_PU_M":11}'

The sample above shows 3 SKUs.

'{"sku-11430_B_S":"20","sku-11430_B_M":"17","sku-11430_B_L":"30","sku-11430_B_XS":"13","sku-11430_BL_S":"7","sku-11430_BL_M":"17","sku-11430_BL_L":"4","sku-11430_BL_XS":"16","sku-11430_O_S":"8","sku-11430_O_M":"6","sku-11430_O_L":"22","sku-11430_O_XS":"20","sku-11430_LBL_S":"27","sku-11430_LBL_M":"25","sku-11430_LBL_L":"22","sku-11430_LBL_XS":"10","sku-11430_Y_S":"24","sku-11430_Y_M":36,"sku-11430_Y_L":"20","sku-11430_Y_XS":"6","sku-11430_RR_S":"4","sku-11430_RR_M":"35","sku-11430_RR_L":"47","sku-11430_RR_XS":"6"}',

The sample above shows many more SKUs.

The number of SKUs in the JSON string can range from one to infinity.

Now, I need a regex pattern to extract this JSON string from each page. At that point, I can easily use json_encode().

Update: Here I found another problem, sorry that my question was not complete, there is another similar json string which is starting with sku- , Please have a look at source code of below link you will understand, the only difference is the value for that one is alphanumeric and for our required one is numeric. Also please note our final goal is to extract SKUs with their quantity, maybe you have a most straightforward solution.

Source

@chris85

Second update:

Here is another strange issue which is a bit off topic.

while I'm opening the URL content using below code there is no json string in the source!

$html = file_get_contents("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");

But when I'm opening the url with my browser the json is there! really confused about this :(

2
  • Is sku-11430_Y_M a typo? The quantity isn't in quotes.. Commented Jun 7, 2015 at 17:16
  • I've removed my answer, perhaps @Phil_1984_ will help you. Good luck. Commented Jun 7, 2015 at 19:49

3 Answers 3

1

Trying to extract specific data from json directly with regexp is normally always a bad idea due to the way json is encoded. The best way is to regexp the whole json data, then decode using the php function json_decode.

The issue with the missing data is due to a missing required cookie. See my comments in the code below.

<?php

function getHtmlFromDresslinkUrl($url)
{
    $ch = curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,true);

    //You must send the currency cookie to the website for it to return the json you want to scrape
    curl_setopt($ch, CURLOPT_HTTPHEADER, array(
        'Cookie: currencies_code=USD;',
    ));

    $output=curl_exec($ch);

    curl_close($ch);
    return $output;
}

$html = getHtmlFromDresslinkUrl("http://www.dresslink.com/womens-candy-color-basic-coat-slim-suit-jacket-blazer-p-8131.html");

//Get the specific arguments for this js function call only
$items = preg_match("/DL\.items\_list\.initItemAttr\((.+)\)\;/", $html, $matches);
if (count($matches) > 0) {
    $arguments = $matches[1];

    //Split by argument seperator.  
    //I know, this isn't great but it seems to work.
    $args_array = explode(", ", $arguments);

    //You need the 5th argument
    $fourth_arg = $args_array[4];

    //Strip quotes
    $fourth_arg = trim($fourth_arg, "'");

    //json_decode
    $qty_data = json_decode($fourth_arg, true);

    //Then you can work with the php array
    foreach ($qty_data as $name => $qtty) {
        echo "Found " . $qtty . " of " . $name . "<br />";
    }
}

?>

Special thanks to @chris85 for making me read the question again. Sorry but I couldn't undo my downvote.

Sign up to request clarification or add additional context in comments.

1 Comment

Gold bless you @Phil_1984_ , thanks to chris85 as well , really appreciate your efforts , sorry that I can not vote
0

You will want to use preg_match_all() to perform the regex matching operation (documentation here).

The following should do it for you. It will match each substring beginning with "sku" and ending with ",".

preg_match_all("/sku\-.+?:[0-9]*/", $input)

Working example here.

Alternatively, if you want to extract the entire string, you can use:

preg_match_all("/{.sku\-.*}/, $input")

This will grab everything between the opening and closing brackets.

Working example here.

Please note that $input denotes the input string.

2 Comments

kindly make a demo, it's not working for me :( @grill , my required language in PHP
Here I found another problem, sorry that my question was not complete, there is another similar json string which is starting with sku- , Please have a look at source code of below link you will understand, the only difference is the value for that one is alphanumeric and for our required one is numeric. dresslink.com/… Also please note our final goal is to extract SKUs with their quantity, maybe you have a most straightforward solution. @grill
0

A simple /'(\{"[^\}]+\})'/ will match all these JSON strings. Demo: https://regex101.com/r/wD5bO4/2

The first element of the returned array will contain the JSON string for json_decode:

preg_match_all ("/'(\{\"[^\}]+\})'/", $html, $matches);

$html is the HTML to be parsed, the JSON will be in $matches[0][1], $matches[1][1], $matches[2][1] etc.

2 Comments

g is not a modifier in PHP. php.net/manual/en/reference.pcre.pattern.modifiers.php This throws Warning: preg_match_all(): Unknown modifier 'g' for me.
Thanks for the hint @chris85. preg_match_all already matches all occurences, no need for g as in JavaScript

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.