12
function curl($url) {

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/25.0.1");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIE, 'long cookie here');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$output = curl_exec($ch);
curl_close($ch);
return $output;

}

The original url I'm feeding it is http://example.com/i-123.html but if I open in browser, I get redirected to https://example.com/item-description-123.html (so I added CURLOPT_FOLLOWLOCATION).

However, the output of this function is binary data.

1f8b 0800 0000 0000 0003 ed7d e976 db38
f2ef e7f8 2930 9ac9 d86e 9b92 b868 f3a2
3e5e 9374 67fb c7ee 74f7 e4e6 f880 2428
31a6 4835 172f 3dd3 8f74 3fde 17b8 f7c5
6e15 008a 8ba8 2db1 3ce9 25a7 dba4 4810
......

How do I fix this? I tried adding

curl_setopt($curl, CURLOPT_SSL_VERIFYPEER, FALSE);
curl_setopt($curl, CURLOPT_SSL_VERIFYHOST, 2); 

(copied from somewhere). Didn't work.

file_get_contents() gives me the same output.

9
  • how do you print the output? How did you get that columns data on your screen? Commented Feb 2, 2015 at 18:09
  • via terminal $ php parser.php > output Commented Feb 2, 2015 at 18:10
  • if a php echoes the binary data, it is just displayed as broken characters. I don't get it how you get those columns on your screen Commented Feb 2, 2015 at 18:13
  • Well, the command written above doesn't echo the output in the terminal, but saves it into a file. when you open the file with a text editor, you see what I posted. Commented Feb 2, 2015 at 18:14
  • try switching your text editor to UTF-8 text mode instead of binary Commented Feb 2, 2015 at 18:21

1 Answer 1

37

Well, the solution was pathetic...

Using wget -S http://example.com I found out that the content is compressed (gzipped). Using gunzip I successfully extracted the html.

Also added to my original PHP script

curl_setopt($ch,CURLOPT_ENCODING , "");

And it worked like a charm.

Sign up to request clarification or add additional context in comments.

5 Comments

Very interesting. A thing to remember. Glad you found the answer!
Wow. I thought the site was using some sort of trickery to return garbage and prevent me from scraping it. Thanks!
Or add the --compressed option for auto ungzip.
As a side note - I was running curl from the CLI. Adding --compressed as an option, mean't it then correctly downloaded as HTML. This answer pushed me in the right direction :)
This also worked for me, In windows I piped to download.gzip and then extracted it with 7zip.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.