1

I'm trying to scrape data from the website Squawka.com. For example, when I'm trying to scrape data from: http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match I'll use this code:

HttpClient client = new DefaultHttpClient(); String url = "http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match"; String urlEncode = "http://www.squawka.com/teams/chelsea/stats" + URLEncoder.encode("#", "UTF-8") + "performance-score" + URLEncoder.encode("#", "UTF-8") + "english-barclays-premier-league"+ URLEncoder.encode("#", "UTF-8") +"season-2014/2015"+ URLEncoder.encode("#", "UTF-8") +"126"+ URLEncoder.encode("#", "UTF-8") +"all-matches"+ URLEncoder.encode("#", "UTF-8") +"1-7"+ URLEncoder.encode("#", "UTF-8") +"by-match"; HttpGet get = new HttpGet(urlEncode); HttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content);

As you can see, the hash sign # is illegal (which gave me the IllegalArgumentException). So I decided to encode the url using URLEncoder, which is my second variable urlEncode. But using this variable, it requests another url, namely

http://www.squawka.com/teams/chelsea/stats%23performance-score%23english-barclays-premier-league%23season-2014/2015%23126%23all-matches%231-7%23by-match

which returns other data.

So my question is: How should I change my code in order to get the data from the right url (Variable String url)

Thanks in advance.

1
  • simply you can use url.encode then url.decode to get normal server url. Commented Oct 14, 2014 at 13:55

1 Answer 1

1

Everything beyond the # is the fragment identifier. It's not sent to the server as part of the request - in this case it would be used by the Javascript on the page to perform extra filtering.

When fetching the page programmatically, you just need to fetch http://www.squawka.com/teams/chelsea/stats - that will get the same data down to the browser as the original link... but you'll then need to work out what the Javascript would have done with the fragment identifier in order to get to the right data within the page (possibly making more requests).

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your answer, do you have any tips on how to work out what the Javascript would have done with the fragment identifier?
@PaterMark: I'd start by looking at the requests it makes, e.g. in the developer console in Chrome. You might want to check whether the data source is happy for you to scrape their data first though...

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.