Java HttpClient getting response with illegal character in url

Question

I'm trying to scrape data from the website Squawka.com. For example, when I'm trying to scrape data from: http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match I'll use this code:

HttpClient client = new DefaultHttpClient(); String url = "http://www.squawka.com/teams/chelsea/stats#performance-score#english-barclays-premier-league#season-2014/2015#126#all-matches#1-7#by-match"; String urlEncode = "http://www.squawka.com/teams/chelsea/stats" + URLEncoder.encode("#", "UTF-8") + "performance-score" + URLEncoder.encode("#", "UTF-8") + "english-barclays-premier-league"+ URLEncoder.encode("#", "UTF-8") +"season-2014/2015"+ URLEncoder.encode("#", "UTF-8") +"126"+ URLEncoder.encode("#", "UTF-8") +"all-matches"+ URLEncoder.encode("#", "UTF-8") +"1-7"+ URLEncoder.encode("#", "UTF-8") +"by-match"; HttpGet get = new HttpGet(urlEncode); HttpResponse response = client.execute(get); HttpEntity entity = response.getEntity(); String content = EntityUtils.toString(entity); System.out.println(content);

As you can see, the hash sign # is illegal (which gave me the IllegalArgumentException). So I decided to encode the url using URLEncoder, which is my second variable urlEncode. But using this variable, it requests another url, namely

http://www.squawka.com/teams/chelsea/stats%23performance-score%23english-barclays-premier-league%23season-2014/2015%23126%23all-matches%231-7%23by-match

which returns other data.

So my question is: How should I change my code in order to get the data from the right url (Variable String url)

Thanks in advance.

simply you can use url.encode then url.decode to get normal server url. — Gurkan İlleez
– Gurkan İlleez, Commented Oct 14, 2014 at 13:55

Jon Skeet · Accepted Answer · 2014-10-14 13:46:06Z

1

Everything beyond the # is the fragment identifier. It's not sent to the server as part of the request - in this case it would be used by the Javascript on the page to perform extra filtering.

When fetching the page programmatically, you just need to fetch http://www.squawka.com/teams/chelsea/stats - that will get the same data down to the browser as the original link... but you'll then need to work out what the Javascript would have done with the fragment identifier in order to get to the right data within the page (possibly making more requests).

answered Oct 14, 2014 at 13:46

Jon Skeet

1.5m893 gold badges9.3k silver badges9.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

PaterMark Over a year ago

Thanks for your answer, do you have any tips on how to work out what the Javascript would have done with the fragment identifier?

Jon Skeet Over a year ago

@PaterMark: I'd start by looking at the requests it makes, e.g. in the developer console in Chrome. You might want to check whether the data source is happy for you to scrape their data first though...

Collectives™ on Stack Overflow

Java HttpClient getting response with illegal character in url

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related