7

I have a list of URLs that I need to verify are valid URLs. I've written a program in Java that uses Apache's HttpClient to check the link. I had to implement my own redirect strategy due to the presence of invalid characters (like {} in the redirect URLS) which the default stratgey didn't take care of. It works fine in the majority of the cases except for 2 of them:

  1. Escaped Characters in the path or query params, which should not be encoded further. Example:

    String url = "http://www.example.com/chapter1/%3Fref%3Dsomething%26term%3D?ref=xyz"
    

    If I use a URI object, it chokes on the "{" character.

    URI myUri = new URI(url) ==> This will fail. 
    

    If I run:

    URI myUri = new URI(UriUtils.encodeHttpUrl(url)) 
    

    it encodes the %3F to %253F. However when I follow the link using Chrome or Fiddler, I do not see %3F getting escaped again. How do I protect from over-encoding the path or query params?

  2. The last query param in the URL has a valid URL as well. Eg.

    String url = "www.example.com/Chapter1/?param1=xyz&param2=http://www.google.com/?abc=1"
    

My current encoding strategy splits up the query params and then calls URLEncoder.encode on the query params. This however causes the last param to be encoded as well (which is not the case when I follow it in Fiddler or Chrome).

I've tried a number of things (using UriUtils, special cases for URLs as last param and other hacks) but nothing seems to be ideal. Whats the best way to solve this?

4 Answers 4

4

How do I protect from over-encoding the path or query params?

You cannot "protect from over-encoding". You either encode, or you do not. You should always know, for any given string, whether it is encoded or not. You should only encode strings which are not yet encoded, and you should never encode strings which are already encoded.

So is this string encoded or not?

%3Fref%3Dsomething%26term%3D{keyword}

It seems to me like this is bad input: clearly this is not encoded because it contains invalid characters ('{' and '}'). Yet it also seems not to be an unencoded string, because it contains '%xx' sequences. So it's partly-encoded. There is no programmatic "solution" once a string is in this form -- you simply need to avoid getting a string into such a form in the first place. You may be able to construct an algorithm which "fixes" this string, by carefully looking for parts looking like a "%" followed by two hex digits, and leaving them alone. But this will break on subtle cases. Consider an unencoded string "42%23", which is supposed to be a literal representation of the mathematical expression "42 mod 23". When I put this into a URI, I expect it to encode as "42%2523" so it decodes as "42%23", but the above algorithm will break and encode it as "42%23" which will then decode as "42#". So there is no way to fix the above string. Encoding "%3F" to "%253F" is exactly what a URI encoder should be doing.

Note: Having said this, browsers often allow you to get away with typing bad characters into URIs and they automatically encode them. That's not very robust so it shouldn't be used unless you are trying to be very forgiving of user input. In that case, you can do a "best effort" by first decoding the URI and then re-encoding it. In this case, if I wanted to type "42%23" I would have to manually type in "42%2523".

As for question 2:

This however causes the last param to be encoded as well

Similarly, this is exactly what you want. If a URI appears as a parameter inside another URI, it should be percent-encoded. Otherwise, how can you tell where one URI finishes and the other continues? I believe the above URI is actually valid (since ':', '/', '&' and '=' are reserved characters, not forbidden, and therefore they are allowed as long as they do not create ambiguity). But it is much safer to have a URI-inside-a-URI escaped.

Sign up to request clarification or add additional context in comments.

3 Comments

@mgiuca-thx for the detailed answer.I don't control the input & am trying to duplicate the behavior of a browser as much as possible. I fixed the sample URL in Q1.The issue with the approach you reco is that when I encode it,it'll go down a redirect path 10 levels deep that is incorrect,and when I try and track it via fiddler or chrome,I see that I've encoded a character or a parameter that the browser hasn't.For Q2,I guess my q should have been what the best approach for encoding query params should be since URLEncode on the query param works fine except when there is a URL in the last param.
What do you mean "except when there is a URL in the last param"? URLEncoder.encode("http://www.google.com/?abc=1") gives "http%3A%2F%2Fwww.google.com%2F%3Fabc%3D1", which is correct. You shouldn't be putting a URL in as a query parameter without first encoding it, or weird behaviour will happen in corner cases.
had a bug in my overall scheme. This answer helped me step back and analyze it again.
4

I really don't know, but you can try to first decode it, so the %3F will gets back what is was, and then encode it back.

So:

String decoded = URLDecoder.decode(url, "UTF-8");
url = URLEncoder.encode(decoded, "UTF-8");

1 Comment

I had a problem where I was supposed to work an encoded String in which %3F had been mistakenly encoded to %253F (i.e. '=' had been encoded to %3F and then encoded again to %253F). "Undoing" the encoding by first decoding a couple of times provided a nice workaround. So the answer here helped.
1

The correct way to encode an unencoded URL string is via URI.toASCIIString().

Of course it is up to you to decide whether the URL is already encoded or not.

Comments

-2

Have you tried using the URLEncoder?

    URLEncoder.encode(URLString, "UTF-8")

Besides that, your only option is going to encode each URL that is being used as a paramater separately, and then manually building the URL. This is a pretty tricky case.

10 Comments

URLEncoder isn't any use for encoding URLs, curiously enough. It is for encoding URL arguments.
@EJP There's no such thing as "encoding URLs", only encoding URL arguments. As I said in my answer, once you have a URL, you can't encode it -- it's already either encoded, or you've missed your chance. You need to encode parts of the URL before constructing it. URLEncoder is good for encoding the only thing it is useful to encode.
@mgiuca You are again mistaken. There most certainly is such a thing as encoding URLs. That's what %20 is for, for example: encoding a space. See RFC 2396, and the Javadoc for java.net.URI.
@EJP "again"? I assure you I am quite familiar with RFC 3986 (which obsoletes 2396) (I wrote urllib.parse.quote/unquote in Python 3). I'm not disputing that %20 is used to encode octets in URLs. I said there is no such thing as encoding URLs, only URL arguments. RFC never mentions encoding URLs, only encoding octets. It says "the conflicting data must be percent-encoded before the URI is formed" (emphasis mine). java.net.URI(String) expects an already-encoded URI -- only the multi-argument constructor performs encoding.
@mgiuca so what is your name for the process of adding %-encoded hex strings to URLs in place of the out of band characters?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.