1

I am using Webharvest to download a file from a website and take its original name.

The Java code that I am working with is:

import org.apache.commons.httpclient.Header;
            import org.apache.commons.httpclient.HttpClient;
            import org.apache.commons.httpclient.HttpStatus;
            import org.apache.commons.httpclient.Header;
            import org.apache.commons.httpclient.methods.GetMethod; 

            HttpClient client = new HttpClient();

            BufferedReader br = null;
            StringBuffer result = new StringBuffer();
            String attachName;

            GetMethod method = new GetMethod(attachmentLink.toString());

            int returnCode; 
            returnCode = client.executeMethod(method);
            Header[] headers = method.getResponseHeader("Content-Disposition");
            attachName = headers[0].getValue();
            attachName = new String(attachName.getBytes());

The result in webharvest is:

attachment; filename="Resoluci�n sobre Mesas de Contrataci�n.pdf"

I cant make it take the letter

ó

After I got the value of the header Content-Disposition into variable attachName, I also tried to decode it, but with no luck:

String attachNamef = URLEncoder.encode(attachName, "ISO-8859-1"); 
                      attachNamef = URLEncoder.decode(attachNamef, "UTF-8");

I was able to determine that the response charset is: ISO-8859-1

method.getResponseCharSet()

P.S. When I see the headers in Firefox Firebug - the value is ok: Content-Disposition

attachment; filename="Resolución sobre Mesas de Contratación.pdf"

1
  • Note that the response charset refers to the payload, not the header fields. Also note that you seem to be using a very obsolete version of the HTTP components. Finally, the server response is invalid; non-ASCII characters are not allowed here; see RFC 6266. Commented Jan 16, 2017 at 18:31

1 Answer 1

2

Apache HttpClient doesn't support non-ascii characters in HTTP headers. Taken from documentation:

The headers of a HTTP request or response must be in US-ASCII format. It is not possible to use non US-ASCII characters in the header of a request or response. Generally this is not an issue however, because the HTTP headers are designed to facilite the transfer of data rather than to actually transfer the data itself. One exception however are cookies. Since cookies are transfered as HTTP Headers they are confined to the US-ASCII character set. See the Cookie Guide for more information.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.