0

I am writing a java project using the HTMLCleaner library and save the output as a XML file this is the code that I wrote :

URL urlSB = new URL("http://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "google.com");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());

// serialize to xml file
new PrettyXmlSerializer(props).writeToFile(
        tagNodeRoot , "cleaned.xml", "utf-8"
);

The problem is that after running the project, cleaned.xml file is empty.

1 Answer 1

1

The problem is that the page you are trying to access is configured to redirect to HTTPS. This does, for whatever reason, not work, and so the input stream is empty. If you change the URL to HTTPS, it's working fine:

URL urlSB = new URL("https://www.groupon.com/browse/chicago?z=skip");
URLConnection urlConnection = urlSB.openConnection();
urlConnection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:5.0) Gecko/20100101 Firefox/25.0");
urlConnection.connect();
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setNamespacesAware(false);
TagNode tagNodeRoot = cleaner.clean(urlConnection.getInputStream());
new PrettyXmlSerializer(props).writeToFile(tagNodeRoot, "cleaned.xml", "utf-8");
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.