Android: Extracting html source

Question

I am trying to extract the source of a website, and I have researched a bit and many solutions point to using HTTPClient and HTTPContext but the problem is that I cannot use a URL to get this source from. The website I am using is based on logins and no matter who you are logged in as, it displays the same URL (but, of course, the information to be extracted is different based on the user). Therefore, I was wondering if there way a way to directly get the source from, perhaps, a webview or something of the sort. In summary, I cannot use a URL intermediate because it is uniform and basically redirects to a generic log-in page.

Sorry if I am missing something; I am new to this. Thank you for the help in advance.

EDIT:

I have found a differentiated URL that is different per user, but there is a(nother) related problem: Using jsoup, I can do Jsoup.connect("http://www.stackoverflow.com/").get().html(); (with the URL replaced with what I'm trying to access) and this does in fact get the HTML source, but the problem again arises that it asks for log-in information when I try to access a user/password protected website. I need to be able to enter username and password once and basically store this in some sort of temporary thing (cookies/cache?) and retain that information for jsoup to stop querying the login credentials each time I ask for a source based on a certain URL. I still cannot find a way to get around this...

Steve2955 · Accepted Answer · 2016-12-31 18:21:15Z

1

Well if I understood correctly (let me know if I did not):

If it user/password protected should you issue a Http Post (that is what you do from a browser for example) and get the Response from that post? Something like this :

http://www.informit.com/guides/content.aspx?g=java&seqNum=44

EDIT: Here is a sample

I have a page that looks like this (it is oversimplified, but nevertheless here it is):

<form action="../../j_spring_security_check" method="post" >
        <input id="j_username" name="j_username" type="text" />
            <input id="j_password" name="j_password" type="password"/>
                    <input type="image" class="submit" id="login" name="login" />
</form>

If it where is a web page, you would have to provide the username/password to get the actual content "after" this login page. What you really issue is a HTTP POST here (I bet it's the same in your case).

Now to get the same functionality in a programmatic way...

You will need the apache http client library (you could probably do without it, but this is the easy way). Here is the maven dependency for it. you are going this for Android, right? apache http client is the default in Android from what I've read.

<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>

import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;

public class HttpPost {
    public static void main(String[] args) {

        HttpClient httpClient = new HttpClient();
        PostMethod postMethod = new PostMethod("http://localhost:20000/moika/moika/j_spring_security_check");
        postMethod.addParameter("j_username", "ACTUAL_USER");
        postMethod.addParameter("j_password", "ACTUAL_PASSWORD");

        try {
            int status = httpClient.executeMethod(postMethod);
            System.out.println("STATUS-->" + status);

            if(status == 302){
                Header header = postMethod.getResponseHeader("location");
                String location = header.getValue();
                System.out.println("HEADER_VALUE-->" + location);
                GetMethod getMethod = new GetMethod(location);
                httpClient.executeMethod(getMethod);
                String content = getMethod.getResponseBodyAsString();
                System.out.println("CONTENT-->" + content);
            }

            String contentInCaseOfNoRedirect = postMethod.getResponseBodyAsString();

        } catch (Exception exception){
            exception.printStackTrace();
        }
    }
}

This might look weird a bit, but I perform a redirect (302), there seems to be an issue with that in RCF, thus the small work-around.

If you do not perform any re-directs on the server side, then you could ignore the part where I check for 302.

See what works for you.

Cheers, Eugene.

edited Dec 31, 2016 at 18:21

Steve2955

6901 gold badge6 silver badges18 bronze badges

answered Dec 28, 2011 at 7:58

Eugene

122k17 gold badges219 silver badges335 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Kgrover Over a year ago

Your method seems on the right track, but confuses me. Is it possible for you to provide some sample code on how to get the html source of a website, given the situation?

Kgrover Over a year ago

I will experiment with it and let you know. Thanks for the response!

Kgrover Over a year ago

This method is very confusing for me; sorry, I have no experience in this topic. I have edited my question; can you take a look?

Kgrover Over a year ago

many of these classes are not showing up in my java IDE (e.g. PostMethod)...Could you suggest a reason?

Eugene Over a year ago

Well because methods are not showing up, it's because Eclipse can't see them. Are you using maven to build your project? Or not? If not you should consult the Eclipse doc on how to add classes to your classpath (Right Click on the Project --> Build Path....). jsoup is something I have not used, so can't really suggest anything.

Sunil Kumar Sahoo · Accepted Answer · 2011-12-28 08:01:28Z

0

see the http://docs.oracle.com/javase/tutorial/networking/urls/readingWriting.html

or check the sample code

How to read content of URL

try{
        URL oracle = new URL("http://www.w3schools.com/html/html_tables.asp");
        URLConnection yc = oracle.openConnection();
        InputStream is = yc.getInputStream();
        String inputLine;
        BufferedReader in = new BufferedReader(
                new InputStreamReader(
                yc.getInputStream()));
        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
        in.close();

        }catch(Exception ex){
            ex.printStackTrace();
        }

answered Dec 28, 2011 at 8:01

Sunil Kumar Sahoo

53.7k55 gold badges182 silver badges244 bronze badges

2 Comments

Eugene Over a year ago

I might be slow because of the morning mood, but how did you solve this part in the answer : "it displays the same URL (but, of course, the information to be extracted is different based on the user". What you have presented is just plain reading the contents of a URL, IMHO you didn't answer the question at all

Kgrover Over a year ago

Yes, Eugene, I agree completely. I know how to read contents of a plain URL, but the situation here is different.

Collectives™ on Stack Overflow

Android: Extracting html source

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related