1

I am trying to extract the source of a website, and I have researched a bit and many solutions point to using HTTPClient and HTTPContext but the problem is that I cannot use a URL to get this source from. The website I am using is based on logins and no matter who you are logged in as, it displays the same URL (but, of course, the information to be extracted is different based on the user). Therefore, I was wondering if there way a way to directly get the source from, perhaps, a webview or something of the sort. In summary, I cannot use a URL intermediate because it is uniform and basically redirects to a generic log-in page.

Sorry if I am missing something; I am new to this. Thank you for the help in advance.

EDIT:

I have found a differentiated URL that is different per user, but there is a(nother) related problem: Using jsoup, I can do Jsoup.connect("http://www.stackoverflow.com/").get().html(); (with the URL replaced with what I'm trying to access) and this does in fact get the HTML source, but the problem again arises that it asks for log-in information when I try to access a user/password protected website. I need to be able to enter username and password once and basically store this in some sort of temporary thing (cookies/cache?) and retain that information for jsoup to stop querying the login credentials each time I ask for a source based on a certain URL. I still cannot find a way to get around this...

2 Answers 2

1

Well if I understood correctly (let me know if I did not):

If it user/password protected should you issue a Http Post (that is what you do from a browser for example) and get the Response from that post? Something like this :

http://www.informit.com/guides/content.aspx?g=java&seqNum=44

EDIT: Here is a sample

I have a page that looks like this (it is oversimplified, but nevertheless here it is):

<form action="../../j_spring_security_check" method="post" >
        <input id="j_username" name="j_username" type="text" />
            <input id="j_password" name="j_password" type="password"/>
                    <input type="image" class="submit" id="login" name="login" />
</form>

If it where is a web page, you would have to provide the username/password to get the actual content "after" this login page. What you really issue is a HTTP POST here (I bet it's the same in your case).

Now to get the same functionality in a programmatic way...

You will need the apache http client library (you could probably do without it, but this is the easy way). Here is the maven dependency for it. you are going this for Android, right? apache http client is the default in Android from what I've read.

<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>

import org.apache.commons.httpclient.Header;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;

public class HttpPost {
    public static void main(String[] args) {

        HttpClient httpClient = new HttpClient();
        PostMethod postMethod = new PostMethod("http://localhost:20000/moika/moika/j_spring_security_check");
        postMethod.addParameter("j_username", "ACTUAL_USER");
        postMethod.addParameter("j_password", "ACTUAL_PASSWORD");

        try {
            int status = httpClient.executeMethod(postMethod);
            System.out.println("STATUS-->" + status);

            if(status == 302){
                Header header = postMethod.getResponseHeader("location");
                String location = header.getValue();
                System.out.println("HEADER_VALUE-->" + location);
                GetMethod getMethod = new GetMethod(location);
                httpClient.executeMethod(getMethod);
                String content = getMethod.getResponseBodyAsString();
                System.out.println("CONTENT-->" + content);
            }

            String contentInCaseOfNoRedirect = postMethod.getResponseBodyAsString();

        } catch (Exception exception){
            exception.printStackTrace();
        }
    }
}

This might look weird a bit, but I perform a redirect (302), there seems to be an issue with that in RCF, thus the small work-around.

If you do not perform any re-directs on the server side, then you could ignore the part where I check for 302.

See what works for you.

Cheers, Eugene.

Sign up to request clarification or add additional context in comments.

5 Comments

Your method seems on the right track, but confuses me. Is it possible for you to provide some sample code on how to get the html source of a website, given the situation?
I will experiment with it and let you know. Thanks for the response!
This method is very confusing for me; sorry, I have no experience in this topic. I have edited my question; can you take a look?
many of these classes are not showing up in my java IDE (e.g. PostMethod)...Could you suggest a reason?
Well because methods are not showing up, it's because Eclipse can't see them. Are you using maven to build your project? Or not? If not you should consult the Eclipse doc on how to add classes to your classpath (Right Click on the Project --> Build Path....). jsoup is something I have not used, so can't really suggest anything.
0

see the http://docs.oracle.com/javase/tutorial/networking/urls/readingWriting.html

or check the sample code

How to read content of URL

try{
        URL oracle = new URL("http://www.w3schools.com/html/html_tables.asp");
        URLConnection yc = oracle.openConnection();
        InputStream is = yc.getInputStream();
        String inputLine;
        BufferedReader in = new BufferedReader(
                new InputStreamReader(
                yc.getInputStream()));
        while ((inputLine = in.readLine()) != null) 
            System.out.println(inputLine);
        in.close();

        }catch(Exception ex){
            ex.printStackTrace();
        }

2 Comments

I might be slow because of the morning mood, but how did you solve this part in the answer : "it displays the same URL (but, of course, the information to be extracted is different based on the user". What you have presented is just plain reading the contents of a URL, IMHO you didn't answer the question at all
Yes, Eugene, I agree completely. I know how to read contents of a plain URL, but the situation here is different.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.