2

there is a basic html page which I would like to screen scrape. I have no idea where to start with this so any help would be much appreciated. To access the page one bit of input is required just like an ID. So what I would like to do is 1.Go to webpage 2.Input Id 3.Then screen scrape(get the data(I have checked the source its all simple html)) that is deisplayed 4.The rest organising(string manipulation) etc I can do.

If anyone can give me some info/start I would be grateful :)

2
  • First step: Acquire HTML Parser Commented Feb 15, 2014 at 19:39
  • Does not clear anything to me. Commented Feb 15, 2014 at 19:39

2 Answers 2

1

Here is some information on where to start from:

Step #1 - Download and use the following JAR files in your project:

  • selenium-java-2.xx.0.jar
  • selenium-server-standalone-2.xx.0.jar

    At present, xx is 39.

Step #2 - Emulate a client browser in order to access the web-page, using the following example class:

import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.firefox.FirefoxDriver;

class MyClass
{
    private WebDriver webDriver = null;

    public void open() throws Exception
    {
        webDriver = new FirefoxDriver();
    }

    public void close() throws Exception
    {
        webDriver.quit();
    }

    public void doStuff(String url) throws Exception
    {
        webDriver.get(url);
        // Use 'webDriver' in order to access the web-page, for example:
        WebElement inputBox = webDriver.findElement(By.id("someInputBox"));
        WebElement inputBtn = webDriver.findElement(By.id("someInputBtn"));
        inputBox.sendKeys("myUserId");
        inputBtn.click();
        String pageSource = webDriver.getPageSource();
        ...
    }
}
Sign up to request clarification or add additional context in comments.

1 Comment

Keep in mind that Selenium is a lot of overhead to add to a program if all you really want to do is scrape the data.
0

There are multiple things you're going to need to put together here to get this done. First you're going to need to go get the HTML. The way I normally do this is with Apache's HttpClient. A quick start guide is here: HttpClient and does a better job of describing how to use HttpClient than I could ever hope to create. Their documentation is pretty good.

That will allow you to get the data back something like this:

HttpClient client = new DefaultHttpClient();
HttpPost post = new HttpPost(URL);
//
// here you can do things like add parameters used when connecting to the remote site    
//
HttpResponse response = client.execute(post);
BufferedReader rd = new BufferedReader(new InputStreamReader(response.getEntity().getContent()));

From there you can do just about anything with it as its basically a StringBuffer.

In order to actually parse and "scrape" the data I recommend using Jsoup It will allow you do do a lot of things with the HTML treating it very much like a DOM.

Document document = Jsoup.parse(HTML);
// OR
Document doc = Jsoup.parseBodyFragment(HTML);
Elements elements = doc.select("#SOME_ID");

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.