0

I need to get the content of some web pases like "http://www.ncbi.nlm.nih.gov/nuccore/NM_007002" for my project. The problem is that I need to open the page from a browser and save it to get the full content (if I try to use the libraries URL and BufferReader I get the "frame" of the page but not the text I need). My professor told me to use Seleniume to open and download the pages I need and then read and parse the relevant information.

Unfortunately, I can't find an example from a JAVA code that open and save a web page. Can anyone explane to my how to do this?

I want to SAVE the page to my computer, not copy the source and save it for file. Not all of the information appears in the source! It's hidden.

2

2 Answers 2

3

In Selenium you can do this:

SafariDriver driver = new SafariDriver(); //you can use any drivers like Chrome,FireFox
driver.get("your link");
String pageSource = driver.getPageSource(); //now you have the page source
//you can save the pageSource to the file or do what ever you want. 

Look at the getPageSource docs here.

If you want to get data from the specific tags, like say for example body, then you can do this:

String pageSource=driver.findElement(By.tagName("body")).getText();
Sign up to request clarification or add additional context in comments.

5 Comments

This is not what I need. I need to save the page to my computer. Only then the information I need is avilable.
@yalush: You want to save the page to computer then why can't you do that with File?
Because File save the text of the page and I need the page itself, just like when I use "save as...". I need it because some of the information in the page is hidden, and appears in the sourse only when I save the page to my computer.
You want along with the images etc those are present in the Web page?
I only need the text in the middle of the page (the one with the information about the gene and exons)
1

Keep in mind that Selenium is meant for web page automation, so for interacting with the pages automatically. If only the source is really what you need, you can use a JSoup a really solid Java Html parser, in two lines of code, you should have your source

     try {
            Document doc = Jsoup.connect("http://www.ncbi.nlm.nih.gov/nuccore/NM_007002").userAgent("Mozilla/5.0").timeout(30000).get();
            System.out.println(doc.toString());
        } catch (IOException e) {
            e.printStackTrace();
        }

1 Comment

You can open the page sourse and see the problam for yourself. You can see that the word "exon" appears many time in the page, but only one in the sourse. If I try to read the sourse I can't get all the informetion I need.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.