Java Web Scraping using Jsoup

Question

I'm trying to make a java application which can scrape infos off web sites, and I've done some googling, and managed very simple scraper, but not enough. It seems that my scraper is not scraping some information on this website, espesially the part where I want to scrape.

1.

        Elements links = htmlDocument.select("a");
        for (Element link : links) {
           this.links.add(link.attr("href"));
        }

        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
        }

I've tried both code, but I cant find that link anywhere in Elements object. I believe that those information I want is the result of search, so when my program connects to that url, that information are gone. How can I solve this? I want an program whenever it gets started, scraping the result of that search.

Here is the link to the web site

So my question is,

1.How do I scrape that link into my code's Elements object? What am I doing Wrong?

2.Is there any way to pick that link and proceed to that link only(not all hyperlinks)?

    final Document doc = Jsoup.connect("http://www.work.go.kr/empInfo/empInfoSrch/list/dtlEmpSrchList.do?pageIndex=2&pageUnit=10&len=0&tot=0&relYn=N&totalEmpCount=0&jobsCount=0&mainSubYn=N&region=41000&lastIndex=1&siteClcd=all&firstIndex=1&pageSize=10&recordCountPerPage=10&rowNo=0&softMatchingPossibleYn=N&benefitSrchAndOr=O&keyword=CAD&charSet=EUC-KR&startPos=0&collectionName=tb_workinfo&softMatchingMinRate=+66&softMatchingMaxRate=100&empTpGbcd=1&onlyTitleSrchYn=N&onlyContentSrchYn=N&serialversionuid=3990642507954558837&resultCnt=10&sortOrderBy=DESC&sortField=DATE").userAgent(USER_AGENT).get();


    try
    {
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
        Document htmlDocument = connection.get();
        this.htmlDocument = htmlDocument;
        String qqq=htmlDocument.toString();
        System.out.println(qqq);
        if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
                                                      // indicating that everything is great.
        {
            System.out.println("\n**Visiting** Received web page at " + url);
        }
        if(!connection.response().contentType().contains("text/html"))
        {
            System.out.println("**Failure** Retrieved something other than HTML");
            return false;
        }
        
        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
            System.out.println(link.absUrl("href"));
        }
        return true;
    }
    catch(IOException ioe)
    {
        // We were not successful in our HTTP request
        return false;
    }

this is the entire code I use for scraping. This code, I'm using from this site.

Maybe this link is generated with Javascript ? Try this stackoverflow.com/documentation/jsoup/4632/… — Tim
– Tim, Commented Feb 7, 2017 at 10:12
I found out that in my code, when I try to connect to the webpage showing results of some search, the webpage responds with empty results,while on the browser(chrome)shows the right result.(even if I use the same url) . I printed out text of Document, and at the part where should be the result of search, the webpage says "cannot find the page you requested". So.. can anybody help this? — Hoon
– Hoon, Commented Feb 7, 2017 at 14:44
Have you tried to set the User Agent when connecting to the URL ? stackoverflow.com/questions/10187603/useragent-in-jsoup — Tim
– Tim, Commented Feb 7, 2017 at 15:17
I edited my question with my code. How do I check if right user agent for my browser..?And if useragent were wrong, shouldn't I get no response at all? I'm confused.. or Is it possible that the website I'm trying to scrap is blocking me for security purpose? — Hoon
– Hoon, Commented Feb 8, 2017 at 0:19

DisappointedByUnaccountableMod · Accepted Answer · 2021-02-13 14:00:06Z

1

I found the issue, and couldn't resolve it. So, what I was trying was that I wanted to scrape info from a webpage showing some results of specific search. The issue was that the website is somehow not letting me to connect from my java application using jsoup. Probably to protect their contents. That's why there's was no elements I needed, because it's actually not there. The website offers openAPI for charge, so I decided to use other websites.

edited Feb 13, 2021 at 14:00

DisappointedByUnaccountableMod

6,8444 gold badges21 silver badges23 bronze badges

answered Feb 13, 2017 at 5:35

Hoon

3971 gold badge6 silver badges19 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Java Web Scraping using Jsoup

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related