0

I'm trying to make a java application which can scrape infos off web sites, and I've done some googling, and managed very simple scraper, but not enough. It seems that my scraper is not scraping some information on this website, espesially the part where I want to scrape. enter image description here

1.

        Elements links = htmlDocument.select("a");
        for (Element link : links) {
           this.links.add(link.attr("href"));
        }
        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
        }

I've tried both code, but I cant find that link anywhere in Elements object. I believe that those information I want is the result of search, so when my program connects to that url, that information are gone. How can I solve this? I want an program whenever it gets started, scraping the result of that search.

Here is the link to the web site

So my question is,

1.How do I scrape that link into my code's Elements object? What am I doing Wrong?

2.Is there any way to pick that link and proceed to that link only(not all hyperlinks)?

    final Document doc = Jsoup.connect("http://www.work.go.kr/empInfo/empInfoSrch/list/dtlEmpSrchList.do?pageIndex=2&pageUnit=10&len=0&tot=0&relYn=N&totalEmpCount=0&jobsCount=0&mainSubYn=N&region=41000&lastIndex=1&siteClcd=all&firstIndex=1&pageSize=10&recordCountPerPage=10&rowNo=0&softMatchingPossibleYn=N&benefitSrchAndOr=O&keyword=CAD&charSet=EUC-KR&startPos=0&collectionName=tb_workinfo&softMatchingMinRate=+66&softMatchingMaxRate=100&empTpGbcd=1&onlyTitleSrchYn=N&onlyContentSrchYn=N&serialversionuid=3990642507954558837&resultCnt=10&sortOrderBy=DESC&sortField=DATE").userAgent(USER_AGENT).get();


    try
    {
        Connection connection = Jsoup.connect(url).userAgent(USER_AGENT);
        Document htmlDocument = connection.get();
        this.htmlDocument = htmlDocument;
        String qqq=htmlDocument.toString();
        System.out.println(qqq);
        if(connection.response().statusCode() == 200) // 200 is the HTTP OK status code
                                                      // indicating that everything is great.
        {
            System.out.println("\n**Visiting** Received web page at " + url);
        }
        if(!connection.response().contentType().contains("text/html"))
        {
            System.out.println("**Failure** Retrieved something other than HTML");
            return false;
        }
        
        Elements linksOnPage = htmlDocument.select("a[href]");
        System.out.println("Found (" + linksOnPage.size() + ") links");
        for(Element link : linksOnPage)
        {
            this.links.add(link.absUrl("href"));
            System.out.println(link.absUrl("href"));
        }
        return true;
    }
    catch(IOException ioe)
    {
        // We were not successful in our HTTP request
        return false;
    }

this is the entire code I use for scraping. This code, I'm using from this site.

5
  • Maybe this link is generated with Javascript ? Try this stackoverflow.com/documentation/jsoup/4632/… Commented Feb 7, 2017 at 10:12
  • I found out that in my code, when I try to connect to the webpage showing results of some search, the webpage responds with empty results,while on the browser(chrome)shows the right result.(even if I use the same url) . I printed out text of Document, and at the part where should be the result of search, the webpage says "cannot find the page you requested". So.. can anybody help this? Commented Feb 7, 2017 at 14:44
  • Have you tried to set the User Agent when connecting to the URL ? stackoverflow.com/questions/10187603/useragent-in-jsoup Commented Feb 7, 2017 at 15:17
  • Post more of your code - how do you fetch the HTML? Commented Feb 7, 2017 at 17:16
  • I edited my question with my code. How do I check if right user agent for my browser..?And if useragent were wrong, shouldn't I get no response at all? I'm confused.. or Is it possible that the website I'm trying to scrap is blocking me for security purpose? Commented Feb 8, 2017 at 0:19

1 Answer 1

1

I found the issue, and couldn't resolve it. So, what I was trying was that I wanted to scrape info from a webpage showing some results of specific search. The issue was that the website is somehow not letting me to connect from my java application using jsoup. Probably to protect their contents. That's why there's was no elements I needed, because it's actually not there. The website offers openAPI for charge, so I decided to use other websites.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.