0

I want to parse the data out of this HTML (CompanyName, Location, jobDescription,...) using JSoup (java). I get stuck when trying to iterate the joblistings

The extract from the HTML is one of many "JOBLISTING" divs which I want to iterate and extract the Data out of it. I just can't handle how to iterate the specific div objects. Sorry for this noob question, but maybe someone can help me who already knows which function to use. Select?

<div class="between_listings"><!-- local.spacer --></div>

<div id="joblisting-2944914" class="joblisting listing-even listing-even company-98028 " itemscope itemtype="http://schema.org/JobPosting">


<div class="company_logo" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">
     <a href="/stellenangebote-des-unternehmens--Delivery-Hero-Holding-GmbH--98028.html" title="Jobs Delivery Hero Holding GmbH" itemprop="url">
       <img src="/upload_de/logo/D/logoDelivery-Hero-Holding-GmbH-98028DE.gif" alt="Logo Delivery Hero Holding GmbH" itemprop="image" width="160" height="80" />
     </a>
</div>


<div class="job_info">


<div class="h3 job_title">
   <a id="jobtitle-2944914" href="/stellenangebote--Junior-Business-Intelligence-Analyst-CRM-m-f-Berlin-Delivery-Hero-Holding-GmbH--2944914-inline.html?ssaPOP=204&ssaPOR=203" title="Arbeiten bei Delivery Hero Holding GmbH" itemprop="url">
      <span itemprop="title">Junior Business Intelligence Analyst / CRM (m/f)</span>
   </a>
</div>

<div class="h3 company_name" itemprop="hiringOrganization" itemscope itemtype="http://schema.org/Organization">

    <span itemprop="name">Delivery Hero Holding GmbH</span>

</div>

</div>




<div class="job_location_date">

    <div class="job_location target-location">
         <div class="job_location_info" itemprop="jobLocation" itemscope itemtype="http://schema.org/Place">


            <div class="h3 locality" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
                  <span itemprop="addressLocality"> Berlin</span>
            </div>


            <span class="location_actions">
                <a href="javaScript:PopUp('http://www.stepstone.de/5/standort.html?OfferId=2944914&ssaPOP=203&ssaPOR=203','resultList',800,520,1)" class="action_showlistingonmap showlabel" title="Google Maps" itemprop="maps">
                   <span class="location-icon"><!-- --></span>
                   <span class="location-label">Google Maps</span>
                </a>
            </span>

          </div>
       </div>

       <div class="job_date_added" itemprop="datePosted"><time datetime="2014-07-04">04.07.14</time></div>
</div>


<div class="job_actions">


</div>

</div>
<div class="between_listings"><!-- local.spacer --></div>

File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements);

Java code:

File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt");
// Load file into extraction1       
Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/");                          
Elements jobListingElements = ParseResult.select(".joblisting");        
for (Element jobListingElement: jobListingElements) {         
    jobListingElement.select(".companyName span[itemprop=\"name\"]");         
    // other element properties         
    System.out.println(jobListingElements);
}

Thank you!

3
  • Welcome to SO. Can you please include the code you tried in the question? Commented Jul 18, 2014 at 8:24
  • Sorry, i can't get the formating right. Thanks for your warm welcome. File input = new File("C:/Talend/workspace/WEBCRAWLER/output/keywords_SOA.txt"); // Load file into extraction1 Document ParseResult = Jsoup.parse(input, "UTF-8", "http://example.com/"); Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(jobListingElements); Commented Jul 18, 2014 at 9:12
  • And I don't get why this doesn't work. Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { Elements e1 = jobListingElement.select(".companyName span[itemprop=\"name\"]"); // other element properties System.out.println(e1.text()); Commented Jul 18, 2014 at 9:29

1 Answer 1

2

So you got your Jsoup document right? Than it seems pretty easy if the css class joblisting does not appear anywhere else.

Document document = Jsoup.parse(new File("d:/bla.html"), "utf-8");
Elements elements = document.select(".joblisting");
for (Element element : elements) {
    Elements jobTitleElement = element.select(".job_title span");
    Elements companyNameElement = element.select(".company_name spanspan[itemprop=name]");
    String companyName = companyNameElement.text();
    String jobTitle = jobTitleElement.text();

    System.out.println(companyName);
    System.out.println(jobTitle);
}

I don't know why the attribute [itemprop*=\"name\"] selector does not find the span (Further reading: http://jsoup.org/cookbook/extracting-data/selector-syntax )

Got it: span[itemprop=name] without any quotes or escapes. Other attributes or values also should work to get a more specific selection.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank you so much for your fast answer! And I apologize in advance for my slow learning, but let me explain my goal to you.
Maybe you can help me with a specific example which produces an output from which i can see the way to go. Thank you in advance! You help me a lot!
It seems to have swallowed my explanation. I want to extract specific variables (e.g. Company_Name, Job Title, etc.) and import them into an SQL-DB (later). Can you give me an example of how to use your code? span[itemprop=\"name\"] I don't get that part for instance. I tried this, which looks stupid to me =) File input = new Elements jobListingElements = ParseResult.select(".joblisting"); for (Element jobListingElement: jobListingElements) { jobListingElement.select(".companyName span[itemprop=\"name\"]"); System.out.println(jobListingElements);
updated my answer.. don't use the attribute selector. I wrote the answer without trying it. Somehow jsoup does not find the span with the attributes.
Works like a charm, thank you. Now I have to figure out how to write it to my local MS-SQL Server. I guess I'll be back in a few hours, haha. I love this page right away. Can't give you an upvote yet, though

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.