How to parse data from HTML using Regex?

Question

I want to scrape the job title, location, and job description from Indeed (1st page only) using Regex, and store the results to a data frame. Here is the link: https://www.indeed.com/jobs?q=data+scientist&l=California

I have completed the task using BeautifulSoup and they worked totally fine:

from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd

url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
htmlfile = urlopen(url)
soup = BS(htmlfile,'html.parser')

companies = []
locations = []
summaries = []

company = soup.findAll('span', attrs={'class':'company'})
for c in company:
    companies.append(c.text.replace("\n",""))
  
location = soup.findAll(class_ = 'location accessible-contrast-color-location')
for l in location:
    locations.append(l.text)
        
summary = soup.findAll('div', attrs={'class':'summary'})
for s in summary:
    summaries.append(s.text.replace("\n",""))

jobs_df = pd.DataFrame({'Company':companies, 'Location':locations, 'Summary':summaries})
jobs_df

Result from BS:

    Company Location    Summary
0   Cisco Careers   San Jose, CA    Work on massive structured, unstru...
1   AllyO   Palo Alto, CA   Extensive knowledge of scientific ...
2   Driven Brands   Benicia, CA 94510   Develop scalable statistical, mach...
3   eBay Inc.   San Jose, CA    These problems require deep analys...
4   Disney Streaming Services   San Francisco, CA   Deep knowledge of machine learning...
5   Trimark Associates, Inc.    Sacramento, CA  The primary focus is in applying d...

But when I tried to use the same tags in Regex it failed.

import urllib.request, urllib.parse, urllib.error
import re
import pandas as pd

url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
text = urllib.request.urlopen(url).read().decode()

companies = []
locations = []
summaries = []

company = re.findall('<span class="company">(.*?)</span>', text)
for c in company:
    companies.append(str(c))
  
location = re.findall('<div class="location accessible-contrast-color-location">(.*?)</div>', text)
for l in location:
    locations.append(str(l))
        
summary = re.findall('<div class="summary">(.*?)</div>', text)
for s in summary:
    summaries.append(str(s))

print(companies)
print(locations)
print(summaries)

There was an error saying the length of lists don't match so I checked on the individual lists. It turned out the contents could not be fetched. What I got from above:

[]
['Palo Alto, CA', 'Sunnyvale, CA', 'San Francisco, CA', 'South San Francisco, CA 94080', 'Pleasanton, CA 94566', 'Aliso Viejo, CA', 'Sacramento, CA', 'Benicia, CA 94510', 'San Bruno, CA']
[]

What did I do wrong?

It's hard to tell without looking at what the actual HTML is. But one thing: the regex character . will not match a newline unless flag re.DOTALL (or re.S) is used. So, if the <span> or <div> tags are split across multiple lines, you will not match anything. Try adding ,flags=re.S to your findall call. — Booboo
– Booboo, Commented Sep 23, 2019 at 11:51

Bhawan · Accepted Answer · 2019-09-23 14:24:17Z

. matches any character except newline. In the HTML code, there are newlines as well.
So you need to use re.DOTALL as flags option in the re.findall like below:

company = re.findall('<span class="company">(.*?)</span>', text, flags=re.DOTALL)

From the above code, you will not get the names only. Instead you will get all the descendents of the span element you are selecting. So, you need to select only that part of regex which you want.

for c in company:
  # selecting only the company name, discarding everything in the anchor tag.
  name = re.findall('<a.*>(.*)</a>', c, flags = re.DOTALL)
  for n in name:
    # doing a little cleanup by removing the newlines and spaces.
    companies.append(str(n.strip()))

print(companies)

Output:

['Driven Brands', 'Southern California Edison', 'Paypal', "Children's Hospital Los Angeles", 'Cisco Careers', 'University of California, Santa Cruz', 'Beyond Limits', 'Shutterfly', 'Walmart', 'Trimark Associates, Inc.']

For location and summary, there are no further HTML tags.
Only the text is present.

So, only re.DOTALL and stripping the text will do the job.
No need of second for loop and second findall.

chitown88 · Accepted Answer · 2019-09-23 13:55:26Z

0

. will match any character except line terminators. The content you are trying to get are on new lines \n. So you need to mach anything, including line terminators.

you'll want to do: company = re.findall('<span class="company">(.*?)</span>', text, re.DOTALL)

But this will also require a little cleanup after.

answered Sep 23, 2019 at 13:55

chitown88

29.1k6 gold badges34 silver badges67 bronze badges

Collectives™ on Stack Overflow

How to parse data from HTML using Regex?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related