3

I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.

The code below is using BeautifulSoup to scrape event data and the event links:

import pandas as pd
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')

event = []
links = []

# ---Event Data---
for a in soup.find_all('a'):
    event.append(a.text)

df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]

# ---Links---
for a in soup.find_all('a', href=True): 
    if a.text: 
        links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']

# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)

At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).

Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29:  [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}

Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}

When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.

df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()

Any help would be greatly appreciated! Thank you.

1

2 Answers 2

5

You may use

date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)

See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).

Details

  • [A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
  • - a space (replace with \s to match any whitespace)
  • [0-9]{1,2} - one or two digits
  • (?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
    • - - hyphen
    • (?:[A-Z][a-z]* )? - an optional sequence of
      • [A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
      • - a space (replace with \s to match any whitespace)
    • [0-9]{1,2} - one or two digits

The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you for explaining! This makes a lot of sense now. That site looks like a great learning tool. Thanks a lot.
I was able to create the new date column, however, the original date is still present in the event column. How would I go about deleting that? I was assuming extract did that.
@LeslieTate If you want to remove the dates from the Event column, you need to use str.replace on it, use df['Event'] = df['Event'].str.replace(date_reg, ''). str.extract only finds a match, it does not remove anything.
Ahh, that's great to know. Thanks again for your help :)
2

This script:

import requests
from bs4 import BeautifulSoup

url = 'https://www.techmeme.com/events'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data = []
for row in soup.select('.rhov a'):
    date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
    data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})

df = pd.DataFrame(data)
print(df)

will create this dataframe:

          Date                                           Event          Place                                               Link
0    May 26-29                NOW VIRTUAL:Augmented World Expo    Santa Clara      https://www.techmeme.com/gotos/www.awexr.com/
1       May 27                               Earnings: HPQ,BOX                 https://www.techmeme.com/gotos/finance.yahoo.c...
2       May 28                              Earnings: CRM, VMW                 https://www.techmeme.com/gotos/finance.yahoo.c...
3    May 28-29         CANCELED:WeAreDevelopers World Congress         Berlin  https://www.techmeme.com/gotos/www.wearedevelo...
4        Jun 2                                    Earnings: ZM                 https://www.techmeme.com/gotos/finance.yahoo.c...
..         ...                                             ...            ...                                                ...
140   Dec 7-10                         NEW DATE:GOTO Amsterdam      Amsterdam         https://www.techmeme.com/gotos/gotoams.nl/
141   Dec 8-10                 Microsoft Azure + AI Conference      Las Vegas  https://www.techmeme.com/gotos/azureaiconf.com...
142   Dec 9-10           NEW DATE:Paris Blockchain Week Summit          Paris  https://www.techmeme.com/gotos/www.pbwsummit.com/
143  Dec 13-16                          NEW DATE:KNOW Identity      Las Vegas  https://www.techmeme.com/gotos/www.knowidentit...
144  Dec 15-16  NEW DATE, NEW LOCATION:Fortune Brainstorm Tech  San Francisco  https://www.techmeme.com/gotos/fortuneconferen...

[145 rows x 4 columns]

2 Comments

This is a much easier solution! I guess it shows how this can be done in much less lines of code. I don't see the links included in the df.
@LeslieTate I updated my answer to include Link as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.