Python Regex Extract Date to New Column in Dataframe

Question

I'm scraping a website using Python and I'm having troubles with extracting the dates and creating a new Date dataframe with Regex.

The code below is using BeautifulSoup to scrape event data and the event links:

import pandas as pd
import bs4 as bs
import urllib.request

source = urllib.request.urlopen('https://www.techmeme.com/events').read()
soup = bs.BeautifulSoup(source,'html.parser')

event = []
links = []

# ---Event Data---
for a in soup.find_all('a'):
    event.append(a.text)

df_event = pd.DataFrame(event)
df_event.columns = ['Event']
df_event = df_event.iloc[1:]

# ---Links---
for a in soup.find_all('a', href=True): 
    if a.text: 
        links.append(a['href'])
df_link = pd.DataFrame(links)
df_link.columns = ['Links']

# ---Combines dfs---
df = pd.concat([df_event.reset_index(drop=True),df_link.reset_index(drop=True)],sort=False, axis=1)

At the beginning of the each event data row, the date is present. Example: (May 26-29Augmented World ExpoSan...). The date follows the following format and I have included my Regex(which I believe is correct).

Different Date Formats:
May 27: [A-Z][a-z]*(\ )[0-9]{1,2}
May 26-29:  [A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}
May 28-Jun 2: [A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}

Combined
[A-Z][a-z]*(\ )[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[0-9]{1,2}|[A-Z][a-z]*(\ )[0-9]{1,2}-[A-Z][a-z]*(\ )[0-9]{1,2}

When I try to create a new column and extract the dates using Regex, I just receive an empty df['Date'] column.

df['Date'] = df['Event'].str.extract(r[A-Z][a-z]*(\ )[0-9]{1,2}')
df.head()

Any help would be greatly appreciated! Thank you.

Is the information provided in this question enough for you stackoverflow.com/a/62009216/12239523. I think it is similar but i may be wrong — Sebastian
– Sebastian, Commented May 26, 2020 at 19:11

Wiktor Stribiżew · Accepted Answer · 2020-05-26 19:10:15Z

5

You may use

date_reg = r'([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)'
df['Date'] = df['Event'].str.extract(date_reg, expand=False)

See the regex demo. If you want to match as whole words and numbers, you may use (?<![A-Za-z])([A-Z][a-z]* [0-9]{1,2}(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})?)(?!\d).

Details

[A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
- a space (replace with \s to match any whitespace)
[0-9]{1,2} - one or two digits
(?:-(?:[A-Z][a-z]* )?[0-9]{1,2})? - an optional sequence of
- - - hyphen
- (?:[A-Z][a-z]* )? - an optional sequence of
  - [A-Z][a-z]* - an uppercase letter and then 0 or more lowercase letters
  - - a space (replace with \s to match any whitespace)
- [0-9]{1,2} - one or two digits

The (?<![A-Za-z]) construct is a lookbehind that fails the match if there is a letter immediately before the current location and (?!\d) fails the match if there is a digit immediately after.

answered May 26, 2020 at 19:10

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Leslie Tate Over a year ago

Thank you for explaining! This makes a lot of sense now. That site looks like a great learning tool. Thanks a lot.

Leslie Tate Over a year ago

I was able to create the new date column, however, the original date is still present in the event column. How would I go about deleting that? I was assuming extract did that.

Wiktor Stribiżew Over a year ago

@LeslieTate If you want to remove the dates from the Event column, you need to use str.replace on it, use df['Event'] = df['Event'].str.replace(date_reg, ''). str.extract only finds a match, it does not remove anything.

Leslie Tate Over a year ago

Ahh, that's great to know. Thanks again for your help :)

Andrej Kesely · Accepted Answer · 2020-05-26 20:05:32Z

2

This script:

import requests
from bs4 import BeautifulSoup

url = 'https://www.techmeme.com/events'

soup = BeautifulSoup(requests.get(url).content, 'html.parser')

data = []
for row in soup.select('.rhov a'):
    date, event, place = map(lambda x: x.get_text(strip=True), row.find_all('div', recursive=False))
    data.append({'Date': date, 'Event': event, 'Place': place, 'Link': 'https://www.techmeme.com' + row['href']})

df = pd.DataFrame(data)
print(df)

will create this dataframe:

          Date                                           Event          Place                                               Link
0    May 26-29                NOW VIRTUAL:Augmented World Expo    Santa Clara      https://www.techmeme.com/gotos/www.awexr.com/
1       May 27                               Earnings: HPQ,BOX                 https://www.techmeme.com/gotos/finance.yahoo.c...
2       May 28                              Earnings: CRM, VMW                 https://www.techmeme.com/gotos/finance.yahoo.c...
3    May 28-29         CANCELED:WeAreDevelopers World Congress         Berlin  https://www.techmeme.com/gotos/www.wearedevelo...
4        Jun 2                                    Earnings: ZM                 https://www.techmeme.com/gotos/finance.yahoo.c...
..         ...                                             ...            ...                                                ...
140   Dec 7-10                         NEW DATE:GOTO Amsterdam      Amsterdam         https://www.techmeme.com/gotos/gotoams.nl/
141   Dec 8-10                 Microsoft Azure + AI Conference      Las Vegas  https://www.techmeme.com/gotos/azureaiconf.com...
142   Dec 9-10           NEW DATE:Paris Blockchain Week Summit          Paris  https://www.techmeme.com/gotos/www.pbwsummit.com/
143  Dec 13-16                          NEW DATE:KNOW Identity      Las Vegas  https://www.techmeme.com/gotos/www.knowidentit...
144  Dec 15-16  NEW DATE, NEW LOCATION:Fortune Brainstorm Tech  San Francisco  https://www.techmeme.com/gotos/fortuneconferen...

[145 rows x 4 columns]

edited May 26, 2020 at 20:05

answered May 26, 2020 at 19:13

Andrej Kesely

196k15 gold badges60 silver badges105 bronze badges

2 Comments

Leslie Tate Over a year ago

This is a much easier solution! I guess it shows how this can be done in much less lines of code. I don't see the links included in the df.

Andrej Kesely Over a year ago

@LeslieTate I updated my answer to include Link as well.

Collectives™ on Stack Overflow

Python Regex Extract Date to New Column in Dataframe

2 Answers 2

4 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related