Text Scraping using Python: Regex

Question

I have a dynamic text which looks something like this

my_text = "address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  
           email [email protected] , sdasd [email protected] - [email protected]"

The text starts with an 'address'. As soon as we see 'address' we need to scrape everything from there until either 'landline'/'mobile'/'cell' appears. From there on, we want to scrape when all the phone text (without altering spaces in between). We start from the first occurrence of either 'landline'/'mobile'/'cell' and stop as soon as we find 'email' appear. Finally we scrape the email part (without altering spaces in between)

'landline'/'mobile'/'cell' can appear in any order and sometimes some may not appear. For example, the text could have looked like this as well.

my_text = "address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"

There's a little more engineering that needs to be done to form arrays of subtext contained in address, phones and email text. Subtexts of addresses are always separated with commas (,). Subtexts of emails can be separated with commas (,) or hyphens (-).

My output should be a JSON dictionary which looks something like this:

resultant_dict = {
                      addresses: [
                                  { address: "ae fae daq ad" }
                                , { address: "1231 asdas" }
                               ]
                    , phones: [
                                  { number: "213121233 -123", kind: "landline" }
                                , { number: "513121233", kind: "mobile" }
                                , { number: "(132 -142-3127", kind: "cell" }
                             ]
                    , emails: [
                                  { email: "[email protected]", connector: "" }
                                , { email: "sdasd [email protected]", connector: "," }
                                , { email: "[email protected]", connector: "-" }
                              ]
}

I am trying to achieve this thing using regular expressions or any other way in Python. I can't figure out how to write this as I am a novice programmer.

Cody Bouche · Accepted Answer · 2015-09-04 21:42:40Z

1

This will work as long as there are no spaces in your emails

import re
my_text = 'address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  email [email protected] , [email protected] - [email protected]'

split_words = ['address', 'landline', 'mobile', 'cell', 'email']
resultant_dict = {'addresses': [], 'phones': [], 'emails': []}

for sw in split_words:

    text = filter(None, my_text.split(sw))
    text = text[0].strip() if len(text) < 2 else text[1].strip()
    next_split = [x.strip() for x in text.split() if x.strip() in split_words]

    if next_split:
        text = text.split(next_split[0])[0].strip()

    if sw in ['address']:
        text = text.split(',')
        for t in text:
            resultant_dict['addresses'].append({'address': t.strip()})

    elif sw in ['landline', 'mobile', 'cell']:
        resultant_dict['phones'].append({'number': text, 'kind': sw})

    elif sw in ['email']:

        connectors = [',', '-']
        emails = re.split('|'.join(connectors), text)
        text = filter(None, [x.strip() for x in text.split()])

        for email in emails:

            email = email.strip()
            connector = ''
            index = text.index(email) if email in text else 0

            if index > 0:
                connector = text[index - 1]

            resultant_dict['emails'].append({'email': email, 'connector': connector})

print resultant_dict

edited Sep 4, 2015 at 21:42

answered Sep 4, 2015 at 21:35

Cody Bouche

9555 silver badges10 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user3422637 Over a year ago

Ok. This works great. But I want to retain the spaces, for a reason. I will try to edit the code accordingly and update.

user3422637 Over a year ago

If you come up with a quick tweak to include the spaces, you could add it too :)

Jerry101 · Accepted Answer · 2015-09-04 21:34:06Z

1

This is not a good job for regular expressions since the components you want to parse out of the input can appear in any order and any number.

Consider using a lexing and parsing library such as the pyPEG parsing expression grammar.

Another approach would use str.split() or re.split() to split the input text into tokens. Then scan through those tokens looking for your keywords like address, cell, and ,, accumulating the following tokens until the next keyword. This approach lets split() do the first part of the tokenizing work, leaving you to do the rest of the lexical work (by recognizing keywords) and the parsing work manually.

The manual approach is more instructive but more verbose and less flexible. It goes like this:

text = """address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"""

class Scraper:
    def __init__(self):
        self.current = []
        self.current_type = None

    def emit(self):
        if self.current:
            # TODO: Add the new item to a dictionary.
            # Later, translate the dictionary to JSON format.
            print(self.current_type, self.current)

    def scrape(self, input_text):
        tokens = input_text.split()
        for token in tokens:
            if token in ('address', 'cell', 'landline', 'email'):
                self.emit()
                self.current = []
                self.current_type = token
            else:
                self.current.append(token)
        self.emit()

s = Scraper()
s.scrape(text)

This emits:

address ['ae', 'fae', 'daq', 'ad,', '1231', 'asdas']
cell ['(132)', '-142-3127']
landline ['213121233', '-123']
email ['[email protected]', ',', 'sdasd', '[email protected]', '-', '[email protected]']

You'll want to use re.split() to make it split 'ad,' into ['ad', ','], add code to handle tokens like ,, and use a library to convert the dictionary to JSON format.

edited Sep 4, 2015 at 21:34

answered Sep 4, 2015 at 20:49

Jerry101

13.7k7 gold badges51 silver badges68 bronze badges

2 Comments

user3422637 Over a year ago

Thanks. Can you provide a working solution using anything other than regex?

Jerry101 Over a year ago

Done. Note: The more intricate the program gets to handle cases like ad,, the better pyPEG becomes in comparison. The more intricate this answer gets, the less instructive it'll be for you and other readers. Also note how the input-parsing code parse() is separate from the output-constructing code emit(). That modularity makes it easier to understand, debug, and modify.

Collectives™ on Stack Overflow

Text Scraping using Python: Regex

2 Answers 2

2 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related