0

I have a dynamic text which looks something like this

my_text = "address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  
           email [email protected] , sdasd [email protected] - [email protected]"

The text starts with an 'address'. As soon as we see 'address' we need to scrape everything from there until either 'landline'/'mobile'/'cell' appears. From there on, we want to scrape when all the phone text (without altering spaces in between). We start from the first occurrence of either 'landline'/'mobile'/'cell' and stop as soon as we find 'email' appear. Finally we scrape the email part (without altering spaces in between)

'landline'/'mobile'/'cell' can appear in any order and sometimes some may not appear. For example, the text could have looked like this as well.

my_text = "address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"

There's a little more engineering that needs to be done to form arrays of subtext contained in address, phones and email text. Subtexts of addresses are always separated with commas (,). Subtexts of emails can be separated with commas (,) or hyphens (-).

My output should be a JSON dictionary which looks something like this:

resultant_dict = {
                      addresses: [
                                  { address: "ae fae daq ad" }
                                , { address: "1231 asdas" }
                               ]
                    , phones: [
                                  { number: "213121233 -123", kind: "landline" }
                                , { number: "513121233", kind: "mobile" }
                                , { number: "(132 -142-3127", kind: "cell" }
                             ]
                    , emails: [
                                  { email: "[email protected]", connector: "" }
                                , { email: "sdasd [email protected]", connector: "," }
                                , { email: "[email protected]", connector: "-" }
                              ]
}

I am trying to achieve this thing using regular expressions or any other way in Python. I can't figure out how to write this as I am a novice programmer.

2 Answers 2

1

This will work as long as there are no spaces in your emails

import re
my_text = 'address ae fae daq ad, 1231 asdas  landline 213121233 -123    mobile 513121233 cell (132) -142-3127  email [email protected] , [email protected] - [email protected]'

split_words = ['address', 'landline', 'mobile', 'cell', 'email']
resultant_dict = {'addresses': [], 'phones': [], 'emails': []}

for sw in split_words:

    text = filter(None, my_text.split(sw))
    text = text[0].strip() if len(text) < 2 else text[1].strip()
    next_split = [x.strip() for x in text.split() if x.strip() in split_words]

    if next_split:
        text = text.split(next_split[0])[0].strip()

    if sw in ['address']:
        text = text.split(',')
        for t in text:
            resultant_dict['addresses'].append({'address': t.strip()})

    elif sw in ['landline', 'mobile', 'cell']:
        resultant_dict['phones'].append({'number': text, 'kind': sw})

    elif sw in ['email']:

        connectors = [',', '-']
        emails = re.split('|'.join(connectors), text)
        text = filter(None, [x.strip() for x in text.split()])

        for email in emails:

            email = email.strip()
            connector = ''
            index = text.index(email) if email in text else 0

            if index > 0:
                connector = text[index - 1]

            resultant_dict['emails'].append({'email': email, 'connector': connector})

print resultant_dict
Sign up to request clarification or add additional context in comments.

2 Comments

Ok. This works great. But I want to retain the spaces, for a reason. I will try to edit the code accordingly and update.
If you come up with a quick tweak to include the spaces, you could add it too :)
1

This is not a good job for regular expressions since the components you want to parse out of the input can appear in any order and any number.

Consider using a lexing and parsing library such as the pyPEG parsing expression grammar.

Another approach would use str.split() or re.split() to split the input text into tokens. Then scan through those tokens looking for your keywords like address, cell, and ,, accumulating the following tokens until the next keyword. This approach lets split() do the first part of the tokenizing work, leaving you to do the rest of the lexical work (by recognizing keywords) and the parsing work manually.

The manual approach is more instructive but more verbose and less flexible. It goes like this:

text = """address ae fae daq ad, 1231 asdas  
           cell (132) -142-3127 landline 213121233 -123     
           email [email protected] , sdasd [email protected] - [email protected]"""

class Scraper:
    def __init__(self):
        self.current = []
        self.current_type = None

    def emit(self):
        if self.current:
            # TODO: Add the new item to a dictionary.
            # Later, translate the dictionary to JSON format.
            print(self.current_type, self.current)

    def scrape(self, input_text):
        tokens = input_text.split()
        for token in tokens:
            if token in ('address', 'cell', 'landline', 'email'):
                self.emit()
                self.current = []
                self.current_type = token
            else:
                self.current.append(token)
        self.emit()

s = Scraper()
s.scrape(text)

This emits:

address ['ae', 'fae', 'daq', 'ad,', '1231', 'asdas']
cell ['(132)', '-142-3127']
landline ['213121233', '-123']
email ['[email protected]', ',', 'sdasd', '[email protected]', '-', '[email protected]']

You'll want to use re.split() to make it split 'ad,' into ['ad', ','], add code to handle tokens like ,, and use a library to convert the dictionary to JSON format.

2 Comments

Thanks. Can you provide a working solution using anything other than regex?
Done. Note: The more intricate the program gets to handle cases like ad,, the better pyPEG becomes in comparison. The more intricate this answer gets, the less instructive it'll be for you and other readers. Also note how the input-parsing code parse() is separate from the output-constructing code emit(). That modularity makes it easier to understand, debug, and modify.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.