Improve Data Preprocessing Speed - Regex in Python

Question

I use the following class in Python to preprocess a string before passing it to a machine learning classification model for predicting its sentiment.

I use regex for most of the transformation along with some libraries like emoji and tweet-preprocessor. The code works fine but I believe that it is slow.

Do you have any suggestions on how to improve its speed?

Example of usage:

string  = "I am very happy with @easyjet #happy customer 🙂. Second sentence"
preprocessor = TextPreprocessing()
result = preprocessor.text_preprocessor(string)

The result will be : ["i am very happy with happy smiling face", "second sentence", "i am very happy with happy smiling face second sentence"]

import re
import preprocessor as p   # this is the tweet-preprocessor library
import emoji
import os
import numpy as np
import pandas as pd

class TextPreprocessing:
    def __init__(self):
        p.set_options(p.OPT.MENTION, p.OPT.URL)

    # remove punctuation
    def _punctuation(self, val):
        val = re.sub(r'[^\w\s]', ' ', val)
        val = re.sub('_', ' ', val)
        return val

    #remove white spaces
    def _whitespace(self, val):
        return " ".join(val.split())

    #remove numbers
    def _removenumbers(self, val):
        val = re.sub('[0-9]+', '', val)
        return val

    #remove unicode
    def _remove_unicode(self, val):
        val = unidecode(val).encode("ascii")
        val = str(val, "ascii")
        return val

    #split string into sentenses
    def _split_to_sentences(self, body_text):
        sentences = re.split(
            r"(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s", body_text)
        return sentences

    # cleaning functions that combines all of the above functions
    def _clean_text(self, val):
        val = val.lower()
        val = self._removenumbers(val)
        val = p.clean(val)
        val = ' '.join(self._punctuation(emoji.demojize(val)).split())
        val = self._remove_unicode(val)
        val = self._whitespace(val)
        return val

    def text_preprocessor(self, body_text):
        body_text_df = pd.DataFrame({"body_text": body_text}, index=[1])
        sentence_split_df = body_text_df.copy()
        sentence_split_df["body_text"] = sentence_split_df["body_text"].apply(
            self._split_to_sentences)

        lst_col = "body_text"
        sentence_split_df = pd.DataFrame(
            {
                col: np.repeat(
                    sentence_split_df[col].values, sentence_split_df[lst_col].str.len(
                    )
                )
                for col in sentence_split_df.columns.drop(lst_col)
            }
        ).assign(**{lst_col: np.concatenate(sentence_split_df[lst_col].values)})[
            sentence_split_df.columns
        ]

        final_df = (
            pd.concat([sentence_split_df, body_text_df])
            .reset_index()
            .drop(columns=["index"])
        )

        final_df["body_text"] = final_df["body_text"].apply(self._clean_text)

        return final_df["body_text"]

This question might be relevant to all those Data Scientists who want to move their NLP models into production.

dosas · Accepted Answer · 2020-10-13 12:26:16Z

-1

Since I cannot comment I will try to answer your question (to some extent):

You should clarify how to measure the execution time improvement. Use timeit and its repeat functionality for that:

import timeit
from functools import partial
...
if __name__ == "__main__":
    # http://25.io/toau/audio/sample.txt
    with open("sample.txt") as f:
        text = f.read()
        tp = TextPreprocessing()
        print(min(timeit.Timer(partial(tp.text_preprocessor, text)).repeat(repeat=10, number=1)))

You can also use timeit on specific methdos to check for bottlenecks.

Sadly I could not run your code sample due to the undefined np. in L58 and L64 so I cannot test my assumptions. Also you did not provide sample data.
Some general thoughts:

Use re.compile() to compile all of your regular expressions
If you do not need the modularity of _remove* you might consider combining these regular expressions
.copy() operations are expensive try to get rid of them
Also some of the _remove* methods have a smell. See the links for alternatives:
- Remove zero width space unicode character from Python string
- Removing numbers from string

edited Oct 13, 2020 at 12:26

answered Oct 13, 2020 at 10:43

dosas

5774 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Stamatis Tiniakos Over a year ago

I have updated my code to include np and included an example. Could you give an example of the compile?

dosas Over a year ago

docs.python.org/3/library/re.html#regular-expression-objects

Stamatis Tiniakos Over a year ago

I am happy to accept your answer if you add few more details e.g. in which code smell are you referring to? An example of the compile using my code would be nice.

dosas Over a year ago

Okay if the link to the documentation is not enough: pattern = re.compile(r"_") pattern.sub("", "asdf_asdf")

dosas Over a year ago

I thought the code smells where covered by the links posted.

|

Collectives™ on Stack Overflow

Improve Data Preprocessing Speed - Regex in Python

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related