RegEx processing with python

Question

I am trying to learn python and do text analysis using NLTK at the same time.

I am using python to scrub text before text analysis.

Given the sentence: The target IP was: 127.1.1.100.

I want to tokenize it into:

["The", "target", "IP", "was", ":","127.1.1.100","."]

It is important I retain all the punctuation so as to reconstruct the source doc, but I need leading/trailing punctuation separated so I can do text analysis on the individual words. I wrote the following python code which works fine, but seems kinda kludgy.

punct = ['.', ',', ':', ';', '!', '[', ']', '(', ')', '{', '}']
def split_punctuation(sentence)-> list:
    sentwords = sentence.split(" ")
    for i, word in enumerate(sentwords):
        if word_ends_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[:-1])
            sentwords.insert(i+1, word[-1])
            word = word[:-1]
        if word_starts_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[0:1])
            sentwords.insert(i+1, word[1:])
            word = word[1:]
    return sentwords

def word_starts_with_punct(w)-> bool:
    for p in punct:
        if w.startswith(p):
            return True
    return False

def word_ends_with_punct(w)->bool:
    for p in punct:
        if w.endswith(p):
            return True
    return False

So looking on SO I found a regex that does what I want, kinda... RegEx by Wiktor Stribiżew

re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip()

I was able to figure out what's going on, but in this form it separates ALL punctuation, even in the middle of words. For example, it converted todays date from: 6/28/2109 to "6 / 28 / 2019".

So I modified to use anchor tags at beginning/end but it seems I have to run it twice, once for beginning punctuation, and again for the end. Seems rather inefficient and was hoping somebody could show the the correct way to accomplish this. The below code is the regex version:

def sep_punct_by_regex(sent)->list :
    words = sent.split(" ")
    new_words = []
    for w in words:
        tmp1 = re.sub(r'^[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+', r' \g<0> ', w).strip()
        tmp2 = re.sub(r'[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+$', r' \g<0> ', tmp1).strip()
        t = tmp2.split(" ")
        for x in t:
            new_words.append(x)
    return new_words

Note the ^ in the tmp1, and $ in tmp2 This works as is, but the goal is to learn while building so how would I modify the RegEx for single pass? I tried the obvious (^) up front, and the $ at the end, but it doesn't work.

You've got 3 layers there. Letters, some punctuation, some numbers+ some punctuation. I don't think you can tokenize this easily. You'd have to make an extreme set of rules for this. — user557597
– user557597, Commented Jun 28, 2019 at 20:17

Wiktor Stribiżew · Accepted Answer · 2019-06-28 20:52:57Z

1

You may use

re.findall(r'\b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b|[^\W_]+|(?:[^\w\s]|_)+', s)

See the regex demo

To remove the punctuation on both ends of a string and strip from whitespaces use

re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()

So, it will look like

s = re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()
oct = r'(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'
return re.findall(r'\b{0}(?:\.{0}){{3}}\b|[^\W_]+|(?:[^\w\s]|_)+'.format(oct), s)

Details

\b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b - an IPv4 regex pattern
| - or
[^\W_]+ - one or more letters or digits
| - or
(?:[^\w\s]|_)+ - one or more chars other than word and whitespace chars or _.

edited Jun 28, 2019 at 20:52

answered Jun 28, 2019 at 20:11

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

GeoffWillis Over a year ago

Guess I wasn't clear on my requirements, sorry. I need to check every word in doc, and remove any leading/trailing punctuation. Not just IP.

GeoffWillis Over a year ago

Still no joy. Input sentence: "Malware Analysis Report (MAR) - 10135536-F " and what I expect is ["Malware", "Analysis", "Report", "(", "Mar", ")", "-", "10135536-F"] what I get is: ['Malware', 'Analysis', 'Report', '(MAR)', '-', '10135536-F']. In particular, the () around the MAR should be separate tokens. Thanks for looking!

Wiktor Stribiżew Over a year ago

@GeoffWillis In particular, my code above returns (, Mar, ) as separate tokens. What code are you testing? Are you sure of your requirements? Probably ideone.com/0B3maI is a better solution, but I am not sure now.

GeoffWillis Over a year ago

Sorry to be so dense, but I don't really understand what you posted entirely. The requirement is to walk a document sentence by sentence, then word by word. For every word, ensure is no leading/trailing punctuation. I used the IP as an example, but also the (MAR) -> ["(", "MAR", ")"]. I've posted my code, with what I THINK you wanted me to substitute for my reg ex. Note the two commented out regEx's are yours, and work great, but was trying to get down to one pass. [link]github.com/GeoffWillis/TaterPy/blob/master/term_freq_vector/…

GeoffWillis Over a year ago

[link] github.com/GeoffWillis/TaterPy/blob/master/term_freq_vector/… Don't know why this 404's from here, Cut-paste in browser works fine...

|

Collectives™ on Stack Overflow

RegEx processing with python

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related