1

I am trying to learn python and do text analysis using NLTK at the same time.

I am using python to scrub text before text analysis.

Given the sentence: The target IP was: 127.1.1.100.

I want to tokenize it into:

["The", "target", "IP", "was", ":","127.1.1.100","."]

It is important I retain all the punctuation so as to reconstruct the source doc, but I need leading/trailing punctuation separated so I can do text analysis on the individual words. I wrote the following python code which works fine, but seems kinda kludgy.

punct = ['.', ',', ':', ';', '!', '[', ']', '(', ')', '{', '}']
def split_punctuation(sentence)-> list:
    sentwords = sentence.split(" ")
    for i, word in enumerate(sentwords):
        if word_ends_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[:-1])
            sentwords.insert(i+1, word[-1])
            word = word[:-1]
        if word_starts_with_punct(word) and len(word) > 1:
            sentwords.pop(i)
            sentwords.insert(i, word[0:1])
            sentwords.insert(i+1, word[1:])
            word = word[1:]
    return sentwords

def word_starts_with_punct(w)-> bool:
    for p in punct:
        if w.startswith(p):
            return True
    return False

def word_ends_with_punct(w)->bool:
    for p in punct:
        if w.endswith(p):
            return True
    return False

So looking on SO I found a regex that does what I want, kinda... RegEx by Wiktor Stribiżew

re.sub(r'[]!"$%&\'()*+,./:;=#@?[\\^_`{|}~-]+', r' \g<0> ', my_text).strip()

I was able to figure out what's going on, but in this form it separates ALL punctuation, even in the middle of words. For example, it converted todays date from: 6/28/2109 to "6 / 28 / 2019".

So I modified to use anchor tags at beginning/end but it seems I have to run it twice, once for beginning punctuation, and again for the end. Seems rather inefficient and was hoping somebody could show the the correct way to accomplish this. The below code is the regex version:

def sep_punct_by_regex(sent)->list :
    words = sent.split(" ")
    new_words = []
    for w in words:
        tmp1 = re.sub(r'^[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+', r' \g<0> ', w).strip()
        tmp2 = re.sub(r'[]!"$/%&\'()*+,.:;=#@?[\\^_`{|}~-]+$', r' \g<0> ', tmp1).strip()
        t = tmp2.split(" ")
        for x in t:
            new_words.append(x)
    return new_words

Note the ^ in the tmp1, and $ in tmp2 This works as is, but the goal is to learn while building so how would I modify the RegEx for single pass? I tried the obvious (^) up front, and the $ at the end, but it doesn't work.

1
  • You've got 3 layers there. Letters, some punctuation, some numbers+ some punctuation. I don't think you can tokenize this easily. You'd have to make an extreme set of rules for this. Commented Jun 28, 2019 at 20:17

1 Answer 1

1

You may use

re.findall(r'\b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b|[^\W_]+|(?:[^\w\s]|_)+', s)

See the regex demo

To remove the punctuation on both ends of a string and strip from whitespaces use

re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()

So, it will look like

s = re.sub(r'^[\W_]+|[\W_]+$', '', s).strip()
oct = r'(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])'
return re.findall(r'\b{0}(?:\.{0}){{3}}\b|[^\W_]+|(?:[^\w\s]|_)+'.format(oct), s)

Details

  • \b(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])(?:\.(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])){3}\b - an IPv4 regex pattern
  • | - or
  • [^\W_]+ - one or more letters or digits
  • | - or
  • (?:[^\w\s]|_)+ - one or more chars other than word and whitespace chars or _.
Sign up to request clarification or add additional context in comments.

7 Comments

Guess I wasn't clear on my requirements, sorry. I need to check every word in doc, and remove any leading/trailing punctuation. Not just IP.
Still no joy. Input sentence: "Malware Analysis Report (MAR) - 10135536-F " and what I expect is ["Malware", "Analysis", "Report", "(", "Mar", ")", "-", "10135536-F"] what I get is: ['Malware', 'Analysis', 'Report', '(MAR)', '-', '10135536-F']. In particular, the () around the MAR should be separate tokens. Thanks for looking!
@GeoffWillis In particular, my code above returns (, Mar, ) as separate tokens. What code are you testing? Are you sure of your requirements? Probably ideone.com/0B3maI is a better solution, but I am not sure now.
Sorry to be so dense, but I don't really understand what you posted entirely. The requirement is to walk a document sentence by sentence, then word by word. For every word, ensure is no leading/trailing punctuation. I used the IP as an example, but also the (MAR) -> ["(", "MAR", ")"]. I've posted my code, with what I THINK you wanted me to substitute for my reg ex. Note the two commented out regEx's are yours, and work great, but was trying to get down to one pass. [link]github.com/GeoffWillis/TaterPy/blob/master/term_freq_vector/…
[link] github.com/GeoffWillis/TaterPy/blob/master/term_freq_vector/… Don't know why this 404's from here, Cut-paste in browser works fine...
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.