0

Iv tried multiple times and ways for removing the extra punctuation from the string.

import string

class NLP:

    def __init__(self,sentence):

        self.sentence  = sentence.lower()

        self.tokenList = []


    #problem were the punct is still included in word
    def tokenize(self, sentence):

        for word in sentence.split():
            self.tokenList.append(word)

            for i in string.punctuation:
                if(i in word):
                    word.strip(i)
                    self.tokenList.append(i)

quick explanation of the code... What it is suppose to do is to split each word and punctuation and store them in a list. But when i have punctuation next to a word it stays with the word. Below is an example where a comma remains grouped with the word 'hello'

['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
      #^
     #there's the problem
2
  • Are the types of punctuation marks known beforehand? Do you know, for example, that there are only fullstops and commas? Commented Jan 11, 2015 at 2:35
  • if you mean by in the sentence, then no. The input can be anything from user input. If you meant by the program identifying the char, then yes, that is in string.punctuation. Commented Jan 11, 2015 at 2:45

2 Answers 2

2

A Python string is immutable. Therefore, word.strip(i) does not "change word in place" as you seem to assume; rather, it returns a copy of word, modified by the .strip(i) operation -- which removes only from the ends of the string, so that's not what you want either (unless you know the punctuation occurs in the word in a peculiar order).

def tokenize(self, sentence):
    for word in sentence.split():
        punc = []
        for i in string.punctuation:
            howmany = word.count(i)
            if not howmany: continue
            word = word.replace(i, '')
            punc.extend(howmany*[i])
        self.tokenList.append(word)
        self.tokenList.extend(punc)

This assumes it's OK to have all the punctuation, one per item, after the cleaned-up word, independently of where within the word the punctuation appeared.

For example, should the sentence be (here), the list would be ['here', '(', ')'].

If there are stricter constraints on the ordering of things in the list, please edit your Q to express them clearly -- ideally with examples of desired input and output, too!

Sign up to request clarification or add additional context in comments.

Comments

1

I'd suggest a different approach:

import string
import itertools

def tokenize(s):
    tokens = []
    for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
        tokens.extend("".join(v).split())
    return tokens

A test:

>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']

2 Comments

@Freddy-FazBear I'm glad it works for you. Note that in your code, you're not assigning to result of word.strip(i) to anything. That's why the character is never removed from the string.
oh, never realized that. Thanks :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.