Python Not Removing Char From String

Question

Iv tried multiple times and ways for removing the extra punctuation from the string.

import string

class NLP:

    def __init__(self,sentence):

        self.sentence  = sentence.lower()

        self.tokenList = []


    #problem were the punct is still included in word
    def tokenize(self, sentence):

        for word in sentence.split():
            self.tokenList.append(word)

            for i in string.punctuation:
                if(i in word):
                    word.strip(i)
                    self.tokenList.append(i)

quick explanation of the code... What it is suppose to do is to split each word and punctuation and store them in a list. But when i have punctuation next to a word it stays with the word. Below is an example where a comma remains grouped with the word 'hello'

['hello,' , ',' , 'my' , 'name' , 'is' , 'freddy']
      #^
     #there's the problem

Are the types of punctuation marks known beforehand? Do you know, for example, that there are only fullstops and commas? — VHarisop
– VHarisop, Commented Jan 11, 2015 at 2:35
if you mean by in the sentence, then no. The input can be anything from user input. If you meant by the program identifying the char, then yes, that is in string.punctuation. — Freddy-FazBear
– Freddy-FazBear, Commented Jan 11, 2015 at 2:45

Alex Martelli · Accepted Answer · 2015-01-11 03:01:58Z

A Python string is immutable. Therefore, word.strip(i) does not "change word in place" as you seem to assume; rather, it returns a copy of word, modified by the .strip(i) operation -- which removes only from the ends of the string, so that's not what you want either (unless you know the punctuation occurs in the word in a peculiar order).

def tokenize(self, sentence):
    for word in sentence.split():
        punc = []
        for i in string.punctuation:
            howmany = word.count(i)
            if not howmany: continue
            word = word.replace(i, '')
            punc.extend(howmany*[i])
        self.tokenList.append(word)
        self.tokenList.extend(punc)

This assumes it's OK to have all the punctuation, one per item, after the cleaned-up word, independently of where within the word the punctuation appeared.

For example, should the sentence be (here), the list would be ['here', '(', ')'].

If there are stricter constraints on the ordering of things in the list, please edit your Q to express them clearly -- ideally with examples of desired input and output, too!

jme · Accepted Answer · 2015-01-11 02:53:13Z

1

I'd suggest a different approach:

import string
import itertools

def tokenize(s):
    tokens = []
    for k,v in itertools.groupby(s, lambda c: c in string.punctuation):
        tokens.extend("".join(v).split())
    return tokens

A test:

>>> tokenize("this is, a test, you know")
['this', 'is', ',', 'a', 'test', ',', 'you', 'know']

answered Jan 11, 2015 at 2:53

jme

20.8k6 gold badges44 silver badges40 bronze badges

2 Comments

jme Over a year ago

@Freddy-FazBear I'm glad it works for you. Note that in your code, you're not assigning to result of word.strip(i) to anything. That's why the character is never removed from the string.

Freddy-FazBear Over a year ago

oh, never realized that. Thanks :)

Collectives™ on Stack Overflow

Python Not Removing Char From String

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related