4

I have a log file that is full of tweets. Each tweet is on its own line so that I can iterate though the file easily.

An example tweet would be like this:

@ sample This is a sample string $ 1.00 # sample

I want to be able to clean this up a bit by removing the white space between the special character and the following alpha-numeric character. "@ s", "$ 1", "# s"

So that it would look like this:

@sample This is a sample string $1.00 #sample

I'm trying to use regular expressions to match these instances because they can be variable, but I am unsure of how to go about doing this.

I've been using re.sub() and re.search() to find the instances, but am struggling to figure out how to only remove the white space while leaving the string intact.

Here is the code I have so far:

#!/usr/bin/python

import csv
import re
import sys
import pdb
import urllib

f=open('output.csv', 'w')

with open('retweet.csv', 'rb') as inputfile:
    read=csv.reader(inputfile, delimiter=',')
    for row in read:
        a = row[0]
        matchObj = re.search("\W\s\w", a)
        print matchObj.group()

f.close()

Thanks for any help!

3 Answers 3

5

Something like this using re.sub:

>>> import re
>>> strs = "@ sample This is a sample string $ 1.00 # sample"
>>> re.sub(r'([@#$])(\s+)([a-z0-9])', r'\1\3', strs, flags=re.I)
'@sample This is a sample string $1.00 #sample'
Sign up to request clarification or add additional context in comments.

4 Comments

This worked great, thank you very much! Would you mind explaining the r'\1\3\ and flags=re.I?
@Josh the \1,\3 represent the captured groups 1 and 3, we dropped the \2 because you didn't want any spaces. re.I is for case-insensitive match.
You could also have done r'([@#$])(?\s+)([a-z0-9])' to make the second group non-capturing (notice the "?" before \s+). In this case, you would have replaced with r'\1\2' instead.
@SethMMorton I think the correct syntax to make a group non-capturing is ?:.
1
>>> re.sub("([@$#]) ", r"\1", "@ sample This is a sample string $ 1.00 # sample")
'@sample This is a sample string $1.00 #sample'

Comments

0

This seemed to work pretty nice.

print re.sub(r'([@$])\s+',r'\1','@ blah $ 1')

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.