Search and replace using regular expressions in Python

Question

I have a log file that is full of tweets. Each tweet is on its own line so that I can iterate though the file easily.

An example tweet would be like this:

@ sample This is a sample string $ 1.00 # sample

I want to be able to clean this up a bit by removing the white space between the special character and the following alpha-numeric character. "@ s", "$ 1", "# s"

So that it would look like this:

@sample This is a sample string $1.00 #sample

I'm trying to use regular expressions to match these instances because they can be variable, but I am unsure of how to go about doing this.

I've been using re.sub() and re.search() to find the instances, but am struggling to figure out how to only remove the white space while leaving the string intact.

Here is the code I have so far:

#!/usr/bin/python

import csv
import re
import sys
import pdb
import urllib

f=open('output.csv', 'w')

with open('retweet.csv', 'rb') as inputfile:
    read=csv.reader(inputfile, delimiter=',')
    for row in read:
        a = row[0]
        matchObj = re.search("\W\s\w", a)
        print matchObj.group()

f.close()

Thanks for any help!

Ashwini Chaudhary · Accepted Answer · 2013-10-23 18:20:08Z

5

Something like this using re.sub:

>>> import re
>>> strs = "@ sample This is a sample string $ 1.00 # sample"
>>> re.sub(r'([@#$])(\s+)([a-z0-9])', r'\1\3', strs, flags=re.I)
'@sample This is a sample string $1.00 #sample'

answered Oct 23, 2013 at 18:20

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Josh Over a year ago

This worked great, thank you very much! Would you mind explaining the r'\1\3\ and flags=re.I?

Ashwini Chaudhary Over a year ago

@Josh the \1,\3 represent the captured groups 1 and 3, we dropped the \2 because you didn't want any spaces. re.I is for case-insensitive match.

SethMMorton Over a year ago

You could also have done r'([@#$])(?\s+)([a-z0-9])' to make the second group non-capturing (notice the "?" before \s+). In this case, you would have replaced with r'\1\2' instead.

Ashwini Chaudhary Over a year ago

@SethMMorton I think the correct syntax to make a group non-capturing is ?:.

damienfrancois · Accepted Answer · 2013-10-23 18:25:35Z

1

>>> re.sub("([@$#]) ", r"\1", "@ sample This is a sample string $ 1.00 # sample")
'@sample This is a sample string $1.00 #sample'

answered Oct 23, 2013 at 18:25

damienfrancois

60.4k9 gold badges116 silver badges128 bronze badges

Comments

Ashwini Chaudhary · Accepted Answer · 2013-10-24 11:09:32Z

0

This seemed to work pretty nice.

print re.sub(r'([@$])\s+',r'\1','@ blah $ 1')

edited Oct 24, 2013 at 11:09

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

answered Oct 23, 2013 at 18:23

stonefury

4664 silver badges7 bronze badges

Collectives™ on Stack Overflow

Search and replace using regular expressions in Python

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related