How to replace a pattern using regex in python?

Question

I have a dataset that looks like this:

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this:

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. I noticed that in the dataset, each new class in the list is denoted by a ‘###’. So I can split the data set into blocks by ‘###’ and count the instances of ###. Then use regex to look for the names, and replace them by the count of ###.

My code looks like this:

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

This doesn’t seem to do the job - no replacements are made.

Possible duplicate of Python string.replace regular expression — Fermat's Little Student
– Fermat's Little Student, Commented Mar 25, 2017 at 19:53
If you're actually using curly quotes, that's not valid Python syntax. Are you programming in Word or something? — jonrsharpe
– jonrsharpe, Commented Mar 25, 2017 at 19:54
what does this mean? Oh no sorry, I copied my code from a text file to here. Silly — Python_Newbie_2
– Python_Newbie_2, Commented Mar 25, 2017 at 19:55
Hi will - the answers on the posts you suggested are not particularly helpful for me — Python_Newbie_2
– Python_Newbie_2, Commented Mar 25, 2017 at 19:59

DahliaSR · Accepted Answer · 2017-03-26 02:22:05Z

1

When running the code you provided, I got the following traceback output:

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. When I inspect this list in PDB, it's an empty list. This is because match has gone out of scope by having two for-loops. So let's combine them as such:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match, but there's still a problem: the return type of re.findall is a list of strings. str.replace(...) expects a single string as its first argument.

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ###. A more resilient way would be to check to see that you have the match before you try to call str.replace() on it.

The final code looks like this:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things:

On line 11, you mistook the variable name. It's triple_hash_count, not hash_count.
This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it.

edited Mar 26, 2017 at 2:22

DahliaSR

776 bronze badges

answered Mar 25, 2017 at 20:24

Matthew Cole

5675 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

DahliaSR Over a year ago

This solution also replaces the '.1' and so on

Paul Back Over a year ago

Your answer is correct, but OP's regex needed a tweak to address the lines with the trailing '.1', '.2', etc

Matthew Cole Over a year ago

@PaulBack I noticed that you put that change in your post. But I'd recommend pattern = r'Name=([^\.\d;]*) so that it doesn't ingest the period between the name and the uniqueness counter.

Paul Back Over a year ago

Nice catch. I made the change.

Paul Back · Accepted Answer · 2017-03-25 20:54:41Z

1

The problem is rooted in the use of a second loop (as well as a mis-named variable). This will work.

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count))

edited Mar 25, 2017 at 20:54

answered Mar 25, 2017 at 20:32

Paul Back

1,31916 silver badges25 bronze badges

1 Comment

DahliaSR Over a year ago

the \d and * in the regex [^\.\d;]* are not required! This: r'=(.*?)[\.;]' does all it needs

Jan · Accepted Answer · 2017-03-25 21:52:12Z

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable):

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does:

First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode.
Second, it looks for a possible number after the Name, capturing it into group 1 (but made optional as not all of your names have it).
Third, it splits your string by ### and iterates over it with enumerate(), thus having a counter for the numbers to be inserted.
Lastly, it joins the resulting list by ### again.

As a one-liner (though not advisable):

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo

A demo says more than thousands words.

Collectives™ on Stack Overflow

How to replace a pattern using regex in python?

3 Answers 3

4 Comments

1 Comment

What it does:

As a one-liner (though not advisable):

Demo

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

1 Comment

What it does:

As a one-liner (though not advisable):

Demo

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related