4

I have a dataset that looks like this:

Male    Name=Tony;  
Female  Name=Alice.1; 
Female  Name=Alice.2;
Male    Name=Ben; 
Male    Name=Shankar; 
Male    Name=Bala; 
Female  Name=Nina; 
###
Female  Name=Alex.1; 
Female  Name=Alex.2;
Male    Name=James; 
Male    Name=Graham; 
Female  Name=Smith;  
###
Female  Name=Xing;
Female  Name=Flora;
Male    Name=Steve.1;
Male    Name=Steve.2; 
Female  Name=Zac;  
###

I want to the change the list so it looks like this:

Male    Name=Class_1;
Female  Name=Class_1.1;
Female  Name=Class_1.2;
Male    Name=Class_1;
Male    Name=Class_1;
Male    Name=Class_1; 
Female  Name=Class_1;
###
Female  Name=Class_2.1; 
Female  Name=Class_2.2; 
Male    Name=Class_2; 
Male    Name=Class_2; 
Female  Name=Class_2;  
###
Female  Name=Class_3; 
Female  Name=Class_3; 
Male    Name=Class_3.1; 
Male    Name=Class_3.2; 
Female  Name=Class_3;
###

Each name has to be changed to the class they belong to. I noticed that in the dataset, each new class in the list is denoted by a ‘###’. So I can split the data set into blocks by ‘###’ and count the instances of ###. Then use regex to look for the names, and replace them by the count of ###.

My code looks like this:

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=(.*?)[;/]'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:
    match = re.findall(pattern, line)
    print match

for line in blocks:
    if line == '###':
        triple_hash_count += 1
        print line 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count))) 

This doesn’t seem to do the job - no replacements are made.

9
  • 1
    Possible duplicate of Python string.replace regular expression Commented Mar 25, 2017 at 19:53
  • If you're actually using curly quotes, that's not valid Python syntax. Are you programming in Word or something? Commented Mar 25, 2017 at 19:54
  • what does this mean? Oh no sorry, I copied my code from a text file to here. Silly Commented Mar 25, 2017 at 19:55
  • Hi will - the answers on the posts you suggested are not particularly helpful for me Commented Mar 25, 2017 at 19:59
  • 1
    Close your file! Commented Mar 25, 2017 at 22:05

3 Answers 3

1

When running the code you provided, I got the following traceback output:

print(line.replace(match, prefix + str(triple_hash_count))) 
TypeError: Can't convert 'list' object to str implicitly

The error happens because type(match) evaluates to a list. When I inspect this list in PDB, it's an empty list. This is because match has gone out of scope by having two for-loops. So let's combine them as such:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else: 
        print(line.replace(match, prefix + str(triple_hash_count)))

Now you're getting content in match, but there's still a problem: the return type of re.findall is a list of strings. str.replace(...) expects a single string as its first argument.

You could cheat, and change the offending line to print(line.replace(match[0], prefix + str(triple_hash_count))) -- but that presumes that you're sure you're going to find a regular expression match on every line that isn't ###. A more resilient way would be to check to see that you have the match before you try to call str.replace() on it.

The final code looks like this:

for line in blocks:
    match = re.findall(pattern, line)
    print(match)

    if line == '###':
        triple_hash_count += 1
        print(line) 
    else:
        if match: 
            print(line.replace(match[0], prefix + str(triple_hash_count)))
        else:
            print(line)

Two more things:

  1. On line 11, you mistook the variable name. It's triple_hash_count, not hash_count.
  2. This code won't actually change the text file provided as input on line 1. You need to write the result of line.replace(match, prefix + str(triple_hash_count)) back to the file, not just print it.
Sign up to request clarification or add additional context in comments.

4 Comments

This solution also replaces the '.1' and so on
Your answer is correct, but OP's regex needed a tweak to address the lines with the trailing '.1', '.2', etc
@PaulBack I noticed that you put that change in your post. But I'd recommend pattern = r'Name=([^\.\d;]*) so that it doesn't ingest the period between the name and the uniqueness counter.
Nice catch. I made the change.
1

The problem is rooted in the use of a second loop (as well as a mis-named variable). This will work.

import re

blocks = [b.strip() for b in open('/file', 'r').readlines()]
pattern = r'Name=([^\.\d;]*)'
prefix = 'Class_'
triple_hash_count = 1

for line in blocks:

    if line == '###':
        triple_hash_count += 1
        print line     
    else:
        match = re.findall(pattern, line)
        print line.replace(match[0], prefix + str(triple_hash_count)) 

1 Comment

the \d and * in the regex [^\.\d;]* are not required! This: r'=(.*?)[\.;]' does all it needs
1

While you already have your answer, you can do it in just a couple of lines with regular expressions (it could even be a one-liner but this is not very readable):

import re
hashrx = re.compile(r'^###$', re.MULTILINE)
namerx = re.compile(r'Name=\w+(\.\d+)?;')

new_string = '###'.join([namerx.sub(r"Name=Class_{}\1".format(idx + 1), part) 
                for idx,part in enumerate(hashrx.split(string))])
print(new_string)

What it does:

  1. First, it looks for ### in a single line with the anchors ^ and $ in MULTILINE mode.
  2. Second, it looks for a possible number after the Name, capturing it into group 1 (but made optional as not all of your names have it).
  3. Third, it splits your string by ### and iterates over it with enumerate(), thus having a counter for the numbers to be inserted.
  4. Lastly, it joins the resulting list by ### again.

As a one-liner (though not advisable):

new_string = '###'.join(
                [re.sub(r'Name=\w+(\.\d+)?;', r"Name=Class_{}\1".format(idx + 1), part) 
                for idx, part in enumerate(re.split(r'^###$', string, flags=re.MULTILINE))])

Demo

A demo says more than thousands words.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.