3

I have a list containing a set of the history of a file. I need to separate each element in the list into several columns and save it to CSV file.

The columns I need are commit_id, filename, committer, date, time, line_number, code. Suppose, this is my list:

my_list = [
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   1) /**',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   2)  *',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   3)  * Licensed to the Apache Software Foundation (ASF) under one',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   4)  * or more contributor license agreements.',
     ...
     'b5cf8748198 master/ActiveMasterManager.java              (Michael Stack      2012-09-27 05:40:09 +0000 160)           if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {'
     ]

The desired csv output:

commit_id   | filename                         | committer     | date       | time     | line_number | code 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 1           | /**
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 2           | *
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 3           | * Licensed to the Apache Software Foundation (ASF) under one
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 4           | * or more contributor license agreements.
........
b5cf8748198 | master/ActiveMasterManager.java  | Michael Stack | 2012-09-27 | 05:40:09 | 160         | if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {

I tried using this code:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

with open('somefile.csv', 'w+', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['commit_id', 'filename', 'committer', 'date', 'time', 'line_number', 'code'])
    for line in my_list:
        writer.writerow([field.strip() for field in pattern.match(line).groups()])

In general, the code works. But for line number = 160, it's written -1 in column line_number and is written only { in column code.

Is there something missing in the regex?

3 Answers 3

1

The main problem with your pattern is usage of .+. If you replace it with .*? you will not only solve the issue with line number but also with catching whitespaces after committer name:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

https://regex101.com/r/f7zjpA/2

EDIT:

You didn't mention that you want to keep indentations and your code didn't look like you actually want it. Whitespaces/indentations before the code are removed not only because of the regex pattern. There are two things:

  • in regex pattern you used \s+ before code group, which excludes all the whitespaces/indentations. If you want to keep them, replace \s+ with \s which will catch only first one instead all of them:

    pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s(?P<code>[^"]*)')
    
  • in the for loop you use field.strip() which removes all whitespaces which are present at the beginning and the end of the string. Modifying the pattern and exchanging:

    writer.writerow([field.strip() for field in pattern.match(line).groups()])
    

    with:

    writer.writerow(pattern.match(line).groups())
    

    will result in keeping indentations where they belong.

Sign up to request clarification or add additional context in comments.

3 Comments

But, if the value in the column 'code' contains indentation, it doesn't work. The regex doesn't keep the indentation at the beginning of the code.
Check the edit. Explanation for that was a bit too long for the comment.
Oh, I understand now. It's because of the loop. Thank you for your solution.
1

I fixed regex. This should work:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

I added a question mark to use Lazy matching ".+" => ".+?"

https://regex101.com/r/GQGLvy/1

Comments

0

Not exactly you are looking for but this can be useful.

import re

for row in my_list:
    print([x.strip() for x in re.split(r"(?![)])\s+(?![(])", row)])

out:

['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '1)', '/**']
['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '2)', '*']
...

1 Comment

(?![) is negative lookahead. please read: docs.python.org/3/howto/regex.html

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.