Regex extract element in a list using python

Question

I have a list containing a set of the history of a file. I need to separate each element in the list into several columns and save it to CSV file.

The columns I need are commit_id, filename, committer, date, time, line_number, code. Suppose, this is my list:

my_list = [
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   1) /**',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   2)  *',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   3)  * Licensed to the Apache Software Foundation (ASF) under one',
     'f5213095324 master/ActiveMasterManager.java              (Michael Stack      2010-08-31 23:51:44 +0000   4)  * or more contributor license agreements.',
     ...
     'b5cf8748198 master/ActiveMasterManager.java              (Michael Stack      2012-09-27 05:40:09 +0000 160)           if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {'
     ]

The desired csv output:

commit_id   | filename                         | committer     | date       | time     | line_number | code 
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 1           | /**
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 2           | *
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 3           | * Licensed to the Apache Software Foundation (ASF) under one
f5213095324 | master/ActiveMasterManager.java  | Michael Stack | 2010-08-31 | 23:51:44 | 4           | * or more contributor license agreements.
........
b5cf8748198 | master/ActiveMasterManager.java  | Michael Stack | 2012-09-27 | 05:40:09 | 160         | if (ZKUtil.checkExists(this.watcher, backupZNode) != -1) {

I tried using this code:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

with open('somefile.csv', 'w+', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['commit_id', 'filename', 'committer', 'date', 'time', 'line_number', 'code'])
    for line in my_list:
        writer.writerow([field.strip() for field in pattern.match(line).groups()])

In general, the code works. But for line number = 160, it's written -1 in column line_number and is written only { in column code.

Is there something missing in the regex?

Michał Machnicki · Accepted Answer · 2018-03-16 09:01:47Z

1

The main problem with your pattern is usage of .+. If you replace it with .*? you will not only solve the issue with line number but also with catching whitespaces after committer name:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

https://regex101.com/r/f7zjpA/2

EDIT:

You didn't mention that you want to keep indentations and your code didn't look like you actually want it. Whitespaces/indentations before the code are removed not only because of the regex pattern. There are two things:

in regex pattern you used \s+ before code group, which excludes all the whitespaces/indentations. If you want to keep them, replace \s+ with \s which will catch only first one instead all of them:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.*?)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).*?(?P<line_number>\b\d+\b)\)\s(?P<code>[^"]*)')

in the for loop you use field.strip() which removes all whitespaces which are present at the beginning and the end of the string. Modifying the pattern and exchanging:
```
writer.writerow([field.strip() for field in pattern.match(line).groups()])
```
with:
```
writer.writerow(pattern.match(line).groups())
```
will result in keeping indentations where they belong.

edited Mar 16, 2018 at 9:01

answered Mar 14, 2018 at 10:22

Michał Machnicki

3,0832 gold badges19 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

YusufUMS Over a year ago

But, if the value in the column 'code' contains indentation, it doesn't work. The regex doesn't keep the indentation at the beginning of the code.

Michał Machnicki Over a year ago

Check the edit. Explanation for that was a bit too long for the comment.

YusufUMS Over a year ago

Oh, I understand now. It's because of the loop. Thank you for your solution.

Chertkov Pavel · Accepted Answer · 2018-03-14 09:33:57Z

1

I fixed regex. This should work:

pattern = re.compile(r'(?P<commit_id>\w+)\s+(?P<filename>[^\s]+)\s+\((?P<committer>.+)\s+(?P<date>\d{4}-\d\d-\d\d)\s+(?P<time>\d\d:\d\d:\d\d).+?(?P<line_number>\b\d+\b)\)\s+(?P<code>[^"]*)')

I added a question mark to use Lazy matching ".+" => ".+?"

https://regex101.com/r/GQGLvy/1

edited Mar 14, 2018 at 9:33

answered Mar 14, 2018 at 9:28

Chertkov Pavel

432 silver badges11 bronze badges

Comments

Rahul · Accepted Answer · 2018-03-14 09:38:46Z

0

Not exactly you are looking for but this can be useful.

import re

for row in my_list:
    print([x.strip() for x in re.split(r"(?![)])\s+(?![(])", row)])

out:

['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '1)', '/**']
['f5213095324', 'master/ActiveMasterManager.java', '(Michael', 'Stack', '2010-08-31', '23:51:44', '+0000', '2)', '*']
...

answered Mar 14, 2018 at 9:38

Rahul

11.7k5 gold badges63 silver badges100 bronze badges

1 Comment

Rahul Over a year ago

(?![) is negative lookahead. please read: docs.python.org/3/howto/regex.html

Collectives™ on Stack Overflow

Regex extract element in a list using python

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related