Python: regex findall

Question

Iam using python regex to extract certain values from a given string. This is my string:

mystring.txt

sometext
somemore    text here

some  other text

              course: course1
Id              Name                marks
____________________________________________________
1               student1            65
2               student2            75
3               MyName              69
4               student4            43

              course: course2
Id              Name                marks
____________________________________________________
1               student1            84
2               student2            73
8               student7            99
4               student4            32

              course: course4
Id              Name                marks
____________________________________________________
1               student1            97
3               MyName              60
8               student6            82

and I need to extract the course name and corresponding marks for a particular student. For example, I need the course and marks for MyName from the above string.

I tried:

re.findall(".*?course: (\w+).*?MyName\s+(\d+).*?",buff,re.DOTALL)

But this works only if MyName is present under each course, but not if MyName is missing in some of the course, like in my example string.

Here I get output as: [('course1', '69'), ('course2', '60')]

but what actually what I want to achive is: [('course1', '69'), ('course4', '60')]

what would be the correct regex for this?

#!/usr/bin/python    
import re

buffer_fp = open("mystring.txt","r+")
buff = buffer_fp.read()
buffer_fp.close()
print re.findall(".*?course: (\w+).*?MyName\s+(\d+).*?",buff,re.DOTALL)

vks · Accepted Answer · 2015-06-03 06:40:38Z

5

.*?course: (\w+)(?:(?!\bcourse\b).)*MyName\s+(\d+).*?

                    ^^^^^^^^^^^^

You can try this.See demo.Just use a lookahead based quantifier which will search for MyName before a course just before it.

https://regex101.com/r/pG1kU1/26

edited Jun 3, 2015 at 6:40

answered Jun 3, 2015 at 6:31

vks

68.1k11 gold badges96 silver badges132 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Deepa Over a year ago

what are the g,s and flags? I understand s is equivalant to re.DOTALL. I thought g is for findall, but then using this regex in python code is giving a different output

Deepa Over a year ago

But re.findall(".*?course: (\w+)(?:(?!\bcourse\b).)*MyName\s+(\d+).*?",buff,re.DOTALL) outputs : [('course1', '60')] :(

vks Over a year ago

@Deepa print re.findall(r".*?course: (\w+)(?:(?!\bcourse\b).)*MyName\s+(\d+).*?",x,flags=re.DOTALL) i have tried this code and its working for me

Deepa Over a year ago

oops.. sorry I have been trying without the r prefix. Dont get what difference that makes. :) r prefix is for not translating escapes right?wonder why did that affect the op

vks Over a year ago

@Deepa python interprets \b as its.own.bell.or.so...but we want it.to.be.word.boundary....so.we have to.use r

Béla · Accepted Answer · 2015-06-03 06:47:14Z

2

I suspect this is impossible to do in a single regular expression. They are not all-powerful.

Even if you find a way, don't do this. Your non-working regex is already close to unreadable; a working solution is likely to be even more so. You can most likely do this in just a few lines of meaningful code. Pseudocode solution:

for line in buff:
    if it is a course line:
        set the course variable
    if it is a MyName line:
        add (course, marks) to the list of matches

Note that this could (and probably should) involve regexes in each of those if blocks. It's not a case of choosing between the hammer and the screwdriver to the exclusion of the other, but rather using them both for what they do best.

answered Jun 3, 2015 at 6:47

Béla

2841 gold badge2 silver badges8 bronze badges

3 Comments

vks Over a year ago

Guess you underestimated regex :)

Béla Over a year ago

@vks I guess I did. But respectfully, your solution proves my point. That regex is illegible garbage - good luck to the OP if their requirements ever change and they need to try to fix it. It reads more like Perl than Python.

vks Over a year ago

it's illegible garbage for the one who can't understand :)

Collectives™ on Stack Overflow

Python: regex findall

2 Answers 2

5 Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related