Regular Expression in python doesn't work

Question

I'm working on the exercise in the book Python for Informatics which asks me to write a program to simulate the operation of the grep command on UNIX. However, my code doesn't work. Here I simplified my code and only intend to calculate how many lines start with the word 'Find'. I'm quite confused and wish you could cast light on it.

from urllib.request import urlopen
import re

fhand = urlopen('http://www.py4inf.com/code/mbox-short.txt')
sumFind = 0

for line in fhand:
    line = str(line) #convert from byte to string for re operation
    if re.search('^From',line) is not None:
        sumFind+=1

print(f'There are {sumFind} lines that match.')

The output of the script is

There are 0 lines that match.

And here is the link of the input text: text

Thanks a lot for your time.

Jean-François Fabre · Accepted Answer · 2018-02-21 15:36:58Z

6

the mistake is to convert bytes to string using str.

>>> str(b'foo')
"b'foo'"

You would have needed

line = line.decode()

But the best way is to pass a bytes regex to the regex, that is supported:

for line in fhand:
    if re.search(b'^From',line) is not None:
        sumFind+=1

now I get 54 matches.

note that you could simplify the whole loop to:

sum_find = sum(bool(re.match(b'From',line)) for line in fhand)

re.match replaces the need to use ^ with search
no need for loop, sum counts the times where re.match returns a truthy value (explicitly converted to bool so it can sum 0 or 1)

or even simpler without regex:

sum_find = sum(line.startswith(b"From") for line in fhand)

answered Feb 21, 2018 at 15:36

Jean-François Fabre♦

141k24 gold badges179 silver badges246 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Maxmoe Over a year ago

Thanks a lot! But the shell reports:Traceback (most recent call last): File "test.py", line 9, in <module> if re.match(b'^From',line): File "D:\Python3\lib\re.py", line 172, in match return _compile(pattern, flags).match(string) TypeError: cannot use a bytes pattern on a string-like object

Jean-François Fabre Over a year ago

either decode bytes or use byte pattern. Really, drop the loop and use sum

Maxmoe Over a year ago

Adopt the bool method and it works. Thank you so much!

Tom · Accepted Answer · 2018-02-21 16:08:46Z

0

You're issue is that the urllib module returns bytes instead of strings from the url/text file.

You can either:

Use bytes in your regex search: re.search(b'From', line).
Use requests module to download file as string and split by lines:

import requests

txt = requests.get('http://www.py4inf.com/code/mbox-short.txt').text.split('\n')

for line in txt: ...

answered Feb 21, 2018 at 16:08

Tom

1,1438 silver badges17 bronze badges

Collectives™ on Stack Overflow

Regular Expression in python doesn't work

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest