Using Python to Access Web Data with Regular Expression is not working

Question

I am doing Python for everybody's Course on Coursera so I just learned how to access the file from the Web with Python.

So here what I am trying to do is to extract the Email from the lines which are starting with the From: but I am getting nothing.

There are emails in lines which are starting with From: because I have done this with File Handling method but it's not working when I tried it on file which is on Server so I guess it is to do with the white space.

So Anyways Guys, Help me I am stuck

import socket
import re
dic = dict()
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
    mysock.connect(('data.pr4e.org', 80))
except:
    print("Can't find the server.\nCheck your internet Connection")
cmd = 'GET http://data.pr4e.org/mbox-short.txt HTTP/1.0\r\n\r\n'.encode()
try:
    mysock.send(cmd)
except:
    print("Connection Lost:\nCheck your Internet Connection")
while True:
    data = mysock.recv(512)
    if len(data)  < 1:
        break
    data = data.decode()
    data = data.rstrip()
    k = re.findall('^From:.(\S+@\S+)', data)
    if (len(k)) > 0:
        print(k)

This is the Link from where you can download the file

You recognize it's some whitespace problem but you haven't even included the text/file you're trying to match against. Have you tried a regex debugging tool? e.g. debuggex.com or want to include the text you're trying to match against? — Macattack
– Macattack, Commented Jun 2, 2020 at 22:34
Brother, I have just added the link from where you can download the file. data.pr4e.org/mbox-short.txt — DeathNet123
– DeathNet123, Commented Jun 2, 2020 at 22:43
Your regex expects a start of From: but there's no : in the file, it's just From — Macattack
– Macattack, Commented Jun 2, 2020 at 23:12
No there are lines which are starting with From: 27 lines are there which are starting with From: to be Precise count = 0 fhand = open("test.txt") #change the file name as you have saved for line in fhand: if line.startswith("From:"): count = count + 1 print(count) — DeathNet123
– DeathNet123, Commented Jun 7, 2020 at 22:37

Wiktor Stribiżew · Accepted Answer · 2020-06-02 23:05:30Z

3

You may get the emails using

k = re.findall(r'(?m)^From:\s*(\S+@\S+)', data)

See the regex demo.

Details

(?m)^ - start of a line
From: - a literal string
\s* - 0+ whitespaces
(\S+@\S+) - Capturing group 1 (the output of re.findall will only contain this value): one or more non-whitespace chars, @ and one or more non-whitespace chars.

answered Jun 2, 2020 at 23:05

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

DeathNet123 · Accepted Answer · 2020-06-14 00:31:55Z

-1

Well, I found the better way of what I am doing here. I can do this easily and more efficiently by using the urllib.request library.

import urllib.request, urllib.parse, urllib.error
import re

fhand = urllib.request.urlopen('http://data.pr4e.org/mbox-short.txt')
for line in fhand:
    k = re.findall(r'(?m)^From:\s*(\S+@\S+)', line)
    if len(k) > 1:
       print(k)

answered Jun 14, 2020 at 0:31

DeathNet123

942 silver badges9 bronze badges

Collectives™ on Stack Overflow

Using Python to Access Web Data with Regular Expression is not working

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related