2

I am doing Python for everybody's Course on Coursera so I just learned how to access the file from the Web with Python.

So here what I am trying to do is to extract the Email from the lines which are starting with the From: but I am getting nothing.

There are emails in lines which are starting with From: because I have done this with File Handling method but it's not working when I tried it on file which is on Server so I guess it is to do with the white space.

So Anyways Guys, Help me I am stuck

import socket
import re
dic = dict()
mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
try:
    mysock.connect(('data.pr4e.org', 80))
except:
    print("Can't find the server.\nCheck your internet Connection")
cmd = 'GET http://data.pr4e.org/mbox-short.txt HTTP/1.0\r\n\r\n'.encode()
try:
    mysock.send(cmd)
except:
    print("Connection Lost:\nCheck your Internet Connection")
while True:
    data = mysock.recv(512)
    if len(data)  < 1:
        break
    data = data.decode()
    data = data.rstrip()
    k = re.findall('^From:.(\S+@\S+)', data)
    if (len(k)) > 0:
        print(k)

This is the Link from where you can download the file

4
  • You recognize it's some whitespace problem but you haven't even included the text/file you're trying to match against. Have you tried a regex debugging tool? e.g. debuggex.com or want to include the text you're trying to match against? Commented Jun 2, 2020 at 22:34
  • Brother, I have just added the link from where you can download the file. data.pr4e.org/mbox-short.txt Commented Jun 2, 2020 at 22:43
  • Your regex expects a start of From: but there's no : in the file, it's just From Commented Jun 2, 2020 at 23:12
  • No there are lines which are starting with From: 27 lines are there which are starting with From: to be Precise count = 0 fhand = open("test.txt") #change the file name as you have saved for line in fhand: if line.startswith("From:"): count = count + 1 print(count) Commented Jun 7, 2020 at 22:37

2 Answers 2

3

You may get the emails using

k = re.findall(r'(?m)^From:\s*(\S+@\S+)', data)

See the regex demo.

Details

  • (?m)^ - start of a line
  • From: - a literal string
  • \s* - 0+ whitespaces
  • (\S+@\S+) - Capturing group 1 (the output of re.findall will only contain this value): one or more non-whitespace chars, @ and one or more non-whitespace chars.
Sign up to request clarification or add additional context in comments.

Comments

-1

Well, I found the better way of what I am doing here. I can do this easily and more efficiently by using the urllib.request library.

import urllib.request, urllib.parse, urllib.error
import re

fhand = urllib.request.urlopen('http://data.pr4e.org/mbox-short.txt')
for line in fhand:
    k = re.findall(r'(?m)^From:\s*(\S+@\S+)', line)
    if len(k) > 1:
       print(k)    

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.