1

After 5 hours of trying, time to get some help. Sifted through all the stackoverflow questions related to this but couldn't find the answer.

The code is a gmail parser - works for most emails but some emails cause the UnicodeDecodeError. The problem is "raw_email.decode('utf-8')" but changing it (see comments) causes a different problem down below.

# Source: https://stackoverflow.com/questions/7314942/python-imaplib-to-get-gmail-inbox-subjects-titles-and-sender-name

import datetime
import time
import email
import imaplib
import mailbox
from vars import *
import re                   # to remove links from str
import string


EMAIL_ACCOUNT = 'gmail_login'
PASSWORD = 'gmail_psswd'

mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(EMAIL_ACCOUNT, PASSWORD)
mail.list()
mail.select('inbox')
result, data = mail.uid('search', None, "ALL") # (ALL/UNSEEN)

id_list = data[0].split()
email_rev = reversed(id_list)             # Returns a type list.reverseiterator, which is not list
email_list = list(email_rev)
i = len(email_list)

todays_date = time.strftime("%m/%d/%Y")

for x in range(i):
    latest_email_uid = email_list[x]
    result, email_data = mail.uid('fetch', latest_email_uid, '(RFC822)')
    raw_email = email_data[0][1]                                 # Returns a byte
    raw_email_str = raw_email.decode('utf-8')                    # Returns a str
    #raw_email_str = base64.b64decode(raw_email_str1)      # Tried this but didn't work.
    #raw_email_str = raw_email.decode('utf-8', errors='ignore')  # Tried this but caused a TypeError down where var subject is created because something there is expecting a str or byte-like 
    email_message = email.message_from_string(raw_email_str)

    date_tuple = email.utils.parsedate_tz(email_message['Date'])           
    date_short = f'{date_tuple[1]}/{date_tuple[2]}/{date_tuple[0]}'

    # Header Details
    if date_short == '12/23/2019':
        #if date_tuple:
        #    local_date = datetime.datetime.fromtimestamp(email.utils.mktime_tz(date_tuple))
        #    local_message_date = "%s" %(str(local_date.strftime("%a, %d %b %Y %H:%M:%S")))
        email_from = str(email.header.make_header(email.header.decode_header(email_message['From'])))
        subject = str(email.header.make_header(email.header.decode_header(email_message['Subject'])))
        #print(subject)
        if email_from.find('[email protected]') != -1:
            print('yay')

        # Body details
        if email_from.find('[email protected]') != -1 and subject.find('Payment Summary') != -1:
            for part in email_message.walk():
                if part.get_content_type() == "text/plain":
                    body = part.get_payload(decode=True)
                    body = body.decode("utf-8")             # Convert byte to str
                    body = body.replace("\r\n", " ")
                    text = re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', body)           # removes url links
                    text2 = text.translate(str.maketrans('', '', string.punctuation))
                    body_list = re.sub("[^\w]", " ",  text2).split()

                    print(body_list)
                    print(date_short)

                else:
                    continue
5
  • To make your life easier you might want to have a look at imapclient.readthedocs.io/en/2.1.0. This deals with most low level stuff and is quite easy to use. Your code above never gonna work reliable unless you implement all edge cases of the mail and imap RFCs (including different encodings on various mail message parts and such). Commented Dec 27, 2019 at 23:33
  • Thanks @jerch! Amazing resource and it works! However, it doesn't show how to extract the body of the email, which is what I want to parse. Did I miss it somewhere? Commented Dec 29, 2019 at 1:34
  • Right, imapclient stops to help at the message itself (for a simple reason - a large attachment would penalize parsing - prolly unwanted). To reliably parse a raw mail message plz refer to stdlib modules like email.message and email.parser (docs.python.org/3/library/email.html). Sadly a mail message can be complicated (due to the parts logic with different encodings and mimetypes), you will have to work through the docs to cover those aspects. Commented Dec 29, 2019 at 11:48
  • Thank you @Jerch, would you please be able to elaborate a bit more about what the code would look like? I've tried various forms of smtplib, imaplib, imapclient, and email libraries without success. I got real close using imaplib but couldn't figure out the problem. Any guidance would help! Commented Dec 31, 2019 at 9:30
  • Try to use high level lib: pypi.org/project/imap-tools All is already parsed. Commented Sep 22, 2020 at 8:52

3 Answers 3

1

Here is an example how to retrieve and read mail parts with imapclient and the email.* modules from the python standard libs:

from imapclient import IMAPClient
import email
from email import policy


def walk_parts(part, level=0):
    print(' ' * 4 * level + part.get_content_type())
    # do something with part content (applies encoding by default)
    # part.get_content()
    if part.is_multipart():
        for part in part.get_payload():
            get_parts(part, level + 1)


# context manager ensures the session is cleaned up
with IMAPClient(host="your_mail_host") as client:
    client.login('user', 'password')

    # select some folder
    client.select_folder('INBOX')

    # do something with folder, e.g. search & grab unseen mails
    messages = client.search('UNSEEN')
    for uid, message_data in client.fetch(messages, 'RFC822').items():
        email_message = email.message_from_bytes(
            message_data[b'RFC822'], policy=policy.default)
        print(uid, email_message.get('From'), email_message.get('Subject'))

    # alternatively search for specific mails
    msgs = client.search(['SUBJECT', 'some subject'])

    #
    # do something with a specific mail:
    #

    # fetch a single mail with UID 12345
    raw_mails = client.fetch([12345], 'RFC822')

    # parse the mail (very expensive for big mails with attachments!)
    mail = email.message_from_bytes(
        raw_mails[12345][b'RFC822'], policy=policy.default)

    # Now you have a python object representation of the mail and can dig
    # into it. Since a mail can be composed of several subparts we have
    # to walk the subparts.

    # walk all parts at once
    for part in mail.walk():
        # do something with that part
        print(part.get_content_type())
    # or recurse yourself into sub parts until you find the interesting part
    walk_parts(mail)

See the docs for email.message.EmailMessage. There you find all needed bits to read into a mail message.

Sign up to request clarification or add additional context in comments.

2 Comments

This could still fail for many real-world messages where the sender declared the wrong content-transfer-encoding. Historically, many clients declared "us-ascii" but then sent some undeclared 8-bit encoding anyway; these days, many probably claim "utf-8" but then actually use something else.
True, but thats always the case - if something states to be XY but is Z, you have a bigger problem (which needs more involved recovery strategies and cannot be blueprinted this easy).
1

use 'ISO 8859-1' instead of 'utf-8'

Comments

0

I had the same issue And after a lot of research I realized that I simply need to use, message_from_bytes function from email rather than using message_from_string

so for your code simply replace:

 raw_email_str = raw_email.decode('utf-8')        
 email_message = email.message_from_string(raw_email_str)

to

email_message = email.message_from_bytes(raw_email)

should work like a charm :)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.