The answers above rely on your text being remarkably similar to your examples. This code is a little more agile, matching any number of emails in your text. I did not thoroughly document it, but...
harvest_emails takes a string of line-separated strings, each of those comma-separated as in your examples, date,message_string,identifier, and returns a generator that produces a 3-length tuple (date,comma-sep-emails,identifier). It will pull any number of emails from the text and matches any email of the form [email protected] | [email protected] | [email protected] where x is any non-zero length series of non-whitespace characters.
def harvest_emails(target):
""""Takes string, splits it on \n, then yields each line formatted as:
datecode, email, identifier
"""
import re
for line in target.splitlines():
t = line.split(",")
yield (
t[0].strip(),
','.join(
re.findall("\S+@\S+\.(?:com|org|net)",
''.join(t[1:-1]).strip(),re.I)[0:]),
t[-1].strip())
.
>>>messages = """04:34:03 +0000 2013,Email me for tickets email me at [email protected],1708824644
Tue Dec 17 04:33:58 +0000 2013,@musclepotential ok, man. you can email [email protected],25016561
Tue Dec 17 04:34:03 +0000 2013, [email protected], [email protected],1708824644
Tue Dec 17 04:33:58 +0000 2013, [email protected],25016561"""
>>>data = list()
>>>for line in harvest_emails(messages):
d = dict()
d["date"],d["emails"],d["id"] = line[0],line[1].split(','),line[2]
data.append(d)
>>>for value in data:
print(value)
{'emails': ['[email protected]'], 'date': '04:34:03 +0000 2013', 'id': '1708824644'}
{'emails': ['[email protected]'], 'date': 'Tue Dec 17 04:33:58 +0000 2013', 'id': '25016561'}
{'emails': ['[email protected]', '[email protected]'], 'date': 'Tue Dec 17 04:34:03 +0000 2013', 'id': '1708824644'}
{'emails': ['[email protected]'], 'date': 'Tue Dec 17 04:33:58 +0000 2013', 'id': '25016561'}
\w+isn't good enough. what about[email protected]?