Separating a date from a string in Python

Question

Given a string with a date in an unknown format and other text, how can I separate the two?

>>dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
datetime.datetime(2010, 7, 10, 0, 0)

from Extracting date from a string in Python is a step in the right direction, but what I want is the non-date text, for example:

date = 2010-07-10
str_a = 'monkey', str_b = 'love banana'

If the date string didn't have spaces in it, I could split the string and test each substring, but how about 'monkey Feb 20, 2015 loves 2014 bananas'? 2014 and 2015 would both "pass" parse(), but only one of them is part of a date.

EDIT: there doesn't seem any reasonable way to deal with 'monkey Feb 20, 2015 loves 2014 bananas' That leaves 'monkey Feb 20, 2015 loves bananas' or 'monkey 2/20/2015 loves bananas' or 'monkey 20 Feb 2015 loves 2014 bananas' or other variants as things parse() can deal with.

why 2015 is a year in your example while 2014 is not? The phrase is non-sense either way. — jfs
– jfs, Commented Feb 21, 2015 at 13:05
Fair point. Feb 20, 2015 is clearly a date, while 2014 is ambiguous. If you run it through parse(...,fuzzy=True), it considers 2014 hours and minutes. I'll edit the question. — foosion
– foosion, Commented Feb 21, 2015 at 13:16
i'd start by trying a date parse at each offset ... if just one works, then use that ... if 2 or more offsets work, then you have a new problem. — Skaperen
– Skaperen, Commented Feb 21, 2015 at 13:23
@Skaperen split on spaces and consider any block that "passes" parse() as a date? Or do you mean something else? BTW, for Feb 20, 2015 each offset would pass, but the parts that work would be contiguous. — foosion
– foosion, Commented Feb 21, 2015 at 13:49

Kasravnd · Accepted Answer · 2015-02-21 14:51:48Z

1

You can use regex to extract the words , and for get ride of month names you can check that your strings not in calendar.month_abbr and calendar.month_name:

>>> import clalendar
>>> def word_find(s):
...       return [i for i in re.findall(r'[a-zA-Z]+',s) if i.capitalize() not in calendar.month_name and i.capitalize() not in calendar.month_abbr]

Demo:

>>> s1='monkey Feb 20, 2015 loves 2014 bananas'
>>> s2='monkey Feb 20, 2015 loves bananas'
>>> s3='monkey 2/20/2015 loves bananas'
>>> s4='monkey 20 Feb 2015 loves 2014 bananas'
>>> print word_find(s1)
['monkey', 'loves', 'bananas']
>>> print word_find(s2)
['monkey', 'loves', 'bananas']
>>> print word_find(s3)
['monkey', 'loves', 'bananas']
>>> print word_find(s4)
['monkey', 'loves', 'bananas']

and this :

>>> s5='monkey 20 January 2015 loves 2014 bananas'
>>> print word_find(s5)
['monkey', 'loves', 'bananas']

edited Feb 21, 2015 at 14:51

answered Feb 21, 2015 at 12:32

Kasravnd

108k19 gold badges167 silver badges195 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

foosion Over a year ago

Consider Feb 20, 2015 or 20 February 2015. I could have a list of all full and abbreviated date strings, but that's tedious (and is may a date or not), especially when parse() can recognize dates.

foosion Over a year ago

Kasra that's worked on everything I've tried so far.

foosion Over a year ago

Kasra, sorry, was doing a bit more testing, then got distracted trying to again figure out why @ shows up in your response to me and I can't get it in my response to you.

Kasravnd Over a year ago

@foosion ;) its ok! because this is my answer and if you left comment here i 'll get a notification any way so there is no need to @ !

foosion Over a year ago

If I understand correctly, I see @ for your answers to me, but others don't.

|

jfs · Accepted Answer · 2015-02-21 13:34:14Z

0

To find date/time in a natural language text and to return their positions in the input text and thus allowing to get non-date text:

 #!/usr/bin/env python
 import parsedatetime # $ pip install parsedatetime

 cal = parsedatetime.Calendar()
 for text in ['monkey 2010-07-10 love banana',
              'monkey Feb 20, 2015 loves 2014 bananas']:
     indices = [0]
     for parsed_datetime, type, start, end, matched_text in cal.nlp(text) or []:
         indices.extend((start, end))
         print([parsed_datetime, matched_text])
     indices.append(len(text))
     print([text[i:j] for i, j in zip(indices[::2], indices[1::2])])

Output

[datetime.datetime(2015, 2, 21, 20, 10), '2010']
['monkey ', '-07-10 love banana']
[datetime.datetime(2015, 2, 20, 0, 0), ' Feb 20, 2015']
[datetime.datetime(2015, 2, 21, 20, 14), '2014']
['monkey', ' loves ', ' bananas']

Note: parsedatetime failed to recognized 2010-07-10 as a date in the first string. 2010 and 2014 are recognized as a time (20:10 and 20:14) in both strings.

answered Feb 21, 2015 at 13:34

jfs

417k210 gold badges1k silver badges1.7k bronze badges

2 Comments

foosion Over a year ago

Doesn't 'failed to recognize' mean parsedatetime is not as good as recognizing valid date strings as dateutil.parser.parse?

jfs Over a year ago

@foosion: it depends on the input. It may be better at parsing human-readable date/time strings e.g., cal.nlp('tomorrow') works but dateutil.parser.parse('tomorrow', fuzzy=True) returns the default (wrong date).

Collectives™ on Stack Overflow

Separating a date from a string in Python

2 Answers 2

8 Comments

Output

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

8 Comments

Output

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related