1

Given a string with a date in an unknown format and other text, how can I separate the two?

>>dparser.parse("monkey 2010-07-10 love banana",fuzzy=True)
datetime.datetime(2010, 7, 10, 0, 0)

from Extracting date from a string in Python is a step in the right direction, but what I want is the non-date text, for example:

date = 2010-07-10
str_a = 'monkey', str_b = 'love banana'

If the date string didn't have spaces in it, I could split the string and test each substring, but how about 'monkey Feb 20, 2015 loves 2014 bananas'? 2014 and 2015 would both "pass" parse(), but only one of them is part of a date.

EDIT: there doesn't seem any reasonable way to deal with 'monkey Feb 20, 2015 loves 2014 bananas' That leaves 'monkey Feb 20, 2015 loves bananas' or 'monkey 2/20/2015 loves bananas' or 'monkey 20 Feb 2015 loves 2014 bananas' or other variants as things parse() can deal with.

5
  • 2
    why 2015 is a year in your example while 2014 is not? The phrase is non-sense either way. Commented Feb 21, 2015 at 13:05
  • Fair point. Feb 20, 2015 is clearly a date, while 2014 is ambiguous. If you run it through parse(...,fuzzy=True), it considers 2014 hours and minutes. I'll edit the question. Commented Feb 21, 2015 at 13:16
  • Perhaps I should examine the source for parse(). Commented Feb 21, 2015 at 13:21
  • 1
    i'd start by trying a date parse at each offset ... if just one works, then use that ... if 2 or more offsets work, then you have a new problem. Commented Feb 21, 2015 at 13:23
  • @Skaperen split on spaces and consider any block that "passes" parse() as a date? Or do you mean something else? BTW, for Feb 20, 2015 each offset would pass, but the parts that work would be contiguous. Commented Feb 21, 2015 at 13:49

2 Answers 2

1

You can use regex to extract the words , and for get ride of month names you can check that your strings not in calendar.month_abbr and calendar.month_name:

>>> import clalendar
>>> def word_find(s):
...       return [i for i in re.findall(r'[a-zA-Z]+',s) if i.capitalize() not in calendar.month_name and i.capitalize() not in calendar.month_abbr]

Demo:

>>> s1='monkey Feb 20, 2015 loves 2014 bananas'
>>> s2='monkey Feb 20, 2015 loves bananas'
>>> s3='monkey 2/20/2015 loves bananas'
>>> s4='monkey 20 Feb 2015 loves 2014 bananas'
>>> print word_find(s1)
['monkey', 'loves', 'bananas']
>>> print word_find(s2)
['monkey', 'loves', 'bananas']
>>> print word_find(s3)
['monkey', 'loves', 'bananas']
>>> print word_find(s4)
['monkey', 'loves', 'bananas']

and this :

>>> s5='monkey 20 January 2015 loves 2014 bananas'
>>> print word_find(s5)
['monkey', 'loves', 'bananas']
Sign up to request clarification or add additional context in comments.

8 Comments

Consider Feb 20, 2015 or 20 February 2015. I could have a list of all full and abbreviated date strings, but that's tedious (and is may a date or not), especially when parse() can recognize dates.
Kasra that's worked on everything I've tried so far.
Kasra, sorry, was doing a bit more testing, then got distracted trying to again figure out why @ shows up in your response to me and I can't get it in my response to you.
@foosion ;) its ok! because this is my answer and if you left comment here i 'll get a notification any way so there is no need to @ !
If I understand correctly, I see @ for your answers to me, but others don't.
|
0

To find date/time in a natural language text and to return their positions in the input text and thus allowing to get non-date text:

 #!/usr/bin/env python
 import parsedatetime # $ pip install parsedatetime

 cal = parsedatetime.Calendar()
 for text in ['monkey 2010-07-10 love banana',
              'monkey Feb 20, 2015 loves 2014 bananas']:
     indices = [0]
     for parsed_datetime, type, start, end, matched_text in cal.nlp(text) or []:
         indices.extend((start, end))
         print([parsed_datetime, matched_text])
     indices.append(len(text))
     print([text[i:j] for i, j in zip(indices[::2], indices[1::2])])

Output

[datetime.datetime(2015, 2, 21, 20, 10), '2010']
['monkey ', '-07-10 love banana']
[datetime.datetime(2015, 2, 20, 0, 0), ' Feb 20, 2015']
[datetime.datetime(2015, 2, 21, 20, 14), '2014']
['monkey', ' loves ', ' bananas']

Note: parsedatetime failed to recognized 2010-07-10 as a date in the first string. 2010 and 2014 are recognized as a time (20:10 and 20:14) in both strings.

2 Comments

Doesn't 'failed to recognize' mean parsedatetime is not as good as recognizing valid date strings as dateutil.parser.parse?
@foosion: it depends on the input. It may be better at parsing human-readable date/time strings e.g., cal.nlp('tomorrow') works but dateutil.parser.parse('tomorrow', fuzzy=True) returns the default (wrong date).

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.