6

I have a string contains datetimes, I am trying to split the string based on the datetime occurances,

data="2018-03-14 06:08:18, he went on \n2018-03-15 06:08:18, lets play"

what I am doing,

out=re.split('^(2[0-3]|[01]?[0-9]):([0-5]?[0-9]):([0-5]?[0-9])$',data)

what I get

["2018-03-14 06:08:18, he went on 2018-03-15 06:08:18, lets play"]

What I want:

["2018-03-14 06:08:18, he went on","2018-03-15 06:08:18, lets play"]
4
  • What is the Python version? Commented Jul 18, 2018 at 7:13
  • python version is 3.6.3 Commented Jul 18, 2018 at 7:14
  • Can there be cases when there is no whitespace between the items? Can we assume we want to split with at least 1 whitespace followed with a date? Commented Jul 18, 2018 at 7:15
  • Well, I meant to suggest something like r'\s+(?=(?:(?:20)?[01]?[0-9])-(?:1[0-2]|0?[0-9])-(?:[0-2]?[0-9]|3[01]))' with split. Commented Jul 18, 2018 at 7:20

2 Answers 2

6

You want to split with at least 1 whitespace followed with a date like pattern, thus, you may use

re.split(r'\s+(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)', s)

See the regex demo

Details

  • \s+ - 1+ whitespace chars
  • (?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b) - a positive lookahead that makes sure, that immediately to the left of the current location, there are
    • \d{2}(?:\d{2})? - 2 or 4 digits
    • - - a hyphen
    • \d{1,2} - 1 or 2 digits
    • -\d{1,2} - again a hyphen and 1 or 2 digits
    • \b - a word boundary (if not necessary, remove it, or replace with (?!\d) in case you may have dates glued to letters or other text)

Python demo:

import re
rex = r"\s+(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)"
s = "2018-03-14 06:08:18, he went on 2018-03-15 06:08:18, lets play"
print(re.split(rex, s))
# => ['2018-03-14 06:08:18, he went on', '2018-03-15 06:08:18, lets play']

NOTE If there can be no whitespace before the date, in Python 3.7 and newer you may use r"\s*(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)" (note the * quantifier with \s* that will allow zero-length matches). For older versions, you will need to use a solution as @blhsing suggests or install PyPi regex module and use r"(?V1)\s*(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)" with regex.split.

Sign up to request clarification or add additional context in comments.

Comments

4

re.split is meant for cases where you have a certain delimiter pattern. Use re.findall with a lookahead pattern instead:

import re
data="2018-03-14 06:08:18, he went on \n2018-03-15 06:08:18, lets play"
d = r'\d{4}-\d?\d-\d?\d (?:2[0-3]|[01]?[0-9]):[0-5]?[0-9]:[0-5]?[0-9]'
print(re.findall(r'{0}.*?(?=\s*{0}|$)'.format(d), data, re.DOTALL))

This outputs:

['2018-03-14 06:08:18, he went on', '2018-03-15 06:08:18, lets play']

8 Comments

Note that a lazy dot with a lookahead might be too resource consuming since the lookahead pattern is checked after each char after the subpattern before the lazy dot. If the requirement is to split with 1 or more whitespaces that are followed with something like a date, re.split(r'\s+(?=\d{2}(?:\d{2})?-\d{1,2}-\d{1,2}\b)', s) might be a better choice.
@blhsing it returns only the last occurance in my actual data
@pyd I see. In case you have a '\n' in the string you just need to add an re.DOTALL flag to findall. I've updated my answer accordingly then.
Thank you for the answer @blhsing
@pyd You're welcome. In fact, if there's always a '\n' before each date/time, you might as well use `str.split('\n')`` to get what you want.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.