Split string with regex not working

Question

I'm trying to split big file with some regex. Problem is that I want to keep delimiter in text after split, and I tried to add ?= on the beggining of regex, but then it doesn't split. I tried modified regex in Sublime, and it's working there.

Text is like this:

Aug 07, 2014 01:01:01 PM
some text
Aug 07, 2014 02:02:02 PM

So, date, then some text and date. I want to get split text with regex which recognize that date.

First version of regex, which works perfectlly for my purpose:

\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)

Code in Python is this:

allparts = re.compile(r'\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].').split(alltext)

After adding ?=, it looks like this:

allparts2 =re.compile(r'(?=\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

What I'm doing wrong in second code?

What about : (?=\w{3} \d{2}, \d{4}, [\d:]+ (?=AM|PM))

Daneo
– Daneo

2014-05-14 12:14:00 +00:00
Commented May 14, 2014 at 12:14 — Daneo
– Daneo, Commented May 14, 2014 at 12:14
I can't find anything with that.

ante003
– ante003

2014-05-14 12:21:00 +00:00
Commented May 14, 2014 at 12:21 — ante003
– ante003, Commented May 14, 2014 at 12:21
What does your allparts / allparts returns in each cases?

M'vy
– M'vy

2014-05-14 14:19:57 +00:00
Commented May 14, 2014 at 14:19 — M'vy
– M'vy, Commented May 14, 2014 at 14:19

AdamK · Accepted Answer · 2014-05-14 13:30:15Z

1

Sorry, my first answer was wrong:) Try not adding ?=, only put it in parentheses like this:

allparts2 =re.compile(r'(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

Then try it without compile...

allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM " 

allparts2 = re.split('(?=\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)
print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(?:\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', ' some text ', ' another text ', ' ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

Just to compare different forms.

edited May 14, 2014 at 13:30

answered May 14, 2014 at 12:06

AdamK

17810 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ante003 Over a year ago

I don't where is the problem. I tried your code standalone, and it's working ok. But when I copy regex in my script, it doesn't work there. Can it be because my string is data from file? I opened file, and saved everything from read() method.

AdamK Over a year ago

It would also help if you print out the string which contains the data from file, to see what exactly is in it.

ante003 Over a year ago

It is just data = openFile.read(). But I solved my problem without this, and just simply wrote my own split method. But your method worked on another file so I will use that regex there. Tnx.

Daneo · Accepted Answer · 2014-05-14 12:39:11Z

0

Although I am unfamiliar with the Python flavour, Pythex gives me the following, I assume correct, results :

See the result

Even if these are not, there are several things in your regex which are unnecessary and/or incorrect by my knowledge.

A comma does not need to be escaped
A conditional is not done by [ condo | cond2] , but rather by parentheses (cond1|cond2)
The \s you have is optional as regex catches a white space, which is correct if you want to catch e.g. a space character, a tab character, a carriage return character, ..

Lastly, the item you are adding ?= is a lookahead, ?: makes it match, but does not make it part of your capture group.

Try this regex : (?:\w{3} \d{2}, \d{4}, [\d:]+ (?:AM|PM))

edited May 14, 2014 at 12:39

answered May 14, 2014 at 12:29

Daneo

5183 silver badges18 bronze badges

1 Comment

ante003 Over a year ago

Nothing. Still, the delimiter is missing after split.

M'vy · Accepted Answer · 2014-05-14 15:51:40Z

0

It seems that python's re.split() doesn't split on zero-length matches.

However, the manual says

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

...

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string.

So you can use :

allparts2 = re.compile(r'(\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s(?:AM|PM))')

Where the matching expression is surrounded by a capturing group (also notice the un-capturing group at the end). The result is :

['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

You can then create your files by grouping allparts[1], allparts[2] and so on (2n+1, 2n+2).

edited May 14, 2014 at 15:51

answered May 14, 2014 at 12:28

M'vy

5,7842 gold badges33 silver badges44 bronze badges

3 Comments

ante003 Over a year ago

Sorry, I added by mistake ',' after year. It should be without ','

ante003 Over a year ago

And still, this will find all lines, but in splitting, it will delete them.

M'vy Over a year ago

Here is another shot at it. I removed the correction since it was just a typo. I still left the corrected alternation and remove the last .

Collectives™ on Stack Overflow

Split string with regex not working

3 Answers 3

3 Comments

1 Comment

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related