0

I'm trying to split big file with some regex. Problem is that I want to keep delimiter in text after split, and I tried to add ?= on the beggining of regex, but then it doesn't split. I tried modified regex in Sublime, and it's working there.

Text is like this:

Aug 07, 2014 01:01:01 PM
some text
Aug 07, 2014 02:02:02 PM


So, date, then some text and date. I want to get split text with regex which recognize that date.

First version of regex, which works perfectlly for my purpose:

\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)

Code in Python is this:

allparts = re.compile(r'\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].').split(alltext)

After adding ?=, it looks like this:

allparts2 =re.compile(r'(?=\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

What I'm doing wrong in second code?

3
  • What about : (?=\w{3} \d{2}, \d{4}, [\d:]+ (?=AM|PM)) Commented May 14, 2014 at 12:14
  • I can't find anything with that. Commented May 14, 2014 at 12:21
  • What does your allparts / allparts returns in each cases? Commented May 14, 2014 at 14:19

3 Answers 3

1

Sorry, my first answer was wrong:) Try not adding ?=, only put it in parentheses like this:

allparts2 =re.compile(r'(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)').split(alltext)

Then try it without compile...

allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM " 

allparts2 = re.split('(?=\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)
print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(?:\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', ' some text ', ' another text ', ' ']

When using:

#!/usr/local/bin/python2.7
import re

alltext = "Aug 07, 2014 01:01:01 PM some text Aug 07, 2014 02:02:02 PM another text Aug 07, 2014 03:03:03 AM "


allparts2 = re.split('(\w{3}\s\d{2},\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s[AM|PM].)', alltext)

print(allparts2)

Result was:

Executing the program....
$python2.7 main.py
['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

Just to compare different forms.

Sign up to request clarification or add additional context in comments.

3 Comments

I don't where is the problem. I tried your code standalone, and it's working ok. But when I copy regex in my script, it doesn't work there. Can it be because my string is data from file? I opened file, and saved everything from read() method.
It would also help if you print out the string which contains the data from file, to see what exactly is in it.
It is just data = openFile.read(). But I solved my problem without this, and just simply wrote my own split method. But your method worked on another file so I will use that regex there. Tnx.
0

Although I am unfamiliar with the Python flavour, Pythex gives me the following, I assume correct, results :

See the result

Even if these are not, there are several things in your regex which are unnecessary and/or incorrect by my knowledge.

  • A comma does not need to be escaped
  • A conditional is not done by [ condo | cond2] , but rather by parentheses (cond1|cond2)
  • The \s you have is optional as regex catches a white space, which is correct if you want to catch e.g. a space character, a tab character, a carriage return character, ..

Lastly, the item you are adding ?= is a lookahead, ?: makes it match, but does not make it part of your capture group.

Try this regex : (?:\w{3} \d{2}, \d{4}, [\d:]+ (?:AM|PM))

1 Comment

Nothing. Still, the delimiter is missing after split.
0

It seems that python's re.split() doesn't split on zero-length matches.

However, the manual says

If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

...

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string.

So you can use :

allparts2 = re.compile(r'(\w{3}\s\d{2}\,\s\d{4}\s\d{1,2}\:\d{2}\:\d{2}\s(?:AM|PM))')

Where the matching expression is surrounded by a capturing group (also notice the un-capturing group at the end). The result is :

['', 'Aug 07, 2014 01:01:01 PM', ' some text ', 'Aug 07, 2014 02:02:02 PM', ' another text ', 'Aug 07, 2014 03:03:03 AM', ' ']

You can then create your files by grouping allparts[1], allparts[2] and so on (2n+1, 2n+2).

3 Comments

Sorry, I added by mistake ',' after year. It should be without ','
And still, this will find all lines, but in splitting, it will delete them.
Here is another shot at it. I removed the correction since it was just a typo. I still left the corrected alternation and remove the last .

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.