Python Regex Simple Split - Empty at first index

Question

I have a String that looks like

test = '20170125NBCNightlyNews'

I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)

I am trying to use re. I have a working version by writing.

re.split('(\d+)',test)

Simple enough, this gives me the values I need in a list.

['', '20170125', 'NBCNightlyNews']

However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.

I also tried telling it to match the begininning of the string as well, and got the same results.

>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>

Does anyone have any input as to why this is there / how I can avoid the empty string?

I did, I was confused as to why it would create a group, I understand better now on ssc answer. thank you — Busturdust
– Busturdust, Commented Jan 31, 2017 at 19:31

TemporalWolf · Accepted Answer · 2017-01-31 19:33:51Z

4

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:

test[:8], test[8:]

Will split your strings just fine.

answered Jan 31, 2017 at 19:33

TemporalWolf

8,0121 gold badge33 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Busturdust Over a year ago

you are right. This is seemingly my best approach. I've been looking for chances to practice and use re but I guess I overthought this one a llittle bit.

TemporalWolf Over a year ago

@Busturdust It's a really common approach, just gotta remember KISS

ssc-hrep3 · Accepted Answer · 2017-01-31 19:30:51Z

3

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.

So, if you have

test = '20170125NBCNightlyNews'

This is happening:

 20170125 NBCNightlyNews
 ^^^^^^^^

The string is split into three parts, everything before the number, the number itself and everything after the number.

Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.

re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']

re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

edited Jan 31, 2017 at 19:30

answered Jan 31, 2017 at 19:29

ssc-hrep3

16.3k8 gold badges51 silver badges96 bronze badges

2 Comments

Busturdust Over a year ago

This made sense to me thank you. The way the docs are written I thought maybe there was like an invisible "digit" before the string, but the expression "before, the number, and after" helped me understand thank you

ssc-hrep3 Over a year ago

@Busturdust I've added an example with a whitespace character, maybe it is even more clear with that :)

anubhava · Accepted Answer · 2017-01-31 19:29:03Z

2

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.

To avoid that you can use filter:

>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

answered Jan 31, 2017 at 19:29

anubhava

790k67 gold badges603 silver badges671 bronze badges

1 Comment

TemporalWolf Over a year ago

Fixing the output from abusing the re.split function would not be my recommended fix.

Gabriel Reiser · Accepted Answer · 2017-01-31 19:34:38Z

2

Why re.split when you can just match and get the groups?...

import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')

result = re.match(pattern, test)
result.groups()[0]  # for the date part
result.groups()[1]  # for the show name

I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

answered Jan 31, 2017 at 19:34

Gabriel Reiser

4005 silver badges10 bronze badges

Comments

masual · Accepted Answer · 2017-01-31 19:34:59Z

2

From the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.

So if you have:

test = 'test20170125NBCNightlyNews'

The indexes would remain unaffected:

>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

answered Jan 31, 2017 at 19:34

masual

892 silver badges8 bronze badges

Comments

fyrescyon · Accepted Answer · 2017-01-31 19:42:05Z

1

If the date is always 8 digits long, I would access the substrings directly (without using regex):

>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']

If the length of the date might vary, I would use:

>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

answered Jan 31, 2017 at 19:42

fyrescyon

214 bronze badges

Collectives™ on Stack Overflow

Python Regex Simple Split - Empty at first index

6 Answers 6

2 Comments

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

2 Comments

2 Comments

1 Comment

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related