3

I have a String that looks like

test = '20170125NBCNightlyNews'

I am trying to split it into two parts, the digits, and the name. The format will always be [date][show] the date is stripped of format and is digit only in the direction of YYYYMMDD (dont think that matters)

I am trying to use re. I have a working version by writing.

re.split('(\d+)',test)

Simple enough, this gives me the values I need in a list.

['', '20170125', 'NBCNightlyNews']

However, as you will note, there is an empty string in the first position. I could theoretically just ignore it, but I want to learn why it is there in the first place, and if/how I can avoid it.

I also tried telling it to match the begininning of the string as well, and got the same results.

>>> re.split('(^\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>> re.split('^(\d+)',test)
['', '20170125', 'NBCNightlyNews']
>>>

Does anyone have any input as to why this is there / how I can avoid the empty string?

2
  • Did you try reading the docs? They explain this. Commented Jan 31, 2017 at 19:26
  • I did, I was confused as to why it would create a group, I understand better now on ssc answer. thank you Commented Jan 31, 2017 at 19:31

6 Answers 6

4

Other answers have explained why what you're doing does what it does, but if you have a constant format for the date, there is no reason to abuse a re.split to parse this data:

test[:8], test[8:]

Will split your strings just fine.

Sign up to request clarification or add additional context in comments.

2 Comments

you are right. This is seemingly my best approach. I've been looking for chances to practice and use re but I guess I overthought this one a llittle bit.
@Busturdust It's a really common approach, just gotta remember KISS
3

What you are actually doing by entering re.split('(^\d+)', test) is, that your test string is splitted on any occurence of a number with at least one character.

So, if you have

test = '20170125NBCNightlyNews'

This is happening:

 20170125 NBCNightlyNews
 ^^^^^^^^

The string is split into three parts, everything before the number, the number itself and everything after the number.


Maybe it is easier to understand if you have a sentence of words, separated by a whitespace character.

re.split(' ', 'this is a house')
=> ['this', 'is', 'a', 'house']

re.split(' ', ' is a house')
=> ['', 'is', 'a', 'house']

2 Comments

This made sense to me thank you. The way the docs are written I thought maybe there was like an invisible "digit" before the string, but the expression "before, the number, and after" helped me understand thank you
@Busturdust I've added an example with a whitespace character, maybe it is even more clear with that :)
2

You're getting an empty result in the beginning because your input string starts with digits and you're splitting it by digits only. Hence you get an empty string which is before first set of digits.

To avoid that you can use filter:

>>> print filter(None, re.split('(\d+)',test))
['20170125', 'NBCNightlyNews']

1 Comment

Fixing the output from abusing the re.split function would not be my recommended fix.
2

Why re.split when you can just match and get the groups?...

import re
test = '20170125NBCNightlyNews'
pattern = re.compile('(\d+)(\w+)')

result = re.match(pattern, test)
result.groups()[0]  # for the date part
result.groups()[1]  # for the show name

I realize now the intention was to parse the text, not fix the regex usage. I'm with the others, you shouldn't use regex for this simple task when you already know the format won't change and the date is fixed size and will always be first. Just use string indexing.

Comments

2

From the documentation:

If there are capturing groups in the separator and it matches at the start of the string, the result will start with an empty string. The same holds for the end of the string. That way, separator components are always found at the same relative indices within the result list.

So if you have:

test = 'test20170125NBCNightlyNews'

The indexes would remain unaffected:

>>>re.split('(\d+)',test)
['test', '20170125', 'NBCNightlyNews']

Comments

1

If the date is always 8 digits long, I would access the substrings directly (without using regex):

>>> [test[:8], test[8:]]
['20170125', 'NBCNightlyNews']

If the length of the date might vary, I would use:

>>> s = re.search('^(\d*)(.*)$', test)
>>> [s.group(1), s.group(2)]
['20170125', 'NBCNightlyNews']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.