Python regex split without empty string

Question

I have the following file names that exhibit this pattern:

000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...

I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:

time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)

But this gives me two extra empty strings in the returned list:

time_info=['', '20111007T084734', '20111008T023142', '']

How do I get only the two time stamp information? i.e. I want:

time_info=['20111007T084734', '20111008T023142']

aksh1618 · Accepted Answer · 2019-03-09 00:34:27Z

28

I'm no Python expert but maybe you could just remove the empty strings from your list?

str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)

edited Mar 9, 2019 at 0:34

aksh1618

2,59927 silver badges46 bronze badges

answered May 30, 2013 at 16:06

Elliot Bonneville

53.6k23 gold badges101 silver badges125 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

tonga Over a year ago

This works. Thanks. I wonder if there is any one-pass solution using re.split() function.

FraggaMuffin Over a year ago

@tonga there is, but it's less pretty: time_info = [x for x in re.split('^[0-9]+_[LU]_|-|\.txt$', f) if x]

Joshua M Over a year ago

Since filter() returns a filter object, you need to use list() afterwards: time_info = list(filter(None, str_list))

JAB · Accepted Answer · 2013-05-30 16:22:16Z

23

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.

>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')

You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

edited May 30, 2013 at 16:22

answered May 30, 2013 at 16:12

JAB

21.2k6 gold badges73 silver badges80 bronze badges

4 Comments

Elazar Over a year ago

It's a shame split doesn't have a "no empty strings" option.

JAB Over a year ago

@Elazar Not really, it's just a matter of how re.split() is implemented and what its intended purpose is. In cases like this, it makes more sense to build a pattern for the desired data than to build one to match everything that isn't desired. (Though str.split() actually does drop empty strings when the separator is unspecified or None.)

Elazar Over a year ago

The way re.split() is implemented should have nothing to do with its external behavior.

JAB Over a year ago

Nowhere in the Python documentation does it say that re.split() must function exactly like str.split() in how it handles empty strings. The only explicit, non-example mention of empty strings in the result is that captured separators at the start or end will be accompanied by an empty string to ensure consistency for relative indexing.

PipperChip · Accepted Answer · 2020-05-05 14:46:08Z

4

Since this came up on google and for completeness, try using re.findall as an alternative!

This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.

Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

answered May 5, 2020 at 14:46

PipperChip

1511 gold badge1 silver badge8 bronze badges

1 Comment

Stuart Axon Over a year ago

This is almost certainly the right answer, I think the user wants to split by regex, they don't necessarily care if they use the 'split' API.

Ashwini Chaudhary · Accepted Answer · 2013-05-30 16:10:05Z

3

If the timestamps are always after the second _ then you can use str.split and str.strip:

>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

answered May 30, 2013 at 16:10

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

3 Comments

Elazar Over a year ago

I love doing these things without REs. I don't know why.

tonga Over a year ago

@Ashwini: Thanks. This works. But how can I do this with regex split?

JAB Over a year ago

@Elazar I suspect because regular expressions can be quite cryptic if they're done wrongly or are too complex and have no comments. Sometimes a string manipulation done with an RE can be easier to understand when built up as a series of function calls. (In this case, though, a series of split()/strip()/element access operations is clunkier than using an RE would be.)

Elazar · Accepted Answer · 2013-05-30 16:17:09Z

1

>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']

or, somewhat more general:

>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

edited May 30, 2013 at 16:17

answered May 30, 2013 at 16:10

Elazar

22k4 gold badges51 silver badges68 bronze badges

Collectives™ on Stack Overflow

Python regex split without empty string

5 Answers 5

3 Comments

4 Comments

1 Comment

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

4 Comments

1 Comment

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related