35

I have the following file names that exhibit this pattern:

000014_L_20111007T084734-20111008T023142.txt
000014_U_20111007T084734-20111008T023142.txt
...

I want to extract the middle two time stamp parts after the second underscore '_' and before '.txt'. So I used the following Python regex string split:

time_info = re.split('^[0-9]+_[LU]_|-|\.txt$', f)

But this gives me two extra empty strings in the returned list:

time_info=['', '20111007T084734', '20111008T023142', '']

How do I get only the two time stamp information? i.e. I want:

time_info=['20111007T084734', '20111008T023142']

5 Answers 5

28

I'm no Python expert but maybe you could just remove the empty strings from your list?

str_list = re.split('^[0-9]+_[LU]_|-|\.txt$', f)
time_info = filter(None, str_list)
Sign up to request clarification or add additional context in comments.

3 Comments

This works. Thanks. I wonder if there is any one-pass solution using re.split() function.
@tonga there is, but it's less pretty: time_info = [x for x in re.split('^[0-9]+_[LU]_|-|\.txt$', f) if x]
Since filter() returns a filter object, you need to use list() afterwards: time_info = list(filter(None, str_list))
23

Don't use re.split(), use the groups() method of regex Match/SRE_Match objects.

>>> f = '000014_L_20111007T084734-20111008T023142.txt'
>>> time_info = re.search(r'[LU]_(\w+)-(\w+)\.', f).groups()
>>> time_info
('20111007T084734', '20111008T023142')

You can even name the capturing groups and retrieve them in a dict, though you use groupdict() rather than groups() for that. (The regex pattern for such a case would be something like r'[LU]_(?P<groupA>\w+)-(?P<groupB>\w+)\.')

4 Comments

It's a shame split doesn't have a "no empty strings" option.
@Elazar Not really, it's just a matter of how re.split() is implemented and what its intended purpose is. In cases like this, it makes more sense to build a pattern for the desired data than to build one to match everything that isn't desired. (Though str.split() actually does drop empty strings when the separator is unspecified or None.)
The way re.split() is implemented should have nothing to do with its external behavior.
Nowhere in the Python documentation does it say that re.split() must function exactly like str.split() in how it handles empty strings. The only explicit, non-example mention of empty strings in the result is that captured separators at the start or end will be accompanied by an empty string to ensure consistency for relative indexing.
4

Since this came up on google and for completeness, try using re.findall as an alternative!

This does require a little re-thinking, but it still returns a list of matches like split does. This makes it a nice drop-in replacement for some existing code and gets rid of the unwanted text. Pair it with lookaheads and/or lookbehinds and you get very similar behavior.

Yes, this is a bit of a "you're asking the wrong question" answer and doesn't use re.split(). It does solve the underlying issue- your list of matches suddenly have zero-length strings in it and you don't want that.

1 Comment

This is almost certainly the right answer, I think the user wants to split by regex, they don't necessarily care if they use the 'split' API.
3

If the timestamps are always after the second _ then you can use str.split and str.strip:

>>> strs = "000014_L_20111007T084734-20111008T023142.txt"
>>> strs.strip(".txt").split("_",2)[-1].split("-")
['20111007T084734', '20111008T023142']

3 Comments

I love doing these things without REs. I don't know why.
@Ashwini: Thanks. This works. But how can I do this with regex split?
@Elazar I suspect because regular expressions can be quite cryptic if they're done wrongly or are too complex and have no comments. Sometimes a string manipulation done with an RE can be easier to understand when built up as a series of function calls. (In this case, though, a series of split()/strip()/element access operations is clunkier than using an RE would be.)
1
>>> f='000014_L_20111007T084734-20111008T023142.txt'
>>> f[10:-4].split('-')
['0111007T084734', '20111008T023142']

or, somewhat more general:

>>> f[f.rfind('_')+1:-4].split('-')
['20111007T084734', '20111008T023142']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.