5

I have a files that follow a specific format which look something like this:

test_0800_20180102_filepath.csv
anotherone_0800_20180101_hello.csv

The numbers in the middle represent timestamps, so I would like to extract that information. I know that there is a specific pattern which will always be _time_date_, so essentially I want the part of the string that lies between the first and third underscores. I found some examples and somehow similar problems, but I am new to Python and I am having trouble adapting them.

This is what I have implemented thus far:

datetime = re.search(r"\d+_(\d+)_", "test_0800_20180102_filepath.csv")

But the result I get is only the date part:

20180102

But what I actually need is:

0800_20180101
2
  • What have you tried and where did you get stuck? Commented Jan 10, 2018 at 9:50
  • I have tried various things but nothing has really worked up to now. The reason why I did not add any minimal example, is that I know it must be something extremely simple with someone that possesses some experience! Commented Jan 10, 2018 at 9:53

3 Answers 3

5

That's quite simple:

match = re.search(r"_((\d+)_(\d+))_", your_string)

print(match.group(1))  # print time_date >> 0800_20180101
print(match.group(2))  # print time >> 0800
print(match.group(3))  # print date >> 20180101

Note that for such tasks the group operator () inside the regexp is really helpful, it allows you to access certain substrings of a bigger pattern without having to match each one individually (which can sometimes be much more ambiguous than matching a larger one).

The order in which you then access the groups is from 1-n_specified, where group 0 is the whole matched pattern. Groups themselves are assigned from left to right, as defined in your pattern.

On a side note, if you have control over it, use unix timestamps so you only have one number defining both date and time universally.

Sign up to request clarification or add additional context in comments.

1 Comment

This is exactly where I have got up to :) but this actually extracts only the date part, not the time part! I need both of them.
1

They key here is you want everything between the first and the third underscores on each line, so there is no need to worry about designing a regex to match your time and date pattern.

with open('myfile.txt', 'r') as f:
    for line in f:
        x = '_'.join(line.split('_')[1:3])
        print(x)

The problem with your implementation is that you are only capturing the date part of your pattern. If you want to stick with a regex solution then simply move your parentheses to capture the entire pattern you want:

re.search(r"(\d+_\d+)_", "test_0800_20180102_filepath.csv").group(1)

gives:

'0800_20180102'

Comments

-1

This is very easy to do with .split():

time = filename.split("_")[1]
date = filename.split("_")[2]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.