0

This data is about file information in a specific folder which is expected to grow over time, meaning there will be many files with similar name pattern. But the filenames are not exactly the same. The code below captures the filename that matches a given pattern and also if there are multiple outputs, selects the latest one based on last_modified date. In this example that is filename1

Sample data frame:

d = {'file_name': ['finding_finding_april_040119_1012', 'finding_finding_april_040119_1111', 'question_answer_april_040119_0915', 'question_answer_april_040119_0945', 'review_rational_040119_0805'], 'No_of_records': [23, 32, 45, 42, 28 ], 'size_in_MB': [10, 15, 8, 12, 10 ], 'Last_modified': ['2019-04-01 05:00:15+00:00', '2019-04-01 05:00:20+00:00', '2019-04-01 07:00:15+00:00', '2019-04-01 07:15:15+00:00', '2019-04-01 05:00:15+00:00']}
import pandas as pd
df = pd.DataFrame(data = d)
df['Last_modified'] = pd.to_datetime(df['Last_modified'])

This is how the table looks like:

enter image description here

Code I am using:

mask1 = df['file_name'].str.contains("finding_finding_april")
df2 = df.loc[mask1]
mask2 = (df2['Last_modified'] == df2['Last_modified'].max())
df3 = df2.loc[mask2]
filename1 = df3.iloc[0,2]

The conditions mask1, mask2 can not be used together like mask1 & mask2. The code works as it is. But I think there should be a better way of writing this.

  1. Is there a way to improve the code using nested for loop or list comprehension?
  2. If I have a list of patterns like the following, how can I run a loop through the list to create filename1 ,filename2 without running the code separately for each of them.

    list = ['finding_finding_april', 'question_answer_april', 'review_rational_april' ... ...]

Now I know how to run loop through a list and do something simple but not sure what to do in this situation.

3
  • do you mean df.loc[mask1&mask2,'size_in_MB'] ?? Commented Apr 2, 2019 at 16:19
  • can you provide a dataset example for case 2? and expected output? Commented Apr 2, 2019 at 16:28
  • @anky_91 , as I mentioned I cant do mask1 & mask2 together. mask2 works on the result I get after filtering with mask1. Case 2 applies to this same example too. Instead of doing df['file_name'].str.contains("finding_finding_april") separately each time I want to match a pattern, I want to execute the whole process through a list of patterns. Commented Apr 2, 2019 at 16:33

1 Answer 1

1

you can iterate through the list and just create a list of filename, append the contents, just like the following

list = ['finding_finding_april', 'question_answer_april', 'review_rational_april']
for i in range(0,len(list)):
    mask1 = df['file_name'].str.contains(list[i])
    df2 = df.loc[mask1]
    .
    .
    filename.append(df3.iloc[0,2])
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot. I have to get each filename as filename1 = filename[0], filename2 = filename[1] etc. can we also create filename1, filename2 ... ... within the code ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.