How to read and index dynamically generating files in python

Question

How can we read and index dynamically generating files, from a source folder, in python and append index with the newly added or unread files, in the folder, upon code refresh?

An automation tool is continuously putting files (say xlsx) to the source folder, a python program will then read and plot a graph from all the files present in the folder, to optimize the performance of the code, we are planning to not to read all the files once the code/ application is refreshed but to only append the index with the unread files.

An index could be a local variable/ table, which contains information about the input files, say which all files are already loaded/ read so that the system knows which one to read now and which all are already read. The idea is to read a file only once, not all the files after every refresh.

an index could be a local variable/ table, which contains information about the input files, say which all files are already loaded/ read so that the system knows which one to read now which all are already read. The idea is to read a file only once, not all the files after every refresh. — Botham Ruosh
– Botham Ruosh, Commented Sep 18, 2018 at 6:27
You can maintain a log file containing the file names of the file which are already processed and read this log file before processing a file to check , whether it has been already processed or not — Ram Mourya
– Ram Mourya, Commented Sep 18, 2018 at 6:30

Akshay Gujar · Accepted Answer · 2018-09-18 06:56:12Z

Following code will help you to give the list of new file names with their index.

These variables are used:

bag_of_file : Content list of file names which already proceed
curr_files : Contents list of file names which are in source folder
new_files : Contents list of file names which you are interested in.

Run this code for first time when you have bag_of_file is empty.

import os
curr_dir = "D:/2018/Address Matching/Data/Statewise Output/"
bag_of_files = [] #Comment out this line after using 1st time
curr_files = os.listdir(curr_dir)
new_files = []
for file in curr_files:
    if file not in bag_of_files:
        new_files.append(file)
        bag_of_files.append(file)

new_files

Output:

['AP Output.csv',
'Delhi Output.csv',
'Gujrat Output.csv',
'Haryana Output.csv',
'Jharkhand Output V1.csv',
'Jharkhand Output V1.xlsx',
'Jharkhand Output.csv',
'Karnataka Output.csv']

Next time always run following code. Difference is only in line no. 3 where we used previous version of bag_of_files. Every time I added some new files in same folder.

curr_dir = "D:/2018/Address Matching/Data/Statewise Output/"
#bag_of_files = [] #Comment out this line after using 1st time
curr_files = os.listdir(curr_dir)
new_files = []
for file in curr_files:
    if file not in bag_of_files:
        new_files.append(file)
        bag_of_files.append(file)
new_files

Output:

['Maharashtra Output.csv',
 'MP Output.csv',
 'Punjab Output.csv',
 'Rajsthan Output.csv']

Run it again :)

Output:

['Bihar Output.csv',
 'Tamilnadu Output.csv',
 'Telangana Output.csv',
 'WB Output.csv']

Kingsley · Accepted Answer · 2018-09-18 05:27:10Z

0

To keep the answer simple, you could use os.listdir() to monitor the directory content. The to watch for modified files that the program has already indexed, check the modified time on these with os.stat().

answered Sep 18, 2018 at 5:27

Kingsley

15k5 gold badges38 silver badges60 bronze badges

Collectives™ on Stack Overflow

How to read and index dynamically generating files in python

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related