1

How can we read and index dynamically generating files, from a source folder, in python and append index with the newly added or unread files, in the folder, upon code refresh?

An automation tool is continuously putting files (say xlsx) to the source folder, a python program will then read and plot a graph from all the files present in the folder, to optimize the performance of the code, we are planning to not to read all the files once the code/ application is refreshed but to only append the index with the unread files.

An index could be a local variable/ table, which contains information about the input files, say which all files are already loaded/ read so that the system knows which one to read now and which all are already read. The idea is to read a file only once, not all the files after every refresh.

3
  • 1
    What do you mean by "index"? Commented Sep 18, 2018 at 5:29
  • an index could be a local variable/ table, which contains information about the input files, say which all files are already loaded/ read so that the system knows which one to read now which all are already read. The idea is to read a file only once, not all the files after every refresh. Commented Sep 18, 2018 at 6:27
  • You can maintain a log file containing the file names of the file which are already processed and read this log file before processing a file to check , whether it has been already processed or not Commented Sep 18, 2018 at 6:30

2 Answers 2

1

Following code will help you to give the list of new file names with their index.

These variables are used:

  • bag_of_file : Content list of file names which already proceed
  • curr_files : Contents list of file names which are in source folder
  • new_files : Contents list of file names which you are interested in.

Run this code for first time when you have bag_of_file is empty.

import os
curr_dir = "D:/2018/Address Matching/Data/Statewise Output/"
bag_of_files = [] #Comment out this line after using 1st time
curr_files = os.listdir(curr_dir)
new_files = []
for file in curr_files:
    if file not in bag_of_files:
        new_files.append(file)
        bag_of_files.append(file)

new_files

Output:

['AP Output.csv',
'Delhi Output.csv',
'Gujrat Output.csv',
'Haryana Output.csv',
'Jharkhand Output V1.csv',
'Jharkhand Output V1.xlsx',
'Jharkhand Output.csv',
'Karnataka Output.csv']

Next time always run following code. Difference is only in line no. 3 where we used previous version of bag_of_files. Every time I added some new files in same folder.

curr_dir = "D:/2018/Address Matching/Data/Statewise Output/"
#bag_of_files = [] #Comment out this line after using 1st time
curr_files = os.listdir(curr_dir)
new_files = []
for file in curr_files:
    if file not in bag_of_files:
        new_files.append(file)
        bag_of_files.append(file)
new_files

Output:

['Maharashtra Output.csv',
 'MP Output.csv',
 'Punjab Output.csv',
 'Rajsthan Output.csv']

Run it again :)

Output:

['Bihar Output.csv',
 'Tamilnadu Output.csv',
 'Telangana Output.csv',
 'WB Output.csv']
Sign up to request clarification or add additional context in comments.

Comments

0

To keep the answer simple, you could use os.listdir() to monitor the directory content. The to watch for modified files that the program has already indexed, check the modified time on these with os.stat().

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.