1

My program should read ~400.000 csv files and it takes very long. The code I use is:

        for file in self.files:
            size=2048
            csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

            for index in range(0,10):
                s=s+float(csvData['bcol'][index])
            s=s/10
            averages.append(s)
            time=file.rpartition('\\')[2]
            time=int(re.search(r'\d+', time).group())
            times.append(time)

Is there a chance to increase the speed?

2

1 Answer 1

0

You can use Threading. I took the following code from here and modified for your use case

global times =[]

def my_func(file):
        size=2048
        csvData = pd.read_csv(file, sep='\t', names=['acol', 'bcol'], header=None, skiprows=range(0,int(size/2)), skipfooter=(int(size/2)-10))

        for index in range(0,10):
            s=s+float(csvData['bcol'][index])
        s=s/10
        averages.append(s)
        time=file.rpartition('\\')[2]
        time=int(re.search(r'\d+', time).group())
        times.append(time)

threads = []
# In this case 'self.files' is a list of files to be read.
for ii in range(self.files):
# We start one thread per file present.
    process = Thread(target=my_func, args=[ii])
    process.start()
    threads.append(process)
# We now pause execution on the main thread by 'joining' all of our started threads.
# This ensures that each has finished processing the urls.
for process in threads:
    process.join()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.