1

I have a dataset of more than 300k files which I need to read and append to dictionary.

corpus_path = "data"
article_paths = [os.path.join(corpus_path,p) for p in os.listdir(corpus_path)]

doc = []
for path in article_paths:
    dp = pd.read_table(path, header=None, encoding='utf-8', quoting=3, error_bad_lines=False)
    doc.append(dp)

Is there a faster way to do this, as the current method takes more than an hour.

1
  • If you have ssd then you can try threads. Otherwise probably no. Commented Feb 24, 2018 at 15:34

1 Answer 1

1

You can use multiprocessing module.

from multiprocessing import Pool

def readFile(path):
    return pd.read_table(path, header=None, encoding='utf-8', quoting=3, error_bad_lines=False)


result = list(Pool(processes=nprocs).imap(readFile, article_paths))  #nprocs = Number of processors 
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.