Efficient way of reading large txt file in python

Question

I'm trying to open a txt file with 4605227 rows (305 MB)

The way I have done this before is:

data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)

df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})

But it's using up most of available ram ~10GB and not finishing. Is there a faster way of reading in this txt file and creating a pandas dataframe?

Thanks!

Edit: Solved now, thank you. Why is np.loadtxtx() so slow?

what happen with df = pd.read_csv9('file.txt', delimiter='\t', dtype=str, skiprows=1)? — Quang Hoang
– Quang Hoang, Commented Nov 14, 2019 at 15:27
stackoverflow.com/questions/25962114/…, 6gb file in the tag question — BENY
– BENY, Commented Nov 14, 2019 at 15:29
agreed with @QuangHoang 305 MB should be readable with pandas directly — anky
– anky, Commented Nov 14, 2019 at 15:30

nitzel · Accepted Answer · 2019-11-14 16:59:48Z

1

Rather than reading it in with numpy you could just read it directly in as a Pandas DataFrame. E.g., using the pandas.read_csv function, with something like:

df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

edited Nov 14, 2019 at 16:59

nitzel

1,92516 silver badges15 bronze badges

answered Nov 14, 2019 at 15:29

Matt Pitkin

7,3201 gold badge26 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Agustin Over a year ago

Thank you, I didn't realise read_csv allowed for tab delimited. However I get this error: TypeError: parser_f() got an unexpected keyword argument 'columns'

nitzel Over a year ago

Because the parameter is not called columns but usecols (as in the docs Matt linked)

redhatvicky · Accepted Answer · 2019-11-14 15:41:57Z

Method 1 :

You can read the file by chunks , Moreover there is a buffer size which ou can mention in readline and you can read.

inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
    #logic goes here

Method 2:

You can also use nmap Module , Here below is the link whic will explain the usage.

import mmap

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

https://docs.python.org/3/library/mmap.html

Mamba_24L · Accepted Answer · 2019-11-14 17:08:38Z

0

You read it directly in as a Pandas DataFrame. eg

import pandas as pd
pd.read_csv(path)

If you want to read faster, you can use modin:

import modin.pandas as pd
pd.read_csv(path)

https://github.com/modin-project/modin

answered Nov 14, 2019 at 17:08

Mamba_24L

91 bronze badge

Collectives™ on Stack Overflow

Efficient way of reading large txt file in python

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related