0

I'm trying to open a txt file with 4605227 rows (305 MB)

The way I have done this before is:

data = np.loadtxt('file.txt', delimiter='\t', dtype=str, skiprows=1)

df = pd.DataFrame(data, columns=["a", "b", "c", "d", "e", "f", "g", "h", "i"])

df = df.astype(dtype={"a": "int64", "h": "int64", "i": "int64"})

But it's using up most of available ram ~10GB and not finishing. Is there a faster way of reading in this txt file and creating a pandas dataframe?

Thanks!

Edit: Solved now, thank you. Why is np.loadtxtx() so slow?

3
  • 2
    what happen with df = pd.read_csv9('file.txt', delimiter='\t', dtype=str, skiprows=1)? Commented Nov 14, 2019 at 15:27
  • stackoverflow.com/questions/25962114/…, 6gb file in the tag question Commented Nov 14, 2019 at 15:29
  • agreed with @QuangHoang 305 MB should be readable with pandas directly Commented Nov 14, 2019 at 15:30

3 Answers 3

1

Rather than reading it in with numpy you could just read it directly in as a Pandas DataFrame. E.g., using the pandas.read_csv function, with something like:

df = pd.read_csv('file.txt', delimiter='\t', usecols=["a", "b", "c", "d", "e", "f", "g", "h", "i"])
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, I didn't realise read_csv allowed for tab delimited. However I get this error: TypeError: parser_f() got an unexpected keyword argument 'columns'
Because the parameter is not called columns but usecols (as in the docs Matt linked)
0

Method 1 :

You can read the file by chunks , Moreover there is a buffer size which ou can mention in readline and you can read.

inputFile = open('inputTextFile','r')
buffer_line = inputFile.readlines(BUFFERSIZE)
while buffer_line:
    #logic goes here

Method 2:

You can also use nmap Module , Here below is the link whic will explain the usage.

import mmap

with open("hello.txt", "r+b") as f:
    # memory-map the file, size 0 means whole file
    mm = mmap.mmap(f.fileno(), 0)
    # read content via standard file methods
    print(mm.readline())  # prints b"Hello Python!\n"
    # read content via slice notation
    print(mm[:5])  # prints b"Hello"
    # update content using slice notation;
    # note that new content must have same size
    mm[6:] = b" world!\n"
    # ... and read again using standard file methods
    mm.seek(0)
    print(mm.readline())  # prints b"Hello  world!\n"
    # close the map
    mm.close()

https://docs.python.org/3/library/mmap.html

Comments

0

You read it directly in as a Pandas DataFrame. eg

import pandas as pd
pd.read_csv(path)

If you want to read faster, you can use modin:

import modin.pandas as pd
pd.read_csv(path)

https://github.com/modin-project/modin

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.