0

I am working on a project which involves big data stored in a .txt files. My program is running a little bit slow. A reason to that I think is that my program parses the file in a non-efficient manner.

FILE SAMPLE:

X | Y | Weight
--------------

1  1  1
1  2  1
1  3  1
1  4  1
1  5  1
1  6  1
1  7  1
1  8  1
1  9  1
1  10  1

PARSER CODE:

def _parse(pathToFile):
    with open(pathToFile) as f:
    myList = []
    for line in f:
        s = line.split()
        x, y, w = [int(v) for v in s]
        obj = CoresetPoint(x, y, w)
        myList.append(obj)
    return myList

This function is invoked NumberOfRows/N times, as I only parse a small chunk of data to process until no lines are left. My .txt is several Giga Bytes.

I can obviously see that I iterate NumberOfLines times in the loop and this is a huge bottleneck and BAD. Which leads me to my question:

Question: What is the right approach to parse a file, what would be the most efficient way to do so and will organizing the data differently in the .txt fasten the parser ? if so, how should I organize the data inside the file ?

1 Answer 1

1

In Python you have a library to do this called Pandas. Import the data with Pandas in the following way:

import pandas as pd
df = pd.read_csv('<pathToFile>.txt')

In case the file is too big to be loaded all together into memory, you could loop through parts of the data and load them one at the time. Here a pretty good blog post that can help you do that.

Sign up to request clarification or add additional context in comments.

2 Comments

I can't have the whole file on main memory as its very large, won't this bring it to main memory ?
yes this will take it into memory. How big is it? If you really need than you should go into the distributed stuff, like Spark's RDDs but that would take some time. What about sampling the data? look at this question: stackoverflow.com/questions/22258491/…, you could loop over parts of the data so not to load all into memory at once.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.