1

I want to use pandas read_csv() func where the input is a python iterator, where each next() will bring to the next line of my text file. What would you suggest me to do? I want best performance.
As I understood, StringIO works in that case, but i would rather to not use that.

BTW, after that I'm using as_matrix() function in order to create a numpy array.
Doing so because it's much faster than np.loadtxt() func which is horribly slow :(

4
  • 1
    Why the iterator, why not the file object? Commented Nov 29, 2015 at 21:59
  • @MaxNoe because i can get access to the content of the txt file only by a generator Commented Nov 30, 2015 at 11:44
  • Just because i'm curious: Where does this strange limitation come from? Commented Nov 30, 2015 at 19:24
  • @MaxNoe I'm working with Apache Spark and my big text file is distributed across the nodes in my cluster. I want to do compute some function on each part of the text file at each node. The way I get access to that part is only by a python's iterator. Commented Nov 30, 2015 at 22:32

1 Answer 1

2

You should use :

 from io import StringIO
 pd.read_csv(StringIO("\n".join(iter)))

where iter is your iterator / generator variable.
This is still faster than using np.loadtxt(iter)

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.