3

could you help me, I faced a problem of reading random rows from the large csv file using 0.18.1 pandas and 2.7.10 Python on Windows (8 Gb RAM).

In Read a small random sample from a big CSV file into a Python data frame I saw an approach, however, it occured for my PC to be very memory consuming, namely, part of the code:

n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv       
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)

so, if I want to take some random rows from the file considering not only 100 rows, but 100 000, it becomes hard, however taking not random rows from the file is almost alright:

skiprows = xrange(100000)    
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)

So the question how can I deal with reading large number of random rows from the large csv file with pandas, i.e. since I can't read the entire csv file, even with chunking it, I'm interested exactly in random rows. Thanks

3
  • 1
    I think most of the discussion on this question will be relevant. In short there wasn't a great solution that wasn't memory-intensive. stackoverflow.com/questions/38039723/… Commented Jul 6, 2016 at 21:05
  • 1
    skiprows uses a lot of memory, try using with chunks: stackoverflow.com/questions/36874993/… Commented Jul 6, 2016 at 21:10
  • If you go to one of these links and find them useful, please upvote them if you can. You need 15 reputation to upvote. You now have 6. Commented Jul 6, 2016 at 21:13

2 Answers 2

2

If memory is the biggest issue, a possible solution might be to use chunks, and randomly select from the chunks

n = 100
s = 10
factor = 1    # should be integer
chunksize = int(s/factor)

reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype  = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)

out = []
tot = 0
for df in reader:
    nsample = random.randint(factor,chunksize)
    tot += nsample
    if  tot > s:
        nsample = s - (tot - nsample)
    out.append(df.sample(nsample))
    if tot >= s:
        break

data = pd.concat(out)

And you can use factor to control the sizes of the chunks.

Sign up to request clarification or add additional context in comments.

3 Comments

One issue is that this approach isn't truly random, as you force selections from each chunk. For example, it's possible, albeit not probable, that a random selection of n items is the first n items, and the chunked approach does not admit this possibility. However, you could also argue that this approach is good enough for most cases that require random selection.
That's right. I edited the answer to take that into account, where the number of samples from each chunk is random. But however, there is still a caveat in that it needs to have at least 'factor' samples per chunk, to get the required number of samples before reaching end of file. To be really random, the possibility to loop over the file more than once needs to be considered.
Thanks, yes, indeed, I believe that is also the way out: to randomly choose chunks and randomly select rows from them, and do this operation in iterative manner. It looks less memory demanding than skiprows option.
1

I think this is faster than other methods showed here and may be worth trying.

Say, we have already chosen rows to be skipped in a list skipped. First, I convert it to a lookup bool table.

# Some preparation:
skipped = np.asarray(skipped)
# MAX >= number of rows in the file
bool_skipped = np.zeros(shape(MAX,), dtype=bool)
bool_skipped[skipped] = True

Main stuff:

from io import StringIO
# in Python 2 use
# from StringIO import StringIO

def load_with_buffer(filename, bool_skipped, **kwargs):
    s_buf = StringIO()
    with open(filename) as file:
        count = -1
        for line in file:
            count += 1
            if bool_skipped[count]:
                continue
            s_buf.write(line)
    s_buf.seek(0)
    df = pd.read_csv(s_buf, **kwargs)
    return df

I tested it as follows:

df = pd.DataFrame(np.random.rand(100000, 100))
df.to_csv('test.csv')

df1 = load_with_buffer('test.csv', bool_skipped, index_col=0)

with 90% of rows skipped. It performs comparably to

pd.read_csv('test.csv', skiprows=skipped, index_col=0)

and is about 3-4 times faster than using dask or reading in chunks.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.