1

I am reading a large csv file in chunks as I don’t have enough memory to store. I would like to read its first 10 rows (0 to 9 rows), skip the next 10 rows(10 to 19), then read the next 10 rows( 20 to 29 rows), again skip the next 10 rows(30 to 39) and then read rows from 40 to 49 and so on. Following is the code I am using:

#initializing n1 and n2 variable  
n1=1
n2=2
#reading data in chunks
for chunk in pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=list(range(  ((n1*10)+1), ((n2*10) +1) ))):
    sample_chunk=chunk
   #displaying the  sample_chunk
   print(sample_chunk)
   #incrementing n1
    n1=n1+2
   #incrementing n2
    n2=n2+2

However, the code does not work as I assume I have designed. It only skip rows from 10 to 19 (i.e: It reads rows from 0 to 9, skip 10 to 19, then reads 20 to 29, then again read 30 to 39, then again read 40 to 49, and keep on reading all the rows). Please help me identify what I am doing wrong.

4
  • it is because when you initialise pd.read_csv you say skiprows =[11, 12, 13, 14, 15, 16, 17, 18, 19, 20] Commented Feb 19, 2019 at 11:38
  • Check this answer using chunksize: stackoverflow.com/questions/25962114/… Commented Feb 19, 2019 at 11:39
  • @Noor how many total rows do you have? Commented Feb 19, 2019 at 11:42
  • @Nihal more than 2 million rows. Commented Feb 19, 2019 at 11:44

2 Answers 2

1

code:

ro = list(range(0, lengthOfFile + 10, 10))
d = [j + 1 for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
# print(ro)
print(d)

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=d)

for example:

lengthOfFile = 100
ro = list(range(0, lengthOfFile + 10, 10))
d = [j for i in range(1, len(ro), 2) for j in range(ro[i], ro[i + 1])]
print(d)

output: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99]

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Nihal . I have implemented your code, however, it reads rows 0,1,2,3,4,5,6,7,8,19 instead of 0-9, then reads rows 20,21,22,23,24,25,26,27,28,39 instead of 20-29, then reads 40,41,42,43,44,45,46,47, 48,59 instead of 40-49 and so on.
updated my answer, just use j + 1 for creating d
1

With your method, you need to define the all the skiprows in the time of initialising the pd.read_csv which you can do so,

rowskips = [i for x in range(1,int(lengthOfFile/10),2) for i in range(x*10, (x+1)*10)]

with lengthOfFile being the length of the file.

Then for pd.read_csv

pd.read_csv('../input/train.csv',chunksize=10, dtype=dtypes,skiprows=rowskips)

From the documentation :

skiprows : list-like, int or callable, optional

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

So you can pass list, int or callable,

int -> it skips the given lines at the start of the file
list -> it skips the line number given in list
callable -> it evaluates the line number with the callable and then decides to skip or not.

You were passing list that specifies at the time of initiation, the lines to skip. You cannot update it again. Another way might to be to pass a callable, lamda x: x in rowskips and it will evaluate if a row fits the condition to skip.

6 Comments

your program only keeps row from 0-9 and skips all other
@Nihal yeah i missed the 2 in range
still wrong, lets say i have length=400 then it will go till 4000
@Nihal thanks for that, yeah, now this should be fine. i overlooked the *10 in the second for
@Noor added the explanation.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.