0

I am processing a csv file and before that I am getting the row count using the below code.

total_rows=sum(1 for row in open(csv_file,"r",encoding="utf-8"))

The code has been written with the help given in this link. However, the total_rows doesn't match the actual number of rows in the csv file. I have found an alternative to do it but would like to know why is this not working correctly??

In the CSV file, there are cells with huge text and I have to use the encoding to avoid errors reading the csv file.

Any help is appreciated!

10
  • 1
    When you have cells with huge text, csv parsers such as pandas.read_csv will read it correctly as opposed to open which reads the file line by line, not considering the huge text as one. Commented Mar 15, 2019 at 8:11
  • @Chris yes. I have found alternate ways to read the rows correctly. But what is wrong with the above code that was suggested for using to count rows?? Commented Mar 15, 2019 at 8:12
  • 3
    As I said, suppose your huge text is two or more lines. In csv, this is considered as one cell which will be inside a single line. Nevertheless, open doesn't know it should be considered as one cell, and simply returns count of lines. Commented Mar 15, 2019 at 8:14
  • 1
    How many lines are in your file? Are you sure there is not a mix of \r and \n that might give your a wrong count? or some \n or \r within some cell text? Commented Mar 15, 2019 at 8:14
  • I have run your code on a large file and it was providing me the correct output: python count_row.py 1715181568 same result with wc -l Commented Mar 15, 2019 at 8:17

2 Answers 2

3

Let's assume you have a csv file in which some cell's a multi-line text.

$ cat example.csv
colA,colB
1,"Hi. This is Line 1.
And this is Line2"

Which, by look of it, has three lines and wc -l agrees:

$ wc -l example.csv
3 example.csv

And so does open with sum:

sum(1 for row in open('./example.csv',"r",encoding="utf-8"))
# 3

But now if you read is with some csv parser such as pandas.read_csv:

import pandas as pd

df = pd.read_csv('./example.csv')
df
   colA                                    colB
0     1  Hi. This is Line 1.\nAnd this is Line2

The other alternative way to fetch the correct number of rows is given below:

with open(csv_file,"r",encoding="utf-8") as f:
     reader = csv.reader(f,delimiter = ",")
     data = list(reader)
     row_count = len(data)

Excluding the header, the csv contains 1 line, which I believe is what you expect. This is because colB's first cell (a.k.a. huge text block) is now properly handled with the quotes wrapping the entire text.

Sign up to request clarification or add additional context in comments.

Comments

1

I think that the problem in here is because you are not counting rows, but counting newlines (either \r\n in windows or \n in linux). The problem lies when you have a cell with text where you have newline character example:

1, "my huge text\n with many lines\n"
2, "other text"

Your method for data above will return 4 when accutaly there are only 2 rows

Try to use Pandas or other library for reading CSV files. Example:

import pandas as pd
data = pd.read_csv(pathToCsv, sep=',', header=None);
number_of_rows = len(df.index) # or df[0].count()

Note that len(df.index) and df[0].count() are not interchangeable as count excludes NaNs.

3 Comments

I was able to get the correct number of rows without using pandas but are you suggesting that open function is counting the number of lines in each cell as well?
Yes, because open just reads file. It is not taking into account that this is CSV.
Chris summed what you're saying. Thank you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.