Counting unique id's in csv file using Pandas (python)

Question

So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;

Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:

import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
    df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
    counts = df['my_data'].value_counts()
counts

Using this I get a weird output that I dont quite understand:

4            16463
10013          490
pserverno        1
Name: my_data, dtype: int64

I know I am doing something wrong in the last line

counts = df['my_data'].value_counts()

but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!

Its made up for 64 columns (all str values) and 200k rows made up of int values. I only want to look into the 3rd row which has the heading 'logid' but for all 200k rows. They are %100 all integers. Not sure what else you mean. — Cameron
– Cameron, Commented Jul 31, 2017 at 4:43
Possible duplicate of Searching CSV files with Pandas (unique id's) - Python — Paul
– Paul, Commented Jul 31, 2017 at 6:33
Seems like you were asking the same question a few days ago here. How is this different? Why not simply edit the first question with your updated code? — Paul
– Paul, Commented Jul 31, 2017 at 6:35

R.A.Munna · Accepted Answer · 2017-07-31 06:27:09Z

4

ok. from my understanding. I think csv file may be like this.

row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h

And I have all done this with this format of csv file like

# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())

And the output look like this

1001    2
1000    2

Hope this will help.

update 1

 df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)

in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.

edited Jul 31, 2017 at 6:27

answered Jul 31, 2017 at 4:53

R.A.Munna

1,7091 gold badge16 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Cameron Over a year ago

The csv file was slightly different to how you described; however using this section of your given code; print(df['logid'].value_counts()), I was able to output the logid and the number of times it shows up in the column of the csv file. Thanks!!

R.A.Munna Over a year ago

@jezrael - I did some modification of you code and post it two my answer. Ok let me remove it. :(

jezrael · Accepted Answer · 2017-07-31 06:29:57Z

1

I think you need create one big DataFrame by append all df to list and then concat first:

dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
    df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
    dfs.append(df)

df = pd.concat(dfs)

Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:

counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts

Or groupby and aggregate size:

counts = df.groupby('my_data').size().reset_index(name='count')
counts

edited Jul 31, 2017 at 6:29

answered Jul 31, 2017 at 5:06

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

4 Comments

Cameron Over a year ago

This would work; however, I have 6000 csv files with 200000 rows in each file, concating them would not be a wise idea. In addition, I also needed the data separate for each file. Look below for the answer I was looking for if you are curious :)

jezrael Over a year ago

Do you think solution of R.A.Munna? What was helpful with it? I dont understand. Can you explain more?

jezrael Over a year ago

Ok, and is possible filter column logid only by df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False) ? I edit my answer.

jezrael Over a year ago

I think remove all columns without logid - pandas.pydata.org/pandas-docs/stable/…

Amir Md Amiruzzaman · Accepted Answer · 2020-06-06 09:21:08Z

0

you may try this.

counts = df['logid'].value_counts()

Now the "counts" should give you the count of each value.

edited Jun 6, 2020 at 9:21

Amir Md Amiruzzaman

2,07526 silver badges25 bronze badges

answered Jul 31, 2017 at 2:50

Asela Dassanayake

531 silver badge4 bronze badges

1 Comment

Cameron Over a year ago

I get this error, 'the label [logid] is not in the [index]'

Collectives™ on Stack Overflow

Counting unique id's in csv file using Pandas (python)

3 Answers 3

2 Comments

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related