4

So I've currently got a dataset that has a column called 'logid' which consists of 4 digit numbers. I have about 200k rows in my csv files and I would like to count each unique logid and output it something like this;

Logid | #ofoccurences for each unique id. So it might be 1000 | 10 meaning that the logid 1000 is seen 10 times in the csv file column 'logid'. The separator | is not necessary, just making it easier for you guys to read. This is my code currently:

import pandas as pd
import os, sys
import glob
count = 0
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
    df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)
    counts = df['my_data'].value_counts()
counts

Using this I get a weird output that I dont quite understand:

4            16463
10013          490
pserverno        1
Name: my_data, dtype: int64

I know I am doing something wrong in the last line

counts = df['my_data'].value_counts()

but I am not too sure what. For reference the values I am extracting are from row C in the excel file (so I guess thats column 3?) Thanks in advance!

4
  • will you provide your csv file structure? Commented Jul 31, 2017 at 4:06
  • Its made up for 64 columns (all str values) and 200k rows made up of int values. I only want to look into the 3rd row which has the heading 'logid' but for all 200k rows. They are %100 all integers. Not sure what else you mean. Commented Jul 31, 2017 at 4:43
  • 1
    Possible duplicate of Searching CSV files with Pandas (unique id's) - Python Commented Jul 31, 2017 at 6:33
  • Seems like you were asking the same question a few days ago here. How is this different? Why not simply edit the first question with your updated code? Commented Jul 31, 2017 at 6:35

3 Answers 3

4

ok. from my understanding. I think csv file may be like this.

row1,row1,row1
row2,row2,row2
row3,row3,row3
logid,header1,header2
1000,a,b
1001,c,d
1000,e,f
1001,g,h

And I have all done this with this format of csv file like

# skipping the first three row
df = pd.read_csv("file_name.csv", skiprows=3)
print(df['logid'].value_counts())

And the output look like this

1001    2
1000    2

Hope this will help.

update 1

 df = pd.read_csv(fname, dtype=None, names=['my_data'], low_memory=False)

in this line the parameter names = ['my_data'] creates a new header of the data frame. As your csv file has header row so you can skip this parameter. And as the main header you want to row3 so you can skip first three row. And last one thing you are reading all csv file in given path. So be conscious all of the csv files are same format. Happy codding.

Sign up to request clarification or add additional context in comments.

2 Comments

The csv file was slightly different to how you described; however using this section of your given code; print(df['logid'].value_counts()), I was able to output the logid and the number of times it shows up in the column of the csv file. Thanks!!
@jezrael - I did some modification of you code and post it two my answer. Ok let me remove it. :(
1

I think you need create one big DataFrame by append all df to list and then concat first:

dfs = []
path = "C:\\Users\\cam19\\Desktop\\New folder\\*.csv"
for fname in glob.glob(path):
    df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False)
    dfs.append(df)

df = pd.concat(dfs)

Then use value_counts - output is Series. So for 2 column DataFrame need rename_axis with reset_index:

counts = df['my_data'].value_counts().rename_axis('my_data').reset_index(name='count')
counts

Or groupby and aggregate size:

counts = df.groupby('my_data').size().reset_index(name='count')
counts

4 Comments

This would work; however, I have 6000 csv files with 200000 rows in each file, concating them would not be a wise idea. In addition, I also needed the data separate for each file. Look below for the answer I was looking for if you are curious :)
Do you think solution of R.A.Munna? What was helpful with it? I dont understand. Can you explain more?
Ok, and is possible filter column logid only by df = pd.read_csv(fname, dtype=None, usecols=['logid'], low_memory=False) ? I edit my answer.
I think remove all columns without logid - pandas.pydata.org/pandas-docs/stable/…
0

you may try this.

counts = df['logid'].value_counts()

Now the "counts" should give you the count of each value.

1 Comment

I get this error, 'the label [logid] is not in the [index]'

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.