Splitting a csv file into panda dataframe by multiple columns

Question

I have a tsv file with multiple columns. There are 10 and more columns but the columns important to me are the ones with the name user_name, shift_id, url_id. I want to create a data frame that first separates the entire csv file based on user_names i.e only rows with same user_name are grouped together. From that chunk I make another chunk where only rows with certain shift_id are grouped together and then from that chunk make a chunk with same url. I unfortunately cannot share the data because of the company rule and making an imaginary data table might be more confusing.

Two of the other columns have time-stamps. I want to calculate the time duration of the chunk but only after I group chunk according to those columns.

I have seen answers that split data-frame by a specific column value,but in my case I have three column values and the order in which they are separated matters too.

Thank you for your help!

Do you mean only dataframe that selects only 3 columns that are important? — niraj
– niraj, Commented Jun 19, 2017 at 3:42
How about showing us some data, or code, or anything really. — Stephen Rauch
– Stephen Rauch ♦, Commented Jun 19, 2017 at 3:43
So are you trying to create dataframe with where username = x, col2 = y and col3 = z? — niraj
– niraj, Commented Jun 19, 2017 at 4:02
@open-source Yes. That sounds right. Other columns will have different value but these columns will have same value. — Zeta11
– Zeta11, Commented Jun 19, 2017 at 4:10

niraj · Accepted Answer · 2017-06-19 05:12:47Z

2

Assuming you read the columns to dataframe

df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6],'col3':[7,8,9],
               'col4':[1,2,3],'col5':[1,2,3],'col6':[1,2,3],
               'col7':[1,2,3],'col8':[1,2,3],'col9':[1,2,3],
               'col91':[1,2,3]})
print(df)

Output:

     col1  col2  col3  col4  col5  col6  col7  col8  col9  col91
0     1     4     7     1     1     1     1     1     1      1
1     2     5     8     2     2     2     2     2     2      2
2     3     6     9     3     3     3     3     3     3      3

Now, we can select only three columns of interest, let it be col1, col2, and col3

tmp_df = df[['col1', 'col2', 'col3']]
print(tmp_df)

Output:

     col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9

Further we want to filter based on three column values:

final_df = tmp_df[(tmp_df.col1 == 1) & (tmp_df.col2 == 4) & (tmp_df.col3== 7)]
print(final_df)

Output:

    col1  col2  col3
0     1     4     7

After reading to dataframe, all these above steps can be acheived in single line:

final = df[['col1', 'col2', 'col3']][(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

Hope it helps!

Update:

df = pd.DataFrame({'col1':[1,1,1,1,1], 'col2':[4,4,4,4,7],'col3':[7,7,9,7,7],
               'col4':['X','X','X','X','X'],'col5':['X','X','X','X','X'],'col6':['X','X','X','X','X'],
               'col7':['X','X','X','X','X'],'col8':['X','X','X','X','X'],'col9':['X','X','X','X','X'],
               'col91':['X','X','X','X','X']})
print(df)

Output:

     col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
2     1     4     9    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X
4     1     7     7    X    X    X    X    X    X     X

Now, usinig similar masking as above:

final = df[(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

Output:

    col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X

edited Jun 19, 2017 at 5:12

answered Jun 19, 2017 at 4:23

niraj

18.2k4 gold badges36 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Zeta11 Over a year ago

This is great. But assume you have more rows with 1 in col1, more different rows with 4 in col2 and more with 7 in column 3. Since the order matters I first want all rows that has 1 in col1. Then from these rows , I want all rows that has 4 in col2 and then from these rows I want rows with 7 in col3. In the end I will have rows with 1 in col1, 4 in col2, and 7 in col3 along with other cols with different value. I need to then operate on the other cols. I hope this helped. Thank you so much for your time and patience. I owe you big time!!

niraj Over a year ago

@BinamrataSharma I added the updated section, don't know if it is what you meant. I still don't know why would ordering matter, since we are looking for intersection of three columns.

Zeta11 Over a year ago

This is good. I will try to proceed from here. But instead of saying col1 == 1, I wanted something generic like for each unique value in this column because it will impossible to keep track of unique values in all the columns manually. But this is a great start.

Collectives™ on Stack Overflow

Splitting a csv file into panda dataframe by multiple columns

1 Answer 1

Update:

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update:

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related