3

I have a tsv file with multiple columns. There are 10 and more columns but the columns important to me are the ones with the name user_name, shift_id, url_id. I want to create a data frame that first separates the entire csv file based on user_names i.e only rows with same user_name are grouped together. From that chunk I make another chunk where only rows with certain shift_id are grouped together and then from that chunk make a chunk with same url. I unfortunately cannot share the data because of the company rule and making an imaginary data table might be more confusing.

Two of the other columns have time-stamps. I want to calculate the time duration of the chunk but only after I group chunk according to those columns.

I have seen answers that split data-frame by a specific column value,but in my case I have three column values and the order in which they are separated matters too.

Thank you for your help!

8
  • Do you mean only dataframe that selects only 3 columns that are important? Commented Jun 19, 2017 at 3:42
  • 1
    How about showing us some data, or code, or anything really. Commented Jun 19, 2017 at 3:43
  • @open-source Did my edit help ? Commented Jun 19, 2017 at 3:59
  • So are you trying to create dataframe with where username = x, col2 = y and col3 = z? Commented Jun 19, 2017 at 4:02
  • @open-source Yes. That sounds right. Other columns will have different value but these columns will have same value. Commented Jun 19, 2017 at 4:10

1 Answer 1

2

Assuming you read the columns to dataframe

df = pd.DataFrame({'col1':[1,2,3], 'col2':[4,5,6],'col3':[7,8,9],
               'col4':[1,2,3],'col5':[1,2,3],'col6':[1,2,3],
               'col7':[1,2,3],'col8':[1,2,3],'col9':[1,2,3],
               'col91':[1,2,3]})
print(df)

Output:

     col1  col2  col3  col4  col5  col6  col7  col8  col9  col91
0     1     4     7     1     1     1     1     1     1      1
1     2     5     8     2     2     2     2     2     2      2
2     3     6     9     3     3     3     3     3     3      3

Now, we can select only three columns of interest, let it be col1, col2, and col3

tmp_df = df[['col1', 'col2', 'col3']]
print(tmp_df)

Output:

     col1  col2  col3
0     1     4     7
1     2     5     8
2     3     6     9

Further we want to filter based on three column values:

final_df = tmp_df[(tmp_df.col1 == 1) & (tmp_df.col2 == 4) & (tmp_df.col3== 7)]
print(final_df)

Output:

    col1  col2  col3
0     1     4     7

After reading to dataframe, all these above steps can be acheived in single line:

final = df[['col1', 'col2', 'col3']][(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

Hope it helps!

Update:

df = pd.DataFrame({'col1':[1,1,1,1,1], 'col2':[4,4,4,4,7],'col3':[7,7,9,7,7],
               'col4':['X','X','X','X','X'],'col5':['X','X','X','X','X'],'col6':['X','X','X','X','X'],
               'col7':['X','X','X','X','X'],'col8':['X','X','X','X','X'],'col9':['X','X','X','X','X'],
               'col91':['X','X','X','X','X']})
print(df)

Output:

     col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
2     1     4     9    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X
4     1     7     7    X    X    X    X    X    X     X

Now, usinig similar masking as above:

final = df[(df.col1 == 1) & (df.col2 == 4) & (df.col3== 7)]
final

Output:

    col1  col2  col3 col4 col5 col6 col7 col8 col9 col91
0     1     4     7    X    X    X    X    X    X     X
1     1     4     7    X    X    X    X    X    X     X
3     1     4     7    X    X    X    X    X    X     X
Sign up to request clarification or add additional context in comments.

3 Comments

This is great. But assume you have more rows with 1 in col1, more different rows with 4 in col2 and more with 7 in column 3. Since the order matters I first want all rows that has 1 in col1. Then from these rows , I want all rows that has 4 in col2 and then from these rows I want rows with 7 in col3. In the end I will have rows with 1 in col1, 4 in col2, and 7 in col3 along with other cols with different value. I need to then operate on the other cols. I hope this helped. Thank you so much for your time and patience. I owe you big time!!
@BinamrataSharma I added the updated section, don't know if it is what you meant. I still don't know why would ordering matter, since we are looking for intersection of three columns.
This is good. I will try to proceed from here. But instead of saying col1 == 1, I wanted something generic like for each unique value in this column because it will impossible to keep track of unique values in all the columns manually. But this is a great start.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.