2

I have a csv file dataset that contains 170 columns, the first 5 columns contain unique identifiers (Platform, ID, date, length of call, name). The remaining columns 175 contain binary data covering 10 categories. I want to condense those columns so that the number of columns in my data frame is 15. Including an example below:

import pandas as pd

df1 = pd.DataFrame({'Platform': ['Telephone', 'Chat', 'Text'], 'ID': [1, 2, 
3], 'Length': [1545,1532,1511], 'Name': ['andy', 'helen', 'peter'], 'Problem: 
A':[0,1,0], 'Problem: B':[1,0,0], 'Problem: C': [0,0,1], 'Solution: A': 
[0,1,0], 'Solution: B':[1,0,0], 'Solution: C': [0,0,1]})

The output is:

df.head()

ID  Date        Length\\
1   2015-10-16    1545
2   2015-10-09    1532
3   2015-10-13    1511 

Name Problem: A Problem: B  Problem: C  Solution: A Solution: B Solution: C
andy         0          1           0            0           1           0
helen        1          0           0            1           0           0
peter        0          0           1            0           0           1

What I want the data frame to look like:

  Platform ID Length  Name   Problem  Solution
  Telephone 1 1545    andy    B        B
  Chat      2 1532    helen   A        A
  Text      3 1511    peter   C        C

FYI this is not the full dataframe. There are a total of 170 colums that I would like to transform into 15.

1

2 Answers 2

1

You can use groupby + apply with a dot product on the columns ;

df = df.set_index('Name')
df.groupby(df.columns.str.split(':').str[0], axis=1).apply(
    lambda x: x.dot(x.columns.str.split(': ').str[1])
)

      Problem Solution
Name                  
andy        B        B
helen       A        A
peter       C        C
Sign up to request clarification or add additional context in comments.

3 Comments

When I do this I receive a "ValueError: Grouper and axis must be same length" message. I'm sure that it is something in the code that I don't understand. Would you mind telling me what each of the suggested bit of above code does? I'm familiar with df.groupby() and .apply(), but not with lambda or x.dot.
From what I can tell, this specifies that the column name will be the word prior to the (' : ') and the value will be the word following the (' : ').
@REFER it means that what you passed df.columns.str.split(':').str[0] and the number of columns in your dataframe aren't the same. Can you look into it please?
0

I created this custom function which will serve your purpose. I got the idea from this stackoverflow article

def condenseCols(data,finalCol,*cols):
    cols = list(cols)
    x = data[cols] # Slice the cols
    x = x.idxmax(axis=1) 
    # x is now a series, holding column name of the max value in the row i.e one of the column from cols
    x = x.apply(lambda s : s.split(": ")[1]) # extract only the prefix (A,B,C)

    data[finalCol] = x
    data = data.drop(cols, axis=1, inplace=True) # Drop the columns : cols
    return data

Call this method, by passing the column names you want to condense, along with the final name of column

condenseCols(df1,'Problem','Problem: A','Problem: B','Problem: C')
condenseCols(df1,'Solution','Solution: A','Solution: B','Solution: C')

There are other ways too, to do this, as cited in the article stackoverflow article

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.