4

I have a pandas dataframe like this.

   Time                      Source Level  County  Town
0  2021-12-01 10:01:41.443   NaN    NaN    NaN     NaN
1                      NaN   Test   3      C1      C1-T1
2                      NaN   Test   5-     C2      C2-T0
3                      NaN   Test   5-     C2      C2-T1
4  2021-12-01 10:01:46.452   NaN    NaN    NaN     NaN

I want to append Town value, which is based on row have the same Source, Level and County value.

I have tried isin, groupby, diff(but my value is str), but still not figure out.

Image below is what I want to get.

   Time                      Source Level  County  Town
0  2021-12-01 10:01:41.443   NaN    NaN    NaN     NaN
1                      NaN   Test   3      C1      C1-T0
2                      NaN   Test   5-     C2      C2-T0, C2-T1
3  2021-12-01 10:01:46.452   NaN    NaN    NaN     NaN

Really appreciate your help!

1 Answer 1

2

The way we can make this work is by creating a list out of it using groupby() and apply(list), we can then transform this into a string separated by comma. Let's split it into 2 steps for better understanding.

Personally I would keep this data as a list within a pandas series and not do step 2. Formatting as string separated by comma might not be ideal to work with.

Step 1:

output = df.groupby(['Time','Source','Level','County'])['Town'].apply(list).reset_index()

Returns:

                      Time Source Level County            Town
0  2021-12-01 10:01:41.443    NaN   NaN    NaN           [nan]
1  2021-12-01 10:01:46.452    NaN   NaN    NaN           [nan]
2                      NaN   Test     3     C1         [C1-T1]
3                      NaN   Test    5-     C2  [C2-T0, C2-T1]

Now, we can format them correctly into strings (step 2):

output['Town'] = pd.Series([', '.join([y for y in x if type(y) == str]) for x in output['Town']]).replace('',np.nan)

Which outputs our desired result:

                      Time Source Level County          Town
0  2021-12-01 10:01:41.443    NaN   NaN    NaN           NaN
1  2021-12-01 10:01:46.452    NaN   NaN    NaN           NaN
2                      NaN   Test     3     C1         C1-T1
3                      NaN   Test    5-     C2  C2-T0, C2-T1
Sign up to request clarification or add additional context in comments.

6 Comments

OK, I understand. I'll try it. But I think I don't need to groupby the Time, right? Because row 1-3 is the same Time as row 0, and I want the Time value disappear(only appear in first row).
You need to use Time if you want your expected output, otherwise you will have 3 rows as output, not 4. But feel free to modify the code to fit your purpose.
Happy to help! If you have any questions let me know :) Also feel free to accept the answer with the tick mark in the left. It will mark your question as solved and reward with some score too!
I use groupby with "Time", but it returned empty dataframe. Also, I need dataframe sort like origin orders... Any tips? Thanks!
Hmm I'm not too sure about the first issue. To sort a dataframe you can use .sort_values(by=['Column 1','Column 2'])
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.