1

I have a custom dataframe B with the following properties (obtained via B.dtypes):

[236 rows x 10 columns]
Filename               object
Path                   object
Filenumber              int64
Channel                object
Folder                 object
User-defined labels    object
Ranking                 int64
Lower limit             int64
Upper limit             int64
Enabled                  bool
dtype: object

As example, calling

print(
    B.loc[
        B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
        & (B["Filenumber"] == 98)
    ]
)

gives

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\test_user\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

Now, I would like to replace/add new labels to the column ["User-defined labels"] by setting them via a list of labels. The list of labels is defined as

new_labels = ["Label 1", "Label 3"]

I accessed the column via

B.loc[
    B["User-defined labels"].explode().eq("Label 1").groupby(level=0).any()
    & (B["Filenumber"] == 98),
    ["User-defined labels"],
]

Initially, I am starting from an output of

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1]        0            0       999999     True

and would like to obtain

        Filename                                               Path  Filenumber      Channel   Folder User-defined labels  Ranking  Lower limit  Upper limit  Enabled
0    File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_0  Label 1           [Label 1, Label 3]        0            0       999999     True
39   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_1  Label 1           [Label 1, Label 3]        0            0       999999     True
78   File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_2  Label 1           [Label 1, Label 3]        0            0       999999     True
117  File_98.csv  C:\Users\riro\Documents\Training_data\Label 1\...          98  subfolder_3  Label 1           [Label 1, Label 3]        0            0       999999     True

For that, I tried the following operations:

  • B.loc[] = new_labels
    Fails with Must have equal len keys and value when setting with an iterable

  • B.loc[] = [new_labels]
    Fails with Must have equal len keys and value when setting with an ndarray

  • B.loc[] = pd.Series(new_labels, dtype=object)
    Sets the first entry in the column User-defined labels to Label 1 (from [Label 1]) and the entries in all further columns to NaN

  • B.loc[] = pd.Series([new_labels], dtype=object) Sets the first entry in the column User-defined labels to [Label 1, Label 3] (as intended), but all further columns are set to NaN

I repeated the same approach with at() instead of loc[], as described here: https://stackoverflow.com/a/70968810/2546099, resulting in:

  • B.at[] = new_labels
    Fails with Must have equal len keys and value when setting with an iterable

  • B.at[] = [new_labels]
    Fails with Must have equal len keys and value when setting with an ndarray

  • B.at[] = pd.Series(new_labels, dtype=object)
    Fails with unhashable type: 'list'

  • B.at[] = pd.Series([new_labels], dtype=object) Fails with unhashable type: 'list'

Are there other approaches which can solve my issues? Of course, I could just add new columns when adding new labels, but that would raise additional issues (based on my current knowledge):

  • Each time I add new labels for some rows, I would have to update the entire dataframe to new columns
  • If I remove labels, I would have to keep empty columns. For cleanup, I would have to check if a column is completely empty before deleting it
  • Iterating over labels is faster if I keep them in a list in one column compared to iterating over all columns which contain the name "label"

Or are those non-issues?

5
  • Do you want to change the value of label to new_labels? Why don't you use B["User-defined labels"] = anything. Commented Aug 5, 2022 at 9:44
  • Wouldn't that change the entire column? I only want to change some specific rows, not the entire column Commented Aug 5, 2022 at 9:45
  • That's right. You can still use df.loc[row_indexer,column_indexer]. Take a look at here if this is what you meant. Commented Aug 5, 2022 at 9:47
  • Hmm, I tried that (first test), and it failed for me, as stated above. Or did I misunderstand something? Commented Aug 5, 2022 at 9:55
  • I tested with numpy and understood your problem ... It seems stupid that pandas gives error :( Commented Aug 5, 2022 at 10:06

1 Answer 1

2

I think you can achieve your goal with apply:

import pandas as pd

df = pd.DataFrame({'a':
    [
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1', 'label 2'],
        ['label 2', 'label 3'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
        ['label 1'],
    ],
    'b': [
        1,
        2,
        3,
        1,
        1,
        2,
        3,
        1,
        2,
        3,
    ]
})

print(df)
                    a  b
0           [label 1]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 2]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7           [label 1]  1
8           [label 1]  2
9           [label 1]  3

iterating through the rows with lambda will set a new value in every row with an index that exists in the mask

mask = df['a'].loc[df['a'].explode().eq("label 1").groupby(level=0).any() & (df['b'] == 1)]
df['a'] = df.apply(lambda x: ['label 1', 'label 3'] if x.name in mask.index else x['a'], axis=1)

print(df)

                    a  b
0  [label 1, label 3]  1
1           [label 1]  2
2           [label 1]  3
3  [label 1, label 3]  1
4  [label 2, label 3]  1
5           [label 1]  2
6           [label 1]  3
7  [label 1, label 3]  1
8           [label 1]  2
9           [label 1]  3
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.