4

Can you please help me with the following issue. Imagine, I have a following df:

data = {
    'A':['A1, B2, C', 'A2, A9, C', 'A3', 'A4, Z', 'A5, A1, Z'], 
    'B':['B1', 'B2', 'B3', 'B4', 'B4'], 
}
df = pd.DataFrame(data)

How can I create a list with unique value that are stored in column 'A'? I want to smth like this:

 list_A = [A1, B2, C, A2, A9, A3, A4, Z, A5]
1
  • All values of the column A are already unique: 'A1, B2, C' != 'A2, A9, C' != 'A3' != 'A4, Z' != 'A5, A1, Z' Commented Dec 20, 2022 at 17:54

3 Answers 3

5

Assuming you define as "values" the comma separated substrings, you can split, explode, and use unique:

list_A = df['A'].str.split(',\s*').explode().unique().tolist()

Output: ['A1', 'B2', 'C', 'A2', 'A9', 'A3', 'A4', 'Z', 'A5']

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much for your reply. Can you please clarify why we need ".str" and what '\s*' is responsible for?
2

enter image description here The code applies a lambda function to the 'A' column to remove any white spaces from the strings in the lists.

Next, the code uses the str.split() method to split the strings in the 'A' column by the delimiter ',', resulting in a columns of lists.

Finally, the code uses a list comprehension to flatten the list of lists into a single list, and then uses the set() function to create a set object containing the unique elements of the list. The set object is then printed to the console.

Comments

1

Converting column A to the desired list (new column C). In this case, instead of 'A1, B2, C', we will have ['A1', 'B2', 'C'].

df['C'] = df['A'].str.split(',\s*')

.str is used to convert the column into a string in case it is not. .split(',\s*') will split the string wherever it observes a comma (,) or a comma and some spaces (\s*) after that.

Finding the sorted unique values of the converted column:

set(df['C'].explode())
# {'A1', 'A2', 'A3', 'A4', 'A5', 'A9', 'B2', 'C', 'Z'}

If sorting is not important, and you want to see them in the order of their appearance:

list(df['C'].explode().unique())
# ['A1', 'B2', 'C', 'A2', 'A9', 'A3', 'A4', 'Z', 'A5']

2 Comments

Thank you for your reply. Can you please explain why we need ".str" and 'split(',\s*')' and not just 'split(',')'
@AlbertoAlvarez If you do split(','), then the spaces will remain there. For instance, for 'A1, B2, C', you will get ['A1', ' B2', ' C']. However, you don't want to have spaces there. .str` is optional, but in case you have something like 1 there which is an integer, .split may raise an error. If it is not the case for you, you can get rid of .str and you do not need to convert to string first.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.