0

I want to identify all the columns that contains only a specific pattern, which is 32 characters long hex code. So if I have the following DataFrame for example

Incident                                SLA Contributor                         Priority       Last Group
-----------------------------------------------------------------------------------------------------------------------------
0000329edb523708f8079044db9619f0        DAP/IBM                                    P2        3a03ba3a6f22f2c44484f5269f3ee425
3bb02e66db97a70cf8079044db961907        EUX/SD                                     P4        102a3748dbee2200d4c17868bf9619b1
3bb02e66db97a70cf8079044db961907        fa863d3fdb1d3740205a328c7c961982           P4        102a3748dbee2200d4c17868bf9619b1
82cf9812dbd80098b6629c4adb96198d        DAP/IBM                                    P4        3a03ba3a6f22f2c44484f5269f3ee425
9e12a6c11b228c14de583150cd4bcbbb        3a03ba3a6f22f2c44484f5269f3ee425           P4        64f4d44fdbf666004f9a7cbdae961919
b17c3f43db9bf700b6629c4adb9619ca        NET/BT                                     P3        64f4d44fdbf666004f9a7cbdae961919
fa863d3fdb1d3740205a328c7c961982        DAP/IBM                                    P2        3a03ba3a6f22f2c44484f5269f3ee425

So it should identify the Incident and Last Group column, because all the values in these two columns are 32 characters long hex code. Some values of SLA Contributor column are 32 long hex code but there are other values also which does not follow that pattern, so it should not include SLA Contributor column.

I am not sure if I have to loop through all the values of a column to check, or there is any better solution for it. I would really appreciate if you could point me in the right direction.

Thanks, Sattyaki

2 Answers 2

3

You can use all and a regular expression:

hex_regex = re.compile(r"^[0-9a-fA-F]{32}$")  # regular expression of an hex code of 32 digits

hex_cols = [col for col in df.columns if df[col].map(lambda x: hex_regex.match(x) is not None).all()]  # check that all the entries in a column match the required format

If nan are allowed, you should skip them, so use the following:

hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]

In both cases, hex_cols is ['Incident', 'Last Group'].

Here a full example (with nan values):

import pandas as pd
import re

df = pd.DataFrame(
    data=[
        ['0000329edb523708f8079044db9619f0', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['3bb02e66db97a70cf8079044db961907', 'EUX/SD', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
        ['3bb02e66db97a70cf8079044db961907', 'fa863d3fdb1d3740205a328c7c961982', 'P4', '102a3748dbee2200d4c17868bf9619b1'],
        ['82cf9812dbd80098b6629c4adb96198d', 'DAP/IBM', 'P4', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['9e12a6c11b228c14de583150cd4bcbbb', '3a03ba3a6f22f2c44484f5269f3ee425', 'P4', '64f4d44fdbf666004f9a7cbdae961919'],
        ['b17c3f43db9bf700b6629c4adb9619ca', 'NET/BT', 'P3', '64f4d44fdbf666004f9a7cbdae961919'],
        ['fa863d3fdb1d3740205a328c7c961982', 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        [None, 'DAP/IBM', 'P2', '3a03ba3a6f22f2c44484f5269f3ee425'],
        ['b17c3f43db9bf700b6629c4adb9619ce', 'DAP/IBM', 'P2', None],
        [None, 'DAP/IBM', 'P2', None],
    ],
    columns=['Incident', 'SLA Contributor', 'Priority', 'Last Group'],
)

hex_regex = re.compile(r"^[0-9a-fA-F]{32}$")

hex_cols = [col for col in df.columns if df[df[col].notnull()][col].map(lambda x: hex_regex.match(x) is not None).all()]

If you like one-liner, you can compress the previous code:

[col for col in df.columns if df[df[col].notnull()][col].map(lambda x: re.compile(r"^[0-9a-fA-F]{32}$").match(x) is not None).all()]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks, your code works like a charm! I wish I could accept this one also as correct.
1

For more general solution first select only strings columns by DataFrame.select_dtypes.

Use Series.str.len for test length per columns and then test if all values are Trues by DataFrame.all, last filter columns names:

df1 = df.select_dtypes(object)

mask = df1.apply(lambda x: x.str.len().eq(32)).all()

If need test hexadecimal 32 length string use Series.str.contains with na parameter with True:

mask = df1.apply(lambda x: x.str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()

cols = df1.columns[mask].tolist()
print (cols)
['Incident', 'Last Group']

List comprehension alternative is:

cols = [c for c in df.columns if df[c].str.contains(r"^[0-9a-fA-F]{32}$", na=True).all()]
print (cols)
['Incident', 'Last Group']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.