Get List of Unique String values per column in a dataframe using python

Question

here I go with another question

I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.

So they way I am thinking it might work is following. 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe

This is how the dataframe looks like

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 NORMAL      NORMAL      1050
7-Feb-15    0:01:00 NORMAL      NORMAL      1050
7-Feb-15    0:02:00 NORMAL      HIGH        1050
7-Feb-15    0:03:00 HIGH        NORMAL      1050
7-Feb-15    0:04:00 LOW         NORMAL      1050
7-Feb-15    0:05:00 NORMAL      LOW         1050

This is the result expected

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 0           0           1050
7-Feb-15    0:01:00 0           0           1050
7-Feb-15    0:02:00 0           1           1050
7-Feb-15    0:03:00 1           0           1050
7-Feb-15    0:04:00 2           0           1050
7-Feb-15    0:05:00 0           2           1050

I am using python 3.5 and the latest version of Pandas

Thanks in advance

JV

It helps if you don't provide an image, but rather code that can be used to create a sample of your DataFrame or something compatible with pd.read_clipboard() to create it. Anyway - do unique strings need to retain the same unique numeric value over different columns? — Jon Clements
– Jon Clements, Commented Sep 22, 2016 at 20:06
Hello Jon, yes they should retain same value if the string were the same in previous column. — racekiller
– racekiller, Commented Sep 22, 2016 at 20:11

MaxU - stand with Ukraine · Accepted Answer · 2016-09-22 23:57:53Z

1

Solution:

# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:

In [233]: df
Out[233]:
       DATE     TIME FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00    NORMAL     NORMAL               1050
1  7-Feb-15  0:01:00    NORMAL     NORMAL               1050
2  7-Feb-15  0:02:00    NORMAL       HIGH               1050
3  7-Feb-15  0:03:00      HIGH     NORMAL               1050
4  7-Feb-15  0:04:00       LOW     NORMAL               1050
5  7-Feb-15  0:05:00    NORMAL        LOW               1050

first we stack all object (string) columns:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
1 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
2 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    1
3 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    0
4 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    0
5 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):

In [241]: df[cols] = st['cat'].unstack()

In [242]: df
Out[242]:
       DATE     TIME  FNRHP306H  FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00          0           0               1050
1  7-Feb-15  0:01:00          0           0               1050
2  7-Feb-15  0:02:00          0           1               1050
3  7-Feb-15  0:03:00          1           0               1050
4  7-Feb-15  0:04:00          2           0               1050
5  7-Feb-15  0:05:00          0           2               1050

Explanation:

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               1050
1    NORMAL     NORMAL               1050
2    NORMAL       HIGH               1050
3      HIGH     NORMAL               1050
4       LOW     NORMAL               1050
5    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW

edited Sep 22, 2016 at 23:57

answered Sep 22, 2016 at 20:49

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

racekiller Over a year ago

Wow! You guys know your stuff, I will try this one ASAP, seems that it will work.

racekiller Over a year ago

Hi @MaxU. Apparently this line is not doing the job df.filter(like='FNR').select_dtypes(include=['object']) It is not removing the columns that are numeric perhaps because those values are not numeric type (real, double, etc) but string type?

MaxU - stand with Ukraine Over a year ago

@racekiller, yes, i think they are of object dtype. You can check it: print(df.dtypes)

racekiller Over a year ago

Yes I just checked and they all are object dtype

MaxU - stand with Ukraine Over a year ago

@racekiller, so convert them to numbers

|

Collectives™ on Stack Overflow

Get List of Unique String values per column in a dataframe using python

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related