1

here I go with another question

I have a large dataframe about 20 columns by 400.000 rows. In this dataset I can not have string since the software that will process the data only accepts numeric and nulls.

So they way I am thinking it might work is following. 1. go thru each column 2. Get List of unique strings 3. Replace each string with a value from 0 to X 4. repeat the process for the next column 5. Repeat for the next dataframe

This is how the dataframe looks like

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 NORMAL      NORMAL      1050
7-Feb-15    0:01:00 NORMAL      NORMAL      1050
7-Feb-15    0:02:00 NORMAL      HIGH        1050
7-Feb-15    0:03:00 HIGH        NORMAL      1050
7-Feb-15    0:04:00 LOW         NORMAL      1050
7-Feb-15    0:05:00 NORMAL      LOW         1050

This is the result expected

DATE        TIME    FNRHP306H   FNRHP306HC  FNRHP306_2MEC_MAX
7-Feb-15    0:00:00 0           0           1050
7-Feb-15    0:01:00 0           0           1050
7-Feb-15    0:02:00 0           1           1050
7-Feb-15    0:03:00 1           0           1050
7-Feb-15    0:04:00 2           0           1050
7-Feb-15    0:05:00 0           2           1050

enter image description here

I am using python 3.5 and the latest version of Pandas

Thanks in advance

JV

2
  • 1
    It helps if you don't provide an image, but rather code that can be used to create a sample of your DataFrame or something compatible with pd.read_clipboard() to create it. Anyway - do unique strings need to retain the same unique numeric value over different columns? Commented Sep 22, 2016 at 20:06
  • Hello Jon, yes they should retain same value if the string were the same in previous column. Commented Sep 22, 2016 at 20:11

1 Answer 1

1

Solution:

# try to convert all columns to numbers...
df = df.apply(lambda x: pd.to_numeric(x, errors='ignore'))

cols = df.filter(like='FNR').select_dtypes(include=['object']).columns
st = df[cols].stack().to_frame('name')
st['cat'] = pd.factorize(st.name)[0]
df[cols] = st['cat'].unstack()

del st

Demo:

In [233]: df
Out[233]:
       DATE     TIME FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00    NORMAL     NORMAL               1050
1  7-Feb-15  0:01:00    NORMAL     NORMAL               1050
2  7-Feb-15  0:02:00    NORMAL       HIGH               1050
3  7-Feb-15  0:03:00      HIGH     NORMAL               1050
4  7-Feb-15  0:04:00       LOW     NORMAL               1050
5  7-Feb-15  0:05:00    NORMAL        LOW               1050

first we stack all object (string) columns:

In [235]: cols = df.filter(like='FNR').select_dtypes(include=['object']).columns

In [236]: st = df[cols].stack().to_frame('name')

now we can factorize stacked column:

In [238]: st['cat'] = pd.factorize(st.name)[0]

In [239]: st
Out[239]:
                name  cat
0 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
1 FNRHP306H   NORMAL    0
  FNRHP306HC  NORMAL    0
2 FNRHP306H   NORMAL    0
  FNRHP306HC    HIGH    1
3 FNRHP306H     HIGH    1
  FNRHP306HC  NORMAL    0
4 FNRHP306H      LOW    2
  FNRHP306HC  NORMAL    0
5 FNRHP306H   NORMAL    0
  FNRHP306HC     LOW    2

assign unstacked result back to original DF (to object columns):

In [241]: df[cols] = st['cat'].unstack()

In [242]: df
Out[242]:
       DATE     TIME  FNRHP306H  FNRHP306HC  FNRHP306_2MEC_MAX
0  7-Feb-15  0:00:00          0           0               1050
1  7-Feb-15  0:01:00          0           0               1050
2  7-Feb-15  0:02:00          0           1               1050
3  7-Feb-15  0:03:00          1           0               1050
4  7-Feb-15  0:04:00          2           0               1050
5  7-Feb-15  0:05:00          0           2               1050

Explanation:

In [248]: df.filter(like='FNR')
Out[248]:
  FNRHP306H FNRHP306HC  FNRHP306_2MEC_MAX
0    NORMAL     NORMAL               1050
1    NORMAL     NORMAL               1050
2    NORMAL       HIGH               1050
3      HIGH     NORMAL               1050
4       LOW     NORMAL               1050
5    NORMAL        LOW               1050

In [249]: df.filter(like='FNR').select_dtypes(include=['object'])
Out[249]:
  FNRHP306H FNRHP306HC
0    NORMAL     NORMAL
1    NORMAL     NORMAL
2    NORMAL       HIGH
3      HIGH     NORMAL
4       LOW     NORMAL
5    NORMAL        LOW
Sign up to request clarification or add additional context in comments.

8 Comments

Wow! You guys know your stuff, I will try this one ASAP, seems that it will work.
Hi @MaxU. Apparently this line is not doing the job df.filter(like='FNR').select_dtypes(include=['object']) It is not removing the columns that are numeric perhaps because those values are not numeric type (real, double, etc) but string type?
@racekiller, yes, i think they are of object dtype. You can check it: print(df.dtypes)
Yes I just checked and they all are object dtype
@racekiller, so convert them to numbers
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.