How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?

Question

I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)

import pandas as pd
df = pd.DataFrame(...)
df
       COLUMN1         ....
0      b'abcde'        ....
1      b'dog'          ....
2      b'cat1'         ....
3      b'bird1'        ....
4      b'elephant1'    ....

When I access by column with df.COLUMN1, I see Name: COLUMN1, dtype: object

However, if I access by element, it is a "bytes" object

df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'

How do I convert these into "regular" strings? That is, how can I get rid of this b'' prefix?

EdChum · Accepted Answer · 2016-11-02 21:27:20Z

82

You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:
    df[col] = str_df[col]

answered Nov 2, 2016 at 21:27

EdChum

397k204 gold badges836 silver badges583 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

pnina Over a year ago

I had the same problem, but my dataframe includes other np objects which aren't bytes (array i.e. ). Is there a way to set only the bytes columns?

Gene Burinsky Over a year ago

Perhaps a more efficient way would be to select the columns to decode and swap through using the one-liner (or two) for c in df.columns[df.dtypes==object]:df.loc[:,c]=df.loc[:,c].str.decode('utf-8')

w. Patrick Gale Dec 11, 2024 at 14:12

You can simplify by not using numpy in the second command and say str_df = df.select_dtypes([object]). The np.object is deprecated as of 2024.

Christabella Irwanto · Accepted Answer · 2020-07-22 06:58:40Z

6

Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():
    if dtype == np.object:  # Only process byte object columns.
        df[col] = df[col].apply(lambda x: x.decode("utf-8"))

answered Jul 22, 2020 at 6:58

Christabella Irwanto

1,20113 silver badges17 bronze badges

3 Comments

Mad Physicist Over a year ago

Apply is not the way to go here. Use df[col].str.decode('utf-8')

jtlz2 Over a year ago

What about objects that are not strings of any sort?

jtlz2 Over a year ago

DeprecationWarning: np.object is a deprecated alias for the builtin object.

Jan · Accepted Answer · 2023-01-20 14:19:30Z

6

I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by @Christabella Irwanto: (i'm more of fan of the str.decode('utf-8') as suggested by @Mad Physicist)

for col, dtype in df.dtypes.items():
        if dtype == object:  # Only process object columns.
            # decode, or return original value if decode return Nan
            df[col] = df[col].str.decode('utf-8').fillna(df[col]) 


>>> df[col]
0        Element
1     b'Element'
2         b'165'
3            165
4             25
5             25

>>> df[col].str.decode('utf-8').fillna(df[col])
0     Element
1     Element
2         165
3         165
4          25
5          25
6          25

(replaced np.object with object to work with recent numpy version)

edited Jan 20, 2023 at 14:19

Jan

7155 silver badges12 bronze badges

answered Apr 11, 2021 at 23:26

GentilsTo

1212 silver badges4 bronze badges

2 Comments

jtlz2 Over a year ago

DeprecationWarning: np.object is a deprecated alias for the builtin object.

Jan Over a year ago

Yes, since numpy version 1.24 np.object doesn't work anymore. But following the numpy deprecation info, replacing it with object solves the problem.

Gibolt · Accepted Answer · 2018-11-15 21:56:34Z

1

df['COLUMN1'].apply(lambda x: x.decode("utf-8"))

edited Nov 15, 2018 at 21:56

Gibolt

47.9k15 gold badges208 silver badges133 bronze badges

answered Nov 15, 2018 at 21:54

Yu Zhou

191 bronze badge

1 Comment

Alexander Over a year ago

Hello and welcome to SO. A litle bit more text would be nice. ;-)

Carl Smith · Accepted Answer · 2020-08-14 09:37:33Z

I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")

Collectives™ on Stack Overflow

How to translate "bytes" objects into literal strings in pandas Dataframe, Python3.x?

5 Answers 5

3 Comments

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

3 Comments

2 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related