50

I have a Python3.x pandas DataFrame whereby certain columns are strings which as expressed as bytes (like in Python2.x)

import pandas as pd
df = pd.DataFrame(...)
df
       COLUMN1         ....
0      b'abcde'        ....
1      b'dog'          ....
2      b'cat1'         ....
3      b'bird1'        ....
4      b'elephant1'    ....

When I access by column with df.COLUMN1, I see Name: COLUMN1, dtype: object

However, if I access by element, it is a "bytes" object

df.COLUMN1.ix[0].dtype
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'dtype'

How do I convert these into "regular" strings? That is, how can I get rid of this b'' prefix?

0

5 Answers 5

82

You can use vectorised str.decode to decode byte strings into ordinary strings:

df['COLUMN1'].str.decode("utf-8")

To do this for multiple columns you can select just the str columns:

str_df = df.select_dtypes([np.object])

convert all of them:

str_df = str_df.stack().str.decode('utf-8').unstack()

You can then swap out converted cols with the original df cols:

for col in str_df:
    df[col] = str_df[col]
Sign up to request clarification or add additional context in comments.

3 Comments

I had the same problem, but my dataframe includes other np objects which aren't bytes (array i.e. ). Is there a way to set only the bytes columns?
Perhaps a more efficient way would be to select the columns to decode and swap through using the one-liner (or two) for c in df.columns[df.dtypes==object]:df.loc[:,c]=df.loc[:,c].str.decode('utf-8')
You can simplify by not using numpy in the second command and say str_df = df.select_dtypes([object]). The np.object is deprecated as of 2024.
6

Combining the answers by @EdChum and @Yu Zhou, a simpler solution would be:

for col, dtype in df.dtypes.items():
    if dtype == np.object:  # Only process byte object columns.
        df[col] = df[col].apply(lambda x: x.decode("utf-8"))

3 Comments

Apply is not the way to go here. Use df[col].str.decode('utf-8')
What about objects that are not strings of any sort?
DeprecationWarning: np.object is a deprecated alias for the builtin object.
6

I add issue with some columns being either full of str or mixed of str and bytes in a dataframe. Solved with a minor modification of the solution provided by @Christabella Irwanto: (i'm more of fan of the str.decode('utf-8') as suggested by @Mad Physicist)

for col, dtype in df.dtypes.items():
        if dtype == object:  # Only process object columns.
            # decode, or return original value if decode return Nan
            df[col] = df[col].str.decode('utf-8').fillna(df[col]) 


>>> df[col]
0        Element
1     b'Element'
2         b'165'
3            165
4             25
5             25

>>> df[col].str.decode('utf-8').fillna(df[col])
0     Element
1     Element
2         165
3         165
4          25
5          25
6          25

(replaced np.object with object to work with recent numpy version)

2 Comments

DeprecationWarning: np.object is a deprecated alias for the builtin object.
Yes, since numpy version 1.24 np.object doesn't work anymore. But following the numpy deprecation info, replacing it with object solves the problem.
1
df['COLUMN1'].apply(lambda x: x.decode("utf-8"))

1 Comment

Hello and welcome to SO. A litle bit more text would be nice. ;-)
1

I came across this thread while trying to solve the same problem but more generally for a Series where some values my be of type str, others of type bytes. Drawing from earlier solutions, I achieved this selective decoding as follows, resulting in a Series all of whose values are of type str. (python 3.6.9, pandas 1.0.5)

>>> import pandas as pd
>>> ser = pd.Series(["value_1".encode("utf-8"), "value_2"])
>>> ser.values
array([b'value_1', 'value_2'], dtype=object)
>>> ser2 = ser.str.decode("utf-8")
>>> ser[~ser2.isna()] = ser2
>>> ser.values
array(['value_1', 'value_2'], dtype=object)

Maybe there exists a more convenient/efficient one-liner for this use case? At first I figured there would be some value to pass in the "errors" kwarg to str.decode but I didn't find one documented.

EDIT: One can definitely achieve the same in one line, but the ways I have thought to so do so take about 25% (tested for Series of length 10^4 and 10^6), but presumably does no copying. E.g.:

ser[ser.apply(type) == bytes] = ser.str.decode("utf-8")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.