Convert string column to array of fixed length strings in pandas dataframe

Question

I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.

Here is how current table looks like:

+-----+--------------------+--------------------+
|col1 |         col2       |         col3       |
+-----+--------------------+--------------------+
|   1 |Marco               | LITMATPHY          |
|   2 |Lucy                | NaN                |
|   3 |Andy                | CHMHISENGSTA       |
|   4 |Nancy               | COMFRNPSYGEO       |
|   5 |Fred                | BIOLIT             |
+-----+--------------------+--------------------+

How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.

+-----+--------------------+----------------------------+
|col1 |         col2       |         col3               |
+-----+--------------------+----------------------------+
|   1 |Marco               | ['LIT','MAT','PHY]         |
|   2 |Lucy                | []                         |
|   3 |Andy                | ['CHM','HIS','ENG','STA']  |
|   4 |Nancy               | ['COM','FRN','PSY','GEO']  |
|   5 |Fred                | ['BIO','LIT']              |
+-----+--------------------+----------------------------+

Is the length of a string in col3 always a multiple of 3?

T C Molenaar
– T C Molenaar

2022-09-26 10:13:12 +00:00
Commented Sep 26, 2022 at 10:13 — T C Molenaar
– T C Molenaar, Commented Sep 26, 2022 at 10:13
Yes it is expected to be multiples of 3

Vortex
– Vortex

2022-09-26 10:24:13 +00:00
Commented Sep 26, 2022 at 10:24 — Vortex
– Vortex, Commented Sep 26, 2022 at 10:24

Nuri Taş · Accepted Answer · 2022-09-26 10:30:17Z

4

Use textwrap.wrap:

import textwrap

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])

If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
           apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)

edited Sep 26, 2022 at 10:30

answered Sep 26, 2022 at 10:17

Nuri Taş

3,8552 gold badges8 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

T C Molenaar Over a year ago

Maybe add the link to the documentation? Nice solution +1

Vortex Over a year ago

Thanks Nuri. This works. Just one question will apply cause a performance reduction compared to map() if suppose I have more than a million rows in the dataframe?

Nuri Taş Over a year ago

Yes apply is faster in this case although there's not a big difference. Timeit results for 2 million rows on my IDE: map: 22.3 s ± 2.14 s per loop; apply: 15.4 s ± 1.71 s per loop I have a terrible GPU and RAM so the results will be probably faster generally :)

Sachin Kohli · Accepted Answer · 2022-09-26 10:47:13Z

2

Another way can be this;

import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})

def split_str(s):
    lst=[]
    for i in range(0,len(s),3):
        lst.append(s[i:i+3])
    return lst

df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))

# Output

           col3           col3_result
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2  CHMHISENGSTA  [CHM, HIS, ENG, STA]
3  COMFRNPSYGEO  [COM, FRN, PSY, GEO]
4        BIOLIT            [BIO, LIT]

edited Sep 26, 2022 at 10:47

answered Sep 26, 2022 at 10:19

Sachin Kohli

1,9961 gold badge4 silver badges6 bronze badges

3 Comments

FObersteiner Over a year ago

you could do a more general check for NaN for instance like lambda x: [] if pd.isna(x) else split_str(x))

Vortex Over a year ago

Thanks Sachin, but I already tried a combination of your way (a method with textwrap.wrap()) and it works fine. But I was looking for a solution similar to what Nuri posted. Thanks again.

Sachin Kohli Over a year ago

That's Great... & Thanks for your input @FObersteiner, edited the code, will keep in mind... :) My earlier way of checking the np.nan might not be the optimal way, as what if we have "nan" as string in the input... so made the changes with pd.isna(x)... :)

T C Molenaar · Accepted Answer · 2022-09-26 10:43:15Z

1

With only using Pandas we can do:

df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])

def to_list(string, n):
    if string != string: # True if string = np.nan
        lst = []
    else:
        lst = [string[i:i+n] for i in range(0, len(string), n)]
    return lst

df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))

Output:

           col3              new_col3
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2                                  []
3     CHFDIOSFF       [CHF, DIO, SFF]
4        CHFIOD            [CHF, IOD]
5  FHDIFOSDFJKL  [FHD, IFO, SDF, JKL]

edited Sep 26, 2022 at 10:43

answered Sep 26, 2022 at 10:21

T C Molenaar

3,2801 gold badge13 silver badges27 bronze badges

3 Comments

FObersteiner Over a year ago

what's string != string supposed to do?

T C Molenaar Over a year ago

It checks if string is equal to np.nan.

FObersteiner Over a year ago

cool, wasn't aware that works with np.nan (it doesn't with native Python's None since there is only one of those...)

Collectives™ on Stack Overflow

Convert string column to array of fixed length strings in pandas dataframe

3 Answers 3

3 Comments

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related