2

I have a pandas dataframe with a few columns. I want to convert one of the string columns into an array of strings with fixed length.

Here is how current table looks like:

+-----+--------------------+--------------------+
|col1 |         col2       |         col3       |
+-----+--------------------+--------------------+
|   1 |Marco               | LITMATPHY          |
|   2 |Lucy                | NaN                |
|   3 |Andy                | CHMHISENGSTA       |
|   4 |Nancy               | COMFRNPSYGEO       |
|   5 |Fred                | BIOLIT             |
+-----+--------------------+--------------------+

How can I split string of "col 3" into array of string of length 3 as follows: PS: There can be blanks or NaN in the col 3 and they should be replaced with empty array.

+-----+--------------------+----------------------------+
|col1 |         col2       |         col3               |
+-----+--------------------+----------------------------+
|   1 |Marco               | ['LIT','MAT','PHY]         |
|   2 |Lucy                | []                         |
|   3 |Andy                | ['CHM','HIS','ENG','STA']  |
|   4 |Nancy               | ['COM','FRN','PSY','GEO']  |
|   5 |Fred                | ['BIO','LIT']              |
+-----+--------------------+----------------------------+
2
  • Is the length of a string in col3 always a multiple of 3? Commented Sep 26, 2022 at 10:13
  • 1
    Yes it is expected to be multiples of 3 Commented Sep 26, 2022 at 10:24

3 Answers 3

4

Use textwrap.wrap:

import textwrap

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else [])

If there are string whose lenghts aren't the multiple of 3, the remaining letters will be pushed to the last. If you only want to have strings of lenght 3, you can apply one more to get rid of those strings:

df['col3'].apply(lambda x: textwrap.wrap(x, 3) if pd.notna(x) else []).\
           apply(lambda x: x[:-1] if len(x[-1]) % 3 != 0 else x)
Sign up to request clarification or add additional context in comments.

3 Comments

Maybe add the link to the documentation? Nice solution +1
Thanks Nuri. This works. Just one question will apply cause a performance reduction compared to map() if suppose I have more than a million rows in the dataframe?
Yes apply is faster in this case although there's not a big difference. Timeit results for 2 million rows on my IDE: map: 22.3 s ± 2.14 s per loop; apply: 15.4 s ± 1.71 s per loop I have a terrible GPU and RAM so the results will be probably faster generally :)
2

Another way can be this;

import pandas as pd
import numpy as np
df = pd.DataFrame({"col3":["LITMATPHY",np.nan,"CHMHISENGSTA","COMFRNPSYGEO","BIOLIT"]})

def split_str(s):
    lst=[]
    for i in range(0,len(s),3):
        lst.append(s[i:i+3])
    return lst

df["col3_result"] = df["col3"].apply(lambda x: [] if pd.isna(x) else split_str(s=x))

# Output

           col3           col3_result
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2  CHMHISENGSTA  [CHM, HIS, ENG, STA]
3  COMFRNPSYGEO  [COM, FRN, PSY, GEO]
4        BIOLIT            [BIO, LIT]

3 Comments

you could do a more general check for NaN for instance like lambda x: [] if pd.isna(x) else split_str(x))
Thanks Sachin, but I already tried a combination of your way (a method with textwrap.wrap()) and it works fine. But I was looking for a solution similar to what Nuri posted. Thanks again.
That's Great... & Thanks for your input @FObersteiner, edited the code, will keep in mind... :) My earlier way of checking the np.nan might not be the optimal way, as what if we have "nan" as string in the input... so made the changes with pd.isna(x)... :)
1

With only using Pandas we can do:

df = pd.DataFrame(['LITMATPHY', np.nan, '', 'CHFDIOSFF', 'CHFIOD', 'FHDIFOSDFJKL'], columns=['col3'])

def to_list(string, n):
    if string != string: # True if string = np.nan
        lst = []
    else:
        lst = [string[i:i+n] for i in range(0, len(string), n)]
    return lst

df['new_col3'] = df['col3'].apply(lambda x: to_list(x, 3))

Output:

           col3              new_col3
0     LITMATPHY       [LIT, MAT, PHY]
1           NaN                    []
2                                  []
3     CHFDIOSFF       [CHF, DIO, SFF]
4        CHFIOD            [CHF, IOD]
5  FHDIFOSDFJKL  [FHD, IFO, SDF, JKL]

3 Comments

what's string != string supposed to do?
It checks if string is equal to np.nan.
cool, wasn't aware that works with np.nan (it doesn't with native Python's None since there is only one of those...)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.