Slice pandas dataframe by part of string value in column

Question

I have a pandas dataframe that contains a column with a 9 character string. I would like to find the rows in the dataframe that match the first 3 of the 9 characters in this string.

My current solution creates a new column in the dataframe that simply slices the first 3 characters of the string, but I would like to solve this without creating a new column (since I have to delete it later). I generally prefer not to alter the dataframe if I can help it.

Example:

import pandas as pd

# sample dataframe:
cid=[1,2,3,4,5,6,7,8,9,10]
strings=[
    'tncduuqcr',
    'xqjfykalt',
    'arzouazgz',
    'tncknojbi',
    'xqjgfcekh',
    'arzupnzrx',
    'tncfjxyox',
    'xqjeboxdn',
    'arzphbdcs',
    'tnctnfoyi',
]

df=pd.DataFrame(list(zip(cid,strings)),columns=['cid','strings'])

# This is the step I would like to avoid doing:
df['short_strings']=df['strings'].str[0:3]

out_dict={}

for x in df['short_strings'].unique():
    df2=df[df['short_strings']==x]
    out_dict[x]=df2

# the separate dataframes:
for x in out_dict.keys():
    print(out_dict[x])

Output:

   cid    strings short_strings
0    1  tncduuqcr           tnc
3    4  tncknojbi           tnc
6    7  tncfjxyox           tnc
9   10  tnctnfoyi           tnc
   cid    strings short_strings
1    2  xqjfykalt           xqj
4    5  xqjgfcekh           xqj
7    8  xqjeboxdn           xqj
   cid    strings short_strings
2    3  arzouazgz           arz
5    6  arzupnzrx           arz
8    9  arzphbdcs           arz

I have tried simply comparing ==df['strings'].str[0:3] but this does not seem to work.

Can you add the expected output to your question?

Mehdi Golzadeh
– Mehdi Golzadeh

2020-11-09 18:43:15 +00:00
Commented Nov 9, 2020 at 18:43 — Mehdi Golzadeh
– Mehdi Golzadeh, Commented Nov 9, 2020 at 18:43
I've added the printed dataframes.

amquack
– amquack

2020-11-09 19:40:16 +00:00
Commented Nov 9, 2020 at 19:40 — amquack
– amquack, Commented Nov 9, 2020 at 19:40

ansev · Accepted Answer · 2020-11-09 18:44:53Z

1

For this type of operations we use DataFrame.groupby() + GroupBy.__iter__(), indexing here with Series.unique is slower:

mydict = dict(df.groupby(df.strings.str[:3]).__iter__())
print(mydict)

Output

{'arz':    cid    strings
 2    3  arzouazgz
 5    6  arzupnzrx
 8    9  arzphbdcs,
 'tnc':    cid    strings
 0    1  tncduuqcr
 3    4  tncknojbi
 6    7  tncfjxyox
 9   10  tnctnfoyi,
 'xqj':    cid    strings
 1    2  xqjfykalt
 4    5  xqjgfcekh
 7    8  xqjeboxdn}

answered Nov 9, 2020 at 18:44

ansev

31k5 gold badges21 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

amquack Over a year ago

I should have been more clear - in my application I am only interested in a subset of the groups, and some of them will be grouped together. Thus I can substitute the .unique with the list of strings I am interested in. Additionally, some strings will be grouped together (ex: 'arz' and 'tnc' saved in the same dataframe/dictionary entry). Is there a way to do that with groupby? (or perhaps better asked: is there a way to include an "if" statement into the groupby?)

Collectives™ on Stack Overflow

Slice pandas dataframe by part of string value in column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related