1

I have a pandas dataframe that contains a column with a 9 character string. I would like to find the rows in the dataframe that match the first 3 of the 9 characters in this string.

My current solution creates a new column in the dataframe that simply slices the first 3 characters of the string, but I would like to solve this without creating a new column (since I have to delete it later). I generally prefer not to alter the dataframe if I can help it.

Example:

import pandas as pd

# sample dataframe:
cid=[1,2,3,4,5,6,7,8,9,10]
strings=[
    'tncduuqcr',
    'xqjfykalt',
    'arzouazgz',
    'tncknojbi',
    'xqjgfcekh',
    'arzupnzrx',
    'tncfjxyox',
    'xqjeboxdn',
    'arzphbdcs',
    'tnctnfoyi',
]

df=pd.DataFrame(list(zip(cid,strings)),columns=['cid','strings'])

# This is the step I would like to avoid doing:
df['short_strings']=df['strings'].str[0:3]

out_dict={}

for x in df['short_strings'].unique():
    df2=df[df['short_strings']==x]
    out_dict[x]=df2

# the separate dataframes:
for x in out_dict.keys():
    print(out_dict[x])

Output:

   cid    strings short_strings
0    1  tncduuqcr           tnc
3    4  tncknojbi           tnc
6    7  tncfjxyox           tnc
9   10  tnctnfoyi           tnc
   cid    strings short_strings
1    2  xqjfykalt           xqj
4    5  xqjgfcekh           xqj
7    8  xqjeboxdn           xqj
   cid    strings short_strings
2    3  arzouazgz           arz
5    6  arzupnzrx           arz
8    9  arzphbdcs           arz

I have tried simply comparing ==df['strings'].str[0:3] but this does not seem to work.

2
  • Can you add the expected output to your question? Commented Nov 9, 2020 at 18:43
  • I've added the printed dataframes. Commented Nov 9, 2020 at 19:40

1 Answer 1

1

For this type of operations we use DataFrame.groupby() + GroupBy.__iter__(), indexing here with Series.unique is slower:

mydict = dict(df.groupby(df.strings.str[:3]).__iter__())
print(mydict)

Output

{'arz':    cid    strings
 2    3  arzouazgz
 5    6  arzupnzrx
 8    9  arzphbdcs,
 'tnc':    cid    strings
 0    1  tncduuqcr
 3    4  tncknojbi
 6    7  tncfjxyox
 9   10  tnctnfoyi,
 'xqj':    cid    strings
 1    2  xqjfykalt
 4    5  xqjgfcekh
 7    8  xqjeboxdn}
Sign up to request clarification or add additional context in comments.

1 Comment

I should have been more clear - in my application I am only interested in a subset of the groups, and some of them will be grouped together. Thus I can substitute the .unique with the list of strings I am interested in. Additionally, some strings will be grouped together (ex: 'arz' and 'tnc' saved in the same dataframe/dictionary entry). Is there a way to do that with groupby? (or perhaps better asked: is there a way to include an "if" statement into the groupby?)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.