0

I have the following code from this question Df groupby set comparison:

   import pandas as pd

wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''

for index, row in wordlist.iterrows() :
    row['split'] = list(row['word'])

    anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
    wordlist['anagrams'] = anaglist

wordlist = wordlist.drop(['split'], axis=1)

wordlist = wordlist['anagrams'].drop_duplicates(keep='first')

print(wordlist)
print(wordlist.dtypes)

Some input in my example.txt file seems to be being read as ints, particularly if the strings are of different character lengths. I can't seem to force pandas to see the data as strings using .astype(str)

What's going on?

1 Answer 1

1

First for force read column to string is possible use parameter dtype=str in read_csv, but it is used if numeric columns is necessary explicitly converting. So it seems because string values all values in column are converted to str implicitly.

I try a bit change your code:

Setup:

import pandas as pd
import numpy as np

temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
      word
0      acb
1      acb
2      bca
3      foo
4      oof
5  spaniel

#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))

print (wordlist)
      word anagrams
0      acb      abc
2      bca      abc
3      foo      foo
4      oof      foo
5  spaniel  aeilnps

#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')

#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
  word anagrams
2  bca      abc
4  oof      foo

#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
  word anagrams
0  acb      abc
2  bca      abc
3  foo      foo
4  oof      foo
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.