Pandas keeps converting strings to int

Question

I have the following code from this question Df groupby set comparison:

   import pandas as pd

wordlist = pd.read_csv('data/example.txt', sep='\r', header=None, index_col=None, names=['word'])
wordlist = wordlist.drop_duplicates(keep='first')
# wordlist['word'] = wordlist['word'].astype(str)
wordlist['split'] = ''
wordlist['anagrams'] = ''

for index, row in wordlist.iterrows() :
    row['split'] = list(row['word'])

    anaglist = wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))
    wordlist['anagrams'] = anaglist

wordlist = wordlist.drop(['split'], axis=1)

wordlist = wordlist['anagrams'].drop_duplicates(keep='first')

print(wordlist)
print(wordlist.dtypes)

Some input in my example.txt file seems to be being read as ints, particularly if the strings are of different character lengths. I can't seem to force pandas to see the data as strings using .astype(str)

What's going on?

jezrael · Accepted Answer · 2018-01-19 06:33:25Z

First for force read column to string is possible use parameter dtype=str in read_csv, but it is used if numeric columns is necessary explicitly converting. So it seems because string values all values in column are converted to str implicitly.

I try a bit change your code:

Setup:

import pandas as pd
import numpy as np

temp=u'''"acb"
"acb"
"bca"
"foo"
"oof"
"spaniel"'''
#after testing replace 'pd.compat.StringIO(temp)' to 'example.txt'
wordlist = pd.read_csv(pd.compat.StringIO(temp), sep="\r", index_col=None, names=['word'])
print (wordlist)
      word
0      acb
1      acb
2      bca
3      foo
4      oof
5  spaniel

#first remove duplicates
wordlist = wordlist.drop_duplicates()
#create lists and join them
wordlist['anagrams'] = wordlist['word'].apply(lambda x: ''.join(sorted(list(x))))

print (wordlist)
      word anagrams
0      acb      abc
2      bca      abc
3      foo      foo
4      oof      foo
5  spaniel  aeilnps

#sort DataFrame by column anagrams
wordlist = wordlist.sort_values('anagrams')

#get first duplicated rows
wordlist1 = wordlist[wordlist['anagrams'].duplicated()]
print (wordlist1)
  word anagrams
2  bca      abc
4  oof      foo

#get all duplicated rows
wordlist2 = wordlist[wordlist['anagrams'].duplicated(keep=False)]
print (wordlist2)
  word anagrams
0  acb      abc
2  bca      abc
3  foo      foo
4  oof      foo

Collectives™ on Stack Overflow

Pandas keeps converting strings to int

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related