Problem with apply(int) to convert string to int in pandas

Question

This question follows the question: Problem in Pandas : impossible to do sum of int with arbitrary precision and I used the accepted answer from there: df["my_int"].apply(int).sum()

But it does not work in all cases.

For example, with this file

my_int
9220426963983292163
5657924282683240

The ouput is -9220659185443576213

After looking at the apply(int) output, I understand the problem. In this case, apply(int) returns dtype:int64.

0    9220426963983292163
1       5657924282683240
Name: my_int, dtype: int64

But with large numbers, it returns dtype:object:

0    1111111111111111111111111111111111111111111111...
1    2222222222222222222222222222222222222222222222...
Name: my_int, dtype: object

Is it possible to solve it with pandas ? Or should I follow Tim Robert's answer from the previous question?

Edit 1:

Awful solution. A line is added to the end of the file with a large integer

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

And after, sum is done on all lines except the last one :

data['my_int'].apply(int).iloc[:-1].sum()

medium-dimensional · Accepted Answer · 2022-12-24 11:06:09Z

1

Solution using Pandas:

sum(data[`my_int`].apply(int).to_list())

Why do I say so?

df1:

my_int
9220426963983292163
5657924282683240

df2:

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

Let S1 and S2 denote the sum of elements in the column my_int in df1 and df2, respectively:

S1 = 9226084888265975403
S2 = 11111111111111111111111111111111111111111111111111111120337195999377086514

If we check the documentation of NumPy on Overflow errors, we see that NumPy offers limited precision:

>>> np.iinfo(int)
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

The max number representable is smaller than both S1 and S2.

The solution works without any trouble and gives the correct sum:

>>> df2['my_int'].sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>> 
>>> df2['my_int'].astype(object).sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> sum(df2['my_int'].apply(int).to_list())
11111111111111111111111111111111111111111111120337195999377086514

EDIT: Prefer sum over np.sum:

>>> np.sum(df1['my_int'].apply(int).to_list())
>>> -9220659185443576213
>>> sum(df1['my_int'].apply(int).to_list())
>>> 9226084888265975403

Source of the calculation for sum of elements in the column my_int is WolframAlpha: df1, df2

edited Dec 24, 2022 at 11:06

answered Dec 24, 2022 at 10:25

medium-dimensional

2,30114 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

medium-dimensional Over a year ago

Not sure what is giving the negative answer for df1. I am not specifying any dtype while converting data to DataFrame. The column my_int has type int64 once df1 is generated.

Stef1611 Over a year ago

I removed my comments because they were useless. Your solution was right.

medium-dimensional · Accepted Answer · 2022-12-24 11:25:51Z

1

Solution :

df["my_int"].apply(int).astype(object).sum()

apply(int): To avoid string concatenation with large numbers.

astype(object): To convert int64 to object.

edited Dec 24, 2022 at 11:25

medium-dimensional

2,30114 silver badges25 bronze badges

answered Dec 24, 2022 at 10:14

Stef1611

2,5152 gold badges20 silver badges47 bronze badges

Collectives™ on Stack Overflow

Problem with apply(int) to convert string to int in pandas

Edit 1:

2 Answers 2

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Edit 1:

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related