0

This question follows the question: Problem in Pandas : impossible to do sum of int with arbitrary precision and I used the accepted answer from there: df["my_int"].apply(int).sum()

But it does not work in all cases.

For example, with this file

my_int
9220426963983292163
5657924282683240

The ouput is -9220659185443576213

After looking at the apply(int) output, I understand the problem. In this case, apply(int) returns dtype:int64.

0    9220426963983292163
1       5657924282683240
Name: my_int, dtype: int64

But with large numbers, it returns dtype:object:

0    1111111111111111111111111111111111111111111111...
1    2222222222222222222222222222222222222222222222...
Name: my_int, dtype: object

Is it possible to solve it with pandas ? Or should I follow Tim Robert's answer from the previous question?

Edit 1:

Awful solution. A line is added to the end of the file with a large integer

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

And after, sum is done on all lines except the last one :

data['my_int'].apply(int).iloc[:-1].sum()
0

2 Answers 2

1

Solution using Pandas:

sum(data[`my_int`].apply(int).to_list())

Why do I say so?

df1:

my_int
9220426963983292163
5657924282683240

df2:

my_int
9220426963983292163
5657924282683240
11111111111111111111111111111111111111111111111111111111111111111111111111

Let S1 and S2 denote the sum of elements in the column my_int in df1 and df2, respectively:

S1 = 9226084888265975403
S2 = 11111111111111111111111111111111111111111111111111111120337195999377086514

If we check the documentation of NumPy on Overflow errors, we see that NumPy offers limited precision:

>>> np.iinfo(int)
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

The max number representable is smaller than both S1 and S2.

The solution works without any trouble and gives the correct sum:

>>> df2['my_int'].sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>> 
>>> df2['my_int'].astype(object).sum()
9220426963983292163565792428268324011111111111111111111111111111111111111111111111111111111111111111
>>>
>>> sum(df2['my_int'].apply(int).to_list())
11111111111111111111111111111111111111111111120337195999377086514

EDIT: Prefer sum over np.sum:

>>> np.sum(df1['my_int'].apply(int).to_list())
>>> -9220659185443576213
>>> sum(df1['my_int'].apply(int).to_list())
>>> 9226084888265975403

Source of the calculation for sum of elements in the column my_int is WolframAlpha: df1, df2

Sign up to request clarification or add additional context in comments.

2 Comments

Not sure what is giving the negative answer for df1. I am not specifying any dtype while converting data to DataFrame. The column my_int has type int64 once df1 is generated.
I removed my comments because they were useless. Your solution was right.
1

Solution :

df["my_int"].apply(int).astype(object).sum()

apply(int): To avoid string concatenation with large numbers.

astype(object): To convert int64 to object.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.