15,494 questions
2
votes
1
answer
132
views
Accurate computation of the inverse gamma function with the standard C math library
The inverse of the gamma function over the reals is multivalued with an infinite number of branches. This self-answered question is about the principal
inverse of the gamma function, Γ0-1(x), whose ...
5
votes
3
answers
163
views
PHP 8.4 rounding gives different result than 8.2
I am in the process of upgrading our code from php 8.2 to 8.4
I noticed we are getting some test failures because of round() returning different values than expected. Ultimately the problem could be ...
4
votes
1
answer
108
views
Invalid Operation with Arm64 fcmp and simd
Consider the following snippet:
ldr q0, [x0]
cmeq v0.16b, v0.16b, #0
shrn v0.8b, v0.8h, #4
fcmp d0, #0.0
This is a common way to implement functions such as strlen with SIMD. According to the Arm64 ...
Advice
1
vote
10
replies
164
views
What is the minimum range of the quotient assigned by `remquo()`?
Is the minimum range that remquo(x, y, quo) assigns *quo [-7 ... +7]?
Is not, what is the minimum compliant range?
double remquo(double x, double y, int *quo); has the following description:
The ...
16
votes
2
answers
696
views
Parsing of small floats with std::istream
I have a program that reads the coordinates of some points from a text file using std::istringstream, and then it verifies the correctness of parsing by calling stream's operator bool().
In general it ...
-4
votes
3
answers
296
views
the pow() function isn't accurate
I read Calculating very large exponents in python , and Ignacio Vazquez-Abrams's answer was to use the pow() function
So , I run the following command in Python 3.9.6 :
print(int(pow(1.5,96)))
and ...
2
votes
1
answer
244
views
Why does Newton’s method overshoot on the first deceleration step in my motion profile generator?
I’m porting a Python motion profile generator to C to implement for my STM32H743. The generator produces step timings for a simple acceleration → cruise → deceleration motion profile. See the ...
0
votes
1
answer
295
views
Inaccuracy replicating Fortran mixed-precision expression in Rust
I have the following code in my Fortran program, where both a and b are declared as REAL (KIND=8):
a = 0.12497443596150659d0
b = 1.0 + 0.00737 * a
This yields b as 1.0009210615647672
For comparison, ...
4
votes
1
answer
169
views
Weird behavior in large complex128 NumPy arrays, imaginary part only [closed]
I'm working on numerical simulations. I ran into an issue with large NumPy arrays (~ 26 GB) on Linux with 128 GB of RAM. The arrays are of type complex128.
Arrays are instantiated without errors (if ...
0
votes
2
answers
248
views
turn Python float argument into numpy array, keep array argument the same
I have a simple function that is math-like:
def y(x):
return x**2
I know the operation x**2 will return a numpy array if supplied a numpy array and a float if supplied a float.
for more complicated ...
25
votes
13
answers
3k
views
How can I parse a string to a float in C in a way that isn't affected by the current locale?
I'm writing a program where I need to parse some configuration files in addition to user input from a graphical user interface. In particular, I'm having issues with parsing strings taken from the ...
2
votes
0
answers
206
views
Why does floating point division take less than 50% of the latency of integer division and also 10x more latency than usual when underflow occurs?
I am measuring the latency of instructions.
For 64-bit primitives, integer division takes about 25 cycles each, usually on my 2.3GHz Digital Ocean vCPU, while floating point division takes about 10 ...
4
votes
2
answers
311
views
Why does adding a value to Float.MAX_VALUE not reach infinity?
According to the standard, overflow in java is handled using a special value called infinity, but here the sum is 3.4028235E38. Why is this the case?
public class FloatingPointTest {
public static ...
6
votes
1
answer
260
views
Speeding up integer division with doubles
I have a fixed-point math-heavy project and I was looking to speed up integer divisions. I tested double division with SSE4 and AVX2 and got nearly 2x speedup versus scalar integer division. I wonder ...
0
votes
1
answer
150
views
GCC offers a _Float16 type, but - what about the functions to work with it?
GCC offers a 16-bit floating point type, outside of the C language standard: _Float16 - at least for x86_64. This allowance is described here.
However - the GCC documentation does not seem to indicate ...
5
votes
1
answer
156
views
Is it expected that vmapping over different input sizes for the same function impacts the accuracy of the result?
I was suprised to see that depending on the size of an input matrix, which is vmapped over inside of a function, the output of the function changes slightly. That is, not only does the size of the ...
3
votes
2
answers
194
views
How does Oracle convert decimal values to float?
If I have a float(5) column, why does 7.89 get rounded to 7.9 but 12.79 gets rounded to 13, not 12.8?
Binary forms are as follows for 3 examples:
7.89 0111.01011001 ------ round to------\> 7.9 ...
0
votes
1
answer
221
views
How can a long double be that big in C++? [duplicate]
The sizeof(long double) is 8, which means that if I use all the bits for the integer part of an unsigned number, I can maximum store 2^64-1=18446744073709551615.
However, std::numeric_limits<long ...
0
votes
0
answers
135
views
Previous representable floating point value with `ffast-math`
I am binary searching an array of floating points. In order to modify the result of a binary search for certain values, I insert previous representable floating point values.
I started with:
std::...
1
vote
1
answer
106
views
integer exact computation with logs
I need to compute ceil(log_N(i)) where log_N is the log with positive integer base N and i is also a positive integer.
The straight forward python implementation using floating point math fails at ...
53
votes
2
answers
3k
views
Why is 0.0 printed as 0.00001 when rounding upward?
If in a C++ program, I activate upward rounding mode for floating-point numbers and print some double-precision value already rounded to an integer, e.g.:
#include <cfenv>
#include <iostream&...
1
vote
2
answers
158
views
Can I tell my compiler that a floating-point value is not NaN nor +/-infinity?
I'm writing a C function float foo(float x) which manipulates a floating-point value. It so happens, that I can guarantee the function will only ever be called with finite values - neither NaN's, nor +...
0
votes
1
answer
118
views
Why does NVCC not optimize ldexpf with a constexpr power-of-two exponent into a simple fmul?
Consider the following CUDA code:
enum { p = 5 };
__device__ float adjust_mul(float x) { return x * (1 << p); }
__device__ float adjust_ldexpf(float x) { return ldexpf(x, p); }
I would expect ...
1
vote
1
answer
123
views
Problems with printing double constants in c
I have codes that look like this
#include <float.h>
#include <stdio.h>
int main(void)
{
long double x = LDBL_MAX;
printf("%Lf ", x);
printf("%Lf\n", 1....
0
votes
3
answers
120
views
Is there a known solution for converting IEEE Float values to Hexadecimal in Ada without using the IEEE package?
I do not have the ability to update our Ada compiler set so include the IEEE packages. Is there a way to convert a Float into a Hexadecimal integer? For instance, a Float value of 1.5 as input should ...
0
votes
1
answer
253
views
GCC -Wunsuffixed-float behaviour
(major edit of the question)
I'm using arm-none-eabi-gcc 10.2.
The following code
const double d = 1.0;
int main()
{
return d*d;
}
compiled with
arm-none-eabi-gcc -Wall -Wextra -Wpedantic -...
10
votes
4
answers
310
views
Why does this lookup table sine estimation perform worse when using float instead of double?
I've written a simple sine estimation function which uses a lookup table. Out of curiosity, I tried both float and double types, expecting float to perform a bit better because of being able to pack ...
2
votes
2
answers
264
views
How to get integer unix timestamps in python? Not casting float to integer
The float timestamp used in Python seems to be problematic. First, it will lose its precision as time goes on, and second, the timestamp is not usable, for example, storing in a database big integer ...
17
votes
1
answer
896
views
Performance of two inverse square root expressions (accounting for CPU pipelining)
To compute the inverse of the square root of a double float in C, the most natural code would be the following: 1 / sqrt(x) (sqrt being a math.h function). At high optimization levels, I expected that ...
2
votes
1
answer
183
views
Fast bithacked log2 approximation
In order to estimate entropy of a sequence, I came up with something like this:
float log2_dirty(float x) {
return std::bitcast<float>((std::bitcast<int>(x) >> 4) + 0x3D800000)) -...
-2
votes
2
answers
231
views
Taylor Series in Java using float
I am implementing a method berechneCosinus(float x, int ordnung) in Java that should approximate the cosine of x using the Taylor series up to the specified order.
My questions:
Can I optimize the ...
3
votes
2
answers
178
views
Efficiently find the minimum exponent to eliminate the fractional part of a floating point number?
Given a float or a double, I want to find the smallest exponent e such that n*pow(2,e) is an integer.
I've currently got this code that just loops multiplying by 2 until the fractional part disappears:...
2
votes
1
answer
206
views
Comparing initialized floating point variable to zero, again
We know that floating point values cannot be compared with the == operator, due to precision issues. However, the following code, which initializes a double variable to an integer 0, successfully ...
4
votes
3
answers
174
views
How to calculate largest float smaller than input integer in C#?
I need a C# function that takes in an integer and returns a float, which is the largest float smaller than that integer. In normal (not floating point) math, this is an example:
For input of 5 the ...
4
votes
0
answers
185
views
Questions regarding the new C++ fixed-width floating-point types in C++23
With the introduction of the new fixed-width floating point types in C++23 (the std::floatN_t types in <stdfloat>), some guarantees regarding the representation of floating-point types are not ...
0
votes
4
answers
264
views
Rounding floats / decimals in e-commerce (PHP 8.3 and Laravel 11)
So due to the fact that product quantities are not limited to integers but can store decimals up to 4 digits, I can have a product with quantity of 69.4160 square meters of something.
Now assuming the ...
0
votes
1
answer
193
views
c# display full float value
I know that binary storage for floats cannot always store the exact value. I know that .NET Framework only displays up to 7 significant digits for floats, and uses the 8th to round the 7th. I also ...
4
votes
1
answer
134
views
floating-point implementation of a function with removable singularity
Consider the following function:
ƒ(x) = sin(sqrt(x)) / sqrt(x) if x > 0
ƒ(x) = 1 if x = 0
ƒ(x) = sinh(sqrt(-x)) / sqrt(-x) if x < 0
This is an entire function with a Taylor series at 0 of: 1/1!...
3
votes
2
answers
178
views
Why does sqrt(x+1)-sqrt(x) result to 0 in JS? [duplicate]
I’m trying to compute the difference between two square roots in JS.
function deltaSqrt(x) {
return Math.sqrt(x + 1) - Math.sqrt(x);
}
console.log(deltaSqrt(1e6)); // 0.0004999998750463419
...
0
votes
2
answers
90
views
Why does the output is 4 seconds, but real time of work is less? [closed]
I have a simple decorator to measure time of function in seconds
from functools import wraps
from time import time
def time_it(func):
@wraps(func)
def wrapper(*args, **kwargs):
...
8
votes
2
answers
730
views
Sine approximation, did i beat Remez?
First, it is most compact sine approximation ever. It seems i can do better than Remez in terms of precision/performance. Here [0,pi/2] approximation range.
double p[] =
-0.0020836519336891552,
-0....
0
votes
0
answers
106
views
In C#, why are 0.025f and 0.025 equal to 0.025m? [duplicate]
In C#, I have this code:
float initialResolutionFloat = 0.025f;
double initialResolutionDouble = 0.025;
decimal initialResolutionDecimal = 0.025m;
decimal resolutionFloat = (decimal)...
0
votes
0
answers
74
views
How to print the value of a floating point register using radare2
I'm debugging an executable program with radare2 on an armv8 computer, and I don't know how to print the values in the floating-point registers。
I tried using the dr directive, but the output is wrong....
0
votes
2
answers
90
views
How do I use g_assert_cmpfloat () to check if two floats are equal without generating a safety warning?
I'm writing some testcases for a program using the GLib testing facilities. I want to assert that two floats have the same value, and g_assert_cmpfloat () seemed like an appropriate function for that. ...
0
votes
1
answer
108
views
How to tell ARM clang to emit SCVTF with the #fbits parameter?
In an embedded application we need to convert a signed-fractional number (S1.31 format) to a single-precision floating point number. A C function looks like this:
#include <stdint.h>
float ...
3
votes
1
answer
223
views
How do I convert a random uint32_t into a random float in the interval [0, 1) using the method from Xoshiro documentation?
On the Xoshiro webpage, they have the following statement under the heading "Generating uniform doubles in the unit interval":
A standard double (64-bit) floating-point number in IEEE ...
4
votes
5
answers
313
views
Why no floating point error occurs in print(0.1* 100000) vs (Decimal(0.1)*100000) due to FP representation of 0.1?
I am studying numerical analysis and I have come across this dilemma.
Running the following script,
from decimal import Decimal
a = 0.1 ;
N = 100000 ;
# product calculation
P = N*a
# Print product ...
3
votes
3
answers
181
views
For a float number a.bc without underflow and overflow, does a.bc.toString() always =="a.bc"? May it return something like "a.bc00000000000000000001"?
For example, I know 0.12 is not exactly equals to 12/100 in decimal because it needs to be rounded to some value. However, I'm not asking about mathematical problem , but about Javascript spec that ...
5
votes
2
answers
168
views
Is the Java float to double cast lossy?
I am creating a float value using raw int bits where the mantissa has all bits set to 1. However, casting to double flips the final bit to 0. Why does this happen? From what I've been seeing online, ...
0
votes
1
answer
83
views
Fortran selected_real_kind and MKL double precision
Following this best practices guide, I have a module in my Fortran code that defines a double precision type,
module kind_parameters
implicit none
public
! Double precision real numbers, ...