Range Reduction algorithm for simple precision float

Question

I am trying to implement a small library of mathematical functions for 32-bit floats (simple precision) as part of one of my java projects. When it comes to calculating the sine of very large arguments, I can't manage to implement the Payne-Hanek range-reduction algorithm.

I know there are many threads about this algorithm on stackoverflow, and I've looked at and analyzed various explanatory articles such as ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (K. C. Ng and the members of the FP group of SunPro), so I think I understand it pretty well for double precision float, but after several days of reflection I still can't implement this algorithm in my 32-bits case. Is there a place where I could find a clear explanation of my particular case, and if not, a similar algorithm for my problem? (For the sinus of average argument x, I already use the Cody-Waite reduction, which however becomes very imprecise above 2^23 (about 10e9)).

Thanks in advance for your help, I'm still a programming novice and this is my first question on StackOverflow. I apologize for the possible mistakes in my English as it is not my first language.

EDIT :

In my implementation attempt, I've taken up the method described in ARGUMENT REDUCTION FOR HUGE ARGUMENTS: Good to the Last Bit (https://redirect.cs.umbc.edu/~phatak/645/supl/Ng-ArgReduction.pdf), which I've tried to modify for 32-bit floats as follows (I won't re-explain the principle of the algorithm to avoid a lengthy message):

The result is a 23-bit mantissa for float x, with an exponent between -128 and 127. In the case of simple precision, I have considered that the float closest to a multiple of pi/2 y is at least 2^-31 (It might be 30 or 29 but I never explicitly calculated it). For our calculation, we must therefore store at least T bits of 2/pi with :

T = max_exponent + mantissa + bits_for_worst_reduction_case = 23 + 127 + 31 = 181.

Since these bits will be stored as int (32bits), I choose to directly store 192 of them (i.e. 6 blocks).however, I use this algorithm for floats greater than 2^23 (min_exponent = 23). Posing M the number of trailing zeros before the binary point, the only part used among all these stored bits is the bits between the M-1st and the M+bits_for_worst_reduction_case+min_exposant = M+54th

(the reason for using only these bits is very well described in the paper presented above, but quickly :

The bits preceding the M-2nd included are such that x * these bits = 4*N where N is an odd integer by definition of M.

The bits following the M+55th are such that x*these bits <= 2^32 (so it doesn't matter in our calculation, even in the worst-case scenario).

The only thing that remains is to calculate x * B. The method is explained in the article, with the difference that here, in order to obtain perfectly equal result, we need to separate x and B into x_i and b_j blocks of at most 10 bits in length, so that x_i * b_j is a 20 significant bit number and the final multiplication (which for each result block y_i requires 2 additions of x_i * b_j sub-blocks) is a n = 22 significant bit number (so that we have n<23 and those bits can fit in a float mantissa).

Now that we have the results block y_i, we can have the remainder r of x%(pi/2) with :

y_i = y_i%4
y = sum(y_i, 1<i<6)
t = (y1-y) + sum(y_i, 2<i<6)
k = round(y)
f = (y-k) + t
r = f*pi/2.

My attempt of implementation is below. There is a few mistakes :

I do not know how to compute the bits of 2OverPi=0.6366... . Should I put them as the int 6366... or as the bits of the float 0.6366...
I can't manage to compute the number of trailing zeros before the binary point of my float. My attempt is in the countTrailingZeros() method.


public static final float piOver2 = 1.57079637051f; //0x3fc90fdb;

    /* 192 bits de Pi/2 for reduction : */
    public static final int TwoOverPiBits[] = {
            0x00000000,
            0x28be60db,
            0x9391054a,
            0x7f09d5f4,
            0x7d4d3770,
            0x36d8a566,
            0x4f10e410
    };

    // return the exponant of a float in simple precision
    public int getExp(float x) {
        int bits = Float.floatToIntBits(x);
        int exp = ((bits >> 23) & 0xFF) - 127;
        return exp;
    }

    // return the mantissa of a float in simple precision
    public int getMant(float x) {
        int bits = Float.floatToIntBits(x);
        int mant = bits & 0x7fffff;
        return mant;
    }

    // return the sign of a float in simple precision
    public int getSgn(float x) {
        int bits = Float.floatToIntBits(x);
        int sgn = (bits >> 31) & 1;
        return sgn;
    }

    // Method to extract 55 bits from TwoOverPiBits for the index k and stock them in 10 bits blocks
    public int[] packBits(int k) {
        int numPackets = 6;
        int[] packets = new int[numPackets];

        int bitIndex = k; // Index of first bit
        for (int i = 0; i < numPackets; i++) {
            packets[i] = extract12Bits(bitIndex);
            bitIndex += 10;
        }

        return packets;
    }

    // Method to extract 10 bits from TwoOverPiBits at a given index
    public int extract12Bits(int bitIndex) {
        int intIndex = bitIndex / 32; // Index de l'entier dans le tableau
        int bitOffset = bitIndex % 32; // Décalage du bit dans l'entier

        int result;
        if (bitOffset <= 22) {
            // All bits are in the same TwoOverPiBits block
            result = (deuxSurPiBits[intIndex] >> bitOffset) & 0x3FF; // Masqk to extract 10 bits
        } else {
            // 10 bits are in two cosecutive TwoOverPiBits blocks
            int bitsFirstPart = deuxSurPiBits[intIndex] >>> bitOffset;
            int bitsSecondPart = deuxSurPiBits[intIndex + 1] & ((1 << (10 - (32 - bitOffset))) - 1);
            result = (bitsSecondPart << (32 - bitOffset)) | bitsFirstPart;
        }

        return result;
    }

    //Methode to separate a float in 4 10 bits blocks
    public int[] extract12BitPackets(float x) {

        int bits = Float.floatToIntBits(x);

        int[] packets = new int[4];

        // Extract the 4 blocks : 32 bits in 2-10-10-10
        // First block : 10 bits but only 2 that are informative : 1 sign bit and the first exposant bit
        // Second block : 7 exposant bits and 3 mantissa bits
        packets[1] = (bits >> 20) & 0x3FF;

        // third block : 10 mantissa bits
        packets[2] = (bits >> 10) & 0x3FF;

        // last block : 10 last mantissa bits
        packets[3] = (bits & 0x3FF);

        return packets;
    }
    
    //Method to count number of trailing zeros before the binary point
    public int countTrailingZeros(int number) {
        if (number == 0) return 32;

        int count = 0;
        while ((number & 1) == 0) {
            count++;
            number >>= 1;
        }
        return count;
    }


    public float payneHanek(float x) {

        if (x<0){
            return -payneHanek(-x);
        }
        int q;
        float f, r;
        int k = getExp(x);
        int mant = getMant(x);
        int M = (k - 23) + countTrailingZeros(mant);

        int[] bits552Overpi = packBits(M - 6); //We begin at M-6 but we then erase the first five bits with the &0x3FF
        int 2Overpibits5 = bits552Overpi[5];
        int 2Overpibits4 = bits552Overpi[4];
        int 2Overpibits3 = bits552Overpi[3];
        int 2Overpibits2 = bits552Overpi[2];
        int 2Overpibits1 = bits552Overpi[1];
        int 2Overpibits0 = bits552Overpi[0] & 0x3FF;

        //Separate x in 10 bits block
        int[] bits32x = extract12BitPackets(x);
        int xbits3 = bits32x[3];
        int xbits2 = bits32x[2];
        int xbits1 = bits32x[1];
        int xbits0 = bits32x[0];

        //x*2OverPi with the blocks
        int y8 = xbits3 * 2Overpibits5;
        int y7 = xbits3 * 2Overpibits4 + xbits2 * 2Overpibits5;
        int y6 = xbits3 * 2Overpibits3 + xbits2 * 2Overpibits4 + xbits1 * 2Overpibits5;
        int y5 = xbits3 * 2Overpibits2 + xbits2 * 2Overpibits3 + xbits1 * 2Overpibits4 + xbits0 * 2Overpibits5;
        int y4 = xbits3 * 2Overpibits1 + xbits2 * 2Overpibits2 + xbits1 * 2Overpibits3 + xbits0 * 2Overpibits4;
        int y3 = xbits3 * 2Overpibits0 + xbits2 * 2Overpibits1 + xbits1 * 2Overpibits2 + xbits0 * 2Overpibits3;
        int y2 = xbits2 * 2Overpibits0 + xbits1 * 2Overpibits1 + xbits0 * 2Overpibits2;
        int y1 = xbits1 * 2Overpibits0 + xbits0 * 2Overpibits1;
        int y0 = xbits0 * 2Overpibits0;


        //Modulo of each blocks : those variables are exact
        y8 = y8 % 4;
        y7 = y7 % 4;
        y6 = y6 % 4;
        y5 = y5 % 4;
        y4 = y4 % 4;
        y3 = y3 % 4;
        y2 = y2 % 4;
        y1 = y1 % 4;
        y0 = y0 % 4;

        //final result :
        float y = y0 + (y1 + (y2 + (y3 + (y4 + (y5 + (y6 + (y7 + y8)))))));
        //error :
        int error = (((((((y1 - y) + y2) + y3) + y4) + y5) + y6) + y7) + y8;
        q = (int) Math.rint(y);
        f = (y - q) + error;
        r = f * piSur2;
        
        //Calculate sin(r)....

Please provide enough code so others can better understand or reproduce the problem. — Community
– Community Bot, Commented Jan 18, 2024 at 9:09
Hello, I apologize for my late reply. I've edited my post, adding a link to the article and my code with as clear an explanation as possible. Thank you in advance for any help you can provide. Don't hesitate to let me know if there are any errors or if you need further clarification. For the 2OverPi bits, I have so far recovered those described in this post: stackoverflow.com/q/30463616/23259690 — dananr
– dananr, Commented Jan 18, 2024 at 9:25

Collectives™ on Stack Overflow

Range Reduction algorithm for simple precision float

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked