Fast Inverse Square Root in Fortran
Fast Inverse Square Root in Fortran
In the early days of computing, the square root operation was one of the slowest operations possible. As such, computer scientists and mathematicians worked hard to develop efficient algorithms to evaluate the square root.
Evaluating the square root is often necessary to normalize a vector. For example, we may want to calculate the direction between two coordinates, and
. Then, the direction between these two vectors,
is
We can define the inverse square root function as
Evaluating the inverse square root in a fast fashion is often important to video games and other graphics processing. The common method introduces an approximation and uses bit manipulation. An example implementation can be found in the “Quake III Arena” video game and more information can be found on Wikipedia.
Bit manipulation is not as common in Fortran as it is in C but can be performed with clever use of the transfer
function. To investigate the efficient evaluation of the inverse square root function in Fortran, we will consider three possibilities:
- Linking with a C function.
- Using Fortran intrinsics and the
transfer
function. - Using the default
sqrt
function in Fortran.
Linking with a C Program
The traditional form of the fast inverse square root algorithm is written in C to allow for simple bit manipulation. To use the algorithm in Fortran, one option is to call a C implementation of the algorithm using the Fortran 2003 standard C interface. This consists of a fair amount of boilerplate code.
First, the conventional fast inverse square root algorithm. This implementation uses a union
to avoid undefined behavior in accordance with the C standard.
/* fast_inv_sqrt.c */
float fast_inv_sqrt(float * x_in)
{
const float x = *x_in;
const float x2 = x*0.5F;
union {
float f;
uint32_t i;
} conv = { .f = x };
conv.i = 0x5f3759df - ( conv.i >> 1 );
conv.f *= 1.5f - ( x2 * conv.f * conv.f );
/* Optionally, include another Newton iteration. */
return conv.f;
}
Next, we need a Fortran interface for the C function.
! mod_c_interface.f90
module c_interface
use, intrinsic :: ISO_C_BINDING, only : C_FLOAT
IMPLICIT NONE
interface
real(C_FLOAT) function fast_inv_sqrt(x) bind(c)
use, intrinsic :: ISO_C_BINDING, only : C_FLOAT
real(C_FLOAT), intent(in) :: x
endfunction fast_inv_sqrt
endinterface
endmodule c_interface
Finally, we can call the C function from a Fortran program by use
-ing the Fortran module containing the C interface.
! main.f90
program main
use, intrinsic :: ISO_C_BINDING, only : C_FLOAT
use c_interface
IMPLICIT NONE
real(C_FLOAT), parameter :: x = 2.0
write(*,*) fast_inv_sqrt(x)
endprogram main
Using the GCC compilers, these files can be compiled as follows.
gcc -ansi -O3 -Wall -Wextra -c fast_inv_sqrt.c
gfortran -std=f2003 -O3 -Wall -Wextra -o exec.x mod_c_interface.f90 main.f90 fast_inv_sqrt.o
Using Fortran Intrinsics and transfer
Alternatively, we could use the transfer
function to perform the necessary bit manipulation in Fortran without ever calling a C function. More information about transfer
can be found in the GNU documentation.
First, we write the fast inverse square root algorithm in Fortran. transfer
is used to perform the type punning and the bit-shift (i.e., <<
in C) is replaced with a division operation (i.e., /2
).
! f_fast_inf_sqrt.f90
real(4) function f_fast_inv_sqrt(x)
IMPLICIT NONE
real(4), intent(in) :: x
real(4) :: x2, y
integer(4) :: i
x2 = x*5e-1
y = x
i = transfer(y, 1_4)
i = 1597463007 - ( i / 2 )
y = transfer(i, 1e0)
y = y * (1.5e0 - (x2 * y**2))
! Optionally, include another Newton iteration.
f_fast_inv_sqrt = y
return
endfunction f_fast_inv_sqrt
Then, we write a simple Fortran program to call the Fortran implementation of the fast inverse square root function.
! main.f90
program main
IMPLICIT NONE
real(4), external :: f_fast_inv_sqrt
real(4), parameter :: x = 2.0
write(*,*) f_fast_inv_sqrt(x)
endprogram main
And using the gfortran
compiler, this example can be compiled as follows.
gfortran -std=f2003 -O3 -Wall -Wextra -o exec.x f_fast_inv_sqrt.f90 main.f90
Using sqrt
Function
Finally, we could simply evaluate the inverse square root using the “slow” intrinsic sqrt
function in Fortran. After all, how slow could it be?
f = 1.0/sqrt(x)
Results
I have ran some simple timing tests to compare the three implementations above. The tests are very simple and simply call the function of interest one-billion times. Note: we have to be a bit clever here. I am a firm believer that if we want to compare these methods, we should compare the optimized executables (i.e., compare -O3
rather than -O0
). The problem is, the optimization can be too good. If we simply use the intrinsic sqrt
function and evaluate it many times, the optimizer will replace it with a single evaluation. The code below is how I tested these methods.
program main
use timing ! some timing module providing tic and toc
use, intrinsic :: ISO_C_BINDING, only : C_FLOAT
IMPLICIT NONE
integer :: i
integer, parameter :: maxiter = 1000000000
real(C_FLOAT) :: xi
real(8) :: x
x = 0d0
call tic()
do i = 1,maxiter
call random_number(xi)
xi = xi + 1_C_FLOAT ! algorithm can perform poorly for zero
x = x + 1.0/sqrt(xi)
enddo
call toc()
endprogram main
My hardware is an AMD Ryzen 9 3900X. Total runtime in seconds is in the table below.
Method | Elapsed [s] |
---|---|
C | 5.057 |
Fortran | 13.697 |
Intrinsic | 4.050 |
Conclusions
So what does any of this mean? I see two main conclusions.
First and most importantly, you should be using the intrinsic Fortran functions. Unless you have some very specialized application, they are typically going to be the fastest option available. In this case also, the fast inverse square root algorithm introduces approximation errors and this is avoided by using sqrt
. As with most of software design, it is more important to have a working algorithm before we worry about runtime. Don’t optimize too early. And if sqrt
evaluation is restrictive on modern computers, you may need a new algorithm.
Second, bit manipulation is probably best left to C. Performing bit manipulation using transfer
in Fortran is almost three-times slower than linking to a C function. If you need bit manipulation, it is probably better to use a language that supports it natively.