float(3) BSD Library Functions Manual float(3)

## NAME

float-- description of floating-point types available on OS X and iOS

## DESCRIPTION

This page describes the available C floating-point types. For a list of math library functions that operate on these types, see the page on the math library, "man math".

## TERMINOLOGY

Floating point numbers are represented in three parts: asign, amantissa(orsignificand), and anexponent. Given such a representation with signs, mantissam, and exponente, the corresponding numerical value iss*m*2**e. Floating-point types differ in the number of bits of accuracy in the man- tissa (called theprecision), and set of available exponents (theexpo-nentrange). Floating-point numbers with the maximum available exponent are reserved operands, denoting aninfinityif the significand is precisely zero, and a Not-a-Number, orNaN, otherwise. Floating-point numbers with the minimum available exponent are eitherzeroif the significand is precisely zero, anddenormalotherwise. Note that zero is signed: +0 and -0 are distinct floating point numbers. Floating-point numbers with exponents other than the maximum and minimum available are callednormalnumbers.

## PROPERTIES OF IEEE-754 FLOATING-POINT

Basic arithmetic operations in IEEE-754 floating-point arecorrectlyrounded: this means that the result delivered is the same as the result that would be achieved by computing the exact real-number operation on the operands, then rounding the real-number result to a floating-point value.Overflowoccurs when the value of the exact result is too large in magni- tude to be represented in the floating-point type in which the computa- tion is being performed; doing so would require an exponent outside of the exponent range of the type. By default, computations that result in overflow return a signed infinity.Underflowoccurs when the value of the exact result is too small in mag- nitude to be represented as a normal number in the floating-point type in which the computation is being performed. By default, underflow is grad- ual, and produces a denormal number or a zero. All floating-points number of a given type are integer multiples of the smallest non-zero floating-point number of that type; however, the con- verse is not true. This means that, in the default mode, (x-y) = 0 only if x = y. The sign of zero transforms correctly through multiplication and divi- sion, and is preserved by addition of zeros with like signs, but x - x yields +0 for every finite floating-point number x. The only operations that reveal the sign of a zero are x/(+-0) and copysign(x,+-0). In par- ticular, comparisons (x > y, x != y, etc) are not affected by the sign of zero. The sign of infinity transforms correctly through multiplication and division, and infinities are unaffected by addition or subtraction of any finite floating-point number. But Inf-Inf, Inf*0, and Inf/Inf are, like 0/0 or sqrt(-3), invalid operations that produce NaN. NaNs are the default results of invalid operations, and they propagate through subsequent arithmetic operations. If x is a NaN, then x != x is TRUE, and every other comparison predicate (x > y, x = y, x <= y, etc) evaluates to FALSE, regardless of the value of y. Additionally, predi- cates that entail an ordered comparison (rather than mere equality or inequality) signal Invalid Operation when one of the arguments is NaN. IEEE-754 provides five kinds of floating-pointexceptions, listed below: Exception Default Result __________________________________________ Invalid Operation NaN or FALSE Overflow +-Infinity Divide by Zero +-Infinity Underflow Gradual Underflow Inexact Rounded Value NOTE: An exception is not an error unless it is handled incorrectly. What makes a class of exceptions exceptional is that no single default response can be satisfactory in every instance. On the other hand, because a default response will serve most instances of the exception satisfactorily, simply aborting the computation cannot be justified. For each kind of floating-point exception, IEEE-754 provides a flag that is raised each time its exception is signaled, and remains raised until the program resets it. Programs may test, save, and restore the flags, or a subset thereof.

## PRECISION AND EXPONENT RANGE OF SPECIFIC FLOATING-POINT TYPES

On both OS X and iOS, the typefloatcorresponds to IEEE-754 single pre- cision. A single-precision number is represented in 32 bits, and has a precision of 24 significant bits, roughly like 7 significant decimal dig- its. 8 bits are used to encode the exponent, which gives an exponent range from -126 to 127, inclusive. The header <float.h> defines several useful constants for the float type:FLT_MANT_DIG- The number of binary digits in the significand of a float.FLT_MIN_EXP- One more than the smallest exponent available in the float type.FLT_MAX_EXP- One more than the largest exponent available in the float type.FLT_DIG- the precision in decimal digits of a float. A decimal value with this many digits, stored as a float, always yields the same value up to this many digits when converted back to decimal notation.FLT_MIN_10_EXP- the smallest n such that 10**n is a non-zero normal num- ber as a float.FLT_MAX_10_EXP- the largest n such that 10**n is finite as a float.FLT_MIN- the smallest positive normal float.FLT_MAX- the largest finite float.FLT_EPSILON- the difference between 1.0 and the smallest float bigger than 1.0. On both OS X and iOS, the typedoublecorresponds to IEEE-754 double pre- cision. A double-precision number is represented in 64 bits, and has a precision of 53 significant bits, roughly like 16 significant decimal digits. 11 bits are used to encode the exponent, which gives an exponent range from -1022 to 1023, inclusive. The header <float.h> defines several useful constants for the double type:DBL_MANT_DIG- The number of binary digits in the significand of a dou- ble.DBL_MIN_EXP- One more than the smallest exponent available in the double type.DBL_MAX_EXP- One more than the exponent available in the double type.DBL_DIG- the precision in decimal digits of a double. A decimal value with this many digits, stored as a double, always yields the same value up to this many digits when converted back to decimal notation.DBL_MIN_10_EXP- the smallest n such that 10**n is a non-zero normal num- ber as a double.DBL_MAX_10_EXP- the largest n such that 10**n is finite as a double.DBL_MIN- the smallest positive normal double.DBL_MAX- the largest finite double.DBL_EPSILON- the difference between 1.0 and the smallest double bigger than 1.0. On Intel macs, the typelongdoublecorresponds to IEEE-754 double extended precision. A double extended number is represented in 80 bits, and has a precision of 64 significant bits, roughly like 19 significant decimal digits. 15 bits are used to encode the exponent, which gives an exponent range from -16383 to 16384, inclusive. The header <float.h> defines several useful constants for the long double type:LDBL_MANT_DIG- The number of binary digits in the significand of a long double.LDBL_MIN_EXP- One more than the smallest exponent available in the long double type.LDBL_MAX_EXP- One more than the exponent available in the long double type.LDBL_DIG- the precision in decimal digits of a long double. A decimal value with this many digits, stored as a long double, always yields the same value up to this many digits when converted back to decimal nota- tion.LDBL_MIN_10_EXP- the smallest n such that 10**n is a non-zero normal number as a long double.LDBL_MAX_10_EXP- the largest n such that 10**n is finite as a long dou- ble.LDBL_MIN- the smallest positive normal long double.LDBL_MAX- the largest finite long double.LDBL_EPSILON- the difference between 1.0 and the smallest long double bigger than 1.0. On ARM iOS devices, the typelongdoublecorresponds to IEEE-754 double precision. Thus, the values of theLDBL_*macros are identical to those of the correspondingDBL_*macros.

## SEE ALSO

math(3),complex(3)

## STANDARDS

Floating-point arithmetic conforms to the ISO/IEC 9899:2011 standard. BSD March 28, 2007 BSD

Mac OS X 10.9.1 - Generated Tue Jan 7 19:42:11 CST 2014