mirror of
				https://sourceware.org/git/glibc.git
				synced 2025-11-03 20:53:13 +03:00 
			
		
		
		
	
		
			
				
	
	
		
			417 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
			
		
		
	
	
			417 lines
		
	
	
		
			14 KiB
		
	
	
	
		
			Plaintext
		
	
	
	
	
	
@node Floating-Point Limits 
 | 
						|
@chapter Floating-Point Limits
 | 
						|
@pindex <float.h>
 | 
						|
@cindex floating-point number representation
 | 
						|
@cindex representation of floating-point numbers
 | 
						|
 | 
						|
Because floating-point numbers are represented internally as approximate
 | 
						|
quantities, algorithms for manipulating floating-point data often need
 | 
						|
to be parameterized in terms of the accuracy of the representation.
 | 
						|
Some of the functions in the C library itself need this information; for
 | 
						|
example, the algorithms for printing and reading floating-point numbers
 | 
						|
(@pxref{I/O on Streams}) and for calculating trigonometric and
 | 
						|
irrational functions (@pxref{Mathematics}) use information about the
 | 
						|
underlying floating-point representation to avoid round-off error and
 | 
						|
loss of accuracy.  User programs that implement numerical analysis
 | 
						|
techniques also often need to be parameterized in this way in order to
 | 
						|
minimize or compute error bounds.
 | 
						|
 | 
						|
The specific representation of floating-point numbers varies from
 | 
						|
machine to machine.  The GNU C Library defines a set of parameters which
 | 
						|
characterize each of the supported floating-point representations on a
 | 
						|
particular system.
 | 
						|
 | 
						|
@menu
 | 
						|
* Floating-Point Representation::   Definitions of terminology.
 | 
						|
* Floating-Point Parameters::	    Descriptions of the library facilities.
 | 
						|
* IEEE Floating-Point::		    An example of a common representation.
 | 
						|
@end menu
 | 
						|
 | 
						|
@node Floating-Point Representation
 | 
						|
@section Floating-Point Representation
 | 
						|
 | 
						|
This section introduces the terminology used to characterize the
 | 
						|
representation of floating-point numbers.
 | 
						|
 | 
						|
You are probably already familiar with most of these concepts in terms
 | 
						|
of scientific or exponential notation for floating-point numbers.  For
 | 
						|
example, the number @code{123456.0} could be expressed in exponential
 | 
						|
notation as @code{1.23456e+05}, a shorthand notation indicating that the
 | 
						|
mantissa @code{1.23456} is multiplied by the base @code{10} raised to
 | 
						|
power @code{5}.
 | 
						|
 | 
						|
More formally, the internal representation of a floating-point number
 | 
						|
can be characterized in terms of the following parameters:
 | 
						|
 | 
						|
@itemize @bullet
 | 
						|
@item
 | 
						|
The @dfn{sign} is either @code{-1} or @code{1}.
 | 
						|
@cindex sign (of floating-point number)
 | 
						|
 | 
						|
@item
 | 
						|
The @dfn{base} or @dfn{radix} for exponentiation; an integer greater
 | 
						|
than @code{1}.  This is a constant for the particular representation.
 | 
						|
@cindex base (of floating-point number)
 | 
						|
@cindex radix (of floating-point number)
 | 
						|
 | 
						|
@item
 | 
						|
The @dfn{exponent} to which the base is raised.  The upper and lower
 | 
						|
bounds of the exponent value are constants for the particular
 | 
						|
representation.
 | 
						|
@cindex exponent (of floating-point number)
 | 
						|
 | 
						|
Sometimes, in the actual bits representing the floating-point number,
 | 
						|
the exponent is @dfn{biased} by adding a constant to it, to make it
 | 
						|
always be represented as an unsigned quantity.  This is only important
 | 
						|
if you have some reason to pick apart the bit fields making up the
 | 
						|
floating-point number by hand, which is something for which the GNU
 | 
						|
library provides no support.  So this is ignored in the discussion that
 | 
						|
follows.
 | 
						|
@cindex bias, in exponent (of floating-point number)
 | 
						|
 | 
						|
@item
 | 
						|
The value of the @dfn{mantissa} or @dfn{significand}, which is an
 | 
						|
unsigned quantity.
 | 
						|
@cindex mantissa (of floating-point number)
 | 
						|
@cindex significand (of floating-point number)
 | 
						|
 | 
						|
@item 
 | 
						|
The @dfn{precision} of the mantissa.  If the base of the representation
 | 
						|
is @var{b}, then the precision is the number of base-@var{b} digits in
 | 
						|
the mantissa.  This is a constant for the particular representation.
 | 
						|
 | 
						|
Many floating-point representations have an implicit @dfn{hidden bit} in
 | 
						|
the mantissa.  Any such hidden bits are counted in the precision.
 | 
						|
Again, the GNU library provides no facilities for dealing with such low-level
 | 
						|
aspects of the representation.
 | 
						|
@cindex precision (of floating-point number)
 | 
						|
@cindex hidden bit, in mantissa (of floating-point number)
 | 
						|
@end itemize
 | 
						|
 | 
						|
The mantissa of a floating-point number actually represents an implicit
 | 
						|
fraction whose denominator is the base raised to the power of the
 | 
						|
precision.  Since the largest representable mantissa is one less than
 | 
						|
this denominator, the value of the fraction is always strictly less than
 | 
						|
@code{1}.  The mathematical value of a floating-point number is then the
 | 
						|
product of this fraction; the sign; and the base raised to the exponent.
 | 
						|
 | 
						|
If the floating-point number is @dfn{normalized}, the mantissa is also
 | 
						|
greater than or equal to the base raised to the power of one less
 | 
						|
than the precision (unless the number represents a floating-point zero,
 | 
						|
in which case the mantissa is zero).  The fractional quantity is
 | 
						|
therefore greater than or equal to @code{1/@var{b}}, where @var{b} is
 | 
						|
the base.
 | 
						|
@cindex normalized floating-point number
 | 
						|
 | 
						|
@node Floating-Point Parameters
 | 
						|
@section Floating-Point Parameters
 | 
						|
 | 
						|
@strong{Incomplete:}  This section needs some more concrete examples
 | 
						|
of what these parameters mean and how to use them in a program.
 | 
						|
 | 
						|
These macro definitions can be accessed by including the header file
 | 
						|
@file{<float.h>} in your program.
 | 
						|
 | 
						|
Macro names starting with @samp{FLT_} refer to the @code{float} type,
 | 
						|
while names beginning with @samp{DBL_} refer to the @code{double} type
 | 
						|
and names beginning with @samp{LDBL_} refer to the @code{long double}
 | 
						|
type.  (In implementations that do not support @code{long double} as
 | 
						|
a distinct data type, the values for those constants are the same
 | 
						|
as the corresponding constants for the @code{double} type.)@refill
 | 
						|
 | 
						|
Note that only @code{FLT_RADIX} is guaranteed to be a constant
 | 
						|
expression, so the other macros listed here cannot be reliably used in
 | 
						|
places that require constant expressions, such as @samp{#if}
 | 
						|
preprocessing directives and array size specifications.
 | 
						|
 | 
						|
Although the ANSI C standard specifies minimum and maximum values for
 | 
						|
most of these parameters, the GNU C implementation uses whatever
 | 
						|
floating-point representations are supported by the underlying hardware.
 | 
						|
So whether GNU C actually satisfies the ANSI C requirements depends on
 | 
						|
what machine it is running on.
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_ROUNDS
 | 
						|
This value characterizes the rounding mode for floating-point addition.
 | 
						|
The following values indicate standard rounding modes:
 | 
						|
 | 
						|
@table @code
 | 
						|
@item -1
 | 
						|
The mode is indeterminable.
 | 
						|
@item 0
 | 
						|
Rounding is towards zero.
 | 
						|
@item 1
 | 
						|
Rounding is to the nearest number.
 | 
						|
@item 2
 | 
						|
Rounding is towards positive infinity.
 | 
						|
@item 3
 | 
						|
Rounding is towards negative infinity.
 | 
						|
@end table
 | 
						|
 | 
						|
@noindent
 | 
						|
Any other value represents a machine-dependent nonstandard rounding
 | 
						|
mode.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_RADIX
 | 
						|
This is the value of the base, or radix, of exponent representation.
 | 
						|
This is guaranteed to be a constant expression, unlike the other macros
 | 
						|
described in this section.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MANT_DIG
 | 
						|
This is the number of base-@code{FLT_RADIX} digits in the floating-point
 | 
						|
mantissa for the @code{float} data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MANT_DIG
 | 
						|
This is the number of base-@code{FLT_RADIX} digits in the floating-point
 | 
						|
mantissa for the @code{double} data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MANT_DIG
 | 
						|
This is the number of base-@code{FLT_RADIX} digits in the floating-point
 | 
						|
mantissa for the @code{long double} data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_DIG
 | 
						|
This is the number of decimal digits of precision for the @code{float}
 | 
						|
data type.  Technically, if @var{p} and @var{b} are the precision and
 | 
						|
base (respectively) for the representation, then the decimal precision
 | 
						|
@var{q} is the maximum number of decimal digits such that any floating
 | 
						|
point number with @var{q} base 10 digits can be rounded to a floating
 | 
						|
point number with @var{p} base @var{b} digits and back again, without
 | 
						|
change to the @var{q} decimal digits.
 | 
						|
 | 
						|
The value of this macro is guaranteed to be at least @code{6}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_DIG
 | 
						|
This is similar to @code{FLT_DIG}, but is for the @code{double} data
 | 
						|
type.  The value of this macro is guaranteed to be at least @code{10}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_DIG
 | 
						|
This is similar to @code{FLT_DIG}, but is for the @code{long double}
 | 
						|
data type.  The value of this macro is guaranteed to be at least
 | 
						|
@code{10}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MIN_EXP
 | 
						|
This is the minimum negative integer such that the mathematical value
 | 
						|
@code{FLT_RADIX} raised to this power minus 1 can be represented as a
 | 
						|
normalized floating-point number of type @code{float}.  In terms of the
 | 
						|
actual implementation, this is just the smallest value that can be
 | 
						|
represented in the exponent field of the number.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MIN_EXP
 | 
						|
This is similar to @code{FLT_MIN_EXP}, but is for the @code{double} data
 | 
						|
type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MIN_EXP
 | 
						|
This is similar to @code{FLT_MIN_EXP}, but is for the @code{long double}
 | 
						|
data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MIN_10_EXP
 | 
						|
This is the minimum negative integer such that the mathematical value
 | 
						|
@code{10} raised to this power minus 1 can be represented as a
 | 
						|
normalized floating-point number of type @code{float}.  This is
 | 
						|
guaranteed to be no greater than @code{-37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MIN_10_EXP
 | 
						|
This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{double}
 | 
						|
data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MIN_10_EXP
 | 
						|
This is similar to @code{FLT_MIN_10_EXP}, but is for the @code{long
 | 
						|
double} data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MAX_EXP
 | 
						|
This is the maximum negative integer such that the mathematical value
 | 
						|
@code{FLT_RADIX} raised to this power minus 1 can be represented as a
 | 
						|
floating-point number of type @code{float}.  In terms of the actual
 | 
						|
implementation, this is just the largest value that can be represented
 | 
						|
in the exponent field of the number.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MAX_EXP
 | 
						|
This is similar to @code{FLT_MAX_EXP}, but is for the @code{double} data
 | 
						|
type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MAX_EXP
 | 
						|
This is similar to @code{FLT_MAX_EXP}, but is for the @code{long double}
 | 
						|
data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MAX_10_EXP
 | 
						|
This is the maximum negative integer such that the mathematical value
 | 
						|
@code{10} raised to this power minus 1 can be represented as a
 | 
						|
normalized floating-point number of type @code{float}.  This is
 | 
						|
guaranteed to be at least @code{37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MAX_10_EXP
 | 
						|
This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{double}
 | 
						|
data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MAX_10_EXP
 | 
						|
This is similar to @code{FLT_MAX_10_EXP}, but is for the @code{long
 | 
						|
double} data type.
 | 
						|
@end defvr
 | 
						|
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MAX
 | 
						|
The value of this macro is the maximum representable floating-point
 | 
						|
number of type @code{float}, and is guaranteed to be at least
 | 
						|
@code{1E+37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MAX
 | 
						|
The value of this macro is the maximum representable floating-point
 | 
						|
number of type @code{double}, and is guaranteed to be at least
 | 
						|
@code{1E+37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MAX
 | 
						|
The value of this macro is the maximum representable floating-point
 | 
						|
number of type @code{long double}, and is guaranteed to be at least
 | 
						|
@code{1E+37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_MIN
 | 
						|
The value of this macro is the minimum normalized positive
 | 
						|
floating-point number that is representable by type @code{float}, and is
 | 
						|
guaranteed to be no more than @code{1E-37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_MIN
 | 
						|
The value of this macro is the minimum normalized positive
 | 
						|
floating-point number that is representable by type @code{double}, and
 | 
						|
is guaranteed to be no more than @code{1E-37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_MIN
 | 
						|
The value of this macro is the minimum normalized positive
 | 
						|
floating-point number that is representable by type @code{long double},
 | 
						|
and is guaranteed to be no more than @code{1E-37}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro FLT_EPSILON
 | 
						|
This is the minimum positive floating-point number of type @code{float}
 | 
						|
such that @code{1.0 + FLT_EPSILON != 1.0} is true.  It's guaranteed to
 | 
						|
be no greater than @code{1E-5}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro DBL_EPSILON
 | 
						|
This is similar to @code{FLT_EPSILON}, but is for the @code{double}
 | 
						|
type.  The maximum value is @code{1E-9}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
@comment float.h
 | 
						|
@comment ANSI
 | 
						|
@defvr Macro LDBL_EPSILON
 | 
						|
This is similar to @code{FLT_EPSILON}, but is for the @code{long double}
 | 
						|
type.  The maximum value is @code{1E-9}.
 | 
						|
@end defvr
 | 
						|
 | 
						|
 | 
						|
 | 
						|
@node IEEE Floating Point
 | 
						|
@section IEEE Floating Point
 | 
						|
 | 
						|
Here is an example showing how these parameters work for a common
 | 
						|
floating point representation, specified by the @cite{IEEE Standard for
 | 
						|
Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985)}.
 | 
						|
 | 
						|
The IEEE single-precision float representation uses a base of 2.  There
 | 
						|
is a sign bit, a mantissa with 23 bits plus one hidden bit (so the total
 | 
						|
precision is 24 base-2 digits), and an 8-bit exponent that can represent
 | 
						|
values in the range -125 to 128, inclusive.
 | 
						|
 | 
						|
So, for an implementation that uses this representation for the
 | 
						|
@code{float} data type, appropriate values for the corresponding
 | 
						|
parameters are:
 | 
						|
 | 
						|
@example
 | 
						|
FLT_RADIX                         2
 | 
						|
FLT_MANT_DIG                     24
 | 
						|
FLT_DIG                           6
 | 
						|
FLT_MIN_EXP                    -125
 | 
						|
FLT_MIN_10_EXP                  -37
 | 
						|
FLT_MAX_EXP                     128
 | 
						|
FLT_MAX_10_EXP                  +38
 | 
						|
FLT_MIN             1.17549435E-38F
 | 
						|
FLT_MAX             3.40282347E+38F
 | 
						|
FLT_EPSILON         1.19209290E-07F
 | 
						|
@end example
 | 
						|
 | 
						|
 | 
						|
 |