Floating-Point Numbers
“Floating point” refers to a set of data types that encode real numbers, including fractions and decimals. Floating-point data types allow for a varying number of digits after the decimal point, while fixed-point data types have a specific number of digits reserved before and after the decimal point. So, floating-point data types can represent a wider range of numbers than fixed-point data types.
Due to limited memory for number representation and storage, computers can represent a finite set of floating-point numbers that have finite precision. This finite precision can limit accuracy for floating-point computations that require exact values or high precision, as some numbers are not represented exactly. Despite their limitations, floating-point numbers are widely used due to their fast calculations and sufficient precision and range for solving real-world problems.
Floating-Point Numbers in MATLAB
MATLAB® has data types for double-precision (double
) and single-precision (single
) floating-point numbers following IEEE® Standard 754. By default, MATLAB represents floating-point numbers in double precision. Double precision allows you to represent numbers to greater precision but requires more memory than single precision. To conserve memory, you can convert a number to single precision by using the single
function.
You can store numbers between approximately –3.4 × 1038 and 3.4 × 1038 using either double or single precision. If you have numbers outside of that range, store them using double precision.
Create Double-Precision Data
Because the default numeric type for MATLAB is type double
, you can create a double-precision floating-point number with a simple assignment statement.
x = 10; c = class(x)
c = 'double'
You can convert numeric data, characters or strings, and logical data to double precision by using the double
function. For example, convert a signed integer to a double-precision floating-point number.
x = int8(-113); y = double(x)
y = -113
Create Single-Precision Data
To create a single-precision number, use the single
function.
x = single(25.783);
You can also convert numeric data, characters or strings, and logical data to single precision by using the single
function. For example, convert a signed integer to a single-precision floating-point number.
x = int8(-113); y = single(x)
y = single -113
How MATLAB Stores Floating-Point Numbers
MATLAB constructs its double
and single
floating-point data types according to IEEE format and follows the round to nearest, ties to even rounding mode by default.
A floating-point number x has the form:
where:
s determines the sign.
f is the fraction, or mantissa, which satisfies 0 ≤ f < 1.
e is the exponent.
s, f, and e are each determined by a finite number of bits in memory, with f and e depending on the precision of the data type.
Storage of a double
number requires 64 bits, as shown in this table.
Bits | Width | Usage |
---|---|---|
63 | 1 | Stores the sign, where 0 is positive and 1 is negative |
62 to 52 | 11 | Stores the exponent, biased by 1023 |
51 to 0 | 52 | Stores the mantissa |
Storage of a single
number requires 32 bits, as shown in this table.
Bits | Width | Usage |
---|---|---|
31 | 1 | Stores the sign, where 0 is positive and 1 is negative |
30 to 23 | 8 | Stores the exponent, biased by 127 |
22 to 0 | 23 | Stores the mantissa |
Largest and Smallest Values for Floating-Point Data Types
The double- and single-precision data types have a largest and smallest value that you can represent. Numbers outside of the representable range are assigned positive or negative infinity. However, some numbers within the representable range cannot be stored exactly due to the gaps between consecutive floating-point numbers, and these numbers can have round-off errors.
Largest and Smallest Double-Precision Values
Find the largest and smallest positive values that can be represented with the double
data type by using the realmax
and realmin
functions, respectively.
m = realmax
m = 1.7977e+308
n = realmin
n = 2.2251e-308
realmax
and realmin
return normalized IEEE values. You can find the largest and smallest negative values by multiplying realmax
and realmin
by -1
. Numbers greater than realmax
or less than –realmax
are assigned the values of positive or negative infinity, respectively.
Largest and Smallest Single-Precision Values
Find the largest and smallest positive values that can be represented with the single
data type by calling the realmax
and realmin
functions with the argument "single"
.
m = realmax("single")
m = single 3.4028e+38
n = realmin("single")
n = single 1.1755e-38
You can find the largest and smallest negative values by multiplying realmax("single")
and realmin("single")
by –1
. Numbers greater than realmax("single")
or less than –realmax("single")
are assigned the values of positive or negative infinity, respectively.
Largest Consecutive Floating-Point Integers
Not all integers are representable using floating-point data types. The largest consecutive integer, x, is the greatest integer for which all integers less than or equal to x can be exactly represented, but x + 1 cannot be represented in floating-point format. The flintmax
function returns the largest consecutive integer. For example, find the largest consecutive integer in double-precision floating-point format, which is 253, by using the flintmax
function.
x = flintmax
x = 9.0072e+15
Find the largest consecutive integer in single-precision floating-point format, which is 224.
y = flintmax("single")
y = single 16777216
When you convert an integer data type to a floating-point data type, integers that are not exactly representable in floating-point format lose accuracy. flintmax
, which is a floating-point number, is less than the greatest integer representable by integer data types using the same number of bits. For example, flintmax
for double precision is 253, while the maximum value for type int64
is 264 – 1. Therefore, converting an integer greater than 253 to double precision results in a loss of accuracy.
Accuracy of Floating-Point Data
The accuracy of floating-point data can be affected by several factors:
Limitations of your computer hardware — For example, hardware with insufficient memory truncates the results of floating-point calculations.
Gaps between each floating-point number and the next larger floating-point number — These gaps are present on any computer and limit precision.
Gaps Between Floating-Point Numbers
You can determine the size of a gap between consecutive floating-point numbers by using the eps
function. For example, find the distance between 5
and the next larger double-precision number.
e = eps(5)
e = 8.8818e-16
You cannot represent numbers between 5
and 5 + eps(5)
in double-precision format. If a double-precision computation returns the answer 5
, the result is accurate within eps(5)
. This radius of accuracy is often called machine epsilon.
The gaps between floating-point numbers are not equal. For example, the gap between 1e10
and the next larger double-precision number is larger than the gap between 5
and the next larger double-precision number.
e = eps(1e10)
e = 1.9073e-06
Similarly, find the distance between 5
and the next larger single-precision number.
x = single(5); e = eps(x)
e = single 4.7684e-07
Gaps between single-precision numbers are larger than the gaps between double-precision numbers because there are fewer single-precision numbers. So, results of single-precision calculations are less precise than results of double-precision calculations.
When you convert a double-precision number to a single-precision number, you can determine the upper bound for the amount the number is rounded by using the eps
function. For example, when you convert the double-precision number 3.14
to single precision, the number is rounded by at most eps(single(3.14))
.
Gaps Between Consecutive Floating-Point Integers
The flintmax
function returns the largest consecutive integer in floating-point format. Above this value, consecutive floating-point integers have a gap greater than 1
.
Find the gap between flintmax
and the next floating-point number by using eps
:
format long
x = flintmax
x = 9.007199254740992e+15
e = eps(x)
e = 2
Because eps(x)
is 2
, the next larger floating-point number that can be represented exactly is x + 2
.
y = x + e
y = 9.007199254740994e+15
If you add 1
to x
, the result is rounded to x
.
z = x + 1
z = 9.007199254740992e+15
Arithmetic Operations on Floating-Point Numbers
You can use a range of data types in arithmetic operations with floating-point numbers, and the data type of the result depends on the input types. However, when you perform operations with different data types, some calculations may not be exact due to approximations or intermediate conversions.
Double-Precision Operands
You can perform basic arithmetic operations with double
and any of the following data types. If one or more operands are an integer scalar or array, the double
operand must be a scalar. The result is of type double
, except where noted otherwise.
single
— The result is of typesingle
.double
int8
,int16
,int32
,int64
— The result is of the same data type as the integer operand.uint8
,uint16
,uint32
,uint64
— The result is of the same data type as the integer operand.char
logical
Single-Precision Operands
You can perform basic arithmetic operations with single
and any of the following data types. The result is of type single
.
single
double
char
logical
Unexpected Results with Floating-Point Arithmetic
Almost all operations in MATLAB are performed in double-precision arithmetic conforming to IEEE Standard 754. Because computers represent numbers to a finite precision, some computations can yield mathematically nonintuitive results. Some common issues that can arise while computing with floating-point numbers are round-off error, cancellation, swamping, and intermediate conversions. The unexpected results are not bugs in MATLAB and occur in any software that uses floating-point numbers. For exact rational representations of numbers, consider using the Symbolic Math Toolbox™.
Round-Off Error
Round-off error can occur due to the finite-precision representation of floating-point numbers. For example, the number 4/3
cannot be represented exactly as a binary fraction. As such, this calculation returns the quantity eps(1)
, rather than 0
.
e = 1 - 3*(4/3 - 1)
e = 2.2204e-16
Similarly, because pi
is not an exact representation of π, sin(pi)
is not exactly zero.
x = sin(pi)
x = 1.2246e-16
Round-off error is most noticeable when many operations are performed on floating-point numbers, allowing errors to accumulate and compound. A best practice is to minimize the number of operations whenever possible.
Cancellation
Cancellation can occur when you subtract a number from another number of roughly the same magnitude, as measured by eps
. For example, eps(2^53)
is 2
, so the numbers 2^53 + 1
and 2^53
have the same floating-point representation.
x = (2^53 + 1) - 2^53
x = 0
When possible, try rewriting computations in an equivalent form that avoids cancellations.
Swamping
Swamping can occur when you perform operations on floating-point numbers that differ by many orders of magnitude. For example, this calculation shows a loss of precision that makes the addition insignificant.
x = 1 + 1e-16
x = 1
Intermediate Conversions
When you perform arithmetic with different data types, intermediate calculations and conversions can yield unexpected results. For example, although x
and y
are both 0.2
, subtracting them yields a nonzero result. The reason is that y
is first converted to double
before the subtraction is performed. This subtraction result is then converted to single
, z
.
format long
x = 0.2
x = 0.200000000000000
y = single(0.2)
y = single 0.2000000
z = x - y
z = single -2.9802323e-09
Linear Algebra
Common issues in floating-point arithmetic, such as the ones described above, can compound when applied to linear algebra problems because the related calculations typically consist of multiple steps. For example, when solving the system of linear equations Ax = b
, MATLAB warns that the results may be inaccurate because operand matrix A
is ill conditioned.
A = diag([2 eps]); b = [2; eps]; x = A\b;
Warning: Matrix is close to singular or badly scaled. Results may be inaccurate. RCOND = 1.110223e-16.
References
[1] Moler, Cleve. Numerical Computing with MATLAB. Natick, MA: The bat365, Inc., 2004.