·
Floating point representation is used to represent real numbers (i.e. numbers with
fractions)
·
Floating
point representation supports huge range with reasonable storage size
·
Examples:
32-bit storage size can represent a number
o
As
large as 1045
o
As
small as 10-45
·
Floating
point representation suffers from the following main disadvantages:
o
Potential
loss of precision due to limited number of significant digits
o
Relatively,
large storage requirements
o
Slow
calculations
·
This
chapters covers:
o
Review of exponential notation
o
Floating point representation in computers
o
Floating point calculations
o
IEEE 745 floating point standard
o
Packed Decimal Format (BCD)
o
Overflow and Underflow Conditions
·
Exponential notation or scientific notation is a conventional
method for representing floating point numbers
·
Exponential
notation format consist of 6 components
o
Mantissa
o
Sign of the mantissa
o
Exponent
o
Sign of the exponent
o
Base of the exponent
o
Location of the fraction point
Example:
-50.5 x 10-20
·
Fraction
point position is flexible and can be adjusted without changing the number
magnitude
·
Changes
to the fraction point requires adjustment to the exponent
o
For
every move to the right, the exponent must be decremented
o
For
every move to the left, the exponent must be incremented
·
However
shifting should not be arbitrary, since that may affect the precision of the
number
·
If
we limit the mantissa to 5 digits, the last representation results in lose of 1
precision point (i.e. error)
·
Computers
uses a representation method very similar to the exponential notation
·
Binary
is used instead of decimal
·
Storage
size of 32, 64, and 128 bits are typically used
·
See
Figure 5.4 in page 132 for a typical 32-bits floating point format
o
Leftmost
bit is the mantissa sign
o
Followed
by 8 bits exponent
o
Followed
by 23 bits mantissa
o
The
fraction point is implied to be at the beginning of the mantissa
o
Exponent
is stored in excess-128 notation
o
Base
of the exponent is implied as base 2
·
Computers
uses many different proprietary and standard Floating Point representation
methods
·
The
following standards are in common use and will be studied later in the chapter
o
IEEE
754 Floating Point representation standard
·
Assume
the following decimal floating point format: SMMMMMMM
(S = mantissa sign, M = mantissa, decimal point is
implied at the end)
·
The
above format provides a range of: ±9999999
·
Let’s
introduce 2 exponent (EE) digits in place of 2 mantissa digits: SSEEMMMM (second S is the exponent sign)
·
The
above format provides a range of: ±
0001 x 10-99 to ± 9999 x 1099
·
With
this representation we have traded off 2 digits of precision to increase the
range
·
There
exist a trade off between Precision and Range
o
The
more digits assigned for the mantissa, the higher the precision and lower the
range
o
The
more digits assigned for the exponent the higher the range and lower the
precision
·
Floating
point formats has the following attributes
o
A
number is assigned a storage space (i.e. fixed number of bits)
o
The
storage space is divided into 4 parts
§
Mantissa sign
§
Mantissa
§
Exponent sign
§
Exponent
o
The
following remaining parts are implied and hence do not need to be stored
§
Exponent base
§
Fraction point position
·
There
are a number of trade-offs that need to be considered when designing a floating
point format:
o
Storage size
§
Increase
precision and range
§
But
also increase storage requirements
o
Base of the exponent
§
Binary
base provide low range capability but requires simple calculations
§
Higher
base (e.g. hex) provide high range but results in more complex calculations
o
Location of binary point
§
Usually
positioned at the beginning of number to provide maximum precision
o
Number of bits to use for the exponent
§
The
higher the number, the higher the range and lower the precision and visa versa
o
Number of bits to use for the mantissa
§
The
higher the number, the higher the precision and lower the range and visa versa
o
Method to handle the sign for the exponent
§
Sign
free representation is required
§
2’s
complement can be used but excess-N is more common
o
Method to handle the sign for the mantissa
§
Sign
free representation is not required
§
Sign-and-magnitude
is typically used
§
2’s
complement can also be used but less common
o
SEEMMMMM
format
o
Excess-50
o
Base
10 exponent
o
Base
10 mantissa
o
Implied
decimal point at the beginning of the number
·
This
format provides a range as small as: ± .00001 x 10-50 and as large
as: ± .99999 x 10+49
Excess-N
·
One
important consideration in floating point is
o
How
to handle the sign of the exponent
o
2’s
complement is an obvious solution
o
However,
the Excess-N method is more commonly used
·
N
is a predefined value separating the positive range from the negative one
o
Value
≥ N is positive
o
Value
< N is negative
·
See
Figure 5.1 in page 125 for Excess-50 representation
·
Excess-N
provides the following important advantages over 2’s complement
o
Simpler
in calculation
o
More
flexible as N can be adjusted to adjust the range of positives and
negatives
§
The
smaller the N the larger the positive range and the smaller the negative range
§
The
larger the N the smaller the positive range and the larger the negative range
·
Subtract
exponent from N
1.
Convert 30 represented in Excess-50 to sign-and-magnitude representation
= 30 – 50 = -20
2.
Convert 60 represented in Excess-50 to sign-and-magnitude representation
= 60– 50 = 10
·
Add
N to the exponent
1.
Convert –10 represented in sign-and-magnitude to Excess-50
= 50 + (–10) = 40
2.
Convert 0 represented in sign-and-magnitude to Excess-50
= 50 + 0
= 50
·
Normalization
is the process of eliminating leading zeros from the mantissa
·
The
objective of normalization is to maximize precision given the number of digits
limitation
·
Normalization
can only be performed if the exponent has enough range
1.
Normalize .0003 x 1020
= .3 x 1017
2.
Normalize .0003 x 10-20
= .3 x 10-23
3.
Normalize .0003 x 10-98 assuming 2 exponent digits
·
The
following steps provide the method to convert an integer or real number to
floating point format:
Given
the following floating point format
o
SEEMMMMM
format
o
Use
0 for positive and 5 for negative
o
Excess-50
o
Base
10 exponent
o
Base
10 mantissa
o
Implied
decimal point at the beginning of the mantissa
·
Convert
246.8035 into the above floating point format
1. Convert to exponent notation format = 246.8035 x 100
2. Set decimal point to proper position = .2468035 x 103
3. Normalize already
normalized
4. Convert exponent to Excess-N =
50 + 3 = 53
5. Store in floating point format = 05324680
·
Convert
– .00000075 into the above floating point format
1. Convert to exponent notation format = .00000075 x 100
2. Set decimal point to proper position already in proper position
3. Normalize =
.75 x 10-6
4.
Convert exponent to Excess-N =
50 + (-6) = 44
5.
Store in floating point format =
54475000
·
Convert
1255 x 10-3 into the above floating point format
1. Convert to exponent notation format = 1255. x 10-3
2. Set decimal point to proper position = .1255 x 101
3. Normalize already
normalized
4.
Convert to Excess-N =
50 + 1 = 51
5.
Store in floating point format =
05112550
·
The
following steps provide the method to convert from floating point format to
real number format
·
Assume
the SEEMMMMM floating point format
·
Convert
05324657 to real number
1. Convert the sign digit = +
2. Convert from Excess-N to sign-and-magnitude = 53 – 50 = 3
3. Convert to exponential notation format =
.24657 x 103
4.
Convert to real number format =
246.57
·
Convert
54810000 to real number
1. Convert the sign digit = -
2. Convert from Excess-N to sign-and-magnitude =
48 – 50 = -2
3. Convert to exponential notation format =
.10000 x 10-2
4.
Convert to real number format =
- .001
·
Convert
05112550 to real number
1. Convert the sign digit =
+
2. Convert from Excess-N to sign-and-magnitude = 51 – 50 = 1
3. Convert to exponential notation format =
.12550 x 101
4.
Convert to real number format =
1.255
·
Floating
point arithmetic is more complex and costly than that of integer arithmetic
·
Exponent
and mantissa both has to be computed separately
·
Addition/subtraction
is done using the following method
·
Assume
the SEEMMMMM floating point format
1.
Add 05199520 + 04967850
1. Align the second number exponent = 0510067850
2. Add the mantissas =
.99520 + .0067850 = 1.0019850
3. Adjust decimal point =
.10019850 and Exponent = 51 + 1 = 52
4.
Store in floating point format =
05210020
2.
Subtract 05199520 - 04967850
1. Align the second number exponent = 0510067850
2. Subtract the mantissas =
.99520 – .0067850 = .9883250
3. Adjust decimal point already
in proper position
4.
Store in floating point format =
05198833
·
Alignment
is not necessary when performing multiplication
·
Multiplication
is done using the following method
1.
Multiply
the two mantissas
2.
Adding
the two exponents – N
3.
Normalize
if necessary
4.
Store
number in the floating point format
·
Assume
the SEEMMMMM floating point format
o
Excess-50
o
Base
10 exponent
o
Base
10 mantissa
o
Implied
decimal point at the beginning of the number
·
Multiply
05220000 x 04712500
1. Multiply the 2 mantissas =
.20000 x .12500 = .02500
2. Compute exponent =
52 + 47 – 50 = 49
2. Normalize result =
.25000 and adjust Exponent = 49 – 1 = 48
4. Store in floating point format = 04825000
·
Alignment
is not necessary when performing division
·
Division
is done by
1.
Divide
the two mantissas
2.
Subtract
the first number exponent – second number exponent + N
3.
Place
decimal point in proper position
4.
Store
number in the floating point format
·
Divide
05275000 ÷ 05025000
1. Divide the 2 mantissas =
.75000 ÷ .25000 = 3.00000
2. Compute exponent =
52 – 50 + 50 = 52
2. Place decimal point in proper position = 0.30000 and adjust Exponent = 52 + 1 = 53
4. Store in floating point format =
05330000
·
IEEE
has developed a standard for both 32 and 64 bits floating point representation
·
The
standard was targeted to be used in Personal Computer (IBM-type PC and Apple
Macintosh)
·
Apple
Macintosh also provides its own 80-bit format
·
IEEE
754 defines a 32-bits format called single-precision floating point format
o
Leftmost
bit is the mantissa sign (0 for positive and 1 for negative)
o
Followed
by 8 bits exponent
o
Followed
by 24 bit mantissa (23 bits + implied which is always assumed to be 1)
o
Exponent
is represented using Excess-127 which gives an exponent range of: 2-126
to 2+127
Exponents 0 (2-127) and 255 (2+128) are reserved for special use
o
Implied
exponent base is 2
o
Fraction
point position is to right of the leading mantissa bit
o
Special
numbers (e.g. 0, ∞, very small none normalized numbers, etc.) are supported
o
Supported
precession is approximately 7 decimal significant digits
o
Allows
for approximate range of 10-45 to 10+38
·
IEEE
754 defines a 64-bits format called double-precision floating point format
o
It
works similar to the single-precision format
o
11
bits for exponent and 52 bits for mantissa
o
Supported
precession is approximately 15 decimal significant digits
o
Allows
for approximate range of 10-300 to 10+300
·
The
following steps provide the method to convert a decimal real number to IEEE 754
Floating Point format:
o
Convert
the decimal number to binary
o
Adjust
binary point to proper position
o
Normalize
the number
o
Convert
exponent from sign-and-magnitude to Excess-127
o
Convert
exponent to binary
o
Store
the number in the floating point format
1.
Convert 36.510 to single-precision IEEE 754 floating point format
1. Convert to binary =
100100.1
2. Adjust binary point to proper position =
1.001001 x 25
3. Normalize already normalized
4. Convert exponent to Excess-127 =
127 + 5 = 132
5. Convert exponent to binary =
10000100
6. Store in floating point format =
0 10000100 00100100000000000000000
2.
Convert –0.25 to single-precision IEEE 754 floating point format
1. Convert to binary =
.01
2. Adjust binary point to proper position =
0.1 x 2-1
3. Normalize =
1.0 x 2-2
4. Convert exponent to Excess-127 =
127 - 2 = 125
5. Convert exponent to binary =
01111101
6. Convert to floating point format =
1 01111101 00000000000000000000000
·
The
following steps provide the method to convert IEEE 754 Floating Point to
decimal real number:
o
Convert
exponent from binary to decimal
o
Convert
from Excess-127 to sign-and-magnitude
o
Convert
to exponent notation
o
Remove
exponent (if possible)
o
Convert
from binary to decimal real number
1. Convert 1 01111101 00000000000000000000000 to decimal real number
1. Convert exponent to decimal =
125
2. Convert Excess-127 to sign-and-magnitude = 125
– 127 = -2
3. Convert to exponent notation =
- 1.0 x 2-2
4. Remove exponent =
- 0.01
5. Convert to decimal real number =
- 0.25
2. Convert 0 10000001 11001100000000000000000 to decimal real number
1. Convert exponent to decimal =
129
2. Convert Excess to Exponent =
129 – 127 = 2
3. Convert to exponent notation =
1.110011 x 22
4. Remove exponent =
111.0011
5. Convert to decimal real number =
7.1875
·
Conversion
of floating point numbers may loose accuracy when converted to another base
(e.g. decimal to binary and visa versa)
·
Many
applications, especially business application that deals with money, requires
full accuracy of the numbers
·
BCD
satisfies the full accuracy objective
·
BCD
in floating point is very similar to the BCD used to represent integer numbers
·
Many
business-oriented high-level languages (e.g. COBOL) supports the packed decimal
format
·
Figure
5.8, page 138 shows 128-bits packed decimal format used in IBM 370/390 and VAX
computers
·
The
format allows for 31 decimal digits (1 digit per 4 bits)
·
Least
significant 4 bits are used for the sign 1100 for +, 1101 for -)
·
The
location of the decimal point is not stored and must be maintained by the
application program
·
Convert
-150.5410 to IBM 370/390 BCD Floating Point format
o
Convert
sign to BCD format =
1101
o
Convert
digit by digit to BCD format =
0001 0101 0000 0101 0100
o
Pad
with zeros to fill the entire storage space =
you need 104 leading zeros
o
Convert
to BCD format
= 0000 (… 104 zeros …) 000101010000010101001101
·
An
Overflow occur when the number is too large to be stored
·
An
Underflow occur when the number is too small to be stored
·
See
Figure 5.2 in page 115 for illustration of overflow and underflow in floating
point representation