Chapter 5 – Floating Point Numbers

·         Floating point representation is used to represent real numbers (i.e. numbers with fractions)

·         Floating point representation supports huge range with reasonable storage size

·         Examples: 32-bit storage size can represent a number

o        As large as 1045

o        As small as 10-45

 

·         Floating point representation suffers from the following main disadvantages:

o        Potential loss of precision due to limited number of significant digits

o        Relatively, large storage requirements

o        Slow calculations

 

·         This chapters covers:

o        Review of exponential notation

o        Floating point representation in computers

o        Floating point calculations

o        IEEE 745 floating point standard

o        Packed Decimal Format (BCD)

o        Overflow and Underflow Conditions


Review of Exponential Notation

·         Exponential notation or scientific notation is a conventional method for representing floating point numbers

·         Exponential notation format consist of 6 components

o        Mantissa

o        Sign of the mantissa

o        Exponent

o        Sign of the exponent

o        Base of the exponent

o        Location of the fraction point

Example: -50.5 x 10-20

 

·         Fraction point position is flexible and can be adjusted without changing the number magnitude

·         Changes to the fraction point requires adjustment to the exponent

o        For every move to the right, the exponent must be decremented

o        For every move to the left, the exponent must be incremented

 
Examples
The number -50.5 x 10-20 can be represented as
-505. x 10-21
-.505 x 10-18
-.000505 x 10-15

 

·         However shifting should not be arbitrary, since that may affect the precision of the number

·         If we limit the mantissa to 5 digits, the last representation results in lose of 1 precision point (i.e. error)


Floating Point Representation in Computers

·         Computers uses a representation method very similar to the exponential notation

·         Binary is used instead of decimal

·         Storage size of 32, 64, and 128 bits are typically used

·         See Figure 5.4 in page 132 for a typical 32-bits floating point format

o        Leftmost bit is the mantissa sign

o        Followed by 8 bits exponent

o        Followed by 23 bits mantissa

o        The fraction point is implied to be at the beginning of the mantissa

o        Exponent is stored in excess-128 notation

o        Base of the exponent is implied as base 2

·         Computers uses many different proprietary and standard Floating Point representation methods

·         The following standards are in common use and will be studied later in the chapter

o        IEEE 754 Floating Point representation standard

 

Floating Point Representation Review

 

·         Assume the following decimal floating point format: SMMMMMMM              

(S = mantissa sign, M = mantissa, decimal point is implied at the end)

·         The above format provides a range of: ±9999999

·         Let’s introduce 2 exponent (EE) digits in place of 2 mantissa digits: SSEEMMMM  (second S is the exponent sign)

·         The above format provides a range of:  ± 0001 x 10-99 to ± 9999 x 1099

·         With this representation we have traded off 2 digits of precision to increase the range

·         There exist a trade off between Precision and Range

o        The more digits assigned for the mantissa, the higher the precision and lower the range

o        The more digits assigned for the exponent the higher the range and lower the precision

 

·         Floating point formats has the following attributes

o        A number is assigned a storage space (i.e. fixed number of bits)

o        The storage space is divided into 4 parts

§         Mantissa sign

§         Mantissa

§         Exponent sign

§         Exponent

o        The following remaining parts are implied and hence do not need to be stored

§         Exponent base

§         Fraction point position

 

·         There are a number of trade-offs that need to be considered when designing a floating point format:

o        Storage size

§         Increase precision and range

§         But also increase storage requirements

o        Base of the exponent

§         Binary base provide low range capability but requires simple calculations

§         Higher base (e.g. hex) provide high range but results in more complex calculations

o        Location of binary point

§         Usually positioned at the beginning of number to provide maximum precision

o        Number of bits to use for the exponent

§         The higher the number, the higher the range and lower the precision and visa versa

o        Number of bits to use for the mantissa

§         The higher the number, the higher the precision and lower the range and visa versa

o        Method to handle the sign for the exponent

§         Sign free representation is required

§         2’s complement can be used but excess-N is more common

o        Method to handle the sign for the mantissa

§         Sign free representation is not required

§         Sign-and-magnitude is typically used

§         2’s complement can also be used but less common

 

Example Floating Point Format

o        SEEMMMMM format

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the number

·         This format provides a range as small as: ± .00001 x 10-50 and as large as: ± .99999 x 10+49

 

Excess-N

·         One important consideration in floating point is

o        How to handle the sign of the exponent

o        2’s complement is an obvious solution

o        However, the Excess-N method is more commonly used

·         N is a predefined value separating the positive range from the negative one

o        Value ≥ N is positive

o        Value < N is negative

·         See Figure 5.1 in page 125 for Excess-50 representation

 

·         Excess-N provides the following important advantages over 2’s complement

o        Simpler in calculation

o        More flexible as N can be adjusted to adjust the range of positives and negatives 

§         The smaller the N the larger the positive range and the smaller the negative range

§         The larger the N the smaller the positive range and the larger the negative range

 

Conversion from Excess–N to Sign-and-Magnitude

·         Subtract exponent from N

 

Examples

1. Convert 30 represented in Excess-50 to sign-and-magnitude representation

= 30 – 50 = -20

 

2. Convert 60 represented in Excess-50 to sign-and-magnitude representation

= 60– 50 = 10

 

Conversion from Sign-and-Magnitude Notation to Excess–N

·         Add N to the exponent

 

Examples

1. Convert –10 represented in sign-and-magnitude to Excess-50

= 50 + (–10) = 40

 

2. Convert 0 represented in sign-and-magnitude to Excess-50

= 50 + 0  = 50

 

Normalization and Formatting of Floating Point Numbers

·         Normalization is the process of eliminating leading zeros from the mantissa

·         The objective of normalization is to maximize precision given the number of digits limitation

·         Normalization can only be performed if the exponent has enough range

 

Examples

1. Normalize .0003 x 1020

= .3 x 1017

 

2. Normalize .0003 x 10-20

= .3 x 10-23

 

3. Normalize .0003 x 10-98 assuming 2 exponent digits

 

Converting from real number to Floating Point format

·         The following steps provide the method to convert an integer or real number to floating point format:

  1. Convert the number to exponential notation format
  2. Place the fraction point to its proper position
  3. Normalize the number
  4. Convert exponent from sign-and-magnitude to Excess-N
  5. Store the number in the floating point format

 

Examples

Given the following floating point format

o        SEEMMMMM format

o        Use 0 for positive and 5 for negative

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the mantissa

 

·         Convert 246.8035 into the above floating point format

1. Convert to exponent notation format   = 246.8035 x 100

2. Set decimal point to proper position     = .2468035 x 103

3. Normalize                                   already normalized

4. Convert exponent to Excess-N              = 50 + 3 = 53

5. Store in floating point format = 05324680

 

·         Convert – .00000075 into the above floating point format

1. Convert to exponent notation format   = .00000075 x 100

2. Set decimal point to proper position   already in proper position

3. Normalize                                   = .75 x 10-6

4.  Convert exponent to Excess-N             = 50 + (-6) = 44

5.  Store in floating point format                = 54475000

 

·         Convert 1255 x 10-3 into the above floating point format

1. Convert to exponent notation format   = 1255. x 10-3

2. Set decimal point to proper position     = .1255 x 101

3. Normalize                                   already normalized

4.  Convert to Excess-N                               = 50 + 1 = 51

5.  Store in floating point format                = 05112550

 

Converting from Floating Point format to real number

·         The following steps provide the method to convert from floating point format to real number format

  1. Convert the mantissa sign digit to (+ or -)
  2. Convert from Excess-N to sign-and-magnitude
  3. Convert to exponential notation format
  4. Convert to real number format

 

Examples

·         Assume the SEEMMMMM floating point format

 

·         Convert 05324657 to real number

1. Convert the sign digit                                                             =  +

2. Convert from Excess-N to sign-and-magnitude = 53 – 50 = 3

3. Convert to exponential notation format                               = .24657 x 103

4.  Convert to real number format                                              = 246.57

 

·         Convert 54810000 to real number

1. Convert the sign digit                                                             =  -

2. Convert from Excess-N to sign-and-magnitude                  = 48 – 50 = -2

3. Convert to exponential notation format                               = .10000 x 10-2

4.  Convert to real number format                                              = - .001

 

·         Convert 05112550 to real number

1. Convert the sign digit                                                             =  +

2. Convert from Excess-N to sign-and-magnitude = 51 – 50 = 1

3. Convert to exponential notation format                               = .12550 x 101

4.  Convert to real number format                                              = 1.255


Floating Point Calculations

·         Floating point arithmetic is more complex and costly than that of integer arithmetic

·         Exponent and mantissa both has to be computed separately

 

Addition and Subtraction

·         Addition/subtraction is done using the following method

  1. Align the exponents (the smaller exponent should aligned until it matches the larger exponent)
  2. Add/subtract the mantissas
  3. Place decimal point in proper position if necessary
  4. Store number in the floating point format

 

Examples

·         Assume the SEEMMMMM floating point format

 

1. Add 05199520 + 04967850

1. Align the second number exponent      = 0510067850

2. Add the mantissas                                   = .99520 + .0067850 = 1.0019850

3. Adjust decimal point                               = .10019850 and Exponent = 51 + 1 = 52

4.  Store in floating point format                = 05210020

 

2. Subtract 05199520 - 04967850

1. Align the second number exponent      = 0510067850

2. Subtract the mantissas                            = .99520 – .0067850 = .9883250

3. Adjust decimal point                               already in proper position

4.  Store in floating point format                = 05198833

 

Multiplication

·         Alignment is not necessary when performing multiplication

·         Multiplication is done using the following method

1.        Multiply the two mantissas

2.        Adding the two exponents – N

3.        Normalize if necessary

4.        Store number in the floating point format

 

Examples

·         Assume the SEEMMMMM floating point format

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the number

 

·         Multiply 05220000 x 04712500

1. Multiply the 2 mantissas                        = .20000 x .12500 = .02500

2. Compute exponent                                   = 52 + 47 – 50 = 49

2. Normalize result                                        = .25000 and adjust Exponent = 49 – 1 = 48

4. Store in floating point format = 04825000

 

Division

·         Alignment is not necessary when performing division

·         Division is done by

1.        Divide the two mantissas

2.        Subtract the first number exponent – second number exponent + N

3.        Place decimal point in proper position

4.        Store number in the floating point format

 

Examples

·         Divide 05275000 ÷ 05025000

1. Divide the 2 mantissas                                            = .75000 ÷ .25000 = 3.00000

2. Compute exponent                                                   = 52 – 50 + 50 = 52

2. Place decimal point in proper position = 0.30000 and adjust Exponent = 52 + 1 = 53

4. Store in floating point format                 = 05330000


IEEE 754 Floating Point Standard

·         IEEE has developed a standard for both 32 and 64 bits floating point representation

·         The standard was targeted to be used in Personal Computer (IBM-type PC and Apple Macintosh)

·         Apple Macintosh also provides its own 80-bit format

·         IEEE 754 defines a 32-bits format called single-precision floating point format

o        Leftmost bit is the mantissa sign (0 for positive and 1 for negative)

o        Followed by 8 bits exponent

o        Followed by 24 bit mantissa (23 bits + implied which is always assumed to be 1)

o        Exponent is represented using Excess-127 which gives an exponent range of: 2-126 to 2+127

Exponents 0 (2-127) and 255 (2+128) are reserved for special use

o        Implied exponent base is 2

o        Fraction point position is to right of the leading mantissa bit

o        Special numbers (e.g. 0, ∞, very small none normalized numbers, etc.) are supported

o        Supported precession is approximately 7 decimal significant digits

o        Allows for approximate range of 10-45 to 10+38

 

·         IEEE 754 defines a 64-bits format called double-precision floating point format

o        It works similar to the single-precision format

o        11 bits for exponent and 52 bits for mantissa

o        Supported precession is approximately 15 decimal significant digits

o        Allows for approximate range of 10-300 to 10+300

 

Convert Decimal Real Number to IEEE 754 Floating Point Format

·         The following steps provide the method to convert a decimal real number to IEEE 754 Floating Point format:

o        Convert the decimal number to binary

o        Adjust binary point to proper position

o        Normalize the number

o        Convert exponent from sign-and-magnitude to Excess-127

o        Convert exponent to binary

o        Store the number in the floating point format

 

1. Convert 36.510 to single-precision IEEE 754 floating point format

1. Convert to binary                                                     = 100100.1

2. Adjust binary point to proper position                                = 1.001001 x 25

3. Normalize                                                                     already normalized

4. Convert exponent to Excess-127                           = 127 + 5 = 132

5. Convert exponent to binary                                   = 10000100

6. Store in floating point format                 = 0 10000100 00100100000000000000000

 

2. Convert –0.25 to single-precision IEEE 754 floating point format

1. Convert to binary                                                     = .01

2. Adjust binary point to proper position                                = 0.1 x 2-1

3. Normalize                                                                   = 1.0 x 2-2

4. Convert exponent to Excess-127                           = 127 - 2 = 125

5. Convert exponent to binary                                   = 01111101

6. Convert to floating point format                            = 1 01111101 00000000000000000000000

 

Convert from IEEE 754 Floating Point Format to real number

·         The following steps provide the method to convert IEEE 754 Floating Point to decimal real number:

o        Convert exponent from binary to decimal

o        Convert from Excess-127 to sign-and-magnitude

o        Convert to exponent notation

o        Remove exponent (if possible)

o        Convert from binary to decimal real number

 

1. Convert 1 01111101 00000000000000000000000 to decimal real number

1. Convert exponent to decimal                                 = 125

2. Convert Excess-127 to sign-and-magnitude        = 125 – 127 = -2

3. Convert to exponent notation                                = - 1.0 x 2-2

4. Remove exponent                                                    = - 0.01

5. Convert to decimal real number                             = - 0.25

 

2. Convert 0 10000001 11001100000000000000000 to decimal real number

1. Convert exponent to decimal                                 = 129

2. Convert Excess to Exponent                  = 129 – 127 = 2

3. Convert to exponent notation                                = 1.110011 x 22

4. Remove exponent                                                    = 111.0011

5. Convert to decimal real number                             = 7.1875


Packed Decimal Format (BCD)

·         Conversion of floating point numbers may loose accuracy when converted to another base (e.g. decimal to binary and visa versa)

·         Many applications, especially business application that deals with money, requires full accuracy of the numbers

·         BCD satisfies the full accuracy objective

·         BCD in floating point is very similar to the BCD used to represent integer numbers

·         Many business-oriented high-level languages (e.g. COBOL) supports the packed decimal format

·         Figure 5.8, page 138 shows 128-bits packed decimal format used in IBM 370/390 and VAX computers

·         The format allows for 31 decimal digits (1 digit per 4 bits)

·         Least significant 4 bits are used for the sign 1100 for +, 1101 for -)

·         The location of the decimal point is not stored and must be maintained by the application program

 

Examples

·         Convert -150.5410 to IBM 370/390 BCD Floating Point format

o        Convert sign to BCD format                                       = 1101

o        Convert digit by digit to BCD format                        = 0001 0101 0000 0101 0100

o        Pad with zeros to fill the entire storage space         = you need 104 leading zeros

o        Convert to BCD format

= 0000 (… 104 zeros …) 000101010000010101001101


Overflow and Underflow Conditions

·         An Overflow occur when the number is too large to be stored

·         An Underflow occur when the number is too small to be stored

·         See Figure 5.2 in page 115 for illustration of overflow and underflow in floating point representation