Chapter 5 – Floating Point Numbers

         Floating point representation is used to represent real numbers (i.e. numbers with fractions)

         Floating point representation supports huge range with reasonable storage size

         Examples: 32-bit storage size can represent a number

o        As large as 1045

o        As small as 10-45

 

         Floating point representation suffers from the following main disadvantages:

o        Potential loss of precision due to limited number of significant digits

o        Relatively, large storage requirements

o        Slow calculations

 

         This chapters covers:

o        Review of exponential notation

o        Floating point representation in computers

o        Floating point calculations

o        IEEE 745 floating point standard

o        Packed Decimal Format (BCD)

o        Overflow and Underflow Conditions


Review of Exponential Notation

         Exponential notation or scientific notation is a conventional method for representing floating point numbers

         Exponential notation format consist of 6 components

o        Mantissa

o        Sign of the mantissa

o        Exponent

o        Sign of the exponent

o        Base of the exponent

o        Location of the fraction point

Example: -50.5 x 10-20

 

         Fraction point position is flexible and can be adjusted without changing the number magnitude

         Changes to the fraction point requires adjustment to the exponent

o        For every move to the right, the exponent must be decremented

o        For every move to the left, the exponent must be incremented

 
Examples
The number -50.5 x 10-20 can be represented as
-505. x 10-21
-.505 x 10-18
-.000505 x 10-15

 

         However shifting should not be arbitrary, since that may affect the precision of the number

         If we limit the mantissa to 5 digits, the last representation results in lose of 1 precision point (i.e. error)


Floating Point Representation in Computers

         Computers uses a representation method very similar to the exponential notation

         Binary is used instead of decimal

         Storage size of 32, 64, and 128 bits are typically used

         See Figure 5.4 in page 132 for a typical 32-bits floating point format

o        Leftmost bit is the mantissa sign

o        Followed by 8 bits exponent

o        Followed by 23 bits mantissa

o        The fraction point is implied to be at the beginning of the mantissa

o        Exponent is stored in excess-128 notation

o        Base of the exponent is implied as base 2

         Computers uses many different proprietary and standard Floating Point representation methods

         The following standards are in common use and will be studied later in the chapter

o        IEEE 754 Floating Point representation standard

 

Floating Point Representation Review

 

         Assume the following decimal floating point format: SMMMMMMM              

(S = mantissa sign, M = mantissa, decimal point is implied at the end)

         The above format provides a range of: 9999999

         Let’s introduce 2 exponent (EE) digits in place of 2 mantissa digits: SSEEMMMM  (second S is the exponent sign)

         The above format provides a range of:  0001 x 10-99 to 9999 x 1099

         With this representation we have traded off 2 digits of precision to increase the range

         There exist a trade off between Precision and Range

o        The more digits assigned for the mantissa, the higher the precision and lower the range

o        The more digits assigned for the exponent the higher the range and lower the precision

 

         Floating point formats has the following attributes

o        A number is assigned a storage space (i.e. fixed number of bits)

o        The storage space is divided into 4 parts

         Mantissa sign

         Mantissa

         Exponent sign

         Exponent

o        The following remaining parts are implied and hence do not need to be stored

         Exponent base

         Fraction point position

 

         There are a number of trade-offs that need to be considered when designing a floating point format:

o        Storage size

         Increase precision and range

         But also increase storage requirements

o        Base of the exponent

         Binary base provide low range capability but requires simple calculations

         Higher base (e.g. hex) provide high range but results in more complex calculations

o        Location of binary point

         Usually positioned at the beginning of number to provide maximum precision

o        Number of bits to use for the exponent

         The higher the number, the higher the range and lower the precision and visa versa

o        Number of bits to use for the mantissa

         The higher the number, the higher the precision and lower the range and visa versa

o        Method to handle the sign for the exponent

         Sign free representation is required

         2’s complement can be used but excess-N is more common

o        Method to handle the sign for the mantissa

         Sign free representation is not required

         Sign-and-magnitude is typically used

         2’s complement can also be used but less common

 

Example Floating Point Format

o        SEEMMMMM format

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the number

         This format provides a range as small as: .00001 x 10-50 and as large as: .99999 x 10+49

 

Excess-N

         One important consideration in floating point is

o        How to handle the sign of the exponent

o        2’s complement is an obvious solution

o        However, the Excess-N method is more commonly used

         N is a predefined value separating the positive range from the negative one

o        Value ≥ N is positive

o        Value < N is negative

         See Figure 5.1 in page 125 for Excess-50 representation

 

         Excess-N provides the following important advantages over 2’s complement

o        Simpler in calculation

o        More flexible as N can be adjusted to adjust the range of positives and negatives 

         The smaller the N the larger the positive range and the smaller the negative range

         The larger the N the smaller the positive range and the larger the negative range

 

Conversion from Excess–N to Sign-and-Magnitude

         Subtract exponent from N

 

Examples

1. Convert 30 represented in Excess-50 to sign-and-magnitude representation

= 30 – 50 = -20

 

2. Convert 60 represented in Excess-50 to sign-and-magnitude representation

= 60– 50 = 10

 

Conversion from Sign-and-Magnitude Notation to Excess–N

         Add N to the exponent

 

Examples

1. Convert –10 represented in sign-and-magnitude to Excess-50

= 50 + (–10) = 40

 

2. Convert 0 represented in sign-and-magnitude to Excess-50

= 50 + 0  = 50

 

Normalization and Formatting of Floating Point Numbers

         Normalization is the process of eliminating leading zeros from the mantissa

         The objective of normalization is to maximize precision given the number of digits limitation

         Normalization can only be performed if the exponent has enough range

 

Examples

1. Normalize .0003 x 1020

= .3 x 1017

 

2. Normalize .0003 x 10-20

= .3 x 10-23

 

3. Normalize .0003 x 10-98 assuming 2 exponent digits

 

Converting from real number to Floating Point format

         The following steps provide the method to convert an integer or real number to floating point format:

  1. Convert the number to exponential notation format
  2. Place the fraction point to its proper position
  3. Normalize the number
  4. Convert exponent from sign-and-magnitude to Excess-N
  5. Store the number in the floating point format

 

Examples

Given the following floating point format

o        SEEMMMMM format

o        Use 0 for positive and 5 for negative

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the mantissa

 

         Convert 246.8035 into the above floating point format

1. Convert to exponent notation format   = 246.8035 x 100

2. Set decimal point to proper position     = .2468035 x 103

3. Normalize                                   already normalized

4. Convert exponent to Excess-N              = 50 + 3 = 53

5. Store in floating point format = 05324680

 

         Convert – .00000075 into the above floating point format

1. Convert to exponent notation format   = .00000075 x 100

2. Set decimal point to proper position   already in proper position

3. Normalize                                   = .75 x 10-6

4.  Convert exponent to Excess-N             = 50 + (-6) = 44

5.  Store in floating point format                = 54475000

 

         Convert 1255 x 10-3 into the above floating point format

1. Convert to exponent notation format   = 1255. x 10-3

2. Set decimal point to proper position     = .1255 x 101

3. Normalize                                   already normalized

4.  Convert to Excess-N                               = 50 + 1 = 51

5.  Store in floating point format                = 05112550

 

Converting from Floating Point format to real number

         The following steps provide the method to convert from floating point format to real number format

  1. Convert the mantissa sign digit to (+ or -)
  2. Convert from Excess-N to sign-and-magnitude
  3. Convert to exponential notation format
  4. Convert to real number format

 

Examples

         Assume the SEEMMMMM floating point format

 

         Convert 05324657 to real number

1. Convert the sign digit                                                             =  +

2. Convert from Excess-N to sign-and-magnitude = 53 – 50 = 3

3. Convert to exponential notation format                               = .24657 x 103

4.  Convert to real number format                                              = 246.57

 

         Convert 54810000 to real number

1. Convert the sign digit                                                             =  -

2. Convert from Excess-N to sign-and-magnitude                  = 48 – 50 = -2

3. Convert to exponential notation format                               = .10000 x 10-2

4.  Convert to real number format                                              = - .001

 

         Convert 05112550 to real number

1. Convert the sign digit                                                             =  +

2. Convert from Excess-N to sign-and-magnitude = 51 – 50 = 1

3. Convert to exponential notation format                               = .12550 x 101

4.  Convert to real number format                                              = 1.255


Floating Point Calculations

         Floating point arithmetic is more complex and costly than that of integer arithmetic

         Exponent and mantissa both has to be computed separately

 

Addition and Subtraction

         Addition/subtraction is done using the following method

  1. Align the exponents (the smaller exponent should aligned until it matches the larger exponent)
  2. Add/subtract the mantissas
  3. Place decimal point in proper position if necessary
  4. Store number in the floating point format

 

Examples

         Assume the SEEMMMMM floating point format

 

1. Add 05199520 + 04967850

1. Align the second number exponent      = 0510067850

2. Add the mantissas                                   = .99520 + .0067850 = 1.0019850

3. Adjust decimal point                               = .10019850 and Exponent = 51 + 1 = 52

4.  Store in floating point format                = 05210020

 

2. Subtract 05199520 - 04967850

1. Align the second number exponent      = 0510067850

2. Subtract the mantissas                            = .99520 – .0067850 = .9883250

3. Adjust decimal point                               already in proper position

4.  Store in floating point format                = 05198833

 

Multiplication

         Alignment is not necessary when performing multiplication

         Multiplication is done using the following method

1.        Multiply the two mantissas

2.        Adding the two exponents – N

3.        Normalize if necessary

4.        Store number in the floating point format

 

Examples

         Assume the SEEMMMMM floating point format

o        Excess-50

o        Base 10 exponent

o        Base 10 mantissa

o        Implied decimal point at the beginning of the number

 

         Multiply 05220000 x 04712500

1. Multiply the 2 mantissas                        = .20000 x .12500 = .02500

2. Compute exponent                                   = 52 + 47 – 50 = 49

2. Normalize result                                        = .25000 and adjust Exponent = 49 – 1 = 48

4. Store in floating point format = 04825000

 

Division

         Alignment is not necessary when performing division

         Division is done by

1.        Divide the two mantissas

2.        Subtract the first number exponent – second number exponent + N

3.        Place decimal point in proper position

4.        Store number in the floating point format

 

Examples

         Divide 05275000 05025000

1. Divide the 2 mantissas                                            = .75000 .25000 = 3.00000

2. Compute exponent                                                   = 52 – 50 + 50 = 52

2. Place decimal point in proper position = 0.30000 and adjust Exponent = 52 + 1 = 53

4. Store in floating point format                 = 05330000


IEEE 754 Floating Point Standard

         IEEE has developed a standard for both 32 and 64 bits floating point representation

         The standard was targeted to be used in Personal Computer (IBM-type PC and Apple Macintosh)

         Apple Macintosh also provides its own 80-bit format

         IEEE 754 defines a 32-bits format called single-precision floating point format

o        Leftmost bit is the mantissa sign (0 for positive and 1 for negative)

o        Followed by 8 bits exponent

o        Followed by 24 bit mantissa (23 bits + implied which is always assumed to be 1)

o        Exponent is represented using Excess-127 which gives an exponent range of: 2-126 to 2+127

Exponents 0 (2-127) and 255 (2+128) are reserved for special use

o        Implied exponent base is 2

o        Fraction point position is to right of the leading mantissa bit

o        Special numbers (e.g. 0, ∞, very small none normalized numbers, etc.) are supported

o        Supported precession is approximately 7 decimal significant digits

o        Allows for approximate range of 10-45 to 10+38

 

         IEEE 754 defines a 64-bits format called double-precision floating point format

o        It works similar to the single-precision format

o        11 bits for exponent and 52 bits for mantissa

o        Supported precession is approximately 15 decimal significant digits

o        Allows for approximate range of 10-300 to 10+300

 

Convert Decimal Real Number to IEEE 754 Floating Point Format

         The following steps provide the method to convert a decimal real number to IEEE 754 Floating Point format:

o        Convert the decimal number to binary

o        Adjust binary point to proper position

o        Normalize the number

o        Convert exponent from sign-and-magnitude to Excess-127

o        Convert exponent to binary

o        Store the number in the floating point format

 

1. Convert 36.510 to single-precision IEEE 754 floating point format

1. Convert to binary                                                     = 100100.1

2. Adjust binary point to proper position                                = 1.001001 x 25

3. Normalize                                                                     already normalized

4. Convert exponent to Excess-127                           = 127 + 5 = 132

5. Convert exponent to binary                                   = 10000100

6. Store in floating point format                 = 0 10000100 00100100000000000000000

 

2. Convert –0.25 to single-precision IEEE 754 floating point format

1. Convert to binary                                                     = .01

2. Adjust binary point to proper position                                = 0.1 x 2-1

3. Normalize                                                                   = 1.0 x 2-2

4. Convert exponent to Excess-127                           = 127 - 2 = 125

5. Convert exponent to binary                                   = 01111101

6. Convert to floating point format                            = 1 01111101 00000000000000000000000

 

Convert from IEEE 754 Floating Point Format to real number

         The following steps provide the method to convert IEEE 754 Floating Point to decimal real number:

o        Convert exponent from binary to decimal

o        Convert from Excess-127 to sign-and-magnitude

o        Convert to exponent notation

o        Remove exponent (if possible)

o        Convert from binary to decimal real number

 

1. Convert 1 01111101 00000000000000000000000 to decimal real number

1. Convert exponent to decimal                                 = 125

2. Convert Excess-127 to sign-and-magnitude        = 125 – 127 = -2

3. Convert to exponent notation                                = - 1.0 x 2-2

4. Remove exponent                                                    = - 0.01

5. Convert to decimal real number                             = - 0.25

 

2. Convert 0 10000001 11001100000000000000000 to decimal real number

1. Convert exponent to decimal                                 = 129

2. Convert Excess to Exponent                  = 129 – 127 = 2

3. Convert to exponent notation                                = 1.110011 x 22

4. Remove exponent                                                    = 111.0011

5. Convert to decimal real number                             = 7.1875


Packed Decimal Format (BCD)

         Conversion of floating point numbers may loose accuracy when converted to another base (e.g. decimal to binary and visa versa)

         Many applications, especially business application that deals with money, requires full accuracy of the numbers

         BCD satisfies the full accuracy objective

         BCD in floating point is very similar to the BCD used to represent integer numbers

         Many business-oriented high-level languages (e.g. COBOL) supports the packed decimal format

         Figure 5.8, page 138 shows 128-bits packed decimal format used in IBM 370/390 and VAX computers

         The format allows for 31 decimal digits (1 digit per 4 bits)

         Least significant 4 bits are used for the sign 1100 for +, 1101 for -)

         The location of the decimal point is not stored and must be maintained by the application program

 

Examples

         Convert -150.5410 to IBM 370/390 BCD Floating Point format

o        Convert sign to BCD format                                       = 1101

o        Convert digit by digit to BCD format                        = 0001 0101 0000 0101 0100

o        Pad with zeros to fill the entire storage space         = you need 104 leading zeros

o        Convert to BCD format

= 0000 (… 104 zeros …) 000101010000010101001101


Overflow and Underflow Conditions

         An Overflow occur when the number is too large to be stored

         An Underflow occur when the number is too small to be stored

         See Figure 5.2 in page 115 for illustration of overflow and underflow in floating point representation