Rounding versus truncating

3.4 Rounding versus truncating

The right shift method by itself is a truncation operator. The least signiﬁcant four bits are simply gone. This means the value 1.11110000₂ and the value 1.11111111₂ are both truncated to the same represented value of 1.1111₂. This appears to be okay, but it has very bad eﬀect “in the long run”.

If we carry out multiplications several times, or add the products of several terms, then the eﬀect of truncation will become apparent. The computed values will becomes smaller and smaller compared to the actual value. This is because truncation biases to a smaller value.

Let us rethink this problem. 1.11110000₂ should deﬁnitely be rounded to 1.1111₂. However, 1.11111111₂ is much closer to 10.0000₂ than 1.1111₂. As a result, it makes sense to round 1.11111111₂ to 10.0000₂. We can, then, generalize and say that we round a number to a less precise representation based on whether it is closer to the smaller value or the larger value.

There is one problem left. What about 1.11111000₂? It is actually exactly half way beteween 1.11110000₂ and 10.0000₂. What we need to consider in this case is: how many values are rounded to 1.1111₂, and how many values are rounded to 10.1111₂? The two numbers should be the same.

Because 1.11110000₂ is “rounded” to 1.1111₂, this means we have 1.11110000₂,…1.11110111₂ rounded to 1.1111₂. That makes 8 distinct values. It makes sense, then, to round 1.11111000₂ to 10.0000₂ so that all values 1.11111000₂,…1.11111111₂ (8 of them) are arounded to 10.0000₂.

Rounding is not diﬃcult, we only need to add 1000₂ to z = xy before the right shift operation. In other words, we want to make z = rs((xy + 1000₂),4).

[next] [prev] [prev-tail] [front] [up]