Printing Floating-Point Numbers March 05, 2014
Part
1: Background
Part
2: Dragon4
Part
3: Formatting
Part
4: Results
References
If you're interested in more details on the subject, these are the primary set of
papers.
If you're okay with using code with a less permissive license, these are the only
correct implementations I've been able to find.
- dtoa.c
- Written by David Gay (author of one of the above papers), this seems to be the
accepted safe implementation. It has been iterated on since 1991 and is used in
everything from Python to
MySQL
to the Nintendo
Wii.
- double-conversion
- This is an implementation of the Grisu algorithm mentioned above and is
maintained by the author of the paper. It was written for the V8 JavaScript engine
in Google Chrome. I think it is also used by Firefox among other large
projects.
If you've found your way here, you might also enjoy these two sites.
- Random ASCII
- Bruce Dawson's blog contains a ton of floating-point related articles that are
easy reads.
- Exploring Binary
- Rick Regan's blog is almost entirely floating-point related articles and he
often dissects conversion implementations finding all sorts of bugs.
基本知识
在我们研究真正的算法之前,先一起看看一个简单的(不准确)的转换正浮点数的方法。如果浮点数可以储存无限精度,那么这个算法确实没有任何问题。但可惜的是,我们的计算机不能精确表示浮点数,这只是一碟在我们在Dragon4前的小菜。
一个数字可以表示成一系列的数字乘以10的次幂(作为程序猿的你不知道的话请到课室外面站),举例子, \(123.456\) 等于 \(1 \cdot 10^2 + 2 \cdot 10^1 + 3 \cdot 10^0 + 4 \cdot
10^{-1} + 5 \cdot 10^{-2} + 4 \cdot 10^{-3}\).
const int c_maxDigits = 256; // input binary floating-point number to convert from
double value = 122.5; // the number to convert
// output decimal representation to convert to
char decimalDigits[c_maxDigits]; // buffer to put the decimal representation in
int numDigits = 0; // this will be set to the number of digits in the buffer
int firstDigitExponent = 0; // this will be set to the the base 10 exponent of the first
// digit in the buffer
// Compute the first digit's exponent in the equation digit*10^exponent
// (e.g. 122.5 would compute a 2 because its first digit is in the hundreds place)
firstDigitExponent = (int)floor( log10(value) );
// Scale the input value such that the first digit is in the ones place
// (e.g. 122.5 would become 1.225).
value = value / pow(10, firstDigitExponent);
// while there is a non-zero value to print and we have room in the buffer
while (value > 0.0 && numDigits < c_maxDigits)
{
// Output the current digit.
double digit = floor(value);
decimalDigits[numDigits] = '0' + (char)digit; // convert to an ASCII character
++numDigits;
// Compute the remainder by subtracting the current digit
// (e.g. 1.225 would becom 0.225)
value -= digit;
// Scale the next digit into the ones place.
// (e.g. 0.225 would becom 2.25)
value *= 10.0;
}
但是为什么它不准确呢?问题主要来自循环里我们对浮点数进行了运算。当我们继续运行这个程序,就会发现有很多数字不能正确表示。
同样,这个算法缺少一些必要的特性。
- 精度的要求(比如只是打印小数点后6位)
- 如果数字因为宽度或者buffer不够而被截断,最后的数字应该要正确的舍入。
- 我们应该可以正确转换这个浮点数(包括舍入),使得这个数字是唯一表示的,可以精确从text转换回来。
Integers-整数
为了修复我们的实现,请回忆起初中学过的数学,我们需要使用伟大的数学知识了。谁说程序猿不需要数学的!!! 第一步就是将浮点数解构。作为程序员,就算是码农,我们都应该知道IEEE浮点数由符号位(sign bits),指数位(exponent bits)和尾数位(mantissa bits)组成。
type | sign
bits | exponent bits | mantissa bits |
32-bit float | 1 | 8 | 23 |
64-bit double | 1 | 11 | 52 |
再一次强调,我们现在这个算法只是小白版,目前还不需要支持负数,所以我们先忽略符号位。在我们使用对数知识进行惊为天人的数学推理前(初中生题目),我们需要先理解浮点数是怎么由那三个部分转换过来的。
有4种浮点数可以由指数和尾数表示。
|
exponent |
mantissa |
usage |
非规格化 |
0 |
任何值 |
非规格化数字是一些太小而不能规格化表达的数字
. 他们真正的值是这样计算的: \(\Large{\frac
{mantissa} { 2^{numMantissaBits} } \cdot 2 ^ { 1 + 1
- 2^{numExponentBits-1}} }\).
32位的浮点数(代入32)就变成这样: | \(\Large{
\frac {mantissa} {2^{23}}
\cdot 2 ^ {1 - 127}}\) |
64位的浮点数(代入64)就变成这样: | \(\Large{
\frac {mantissa}
{2^{52}} \cdot 2 ^
{1 - 1023}}\) |
|
规格化 |
大于0小于最大值 | 任何值 |
大部分浮点数都是规格化数字. 他们真正的值是这样计算的: \(\Large{(1 +
\frac {mantissa} {2^{numMantissaBits}}) \cdot 2 ^ {exponent + 1 - 2^{numExponentBits-1}}}\).
32位的浮点数(代入32)就变成这样: | \(\Large{(1
+ \frac{mantissa} {2^{23}}) \cdot 2 ^ {exponent
-
127}}\) |
64位的浮点数(代入64)就变成这样: | \(\Large{(1
+
\frac{mantissa}
{2^{52}})
\cdot
2
^
{exponent
-
1023}}\) |
|
Infinity |
最大值 (32位的浮点数就是:0xFF,64位的double就是: 0x7FF) |
0 |
Inf的结果来自一些不恰当的计算比如除0. |
NaN |
最大值 (32位的浮点数就是:0xFF 64位的double就是: 0x7FF |
大于0 |
NaN 意思是 "not a number" 出现这个结果通常是计算得到的结果不是真正的数字,比如对负数开方. |
我们的数字转换算法只是为规格化和非规格化数逐个数字取出来。就像符号位一样,对inf和NaN的处理都留给调用它的函数,我们先不处理。
无论是规格化数还是非规格化数,现在我们知道怎么求出32位和64位的浮点数的值了。用无符号型整数
valueMantissa来代表尾数,有符号整数
valueExponent代表指数,输入的浮点数的值都等于 \((valueMantissa \cdot 2^{valueExponent})\).
代表 |
转换 |
valueMantissa |
valueExponent |
32-bit 非规格化数 |
The floating point equation is \(value = \Large{\frac{mantissa} {2^{23}} \cdot 2 ^ {1-127}}\) Factor a \(2^{23}\) out of the exponent and cancel out the denominator \(value = \Large{\frac{mantissa} {2^{23}} \cdot 2^{23} \cdot 2 ^ {1-127-23}}\) \(value = mantissa \cdot 2 ^ {1-127-23}\) | \(mantissa\) | \(-149\) |
32-bit 规格化数 |
The floating point equation is \(value = \Large{(1 + \frac{mantissa} {2^{23}}) \cdot 2 ^ {exponent-127}}\) Factor a \(2^{23}\) out of the exponent and cancel out the denominator \(value = \Large{(1 + \frac{mantissa} {2^{23}}) \cdot 2^{23} \cdot 2 ^ {exponent-127-23}}\) \(value = (2^{23} + mantissa) \cdot 2 ^ {exponent-127-23}\) | \(2^{23} + mantissa\) | \(exponent - 150\) |
64-bit 非规格化数 |
The floating point equation is \(value = \Large{ \frac{mantissa} {2^{52}} \cdot 2 ^ {1-1023}}\) Factor a \(2^{52}\) out of the exponent and cancel out the denominator \(value = \Large{\frac{mantissa} {2^{52}} \cdot 2^{52} \cdot 2 ^ {1-1023-52}}\) \(value = mantissa \cdot 2 ^ {1-1023-52}\) | \(mantissa\) | \(-1074\) |
64-bit 规格化数 |
The floating point equation is \(value = \Large{(1 + \frac{mantissa} {2^{52}}) \cdot 2 ^ {exponent-1023}}\) Factor a \(2^{52}\) out of the exponent and cancel out the denominator \(value = \Large{(1 + \frac{mantissa} {2^{52}}) \cdot 2^{52} \cdot 2 ^ {exponent-1023-52}}\) \(value = (2^{52} + mantissa) \cdot 2 ^ {exponent-1023-52}\) | \(2^{52} + mantissa\) | \(exponent - 1075\) |
Big Integers-大数
现在我们可以用 \((valueMantissa \cdot 2^{valueExponent})\)来求出所有浮点数的值了。现在我们用大数运算来重构我们的算法。
用整数来表示数字如12很正常,但是表示小数如12.3就不是简单的事情了。对小数部分的处理,我们要用到两个整数来表示它的值:一个除数和一个被除数。
12.3就可以用除数123和被除数10来表示了,因为\(12.3 = \large{\frac{123}{10}}\).用这个方法,我们就可以将二进制浮点数准确的转换成数字了, 但是,我们要用到非常 大的数字才可以,不然无法保证精度。 32位和64位的整数都会截断,事实上,当要表示64位的double类型时,我们要用到超过1000位的整数来处理一些值。 要处理这么大的数字,我们要创建一个tBigInt
结构体和为它写一些计算函数。假设现在有这么一个结构体,我们可以重构算法成下面这个样子了。
对了,我还要提醒你,我们的算法依然是为了让你看懂先~而不是优化后的。
(接下来就有趣了)为什么?因为伟大的数学表演要开始了!!
const int c_maxDigits = 256; // input number to convert from: value = valueMantissa * 2^valueExponent
double value = 122.5; // in memory 0x405EA00000000000
// (sign=0x0, exponent=0x405, mantissa=0xEA00000000000)
uint64_t valueMantissa = 8620171161763840ull;
int32_t valueExponent = -46; // output decimal representation to convert to
char decimalDigits[c_maxDigits]; // buffer to put the decimal representation in
int numDigits = 0; // this will be set to the number of digits in the buffer
int firstDigitExponent = 0; // this will be set to the the base 10 exponent of the first
// digit in the buffer
// (e.g. 122.5 would compute a 2 because its first digit is in the hundreds place)
firstDigitExponent = (int)floor( log10(value)); // We represent value as (valueNumerator / valueDenominator)
tBigInt valueNumerator;
tBigInt valueDenominator;
if (exponent > 0) { // value = (mantissa * 2^exponent) / (1)
valueNumerator = BigInt_ShiftLeft(valueMantissa, valueExponent);
valueDenominator = 1;
}
else { // value = (mantissa) / (2^(-exponent))
valueNumerator = mantissa;
valueDenominator = BigInt_ShiftLeft(1, -valueExponent);
}
// Scale the input value such that the first digit is in the ones place
// (e.g. 122.5 would become 1.225).
// value = value / pow(10, firstDigitExponent)
if (digitExponent > 0) { // The exponent is positive creating a division so we multiply up the denominator.
valueDenominator = BigInt_Multiply( valueDenominator, BigInt_Pow(10,digitExponent));
} else if (digitExponent < 0) {
// The exponent is negative creating a multiplication so we multiply up the numerator.
valueNumerator = BigInt_Multiply( valueNumerator, BigInt_Pow(10,digitExponent));
}
// while there is a non-zero value to print and we have room in the buffer
while (valueNumerator > 0.0 && numDigits < c_maxDigits) { // Output the current digit.
double digit = BigInt_DivideRoundDown(valueNumerator,valueDenominator);
decimalDigits[numDigits] = '0' + (char)digit; // convert to an ASCII charater
++numDigits;
// Compute the remainder by subtracting the current digit
// (e.g. 1.225 would becom 0.225)
valueNumerator = BigInt_Subtract( BigInt_Multiply(digit,valueDenominator));
// Scale the next digit into the ones place.
// (e.g. 0.225 would becom 2.25)
valueNumerator = BigInt_Multiply(valueNumerator,10);
}
Logarithms--对数
在上面的例子中,唯一剩下的浮点数计算就是用log10()函数来计算到第一个数字的指数(为了科学计数法表示)。这是一个潜在的不准确转换的因素。 在Dragon4算法的实现上,Steele和White用一个高精度整数不断除以10来循环计算它,这是很简单而且显然而见的办法,结果不言而喻,非常慢。在接下来Burger和Dybvig的论文中,提出两种优化的办法来估算出这个值,而估计出来的最多只是跟正确值相差1。
第一种办法就是依然用log10()函数,但是会加上一个小的偏移值来保证floor()函数可以恰当的舍入到正确值。
第二种办法就是利用伟大的数学知识来计算这个值,而我的实现就是基于第二种方法,但加了一点自己的优化。 要计算到第一位数字的指数,我们解出这条公式: \(firstDigitExponent = \lfloor log_{10} value \rfloor\) 然后我们可以用change of base formula-对数转换公来重新用2的对数来表示 \(log_{10} value = \Large{ \frac { log_{2} value } { log_{2} 10 } }
\) 再一次使用转换公式,我们可以得到: \( log_{2} 10 = \Large{ \frac { log_{10} 10 } { log_{10} 2 } = \frac { 1 } { log_{10} 2 } } \) 带入到等式\(log_{10} value\) 中,我们可以消去除法运算: \(log_{10} value = log_{2} value \cdot log_{10} 2 \) 现在要计算的等式变成: \(firstDigitExponent = \lfloor log_{2} value \cdot log_{10} 2 \rfloor\) 还记得我们说过 \(value =
valueMantissa \cdot 2^{valueExponent}\)吗?我们可以用它来带入上面的公式,得到: \(log_{2} value = log_{2}(valueMantissa) + valueExponent\) 因为我们的尾数是以二进制存储在整数里面,所以取2的对数约摸是二进制最高位的下标。准确的用公式表示是: \( highestMantissaBitIndex = \lfloor log_{2}(valueMantissa) \rfloor \) 因为
\(valueExponent\)是一个整数,可以得到: \( \lfloor log_{2} value \rfloor = \lfloor log_{2}(valueMantissa) \rfloor + valueExponent\) \( \lfloor log_{2} value \rfloor = highestMantissaBitIndex + valueExponent\)
又因为 \(highestMantissaBitIndex\) 同样是整数,我们可以得到以下不等式:
\( log_{2}(value) - 1 < highestMantissaBitIndex + valueExponent \leq log_{2}(value)\)
不等式乘以 \(log_{10} 2\), 可以得到:
\( log_{2}(value) \cdot log_{10} 2 - log_{10} 2 < log_{10} 2 \cdot (highestMantissaBitIndex + valueExponent) \leq log_{2}(value) \cdot log_{10} 2\) \( log_{10}(value) - log_{10} 2 < log_{10} 2 \cdot (highestMantissaBitIndex + valueExponent) \leq log_{10}(value)\)
因为我们在使用浮点型数字进行运算,所以当乘 \(log_{10} 2\)的时候,我们需要处理浮动误差. 因为它有可能会让我们稍稍超越上界,所以我们要减去一个小的偏移值 (比如 0.00001) 公式就是: \( log_{10}(value) - log_{10} 2 - epsilon < log_{10} 2 \cdot (highestMantissaBitIndex + valueExponent) - epsilon < log_{10}(value)\)
最终,我们可以得到结果的近似值:
\(log_{10}(value) \approx log_{10} 2 \cdot (highestMantissaBitIndex + valueExponent) - epsilon\)
根据不等式,可以推断出我们的估算值会与正确值相差在 \(log_{10} 2 + epsilon\) 的范围内。计算 \(log_{10} 2\)我们得到近似值 0.301. 这非常棒,因为我们的估算误差小于1, 就算加上偏移值也是一样,而我们真正要计算的是 \( \lfloor log_{10}(value) \rfloor \). 因此,
向下舍入得到我们的近似值误差也总是小于1的。下面这幅图有助于我们理解误差范围,以及当估算得太小时,会与正确值相差1.
当我们除以 \(10^{firstDigitExponent}\)后,我们可以检验估算是否正确和修正不正确的检验. 如果我们用两个整数表示的浮点数的结果中,除数依然比被除数大10倍,那么,我们的估算是错误的,我们要把被除数乘10和增加我们估算的指数。 当我们真正去写Dragon4的算法时,你会发现,要简化计算,最好计算 \(1 + \lfloor log_{10} v \rfloor\). 因此,我们不向下取舍而向上。
这个取舍的方法来自Burge和Dybvig的论文。上面提到了,我们现在的误差是在0.301到0.302之间。 \(1.0 - log_{10} 2 - epsilon\), we can shift our error bounds lower. Explicitly, I subtract 0.69 from the estimate which moves the error range from about (-0.301, -0.69] to about (-0.991, -0.69].
The
final
inequality
is:
\(
\lfloor
log_{10}(value)
\rfloor
<
\lceil
log_{10}
2
\cdot
(highestMantissaBitIndex
+
valueExponent)
-
0.69
\rceil
\leq
\lfloor
log_{10}(value)
\rfloor
+
1\)
Minimal
Output
and
Rounding
As
it
stands,
our
prototype
algorithm
will
print
a
number
until
all
precision
is
exhausted
(i.e.
when
the
remainder
reaches
zero).
Because
there
are
discrete
deltas
between
each
representable
number,
we
don't
need
to
print
that
many
digits
to
uniquely
identify
a
number.
Thus,
we
are
going
to
add
support
for
printing
the
minimal
number
of
digits
needed
to
identify
a
number
from
its
peers.
For
example,
let's
consider
the
following
three
adjacent
32-bit
floating-point
values.
| binary |
exact
decimal
representation |
\(\large{v_{-1}}\) | 0x4123c28e |
10.2349987030029296875 |
\(\large{v_0}\) | 0x4123c28f |
10.23499965667724609375 |
\(\large{v_1}\) | 0x4123c290 |
10.2350006103515625 |
There
is
a
margin
of
0.00000095367431640625
between
each
of
these
numbers.
For
any
value,
all
numbers
within
the
exclusive
range
\(\large{(value
-
\frac{margin}{2},
value
+
\frac{margin}{2})}\)
can
uniquely
identify
it.
For
the
above
value,
\(\large{v_0}\),
we
can
uniquely
identify
it
with
the
shorter
number
\(\large{v_{0s}}
=
10.235\)
because
it
is
closer
to
\(\large{v_0}\)
than
it
is
to
to
either
\(\large{v_{-1}}\)
or
\(\large{v_1}\).
Solving
this
problem
is
what
lets
the
Dragon4
algorithm
output
"pretty"
representations
of
floating
point
numbers.
To
do
it,
we
can
track
the
margin
as
a
big
integer
with
the
same
denominator
as
the
value.
Every
time
we
advance
to
the
next
digit
by
multiplying
the
numerator
by
10,
we
will
also
multiply
the
margin
by
10.
Once
the
next
digit
to
print
can
be
safely
rounded
up
or
down
without
leaving
the
bounds
of
the
margin,
we
are
far
enough
from
the
neighboring
values
to
print
a
rounded
digit
and
stop.
In
this
image,
we
can
see
how
the
algorithm
decides
to
terminate
on
the
shortest
decimal
representation
(the
chosen
numbers
don't
correspond
to
any
real
world
floating-point
format
and
are
just
for
illustration).
- A:
We
start
off
with
an
input
value
that
has
the
exact
representation
7.625.
The
margin
between
each
value
is
0.25
which
means
that
the
set
of
values
within
0.125
units
below
or
above
the
input
value
can
be
safely
rounded
to
it.
This
plus
or
minus
0.125
range
has
been
highlighted
by
the
box
around
the
end
of
the
value
bar.
- B:
We
divide
our
value
by
10
to
get
the
first
printed
digit,
7.
We
then
set
the
value
to
the
division's
remainder:
0.625.
The
low
end
of
the
margin
box
is
greater
than
zero
and
the
high
end
is
less
than
one.
This
means
that
if
we
were
to
stop
printing
now
and
round
the
current
digit
to
7
or
8,
we
would
print
a
number
closer
to
another
value
than
it
is
to
7.625.
Because
neither
the
rounded-down
or
rounded-up
digit
is
contained
by
our
margin
box,
we
continue.
- C:
To
get
our
next
output
digit
in
the
ones
place,
we
multiply
our
value
by
10
giving
us
6.25.
We
also
multiply
our
margin
by
10
giving
us
a
distance
of
2.5
between
numbers.
The
safe
rounding
distance
of
plus
or
minus
half
the
margin
is
1.25.
- D:
We
divide
our
value
by
10
to
get
the
second
printed
digit,
6.
We
then
set
the
value
to
the
division's
remainder:
0.25.
The
low
end
of
the
margin
box
is
now
less
than
zero
and
the
high
end
is
greater
than
one.
This
means
we
can
safely
stop
printing
and
output
the
current
digit
as
a
6
or
7
because
they
will
both
uniquely
identify
our
number.
By
checking
if
the
remainder
of
0.25
is
below
or
above
0.5
we
can
decide
to
round
down
or
up
(we'll
discuss
ties
shortly).
In
this
case,
we
round
down
and
the
shortest
output
for
7.625
is
"7.6".
Unequal
Margins
Unfortunately,
the
IEEE
floating-point
format
adds
an
extra
wrinkle
we
need
to
deal
with.
As
the
exponent
increases,
the
margin
between
adjacent
numbers
also
increases.
This
is
why
floating-point
values
get
less
accurate
as
they
get
farther
from
zero.
For
normalized
values,
every
time
the
mantissa
rolls
over
to
zero,
the
exponent
increases
by
one.
When
this
happens,
the
higher
margin
is
twice
the
size
of
the
lower
margin
and
we
need
to
account
for
this
special
case
when
testing
the
margin
bounds
in
code.
In
understanding
this
better,
I
find
it
helpful
to
make
a
small
floating-point
format
where
we
can
examine
every
value.
The
tables
below
show
every
positive
floating-point
value
for
a
6-bit
format
that
contains
1
sign
bit,
3
exponent
bits
and
2
mantissa
bits.
Positive
Denormalized
Numbers
\(value
=
\Large{
\frac{mantissa}{2^2}
\cdot
2^{1+1-2^{3-1}}
=
\frac{mantissa}{4}
\cdot
2^{1-3}}\)
Here
we
can
see
our
format's
minimal
margin
of
0.0625
between
each
number.
No
smaller
delta
can
be
represented.
binary |
exponent |
mantissa |
equation |
exact
value |
minimal
value |
0
000
00 |
0 |
0 |
(0/4)*2^(-2) |
0 |
0 |
0
000
01 |
0 |
1 |
(1/4)*2^(-2) |
0.0625 |
0.06 |
0
000
10 |
0 |
2 |
(2/4)*2^(-2) |
0.125 |
0.1 |
0
000
11 |
0 |
3 |
(3/4)*2^(-2) |
0.1875 |
0.2 |
Positive
Normalized
Numbers
\(value
=
\Large{
(1
+
\frac{mantissa}
{2^2})
\cdot
2^{exponent+1-2^{3-1}}
=
(1
+
\frac{mantissa}{4})
\cdot
2^{exponent-3}}\)
Here
we
can
see
our
format's
lowest
exponent
has
the
minimal
margin
of
0.0625
between
each
number.
Each
time
the
exponent
increases,
so
does
our
margin
by
a
factor
of
two
and
for
our
largest
values
the
margin
is
2.0.
binary |
exponent |
mantissa |
equation |
exact
value |
minimal
value |
0
001
00 |
1 |
0 |
(1
+
0/4)*2^(-2) |
0.25 |
0.25 |
0
001
01 |
1 |
1 |
(1
+
1/4)*2^(-2) |
0.3125 |
0.3 |
0
001
10 |
1 |
2 |
(1
+
2/4)*2^(-2) |
0.375 |
0.4 |
0
001
11 |
1 |
3 |
(1
+
3/4)*2^(-2) |
0.4375 |
0.44 |
0
010
00 |
2 |
0 |
(1
+
0/4)*2^(-1) |
0.5 |
0.5 |
0
010
01 |
2 |
1 |
(1
+
1/4)*2^(-1) |
0.625 |
0.6 |
0
010
10 |
2 |
2 |
(1
+
2/4)*2^(-1) |
0.75 |
0.8 |
0
010
11 |
2 |
3 |
(1
+
3/4)*2^(-1) |
0.875 |
0.9 |
0
011
00 |
3 |
0 |
(1
+
0/4)*2^(0) |
1 |
1 |
0
011
01 |
3 |
1 |
(1
+
1/4)*2^(0) |
1.25 |
1.2 |
0
011
10 |
3 |
2 |
(1
+
2/4)*2^(0) |
1.5 |
1.5 |
0
011
11 |
3 |
3 |
(1
+
3/4)*2^(0) |
1.75 |
1.8 |
0
100
00 |
4 |
0 |
(1
+
0/4)*2^(1) |
2 |
2 |
0
100
01 |
4 |
1 |
(1
+
1/4)*2^(1) |
2.5 |
2.5 |
0
100
10 |
4 |
2 |
(1
+
2/4)*2^(1) |
3 |
3 |
0
100
11 |
4 |
3 |
(1
+
3/4)*2^(1) |
3.5 |
3.5 |
0
101
00 |
5 |
0 |
(1
+
0/4)*2^(2) |
4 |
4 |
0
101
01 |
5 |
1 |
(1
+
1/4)*2^(2) |
5 |
5 |
0
101
10 |
5 |
2 |
(1
+
2/4)*2^(2) |
6 |
6 |
0
101
11 |
5 |
3 |
(1
+
3/4)*2^(2) |
7 |
7 |
0
110
00 |
6 |
0 |
(1
+
0/4)*2^(3) |
8 |
8 |
0
110
01 |
6 |
1 |
(1
+
1/4)*2^(3) |
10 |
10 |
0
110
10 |
6 |
2 |
(1
+
2/4)*2^(3) |
12 |
11 |
0
110
11 |
6 |
3 |
(1
+
3/4)*2^(3) |
14 |
12 |
Positive
Infinity
binary | exponent | mantissa | value |
0
111
00 | 7 | 0 | Inf |
Positive
NaNs
binary | exponent | mantissa | value |
0
111
01 | 7 | 1 | NaN |
0
111
10 | 7 | 2 | NaN |
0
111
11 | 7 | 3 | NaN |
As
far
as
I
can
tell,
the
original
Dragon4
paper
by
Steele
and
White
actually
computes
the
low
margin
wrong
for
the
lowest
normalized
value
because
it
divides
it
in
half.
It
should
actually
remain
the
same
because
the
transition
from
denormalized
to
normalized
numbers
does
not
change
the
exponent.
This
error
is
not
present
in
the
follow-up
paper
by
Burger
and
Dybvig.
Breaking
Ties
When
deciding
how
to
round-off
the
final
printed
digit,
it
is
possible
for
the
exact
value
to
lie
directly
between
the
lower
digit
and
the
higher
digit.
In
this
case,
a
rounding
rule
needs
to
be
decided.
You
could
round
towards
zero,
away
from
zero,
towards
negative
infinity,
etc.
In
my
implementation,
I
always
round
to
the
even
digit
because
it
is
similar
to
the
rule
used
by
IEEE
floating-point
operations.
In
the
reverse
algorithm
of
converting
decimal
strings
into
binary
floating-point,
you
can
hit
a
similar
case
where
the
string
representation
lies
exactly
between
to
representable
numbers.
Once
again,
a
rounding
rule
must
be
chosen.
If
you
know
this
rule
in
both
algorithms,
there
are
times
when
you
can
print
out
a
unique
representation
of
a
number
with
one
less
digit.
For
example,
consider
how
the
digit
printing
algorithm
tests
if
half
the
high-margin
encompasses
a
digit
to
round-up
to.
I
tested
if
the
remaining
value
was
greater
than
one.
If
we
knew
how
the
reverse
algorithm
would
handle
ties,
we
might
be
able
to
test
greater
than
or
equal
to
one.
David
Gay's
dtoa
implementation
seems
to
do
this;
therefor
it
should
always
be
run
in
conjunction
with
his
strtod
implementation.
Personally,
I
prefer
printing
extra
digits
until
the
rounded
number
is
inclusively
contained
within
the
margin.
This
way
the
output
is
guaranteed
to
uniquely
identify
the
number
regardless
of
the
tie
breaking
algorithm
chosen
by
a
strtod
parser.
Unfortunately,
this
is
only
guaranteed
for
a
proper
strtod
implementation.
The
implementation
that
ships
with
Microsoft
Visual
C++
2010
Express
sometimes
has
an
off-by-one
error
when
rounding
to
the
closest
value.
For
example,
the
input
"6.439804741657803e-031"
should
output
the
64-bit
double
0x39AA1F79C0000000,
but
instead
outputs
0x39AA1F79BFFFFFFF.
I
can't
say
for
certain
where
Microsoft's
bug
comes
from,
but
I
can
say
that
the
problem
of
reading
a
decimal
string
is
more
difficult
than
printing
a
decimal
string
so
it's
more
likely
to
have
bugs.
Hopefully
I
can
find
the
time
to
try
my
hands
at
it
in
the
near
future.
Coding
Dragon4
Next,
we
will
program
an
optimized
implementation
of
the
Dragon4
algorithm.
Click
here
to
read
part
2..
|