Intro(简介)
This describes an adaptive, stable, natural mergesort, modestly called
timsort (hey, I earned it
kinds of partially ordered arrays (less than lg(N!) comparisons needed, and
as few as N-1), yet as fast as Python's previous highly tuned samplesort
hybrid on random arrays.
In a nutshell, the main routine marches over the array once, left to right,
alternately identifying the next run, then merging it into the previous
runs "intelligently". Everything else is complication for speed, and some
hard-won measure of memory efficiency.
这是一个自适应、稳定、自然合并排序算法的描述,我命名它为timsort。它在许多类型的部分有序数组上表现出超自然的性能(比lg(N!)的比较次数少,甚至可能只有N-1次),但在随机数组上与Python之前高度调优的samplesort混合算法一样快。
简而言之,主要的程序一次从左到右遍历数组,交替地识别下一个运行序列,然后将其智能地合并到之前的运行序列中。其他所有的内容都是为了提高速度,并且经过艰苦的努力以提高内存效率。
Comparison with Python's Samplesort Hybrid(与Python的Samplesort混合算法的对比)
-
timsort can require a temp array containing as many as N//2 pointers,
which means as many as 2*N extra bytes on 32-bit boxes. It can be
expected to require a temp array this large when sorting random data; on
data with significant structure, it may get away without using any extra
heap memory. This appears to be the strongest argument against it, but
compared to the size of an object, 2 temp bytes worst-case (also expected-
case for random data) doesn't scare me much. -
timsort可能需要一个临时数组,其中包含多达N//2个指针,这意味着在32位系统上可能需要多达2*N个额外字节。在对随机数据进行排序时,可能需要这么大的临时数组;对于具有显著结构的数据,它可能在不使用任何额外堆内存的情况下完成。这似乎是反对它的最有力的论点,但与对象的大小相比,最坏情况下2个临时字节(对于随机数据也是预期情况)并不让我感到担心。
It turns out that Perl is moving to a stable mergesort, and the code for
that appears always to require a temp array with room for at least N
pointers. (Note that I wouldn't want to do that even if space weren't an
issue; I believe its efforts at memory frugality also save timsort
significant pointer-copying costs, and allow it to have a smaller working
set.)
事实证明,Perl正在转向稳定的合并排序,并且该代码似乎总是需要一个至少能容纳N个指针的临时数组。(请注意,即使空间不是问题,我也不希望这样做;我相信它在节约内存方面的努力也可以节省timsort显著的指针复制成本,并使其工作集更小。) -
Across about four hours of generating random arrays, and sorting them
under both methods, samplesort required about 1.5% more comparisons
(the program is at the end of this file). -
在大约四个小时的生成随机数组并使用两种方法对其进行排序的过程中,samplesort需要比较多约1.5%(程序在本文件的末尾)。
-
In real life, this may be faster or slower on random arrays than
samplesort was, depending on platform quirks. Since it does fewer
comparisons on average, it can be expected to do better the more
expensive a comparison function is. OTOH, it does more data movement
(pointer copying) than samplesort, and that may negate its small
comparison advantage (depending on platform quirks) unless comparison
is very expensive. -
在实际应用中,对于随机数组,它可能比samplesort更快或更慢,这取决于平台的特性。由于平均比较次数较少,可以预期在比较函数更昂贵的情况下,它的性能会更好。另一方面,它比samplesort进行更多的数据移动(指针复制),这可能会抵消其较小的比较优势(取决于平台的特性),除非比较非常昂贵。
-
On arrays with many kinds of pre-existing order, this blows samplesort out
of the water. It's significantly faster than samplesort even on some
cases samplesort was special-casing the snot out of. I believe that lists
very often do have exploitable partial order in real life, and this is the
strongest argument in favor of timsort (indeed, samplesort's special cases
for extreme partial order are appreciated by real users, and timsort goes
much deeper than those, in particular naturally covering every case where
someone has suggested "and it would be cool if list.sort() had a special
case for this too ... and for that ..."). -
在具有许多种预先存在的顺序的数组上,timsort远远超过了samplesort。即使在一些samplesort特殊处理的情况下,timsort也明显快于samplesort。我相信在现实生活中,列表往往具有可利用的部分顺序,这是支持timsort的最有力的论点(实际上,samplesort对极端部分顺序的特殊情况是用户所欣赏的,而timsort比这些情况更深入,特别是自然地涵盖了每种情况,其中有人建议“如果list.sort()也有一个特殊情况来处理这个...还有那个...”)。
-
Here are exact comparison counts across all the tests in sortperf.py,
when run with arguments "15 20 1". -
下面是在sortperf.py的所有测试中的精确比较计数,当使用参数"15 20 1"运行时。
First the trivial cases, trivial for samplesort because it special-cased
them, and trivial for timsort because it naturally works on runs. Within
an "n" block, the first line gives the # of compares done by samplesort,
the second line by timsort, and the third line is the percentage by
which the samplesort count exceeds the timsort count:首先是简单的情况,对于samplesort来说是简单的,因为它对它们进行了特殊处理,对于timsort来说也是简单的,因为它自然地处理运行序列。在一个“n”块内,第一行给出了samplesort执行的比较次数,第二行给出了timsort执行的比较次数,第三行是samplesort计数超过timsort计数的百分比:
n \sort /sort =sort
------- ------ ------ ------
32768 32768 32767 32767 samplesort
32767 32767 32767 timsort
0.00% 0.00% 0.00% (samplesort - timsort) / timsort
65536 65536 65535 65535
65535 65535 65535
0.00% 0.00% 0.00%
131072 131072 131071 131071
131071 131071 131071
0.00% 0.00% 0.00%
262144 262144 262143 262143
262143 262143 262143
0.00% 0.00% 0.00%
524288 524288 524287 524287
524287 524287 524287
0.00% 0.00% 0.00%
1048576 1048576 1048575 1048575
1048575 1048575 1048575
0.00% 0.00% 0.00%
The algorithms are effectively identical in these cases, except that
timsort does one less compare in \sort.
Now for the more interesting cases. lg(n!) is the information-theoretic
limit for the best any comparison-based sorting algorithm can do on
average (across all permutations). When a method gets significantly
below that, it's either astronomically lucky, or is finding exploitable
structure in the data.
在这些情况下,这两种算法实际上是相同的,只是timsort在\sort中少做了一次比较。
现在来看更有趣的情况。lg(n!)是任何基于比较的排序算法在平均情况下(对所有排列进行平均)的信息论限制。当一种方法明显低于该限制时,要么是极其幸运,要么是在数据中找到了可利用的结构。
n lg(n!) *sort 3sort +sort ~sort !sort
------- ------- ------ -------- ------- ------- --------
32768 444255 453084 453604 32908 130484 469132 samplesort
449235 33019 33016 188720 65534 timsort
0.86% 1273.77% -0.33% -30.86% 615.86% %ch from timsort
65536 954037 973111 970464 65686 260019 1004597
963924 65767 65802 377634 131070
0.95% 1375.61% -0.18% -31.15% 666.46%
131072 2039137 2100019 2102708 131232 555035 2161268
2058863 131422 131363 755476 262142
2.00% 1499.97% -0.10% -26.53% 724.46%
262144 4340409 4461471 4442796 262314 1107826 4584316
4380148 262446 262466 1511174 524286
1.86% 1592.84% -0.06% -26.69% 774.39%
524288 9205096 9448146 9368681 524468 2218562 9691553
9285454 524576 524626 3022584 1048574
1.75% 1685.95% -0.03% -26.60% 824.26%
1048576 19458756 19950541 20307955 1048766 4430616 20433371
19621100 1048854 1048933 6045418 2097150
1.68% 1836.20% -0.02% -26.71% 874.34%
Discussion of cases:
sort: There's no structure in random data to exploit, so the theoretical
limit is lg(n!). Both methods get close to that, and timsort is hugging
it (indeed, in a marginal* sense, it's a spectacular improvement --
there's only about 1% left before hitting the wall, and timsort knows
darned well it's doing compares that won't pay on random data -- but so
does the samplesort hybrid). For contrast, Hoare's original random-pivot
quicksort does about 39% more compares than the limit, and the median-of-3
variant about 19% more.
3sort and !sort: No contest; there's structure in this data, but not of
the specific kinds samplesort special-cases. Note that structure in !sort
wasn't put there on purpose -- it was crafted as a worst case for a
previous quicksort implementation. That timsort nails it came as a
surprise to me (although it's obvious in retrospect).
+sort: samplesort special-cases this data, and does a few less compares
than timsort. However, timsort runs this case significantly faster on all
boxes we have timings for, because timsort is in the business of merging
runs efficiently, while samplesort does much more data movement in this
(for it) special case.
~sort: samplesort's special cases for large masses of equal elements are
extremely effective on ~sort's specific data pattern, and timsort just
isn't going to get close to that, despite that it's clearly getting a
great deal of benefit out of the duplicates (the # of compares is much less
than lg(n!)). ~sort has a perfectly uniform distribution of just 4
distinct values, and as the distribution gets more skewed, samplesort's
equal-element gimmicks become less effective, while timsort's adaptive
strategies find more to exploit; in a database supplied by Kevin Altis, a
sort on its highly skewed "on which stock exchange does this company's
stock trade?" field ran over twice as fast under timsort.
However, despite that timsort does many more comparisons on ~sort, and
that on several platforms ~sort runs highly significantly slower under
timsort, on other platforms ~sort runs highly significantly faster under
timsort. No other kind of data has shown this wild x-platform behavior,
and we don't have an explanation for it. The only thing I can think of
that could transform what "should be" highly significant slowdowns into
highly significant speedups on some boxes are catastrophic cache effects
in samplesort.
But timsort "should be" slower than samplesort on ~sort, so it's hard
to count that it isn't on some boxes as a strike against it
讨论这些情况:
*sort:在随机数据中没有可利用的结构,因此理论上的限制是lg(n!)。两种方法都接近这个限制,而timsort则非常接近(实际上,在某种程度上,这是一个显著的改进——离达到极限只剩下大约1%,而timsort非常清楚它在对随机数据进行比较时是不会有回报的——但samplesort混合算法也是如此)。相比之下,Hoare的原始随机轴快速排序比限制多约39%的比较,而中值-3变体则多约19%。
3sort和!sort:毫无疑问;这些数据中存在结构,但不是samplesort特殊处理的那种特定类型。请注意,!sort中的结构并非故意放置——它是为了之前的快速排序实现而设计的最坏情况。timsort能够处理它让我感到惊讶(尽管回过头来看,这是显而易见的)。
+sort:samplesort对这些数据进行了特殊处理,并且比timsort少做了一些比较。然而,根据我们进行计时的所有测试平台,timsort在所有情况下都显著快于samplesort,因为timsort专注于高效地合并运行序列,而samplesort在这种(对它来说)特殊情况下进行了更多的数据移动。
~sort:samplesort对大量相等元素的特殊情况非常有效,适用于~sort的特定数据模式,而timsort则无法接近这一点,尽管它显然从重复元素中获得了很大的好处(比较次数远远小于lg(n!))。~sort具有完全均匀分布的4个不同值,随着分布的偏斜程度增加,samplesort的相等元素技巧变得不那么有效,而timsort的自适应策略则找到了更多可利用的内容;在Kevin Altis提供的数据库中,对高度倾斜的“这家公司的股票在哪个证券交易所交易?”字段进行排序时,timsort的运行速度是samplesort的两倍以上。
然而,尽管timsort在~sort上进行了更多的比较,并且在某些平台上~sort在timsort下运行得显著慢,但在其他平台上,~sort在timsort下运行得显著快。没有其他类型的数据显示出这种跨平台的异常行为,我们对此没有解释。我唯一能想到的能够将“应该是”显著的减速转化为某些平台上显著的加速的因素是samplesort中的灾难性缓存效应。
但是,timsort“应该”比samplesort在~sort上慢,所以很难将某些平台上的速度快于samplesort视为对timsort的打击
A detailed description of timsort follows.
下面是 timsort 的详细描述。
Runs
count_run() returns the # of elements in the next run. A run is either
"ascending", which means non-decreasing:
a0 <= a1 <= a2 <= ...
or "descending", which means strictly decreasing:
a0 > a1 > a2 > ...
Note that a run is always at least 2 long, unless we start at the array's
last element.
The definition of descending is strict, because the main routine reverses
a descending run in-place, transforming a descending run into an ascending
run. Reversal is done via the obvious fast "swap elements starting at each
end, and converge at the middle" method, and that can violate stability if
the slice contains any equal elements. Using a strict definition of
descending ensures that a descending run contains distinct elements.
If an array is random, it's very unlikely we'll see long runs. If a natural
run contains less than minrun elements (see next secion), the main loop
artificially boosts it to minrun elements, via a stable binary insertion sort
applied to the right number of array elements following the short natural
run. In a random array, all runs are likely to be minrun long as a
result. This has two primary good effects:
-
Random data strongly tends then toward perfectly balanced (both runs have
the same length) merges, which is the most efficient way to proceed when
data is random. -
Because runs are never very short, the rest of the code doesn't make
heroic efforts to shave a few cycles off per-merge overheads. For
example, reasonable use of function calls is made, rather than trying to
inline everything. Since there are no more than N/minrun runs to begin
with, a few "extra" function calls per merge is barely measurable.
在timsort中,"run"指的是数组中按升序或降序排列的一系列元素。函数count_run()负责确定数组中下一个run的长度。
升序run是指每个元素都是非递减的序列,也就是每个元素都大于或等于前一个元素:
a0 <= a1 <= a2 <= ...
另一方面,降序run是指每个元素严格递减的序列,也就是每个元素都大于前一个元素:
a0 > a1 > a2 > ...
需要注意的是,一个run至少包含两个元素,除非我们从数组的最后一个元素开始。
降序run的定义是严格的,因为timsort的主要过程会就地反转降序run,将其转换为升序run。这个反转过程使用了一种快速的方法,从两端开始交换元素,并在中间汇聚。然而,如果切片中包含相等的元素,这种方法可能会破坏稳定性。通过使用严格的降序run定义,timsort确保降序run包含不同的元素。
对于随机数据,自然情况下很少出现长的run。如果一个自然run的长度小于预定义的阈值(称为"minrun",将在下一部分讨论),主循环会人为地增加其长度到minrun大小。这是通过对短自然run后面的一定数量的数组元素应用稳定的二分插入排序来实现的。在随机数组中,所有的run都有可能达到minrun的长度。这有两个主要的好处:
随机数据倾向于产生完全平衡的合并,即合并的两个run具有相同的长度。这种平衡的合并是处理随机数据时最高效的方式。
由于随机数据中的run都不会很短,因此代码的其余部分不需要过度优化每次合并的开销。例如,合理使用函数调用而不是尝试内联所有代码。由于初始情况下最多只有N/minrun个run,每次合并多出来的几个函数调用对性能几乎没有影响。
总的来说,timsort在性能和适应性之间取得了平衡。它在部分有序的数组上表现出色,同时也能高效地处理随机数据。
Computing minrun
If N < 64, minrun is N. IOW, binary insertion sort is used for the whole
array then; it's hard to beat that given the overheads of trying something
fancier.
When N is a power of 2, testing on random data showed that minrun values of
16, 32, 64 and 128 worked about equally well. At 256 the data-movement cost
in binary insertion sort clearly hurt, and at 8 the increase in the number
of function calls clearly hurt. Picking some power of 2 is important
here, so that the merges end up perfectly balanced (see next section). We
pick 32 as a good value in the sweet range; picking a value at the low end
allows the adaptive gimmicks more opportunity to exploit shorter natural
runs.
Because sortperf.py only tries powers of 2, it took a long time to notice
that 32 isn't a good choice for the general case! Consider N=2112:
divmod(2112, 32)
(66, 0)
If the data is randomly ordered, we're very likely to end up with 66 runs
each of length 32. The first 64 of these trigger a sequence of perfectly
balanced merges (see next section), leaving runs of lengths 2048 and 64 to
merge at the end. The adaptive gimmicks can do that with fewer than 2048+64
compares, but it's still more compares than necessary, and-- mergesort's
bugaboo relative to samplesort --a lot more data movement (O(N) copies just
to get 64 elements into place).
If we take minrun=33 in this case, then we're very likely to end up with 64
runs each of length 33, and then all merges are perfectly balanced. Better!
What we want to avoid is picking minrun such that in
q, r = divmod(N, minrun)
q is a power of 2 and r>0 (then the last merge only gets r elements into
place, and r<minrun is small compared to N), or r=0 and q a little larger
than a power of 2 (then we've got a case similar to "2112", again leaving
too little work for the last merge to do).
Instead we pick a minrun in range(32, 65) such that N/minrun is exactly a
power of 2, or if that isn't possible, is close to, but strictly less than,
a power of 2. This is easier to do than it may sound: take the first 6
bits of N, and add 1 if any of the remaining bits are set. In fact, that
rule covers every case in this section, including small N and exact powers
of 2; merge_compute_minrun() is a deceptively simple function.
如果N < 64,minrun的值就是N。换句话说,整个数组都使用二分插入排序;在考虑到其他更复杂方法的开销后,很难超越这种方法的效率。
当N是2的幂时,对随机数据进行测试表明,minrun值为16、32、64和128的效果差不多。当minrun为256时,二分插入排序中的数据移动成本明显增加,而当minrun为8时,函数调用的增加明显影响性能。在这里,选择某个2的幂次方是很重要的,以便合并结果能够完美平衡(见下一部分)。我们选择32作为一个适中的值;选择较小的值可以更好地利用自适应技巧来处理较短的自然run。
因为sortperf.py只尝试2的幂次方,所以很长时间才注意到32对于一般情况并不是一个好的选择!考虑N=2112的情况:
divmod(2112, 32)
(66, 0)
如果数据是随机排序的,很可能最终会得到66个长度为32的run。其中前64个会触发一系列完美平衡的合并(见下一部分),剩下的长度为2048和64的run会在最后进行合并。自适应技巧可以在少于2048+64次比较的情况下完成合并,但仍然比必要的比较次数多,并且相对于samplesort而言,数据移动的开销更大(为了将64个元素放置到正确位置,需要进行O(N)次复制)。
如果在这种情况下选择minrun=33,那么很可能最终会得到64个长度为33的run,然后所有的合并都是完美平衡的。更好!
我们要避免的是选择minrun,使得在
q, r = divmod(N, minrun)
q是2的幂次方且r>0(这样最后一次合并只能将r个元素放置到正确位置,而r<minrun相对于N来说很小),或者r=0且q略大于2的幂次方(这种情况类似于"2112",同样导致最后一次合并的工作量过小)。
相反,我们选择一个在范围(32, 65)内的minrun,使得N/minrun恰好是2的幂次方,或者如果不可能,就接近但严格小于2的幂次方。这比听起来要简单得多:取N的前6位,并在剩余位中有任何位设置时加1。实际上,这个规则涵盖了本节中的每种情况,包括小的N和精确的2的幂次方;merge_compute_minrun()是一个看似简单的函数。
The Merge Pattern
In order to exploit regularities in the data, we're merging on natural
run lengths, and they can become wildly unbalanced. That's a Good Thing
for this sort! It means we have to find a way to manage an assortment of
potentially very different run lengths, though.
Stability constrains permissible merging patterns. For example, if we have
3 consecutive runs of lengths
A:10000 B:20000 C:10000
we dare not merge A with C first, because if A, B and C happen to contain
a common element, it would get out of order wrt its occurence(s) in B. The
merging must be done as (A+B)+C or A+(B+C) instead.
So merging is always done on two consecutive runs at a time, and in-place,
although this may require some temp memory (more on that later).
When a run is identified, its base address and length are pushed on a stack
in the MergeState struct. merge_collapse() is then called to see whether it
should merge it with preceding run(s). We would like to delay merging as
long as possible in order to exploit patterns that may come up later, but we
like even more to do merging as soon as possible to exploit that the run just
found is still high in the memory hierarchy. We also can't delay merging
"too long" because it consumes memory to remember the runs that are still
unmerged, and the stack has a fixed size.
为了利用数据中的规律性,我们在自然run的长度上进行合并,但它们可能变得非常不平衡。对于这种排序来说,这是一件好事!然而,这意味着我们必须找到一种方法来管理各种可能非常不同的run长度。
稳定性限制了可行的合并模式。例如,如果我们有连续的3个run的长度:
A:10000 B:20000 C:10000
我们不能先将A与C合并,因为如果A、B和C恰好包含一个共同的元素,它的顺序会与其在B中出现的位置不一致。合并必须按照(A+B)+C或A+(B+C)的顺序进行。
因此,合并总是一次处理两个连续的run,并且是就地进行的,尽管这可能需要一些临时内存(稍后会详细介绍)。
当识别出一个run时,它的基地址和长度会被推入MergeState结构中的堆栈。然后调用merge_collapse()函数来判断是否应该将其与前面的run合并。我们希望尽可能延迟合并,以便利用后面可能出现的模式,但我们更希望尽早进行合并,以利用刚刚找到的run仍然在内存层次结构的高层。然而,我们也不能太长时间地延迟合并,因为记住尚未合并的run会消耗内存,并且堆栈有固定的大小。
What turned out to be a good compromise maintains two invariants on the
stack entries, where A, B and C are the lengths of the three righmost not-yet
merged slices:
- A > B+C
- B > C
Note that, by induction, #2 implies the lengths of pending runs form a
decreasing sequence. #1 implies that, reading the lengths right to left,
the pending-run lengths grow at least as fast as the Fibonacci numbers.
Therefore the stack can never grow larger than about log_base_phi(N) entries,
where phi = (1+sqrt(5))/2 ~= 1.618. Thus a small # of stack slots suffice
for very large arrays.
If A <= B+C, the smaller of A and C is merged with B (ties favor C, for the
freshness-in-cache reason), and the new run replaces the A,B or B,C entries;
e.g., if the last 3 entries are
A:30 B:20 C:10
then B is merged with C, leaving
A:30 BC:30
on the stack. Or if they were
A:500 B:400: C:1000
then A is merged with B, leaving
AB:900 C:1000
on the stack.
经过实践,一个良好的折衷方案是在堆栈条目上保持两个不变量,其中A、B和C是最右边尚未合并的三个切片的长度:
A > B+C
B > C
注意,通过归纳,不变量2意味着待处理run的长度形成一个递减序列。不变量1意味着,从右到左读取长度时,待处理run的长度至少以斐波那契数列的增长速度增长。因此,堆栈的大小永远不会超过大约log_base_phi(N)个条目,其中phi = (1+sqrt(5))/2约等于1.618。因此,对于非常大的数组,只需要很少的堆栈空间。
如果A <= B+C,则将A和C中较小的那个与B合并(如果相等,则优先选择C,原因是缓存的新鲜度),并用新的run替换A、B或B、C条目;例如,如果最后3个条目是:
A:30 B:20 C:10
那么B将与C合并,得到:
A:30 BC:30
在堆栈上。或者如果它们是:
A:500 B:400 C:1000
那么A将与B合并,得到:
AB:900 C:1000
在堆栈上。
In both examples, the stack configuration after the merge still violates
invariant #2, and merge_collapse() goes on to continue merging runs until
both invariants are satisfied. As an extreme case, suppose we didn't do the
minrun gimmick, and natural runs were of lengths 128, 64, 32, 16, 8, 4, 2,
and 2. Nothing would get merged until the final 2 was seen, and that would
trigger 7 perfectly balanced merges.
The thrust of these rules when they trigger merging is to balance the run
lengths as closely as possible, while keeping a low bound on the number of
runs we have to remember. This is maximally effective for random data,
where all runs are likely to be of (artificially forced) length minrun, and
then we get a sequence of perfectly balanced merges (with, perhaps, some
oddballs at the end).
OTOH, one reason this sort is so good for partly ordered data has to do
with wildly unbalanced run lengths.
在这两个例子中,合并后的堆栈配置仍然违反不变量#2,merge_collapse()函数会继续合并run,直到两个不变量都满足为止。作为一个极端情况,假设我们没有使用minrun技巧,自然run的长度分别为128、64、32、16、8、4、2和2。在遇到最后的2之前,不会进行任何合并操作,而最后的2会触发7次完美平衡的合并。
当这些规则触发合并时,其目的是尽可能平衡run的长度,同时保持需要记住的run数量较低。这对于随机数据来说非常有效,因为所有的run都很可能具有(人为强制的)minrun长度,然后我们会得到一系列完美平衡的合并(可能在末尾有一些特殊情况)。
另一方面,这种排序算法在处理部分有序数据时表现出色的原因之一就是存在极度不平衡的run长度。这使得算法能够高效处理和利用数据中的不规则性,从而在这种情况下提供更好的性能。
Merge Memory
Merging adjacent runs of lengths A and B in-place is very difficult.
Theoretical constructions are known that can do it, but they're too difficult
and slow for practical use. But if we have temp memory equal to min(A, B),
it's easy.
If A is smaller (function merge_lo), copy A to a temp array, leave B alone,
and then we can do the obvious merge algorithm left to right, from the temp
area and B, starting the stores into where A used to live. There's always a
free area in the original area comprising a number of elements equal to the
number not yet merged from the temp array (trivially true at the start;
proceed by induction). The only tricky bit is that if a comparison raises an
exception, we have to remember to copy the remaining elements back in from
the temp area, lest the array end up with duplicate entries from B. But
that's exactly the same thing we need to do if we reach the end of B first,
so the exit code is pleasantly common to both the normal and error cases.
If B is smaller (function merge_hi, which is merge_lo's "mirror image"),
much the same, except that we need to merge right to left, copying B into a
temp array and starting the stores at the right end of where B used to live.
A refinement: When we're about to merge adjacent runs A and B, we first do
a form of binary search (more on that later) to see where B[0] should end up
in A. Elements in A preceding that point are already in their final
positions, effectively shrinking the size of A. Likewise we also search to
see where A[-1] should end up in B, and elements of B after that point can
also be ignored. This cuts the amount of temp memory needed by the same
amount.
These preliminary searches may not pay off, and can be expected not to
repay their cost if the data is random. But they can win huge in all of
time, copying, and memory savings when they do pay, so this is one of the
"per-merge overheads" mentioned above that we're happy to endure because
there is at most one very short run. It's generally true in this algorithm
that we're willing to gamble a little to win a lot, even though the net
expectation is negative for random data.
在原地合并长度为A和B的相邻run非常困难。虽然已经有理论上的构造可以实现这一点,但它们对于实际使用来说过于复杂和缓慢。但是,如果我们有一个大小等于min(A, B)的临时内存,就很容易实现。
如果A较小(merge_lo函数),将A复制到一个临时数组中,保持B不变,然后我们可以从临时区域和B开始,从左到右进行明显的合并算法,将结果存储到A原来的位置。原始区域中始终有一个空闲区域,其大小等于尚未从临时数组中合并的元素数量(在开始时显然为真;通过归纳进行推导)。唯一棘手的部分是,如果比较引发异常,我们必须记住从临时区域中复制剩余的元素回来,以免数组最终出现来自B的重复条目。但是,如果我们先到达B的末尾,我们也需要做同样的事情,所以退出代码对于正常情况和错误情况是相同的。
如果B较小(merge_hi函数,是merge_lo的"镜像"),基本相同,只是我们需要从右到左进行合并,将B复制到一个临时数组中,并从B原来的右端开始存储。
一个改进:当我们要合并相邻的run A和B时,我们首先进行一种二分搜索(稍后详细介绍),以确定B[0]应该在A中的哪个位置。在该点之前的A元素已经处于最终位置,有效地缩小了A的大小。同样,我们还搜索以确定A[-1]应该在B中的哪个位置,之后的B元素也可以忽略。这样可以减少所需的临时内存量。
这些初步搜索可能不会有所回报,并且如果数据是随机的,可以预期它们不会回报其成本。但是,当它们确实有回报时,它们可以在时间、复制和内存节省方面带来巨大的收益。因此,这是上面提到的"每次合并的开销"之一,我们愿意忍受它,因为最多只有一个非常短的run。在这个算法中,通常情况下,我们愿意冒一点风险以获得更大的收益,即使对于随机数据来说,净期望值是负的。
Merge Algorithms
merge_lo() and merge_hi() are where the bulk of the time is spent. merge_lo
deals with runs where A <= B, and merge_hi where A > B. They don't know
whether the data is clustered or uniform, but a lovely thing about merging
is that many kinds of clustering "reveal themselves" by how many times in a
row the winning merge element comes from the same run. We'll only discuss
merge_lo here; merge_hi is exactly analogous.
Merging begins in the usual, obvious way, comparing the first element of A
to the first of B, and moving B[0] to the merge area if it's less than A[0],
else moving A[0] to the merge area. Call that the "one pair at a time"
mode. The only twist here is keeping track of how many times in a row "the
winner" comes from the same run.
If that count reaches MIN_GALLOP, we switch to "galloping mode". Here
we search B for where A[0] belongs, and move over all the B's before
that point in one chunk to the merge area, then move A[0] to the merge
area. Then we search A for where B[0] belongs, and similarly move a
slice of A in one chunk. Then back to searching B for where A[0] belongs,
etc. We stay in galloping mode until both searches find slices to copy
less than MIN_GALLOP elements long, at which point we go back to one-pair-
at-a-time mode.
merge_lo()和merge_hi()是耗时最多的部分。merge_lo处理A <= B的run,而merge_hi处理A > B的run。它们并不知道数据是聚集还是均匀分布,但合并的一个美妙之处在于,许多类型的聚集通过连续多次获胜的合并元素来"展现出来"。在这里,我们只讨论merge_lo;merge_hi完全类似。
合并以通常明显的方式开始,比较A的第一个元素和B的第一个元素,如果B[0]小于A[0],则将B[0]移动到合并区域,否则将A[0]移动到合并区域。将其称为"一对一次"模式。这里唯一的变化是跟踪"获胜者"连续多次来自同一个run的次数。
如果该计数达到MIN_GALLOP,我们切换到"奔跑模式"。在这种模式下,我们在B中搜索 A[0]应该插入的位置,并将该位置之前的所有B元素一次性移动到合并区域,然后将A[0]移动到合并区域。然后我们在A中搜索B[0]应该插入的位置,并类似地一次性移动A的一部分。然后再次在B中搜索A[0]应该插入的位置,以此类推。我们在奔跑模式下保持,直到两个搜索找到要复制的片段长度小于MIN_GALLOP个元素,此时我们回到一对一次模式。
Galloping
Still without loss of generality, assume A is the shorter run. In galloping
mode, we first look for A[0] in B. We do this via "galloping", comparing
A[0] in turn to B[0], B[1], B[3], B[7], ..., B[2j - 1], ..., until finding
the k such that B[2(k-1) - 1] < A[0] <= B[2**k - 1]. This takes at most
roughly lg(B) comparisons, and, unlike a straight binary search, favors
finding the right spot early in B (more on that later).
After finding such a k, the region of uncertainty is reduced to 2*(k-1) - 1
consecutive elements, and a straight binary search requires exactly k-1
additional comparisons to nail it. Then we copy all the B's up to that
point in one chunk, and then copy A[0]. Note that no matter where A[0]
belongs in B, the combination of galloping + binary search finds it in no
more than about 2lg(B) comparisons.
If we did a straight binary search, we could find it in no more than
ceiling(lg(B+1)) comparisons -- but straight binary search takes that many
comparisons no matter where A[0] belongs. Straight binary search thus loses
to galloping unless the run is quite long, and we simply can't guess
whether it is in advance.
If data is random and runs have the same length, A[0] belongs at B[0] half
the time, at B[1] a quarter of the time, and so on: a consecutive winning
sub-run in B of length k occurs with probability 1/2**(k+1). So long
winning sub-runs are extremely unlikely in random data, and guessing that a
winning sub-run is going to be long is a dangerous game.
OTOH, if data is lopsided or lumpy or contains many duplicates, long
stretches of winning sub-runs are very likely, and cutting the number of
comparisons needed to find one from O(B) to O(log B) is a huge win.
仍然不失一般性,假设A是较短的run。在奔跑模式下,我们首先在B中寻找A[0]。我们通过"奔跑"的方式来进行,依次将A[0]与B[0]、B[1]、B[3]、B[7]、...、B[2j - 1]、...进行比较,直到找到满足条件的k,使得B[2(k-1) - 1] < A[0] <= B[2**k - 1]。这最多需要大约lg(B)次比较,并且与直接的二分搜索不同,它更倾向于在B中早期找到正确的位置(稍后详细介绍)。
找到这样的k之后,不确定性的区域缩小为2*(k-1) - 1个连续元素,直接的二分搜索需要额外的k-1次比较来确定位置。然后我们一次性复制到该位置之前的所有B元素,然后复制A[0]。请注意,无论A[0]在B中的位置如何,奔跑+二分搜索的组合最多需要大约2lg(B)次比较。
如果我们使用直接的二分搜索,最多需要ceiling(lg(B+1))次比较,但无论A[0]在哪里,直接的二分搜索都需要这么多比较。因此,除非run非常长,否则直接的二分搜索会输给奔跑,而我们无法提前猜测run的长度。
如果数据是随机的且run具有相同的长度,A[0]在B[0]的位置出现的概率为一半,在B[1]的位置出现的概率为四分之一,依此类推:在B中长度为k的连续获胜子run出现的概率为1/2**(k+1)。因此,在随机数据中,长的连续获胜子run的概率极低,猜测获胜子run会很长是一种危险的游戏。
另一方面,如果数据不平衡、不均匀或包含许多重复项,那么长时间的连续获胜子run非常可能出现,将查找一个获胜子run所需的比较次数从O(B)降低到O(log B)将带来巨大的收益。
Galloping compromises by getting out fast if there isn't a long winning
sub-run, yet finding such very efficiently when they exist.
I first learned about the galloping strategy in a related context; see:
"Adaptive Set Intersections, Unions, and Differences" (2000)
Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
and its followup(s). An earlier paper called the same strategy
"exponential search":
"Optimistic Sorting and Information Theoretic Complexity"
Peter McIlroy
SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
467-474, Austin, Texas, 25-27 January 1993.
and it probably dates back to an earlier paper by Bentley and Yao. The
McIlory paper in particular has good analysis of a mergesort that's
probably strongly related to this one in its galloping strategy.
的确,奔跑策略在没有长时间连续获胜子run的情况下能够快速退出,但在存在这样的情况下,能够高效地找到它们。
我最初在相关领域中了解到奔跑策略;可以参考以下论文:
"Adaptive Set Intersections, Unions, and Differences" (2000) - Erik D. Demaine, Alejandro López-Ortiz, J. Ian Munro
"Optimistic Sorting and Information Theoretic Complexity" - Peter McIlroy, SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp 467-474, Austin, Texas, 25-27 January 1993
这些论文可能还可以追溯到Bentley和Yao的早期论文。特别是McIlroy的论文对使用奔跑策略的归并排序进行了良好的分析,与本文中的奔跑策略在很大程度上相关。
这些论文提供了有关奔跑策略及其在相关领域中应用的宝贵见解。Demaine、López-Ortiz和Munro的论文涉及自适应集合交集、并集和差集,而McIlroy的早期工作则涉及乐观排序和信息理论复杂性,这些都对采用奔跑技术的算法的效率和分析提供了启示。
不同研究论文中的思想和策略如何相互借鉴,为高效排序和合并算法的发展做出贡献,这是非常有趣的。奔跑策略能够根据数据的特征自适应地优化合并操作,因此在优化合并过程中发挥了重要作用。
Galloping with a Broken Leg
So why don't we always gallop? Because it can lose, on two counts:
-
While we're willing to endure small per-run overheads, per-comparison
overheads are a different story. Calling Yet Another Function per
comparison is expensive, and gallop_left() and gallop_right() are
too long-winded for sane inlining. -
Ignoring function-call overhead, galloping can-- alas --require more
comparisons than linear one-at-time search, depending on the data.
2 requires details. If A[0] belongs before B[0], galloping requires 1
compare to determine that, same as linear search, except it costs more
to call the gallop function. If A[0] belongs right before B[1], galloping
requires 2 compares, again same as linear search. On the third compare,
galloping checks A[0] against B[3], and if it's <=, requires one more
compare to determine whether A[0] belongs at B[2] or B[3]. That's a total
of 4 compares, but if A[0] does belong at B[2], linear search would have
discovered that in only 3 compares, and that's a huge loss! Really. It's
an increase of 33% in the number of compares needed, and comparisons are
expensive in Python.
我们为什么不总是使用奔跑策略呢?因为在两个方面它可能会失败:
虽然我们可以容忍每个run的一些开销,但每次比较的开销是另一回事。每次比较都调用另一个函数是昂贵的,而且gallop_left()和gallop_right()函数太冗长,不适合进行高效的内联。
忽略函数调用开销,奔跑策略可能需要比线性逐个搜索更多的比较,这取决于数据的情况。
第二个原因需要详细说明。如果A[0]应该插入到B[0]之前,奔跑策略需要进行1次比较来确定这一点,与线性搜索相同,只是调用奔跑函数的开销更大。如果A[0]应该插入到B[1]之前,奔跑策略仍然需要2次比较,与线性搜索相同。然而,在第三次比较中,奔跑策略将A[0]与B[3]进行比较,如果A[0]小于或等于B[3],则需要进行一次额外的比较来确定A[0]是应该插入到B[2]还是B[3]。这总共需要4次比较。相比之下,如果A[0]应该插入到B[2],线性搜索只需要3次比较就能找到正确的位置。这导致比较次数增加了33%,而在Python中,比较操作是昂贵的,这会对性能产生显著影响。
因此,虽然奔跑策略在许多情况下非常高效,但并不总是最佳选择,因为它会带来每次比较的开销,并且在某些情况下可能需要比线性搜索更多的比较。算法需要在奔跑策略的优势和潜在缺点之间取得平衡,以优化整个合并过程。
index in B where # compares linear # gallop # binary gallop
A[0] belongs search needs compares compares total
---------------- ----------------- -------- -------- ------
0 1 1 0 1
1 2 2 0 2
2 3 3 1 4
3 4 3 1 4
4 5 4 2 6
5 6 4 2 6
6 7 4 2 6
7 8 4 2 6
8 9 5 3 8
9 10 5 3 8
10 11 5 3 8
11 12 5 3 8
...
In general, if A[0] belongs at B[i], linear search requires i+1 comparisons
to determine that, and galloping a total of 2*floor(lg(i))+2 comparisons.
The advantage of galloping is unbounded as i grows, but it doesn't win at
all until i=6. Before then, it loses twice (at i=2 and i=4), and ties
at the other values. At and after i=6, galloping always wins.
We can't guess in advance when it's going to win, though, so we do one pair
at a time until the evidence seems strong that galloping may pay. MIN_GALLOP
is 8 as I type this, and that's pretty strong evidence. However, if the data
is random, it simply will trigger galloping mode purely by luck every now
and again, and it's quite likely to hit one of the losing cases next. 8
favors protecting against a slowdown on random data at the expense of giving
up small wins on lightly clustered data, and tiny marginal wins on highly
clustered data (they win huge anyway, and if you're getting a factor of
10 speedup, another percent just isn't worth fighting for).
一般来说,如果A[0]应该插入到B[i],线性搜索需要进行i+1次比较来确定,而奔跑策略需要总共2*floor(lg(i))+2次比较。奔跑策略的优势在i增长时是无限的,但在i=6之前,它并不会获胜。在此之前,它会在i=2和i=4时输掉比较,并在其他值上达到平局。在i=6及之后,奔跑策略总是获胜。
然而,我们无法提前猜测奔跑策略何时会获胜,因此我们会逐对进行比较,直到有足够的证据表明奔跑策略可能会带来好处。在我撰写这篇回答时,MIN_GALLOP的值为8,这是相当有力的证据。然而,如果数据是随机的,它可能仅仅是偶然地触发了奔跑模式,而且很可能在下一次遇到其中一个输掉的情况。8的选择更倾向于在随机数据上保护免受减速的影响,而牺牲了在轻度聚集数据上的小胜利,以及在高度聚集数据上的微小边际胜利(它们本来就会取得巨大胜利,如果你获得了10倍的加速,再多的1%也不值得争取)。
Galloping Complication
The description above was for merge_lo. merge_hi has to merge "from the
other end", and really needs to gallop starting at the last element in a run
instead of the first. Galloping from the first still works, but does more
comparisons than it should (this is significant -- I timed it both ways).
For this reason, the gallop_left() and gallop_right() functions have a
"hint" argument, which is the index at which galloping should begin. So
galloping can actually start at any index, and proceed at offsets of 1, 3,
7, 15, ... or -1, -3, -7, -15, ... from the starting index.
In the code as I type it's always called with either 0 or n-1 (where n is
the # of elements in a run). It's tempting to try to do something fancier,
melding galloping with some form of interpolation search; for example, if
we're merging a run of length 1 with a run of length 10000, index 5000 is
probably a better guess at the final result than either 0 or 9999. But
it's unclear how to generalize that intuition usefully, and merging of
wildly unbalanced runs already enjoys excellent performance.
上述描述是针对merge_lo的情况。merge_hi需要从另一端开始合并,实际上需要从run的最后一个元素开始奔跑,而不是从第一个元素开始。虽然从第一个元素开始奔跑仍然有效,但比应有的比较次数要多(这是显著的——我对两种方式进行了计时)。因此,gallop_left()和gallop_right()函数有一个"hint"参数,用于指示奔跑应该从哪个索引开始。因此,奔跑实际上可以从任何索引开始,并以1、3、7、15等偏移量或-1、-3、-7、-15等偏移量继续。
在我撰写代码时,它始终以0或n-1(其中n是run中的元素数量)作为参数调用。尝试将奔跑与某种插值搜索形式相结合可能很有诱惑力;例如,如果我们将长度为1的run与长度为10000的run合并,索引5000可能是对最终结果的更好猜测,而不是0或9999。但如何有用地推广这种直觉尚不清楚,并且对于极不平衡的run进行合并已经具有出色的性能。
Comparing Average # of Compares on Random Arrays
Here list.sort() is samplesort, and list.msort() this sort:
import random
from time import clock as now
def fill(n):
from random import random
return [random() for i in xrange(n)]
def mycmp(x, y):
global ncmp
ncmp += 1
return cmp(x, y)
def timeit(values, method):
global ncmp
X = values[:]
bound = getattr(X, method)
ncmp = 0
t1 = now()
bound(mycmp)
t2 = now()
return t2-t1, ncmp
format = "%5s %9.2f %11d"
f2 = "%5s %9.2f %11.2f"
def drive():
count = sst = sscmp = mst = mscmp = nelts = 0
while True:
n = random.randrange(100000)
nelts += n
x = fill(n)
t, c = timeit(x, 'sort')
sst += t
sscmp += c
t, c = timeit(x, 'msort')
mst += t
mscmp += c
count += 1
if count % 10:
continue
print "count", count, "nelts", nelts
print format % ("sort", sst, sscmp)
print format % ("msort", mst, mscmp)
print f2 % ("", (sst-mst)*1e2/mst, (sscmp-mscmp)*1e2/mscmp)
drive()
I ran this on Windows and kept using the computer lightly while it was
running. time.clock() is wall-clock time on Windows, with better than
microsecond resolution. samplesort started with a 1.52% #-of-comparisons
disadvantage, fell quickly to 1.48%, and then fluctuated within that small
range. Here's the last chunk of output before I killed the job:
count 2630 nelts 130906543
sort 6110.80 1937887573
msort 6002.78 1909389381
1.80 1.49
We've done nearly 2 billion comparisons apiece at Python speed there, and
that's enough
For random arrays of size 2 (yes, there are only 2 interesing ones),
samplesort has a 50%(!) comparison disadvantage. This is a consequence of
samplesort special-casing at most one ascending run at the start, then
falling back to the general case if it doesn't find an ascending run
immediately. The consequence is that it ends up using two compares to sort
[2, 1]. Gratifyingly, timsort doesn't do any special-casing, so had to be
taught how to deal with mixtures of ascending and descending runs
efficiently in all cases.