## Tuesday, September 21, 2010

### An Array Initialization Trick

Here is a neat trick that lets you avoid initializing large arrays. I learned about it long ago from Aho, Hopcroft, and Ullman, The Design and Analysis of Computer Algorithms, an old but still great book — but I think it is probably an old hacker's idea.

To illustrate the idea, suppose we are trying to solve the element distinctness problem on a restricted universe. In this problem we are given a list of integers and we want to determine if there are any repeated elements. The input is an array A[1..n] of integers, where each integer is between (say) 1 and M, where M is not so large compared to n — maybe M is about 10n or something like that.

The usual way to do this is to create an array B[1..M] such that B[j] = 1 if j is an element of A, and 0 otherwise. We start by initializing all the entries of B to 0. Then we loop through the elements of A. For each i, 1 ≤ i ≤ n, we first check B[A[i]]. If it equals 1, then the value A[i] already occurred in A. Otherwise, we set B[A[i]] := 1, to signify that we've seen A[i].

If M = O(n), this gives a linear-time algorithm for element distinctness. It even lets us list the repeated elements, if there are any.

Now here's the deal. Suppose we are solving element distinctness many times in a program, as a subroutine. Then it is conceivable that initializing all the entries of B to 0 could actually be a significant time drain. Although we have to allocate the memory for B once at the start, could there be a way to avoid the time-consuming Θ(M) initialization for B each time we call the subroutine again? We have to handle the problem that the entries of B could be arbitrary junk that we have no control over.

The answer is yes, using the following trick: instead of containing 1 or 0, we will set it up so that B[j] contains the position p in A where j is found. The key point is that we always actually verify this by looking at A[p], so we can never go wrong, even if B[j] is junk. More precisely, we want to ensure that if B[j] = p, then A[p] = j.

Now for each i we are going to check to see if A[i] = d has already been seen. To do this, we look in B[d]; say it equals c. If c is not between 1 and i-1, then we know that c represents uninitialized junk, so we can confidently ignore it and set B[d] = i.

If c is between 1 and i-1, then it either represents a true pointer back to the place in A where d was found earlier, or it just happens to be junk that is in the right range. In either case, we look at A[c]. Either it equals d, in which case we have found d earlier in the array at the entry A[c], or it equals something else. In the latter case we also set B[d] = i. This works whether B[d] is a true pointer or not!

Here's the code:
`for i := 1 to n do     d := A[i]     c := B[d]     if (c < 1) or (c ≥ i) then              B[d] := i     else if A[c] = d then                print(d, " occurred at positions ", c," and ", i)      else B[d] := i;`

And that's it! This code fragment prints out the duplications. Of course, if you'd rather terminate as soon as a repeated element is found, you can do that instead. Or if you'd rather save the repeated elements in a linked list, you can do that, too. And of course, this trick can be used in other settings, such as large sparse matrices, where you want to avoid initialization costs.

Pseudonym said...

Of course, this "trick" means that B must occupy lg(|A|) times as much storage as it would otherwise need to, and that's only if you tune the representation of B to take into account the length of A. More typically, it would be 16, 32 or 64 times as big as it needs to be.

An interesting trick, but I can't really imagine using it anywhere. Were M so big that the Θ(M) time was really hurting performance, I'd be tempted to try undoing the initialisation by traversing the piece of A that has been "consumed" and only zeroing out those entries; if you haven't consumed much of A, they're still likely to be in cache, after all.

A more exotic option is to store B in zero-fill demand paged memory. Then the initialisation would only occur as needed.

But of course, the first step would always be to do instruction-level profiling and make sure that initialisation of B really was the bottleneck. For this toy problem, it almost certainly isn't.

Jeffrey Shallit said...

Pseudo:

Except that in many implementations, an array of booleans takes up the same amount of space as an array of integers. C, for example, has no native boolean type.

I can't tell you how many times I've heard clever ideas dismissed as "An interesting trick, but I can't really imagine using it anywhere." The point is to have lots of tricks in your toolbox, so that you can pull one out as needed.

Pseudonym said...

C does indeed have a native boolean type. It's been in the standard for over ten years. If your compiler doesn't have it, then it's not compliant.

What it doesn't have is a native bit vector type, but then, C doesn't really have native vector types at all. Every C programmer has an implementation in easy reach, because it's a critical piece of infrastructure. In C++, of course, it's in the standard library.

I can't tell you how many times I've heard clever ideas dismissed as "An interesting trick, but I can't really imagine using it anywhere." The point is to have lots of tricks in your toolbox, so that you can pull one out as needed.

There are tools and there are tools. I can probably best illustrate this with a tale of two sort algorithms, both of which are usually considered obsolete.

Sort algorithm #1: Shell sort.

I used this algorithm to solve a real problem a couple of years ago. It was a real-time firmware application where I had a moderately small amount of data that needed to be sorted, but it was still large enough that a quadratic algorithm wouldn't do the job. On the other hand, while I had lots of read-only memory available, I had essentially no read-write memory spare, so all of the lowest-constant-factor O(n log n) algorithms were out of the question. Shell sort turned out to be the best fit for this problem.

Sort algorithm #2: Bubble sort.

This algorithm really is useless. On every workload, on every CPU, in every situation that is ever going to come up, insertion sort is always a superior choice. This "always" sounds like a rash universal quantifier, but it's true.

Okay, in theory, it could potentially work better than insertion sort on certain special-purpose exotic hardware, like parallel stream processing something-or-other. But on such hardware, there are algorithms which are just as simple and perform even better.

In summary, an excellent rule of thumb is that there is no situation where bubble sort is ever the right answer to a real problem. But every programmer should know about it, if only so they don't accidentally re-invent it.

I do apologise for being glib. I didn't quite mean "anywhere". What I meant was that for this toy problem, this trick simply isn't a practical solution, when you consider the alternatives.

What I didn't mention is that I did work out a fairly realistic scenario where the trick could come in handy. The details are unimportant, but it involved doing some kind of processing on A, which may occasionally backtrack (where "occasionally" means that the number of choice points is small relative to the length of A). In that situation, you only want to "undo" visiting part of A. Details are left as an exercise.

Jeffrey Shallit said...

Pseudo:

Sorry about the mistake about C. I don't keep up with "advances" in the C standard, since I find C to be so unpleasant to use. I should have said "had" instead of "has".

Jeffrey Shallit said...

BTW, heapsort is done in place, so I'm having trouble understanding why you think Shell sort was preferable to heapsort, even in your situation.

Gareth McCaughan said...

The constant factor for heapsort is pretty bad, and the constant factor for shellsort is pretty good.

I put together simple implementations of shellsort and heapsort and rigged them to count "mems" (memory accesses that would be needed with a sufficiently smart compiler). The crossover point above which heapsort wins was somewhere between n=3000 and n=4000.

Jeffrey Shallit said...

Gareth:

What increments did you use in Shell sort?

Gareth McCaughan said...

1, 4, 13, 40, ...: start at 1, do n -> 3n+1 until you get too big. Equivalently, (3^n-1)/2.

I think folklore says you can actually do better than that, but clearly it's good enough as it is to beat heapsort for moderate-sized arrays. At least if you count mems; I haven't tried making streamlined implementations in C and actually timing them, which would probably be more appropriate for measuring smallish-array performance. (Because for small arrays the "memory access is the only thing that really costs" approximation, which is one of the main justifications for counting mems, isn't as true as it is for larger arrays. For that matter, cache effects mean that even for large arrays mems are at best a useful approximation. But I digress.)

Pseudonym said...

Thanks, Gareth, for saying pretty much what I was going to say. The key thing is that for small-to-moderate amounts of data, constant factors dominate in heap sort. Constant factors matter just as much as complexity.

Incidentally, here's something that a theorist should appreciate: "Memory access is the only thing that really costs" is a good approximation a lot of the time, especially on microcontrollers. On desktop/server CPUs, however, one thing that often dominates is branch misprediction.

Algorithmic information theory 201 requires that comparison-based sorting must perform O(n log n) data comparisons at some point. What often isn't appreciated is that the result of these comparisons are essentially impossible to predict in advance. This should be clear why: the result of those comparisons is precisely the information that you're trying to uncover.

By contrast, loop branches and exception branches are usually very easy to predict: Most of the time, loops are "taken" and exceptions are not "taken". Both static and dynamic branch prediction usually do an extremely good job on these cases.

So when you have a job for which a comparison-based sort is a bottleneck, and memory access times aren't an issue, then more often than not, most time is spent undoing the conditional branch resulting from the data comparison.

There are always a bunch of caveats here, of course. Many CPUs can recover from speculative execution of a single conditional branch more cheaply than multiple branches. So if your comparison is actually on multiple keys where the primary key is "equal" a significant amount of the time, the effect is often more noticeable. Some CPUs have a conditional move or conditional skip instruction, which can help if it applies. And, of course, some variant of radix sorting is always an option.

One of the more surreal moments I've spent in my HPC career is rewriting whole modules just to avoid a single conditional branch. Needless to say, this is something you only do after much profiling, thinking, back-of-the-envelope reasoning and just a little swearing.

D. Swart said...

Pseudonym. You with your 'always's 'every's and 'never's!

For my own one-off personal projects, the computer's time is often less valuable than my own.

In these cases, ease-of-implementation trumps time, space, and scalability.

So I'm not ashamed to say I've whipped off bubble-sort caliber algorithms when no prepackaged implementations were at hand (i.e., selection sort).

F. Andy Seidl said...

>> Although we have to allocate the memory for B once at the start<<

Or you could use a hash map and avoid the initialization cost AND the allocation overhead.

>>This algorithm really is useless. On every workload, on every CPU, in every situation that is ever going to come up, insertion sort is always a superior choice.<<

Pseudonym, please. There is a time for every purpose under the sun.

Suppose the arrays you're going to sort are never longer than three elements?

Suppose the arrays you're going to sort are already sorted except that one new element may have been added at the end?

In both case, it'd be tough to beat a bubble sort.

Pseudonym said...

If you're going to sort only three elements, then bubble sort performs the same amount of work as insertion sort or, if you don't implement it carefully, performs one extra comparison.

As for the second example, if the new element that you need to add is at the end (as opposed to the start) then the bubble sort that's in your standard algorithms textbook does O(n^2) work in general. If you change it to "bubble" from right to left, it does almost exactly the same amount of work as an insertion sort that's been modified similarly. An unmodified insertion sort does roughly twice the work as the modified one, but it's still linear as opposed to quadratic.

Either way, bubble sort is never better than insertion sort, and is almost always worse. Insertion sort is also easier to write.

Note that it may be possible to come up with an unusual problem where a modified bubble sort outperforms many other algorithms. For "modified bubble sort", read "problem-specific algorithm".

Pseudonym said...

Oh, and D. Swart: I agree. For one-off personal projects (or even professional projects which won't be run many times), programmer time dominates run-time. In that case, the best sort algorithm is almost certainly the one that you don't have to write: either the one in the standard library of your programming language or, as I did about a month ago, piping data through /usr/bin/sort.