Binary array set

stereoabuse | 63 points

Wow! I independently came up with this algorithm a few years ago and wasn't even sure what to search for to find the prior art. Happy to see someone finally gave it a name and attempted to find the history.

Fun fact #1 that I also realized, which I have yet to see mentioned elsewhere (though people have almost surely realized this in various contexts):

This is not limited to binary search or mergesort.

This is a very general-purpose meta-algorithm for turning any batch algorithm into a streaming one. (!)

The merge() step is basically "redo computation on a batch 2x the size". You pay a log(n) cost for this ability. You can combine the batches in any way you want there. Here it happens to be a merge of two sorted arrays. But you can imagine this being anything, like "train your ML model again on the larger batch".

Fun fact #2: I believe you can add deletion support as well. This can be done with a hierarchical bitmap to help you quickly search for occupied slots. It comes at the cost of another log(n) factor in some operations. I have yet to search if there is a name for the hierarchical bitmap structure I have in mind, so I'm just calling it that for now.

mehrdadn | a month ago

This implementation is missing a lot of the usual tricks.

By slicing a single array you only need O(1) not O(log n) overhead. This does of course means dropping down to amortized insertion due to the occasional large reallocation (but if your OS supports `mremap` and your language allows you to use it, that goes away ... but the merge means you pay anyway). It never makes sense to allow a "null" array except for the smallest items (which I tend to put at the end, not the start). Also, if under N elements (16?) you should fall back to a linear search.

Also worth noting that tombstones are a legitimate way of deleting elements. If your elements are (or contain, if structs, especially map-like entries) pointers you can just use a pointer to a private static variable (since you know nobody else can access it); if integers you can use the smallest integer except for the very first element of the array, which you can avoid turning into a tombstone. The trick is never to insert one at an intermediate level of the "tree", but instead bubble them down. (Bubbling down of self-invalidating elements like weak pointers is also quite possible, but turning all reads into potential writes complicates thread-safety if you care about that). You should also keep track of the number of deleted elements and perform a full re-sort if it exceeds the number of real elements or so. That said, there's enough branch-predictor overhead in this that you probably want to implement the delete-allowed version separately from the no-delete version.

Sketch of the `get` code as I'm familiar with it:

  def get(array, elem, start, end):
    len = end - start
    chunk_size = round_down_to_power_of_2(len)
    while True:
      assert start + len == end
      if len < SMALL:
        # note that, after the first iteration,
        # len can be 0, 1, etc. even if chunk_size is still large
        return linear_search(array, elem, start, end)
      assert chunk_size <= len
      assert start + chunk_size <= end
      rv = binary_search(array, elem, start, start + chunk_size)
      if rv != NOT_FOUND: return rv
      start += chunk_size
      len -= chunk_size
      chunk_size /= 2
o11c | a month ago

Im being pretty thick here. I understand that each bucket is sorted. But what distinguishes the buckets? it talks about merging into slot 0, and then merging into slot 1...with what?

I know I could spend some time and read the code, but I'm hoping someone is kind enough to post a better explanation for the structure invariant and the insertion process

convolvatron | a month ago

This looks like an in-memory log-structured merge tree [0].

(Albeit based on sorted runs rather than trees, as in the more recent variations of the idea)

[0] https://www.cs.umb.edu/~poneil/lsmtree.pdf

okennedy | a month ago

Oh nice, I've designed something similar before. You can do this within a single array by always adding new elements to the end of the array and performing an in-place merge of adjacent equal-sized bins. So the example given would look like:

    1, 4, 6, 7, 8, 10, 11, 12, 2, 3, 9, 13, 5
The structure is implicit in the size (as mentioned in the article). No need for an explicit array-of-array structure.

----

A related data structure I've been playing around with in F# (but haven't had time to write up) --

1. Instead of merging arrays, just append them.

2. Instead of arrays, use binary trees.

3. Instead of an array of these arrays/binary trees, use a linked list (so, a linked list of complete binary trees of strictly increasing depth). Omit elements whose trees are empty.

4. Put two of these back to back ("large" ends touching).

5. Recognize that if one of these lists is empty, a binary tree (or half of one) can be moved from the back of the other list to populate it.

Now you have a purely functional double-ended list with log(n) indexing, insertion, deletion, and append, while retaining a high degree of sharing.

----

Similarly, another related data structure I've been playing around with in Prolog --

1. No merging or appending.

2. Instead of arrays, use binary trees.

3. Instead of an array of these arrays/binary trees, use an infinite linked list, in order of increasing depth.

4. Do not explicitly instantiate this structure. Represent everything initially as a logic variable, only expanding linked list and binary tree nodes as needed.

5. Place two of these back to back, and add one extra logic variable in-between.

Now you have a purely relational array with integer indexes, supporting log(n) indexing and assignment, no upper or lower bound, and native Prolog unification. (Which coincidentally is in identical to the notion of "array" in SMTLIBv2.)

----

TLDR sequences of complete binary trees of increasing depth can be used for many interesting data structures.

colanderman | a month ago

Reminds me a little bit of a bloom filter in its functionality

goggy_googy | a month ago

Considering the nested array (indirection) this doesn’t seem that attractive for testing membership/insertion compared to a hashset. What am I missing?

keybored | a month ago