HACKER Q&A
📣 Uptrenda

What are some cool but obscure data structures you know about?


I'm very interested in what types of interesting data structures are out there HN. Totally your preference.

I'll start: bloom filters. Lets you test if a value is definitely NOT in a list of pre-stored values (or POSSIBLY in a list - with adjustable probability that influences storage of the values.)

Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

Bonus section: Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.


  👤 skrebbel Accepted Answer ✓
Not a very deep CS-y one, but still one of my favourite data structures: Promise Maps.

It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.

When caching the result of an expensive computation or a network call, don't actually cache the result, but cache the promise that awaits the result. Ie don't make a

    Map
but a

    Map>
This way, if a new, uncached key gets requested twice in rapid succession, ie faster than the computation takes, you avoid computing/fetching the same value twice. This trick works because:

- promises hold on to their result value indefinitely (until they're GC'ed)

- you can await (or .then()) an existing promise as many times as you want

- awaiting an already-resolved promise is a very low-overhead operation.

In other words, the promise acts as a mutex around the computation, and the resulting code is understandable even by people unfamiliar with mutexes, locks and so on.


👤 jiggawatts
Cache-Oblivious Data Structures: https://cs.au.dk/~gerth/MassiveData02/notes/demaine.pdf

A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.

A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.

In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.

Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.

This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.


👤 jtolmar
The union-find data structure / algorithm is useful and a lot of fun.

The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)

The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).

There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.

And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).


👤 loxias
(Fantastic post idea OP. One of the best I've ever seen :D)

Related to bloom filters, xor filters are faster and more memory efficient, but immutable.

HyperLogLog is an efficient way to estimate cardinality.

Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.

see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)

Would love to learn more "stupidly fast at the cost of conceptual complexity" things.

edit:

(adding) I don't know a name for it, because it's not one thing but a combination, but once can:

Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.

Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.

Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.

Very clever. (I first saw in the Aerospike key value store)


👤 filoeleven
HAMT: Hash Array Mapped Trie. This data structure makes efficient immutable data possible. You can update a list of a million items, and keep a reference to the original list, by changing 3 or 4 references and some bytes.

This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.

https://en.m.wikipedia.org/wiki/Hash_array_mapped_trie


👤 dwaite
Splay trees:

These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.

Piece tables:

A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).

This has interesting variations:

- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.

- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.

- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.


👤 zarzavat
Spatial hashing.

Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.

Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.


👤 romes
Equality graphs (e-graphs) for theorem proving and equality saturation and other equality-related things.

They're awesome data structures that efficiently maintain a congruence relation over many expressions

> At a high level, e-graphs extend union-find to compactly represent equivalence classes of expressions while maintaining a key invariant: the equivalence relation is closed under congruence.

e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and then said "x == y" (merged "x" and "y" in the e-graph), then the e-graph, by congruence, would be able to tell me that "f(x) == f(y)"

e.g. If I were to represent "a*(2/2)", in the e-graph, then say "2/2 == 1", and "x*1 == x", by congruence the e-graph would know "a*(2/2) == a" !

The most recent description of e-graphs with an added insight on implementation is https://arxiv.org/pdf/2004.03082.pdf to the best of my knowledge.

P.S: I'm currently implementing them in Haskell https://github.com/alt-romes/hegg


👤 gpestana
Conflict-free replicated data type (CRDT) is super interesting data type (and algorithm), especially when it is implemented for complex data structures like JSON: A JSON CRDT is "[...] an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state (a conflict-free replicated datatype or CRDT)." [1].

[1] A Conflict-Free Replicated JSON Datatype (https://arxiv.org/abs/1608.03960) b Martin Kleppmann and Alastair R. Beresford.


👤 anon291
Finger trees allow you to do amazing things. In essence they let you build an index, or multiple indices, for your dataset and then store them in a single structure.

When I go from Haskell back to imperative land I find myself greatly missing this ability. Sure I can make multiple hashmaps or trees or whatever but being able to stuff it all in one data structure is amazing.

One structure I built with them that is much more difficult with typical structures is a "block list" structure.

In this structure each block has a particular width and they're all stacked side by side.

I want to perform a query, "which box is at position X". So if my boxes are of widths 7,20,10, then the lookup for 2 yields the first box, the lookup for 12 yields the second, etc.

More interestingly, if add a new box between the second and third, the indices covered by the last box is increased.

With finger trees you use a sum monoid. This is easy. In other languages you have to roll your own structure either using a list (with o(n) lookup) or a tree with o(log n) lookup, but o(n) inserts (to translate the indices of future blocks)

https://andrew.gibiansky.com/blog/haskell/finger-trees/


👤 kcbanner

  "This is the story of a clever trick that's been around for at least 35 years,     in which array values can be left uninitialized and then read during normal   operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time."
https://research.swtch.com/sparse

👤 3wolf
A minor quibble with your use-case explanation: The advantage of a bloom filter isn't strictly time complexity. For example, a hash table would also have constant lookup time (best case), and would give a definitive answer on set membership. However, to store 1 million IPv6 addresses would take 16 MB. You can see very quickly that this would not scale very well to, say, a billion addresses stored in-memory on a laptop. With a bloom filter, we can shrink the amount of storage space required* while maintaining an acceptable, calculable false positive rate.

* IP addresses actually aren't a great use case for basic bloom filters, as they're fairly storage efficient to begin with, as opposed to a url for example. Taking your example, say we need to store 1 million IP addresses in our bloom filter and we're okay with a ~1% false positive rate. Well then, if we use a bloom filter with 2^23 bits (1 MB), the optimal number of hash functions is (2^23)/(10^6)*ln(2) = 6, yielding a false positive rate of (1 - exp(-6* 10^6 /2^23))^6 = ~1.8%. So we're using 6% of the storage space, but with a nearly 2% false positive rate.


👤 EnderShadow8
Treap: https://en.wikipedia.org/wiki/Treap

It's like a Red-Black tree in use case, but much faster to implement, which is good for competitive programming. The average case complexity is the same for all operations, but there's an unlikely degeneration to worst-case linked-list behaviour.

Lazy Propagation Segment Tree: https://cp-algorithms.com/data_structures/segment_tree.html

Like a segment tree in that it supports range queries in logarithmic time, but supports range updates in log time as well.

I know a few more fun ones that I occasionally use in contests, but I've never touched otherwise.


👤 chriswarbo
The Zipper acts like a linked-list with a cursor, or "focused element"; it's implemented as a pair of lists in opposite orders; or, equivalently but more symmetric, as triple of (List[A], A, List[A])

Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're focusing on the 3. In code this will look like:

    ([2, 1, 0], 3, [4, 5])
Where [a, b, c] denotes a singly-linked list, with O(1) head (returning a) and tail (returning [b, c]). Notice that the first list is in reverse order.

To focus on the next element, we put the currently focused element on the first list, and pull the head off the second list:

    ([3, 2, 1, 0], 4, [5])
And vice versa to focus on the previous element:

    ([1, 0], 2, [3, 4, 5])
I like to imagine this as a string of beads, with the focused element held in our fingers and the rest hanging down:

     3
    / \
    2 4
    | |
    1 5
    |
    0
To move the focus forwards and backwards, we move our grip to the next/previous bead.

This works nicely as an immutable datastructure, since the tails of both lists can be shared (i.e. we don't need to copy the whole list, just the wrap/unwrap an element on to the head)

Zippers were first applied to lists, but have since been generalised:

- To trees: focusing on one node, and being able to move the focus up, to the left-child, or to the right child

- To having more than one focus

- To datastructures more generally, by treating them as 'derivatives' (as in calculus)

https://en.wikipedia.org/wiki/Zipper_(data_structure)

As an example, the XMonad window manager uses a zipper to keep track of the currently-focused window.

I've also used Zippers for window management, e.g. having a list of Displays, with one of them focused (accepting keyboard input); each Display containing a list of Workspaces, with one of them focused (currently shown); and each Workspace containing a list of Windows, with one of them focused (the active window). Keyboard shortcuts could shift between the previous/next Window, Workspace, or Display.


👤 jhallenworld
Some ones I've used recently:

The "golden section search" to find a the minimum (or maximum) of a unimodal function. An actual real-world use case for the golden ratio.

Exponentially Weighted Moving Average filters. Or how to have a moving average without saving any data points..

Some of my classic favorites:

Skiplists: they are sorted trees, but the algorithms are low complexity which is nice.

Boyer-Moore string search is awesome..

Bit twiddling algorithms from Hackers Delight: for example (x &-x) isolates the least significant set bit: basically use the ALU's carry chain to determine priority.

Another is compress from Hacker's Delight. It very quickly compresses selected bits (from a bitmask), all to the right of the word. It's useful to make certain hash functions.

The humble circular queue. If your circular queue is a power of 2 in size, then the number of bits needed for indexes is at least log2(size) + 1. The + 1 is key- it allows you to distinguish between completely full and completely empty with no effort. So:

    Empty: (write_index == read_index)
    Full: (write_index == read_index + size)
    Count: (write_index - read_index)
    Insert: queue[write_index++ & (size - 1)] = data
    Remove: data = queue[read_index++ & (size - 1)];
All lossless data compression algorithms are amazing.

👤 binarymax
HNSW, or Hierarchical Navigable Small World is a graph data structure for approximate nearest neighbor search of vectors.

https://arxiv.org/abs/1603.09320

The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.

Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.


👤 nathell
Directed acyclic word graphs (DAWGs)!

They’re like tries, but with identical subtrees glued together. They’re capable of encoding real-world dictionaries of millions of words into ~1 byte per word, when stored cleverly (see [0]). They let you do things like efficiently finding a list of words in a dictionary whose Levenshtein’s distance to a given word is less than x. Their cousins, GADDAGs, are _the_ data structure of choice in Scrabble-playing engines.

[0]: http://www.jandaciuk.pl/fsa.html


👤 petethepig
Tries (or prefix trees).

We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).

We have an article with some animations that illustrate the concept in case anyone's interested [0].

[0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...


👤 SahAssar
https://en.wikipedia.org/wiki/Macaroons_(computer_science) are very interesting to me.

Think of it as a JWT that you can narrow down the authorizations for without needing to communicate with a server, so if you have read permissions to all your photos you can add a caveat saying `photoid=123456` and share it, and the recipient can only read the photo 123456. The caveats can be anything, including requiring third party authentication via third-party caveats. I've implemented systems with it being validated by a lua module in nginx which worked well, but the whole concept never took off.

I still think it seems like one of the most interesting authZ strategies around.


👤 amenghra
Purely Functional Data Structures[1] by Chris Okasaki is worth reading. There's a book version if you prefer vs reading a thesis.

Even though the domain application is functional programming, these datastructures can come in handy when you want to enable state sharing / keeping old versions around without having to copy data.

[1] https://www.cs.cmu.edu/~rwh/students/okasaki.pdf


👤 scottcodie
Reservoir sampling is a statistical technique used to randomly select a finite number of elements from a population. The elements are chosen such that each element has an equal probability of being selected. This technique is often used when it is impractical to select a random sample of elements from a very large population.

To do reservoir sampling, you first need to decide how many items you want in your sample. This number is called the size of the reservoir. Once you have the size of the reservoir, you can fill it by selecting items from the population at random.

The code is beautifully simple:

  for i = k to population.length-1
    j = random integer between 0 and i, inclusive
    if j < k
       reservoir[j] = population[i]
  return reservoir

👤 zbobet2012
Here is one not on the list so far:

Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.

Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 4MB of data be transferred to calculate the same thing.

*(in particular pin sketch https://github.com/sipa/minisketch).


👤 mijoharas
My favorite one that I've had the chance to use professionally is the Marisa trie[0].

Context is a data scientist had written a service that essentially was just lookups on a trie. He'd left and the service was starting to have memory problems so I dug in and swapped the implementation. Iirc swapping the trie implementation changed memory usage from 8gb to 100mb and sped everything up as well.

The actual data structure is equivalent to a trie, but cannot be modified once it's been built (I think it may be the same as a LOUDS trie, I don't remember the specifics)

[0] https://github.com/s-yata/marisa-trie


👤 remorses
Struct of arrays (also called MultiArrayList in Zig), instead of storing big structs in an array you store each field in a separate array and if you need to fetch the full struct you reconstruct it from the arrays.

The benefit is that the arrays items memory size is smaller and has no padding, it also increases cache locality.


👤 cr__
Huet's zipper. https://en.wikipedia.org/wiki/Zipper_(data_structure).

One zipper value represents a regular tree of data, but from the perspective of the "current position". There are traversal operations that take a zipper value and return the same tree from the perspective of an element above/beside/below the current position, and there are operations that return the same structure but with the current element changed, or items deleted or inserted.

Huet's paper is an easy read if you've been exposed to OCaml or similar languages: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced... . His "returned glove" metaphor is what made it click for me.

Clojure includes an implementation of Huet's zippers https://github.com/clojure/clojure/blob/master/src/clj/cloju... that is <300 lines of code. It's very, very clever and broadly useful for dealing with XML or other nested data structures.


👤 silverlake
Most of the data structures posted here are taught in CS classes. Here’s an interesting list of more obscure ones: https://web.stanford.edu/class/cs166/handouts/090%20Suggeste...

👤 kaymanb
Disjoint-Sets have a very cool implementation whose amortized time complexity is extremely slow growing. It is not quite constant, but even for a disjoint-set with as many elements as there are particles in the universe, the amortized cost of an operation will be less than or equal to 4.

https://en.wikipedia.org/wiki/Disjoint-set_data_structure


👤 actually_a_dog
Previously on HN:

https://news.ycombinator.com/item?id=1370847 (2010)

https://news.ycombinator.com/item?id=2363522 (2011)

https://news.ycombinator.com/item?id=3369454 (2011) -- Although not much is in the discussion, it points to a Stack Overflow post with some examples.

https://news.ycombinator.com/item?id=7079427 (2014)

https://news.ycombinator.com/item?id=28758106 (2021) -- Not exactly a data structure, but an interesting algorithm.


👤 nemo1618
My contribution: the Binary Numeral Tree: https://eprint.iacr.org/2021/038

This is a tree-like structure where the structure of the tree is fully determined by the number of leaves N. Specifically, each 1-bit in the binary representation of N corresponds to one of the tree's perfect binary subtrees. For that reason, I think of it more as "structure imposed upon a list" rather than a typical mutable tree with pointers and such.

What are the applications of this data structure? I am only aware of one: computing the Merkle root of a large amount of data in streaming fashion. The funny thing is that it's been discovered independently multiple times: in BLAKE3, Sia, Utreexo, and more. Maybe someday, someone will find another use for it. :)


👤 gordaco
Fenwick Trees (which, despite the name, are implemented using an array) allow counting prefix sums AND updating prefix sums in O(log n) time. Very useful when n is in the order of millions. I have used them a few times in Project Euler problems.

https://en.wikipedia.org/wiki/Fenwick_tree


👤 abetusk
Succinct Data Structures [0] [1]. It encompass many different underlying data structure types but the overarching idea is that you want small data size while still keeping "big O" run time.

In other words, data structures that effectively reach a 'practical' entropy lower bound while still keeping asymptotic run time.

[0] https://en.wikipedia.org/wiki/Succinct_data_structure

[1] https://github.com/simongog/sdsl-lite


👤 pavelboyko
OBDD: Ordered Binary Decision Diagrams. This data structure allows efficient symbolic operations on boolean functions. Widely used in electronic design automation software.

[1] https://en.wikipedia.org/wiki/Binary_decision_diagram [2] https://apps.dtic.mil/sti/pdfs/ADA470446.pdf


👤 vivegi
1. Merkle-trees. Helps with content-based-retrieval, efficient add/modify/delete/move node operations. Useful in scenarios where you are building a data-serialization format that needs to support efficient updates.

2. Not necessarily a data structure, but SMT solvers (like Z3) provide a very useful abstraction for certain kinds of problems. They also use some cool data structures like disjoint-sets among other things.

3. Lock-free datastructures


👤 felixr
The SO question "What are the lesser known but useful data structures?" has a good list

https://stackoverflow.com/questions/500607/what-are-the-less...

[edit: the link above has been discussed on HN multiple times https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...]


👤 nicwolff
The Aho-Corasick automaton [0]. You have a bunch of strings, and you want to know if and where they appear in a longer string. You build a search trie from the strings, but add links back from each node to the node representing the longest suffix of the string it represents that is also in the trie, so you can skip backtracking as you scan, yielding linear time complexity in O(length of string being searched + number of results found).

[0] https://en.wikipedia.org/wiki/Aho–Corasick_algorithm


👤 whycombagator
I really like this thread, but now I am concerned which of these will show up in a future interview.

Edit: forgot to add mine. Not obscure by any means but Fenwick Trees are very cool IMO

https://en.m.wikipedia.org/wiki/Fenwick_tree


👤 kaylynb
Sparse sets.

They're often used in Entity Component System architectures since you have O(1) add/remove/find, & O(n) iteration. Iterations are also packed, which is a bonus for cache efficiency.

Can be difficult to implement well, but the concept is simple and a neat example of a useful specialty data structure. Take a look at https://github.com/skypjack/entt


👤 beeforpork
Cuckoo hashing is really nice: fast O(1) access, amortized O(1) insert. And really space efficient, and it can be parameterized easily for a space efficiency vs. access time trade-off (it is always O(1), but with larger constants for better space efficiency). It is also very simple: basically just an array (plus an algorithm). Unexpectedly simple in my opinion for its good time and space performance.

https://en.wikipedia.org/wiki/Cuckoo_hashing


👤 valbaca
(EDIT: using "mail" because I was using "letter" too much in my description)

The way I think of a bloom filter is like at an office mailroom or at post office mailbox where mail is grouped by the letters of people's name.

Given someone's name, you can glance and see "Uptrenda? No letters for 'U'" very quickly. This is when a bloom filter returns FALSE. But if they glance and see mail in the "U" box, then they return MAYBE. After a MAYBE, to see if you actually have any mail, they need to actually go through the box (i.e. a bloom filter miss).

Mine are a couple that are present in the Java Collections but I'm always disappointed that other language's don't include them:

- TreeMap/Set: A RedBlack tree. It keeps keys ordered, either naturally or by using the provided Comparator, but has log(n) time on basic ops.

    - https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
- LinkedHashMap/Set: keeps keys ordered by insert order via a linkedlist but can be configured to use access-order making it easy to use as a LRU cache. It keeps O(1) time but is less efficient than HashMap:

    - https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html

👤 ignoramous
> Good use-case: routing. Say you have a list of 1 million IPs that are [deny listed].

Apparently, bloom filters make for lousy IP membership checks, read: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/

CritBit Trie [0] and possibly Allotment Routing Table (ART) are better suited for IPs [1].

[0] https://github.com/agl/critbit

[1] https://web.archive.org/web/20210720162224/https://www.harig...


👤 Blahah
1. Probabilistic filtering and matching:

Since you mentioned bloom filters - other probabilistic data structures like count-min sketches (roughly, streaming bloom filters) are super useful. Approximate kmer methods like minhash and w-shingling use them in really cool ways. Rolling hash methods like Rabin chunking also work really nicely with probabilistic/streaming hash tables - splitting the data stream into chunks that can be consistently filtered or matched is useful in many circumstances! Think NSA-level data harvesting, or realtime genome/geospatial data classification.

2. Checking for, locating and counting subsequences in giant string datasets:

Wavelet trees, FM-indices, suffix arrays more generally - and structures that allow extreme performance in very niche situations with arbitrary encodings like bitvectors and compressed integer vectors. If you have a million books, or all the genomes ever sequenced, you use the first structures to find where a specific phrase can be found, even if you don't quite spell it right. You can use the latter ones to do super fast comparisons of giant datasets - given a million viruses, how similar is each one to the human genome, and where specifically do they most closely match?

3. Reconstructing structured information from (potentially noisy) fragments:

De-brujn graphs (and related graphs like string graphs, colored de brujns). This is a way to find all the overlap connections between fragments of information, and then weight the possible traversals of those connections by the evidence. These can be represented using the data structures from #1 (FM-indices for example), and efficiently used in some circumstances with those from #2 to enable some kinds of graph algorithms. If you have a shredded set of documents, or a billion short DNA reads from a genome sequencing experiment, this is how you reconstruct the original.

4. Decentralised coordination structures. Merkle-DAGs and Kademlia DHTs in particular. Being able to compare any trees by root-first hashes, and being able to request the content of the subtree for any node hash from an arbitrary set of peers - these structures enable the p2p web infrastructure. Everything from Limewire to bittorrent, IPFS and blockchains, and, most importantly, Sci-Hub.

1, 2 and 3 together are some of the fundamentals of computational biology. If you're interested in understanding them, https://rosalind.info/problems/list-view/ is a great starting place.


👤 fulafel
This post: http://www.frankmcsherry.org/graph/scalability/cost/2015/01/...

"Scalability! But at what COST?"

In it Frank McSherry uses a handful of data structure tricks to make graph algorithms like connectivity and PageRank run progressively faster on his laptop, beating distributed cluster implementations. It's actually a version of their HotOS paper with Isard and Murray.

Admittedly it's the opposite of obscure, being kind of a myth busting piece about distributed processing being faster and how even largish data sets can fit on a single node (and even your laptop).


👤 nvarsj
Probabilistic data structures are pretty cool.

Count-min sketch [1] is another one. It gives a reasonably accurate count of different events, even millions of unique events, in a fixed memory size. It shows up a bunch in high volume stream processing, and it's easy to understand like the bloom filter.

Another cool data structure is HeavyKeeper [2], which was built as an improvement on count-min sketch for one of its use cases: ranking the most frequent events (like for a leaderboard). It can get 4 nines of accuracy with a small fixed size even for enormous data sets. It's implemented by Redis's topk.

[1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch

[2]: https://www.usenix.org/system/files/conference/atc18/atc18-g...


👤 mgraczyk
Fractional Cascading. A simple and very cool way to speed up searching for the same key in multiple lists. Instead of K binary searches taking time Klog(N), you can do it in log(N) time without using asymptomatically more space.

https://en.m.wikipedia.org/wiki/Fractional_cascading

I wrote a simple demo in Rust a while back to help myself learn the language.

https://github.com/mgraczyk/fractional_cascading

I also think pairing heaps are neat.


👤 m12k
The Bag. Also known as a Multiset. I can't believe it took me so many years to learn the name of the most basic data structure imaginable - it basically makes no promises whatsoever about the structure. Think of it like a Set that allows duplicates - you can put stuff in and take stuff out, just like a bag - what more could you want?

👤 samorozco
It's not a data structure but a really cool algorithm. Locality Sensitive Hashing. It allows like items to be hashed to the same value. So instead of a typical hashing functions that wants to avoid collisions this algorithm tries to optimize for collisions.

https://en.wikipedia.org/wiki/Locality-sensitive_hashing


👤 zeroonetwothree
Fibonacci Heap

Famous as the data structure that achieved optimum runtime in Djikstra’s algorithm. In practice it isn’t often used because simpler heaps outperform it except in very specific circumstances.

https://en.wikipedia.org/wiki/Fibonacci_heap

Soft heap

As far as I know this hasn’t really been implemented. It’s used for the best known minimum spanning tree algorithm. Pretty crazy. https://en.wikipedia.org/wiki/Soft_heap


👤 allanrbo
Frugal streaming.

For estimating a median or percentile in a stream, using constant memory.

It's very simple: if the current value is higher than our estimate, then increase our estimate by one. Else decrease by one. It will converge over long enough time. This is called the Frugal-1U algorithm.

The Frugal-2U is a slightly more advanced version that modifies the step size along the way, plus other optimizations, to converge faster and not oscillate too much.

https://pastebin.com/wAcCS8WZ


👤 didip
Do you know about:

* HyperLogLog? Convenient for doing approximate counting across huge datasets.

* SkipList is another probabilistic data structure that allows you to skip ahead N elements in 1 step. I believe ClickHouse uses it.

* Bitmap index organizes database pointers into a matrix of bits. The bitmap scans simply skip over the zeros. This type of index gets in trouble if you have high cardinality, though.


👤 didip
Have you heard of Geohash? My mind was blown the first time I learned it. https://en.m.wikipedia.org/wiki/Geohash

👤 xyproto
Cuckoo filter

"A space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters."

https://en.m.wikipedia.org/wiki/Cuckoo_filter


👤 ffhhj
Latent Dirichlet Allocation (LDA) structure:

"LDA represents documents as mixtures of topics that spit out words with certain probabilities."

"LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents."

https://cfss.uchicago.edu/notes/topic-modeling/

I'm making a next-gen search engine.


👤 dahart
The Suffix Array is surprisingly cool and useful https://en.wikipedia.org/wiki/Suffix_array

The Alias Table for sampling a discrete distribution in O(1) time is a very clever idea https://www.keithschwarz.com/darts-dice-coins/


👤 tialaramex
Hazard Pointers are an interesting concurrent data structure.

Suppose we've got a lot of Doodads, let's say there's a Graph of ten million Doodads, and a whole bunch (dozens? hundreds?) of threads are poking around in this same graph, maybe looking at Doodads and sometimes (but not often) removing them from the Graph.

What happens if my thread is looking at a Doodad, and meanwhile a different thread removes it from the Graph? Is the Doodad destroyed while I was looking at it? That sounds very bad. What if while I'm looking at it, the Doodad is destroyed, but, a nearly identical Doodad appears in its place. How would I know? Not acceptable.

We could reference count the Doodads, we'd need atomic reference counts, something like C++ shared_ptr -- but this would work. However now our performance is terrible, because we've got 100 threads increasing and decreasing these reference counts just to look at Doodads and each such change causes a write that must be observed on other threads.

So instead Hazard pointers go like this:

Every thread who wants to look at Doodads asks for a Hazard Pointer, usually these are never deleted once made because it's much easier. A thread sets this Hazard Pointer to point at a Doodad when it wants to look at one, it checks this was successful (if not the Doodad is gone now), then contemplates the Doodad (it can be in this state for an arbitrarily long period of time) and then it clears the hazard pointer once it is done. It can only point one Hazard Pointer at one Doodad (some schemes allow a thread to have say, two or some finite number of Hazard Pointers each)

Somewhere a Linked List of all these hazard pointers is kept. When you want to remove a Doodad, you take it out of the graph, then you check the List of Hazard Pointers. If any of them was pointing at the Doodad you removed, you put the Doodad on a special pile since it can't be deleted yet, otherwise you can delete it immediately because nobody else needs it any more. This is a relatively expensive operation, but you're only doing it for removal, not just reads.

Finally there needs to be some mopping up of that pile of "couldn't delete yet" Doodads by trying to delete them again (checking the Hazard Pointers again of course). This can be periodic, it can happen every time somebody tries to delete one, or whatever, the best policy may vary between applications.

This approach means reading Doodads costs only two local writes, no waiting around for the cache. Removing them is more expensive, but that's OK because it was rare.


👤 freen
Fountain Codes: reconstruct data from random selection of packets.

Was deeply patent encumbered until recently.

https://en.m.wikipedia.org/wiki/Fountain_code


👤 sterlind
Skip lists! Super wacky, super fun to implement. O(log n) insertion/deletion/binary search, maintained sorted. It's probabilistic; you have a number of linked list levels, so you flip k coins to insert a node, and insert into the lowest k linked lists.

also, radix heaps are underappreciated.


👤 RiverBucketShoe
Splay trees: https://en.wikipedia.org/wiki/Splay_tree recently searched items are always near the top.

👤 FullyFunctional
My absolute favorite, Cuckoo hashing, has been mentioned already, but so far (239 comment in) nobody has mention this approach (called?) to store an array A[] of N-bit numbers, most of which are zero or small: use N set of numbers S[N], such that for each A[I] you store I in S[J] where J are the bit positions where A[I] have a set bit. In other words, S[J] are the indices of elements from A that has a set bit in position J.

The representation of S can be anything, and even dynamically adapting to the data. I’ve seen trees of bitmaps for example.


👤 sgtnoodle
Disruptor queues are also fun. They are lock free multi-producer multi-consumer circular buffers. I figured out the basics on my own in a vacuum out of need, then discovered a relevant white paper.

I used one to implement a fast shared memory pub-sub message bus for communicating across dozens of processes in a simulation framework.


👤 okennedy
Linked Hash/Tree Maps, simple, but elegant. A Map with its nodes connected in a linked list so you can traverse them in insertion order (and O(n) time). Very useful for window queries over sequential data and other cases where you want FIFO access, but also quick access by a field of the data.

👤 stevenjgarner
Crit-bit trees [0]. Also known as TRIES, with an interesting use described in "TRASH - A dynamic LC-trie and hash data structure", which combines a trie with a hash function [1].

Also ROPE, a string allowing for prepends, substrings, middle insertions and appends [2]

[0] http://cr.yp.to/critbit.html

[1] https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96....

[2] https://en.wikipedia.org/wiki/Rope_%28data_structure%29


👤 lscharen
CDAWGs. Compressed Directed Acyclic Word Graphs.

They are like tries but share prefixes and suffixes of a word corpus, can be built in linear time, and since they are graphs, one can define an inner product between two CDAWGs and use them for kernel-based machine learning algorithms.

https://hal-upec-upem.archives-ouvertes.fr/hal-00620006/docu...


👤 CreRecombinase
Roaring bitmaps are compressed bitmaps and they're pretty great. https://roaringbitmap.org/about/

👤 mikewarot
Trivial, but very useful - phase accumulator / numerically controlled oscillator.

With two unsigned integers, freq and phase, you get amazingly fine control over timing. You only use the top N bits of phase for output. It's saved my bacon in embedded systems more than once.


👤 jedberg
I saw this and bloom filter was my very first thought, so I'm glad you had it.

We used it to check if you had voted on a discussion on reddit. It almost all cases the user had not voted, so it was a lot quicker to check if you hadn't voted than if you had.


👤 zeugmasyllepsis
A GADDAG[1] is a combination pre-/suffix trie that's used for move generation in Scrabble. Unlike a traditional dictionary to find anagrams for words, a GADDAG makes it easier to identify plays based off of available letters already on the board. I made a little visualization of how they work for a data visualization class a few years back.[2]

[1] https://en.wikipedia.org/wiki/GADDAG [2] https://www.axioms.me/gaddag/#?page-5



👤 bell-cot
> ...bloom filters... Good use-case: routing. Say you have a list of 1 million IPs that are black listed...

Seems-needed clarification: Bloom filters suffer from false positives, but not false negatives. So - if BloomFilterBlacklistCheck( $IP ) returns false, your router can safely conclude "not blacklisted". But if it returns true, then you'll need to perform further checks. (And you can't just use a Bloom filter of known false positive - that will also suffer from false positive, potentially letting in traffic from blacklisted IPs.)


👤 efortis
The humble array but with a twist for accessing its indices in a hardware multiplexer way with 'shift left' and 'or' bitwise operations.

    /* Bits: selected, hovered */
    const colors = [
      grey,  // 00
      green, // 01
      blueA, // 10
      blueB  // 11
    ]
    const color = colors[selected << 1 | hovered]

https://blog.uidrafter.com/bitwise-table-lookup

👤 winrid
Structures good for Geospatial information like rtrees, quadtrees.

https://en.m.wikipedia.org/wiki/R-tree

Also concurrent data structures.

https://youtu.be/jcqGSehrMGU


👤 benreesman
https://github.com/facebook/folly/blob/main/folly/docs/Packe...

is pretty awesome and I’ve personally gotten big wins from it.

The parent mentioned Bloom Filters, HyperLogLog-type stuff is on that family tree and also very interesting and very useful.

But the coolest thing lately by far? Swiss table (and the largely equivalent F14).

That thing is a game changer in certain scenarios.


👤 sam0x17
Years ago as part of my thesis research I created a rather obscure family of data structures called DFI Filters or DFilters for short. The work centers around indexing typed trees such as abstract syntax trees or a web browser DOM such that you can answer arbitrary queries of the form "give me the set of all descendants of node P that are of type T" in something that in practice collapses to O(1) in the number of tree nodes. There are several versions of the data structure, but the general setup is a sort of "tree of trees" where the main tree is a binary search tree organized based on the depth-first indices of the nodes in the tree being indexed, and there are links into each type-specific variant of the main tree (in other words, simplified versions of the main tree containing only one type). The key intuition behind the whole thing, and the reason I called them DFI Filters, is the fact that if you are at some particular node in a tree, and you know ahead of time the depth first indices of your node and it's siblings, you actually already know exactly how many descendents your node has based on the difference in the DFI number between your node and its depth-first successor. Then you can build a variety of indexing approaches and data structures based on this insight -- I came up with ones that do it in truly constant time but are expensive to update as an undergrad, and in grad school I was able to build a version that updates on the fly but still retains ammortized O(1) for querying and ammortized O(m) for updating where m is the size of the update.

By the way if you're curious how I can get constant time when I'm returning a set of descendants, it's because I can give you the size and pointers to the first and last elements right off the bat.

The link to the original paper seems to be down (looking into it), but you can find the master's thesis covering all variants here: https://github.com/sam0x17/hierarch_old/blob/master/Master's...

I've been working on a rust implementation lately as well. There are a number of applications for DFilters including fast file-system searching, DOM traversal, very fast static analysis on abstract syntax trees, and more.


👤 sgtnoodle
Monotonic stacks are neat.

I made up a data structure once, consisting of a pyramid of deques. It lets you efficiently compute any associative function over a streaming window of data.


👤 contingencies
The LIFO buffer. Also surfaces as the messy desk, the million email inbox, or ten-thousand browser tabs. It really works at progressively sorting frequently accessed items to the top, even accounting for changing or seasonal habits and preferences.

👤 stevefan1999
https://en.wikipedia.org/wiki/Fibonacci_heap?wprov=sfla1

Fibonacci heap is theoretically better than Binary heap but the cache behavior is terrible

https://en.wikipedia.org/wiki/Van_Emde_Boas_tree?wprov=sfla1

Well, I saw this in CLRS tho. Very clever way of abusing bit patterns for quick range query in O(log log M) where M is the integer size.

https://en.wikipedia.org/wiki/Suffix_array?wprov=sfla1

A simpler way to do text matching since suffix array is the preorder traversal of a suffix trie constructed using the same string. Not heavily used in practice, but I do know bioinfo people used it a lot.

https://study.com/academy/lesson/robin-hood-hashing-concepts...

Not a data structure, but still very underrated. Robinhood hash basically means you keep the first insertion order and the actual insertion order and compare their distance. Once the difference is bigger than expected, you put it in front to "steal from the rich" to have better expected access distribution, hence Robinhood. Getting more prominent today thanks to Rust popularizing it again, since the default hash table implementation of Rust uses hashbrown which uses Robinhood hashing scheme.

https://en.wikipedia.org/wiki/Skip_list?wprov=sfla1

The ordered linked list with a fast lane every log(n) level. Not used very much, but one of the place I know uses it is Redis

https://en.wikipedia.org/wiki/Double-ended_priority_queue?wp...

Basically one priority queue that does two things. Can ironically be implemented with two priority queues (dual)


👤 rockwotj
Interval trees. Really cool trees that allow fast lookups of all the ranges that contain a given point.

https://en.wikipedia.org/wiki/Interval_tree


👤 synalx
PR Quadtrees (https://opendsa-server.cs.vt.edu/OpenDSA/Books/CS3/html/PRqu...) got me my current job at Google. One of my interviewers asked basically "how would you design Google Maps?" and I based my answer on this data structure.

👤 harporoeder
The entire space of probabilistic and sublinear data structures is fascinating. From hyperloglog for high cardinality uniqueness counting, to Cuckoo filters (similar to bloom filters), or Count–min sketch for frequency counting.

https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...


👤 fho
Due to data immutability, a lot of the standard data structures in Haskell are quite interesting.

Eg `Data.Sequence` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) is based on Finger Trees (http://staff.city.ac.uk/~ross/papers/FingerTree.html) which allow O(log(min(i,n-i))) inserts (at i) and O(1) inserts at the beginning or end even thou the data structure is basically copy-on-write.

In general Haskell gas a lot of specialized data structures. On of my favorites is `Data.IntMap` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) which is a specialized map for integer keys (based on Patricia trees iirc).

(Man I miss working in Haskell :-( )


👤 XCSme
Not a data structure, and not really obscure, but I am still amazed by the concept of solving DP (dynamic-programming) problems in O(log N) instead of O(N) time by writing the solution as a Fast Matrix Exponentiation equation: https://www.geeksforgeeks.org/matrix-exponentiation/

👤 pgt
Not technically a data structure, but Reservoir Sampling is an interesting probabilistic sampling technique used to choose a random sample from a population of unknown size N in a single pass, so useful for a data structure that does not fit in memory.

https://en.wikipedia.org/wiki/Reservoir_sampling


👤 haroldl
In a Scrabble AI coding contest I discovered the GADDAG which is like a trie but indexes all of the words forwards and backwards from any starting point within the words.

https://en.wikipedia.org/wiki/GADDAG#:~:text=A%20GADDAG%20is....


👤 Gehinnn
Finger trees. They work on an arbitrary monoid and maintain a list of items. You can insert and remove items in logarithmic time, as well as ask for the sum of items in a given range, also in logarithmic time.

This is useful for example when you want to convert between offsets and line/column numbers efficiently, while you also want to efficiently update/insert/replace lines.


👤 earthnail
Vantage-point trees. Great structure for nearest neighbour searches.

Bonus tip: When you have a fast distance function (like a hamming distance) and write your tree from scratch in C (which almost automatically means you'll optimise it for your use case), they can become absurdly fast - much faster than the general-purpose nearest neighbour libraries I'm aware of.


👤 weinzierl
Nearly 300 comments and no one said

Rainbow Tables.

Sure, they not used anymore and are not particularly useful nowadays, but I find the idea behind them very beautiful.


👤 almog
The Hierarchical Timing Wheels is an efficient data structure/algorithm for managing timers (event scheduling) when: 1. The timers variance is large. 2. Timers are likely to be cancelled. 3. A fixed (configurable) precision is configurable.

This talk provides a nice overview of different timing wheels implementations including hierarchial and hashed timing wheels.


👤 slazaro
I don't know whether it already exists or if it has a name, but internally I call it Virtual List.

- Reasoning: It's used in cases where you'd ideally use an array because you want contiguous memory (because you'll usually iterate through them in order), but you don't know beforehand how many elements you'll insert. But you can't use a resizeable version like std::vector, because it invalidates any pointers to the elements you've already allocated, and you need to store those pointers.

- Implementation: As a black box, it's used like a list: you index elements with an integer, and you can append to the end. But internally it uses different lists. Like a std::vector, it starts with an allocated size, and when it needs a bigger size, it allocates newer blocks that are "added" to the end, but doesn't move the already allocated elements.

- Downsides: Data is not all contiguous, at some point there are jumps between elements that should be close together. For performance when iterating, it's negligible since those cache misses happen "rarely" (depends on the block size, bigger block size means better cache performance but more potential wasted allocated memory), and most of the time you're iterating from start to end. But this means that you can't use this structure with algorithms that use "pointer + size", since internally it doesn't work like that. And since internally you can't shrink the lists and invalidate pointers, there's no way to delete elements, but in my case it's not a problem since I'd reuse indices afterwards with an added structure.

- Applications: I used it when coding games, when you have a big list of entities/components but you don't know how many beforehand. You access them by index, you might store pointers to elements, and most of the time you'll iterate through the whole list (when updating, drawing, etc.).


👤 _trackno5
Not really obscure, but I had never heard of it till recently (and most people I've talked had no idea about it): Difference Arrays.

Great for when you need to do ranged updates in constant time.

More info: https://codeforces.com/blog/entry/78762


👤 sebastianconcpt
Good! I got always fascinated by non-boolean indexing (probabilistic instead).

I've remember having implemented an OOP indexed version of U.C. Berkeley's Cheshire-II in 2007 with Dolphin Smalltalk.

I was using BTrees tho, hence O(log N). The non-boolean part was in the values of the keys which were probability of good match against your search target.


👤 andrenotgiant
Can I describe a data queueing problem that I feel like there is a specific data (or queue) structure for, but that I don't know the name is?

Let's say you are trying to "synchronize" a secondary data store with a primary data store. Changes in the primary data store are very "bursty", one row will not change for days, then it'll change 300 times in a minute. You are willing to trade a bit of latency (say 10 seconds) to reduce total message throughput. You don't care about capturing every change to the primary, you just want to keep it within 10 seconds.

It feels like there should be a clever way to "debounce" an update when another update overrides it 500ms later. I know debounce from the world of front-end UI where you wait a little when doing autocomplete search based on keyboard input so as not to overwhelm the search.


👤 manuelabeledo
Concurrent tries with non-blocking snapshots [0]

Say that you have a dataset that needs to be ordered, easily searchable, but is also updated quite frequently. Fast accesses are a pain if you decide to use traditional read-write locks.

Ctries are entirely lock-free, thus there is no waiting for your read operations when an update is happening, i.e. you run lookups on snapshots while updates happen.

They are also a lot of fun to implement, especially if you aren't familiar with lock-free algorithms! I did learn a lot doing it myself [1]

[0] http://aleksandar-prokopec.com/resources/docs/ctries-snapsho...

[1] https://github.com/mabeledo/ctrie-java


👤 jll29
Associative memory? Store data in them and they will never complain they are full, but they may "forget" instead. Querying them with slightly wrong keys, they'll auto-correct the keys as long as they are not overloaded.

Binary fuse filters? (Xor|Bloom) filters on steroids.

Non-deterministic finite state transducers? Reversible, more elegant and more powerful version of non-deterministic finite state recognizers (transitions are labelled with two outputs, which can be interpreted as inputs versus outputs, but these roles can be swapped as they are symmetric).

Ropes? What, you are still using strings? Dude, please stop writing that newbie editor.

Caches with Time aware least recently used (TLRU) or Segmented LRU (SLRU) cache replacement strategies (Oh, you thought Least recently used (LRU) was cool?)


👤 sairahul82
FST is one underappreciated data structure. Its used in places like speech recognition and synthesis, machine translation, optical character recognition, pattern matching, string processing, machine learning, information extraction. If you know about it you exactly know where to use it.

👤 nathas
Hashed time wheels: https://blog.acolyer.org/2015/11/23/hashed-and-hierarchical-...

Great for short-lived values in a cache that are frequently accessed. If all of your values always have the same expiration period (i.e. expires at insert time + N period) then it's super efficient.

More or less you have an array with a head and tail. Every "tick" (usually each second but you can customize it) you move the read pointer ahead by 1 array slot. The tail is then evicted.

If you have a highly accessed part of your code, you can be lazy and avoid using an extra thread and just tick ahead the read pointer when the time changes.


👤 bjoli
We all know about the immutable vectors of clojure (tries, basically). I always preferred the RRB-tree which is the same thing, but relaxes the leftwise dense requirement. This means you have a bit-partitioned trie until you do a split, merge or insert, after which you get a slightly slower operations, but still very competitive.

It is actually a quite trivial change until you come to the merge algorithm which is finicky in all the wrong ways. The benefits are (in clojure terminology) O(1)ish splits, inserts and concatenations instead of O(n).

I don't know if it can be called obscure anymore, since Scala's vectors are RRB-trees and clojure actually has an implementation of them, but I often see people talk about ropes in situations where an RRB tree should be a no-brainer.


👤 apgwoz
Good old Calendar Queues: https://en.m.wikipedia.org/wiki/Calendar_queue

These things are common in discrete event simulation, but, are useful anytime you have a time based queue.


👤 z3t4
A sorted list. It's often cheaper to sort the list after each change, then to search the whole list. Especially when you do collision detection in 3d... (maybe not so obscure, but at least underestimated) When I pair my socks I first place them in color order :)

👤 goddstream

👤 myaccount80
Dancing links https://en.m.wikipedia.org/wiki/Dancing_Links

“ In computer science, dancing links (DLX) is a technique for adding and deleting a node from a circular doubly linked list. It is particularly useful for efficiently implementing backtracking algorithms, such as Donald Knuth's Algorithm X for the exact cover problem.[1] Algorithm X is a recursive, nondeterministic, depth-first, backtracking algorithm that finds all solutions to the exact cover problem. Some of the better-known exact cover problems include tiling, the n queens problem, and Sudoku.”


👤 agilob
Not super obscure, but I remember that one specific time when circular-linked-list made a lot of sense to use, well I wanted to use it so I used it.

I had a bunch of API keys with poor expiry documentation and implementation, so to find out if a key expired it had to be used. I put it in a main "keys.pop loop" and all methods below tried to use the key. If HTTP response was (some another obscure HTTP status code like) 505, I simply called `continue;` in the loop to jump to another, without caring at all where I was before.

https://github.com/atedja/gring


👤 magicalhippo
Not sure you could call it a data structure as such, but Latent Semantic Indexing[1] kinda blew my mind when I learned about it.

The fact that you could perform linear algebra on documents and words, to get out a number that somehow quantified how close a document was to another really surprised me back in the day. Grew a new appreciation for what exactly those vectors in linear algebra could represent.

Not in the field so don't know how obscure it is though.

[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...


👤 carlopi
Not sure whether they classify as obscure, but I haven't see cited already:

- Dominator tree (https://en.wikipedia.org/wiki/Dominator_(graph_theory))

- Single-Connected-Components-Graph

- Deterministic data structures (eg. a set that acts deterministic to the fact that addresses might be somehow randomly assigned, very useful for ensuring reproducibility)

Already cited, but it's clearly among the most elegant:

- union-find (!!!!)

and as a bonus one that is easily overlooked:

-std::deque, that when restricted to push_back() or push_front() guarantees not to ever move objects around.


👤 api
I love the set reconciliation structures like the IBLT (Iterative Bloom Lookup Table) and BCH set digests like minisketch.

https://github.com/sipa/minisketch

Lets say you have a set of a billion items. Someone else has mostly the same set but they differ by 10 items. These let you exchange messages that would fit in one UDP packet to reconcile the sets.

Minisketch is more costly in CPU but has deterministic guarantees. IBLTs are very fast but have a chance of failure requiring a randomized iterative procedure.


👤 alfiedotwtf
The XOR Linked List blew my mind when I first read about it. Clever!

https://en.wikipedia.org/wiki/XOR_linked_list


👤 mkindahl
Also, the algorithms around using Reduced Order Binary Decision Diagrams (ROBDD) to represent sets efficiently. They are used in verification algorithms with very large state spaces, such as "Symbolic Model Checking: 10^20 states and beyond" by Burch, Clarke, and McMillan <http://www.cse.chalmers.se/edu/year/2012/course/TDA956/Paper...>

👤 benhoyt
The BK-Tree, which allows fast querying of "close" matches, such as Hamming distance (number of bits different). http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK...

I wrote a Python library implementing them a number of years ago: https://github.com/benhoyt/pybktree


👤 anmalhot
Since you mentioned Bloom filters, here are ribbon filters: https://engineering.fb.com/2021/07/09/data-infrastructure/ri...

Edit: another that I enjoyed working with 'Burst Tries': https://dl.acm.org/doi/10.1145/506309.506312


👤 edem
It must be [CHAMP](https://blog.acolyer.org/2015/11/27/hamt/#:~:text=CHAMP%20st....). Acronym for Compressed Hash-Array Mapped Prefix-tree. It is a current state of the art persistent hashtable. I'm surprised that no one mentioned this before, because it is super useful!

👤 zasdffaa
> Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.

Humm. A link to these would have been good[1], but anyway I understand that an optimally filled bloom filter takes ~1.44 times more space than the optimal theoretic value (which I understand Golomb encoding gives you?), so 'much smaller' is not a helpful measure. Likewise the 'worse performance' is not a helpful phrase, I believe a linear lookup time is needed for decoding, but a small amount of space for an extra indexing structure can speed things up lots. Bloom filters are updatable (addition only in the simplest bloom filter), golomb coded sets practically can't be updated except by rebuilding AIUI.

I suppose binary heaps should get a mention <https://en.wikipedia.org/wiki/Binary_heap> cos they barely seem to figure in other comments here. The neat bit is they are trees but implicit (stored in an array) not explicit (which would use pointers) and are always balanced giving you log access times in a compact form.

[1] <https://en.wikipedia.org/wiki/Golomb_coding> but there are simpler explanations on the web


👤 odipar
SeqHash

This data structure is an immutable, uniquely represented Sequence ADS (Authenticated Data Structure) with various interesting use cases.

It is based on a novel hashing scheme that was first introduced with SeqHash.

See http://www.bu.edu/hic/files/2015/01/versum-ccs14.pdf) by Jelle van den Hooff , a brilliant young researcher back then.

SeqHashes are uniquely represented Merkle Trees that also represents a sequence. Concatenation of two SeqHashes is O(log(n)). In turn, the cryptographic signing of SeqHashes is O(1) as each tree node carries a cryptographic hash.

Of course, for each node to carry a hash incurs a hefty overhead but that can be alleviated by (lazily) grouping nodes into buckets (turning it in some kind of BTree).

SeqHashes also don't support splitting in O(log(n)) like for example AVL trees.

I've created an Btree version of SeqHash that also allows splitting SeqHashes in O(log(n)) called SplitHash.

See: https://github.com/odipar/spread/blob/master/src/main/scala/...


👤 mamcx
I made https://github.com/mamcx/tree-flat as flattened stored tree in pre-order that allows for very fast iterations even for childs/parent queries. Is based on APL, so not that novel.

I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.



👤 scary-size
Count-min sketch comes to mind [1]. It's a probabilistic data structure similar to the bloom filter. Learnt about it in this system design video "Top K Problem" [2].

[1] https://en.wikipedia.org/wiki/Count–min_sketch

[2] https://www.youtube.com/watch?v=kx-XDoPjoHw


👤 RF_Savage
Low Density Parity Codes (LDPC) (1).

So good forward error correction that up toa few years ago it was all under patents.

Gold Codes (2) are also very cool in CDMA systems.

[1] https://en.m.wikipedia.org/wiki/Low-density_parity-check_cod...

[2] https://en.m.wikipedia.org/wiki/Gold_code


👤 cdelahousse
Min-Max Heaps

Think of them as double ended priority queues. O(n) construction with O(lgN) mutations. Both the min and max elements are quick to get O(1).

The paper is short and the data structure is elegant. I read it a few years ago in uni and made me appreciate how greatly useful can be implemented very simply.

https://dl.acm.org/doi/pdf/10.1145/6617.6621


👤 epelesis
BW Trees are pretty neat [1][2]. Typically BTrees in database engines need latches to prevent concurrent writes messing up the tree but BW Trees conveniently sidestep that requirement by appending all node updates to a linked list (called a Delta Chain), and having each reader apply the delta of updates by traversing the linked list until it hits the final node in the list. Periodically, these linked lists are compacted back into single nodes.

If you like these kinds of data structures, the Database Internals [3] book has a good introduction and explanation of a handful of BTree optimizations like this.

1: https://www.microsoft.com/en-us/research/wp-content/uploads/...

2: https://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pd...

3: https://www.databass.dev/


👤 JKCalhoun
A certain ePub reader on MacOS used to use a bloom filter for text search. You would decompose an ePub into pages of text and generate a bit array for the text on each page. As the user types in the search field you could often reject large portions of the document — leaving perhaps only a handful of pages where the code would only then have to drop down to the more tedious string matching algorithm.

Very cool.


👤 jcadam
How about the R-Tree, used for accessing/searching spatial data: https://en.wikipedia.org/wiki/R-tree

Maybe not so obscure if you've worked in the geospatial realm.

I actually had a job interview once that asked me to implement an R-Tree from scratch in C++ as a homework assignment - I didn't get that job :)


👤 paskozdilar
I've had a situation where I needed to stream blocks of data from a remote computer in soft-realtime, but still needed to share the same data with many different consumers.

The code was simple, but effective (Go):

    import "sync"

    type Muxer[T any] struct {
        ret T
        mut sync.RWMutex
    }

    func (c *Muxer[T]) Multiplex(fn func() T) (ret T) {
        if c.mut.TryLock() {
            defer c.mut.Unlock()
            ret = fn()
            c.ret = ret
        } else {
            c.mut.RLock()
            defer c.mut.RUnlock()
            ret = c.ret
        }
        return ret
    }
The way to use it is to write a non-concurrent nullary get() function, then pass it to the Multiplex() function in as many goroutines as needed:

    get := func() DataType { ... }
    muxer := Muxer[DataType]{}
    
    // in other goroutines
    value := muxer.Multiplex(get)

The value will be shared across all goroutines, so we get minimum latency with minimum copying.

👤 groby_b
All probabilistic structures are fascinating to me. I'm most enamored with HyperLogLog - estimating the number of distinct elements in extremely large sets of data without having to build a set of seen elements - but the entire field is pretty awesome.

The canonical use case for HLL is counting unique visitors. But any large dataset where you only need a good approximation is a valid target :)


👤 anigbrowl
Quadrilateral Simmelian backbones

Count quadrangle structures in graphs to derive an embeddedness coefficient for each edge and expand the graph along the most embedded edges, yielding nice clusters.

https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...


👤 imachine1980_
Use arrayanes instead of matrix efficiently in boardgames like tetris or chess"A bitboard is a specialized bit array data structure commonly used in computer systems that play board games, where each bit corresponds to a game board space or piece. This allows parallel bitwise operations to set or query the game state, or determine moves or plays in the game"

👤 fulmicoton
pure algorithm & perf stuff... - exponential unrolled linked list: I don't know what they are called, so I called them that way https://fulmicoton.com/posts/tantivy-stacker/. - radix heap. I actually had to use that data structure on a real use case. - radix tree. Actually super useful. - HashMap where the string and the value are contiguous in memory. Weirdly I've never seen that one... The hashmap can stores the full hash in its hash table anyway, so false positive are super rare. It will have to check the string anyway... You can improve your locality by keeping the key and the value in the same place in RAM.

just practical stuff, in python, I often use a magic dict defined as follows.

``` from collections import defaultdict >>> def MagicDict(): ... return defaultdict(MagicDict)

```

then you can use your MagicDict as a weird json-like map.

``` c = MagicDict() c[3]["aaa"][2] = 5 ```


👤 mysterydip
Djikstra maps. This page describes it better than I can: http://www.roguebasin.com/index.php/Dijkstra_Maps_Visualized but it can be a real simple way to add emergent behavior to an AI in a game (my interest in it), among other things.

👤 yeputons
Eertree. As one can guess from the name, it stores all sub-palindromes of a given string and does stuff to them. Somewhat similar to a suffix tree in spirit, but was only invented in 2015.

See https://en.wikipedia.org/wiki/Palindrome_tree and the original paper


👤 carapace
DLX https://en.wikipedia.org/wiki/Dancing_links

Stanford Lecture: Don Knuth — "Dancing Links" (2018) https://www.youtube.com/watch?v=_cR9zDlvP88


👤 dskloet
Invertible Bloom Lookup Tables. They're like Bloom filters but you can also remove elements and even reconstruct elements in certain cases. Useful for syncing 2 databases which are almost in-sync, by sending only a small amount of data between them. It's used by Bitcoin nodes to communicate the contents of newly mined blocks.

👤 Dwedit
Here's another one I was thinking about. It's an idea for a hierarchical Free Space Bitmap. I don't know if anything like this has appeared in a real file system or not.

At the lowest level, let's have one bit represent if a 4K chunk is free space or occupied. 64 bits at this level tells you the usage of 64 * 4096 bytes (256KB)

Then Level 2. Two arrays. One bit at this level represents 64 bits in the level 1 array. One array for All-Bits-Set in level 1, one array for Any-Bit-Set in level 1. 64 bits at Level 2 tells you usage of 64 * 64 * 4096 bytes (16MB).

Then Level 3. Two arrays. One bit at this level represents 64 bits in the level 2 array. One array for All-Bits-Set in level 2, one array for Any-Bit-Set in level 2. 64 bits at Level 3 tells you the usage of 1GB.

I'd imagine there would be ways to use the hierarchical tables to quickly find a hole of a particular size without checking only the lowest level.


👤 nonbirithm
Colonies/hives, which have fast insertion/erasure/iteration and preserve pointers on insertion/erasure. I remember that Cataclysm: DDA used this data structure for keeping track of the items on each map tile.

https://plflib.org/colony.htm


👤 Dowwie
Slotmap seems like a superior data structure to adjacency lists backed by Vec, but still haven't replaced them in graph libraries: https://docs.rs/slotmap/latest/slotmap/

👤 gb
The half-edge data structure, used in computational geometry.

I remember when I first discovered it, it drove home the point that choice of data structure can really matter - it made a bunch of problems I'd been working on much easier to solve, as it gave a different way of looking at things.


👤 Dwedit
Here's one I don't know if I've ever seen documented anywhere. If anyone knows a proper name for this one, let me know!

Imagine it's for a text editor, and you want to map Line Numbers to Byte Positions. But then you want to insert a byte somewhere, and you need to add 1 to all your Byte Position values.

Instead of actually keeping a big array of Byte Position values, you have a hierarchical array. The convention is that whenever you want to read the value, you add the Parent's value, and that parent's value, etc. Number of parents would be logarithmic.

So if you want to add 1 to everything past a certain point, you just apply it to some leaf nodes, then the next internal nodes, rather than applying it to every single leaf node.


👤 hansvm
Succinct data structures in general are interesting. They use close to the information theoretic bound for a given problem and still support fast operations -- a decent mental model is that they're able to jump to a roughly correct place and (de)compress the data locally.

A good example is a wavelet tree. Suppose you have a text of length N comprised of an alphabet with C characters and wish to execute full-text pattern searches for patterns of length P. Once the tree is created, you can do your full-text search in O(P log C) time, and you use less space than any constant multiplicative factor of the information theoretic lower bound. Note that the runtime is independent of the original string's length.


👤 chrsig
I'll add: the count-min sketch[0]. Used for streaming top-k, where the set of keys may be too large to fit in memory. It's essentially a composite of a fixed size (k) heap and a counting bloom filter.

Instead of using a bit field for membership, use an integer field for tracking count.

When querying: take the min of each value in the field location given by the hash functions.

When updating: increment each field location given by the hash functions, take the min of new values. if the min is greater than the least value in the heap, push the updated key into the heap

[0] https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch


👤 winReInstall
Input, Output Unionnions. basically a input struct into a algorith, layed out in such a way, that the algorithms output, always only overwrites input no longer needed. If done well, this allows for hot-loops that basically go over one array for read & write-backs. After all, its all just memory and to use what you got in situ is the fastet way one can go.

No pointers to dereference and wait, just the input, computated, stored back into the same cacheline- to be used for the next part of the hot loop. If the hot-loop is multi staged and the output continously decreases in size, one can even "compress" the data, by packing an array of orgMaxisze/outputSize subunits into the unit.


👤 flobosg
Another use case for Bloom filters: Bioinformatics (see https://en.wikipedia.org/wiki/Bloom_filters_in_bioinformatic...)

👤 pklausler
Briggs-Torczon sets of small integers (0 <= x < N) deserve to be better known.

(Set up two N-element arrays "member" and "index" of ints and an elementCount; set elementCount to zero. You don't have to initialize the arrays if you can live with some valgrind grumbling. The leading "elementCount" entries in "member" are the current members of the set, unordered, and for each of those values x, index[x] is its index in "members".

Then isInSet(x) = index[x] >= 0 && index[x] < elementCount && member[index[x]] == x and addToSet(x) = isInSet(x) || (member[index[x] = elementCount++] = x)

and rmFromSet() is left as an exercise.)


👤 Gravityloss
Don't know if it fills the criteria, but some very basic concept: Circular arrays.

This is great for live updated time series data: you never have to expand your arrays, allocate memory etc. It is extremely fast, predictable and resource efficient.


👤 ur-whale
Cuckoo hashing: https://en.wikipedia.org/wiki/Cuckoo_hashing

Hash tables that come with a parameter that lets you control density vs. speed.


👤 CharanSriram
This isn’t so much a data structure than an implementation of one, but map-represented trees are great.

For context, they were used by Figma to represent its scene-graph (https://www.figma.com/blog/how-figmas-multiplayer-technology...).

Essentially, it’s a way to represent a normal tree as a map; keys of the map represent individual nodes on the trees, and one key “ROOT” points to the rootmost node.

Each node is connected via the following relationship: parents have an unordered set of their children’s keys (not the actual node reference) and children know their parent’s key. Children also maintain a property to track their index using a technique called fractional indexing.

What’s great about this data structure is that you have O(1) transformations for most parts. If you need to change a node’s property, height for instance, (assuming this is a design tool or something), all you need is that nodes key. If you need to reposition a node under a new parent, simply remove the key from the parent’s set, change the node’s parent key, and update the index. This structure plays well with collaborative apps too as most edits to the tree are O(1)— edits are swift and can be communicated to other collaborative instances quite simply with small payloads.

Last thing I’ll touch on is fractional indexing. Very crudely put, it forces children nodes to track their position in a list via an “index” that can be a normal float. To reposition a child node, you first find its neighbors and average their indices to find a new index. Ordering the list is a sort from smallest indexes to largest.

In an example, say we had 3 nodes A, B, and C. A’s index is 0.1, B’s is 0.3, and C’s is 0.5. To insert an element D between A and B, I average A and B’s index (0.2), set that as the index of D and insert it into the parent node’s children set. At render, the set is ordered and we find A, D, B, C.

Shameless plug, but I’m using these techniques to build a multi-framework mobile app builder at phazia.com :). I love nerding out about these things so thanks for the chance to do so @op.


👤 dicroce
I came up with a clever one recently. So, I had a stream of records of fixed size where I will receive exactly one per second. I needed to keep approximately two weeks of this data. I would also need to be able to query the records by time. Anyway, the idea is: use modulus on the file creation time to calculate a write position. Use the location of the current write position to locate query starts and stops. The file then has zero space overhead. Is this something known or did I actually invent something here? If this is my invention I'm open to hearing suggestions for a cool name!

👤 adontz
Succinct Trie http://stevehanov.ca/blog/?id=120

Super effective way to store a searchable list of items, like a dictionary of words.


👤 Joel_Mckay
First contact with Functors, multidimensional Kalman filtering abstractions, and Golomb ruler optimized DFT... kind of like meeting the Wizard of Oz, and finding out it was a chicken all along. ;)

👤 the-alchemist
Very cool stuff...

To take advantage of all these, where would one go to ask "what's the best data structure to use to implement X, Y, and Z?"

Is there a stackoverflow tag or something like that that folks use??


👤 benou
Very good idea!

And I love bloom filter too (and their more modern successor cuckoo filter) but I'd challenge the usecase you mention though: 1 million IPv4 is 4MB, and 16MB for IPv6. That's tiny, you're better off using some kind of hashtable, unless you have a fast and small memory, and then a slow and big memory (say embedded processor with small CPU cache and some DRAM).

Bloom filters are useful when your working set cannot fit in your fast memory, to allow you to go to the slow memory only for a small number of requests.


👤 ajjenkins
I recently learned about deques in Python (other languages may have it too) which is like a linked list but it keeps pointers for the first and last elements. This allows you to insert/read/remove elements from the beginning or end of the list in constant time. I’ve never used it, but I could imagine it being useful in certain situations.

https://docs.python.org/3/library/collections.html


👤 ChrisMarshallNY
Ole Begemann and Chris Eidhof wrote Advanced Swift, and, in there, described a really efficient FIFO queue.

I implemented a variant of it, in my Generic Swift Toolbox Package[0]. It’s lightning fast.

[0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...


👤 turkeywelder
Might not be very obscure but it was new to me. Ended up being really useful for finding out whether dates intersect a particular period and ended up using it quite a bit at work.

Range Tree:

https://en.wikipedia.org/wiki/Range_tree

You give it a range of dates and it'll tell you if any intersect so if you're looking for "how many people are absent in this time period" you can really quickly find out by using a range tree.


👤 Philip-J-Fry
I want to write a content based pub/sub system. Where a subscriber says "I want messages where field X = A, y = B and z > C".

I've looked around and I've seen R+ Trees are potentially a solution but there's very little info on how they're actually created in code.

Does anyone know of any nice data structures to do this? Obviously the most basic way is just indexing each field and that works for simple equality but gets harder when you're trying to do other comparators like < and >.



👤 berendka
Nested Sets

https://en.wikipedia.org/wiki/Nested_set_model

You can ust to store tree structures in relational databases for faster querying stuff like "all nodes below node x", "How many nodes to the top node" and so on.

Querying is easy, but comes at the cost that changes like inserting and deleting are more expensive.

I used this for an auth service with roles, user and org units. pretty query heavy, worked very well.


👤 noname120
HyperLogLog[1] — approximately count distinct elements from an extremely large set (e.g. 10⁹ elements) super efficiently.

→ Distinct count time complexity[2]: O(1)

See Redis's PFCOUNT[4].

[1] https://en.wikipedia.org/wiki/HyperLogLog

[2] For a fixed number of registers (e.g. Redis's implementation)

[3] https://redis.io/commands/pfcount/


👤 dqpb


👤 reality_inspctr
Resource Description Framework (RDF)

The Resource Description Format (RDF) is the W3C standard for web data. It allows for data the be linked, anywhere, and to be "grounded" in semantic descriptions of the data. The core RDF data model is very simple: It is a set of triples orgainzed into a RDF Graph.

https://www.w3.org/egov/wiki/Resource_Description_Format


👤 ben__bradley
Bitboards for chess board representation, the fastest and most complex such representation: https://www.chessprogramming.org/Bitboards Other structures for chess are easier to follow and understand: https://www.chessprogramming.org/Board_Representation

👤 6510
I did an append only list of URL's with RAW files where each bit says something is true/false about the url at that offset.

For example 26x26 RAW files asking if the page contains the letter combination "aa" or "ab" all the way up to "zy" and "zz".

When one types a search query after 2 letters a file is pulled, at the 3rd letter we have a second 2 letter combination. Then do the AND operation.

It is much like a bloom filter and tells you what is not in the set.



👤 nl
I came here to say Golomb compressed sets except now I see it's part of the question!

They are used by default in the OpenMined implementation of Private Set Intersection[1] - a multi-party computation technique.

[1] https://github.com/OpenMined/PSI/blob/master/private_set_int...


👤 Loranubi
I am trying to compile a list of data structures (and algorithms and other stuff) into a kind of knowledge base: https://github.com/Dobatymo/data-algos/blob/master/data-stru.... There are few cool and obscure data structures already. Please feel free to help extend it!

👤 perlgeek
Skip lists.

They are useful for checking in O(log(n)) if a number (or string) is in any of n ranges of numbers.

You just keep a sorted list of endpoints of the ranges, and when you do a lookup, you do a binary search if the search value is in the list, and if not, between which two indexes. If it's between and even and an odd index, it's in.

Very useful if you remember that IP addresses are also integers, and with IPv6 networks become way too huge to enumerate.


👤 Fiahil
Append-Log. A grow-only linked list where you can only append elements to it.

Good use-case: Lock-free concurrent reads are allowed, because there is no way committed content could change. Writes need to be Linearizable (so locking is required). Given this property, this data structure provides faster reads than a Mutex> and similar write perfs.

Bonus section: Provides broadcast broadcast capabilities if used as channel.


👤 throwaway45631
Not obscure by itself, but gifted with obscurity due to its language‘s success: JavaScript‘s Map type.

Just because of the Type being introduced very late to the language, and another very successful method named identically makes it almost impossible to google your way out of any situation.

You have to rely on core documentation and link lists. Just by using “new Map()” in JS, you’re suddenly ehem mapped 30 years back in time!


👤 kqr
Not particularly innovative, but extremely useful: HDRHistograms let you very efficiently create, combine and query empirical probability distributions.

If I'm doing anything with random or random-looking variation, I prefer to store the full distribution in a HDRHistogram rather than just the last value, last few values, mean value, or whatever. Opens up so many possibilities for analysis later down the road.



👤 strawhatguy
I actually got to use a k-d tree before, I was naïvely brute forcing tiles together according to their 2d positions, the tiles being part of a larger image, and it was perfect.

Would run out of memory on a 96gb machine before, after it could process the tiles as quick as they came in from the camera, only a couple held in memory at a time.

When you have the right data structure, man it’s like night and day performance!


👤 vnorilo
One of my favourites is treap [1], which doubles as a great excuse to forget how to balance binary trees. It is similarly "probably efficient" as HAMT, mentioned in other comments; both require a well-behaving distribution in the hash function.

1: https://en.m.wikipedia.org/wiki/Treap


👤 foxyv
For my final project in college Data Structures I wrote a linked hexagonal data structure. To index it I used two different solutions. One was to use a base-6 system to encode a series of hops in the six different directions. Another was to have two coordinates which indicated how far to traverse in two directions to reach the target hex cell.

👤 mzaks
Not a data structure, but a nice concept for very fast search on immutable pre sorted array, eytzinger order. I did some experiments, it is about 2-3x faster then binary search. https://medium.com/swlh/binary-search-vs-eytzinger-order-301...

👤 simonpure
The BIP buffer.

It's a ring buffer compatible with APIs that don't specifically support it.

It's a great fit for any network I/O and I believe Android uses it under the hood.

https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The...


👤 KETpXDDzR
Locality-sensitive hashing[0] is similar to bloom filters. It's a hashing function that hashes values into buckets. One interesting application is neutral networks with sparse inputs.

[0] https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing


👤 openfuture
I left this thread alone since it is a rabbithole but I was surprised how monotonous it ended up being. The datastructure I am quite interested in these days is 'concept lattices' and object-property (bi)graphs. Although I am less interested in the dynamic side atm and am more thinking about representation of data while being exchanged.

👤 mungoman2
Not really a pure data structure, but I like hash tables where the entries expire after some time. This is very useful in interactive situations like in games or chat bots, where for example you want the AI/bot to remember a conversation or piece of information for some time.

Simplest to implement is to just wrap a hash table and insert time checks in the accessors.


👤 aliceryhl
Link cut trees. They can be used to query information about paths in a forest, while also allowing you to add and remove edges in the forest.

For example, you could use them to store the minimum spanning tree of a graph, while supporting updates if edges are added to the graph.

They can also be used to speed up max flow algorithms by finding the augmenting path in logarithmic time.


👤 demindiro
This isn't a data structure technically but I find it clever: ZFS' space maps.

It keeps track of which space is allocated by logging every allocation and deallocation. When initializing an allocator, which can be implemented in any way, this log is replayed.

If the map becomes too big it is rewritten so it is equivalent to the old map but with no redundant entries.


👤 swizzler
I like consistent hashing (https://en.m.wikipedia.org/wiki/Consistent_hashing). When a hash table (or load balancing pool) needs to be resized, it usually reduces the number of keys (clients) that need to be remapped.

👤 Arainach
Deterministic acyclic finite state automata. Essentially a compressed trie. Significantly more memory efficient for real world dictionaries.

https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...


👤 elromulous
Somewhat straddling pure cs and cryptography/coding theory: one of my absolute favorites: Merkle tree. It's used for easily confirming the integrity of data. E.g. a filesystem.

https://en.m.wikipedia.org/wiki/Merkle_tree


👤 eru
Soft heaps are pretty neat. They are like normal min-heaps, but support all operations you care about in O(1) instead of O(log n) time. The catch is that they are allowed to make a configurable proportion of errors.

They are basically only useful in theory, not in practice. They allow construction of some asymptotically optimal algorithms.


👤 anonymoushn
Robin Hood tables are a common-ish hash map implementation that often shows up on benchmarks for fast inserts. Depending on the implementation, a Robin Hood table is also a sorted sparse array! This could be useful for data that needs random access and sequential access to a subset of the sorted data, like an order book.

👤 karsinkk
Lazy Adaptive Trees : https://dl.acm.org/doi/10.14778/1687627.1687669

It is a B-Tree where updates to the tree are buffered in the branch nodes and are propagated down to the leaf nodes at a later point in time.


👤 Chris_Newton
Piece tables: a persistent data structure used to represent the current document with relatively efficient editing operations in various text editors, word processors, etc.

https://en.wikipedia.org/wiki/Piece_table


👤 whateveracct
A lazy, immutable rose tree

This article shows some really cool stuff you can do with it https://www.well-typed.com/blog/2019/05/integrated-shrinking...


👤 ImJasonH
Nested Set Model: a way to model nested sets of entities and query them in a SQL database. I haven't used it for anything but it seems fun.

https://en.m.wikipedia.org/wiki/Nested_set_model


👤 fn1
The disruptor ring-buffer for providing lock-free parallelization for multiple readers/writers: https://martinfowler.com/articles/lmax.html

👤 1024core
> Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements.

In the case of IPs, you just have to make 4 comparisons (for IPv4), if you store them in a tree.


👤 darksaints
The Rete algorithm / graph network, and its variants.

You can evaluate infinite rules, with automatic incremental updates (no re-evaluation if not necessary), with O(1) time complexity. The tradeoff being that you have to hold a very large graph which represents your rule set in memory.


👤 kqr
Suffix trees are known to anyone who's done text search, but very neat! Fairly easy idea too: in order to allow quick substring search, build a mapping from all suffixes to documents. This is compressed as a tree, and prefix matching is done by partial traversal.

👤 FridgeSeal
Succinct data structures are one of my favourites.

The idea that we can very cleverly pack a collection of information into a much smaller form, but you can query the compressed form with good computational complexity, and without unpacking the data you need to read is amazing.


👤 ur-whale
The Alias method: O(n) in space and O(1) biased number generator

https://en.wikipedia.org/wiki/Alias_method

Mind was utterly blown when I first understood the way this works in my late teens.


👤 zbentley
HyperLogLogs! They're extremely useful for cardinality count estimation, and support set operations.

https://en.m.wikipedia.org/wiki/HyperLogLog


👤 jupiter909
o Fractal Tree Index's are rather nifty as used in Tokudb engine. Basic gist is that you get the speed of Btree but your update are less painful.

https://en.wikipedia.org/wiki/Fractal_tree_index http://www-db.deis.unibo.it/courses/TBD/Lezioni/mysqluc-2010...

o k-d tree for space partitioning, handy real world applications in mapping and gaming.

o Bloom Filters as you mention are great, always a handy one to know.


👤 jillesvangurp
String tries are pretty fun if you need to implement auto completion. I spent some time implementing my own on one project and open sourced it. And then brought it in on a few more projects.

I've used bloom filters with in memory caches a few times.


👤 justahuman74
I'll just sit here hoping that all of the above datatypes get a nice rust crate :)


👤 hamilyon2
My attention seems to to be attracted to perfect hash functions. They are typically used for stuff like keywords detection in parsers, but they are an interesting concept, that should be possible to extend to much more.

👤 Hippocrates

👤 jordanmorgan10
So it’s not snazzy, but I also wrote Objective-C a long time without knowing about NSPointerArray.

Among other things, it can store nil values, which if you’ve written any Cocoa apps, is an all too common way to crash your software.


👤 philosopher1234
Linked list but where each entry in the list is fixed size array. O(1) inserts into the array. If the array fills you split it into two in the linked list. Amortizes the cost of traversing list pointers.

👤 johncalvinyoung
Not crazy obscure, but half-edge mesh data structures are super cool to me.

👤 jonathan-kosgei
Time complexity doesn't seem like a great reason to use a bloom filter. Why not just check if the IP is in a set? Set membership is supposed to be O(1).

The reason to use a bloom filter would be to save space.



👤 rjh29
For routing you can also use a sparse radix tree, also known as a Patricia tree. This allows fast matching against a set prefixes (which is used for IP routing) without taking up a lot of space.

👤 s1k3
There’s this one called a binary search tree. I’ve only heard about it a handful of times- during engineering interviews, but is otherwise reserved for the most esoteric of problem/spaces.

👤 foobiekr
Skew heaps. They aren’t useful but their utter simplicity is really fun.

👤 unsupp0rted
If somebody made a Youtube channel going over the data structures mentioned in this thread with clear examples of pros/cons, I'd watch the heck out of it and support it on Patreon.

👤 ww520
Extendible hashing allows space efficient growable hashtable on disk, with a maximum of two reads to reach the data page, and an extremely simple and efficient way to grow the hashtable.

👤 nailer
> Lets you test if a value is definitely NOT in a list of pre-stored values

wouldn’t a hashmap / dict / object do this in 0(1)? Input goes in, is hashed, the key exists or doesn’t exist?


👤 raydiatian
I’ve bookmarked this thread. I have a new reading list! Thanks OP!


👤 JohnDeHope
I always thought the relational calculus was awesome. There’s only one place I’ve seen it in a real world use case, outside of college, and that’s in a database called Suneido.

👤 glouwbug
Static arrays - when was the last time you used a static array?

👤 mtlmtlmtlmtl
N-dimensionsonal arrays are pretty cool, and fun to implement. I did one in C for an Advent of Code problem that switched from 3 to 4 dimensions between the two parts.


👤 DeathArrow
>Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)

The easy and efficient way to test if a value is in a list is to use a hash set or dictionary. Complexity is always O(1).


👤 slaymaker1907
I really like the splay tree. It optimizes itself during reads and writes such that it is optimal for a fixed probability distribution of accesses.

👤 fatih-erikli
- Prefix tree (also known as a trie) - Splaytree (Self balancing tree optimized for querying fast for the frequently accessed nodes)

👤 icsa
[Ordered] Minimal Perfect Hash Functions with O(N) construction time.

They have been around for 30 years and I don't see them used in practice.


👤 nullc
How about a pinsketch:

A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.

For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).

Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).

Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.

The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P

Here is an implementation I contributed to: https://github.com/sipa/minisketch/

There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.

IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.

The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)


👤 smokel
Bitboards, although perhaps not so obscure, allow for some very nice optimization tricks.

I've used them to solve Sudoku puzzles really fast.


👤 peterhunt
Sketching data structures: count min sketch, hyperloglog, sliding window hyperloglog, t digest

Skiplists

Locality sensitive hashing

Log structured storage (bitcask, lsm)


👤 talideon
Skip lists: 90% of the benefits of more complex tree data structures, with a fraction of the complexity.

👤 jmartrican
Not super obscure, but also not well known... LinkedHashMap. You get the O(1) lookups and ordered iteration.

👤 mcqueenjordan
A Patricia Merkle tree. Sometimes affectionately referred to as a Perkle tree in some large companies.

👤 lizardactivist
The Trie is pretty cool. I think it's a bit obscure because the memory use can grow quite fast.

👤 crazybob123
FST: Finite State Transducer. Widely used inside search (Lucene), DB powering prefix, fuzzy search

👤 nunez
patricia tries are cool. used them once while working on an infrastructure management tool; makes searching word dictionaries very efficient.

also love the Option struct with pattern matching that all of the hippie-dippie functional languages seem to like lol


👤 untererdisch
Three of my favorite probabilistic data structures are: Skip Lists, Bloom Filters, and SimHashes.

👤 victor106
Algorithms and Data Structures for Massive Datasets By Manning.

Just started reading this. It looks interesting


👤 O__________O
Wikipedia’s list of algorithms and data structures is pretty extensive:

https://en.m.wikipedia.org/wiki/List_of_algorithms

https://en.m.wikipedia.org/wiki/List_of_data_structures

__________________

As for a specific algorithm, this paper covers analysis of HyperLogLog (HLL):

http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf

TLDR: Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.

Source:

https://en.wikipedia.org/wiki/HyperLogLog

Recent HN coverage of HyperLogLog:

https://news.ycombinator.com/item?id=32138790


👤 vuonghv
I like Ring Buffer, it leverages the CPU-cache line and can use to store event queue.

👤 motoboi
If you like Bloom Filters, you'll love Cuckoo filter, which allow key deletion.

👤 mkindahl
Tree automata and the BURS (Bottom Up Rewrite System) algorithm are pretty cool.

👤 m463
patricia trees

They are particularly interesting for storing long strings.

You can look up all strings with a common prefix quite easily.

the lookup function is really interesting, just check the first differing bit.


👤 tealpod
Bloom Filter is very clever and simple DataStructure, Thank you.

👤 brightball
Actually came here to say bloom filter but you beat me to it.

👤 beached_whale
I find the RTree beautiful. While not as in use these days.

👤 boredbot
arrays are pretty underground, theyre like lists but better. C++ offers arrays but unfortunately python doesnt :(

👤 vbezhenar
Ropes. Something between array and tree.

👤 markdown
> I'll start

You're supposed to post yours as a comment, so it can be voted on like every other type of structure suggested here.


👤 kelseyfrog
I am enamored by data structures in the sketch/summary/probabilistic family: t-digest[1], q-digest[2], count-min sketch[3], matrix-sketch[4], graph-sketch[5][6], Misra-Gries sketch[7], top-k/spacesaving sketch[8], &c.

What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.

I hope they are as useful or intriguing to you as they are to me.

1. https://github.com/tdunning/t-digest

2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html

3. https://florian.github.io/count-min-sketch/

4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...

5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf

6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...

7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf

8. https://www.sciencedirect.com/science/article/abs/pii/S00200...

9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.


👤 hackarama
Quadrupley linked lists

👤 cratermoon
Skip lists.

👤 samsquire
I'm trying to create a first class citizen of loops and trying to improve software responsiveness.

There's two ideas: concurrent loops and userspace scheduling and preemption.

Nested loops can be independently scheduled. I wrote a M:N userspace scheduler[2] and it does something simple. You can interrupt a thread that is in a hot loop by changing its limit variable to the end. You don't need to slow down a hot loop by placing an if statement or checking if X number of statements executed. (Many runtimes do this to implement scheduling and you don't need to do it this way)

Golang schedules away from a goroutine by checking during stack expansion if the thread has used its timeslice.

I think this simple observation is really powerful. The critical insight is that frontend performance can be drastically improved by reducing latency from the user's perspective. Did you ever notice that the cancel button rarely cancels immediately? Or resizing an image is slow and gives no feedback or encrypting gives no feedback?

By having a watching thread interrogate the output buffer, we can create GUIs that visually do things and provide observability to what the computer is doing. Imagine watching a resize operation occur in real time. Or encryption. Or network communication.

One of my ideas of concurrent loops is "cancellation trees"[1], if you model every behaviour of a piece of software as a tree of concurrent loops that are reentrant - as in they never block and each call is a loop iteration of a different index, you can create really responsive low latency software. If you cancel a branch of the tree, you can cancel all the loops that are related to that branch. It's a bit like a tree of processes from PID 0 or loops as lightweight green threads/goroutines.

So as you're moving around in your text editor or browser, if you invalidate the current action - such as press ESCAPE while Intellisense is trying to find related code, you can interrupt all the loops that fanned out from that GUI operation.

Telling the computer to stop doing what it is doing and observability are two key weaknesses of modern computing. GUIs do not always accurately communicate what the computer is doing and you usually need to wait for the computer to finish before it shows you what it is doing. I am sure the cancellation token could be extended to follow this idea.

If you liked this comment, check out my profile to links to my ideas documents.

1: https://github.com/samsquire/ideas4/blob/main/README.md#120-...

2: https://github.com/samsquire/preemptible-thread


👤 sidcool
Tim sort.

👤 llimos
The Israeli queue.

Like a regular queue, but if something new that comes in sees its friend in the queue already, it can jump the queue and go and stand next to her.

Useful for when something has a big overhead on top of its own processing, but the overhead can be shared between several similar entries. If you're doing it anyway for the one already in the queue, you get to make the most of it for its friends too.

I chose it because I live here and it's totally true :)


👤 b20000
the BroTree. it’s the tree you don’t know about but will inevitably be queried about during a google interview with a googler on a mission to prove supreme smartness.

👤 cl0rkster
How about the current DATE formats based on the EPOCH. Hell, they knew there was nothing to it, but some backstreet performances inspired my Y2K formative years where my normal, kind, religious mother became a hoarder that destroyed everything she touched. Why can Java still not do simple date processing when even M$$ nailed that formula generations ago? Not letting devs communicate cross platform without Linux level religious arguments is ridiculous. Git this, Linus, is the EPOCH the final destination of the post-Christian measure of time (i.e. the common era)