I'll start: bloom filters. Lets you test if a value is definitely NOT in a list of pre-stored values (or POSSIBLY in a list - with adjustable probability that influences storage of the values.)
Good use-case: routing. Say you have a list of 1 million IPs that are black listed. A trivial algorithm would be to compare every element of the set with a given IP. The time complexity grows with the number of elements. Not so with a bloom filter! A bloom filter is one of the few data structures whose time complexity does not grow with the number of elements due to the 'keys' not needing to be stored ('search' and 'insert' is based on the number of hash functions.)
Bonus section: Golomb Coded Sets are similar to bloom filters but the storage space is much smaller. Worse performance though.
It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.
When caching the result of an expensive computation or a network call, don't actually cache the result, but cache the promise that awaits the result. Ie don't make a
Map
but a Map>
This way, if a new, uncached key gets requested twice in rapid succession, ie faster than the computation takes, you avoid computing/fetching the same value twice. This trick works because:- promises hold on to their result value indefinitely (until they're GC'ed)
- you can await (or .then()) an existing promise as many times as you want
- awaiting an already-resolved promise is a very low-overhead operation.
In other words, the promise acts as a mutex around the computation, and the resulting code is understandable even by people unfamiliar with mutexes, locks and so on.
A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.
A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.
In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.
Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.
This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.
The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)
The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).
There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.
And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).
Related to bloom filters, xor filters are faster and more memory efficient, but immutable.
HyperLogLog is an efficient way to estimate cardinality.
Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.
see: https://www.youtube.com/playlist?list=PLUl4u3cNGP61hsJNdULdu... (6.851 Advanced Data Structures, Erik Demaine)
Would love to learn more "stupidly fast at the cost of conceptual complexity" things.
edit:
(adding) I don't know a name for it, because it's not one thing but a combination, but once can:
Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.
Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.
Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.
Very clever. (I first saw in the Aerospike key value store)
This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.
These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.
Piece tables:
A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).
This has interesting variations:
- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.
- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.
- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.
Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.
Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.
They're awesome data structures that efficiently maintain a congruence relation over many expressions
> At a high level, e-graphs extend union-find to compactly represent equivalence classes of expressions while maintaining a key invariant: the equivalence relation is closed under congruence.
e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and then said "x == y" (merged "x" and "y" in the e-graph), then the e-graph, by congruence, would be able to tell me that "f(x) == f(y)"
e.g. If I were to represent "a*(2/2)", in the e-graph, then say "2/2 == 1", and "x*1 == x", by congruence the e-graph would know "a*(2/2) == a" !
The most recent description of e-graphs with an added insight on implementation is https://arxiv.org/pdf/2004.03082.pdf to the best of my knowledge.
P.S: I'm currently implementing them in Haskell https://github.com/alt-romes/hegg
[1] A Conflict-Free Replicated JSON Datatype (https://arxiv.org/abs/1608.03960) b Martin Kleppmann and Alastair R. Beresford.
When I go from Haskell back to imperative land I find myself greatly missing this ability. Sure I can make multiple hashmaps or trees or whatever but being able to stuff it all in one data structure is amazing.
One structure I built with them that is much more difficult with typical structures is a "block list" structure.
In this structure each block has a particular width and they're all stacked side by side.
I want to perform a query, "which box is at position X". So if my boxes are of widths 7,20,10, then the lookup for 2 yields the first box, the lookup for 12 yields the second, etc.
More interestingly, if add a new box between the second and third, the indices covered by the last box is increased.
With finger trees you use a sum monoid. This is easy. In other languages you have to roll your own structure either using a list (with o(n) lookup) or a tree with o(log n) lookup, but o(n) inserts (to translate the indices of future blocks)
"This is the story of a clever trick that's been around for at least 35 years, in which array values can be left uninitialized and then read during normal operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time."
https://research.swtch.com/sparse
* IP addresses actually aren't a great use case for basic bloom filters, as they're fairly storage efficient to begin with, as opposed to a url for example. Taking your example, say we need to store 1 million IP addresses in our bloom filter and we're okay with a ~1% false positive rate. Well then, if we use a bloom filter with 2^23 bits (1 MB), the optimal number of hash functions is (2^23)/(10^6)*ln(2) = 6, yielding a false positive rate of (1 - exp(-6* 10^6 /2^23))^6 = ~1.8%. So we're using 6% of the storage space, but with a nearly 2% false positive rate.
It's like a Red-Black tree in use case, but much faster to implement, which is good for competitive programming. The average case complexity is the same for all operations, but there's an unlikely degeneration to worst-case linked-list behaviour.
Lazy Propagation Segment Tree: https://cp-algorithms.com/data_structures/segment_tree.html
Like a segment tree in that it supports range queries in logarithmic time, but supports range updates in log time as well.
I know a few more fun ones that I occasionally use in contests, but I've never touched otherwise.
Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're focusing on the 3. In code this will look like:
([2, 1, 0], 3, [4, 5])
Where [a, b, c] denotes a singly-linked list, with O(1) head (returning a) and tail (returning [b, c]). Notice that the first list is in reverse order.To focus on the next element, we put the currently focused element on the first list, and pull the head off the second list:
([3, 2, 1, 0], 4, [5])
And vice versa to focus on the previous element: ([1, 0], 2, [3, 4, 5])
I like to imagine this as a string of beads, with the focused element held in our fingers and the rest hanging down: 3
/ \
2 4
| |
1 5
|
0
To move the focus forwards and backwards, we move our grip to the next/previous bead.This works nicely as an immutable datastructure, since the tails of both lists can be shared (i.e. we don't need to copy the whole list, just the wrap/unwrap an element on to the head)
Zippers were first applied to lists, but have since been generalised:
- To trees: focusing on one node, and being able to move the focus up, to the left-child, or to the right child
- To having more than one focus
- To datastructures more generally, by treating them as 'derivatives' (as in calculus)
https://en.wikipedia.org/wiki/Zipper_(data_structure)
As an example, the XMonad window manager uses a zipper to keep track of the currently-focused window.
I've also used Zippers for window management, e.g. having a list of Displays, with one of them focused (accepting keyboard input); each Display containing a list of Workspaces, with one of them focused (currently shown); and each Workspace containing a list of Windows, with one of them focused (the active window). Keyboard shortcuts could shift between the previous/next Window, Workspace, or Display.
The "golden section search" to find a the minimum (or maximum) of a unimodal function. An actual real-world use case for the golden ratio.
Exponentially Weighted Moving Average filters. Or how to have a moving average without saving any data points..
Some of my classic favorites:
Skiplists: they are sorted trees, but the algorithms are low complexity which is nice.
Boyer-Moore string search is awesome..
Bit twiddling algorithms from Hackers Delight: for example (x &-x) isolates the least significant set bit: basically use the ALU's carry chain to determine priority.
Another is compress from Hacker's Delight. It very quickly compresses selected bits (from a bitmask), all to the right of the word. It's useful to make certain hash functions.
The humble circular queue. If your circular queue is a power of 2 in size, then the number of bits needed for indexes is at least log2(size) + 1. The + 1 is key- it allows you to distinguish between completely full and completely empty with no effort. So:
Empty: (write_index == read_index)
Full: (write_index == read_index + size)
Count: (write_index - read_index)
Insert: queue[write_index++ & (size - 1)] = data
Remove: data = queue[read_index++ & (size - 1)];
All lossless data compression algorithms are amazing.
https://arxiv.org/abs/1603.09320
The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.
Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.
They’re like tries, but with identical subtrees glued together. They’re capable of encoding real-world dictionaries of millions of words into ~1 byte per word, when stored cleverly (see [0]). They let you do things like efficiently finding a list of words in a dictionary whose Levenshtein’s distance to a given word is less than x. Their cousins, GADDAGs, are _the_ data structure of choice in Scrabble-playing engines.
We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).
We have an article with some animations that illustrate the concept in case anyone's interested [0].
[0] https://github.com/pyroscope-io/pyroscope/blob/main/docs/sto...
Think of it as a JWT that you can narrow down the authorizations for without needing to communicate with a server, so if you have read permissions to all your photos you can add a caveat saying `photoid=123456` and share it, and the recipient can only read the photo 123456. The caveats can be anything, including requiring third party authentication via third-party caveats. I've implemented systems with it being validated by a lua module in nginx which worked well, but the whole concept never took off.
I still think it seems like one of the most interesting authZ strategies around.
Even though the domain application is functional programming, these datastructures can come in handy when you want to enable state sharing / keeping old versions around without having to copy data.
To do reservoir sampling, you first need to decide how many items you want in your sample. This number is called the size of the reservoir. Once you have the size of the reservoir, you can fill it by selecting items from the population at random.
The code is beautifully simple:
for i = k to population.length-1
j = random integer between 0 and i, inclusive
if j < k
reservoir[j] = population[i]
return reservoir
Set Sketches. They allow you compute the difference between two sets (for example to see if data has been replicated between two nodes) WITHOUT transmitting all the keys in one set to another.
Say you have two sets of the numbers [1, ..., 1million] all 32 bit integers, and you know one set is missing 2 random numbers. Set sketches allow you to send a "set checksum" that is only 64 BITS which allows the other party to compute those missing numbers. A naive algorithm would require 4MB of data be transferred to calculate the same thing.
*(in particular pin sketch https://github.com/sipa/minisketch).
Context is a data scientist had written a service that essentially was just lookups on a trie. He'd left and the service was starting to have memory problems so I dug in and swapped the implementation. Iirc swapping the trie implementation changed memory usage from 8gb to 100mb and sped everything up as well.
The actual data structure is equivalent to a trie, but cannot be modified once it's been built (I think it may be the same as a LOUDS trie, I don't remember the specifics)
The benefit is that the arrays items memory size is smaller and has no padding, it also increases cache locality.
One zipper value represents a regular tree of data, but from the perspective of the "current position". There are traversal operations that take a zipper value and return the same tree from the perspective of an element above/beside/below the current position, and there are operations that return the same structure but with the current element changed, or items deleted or inserted.
Huet's paper is an easy read if you've been exposed to OCaml or similar languages: https://www.st.cs.uni-saarland.de/edu/seminare/2005/advanced... . His "returned glove" metaphor is what made it click for me.
Clojure includes an implementation of Huet's zippers https://github.com/clojure/clojure/blob/master/src/clj/cloju... that is <300 lines of code. It's very, very clever and broadly useful for dealing with XML or other nested data structures.
https://news.ycombinator.com/item?id=1370847 (2010)
https://news.ycombinator.com/item?id=2363522 (2011)
https://news.ycombinator.com/item?id=3369454 (2011) -- Although not much is in the discussion, it points to a Stack Overflow post with some examples.
https://news.ycombinator.com/item?id=7079427 (2014)
https://news.ycombinator.com/item?id=28758106 (2021) -- Not exactly a data structure, but an interesting algorithm.
This is a tree-like structure where the structure of the tree is fully determined by the number of leaves N. Specifically, each 1-bit in the binary representation of N corresponds to one of the tree's perfect binary subtrees. For that reason, I think of it more as "structure imposed upon a list" rather than a typical mutable tree with pointers and such.
What are the applications of this data structure? I am only aware of one: computing the Merkle root of a large amount of data in streaming fashion. The funny thing is that it's been discovered independently multiple times: in BLAKE3, Sia, Utreexo, and more. Maybe someday, someone will find another use for it. :)
In other words, data structures that effectively reach a 'practical' entropy lower bound while still keeping asymptotic run time.
[1] https://en.wikipedia.org/wiki/Binary_decision_diagram [2] https://apps.dtic.mil/sti/pdfs/ADA470446.pdf
2. Not necessarily a data structure, but SMT solvers (like Z3) provide a very useful abstraction for certain kinds of problems. They also use some cool data structures like disjoint-sets among other things.
3. Lock-free datastructures
https://stackoverflow.com/questions/500607/what-are-the-less...
[edit: the link above has been discussed on HN multiple times https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...]
Edit: forgot to add mine. Not obscure by any means but Fenwick Trees are very cool IMO
They're often used in Entity Component System architectures since you have O(1) add/remove/find, & O(n) iteration. Iterations are also packed, which is a bonus for cache efficiency.
Can be difficult to implement well, but the concept is simple and a neat example of a useful specialty data structure. Take a look at https://github.com/skypjack/entt
The way I think of a bloom filter is like at an office mailroom or at post office mailbox where mail is grouped by the letters of people's name.
Given someone's name, you can glance and see "Uptrenda? No letters for 'U'" very quickly. This is when a bloom filter returns FALSE. But if they glance and see mail in the "U" box, then they return MAYBE. After a MAYBE, to see if you actually have any mail, they need to actually go through the box (i.e. a bloom filter miss).
Mine are a couple that are present in the Java Collections but I'm always disappointed that other language's don't include them:
- TreeMap/Set: A RedBlack tree. It keeps keys ordered, either naturally or by using the provided Comparator, but has log(n) time on basic ops.
- https://docs.oracle.com/javase/8/docs/api/java/util/TreeMap.html
- LinkedHashMap/Set: keeps keys ordered by insert order via a linkedlist but can be configured to use access-order making it easy to use as a LRU cache. It keeps O(1) time but is less efficient than HashMap: - https://docs.oracle.com/javase/8/docs/api/java/util/LinkedHashMap.html
Apparently, bloom filters make for lousy IP membership checks, read: https://blog.cloudflare.com/when-bloom-filters-dont-bloom/
CritBit Trie [0] and possibly Allotment Routing Table (ART) are better suited for IPs [1].
[0] https://github.com/agl/critbit
[1] https://web.archive.org/web/20210720162224/https://www.harig...
Since you mentioned bloom filters - other probabilistic data structures like count-min sketches (roughly, streaming bloom filters) are super useful. Approximate kmer methods like minhash and w-shingling use them in really cool ways. Rolling hash methods like Rabin chunking also work really nicely with probabilistic/streaming hash tables - splitting the data stream into chunks that can be consistently filtered or matched is useful in many circumstances! Think NSA-level data harvesting, or realtime genome/geospatial data classification.
2. Checking for, locating and counting subsequences in giant string datasets:
Wavelet trees, FM-indices, suffix arrays more generally - and structures that allow extreme performance in very niche situations with arbitrary encodings like bitvectors and compressed integer vectors. If you have a million books, or all the genomes ever sequenced, you use the first structures to find where a specific phrase can be found, even if you don't quite spell it right. You can use the latter ones to do super fast comparisons of giant datasets - given a million viruses, how similar is each one to the human genome, and where specifically do they most closely match?
3. Reconstructing structured information from (potentially noisy) fragments:
De-brujn graphs (and related graphs like string graphs, colored de brujns). This is a way to find all the overlap connections between fragments of information, and then weight the possible traversals of those connections by the evidence. These can be represented using the data structures from #1 (FM-indices for example), and efficiently used in some circumstances with those from #2 to enable some kinds of graph algorithms. If you have a shredded set of documents, or a billion short DNA reads from a genome sequencing experiment, this is how you reconstruct the original.
4. Decentralised coordination structures. Merkle-DAGs and Kademlia DHTs in particular. Being able to compare any trees by root-first hashes, and being able to request the content of the subtree for any node hash from an arbitrary set of peers - these structures enable the p2p web infrastructure. Everything from Limewire to bittorrent, IPFS and blockchains, and, most importantly, Sci-Hub.
1, 2 and 3 together are some of the fundamentals of computational biology. If you're interested in understanding them, https://rosalind.info/problems/list-view/ is a great starting place.
"Scalability! But at what COST?"
In it Frank McSherry uses a handful of data structure tricks to make graph algorithms like connectivity and PageRank run progressively faster on his laptop, beating distributed cluster implementations. It's actually a version of their HotOS paper with Isard and Murray.
Admittedly it's the opposite of obscure, being kind of a myth busting piece about distributed processing being faster and how even largish data sets can fit on a single node (and even your laptop).
Count-min sketch [1] is another one. It gives a reasonably accurate count of different events, even millions of unique events, in a fixed memory size. It shows up a bunch in high volume stream processing, and it's easy to understand like the bloom filter.
Another cool data structure is HeavyKeeper [2], which was built as an improvement on count-min sketch for one of its use cases: ranking the most frequent events (like for a leaderboard). It can get 4 nines of accuracy with a small fixed size even for enormous data sets. It's implemented by Redis's topk.
[1]: https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
[2]: https://www.usenix.org/system/files/conference/atc18/atc18-g...
https://en.m.wikipedia.org/wiki/Fractional_cascading
I wrote a simple demo in Rust a while back to help myself learn the language.
https://github.com/mgraczyk/fractional_cascading
I also think pairing heaps are neat.
Famous as the data structure that achieved optimum runtime in Djikstra’s algorithm. In practice it isn’t often used because simpler heaps outperform it except in very specific circumstances.
https://en.wikipedia.org/wiki/Fibonacci_heap
Soft heap
As far as I know this hasn’t really been implemented. It’s used for the best known minimum spanning tree algorithm. Pretty crazy. https://en.wikipedia.org/wiki/Soft_heap
For estimating a median or percentile in a stream, using constant memory.
It's very simple: if the current value is higher than our estimate, then increase our estimate by one. Else decrease by one. It will converge over long enough time. This is called the Frugal-1U algorithm.
The Frugal-2U is a slightly more advanced version that modifies the step size along the way, plus other optimizations, to converge faster and not oscillate too much.
* HyperLogLog? Convenient for doing approximate counting across huge datasets.
* SkipList is another probabilistic data structure that allows you to skip ahead N elements in 1 step. I believe ClickHouse uses it.
* Bitmap index organizes database pointers into a matrix of bits. The bitmap scans simply skip over the zeros. This type of index gets in trouble if you have high cardinality, though.
"A space-efficient probabilistic data structure that is used to test whether an element is a member of a set, like a Bloom filter does. False positive matches are possible, but false negatives are not – in other words, a query returns either "possibly in set" or "definitely not in set". A cuckoo filter can also delete existing items, which is not supported by Bloom filters. In addition, for applications that store many items and target moderately low false positive rates, cuckoo filters can achieve lower space overhead than space-optimized Bloom filters."
"LDA represents documents as mixtures of topics that spit out words with certain probabilities."
"LDA assumes that each document in a corpus contains a mix of topics that are found throughout the entire corpus. The topic structure is hidden - we can only observe the documents and words, not the topics themselves. Because the structure is hidden (also known as latent), this method seeks to infer the topic structure given the known words and documents."
https://cfss.uchicago.edu/notes/topic-modeling/
I'm making a next-gen search engine.
The Alias Table for sampling a discrete distribution in O(1) time is a very clever idea https://www.keithschwarz.com/darts-dice-coins/
Suppose we've got a lot of Doodads, let's say there's a Graph of ten million Doodads, and a whole bunch (dozens? hundreds?) of threads are poking around in this same graph, maybe looking at Doodads and sometimes (but not often) removing them from the Graph.
What happens if my thread is looking at a Doodad, and meanwhile a different thread removes it from the Graph? Is the Doodad destroyed while I was looking at it? That sounds very bad. What if while I'm looking at it, the Doodad is destroyed, but, a nearly identical Doodad appears in its place. How would I know? Not acceptable.
We could reference count the Doodads, we'd need atomic reference counts, something like C++ shared_ptr -- but this would work. However now our performance is terrible, because we've got 100 threads increasing and decreasing these reference counts just to look at Doodads and each such change causes a write that must be observed on other threads.
So instead Hazard pointers go like this:
Every thread who wants to look at Doodads asks for a Hazard Pointer, usually these are never deleted once made because it's much easier. A thread sets this Hazard Pointer to point at a Doodad when it wants to look at one, it checks this was successful (if not the Doodad is gone now), then contemplates the Doodad (it can be in this state for an arbitrarily long period of time) and then it clears the hazard pointer once it is done. It can only point one Hazard Pointer at one Doodad (some schemes allow a thread to have say, two or some finite number of Hazard Pointers each)
Somewhere a Linked List of all these hazard pointers is kept. When you want to remove a Doodad, you take it out of the graph, then you check the List of Hazard Pointers. If any of them was pointing at the Doodad you removed, you put the Doodad on a special pile since it can't be deleted yet, otherwise you can delete it immediately because nobody else needs it any more. This is a relatively expensive operation, but you're only doing it for removal, not just reads.
Finally there needs to be some mopping up of that pile of "couldn't delete yet" Doodads by trying to delete them again (checking the Hazard Pointers again of course). This can be periodic, it can happen every time somebody tries to delete one, or whatever, the best policy may vary between applications.
This approach means reading Doodads costs only two local writes, no waiting around for the cache. Removing them is more expensive, but that's OK because it was rare.
Was deeply patent encumbered until recently.
also, radix heaps are underappreciated.
The representation of S can be anything, and even dynamically adapting to the data. I’ve seen trees of bitmaps for example.
I used one to implement a fast shared memory pub-sub message bus for communicating across dozens of processes in a simulation framework.
Also ROPE, a string allowing for prepends, substrings, middle insertions and appends [2]
[0] http://cr.yp.to/critbit.html
[1] https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.96....
They are like tries but share prefixes and suffixes of a word corpus, can be built in linear time, and since they are graphs, one can define an inner product between two CDAWGs and use them for kernel-based machine learning algorithms.
https://hal-upec-upem.archives-ouvertes.fr/hal-00620006/docu...
With two unsigned integers, freq and phase, you get amazingly fine control over timing. You only use the top N bits of phase for output. It's saved my bacon in embedded systems more than once.
We used it to check if you had voted on a discussion on reddit. It almost all cases the user had not voted, so it was a lot quicker to check if you hadn't voted than if you had.
[1] https://en.wikipedia.org/wiki/GADDAG [2] https://www.axioms.me/gaddag/#?page-5
Seems-needed clarification: Bloom filters suffer from false positives, but not false negatives. So - if BloomFilterBlacklistCheck( $IP ) returns false, your router can safely conclude "not blacklisted". But if it returns true, then you'll need to perform further checks. (And you can't just use a Bloom filter of known false positive - that will also suffer from false positive, potentially letting in traffic from blacklisted IPs.)
/* Bits: selected, hovered */
const colors = [
grey, // 00
green, // 01
blueA, // 10
blueB // 11
]
const color = colors[selected << 1 | hovered]
https://blog.uidrafter.com/bitwise-table-lookup
https://en.m.wikipedia.org/wiki/R-tree
Also concurrent data structures.
is pretty awesome and I’ve personally gotten big wins from it.
The parent mentioned Bloom Filters, HyperLogLog-type stuff is on that family tree and also very interesting and very useful.
But the coolest thing lately by far? Swiss table (and the largely equivalent F14).
That thing is a game changer in certain scenarios.
By the way if you're curious how I can get constant time when I'm returning a set of descendants, it's because I can give you the size and pointers to the first and last elements right off the bat.
The link to the original paper seems to be down (looking into it), but you can find the master's thesis covering all variants here: https://github.com/sam0x17/hierarch_old/blob/master/Master's...
I've been working on a rust implementation lately as well. There are a number of applications for DFilters including fast file-system searching, DOM traversal, very fast static analysis on abstract syntax trees, and more.
I made up a data structure once, consisting of a pyramid of deques. It lets you efficiently compute any associative function over a streaming window of data.
Fibonacci heap is theoretically better than Binary heap but the cache behavior is terrible
https://en.wikipedia.org/wiki/Van_Emde_Boas_tree?wprov=sfla1
Well, I saw this in CLRS tho. Very clever way of abusing bit patterns for quick range query in O(log log M) where M is the integer size.
https://en.wikipedia.org/wiki/Suffix_array?wprov=sfla1
A simpler way to do text matching since suffix array is the preorder traversal of a suffix trie constructed using the same string. Not heavily used in practice, but I do know bioinfo people used it a lot.
https://study.com/academy/lesson/robin-hood-hashing-concepts...
Not a data structure, but still very underrated. Robinhood hash basically means you keep the first insertion order and the actual insertion order and compare their distance. Once the difference is bigger than expected, you put it in front to "steal from the rich" to have better expected access distribution, hence Robinhood. Getting more prominent today thanks to Rust popularizing it again, since the default hash table implementation of Rust uses hashbrown which uses Robinhood hashing scheme.
https://en.wikipedia.org/wiki/Skip_list?wprov=sfla1
The ordered linked list with a fast lane every log(n) level. Not used very much, but one of the place I know uses it is Redis
https://en.wikipedia.org/wiki/Double-ended_priority_queue?wp...
Basically one priority queue that does two things. Can ironically be implemented with two priority queues (dual)
https://en.wikipedia.org/wiki/Category:Probabilistic_data_st...
Eg `Data.Sequence` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) is based on Finger Trees (http://staff.city.ac.uk/~ross/papers/FingerTree.html) which allow O(log(min(i,n-i))) inserts (at i) and O(1) inserts at the beginning or end even thou the data structure is basically copy-on-write.
In general Haskell gas a lot of specialized data structures. On of my favorites is `Data.IntMap` (https://hackage.haskell.org/package/containers-0.6.5.1/docs/...) which is a specialized map for integer keys (based on Patricia trees iirc).
(Man I miss working in Haskell :-( )
https://en.wikipedia.org/wiki/GADDAG#:~:text=A%20GADDAG%20is....
This is useful for example when you want to convert between offsets and line/column numbers efficiently, while you also want to efficiently update/insert/replace lines.
Bonus tip: When you have a fast distance function (like a hamming distance) and write your tree from scratch in C (which almost automatically means you'll optimise it for your use case), they can become absurdly fast - much faster than the general-purpose nearest neighbour libraries I'm aware of.
Rainbow Tables.
Sure, they not used anymore and are not particularly useful nowadays, but I find the idea behind them very beautiful.
This talk provides a nice overview of different timing wheels implementations including hierarchial and hashed timing wheels.
- Reasoning: It's used in cases where you'd ideally use an array because you want contiguous memory (because you'll usually iterate through them in order), but you don't know beforehand how many elements you'll insert. But you can't use a resizeable version like std::vector, because it invalidates any pointers to the elements you've already allocated, and you need to store those pointers.
- Implementation: As a black box, it's used like a list: you index elements with an integer, and you can append to the end. But internally it uses different lists. Like a std::vector, it starts with an allocated size, and when it needs a bigger size, it allocates newer blocks that are "added" to the end, but doesn't move the already allocated elements.
- Downsides: Data is not all contiguous, at some point there are jumps between elements that should be close together. For performance when iterating, it's negligible since those cache misses happen "rarely" (depends on the block size, bigger block size means better cache performance but more potential wasted allocated memory), and most of the time you're iterating from start to end. But this means that you can't use this structure with algorithms that use "pointer + size", since internally it doesn't work like that. And since internally you can't shrink the lists and invalidate pointers, there's no way to delete elements, but in my case it's not a problem since I'd reuse indices afterwards with an added structure.
- Applications: I used it when coding games, when you have a big list of entities/components but you don't know how many beforehand. You access them by index, you might store pointers to elements, and most of the time you'll iterate through the whole list (when updating, drawing, etc.).
Great for when you need to do ranged updates in constant time.
More info: https://codeforces.com/blog/entry/78762
I've remember having implemented an OOP indexed version of U.C. Berkeley's Cheshire-II in 2007 with Dolphin Smalltalk.
I was using BTrees tho, hence O(log N). The non-boolean part was in the values of the keys which were probability of good match against your search target.
Let's say you are trying to "synchronize" a secondary data store with a primary data store. Changes in the primary data store are very "bursty", one row will not change for days, then it'll change 300 times in a minute. You are willing to trade a bit of latency (say 10 seconds) to reduce total message throughput. You don't care about capturing every change to the primary, you just want to keep it within 10 seconds.
It feels like there should be a clever way to "debounce" an update when another update overrides it 500ms later. I know debounce from the world of front-end UI where you wait a little when doing autocomplete search based on keyboard input so as not to overwhelm the search.
Say that you have a dataset that needs to be ordered, easily searchable, but is also updated quite frequently. Fast accesses are a pain if you decide to use traditional read-write locks.
Ctries are entirely lock-free, thus there is no waiting for your read operations when an update is happening, i.e. you run lookups on snapshots while updates happen.
They are also a lot of fun to implement, especially if you aren't familiar with lock-free algorithms! I did learn a lot doing it myself [1]
[0] http://aleksandar-prokopec.com/resources/docs/ctries-snapsho...
Binary fuse filters? (Xor|Bloom) filters on steroids.
Non-deterministic finite state transducers? Reversible, more elegant and more powerful version of non-deterministic finite state recognizers (transitions are labelled with two outputs, which can be interpreted as inputs versus outputs, but these roles can be swapped as they are symmetric).
Ropes? What, you are still using strings? Dude, please stop writing that newbie editor.
Caches with Time aware least recently used (TLRU) or Segmented LRU (SLRU) cache replacement strategies (Oh, you thought Least recently used (LRU) was cool?)
Great for short-lived values in a cache that are frequently accessed. If all of your values always have the same expiration period (i.e. expires at insert time + N period) then it's super efficient.
More or less you have an array with a head and tail. Every "tick" (usually each second but you can customize it) you move the read pointer ahead by 1 array slot. The tail is then evicted.
If you have a highly accessed part of your code, you can be lazy and avoid using an extra thread and just tick ahead the read pointer when the time changes.
It is actually a quite trivial change until you come to the merge algorithm which is finicky in all the wrong ways. The benefits are (in clojure terminology) O(1)ish splits, inserts and concatenations instead of O(n).
I don't know if it can be called obscure anymore, since Scala's vectors are RRB-trees and clojure actually has an implementation of them, but I often see people talk about ropes in situations where an RRB tree should be a no-brainer.
These things are common in discrete event simulation, but, are useful anytime you have a time based queue.
https://www.averylaird.com/programming/the%20text%20editor/2...
“ In computer science, dancing links (DLX) is a technique for adding and deleting a node from a circular doubly linked list. It is particularly useful for efficiently implementing backtracking algorithms, such as Donald Knuth's Algorithm X for the exact cover problem.[1] Algorithm X is a recursive, nondeterministic, depth-first, backtracking algorithm that finds all solutions to the exact cover problem. Some of the better-known exact cover problems include tiling, the n queens problem, and Sudoku.”
I had a bunch of API keys with poor expiry documentation and implementation, so to find out if a key expired it had to be used. I put it in a main "keys.pop loop" and all methods below tried to use the key. If HTTP response was (some another obscure HTTP status code like) 505, I simply called `continue;` in the loop to jump to another, without caring at all where I was before.
The fact that you could perform linear algebra on documents and words, to get out a number that somehow quantified how close a document was to another really surprised me back in the day. Grew a new appreciation for what exactly those vectors in linear algebra could represent.
Not in the field so don't know how obscure it is though.
[1]: https://en.wikipedia.org/wiki/Latent_semantic_analysis#Laten...
- Dominator tree (https://en.wikipedia.org/wiki/Dominator_(graph_theory))
- Single-Connected-Components-Graph
- Deterministic data structures (eg. a set that acts deterministic to the fact that addresses might be somehow randomly assigned, very useful for ensuring reproducibility)
Already cited, but it's clearly among the most elegant:
- union-find (!!!!)
and as a bonus one that is easily overlooked:
-std::deque, that when restricted to push_back() or push_front() guarantees not to ever move objects around.
https://github.com/sipa/minisketch
Lets say you have a set of a billion items. Someone else has mostly the same set but they differ by 10 items. These let you exchange messages that would fit in one UDP packet to reconcile the sets.
Minisketch is more costly in CPU but has deterministic guarantees. IBLTs are very fast but have a chance of failure requiring a randomized iterative procedure.
I wrote a Python library implementing them a number of years ago: https://github.com/benhoyt/pybktree
Edit: another that I enjoyed working with 'Burst Tries': https://dl.acm.org/doi/10.1145/506309.506312
Humm. A link to these would have been good[1], but anyway I understand that an optimally filled bloom filter takes ~1.44 times more space than the optimal theoretic value (which I understand Golomb encoding gives you?), so 'much smaller' is not a helpful measure. Likewise the 'worse performance' is not a helpful phrase, I believe a linear lookup time is needed for decoding, but a small amount of space for an extra indexing structure can speed things up lots. Bloom filters are updatable (addition only in the simplest bloom filter), golomb coded sets practically can't be updated except by rebuilding AIUI.
I suppose binary heaps should get a mention <https://en.wikipedia.org/wiki/Binary_heap> cos they barely seem to figure in other comments here. The neat bit is they are trees but implicit (stored in an array) not explicit (which would use pointers) and are always balanced giving you log access times in a compact form.
[1] <https://en.wikipedia.org/wiki/Golomb_coding> but there are simpler explanations on the web
This data structure is an immutable, uniquely represented Sequence ADS (Authenticated Data Structure) with various interesting use cases.
It is based on a novel hashing scheme that was first introduced with SeqHash.
See http://www.bu.edu/hic/files/2015/01/versum-ccs14.pdf) by Jelle van den Hooff , a brilliant young researcher back then.
SeqHashes are uniquely represented Merkle Trees that also represents a sequence. Concatenation of two SeqHashes is O(log(n)). In turn, the cryptographic signing of SeqHashes is O(1) as each tree node carries a cryptographic hash.
Of course, for each node to carry a hash incurs a hefty overhead but that can be alleviated by (lazily) grouping nodes into buckets (turning it in some kind of BTree).
SeqHashes also don't support splitting in O(log(n)) like for example AVL trees.
I've created an Btree version of SeqHash that also allows splitting SeqHashes in O(log(n)) called SplitHash.
See: https://github.com/odipar/spread/blob/master/src/main/scala/...
I also like a lot the relational model, is not that much represented so I making a language on top of it: https://tablam.org.
So good forward error correction that up toa few years ago it was all under patents.
Gold Codes (2) are also very cool in CDMA systems.
[1] https://en.m.wikipedia.org/wiki/Low-density_parity-check_cod...
Think of them as double ended priority queues. O(n) construction with O(lgN) mutations. Both the min and max elements are quick to get O(1).
The paper is short and the data structure is elegant. I read it a few years ago in uni and made me appreciate how greatly useful can be implemented very simply.
If you like these kinds of data structures, the Database Internals [3] book has a good introduction and explanation of a handful of BTree optimizations like this.
1: https://www.microsoft.com/en-us/research/wp-content/uploads/...
2: https://www.cs.cmu.edu/~huanche1/publications/open_bwtree.pd...
Very cool.
Maybe not so obscure if you've worked in the geospatial realm.
I actually had a job interview once that asked me to implement an R-Tree from scratch in C++ as a homework assignment - I didn't get that job :)
The code was simple, but effective (Go):
import "sync"
type Muxer[T any] struct {
ret T
mut sync.RWMutex
}
func (c *Muxer[T]) Multiplex(fn func() T) (ret T) {
if c.mut.TryLock() {
defer c.mut.Unlock()
ret = fn()
c.ret = ret
} else {
c.mut.RLock()
defer c.mut.RUnlock()
ret = c.ret
}
return ret
}
The way to use it is to write a non-concurrent nullary get() function, then pass it to the Multiplex() function in as many goroutines as needed: get := func() DataType { ... }
muxer := Muxer[DataType]{}
// in other goroutines
value := muxer.Multiplex(get)
The value will be shared across all goroutines, so we get minimum latency with minimum copying.
The canonical use case for HLL is counting unique visitors. But any large dataset where you only need a good approximation is a valid target :)
Count quadrangle structures in graphs to derive an embeddedness coefficient for each edge and expand the graph along the most embedded edges, yielding nice clusters.
https://jgaa.info/accepted/2015/NocajOrtmannBrandes2015.19.2...
just practical stuff, in python, I often use a magic dict defined as follows.
``` from collections import defaultdict >>> def MagicDict(): ... return defaultdict(MagicDict)
```
then you can use your MagicDict as a weird json-like map.
``` c = MagicDict() c[3]["aaa"][2] = 5 ```
See https://en.wikipedia.org/wiki/Palindrome_tree and the original paper
Stanford Lecture: Don Knuth — "Dancing Links" (2018) https://www.youtube.com/watch?v=_cR9zDlvP88
At the lowest level, let's have one bit represent if a 4K chunk is free space or occupied. 64 bits at this level tells you the usage of 64 * 4096 bytes (256KB)
Then Level 2. Two arrays. One bit at this level represents 64 bits in the level 1 array. One array for All-Bits-Set in level 1, one array for Any-Bit-Set in level 1. 64 bits at Level 2 tells you usage of 64 * 64 * 4096 bytes (16MB).
Then Level 3. Two arrays. One bit at this level represents 64 bits in the level 2 array. One array for All-Bits-Set in level 2, one array for Any-Bit-Set in level 2. 64 bits at Level 3 tells you the usage of 1GB.
I'd imagine there would be ways to use the hierarchical tables to quickly find a hole of a particular size without checking only the lowest level.
I remember when I first discovered it, it drove home the point that choice of data structure can really matter - it made a bunch of problems I'd been working on much easier to solve, as it gave a different way of looking at things.
Imagine it's for a text editor, and you want to map Line Numbers to Byte Positions. But then you want to insert a byte somewhere, and you need to add 1 to all your Byte Position values.
Instead of actually keeping a big array of Byte Position values, you have a hierarchical array. The convention is that whenever you want to read the value, you add the Parent's value, and that parent's value, etc. Number of parents would be logarithmic.
So if you want to add 1 to everything past a certain point, you just apply it to some leaf nodes, then the next internal nodes, rather than applying it to every single leaf node.
A good example is a wavelet tree. Suppose you have a text of length N comprised of an alphabet with C characters and wish to execute full-text pattern searches for patterns of length P. Once the tree is created, you can do your full-text search in O(P log C) time, and you use less space than any constant multiplicative factor of the information theoretic lower bound. Note that the runtime is independent of the original string's length.
Instead of using a bit field for membership, use an integer field for tracking count.
When querying: take the min of each value in the field location given by the hash functions.
When updating: increment each field location given by the hash functions, take the min of new values. if the min is greater than the least value in the heap, push the updated key into the heap
No pointers to dereference and wait, just the input, computated, stored back into the same cacheline- to be used for the next part of the hot loop. If the hot-loop is multi staged and the output continously decreases in size, one can even "compress" the data, by packing an array of orgMaxisze/outputSize subunits into the unit.
(Set up two N-element arrays "member" and "index" of ints and an elementCount; set elementCount to zero. You don't have to initialize the arrays if you can live with some valgrind grumbling. The leading "elementCount" entries in "member" are the current members of the set, unordered, and for each of those values x, index[x] is its index in "members".
Then isInSet(x) = index[x] >= 0 && index[x] < elementCount && member[index[x]] == x and addToSet(x) = isInSet(x) || (member[index[x] = elementCount++] = x)
and rmFromSet() is left as an exercise.)
This is great for live updated time series data: you never have to expand your arrays, allocate memory etc. It is extremely fast, predictable and resource efficient.
Hash tables that come with a parameter that lets you control density vs. speed.
For context, they were used by Figma to represent its scene-graph (https://www.figma.com/blog/how-figmas-multiplayer-technology...).
Essentially, it’s a way to represent a normal tree as a map; keys of the map represent individual nodes on the trees, and one key “ROOT” points to the rootmost node.
Each node is connected via the following relationship: parents have an unordered set of their children’s keys (not the actual node reference) and children know their parent’s key. Children also maintain a property to track their index using a technique called fractional indexing.
What’s great about this data structure is that you have O(1) transformations for most parts. If you need to change a node’s property, height for instance, (assuming this is a design tool or something), all you need is that nodes key. If you need to reposition a node under a new parent, simply remove the key from the parent’s set, change the node’s parent key, and update the index. This structure plays well with collaborative apps too as most edits to the tree are O(1)— edits are swift and can be communicated to other collaborative instances quite simply with small payloads.
Last thing I’ll touch on is fractional indexing. Very crudely put, it forces children nodes to track their position in a list via an “index” that can be a normal float. To reposition a child node, you first find its neighbors and average their indices to find a new index. Ordering the list is a sort from smallest indexes to largest.
In an example, say we had 3 nodes A, B, and C. A’s index is 0.1, B’s is 0.3, and C’s is 0.5. To insert an element D between A and B, I average A and B’s index (0.2), set that as the index of D and insert it into the parent node’s children set. At render, the set is ordered and we find A, D, B, C.
Shameless plug, but I’m using these techniques to build a multi-framework mobile app builder at phazia.com :). I love nerding out about these things so thanks for the chance to do so @op.
Super effective way to store a searchable list of items, like a dictionary of words.
To take advantage of all these, where would one go to ask "what's the best data structure to use to implement X, Y, and Z?"
Is there a stackoverflow tag or something like that that folks use??
And I love bloom filter too (and their more modern successor cuckoo filter) but I'd challenge the usecase you mention though: 1 million IPv4 is 4MB, and 16MB for IPv6. That's tiny, you're better off using some kind of hashtable, unless you have a fast and small memory, and then a slow and big memory (say embedded processor with small CPU cache and some DRAM).
Bloom filters are useful when your working set cannot fit in your fast memory, to allow you to go to the slow memory only for a small number of requests.
I implemented a variant of it, in my Generic Swift Toolbox Package[0]. It’s lightning fast.
[0] https://github.com/RiftValleySoftware/RVS_Generic_Swift_Tool...
Range Tree:
https://en.wikipedia.org/wiki/Range_tree
You give it a range of dates and it'll tell you if any intersect so if you're looking for "how many people are absent in this time period" you can really quickly find out by using a range tree.
I've looked around and I've seen R+ Trees are potentially a solution but there's very little info on how they're actually created in code.
Does anyone know of any nice data structures to do this? Obviously the most basic way is just indexing each field and that works for simple equality but gets harder when you're trying to do other comparators like < and >.
https://en.wikipedia.org/wiki/Nested_set_model
You can ust to store tree structures in relational databases for faster querying stuff like "all nodes below node x", "How many nodes to the top node" and so on.
Querying is easy, but comes at the cost that changes like inserting and deleting are more expensive.
I used this for an auth service with roles, user and org units. pretty query heavy, worked very well.
→ Distinct count time complexity[2]: O(1)
See Redis's PFCOUNT[4].
[1] https://en.wikipedia.org/wiki/HyperLogLog
[2] For a fixed number of registers (e.g. Redis's implementation)
The Resource Description Format (RDF) is the W3C standard for web data. It allows for data the be linked, anywhere, and to be "grounded" in semantic descriptions of the data. The core RDF data model is very simple: It is a set of triples orgainzed into a RDF Graph.
For example 26x26 RAW files asking if the page contains the letter combination "aa" or "ab" all the way up to "zy" and "zz".
When one types a search query after 2 letters a file is pulled, at the 3rd letter we have a second 2 letter combination. Then do the AND operation.
It is much like a bloom filter and tells you what is not in the set.
The piece table. https://news.ycombinator.com/item?id=15387672
Mirror of original: https://web.archive.org/web/20160308183811/http://1017.songt...
They are used by default in the OpenMined implementation of Private Set Intersection[1] - a multi-party computation technique.
[1] https://github.com/OpenMined/PSI/blob/master/private_set_int...
They are useful for checking in O(log(n)) if a number (or string) is in any of n ranges of numbers.
You just keep a sorted list of endpoints of the ranges, and when you do a lookup, you do a binary search if the search value is in the list, and if not, between which two indexes. If it's between and even and an odd index, it's in.
Very useful if you remember that IP addresses are also integers, and with IPv6 networks become way too huge to enumerate.
Good use-case: Lock-free concurrent reads are allowed, because there is no way committed content could change. Writes need to be Linearizable (so locking is required). Given this property, this data structure provides faster reads than a Mutex Bonus section: Provides broadcast broadcast capabilities if used as channel.
> and similar write perfs.
Just because of the Type being introduced very late to the language, and another very successful method named identically makes it almost impossible to google your way out of any situation.
You have to rely on core documentation and link lists. Just by using “new Map()” in JS, you’re suddenly ehem mapped 30 years back in time!
If I'm doing anything with random or random-looking variation, I prefer to store the full distribution in a HDRHistogram rather than just the last value, last few values, mean value, or whatever. Opens up so many possibilities for analysis later down the road.
Would run out of memory on a 96gb machine before, after it could process the tiles as quick as they came in from the camera, only a couple held in memory at a time.
When you have the right data structure, man it’s like night and day performance!
It's a ring buffer compatible with APIs that don't specifically support it.
It's a great fit for any network I/O and I believe Android uses it under the hood.
https://www.codeproject.com/Articles/3479/The-Bip-Buffer-The...
[0] https://en.m.wikipedia.org/wiki/Locality-sensitive_hashing
Simplest to implement is to just wrap a hash table and insert time checks in the accessors.
For example, you could use them to store the minimum spanning tree of a graph, while supporting updates if edges are added to the graph.
They can also be used to speed up max flow algorithms by finding the augmenting path in logarithmic time.
It keeps track of which space is allocated by logging every allocation and deallocation. When initializing an allocator, which can be implemented in any way, this log is replayed.
If the map becomes too big it is rewritten so it is equivalent to the old map but with no redundant entries.
https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...
They are basically only useful in theory, not in practice. They allow construction of some asymptotically optimal algorithms.
It is a B-Tree where updates to the tree are buffered in the branch nodes and are propagated down to the leaf nodes at a later point in time.
This article shows some really cool stuff you can do with it https://www.well-typed.com/blog/2019/05/integrated-shrinking...
In the case of IPs, you just have to make 4 comparisons (for IPv4), if you store them in a tree.
You can evaluate infinite rules, with automatic incremental updates (no re-evaluation if not necessary), with O(1) time complexity. The tradeoff being that you have to hold a very large graph which represents your rule set in memory.
The idea that we can very cleverly pack a collection of information into a much smaller form, but you can query the compressed form with good computational complexity, and without unpacking the data you need to read is amazing.
https://en.wikipedia.org/wiki/Alias_method
Mind was utterly blown when I first understood the way this works in my late teens.
https://en.wikipedia.org/wiki/Fractal_tree_index http://www-db.deis.unibo.it/courses/TBD/Lezioni/mysqluc-2010...
o k-d tree for space partitioning, handy real world applications in mapping and gaming.
o Bloom Filters as you mention are great, always a handy one to know.
I've used bloom filters with in memory caches a few times.
Among other things, it can store nil values, which if you’ve written any Cocoa apps, is an all too common way to crash your software.
The reason to use a bloom filter would be to save space.
wouldn’t a hashmap / dict / object do this in 0(1)? Input goes in, is hashed, the key exists or doesn’t exist?
The easy and efficient way to test if a value is in a list is to use a hash set or dictionary. Complexity is always O(1).
They have been around for 30 years and I don't see them used in practice.
A pinsketch is a set that takes a specified amount of memory and into which you can insert and remove set members or even add whole sets in time O(memory size). You can insert an unbounded number of entries, and at any time that it has equal or fewer entries than the size you can decode the list of members.
For an example usage, say I have a list of ten million IP addresses of people who have DOS attacked my systems recently. I want to send my list to you over an expensive pay as you go iridium connection, so I don't want to just send the 40MiB list since that would cost about $700. (Or 12.7 MiB if you use an optimal set encoding, or somewhat in between if you use the Golomb coded set mentioned in the OP).
Fortunately you've been making your own observations (and maybe have stale data from me), and we don't expect our lists to differ by more than 1000 entries. So I make and maintain a pinsketch with size 1000 which takes 4000 bytes (1000 * 4bytes because IP addresses are 32-bits). Then to send you an update I just send it over. You maintain your own pinsketch of addresses, you subtract yours from the one I sent and then you decode it. If the number of entries in the resulting set (the number different between us) is 1000 or fewer you're guaranteed to learn the difference (otherwise the decode will fail, or give a false decode with odds ~= 1/2^(1000)).
Bonus: We don't need to know in advance how different our sets are-- I can send the sketch in units as small as one word at a time (32-bits in this case) and stop sending once you've got enough to decode. Implemented that way I could send you exactly the amount of data you're missing without even knowing which entries you're missing.
The sketch itself is simple: To insert an element you raise it to a sequence of odd powers (1,3,5,7...) in a finite field and add it to an accumulator for each power. The tricky part is decoding it. :P
Here is an implementation I contributed to: https://github.com/sipa/minisketch/
There is a application related data-structure called an inverted bloom lookup table (IBLT) that accomplishes the same task. Its encoding and especially decoding is much faster, and it has asymptotically the same communications efficiency. However, the constant factors on the communications efficiency are poor so for 'small' in set difference (like the 1000 above) it has a rather high overhead factor, and it can't guarantee decoding. I think that makes it much less magical, though it may be the right tool for some applications.
IBLT also has the benefit that it the decoder is a fun bit of code golf to implement.
The encoding for IBLT is simple: take your entries, append a checkvalue (can't be a plain CRC). Then hash them with three hash-functions to obtain three locations in a hash table and xor them into those locations (removal works the same way, due to xor's self-inverse property). Decoding an IBLT works by finding entries whos check value passes (which will be true when they are the only entry in their position) and subtracting them from the table (hash them to get their other two positions, and xor them in all three). Decoding is successful when the data structure is all zeros. When the tables are big enough and have a modest amount of slack space the decoding algorithm will be successful (for it to fail there has to be an overlapping cycle, which becomes rare in sufficiently large random graphs). (this description is probably enough for a reader to make a working IBLT implementation though it omits some improvements)
I've used them to solve Sudoku puzzles really fast.
Skiplists
Locality sensitive hashing
Log structured storage (bitcask, lsm)
also love the Option
Just started reading this. It looks interesting
https://en.m.wikipedia.org/wiki/List_of_algorithms
https://en.m.wikipedia.org/wiki/List_of_data_structures
__________________
As for a specific algorithm, this paper covers analysis of HyperLogLog (HLL):
http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf
TLDR: Calculating the exact cardinality of a multiset requires an amount of memory proportional to the cardinality, which is impractical for very large data sets. Probabilistic cardinality estimators, such as the HyperLogLog algorithm, use significantly less memory than this, at the cost of obtaining only an approximation of the cardinality. The HyperLogLog algorithm is able to estimate cardinalities of > 10^9 with a typical accuracy (standard error) of 2%, using 1.5 kB of memory.
Source:
https://en.wikipedia.org/wiki/HyperLogLog
Recent HN coverage of HyperLogLog:
They are particularly interesting for storing long strings.
You can look up all strings with a common prefix quite easily.
the lookup function is really interesting, just check the first differing bit.
You're supposed to post yours as a comment, so it can be voted on like every other type of structure suggested here.
What I like about them is that they give me a set of engineering tradeoffs that I typically don't have access to: accuracy-speed[9] or accuracy-space. There have been too many times that I've had to say, "I wish I could do this, but it would take too much time/space to compute." Most of these problems still work even if the accuracy is not 100%. And furthermore, many (if not all of these) can tune accuracy to by parameter adjustment anyways. They tend to have favorable combinatorial properties ie: they form monoids or semigroups under merge operations. In short, a property of data structures that gave me the ability to solve problems I couldn't before.
I hope they are as useful or intriguing to you as they are to me.
1. https://github.com/tdunning/t-digest
2. https://pdsa.readthedocs.io/en/latest/rank/qdigest.html
3. https://florian.github.io/count-min-sketch/
4. https://www.cs.yale.edu/homes/el327/papers/simpleMatrixSketc...
5. https://www.juanlopes.net/poly18/poly18-juan-lopes.pdf
6. https://courses.engr.illinois.edu/cs498abd/fa2020/slides/20-...
7. https://people.csail.mit.edu/rrw/6.045-2017/encalgs-mg.pdf
8. https://www.sciencedirect.com/science/article/abs/pii/S00200...
9. It may better be described as error-speed and error-space, but I've avoided the term error because the term for programming audiences typically evokes the idea of logic errors and what I mean is statistical error.
There's two ideas: concurrent loops and userspace scheduling and preemption.
Nested loops can be independently scheduled. I wrote a M:N userspace scheduler[2] and it does something simple. You can interrupt a thread that is in a hot loop by changing its limit variable to the end. You don't need to slow down a hot loop by placing an if statement or checking if X number of statements executed. (Many runtimes do this to implement scheduling and you don't need to do it this way)
Golang schedules away from a goroutine by checking during stack expansion if the thread has used its timeslice.
I think this simple observation is really powerful. The critical insight is that frontend performance can be drastically improved by reducing latency from the user's perspective. Did you ever notice that the cancel button rarely cancels immediately? Or resizing an image is slow and gives no feedback or encrypting gives no feedback?
By having a watching thread interrogate the output buffer, we can create GUIs that visually do things and provide observability to what the computer is doing. Imagine watching a resize operation occur in real time. Or encryption. Or network communication.
One of my ideas of concurrent loops is "cancellation trees"[1], if you model every behaviour of a piece of software as a tree of concurrent loops that are reentrant - as in they never block and each call is a loop iteration of a different index, you can create really responsive low latency software. If you cancel a branch of the tree, you can cancel all the loops that are related to that branch. It's a bit like a tree of processes from PID 0 or loops as lightweight green threads/goroutines.
So as you're moving around in your text editor or browser, if you invalidate the current action - such as press ESCAPE while Intellisense is trying to find related code, you can interrupt all the loops that fanned out from that GUI operation.
Telling the computer to stop doing what it is doing and observability are two key weaknesses of modern computing. GUIs do not always accurately communicate what the computer is doing and you usually need to wait for the computer to finish before it shows you what it is doing. I am sure the cancellation token could be extended to follow this idea.
If you liked this comment, check out my profile to links to my ideas documents.
1: https://github.com/samsquire/ideas4/blob/main/README.md#120-...
Like a regular queue, but if something new that comes in sees its friend in the queue already, it can jump the queue and go and stand next to her.
Useful for when something has a big overhead on top of its own processing, but the overhead can be shared between several similar entries. If you're doing it anyway for the one already in the queue, you get to make the most of it for its friends too.
I chose it because I live here and it's totally true :)