ALTERNATE UNIVERSE DEV

CodingBlocks

98. Data Structures – Heaps and Tries

We dig into heaps and tries as Allen gives us an up to date movie review while Joe and Michael compare how the bands measure up.

Reading this episode’s show notes via your podcast player? Head to https://www.codingblocks.net/episode98 to find this episode’s full show notes and participate in the conversation.

Sponsors

  • O’Reilly Software Architecture Conference – Microservices, domain-driven design, and more. The O’Reilly Software Architecture Conference covers the skills and tools every software architect needs. Use the code BLOCKS during registration to get 20% off of most passes.
  • Discover.bot – a digital space for bot developers and enthusiasts of all skill levels to learn from one another, share stories, and move the conversation forward. New to bots? Get started with their Beginner’s Guide.
  • Clubhouse – The first project management platform for software development that brings everyone on every team together to build better products. Sign up for two free months of Clubhouse by visiting clubhouse.io/codingblocks.

Survey Says

Anonymous Vote
Sign in with Wordpress
How much time do you spend coding outside of work?
  • Nada. I don't write code for free.
  • Until I go to bed.
  • I take the weekend off. Otherwise I'm writing code.
  • Like a sine wave, I go through phases. Sometimes more. Sometimes less.

News

  • Thank you to everyone that took time out of their day to leave us a review:
    • iTunes: Pendelgeist, bothzoli
    • Stitcher: jbzcooper, Terrance, Saltire Steve
  • We’ll be giving away a ticket to React Amsterdam.

Trie as you may …

Heaps

What is it?
  • Heaps are specialized binary trees that keep data sorted-ish.
  • This makes them slower for lookups, but faster for insertions.
  • They are also really good at getting their roots plucked.
  • Wikipedia list 18 different types of heaps, but we focus on Min/Max heaps (particularly max).
  • Sorted-ish?
    • Heaps are very similar to binary search tree (BST), they have two values – but unlike a BST, the children are always less than the parent (or more in a min tree).
    • This means that your root node is always the biggest value in the whole tree.
    • Because heaps aren’t so strict on the ordering, it’s quicker, O(log n), to insert data than it is into a sorted array.
    • It’s also quicker on average to insert into than a BST, in both cases the insertion time comes down to the height of the tree but we can cheaply keep the heap balanced – no straight lines here!
    • The downside is that heaps are slower for lookups, because you potentially have to look at every node in the tree.
    • Finally, the killer feature – it’s really efficient to remove the root node.
How does it work?
  • Heaps are typically stored in an array. Because it’s a binary tree you can quickly calculate a node’s children based on its index.
  • Downside is you need to pre-allocate memory, and either have to set a max size or use a dynamic array data structure that will grow as needed.
  • Adding to the heap has cool names like “up-heap”, “bubble up”, “trickle up”.
    • Add your node to the bottom level (easy in an array, just shove it in the first open value).
    • If the new node is less than its parent, swap and repeat.
  • Extraction, downheap, trickle-down, bubble-down, etc.
    • Move the last filled value in the array to 0.
    • If this node is greater than one of its children, swap and repeat.
Pros
  • Fast insert O(1) on average, O(log n) for worst case.
  • Fast removal of root node.
Cons
  • Slow search O(n).
When to use?
  • When you need to quickly insert data, and you only care about the min or max value.
  • Don’t use a heap when you need to search for arbitrary values.
  • Priority Queues (BFS).

Tries

What is it?
  • Tries, (technically) pronounced “tree” as in retrieval, aka digital tree, (compressed) radix tree, prefix tree.
  • They are specialized tree where the nodes don’t matter.
  • Instead, the “key” or “value” or “payload” is associated with the edges, and the nodes just exist for convenience.
  • The true values in the tree only exist on the leaves, and siblings share what is called a common prefix.
How does it work?
  • English Spellchecker example:
    • English is a mess, every rule has an exception – q is not always followed by u, i is not always before e, y isn’t always a vowel.
    • You need to store every english word in order to implement a spell checker (~1M depending on how you count).
    • You could keep a hash table, or a big sorted array then scan the document word by word to look to see if it’s valid.
      • This should be fast enough too, worst case for hash or sorted array is O(log n) (worst case 20 operations to find a word in the WORST case!) but how much RAM would you need to keep 1M words in memory…
    • However, if you look at the data you’ll see that a lot of words are really similar – code, codes, coder, coders, coding, …
    • We can do better with tries!!
    • Let’s start with an empty trie, and every time we add a word, lets have an edge represent one letter of the word.
    • When we add the word “code” we get an edge for c, o, d, and e – when we add “coder” …we already have edges for 4 of the letters, so we just add an edge for the r.
    • When we stored each of those coding words in an array, we ended up storing 26 characters, using a trie: 10.
    • Sure, we saved some space – but think about what it takes to see if a word exists? Checking for “code” just means looking for a branch for each letter – the number of comparisons is equal to the number of letters.
    • To make things even better, it’s actually really easy to offer auto-complete now – as the user types a letter, we can take another step down the trie to see if it exists … and even follow the tree down to show what words we think you might be typing.
  • While the example used natural language, tries are also useful for permutations, digits, or even binary – basically anything that has a lot of identical prefix data.
  • Note there is actually a better data structure for an autocomplete/spell checker: Deterministic Acyclic Finite State Automaton (DAFSA for short)!
  • Tries also compress really well, like combining any nodes that only have one child – in the spellchecker example, imagine that “q” was always followed by a “u” – then you could combine to a “qu” node.
Pros
  • Often used in place of a hash.
    • Faster worse case lookup than an imperfect hash table.
    • No messy/bad hash functions to deal with.
    • No messy/bad collision strategies to deal with.
    • Easy to provide an ordered listing of keys with no additional data structures.
Cons
  • Trie lookups can be slower than hash lookups, particularly if the random-access time of the medium is bad.
  • Not good when there would be long meaningless chains (like 2.3333333333334).
  • Can potentially take more space than a hash table (imagine having the pointer over head for each character where there isn’t a lot of overhead, as opposed to a single pointer to a string).
When to use?
  • Fast insert, fast and efficient sorting, lots of duplicated data…like in the bowels of search engines, or spell checkers/autocomplete.

Resources We Like

Tip of the Week

  • Order yourself some dev stickers! (Amazon)
  • How to programmatic disable C# Console Application’s Quick Edit mode? (Stack Overflow)
  • Package Thief vs Glitter Bomb Trap (YouTube)
  • Use the Semantic Syntax Colorizer to liven up your IDE experience in Visual Studio Code.

Episode source