ALTERNATE UNIVERSE DEV

CodingBlocks

83. Search Driven Apps

We’re talking databases, indexes, search engines, and why they’re basically microwaves in this episode while Joe wears a polo, Allen’s quick brown fox jumps over whatever, and Michael gives out fake URLs.

Sponsors

  • Datadog.com/codingblocks – Sign up today for a free 14 day trial and get a free Datadog t-shirt after creating your first dashboard.

Survey Says

Anonymous Vote
Sign in with Wordpress
Now that you had some time to digest the news, how do you feel about Microsoft's acquisition of GitHub?
  • Very excited! Looking forward to the awesome things Microsoft will add to GitHub.
  • I'm concerned but not enough to do anything about it. Yet.
  • I don't care. At all. Should I?
  • OMG the sky is falling. Why? How could we let this happen?
  • Already packed up my code and moved to GitLab.

News

  • Thank you to everyone that left us a review:
    • iTunes – vr6apparatus, The All Ighty Ollar, BonerGs, MatiasWasAlreadyTaken, galusssss
    • Stitcher – Vothis, JavaRuns100BillionDevices, SoulSurvivor, Twentyone
  • The Cynical Developer (James Studdart) has Sean Martz on the show to talk about transitioning from a sys admin to a developer. (Cynical Developer Episode 79)
  • In the Orlando area on June 21? Come see Joe’s talk on Building Search Driven Apps with Elasticsearch at the Orlando Backend Developer’s Meetup.
    • Sign up for Elasticsearch and Go for the Backend at Meetup
    • Source code and Slides over on GitHub
  • Joe invents a new way to listen to podcasts … by topic.

Search Driven Apps

What’s the problem? Why do we need search driven apps?

  • Search is a core tenant of modern computing
    • Google, Shopping, Cortana, Spotlight, Contacts
    • Can’t use the phone, listen to music, buy underwear or watch TV without searching
  • In some cases, the number of items we’re searching are small, can get away with a simple LIKE % search
  • But people aren’t very good at searching, and Google has spoiled us
    • Auto-Complete helps with typos
      • What if there’s too much data? Or the user doesn’t know what the thing is called?
      • And what if there’s a lot of data?

Some interesting numbers:

  • Google servers 40k searches … per second (Ardor Seo)
  • Amazon sells over 500M products (ScrapeHero)
  • Splunk indexes 100’s of TB per day
  • 5B videos watched on YouTube every day (MerchDope)

Q: How are these services doing it?
A: Lots and lots of joins, and LIKE clauses?

Relational Databases …

  • Don’t scale horizontally very well
  • Are designed for ACID instead of BASE
    • ACID
      • From Wikipedia:
        • Atomicity – requires that each transaction be “all or nothing”:
        • Consistency – ensures that any transaction will bring the database from one valid state to another.
        • Isolation – ensures that the concurrent execution of transactions results in a system state that would be obtained if transactions were executed sequentially
        • Durability – ensures that once a transaction has been committed, it will remain so, even in the event of power loss, crashes, or errors
    • BASE – Basic Availability, Soft State, Eventual Consistency
  • In regards to the CAP Theorem, Relational DB have an emphasis on Consistency
  • CAP Theorem
    • From Wikipedia:
      • Consistency – every read receives the most recent write or an error
      • Availability – every request receives a (non-error) response – without guarantee that it contains the most recent write
      • Partition Tolerance – the system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes
  • Shredding is inefficient
  • Writing dynamic ANSI SQL is the pits

DB vs Search Engine…why not just do it all in SQL?

  • Why not just turn on full text search in SQL Server?
  • Elasticsearch Search Reference (Elastic Docs)
  • Ability to search across all indexes, quickly and easily…
    • Fast aggregations (facets)
    • Ranged aggregations
    • Paging
    • Scale
    • Migrations / Versions
    • Simpler query requests

Ever written a SQL query that started out simple

  • And ended up a monster?
  • Start with a LIKE
  • Add some dynamic JOINs
  • In clauses lead to temp tables
  • Paging leads to support for multi-page actions …
  • CTE, Pivoting, OpenQuery

Writing dynamic SQL stinks.

How Inverted Index Searches solve the problem

So … why don’t we use a NoSQL implementation, denormalize the heck out of it, and then index the heck out of it, then map reduce the heck out of it – Well… you just described a basic search engine!

  • NoSQL DB: Horizontal Scaling
  • DeNormalize our data
  • Index everything!
  • Map/Reduce to search quickly
  • Declarative language

A little bit about indexes …

  • Forward index, and …
  • Inverted index (Wikipedia)
  • Let’s consider them both at the same time using a book analogy.
    • A table of contents is a _forward index_. It tells you where to go in the book for a “document” (aka a chapter).
    • The index you’ve seen in books is an _inverted index_. It tells you where to go in the book for all uses of the relevant words.
    • Great answer on Stack Overflow.

But What’s a Reverse Index?

  • Let’s consider a row in a database the “document” we want.
  • A typical DB index would be a forward index. In other words, when we want row with ID = 123, we can seek right to it.
    • Or put another way, a typical DB index acts like a table of contents for the “document” (i.e. row) we want
  • Busy databases/indexes can suffer from index block contention, especially those indexes for monotonically increasing sequences. Like numerical based primary keys.
    • Consider that the index for a table is stored across several “blocks”. Let’s hypothetically say each block contains 100 keys.
    • Let’s also consider our table has a integer primary key that sequentially increases. So we have keys like 123, 124, 125, etc.
    • So, 0-99 are in the first block, 100-199 are in the second block, 200-299 are in the third block, etc.
    • If 100 write requests happen at the same time, it’s possible they could all queue up to write to the same block.
    • Best case, two blocks given that scenario
    • This is what a reverse index aims to solve.
  • A reverse index simply reverses the key
    • So for key = 123, it would go in the block as 321 (i.e. in our “300” block) and key = 124 would go in another block (in our “400” block) as 421, etc.
  • This way reads/writes are distributed.
  • In the case of 100 concurrent write requests, they would be spread across 10 different indexes given this scenario.

Back to Inverted Indexes

Take the sentence “The Quick Brown Fox Jumps Over the Lazy Dog”

  • Store the entire document
  • Throw away the “stop” words (like “the”)
  • Break the sentence down into tokens for a hash table
  • Each token contains an array of pointers to records
  • When the user searches for “lazy fox” we can break that sentence into tokens, and return the records that are pointed to by our indexes
  • Advanced rules apply: synonyms, antonyms, stemming, pluralization, scoring, etc
  • Example of synonym: “PWA” and “Progressive Web Apps” or “.Net” and “Dot Net” or “JS” and “JavaScript”
  • Because of the storage mechanism, counting is easy too – we don’t even need to fetch the document, just count pointers

Inverted Index Search Engines

  • Slow to Write, Fast to Read (Sort of…there’s near real time on many now)
  • NoSQL before it was called NoSQL
  • Scale Super Well
  • Wicked Fast Reads
  • Support Complex Filters
  • Declarative syntax

Popular Search Engines

Resources We Like

Tip of the Week

Episode source