Image for post
Image for post

How Big things change the rules

What basketball teaches us about IoT databases

When Lew Alcindor arrived at UCLA in 1965, he already had a winning pedigree. After losing only 6 games in his four high school years (out of more than 100 played) and winning more than 70 consecutive, he wasn’t the average college freshman. Add to that his height (7'2") and athletic ability, and UCLA had a reason to be excited. That being said, the university had the #1 ranked team in the country, and had won the championship the last two years in a row, so for Alcindor to stand out, he’d have to be extra special.

  • Decouple compute and storage — Growing compute in lock-step with storage when data is arriving at a high rate of speed is too costly. But even if the costs could be contained, it’s typical for IoT workloads to query only a small portion of the most recently inserted data. As a result, it’s horrendously inefficient to couple compute to storage if the bulk of the data is not regularly queried, because the bulk of the compute will consistently be idle.
  • Leverage low cost storage — With increasing data sizes comes increasing costs. Amplifying this effect is the fact that many IoT use cases require the data (or at least aggregated data) to be stored indefinitely, due to regulatory reasons. This clearly necessitates leveraging the cheapest storage possible.
  • Continue to support fast queries — While cheap storage is essential, if leveraging it causes queries to slow down dramatically, the value of the database is severely diminished. To solve this problem, IoT databases must employ intelligent caches (over multiple tiers) to ensure that low cost (and low performance) storage use remains viable. Additionally, they must leverage auxiliary structures (like indexes and synopsis) to allow queries to efficiently weave their way through the massive amounts of data.
  • Keep data open — With increasingly large data sets that must be maintained forever, system lock-in creates significant issues. If you choose to switch vendors after 5 years of ingesting data at megabytes per second, you’ll have hundreds of terabytes of data to migrate to the new system. To prevent this need, IoT systems should store data in an open format which is leverageable directly by an entire ecosystem, and therefore prevents system or vendor lock-in.
  • Provide rich tooling for data analysis — Much of IoT analysis is ML-based and leverages time series analytics. The ideal IoT database must provide rich data science tooling (including up-to-date frameworks, notebooks for analysis, and model management infrastructure) as well as times series functions which allow the arriving data to be quickly and easily analyzed.
  • Be continuously available — IoT systems have limited amounts of back pressure capacity. As a result, if the database goes down, arriving data that is unable to be stored will be lost forever. To avoid data loss, and allow for 24/7 analytics, the database must be immune to hardware or software failures.
  • Stores all data in the open Apache Parquet format and on shared storage, thus keeping data open, and decoupling compute and storage
  • Leverages low-cost cloud object storage
  • Utilizes both indexes and synopsis to dramatically speed up queries
  • Is packaged with Watson Studio and an advanced time series library for rich data analysis
  • Is designed to remain available to both ingest and queries, even in the presence of node failures

Adam has been developing and designing complex software systems for the last 15+ years. He is also a son, brother, husband and father.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store