How Big things change the rules

What basketball teaches us about IoT databases

Adam Storm
6 min readJul 8, 2019

--

When Lew Alcindor arrived at UCLA in 1965, he already had a winning pedigree. After losing only 6 games in his four high school years (out of more than 100 played) and winning more than 70 consecutive, he wasn’t the average college freshman. Add to that his height (7'2") and athletic ability, and UCLA had a reason to be excited. That being said, the university had the #1 ranked team in the country, and had won the championship the last two years in a row, so for Alcindor to stand out, he’d have to be extra special.

Due to NCAA regulations at the time, Alcindor was prohibited from playing on the varsity team when he arrived and instead, led the freshman team. But when the freshman team, midway through the year, was given the opportunity to play the varsity team in an exhibition game, the Alcindor lead freshmen beat the varsity team by 15 points — yes, the freshman team beat the national champions by 15 points. Alcindor was extra special, and he was just getting started.

By the time his sophomore year arrived, Alcindor was ready to show the world the full breadth of his talents. In the 66–67 season, he averaged 29 points a game, and the Bruins won all 30 of their games leading them to a National Championship in the NCAA tournament, in which Alcindor was named tournament MVP.

Unfortunately, he may have dominated a little too much in his sophomore year. It should come as no surprise that Alcindor’s height presented a significant advantage on the court, as it allowed him to dunk with ease. So much so that the NCAA thought it gave him an unfair advantage — he was literally too big for the game — so in the spring of ’67, they changed their rules, banning dunking from the league (a rule which stood for almost a decade).

Fortunately for Alcindor, when the rules changed he was able to change too, and when the NCAA prevented him from dunking, he developed a vicious hook shot (known as the skyhook) which was even harder to defend than his dunk. At the end of his college career, he’d won a total of 88 game and only lost 2 (one of which he’d played with an eye injury), captured 3 National Championships, and was named tournament MVP three times (something that’s never been repeated).

He then went onto the NBA, where he changed his name to Kareem Abdul-Jabbar, and proceeded to win 7 NBA titles and to this day is widely regarded as the best center to ever play the game.

For the first 40+ years of its existence, the database world has followed a certain set of rules. Systems were designed either for fast transactions, or fast analytics, but not both, data was persisted locally (on the compute node) for low latency of both of writes and reads, and almost invariably, as data sizes grew, compute was required to grow in step. The arrival of the massive amounts of data in IoT workloads over the past decade however, is forcing all of those rules to change.

IoT data is dominating the world. According to IDC, the rate of global data growth will exceed 61% annually for the next 6 years, resulting in an additional 142 zetabytes of data being created. The bulk of this data will be machine generated IoT data. To put 142 zetabytes into perspective, the largest hard disk available for purchase today is 16 terabytes. It would take nearly 9 billion of these enormous drives to contain 142 zetabytes. It’s a truly remarkable amount of data.

This tremendous data growth — which is too big for traditional database systems — requires a rethinking of the traditional database rules and a redesign of existing systems so that they can efficiently store and query IoT data.

Specifically, IoT databases must:

  • Ingest data quickly — It’s not uncommon for IoT use cases to generate millions of data points per second (from a group of IoT devices). Additionally, ingested data must be made immediately available for analysis so that insights can be derived without delay.
  • Decouple compute and storage — Growing compute in lock-step with storage when data is arriving at a high rate of speed is too costly. But even if the costs could be contained, it’s typical for IoT workloads to query only a small portion of the most recently inserted data. As a result, it’s horrendously inefficient to couple compute to storage if the bulk of the data is not regularly queried, because the bulk of the compute will consistently be idle.
  • Leverage low cost storage — With increasing data sizes comes increasing costs. Amplifying this effect is the fact that many IoT use cases require the data (or at least aggregated data) to be stored indefinitely, due to regulatory reasons. This clearly necessitates leveraging the cheapest storage possible.
  • Continue to support fast queries — While cheap storage is essential, if leveraging it causes queries to slow down dramatically, the value of the database is severely diminished. To solve this problem, IoT databases must employ intelligent caches (over multiple tiers) to ensure that low cost (and low performance) storage use remains viable. Additionally, they must leverage auxiliary structures (like indexes and synopsis) to allow queries to efficiently weave their way through the massive amounts of data.
  • Keep data open — With increasingly large data sets that must be maintained forever, system lock-in creates significant issues. If you choose to switch vendors after 5 years of ingesting data at megabytes per second, you’ll have hundreds of terabytes of data to migrate to the new system. To prevent this need, IoT systems should store data in an open format which is leverageable directly by an entire ecosystem, and therefore prevents system or vendor lock-in.
  • Provide rich tooling for data analysis — Much of IoT analysis is ML-based and leverages time series analytics. The ideal IoT database must provide rich data science tooling (including up-to-date frameworks, notebooks for analysis, and model management infrastructure) as well as times series functions which allow the arriving data to be quickly and easily analyzed.
  • Be continuously available — IoT systems have limited amounts of back pressure capacity. As a result, if the database goes down, arriving data that is unable to be stored will be lost forever. To avoid data loss, and allow for 24/7 analytics, the database must be immune to hardware or software failures.

The rules have changed. Only a database which addresses these newly identified design constraints will excel in the IoT world.

Recently, IBM launched the latest version of Db2 Event Store, a data store designed specifically for IoT workloads in the following ways:

  • Is able to ingest millions of data points per second and makes the ingested data immediately available for analysis
  • Stores all data in the open Apache Parquet format and on shared storage, thus keeping data open, and decoupling compute and storage
  • Leverages low-cost cloud object storage
  • Utilizes both indexes and synopsis to dramatically speed up queries
  • Is packaged with Watson Studio and an advanced time series library for rich data analysis
  • Is designed to remain available to both ingest and queries, even in the presence of node failures

Additionally, in the 2.0 release, Db2 Event Store has been integrated with the Db2 Common SQL Engine, the most sophisticated SQL-based analytics query engine available. Regardless of your query needs Db2 Event Store can handle them quickly and efficiently.

If you’re looking for a solution to your IoT problems, we have them covered. Feel free to reach out for more information.

--

--

Adam Storm

Adam has been developing and designing complex software systems for the last two decades. He is also a son, brother, husband and father.