Defining an Approach to Industrial Big Data

A few years back, in 2008, I was working at a digital advertising company. In measuring the audiences that together make up the internet population, and building a quantitative understanding of the people that make up those audiences, I learned the meaning of “big data”. An easy rule of thumb was: when you have enough data to fill a box of hard drives that you can’t pick up, you’ve got big data.

What we were building back then was really big data 1.0, developed for the Consumer Internet to support web search and ad targeting. The key challenges were to get the technology to work at all, and to scale cheaply enough to let us run our business. Now, internet-scale data processing is a commodity. Big data is getting ready for the Industrial Internet and with that, comes a whole new set of challenges.

“Internet-scale data processing is a commodity”

The industrial world is a fundamentally different beast. Translating Consumer Internet technology to this world requires shifts in three major areas that we must undertake: the storage architecture, the level of security, and the quality of the data. (Another factor is data sovereignty, but we’ll save that topic for another post.) Without addressing these, we can’t credibly tackle Industrial Internet-scale data processing. Let’s discuss the three.

Storage55 1

First, we must build a fundamentally different storage architecture. As data shifts from massive reams of similar event data to integrated time-series data, the flat file on HDFS storage mechanism is no longer sufficient, nor are traditional horizontally scaling key-value stores (cassandra, AeroSpike, etc). The critical differences in the structure of the data that make this change necessary are:

  • The searching and indexing needs of industrial big data systems. Consider a search one might perform in an event based data store: Find me all the people who visited cnn.com. This can be served by a standard b-tree index. Now consider a time-series query: find me all the voltage sensors that had a spike similar to this one. Existing time-series databases (aka data historians) are not particularly well suited for these purposes, being mostly designed to store time-oriented series of event data, not search them.
  • Time-series data is necessarily continuous, while event data is not. For example, between any two temperature readings, you know that the temperature certainly had a value between the measurements. Whereas between two clicks on different websites, you can’t assume an intervening click. 
  • We aren’t simply dealing with just one or the other of event data or time series data, but an integrated time-series data stream combining both. In order to crack analytics for industry, these challenges must be met with systems that integrate the capabilities of the consumer internet data processing tools with the new time series-oriented capabilities required for handling industrial data.

Building a fundamentally different storage architecture enables the use of Consumer Internet tools and capabilities on industrial data. With a more flexible storage architecture, we can query the data and leverage the analytic and data processing capabilities from the Consumer Internet world.

    Security45 1

    Second, we must build for security in depth. It’s no longer enough to secure data by putting a VPN around storage systems. There must be security built in at every layer — from the network boundary, to each level of interface to the data while in motion, and finally to the data itself being encrypted at rest.

    The reason for the increased focus on security is the much higher sensitivity of industrial data.  Again, let’s compare digital advertising data to industrial data.  On the advertising side, the bulk of the data looks like anonymous web usage data that can tie a cookie to a series of websites that cookie visited and when that visit happened. Now, consider a data breach scenario with this type of data.  It could be damaging to the advertising company’s reputation, but it doesn’t impact the anonymous users and there’s no way for a competitor to use that information.

    Now, on the industrial side, you have operational data (readings from assets), maintenance data, logistics data, and several other kinds of data making up the constellation of interesting information about industrial assets. This data is much more sensitive by nature. Through a reasonably careful analysis of this data, there is information that could be very damaging to the asset operator. For example, operational data could indicate future performance of a company, and thus give a hacker the ability to manipulate the stock price, and a whole host of other damaging activities.  

    The tools and approaches for achieving security in depth are not unknown, but they require significant work to integrate because the tools aren’t integrated into commodity systems, and the necessary engineering practices are rarely adopted in across organizations. Building in more thorough security at every level addresses the sensitive nature of industrial data through the use of advantageous tools and practices.

    Data Quality

    Third, we must build for data quality — both in terms of what we collect and the stability of the collection mechanisms. We also need to manage and report on backfilling missing data (called “imputed data” in statistics) so users have a transparent and accurate accounting of the state of the data.

    Assertions based off data are of much higher importance in the industrial world. If bad data in the consumer world—either noisy, simply corrupted, or even generated by some bad actor for nefarious purposes—causes someone to show a digital advertisement to a robot, that just cost someone about $0.000008. In the industrial world, if we inaccurately predict that a $5,000,000 wind turbine needs maintenance, and it’s taken offline for a week for unnecessary maintenance, that’s non-trivial to the bottom line.  

    Within 5 years, data from the Industrial Internet is going to dwarf the Consumer Internet, and the economic impact will be enormous (some estimate that the spend will be $500 billion by 2020, with over $1.2 trillion in value). However, without some major changes to how we approach big data for the industrial world, much of that value will remain out of reach.  

    The question remains, what does this mean for those of us who are building industrial data analytics? It means we’ll have to tackle these problems head on, and will see the emergence of a new architectural normal for Industrial Internet applications. It means we’ll have to struggle for interoperability as competing standards and approaches emerge with no clear winner. And finally, it means we will watch the next major global economic stepchange from the inside.