What is Big Data?
Big Data is a relatively recent technology trend, brought about by the timely convergence of a new storage model, new data sources, and inexpensive hardware. A distributed file system is maturing in the form of an open source framework known as Hadoop. Data has moved far beyond human-generated textual content within transactions. Now, machine generated logs, RFID signals, location coordinates, imagery, video, and audio can be analyzed within context. A datacenter cabinet filled with 1U-sized servers having commodity hard drives running proven open source software is distinctly more affordable than a scaled-up single server with many, many cores and 100s of GB of memory with NAS storage and associated software licenses. Big data platforms and ecosystems emerged from the need to efficiently handle data volumes, velocities, and varieties that had not previously existed.
Three Vs of Big Data
Much has been written about the three Vs of Big Data: Volume, Velocity, and Variety. Doug
Laney of Gartner Group is credited with initial recognition of the Vs when he was an analyst for META Group. He wrote of the concept in 2001. A fourth V is often referenced, and is usually cited as Value or Veracity or Viability.
Data storage is a volume metric. The local big box retailers sell consumer-grade multi-terabyte enclosed drives for a few hundred dollars. These drives are adequately used in many businesses for local storage, with remote backup to offline storage. Larger firms invest in network attached storage (NAS) units for a few thousand to tens of thousands of dollars.
The progression in storage requirements from gigabytes to terabytes to petabytes, exabytes, and zetabytes has occurred rapidly, and the volume of data continues to grow. Much of that growth comes from the velocity with which data is generated by sensors and machines and everyday devices, becoming known as the Internet of Things. Boeing’s 787 Dreamliner generates 500GB of data per flight, with dozens of flights daily. The volume of images uploaded to Facebook daily has a growth curve like a hockey stick. Besides storage for the images, there are dozens of metadata elements associated with each image. There is context to files, whether a family photo on the beach or a slide deck from a conference. Mining such a variety of information starts with a map and reduce process across a range of servers.
Defining a data storage solution that was different than what existed at the time was the work of Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung of Google in 2003, which they described in a technical paper as The Google File System. A few years later, engineers at Yahoo! created the Hadoop Distributed File System (HDFS), referencing the research of others from as far back as 1993. The Apache Software Foundation (ASF), an all-volunteer organization of developers, stewards, and incubators of Open Source projects and initiatives, defines Apache™ Hadoop™ as a software framework for reliable, scalable, distributed computing. ASF notes that Hadoop enables data-intensive distributed applications to work with thousands of nodes and exabytes of data.
Data loaded into HDFS is stored in fragments on multiple servers. The servers do not share memory or drive space. They are just commodity servers. The data is not stored in a single device as is the case with network attached storage. If a server malfunctions, the data can be replicated from a known, good data storage location on a healthy server. Retrieving data for analysis takes advantage of multiple servers and their CPUs, rather than the finite CPU power of a centralized data server. Think about it. A central server with, for instance, eight CPUs is expected to work on data from an attached storage subsystem, whereas Hadoop distributes the work to two or twenty servers, each with dedicated CPUs and local storage to work on subsets of the data in parallel. The performance is evident and the infrastructure cost is a fraction of the norm, especially at large scale. Here is how the stack appears, including related technologies that enable working with HDFS.
MapReduce: Slicing and Dicing and Julienne Fries
Have you seen the barkers at a fair or market or on television? They hawk kitchen cutters and slicers. With a flick of the wrist, they show how simple it is, by rearranging a blade or turning a base at an angle, to cut many kinds of veggies and fingers into interesting patterns. Those machines are the MapReduce scripts of big data, and it is not odd that Veggies begins with a V.
HDFS contains variety and/or significant volume and/or a high velocity of data acquisition. The impetus for a big data environment is the need to ‘query’ the system for analytical insights. That is the reason for acquiring the data. MapReduce is a framework that is a large distributed sort. There are six key elements to the framework:
1. Read and pre-process metadata and data
2. Map function for distributing the load across the environment
3. Partition function
4. Compare function (remember, at its base, this is a sort)
5. Reduce function for collecting and collating and coalescing the results of the map and partition and compare functions
6. Write function, to return metadata and data, perhaps as input to HBase, the Hadoop environments’ basic table-oriented database engine.
MapReduce provides control for analytics. The actual analytical analysis and visualizations are provided by other processes. These are often traditional visualization and dashboarding and reporting systems, operating against a data warehouse or data mart.
So, MapReduce can operate in parallel on many individual servers with the instance of a MapReduce sort focused on a granular data operation. Operational data is expected. Public data is necessary for 360 degree analysis on most any subject. Provisioning a big data environment can lead to data hoarding.
Data hoarding is a condition that might befall the unwary team, early in its scaling out of a big data implementation. There are useful datasets, free to acquire, from thousands and thousands of sources. Wikipedia and StackExchange dumps are available. Census and financial datasets are alluring for inclusion with analytics performed on big data. Weather data is freely available, and useful in assessing location-based data from tweets, posts, and pins in the social media universe.
Let’s assume that an analyst has a number of key areas of interest in the data. MapReduce projects are iterative. If the minimum datasets turned out to be interesting, scale horizontally by adding more sources, or scale vertically by loading a few billion rows and a million documents. Adjust the MapReduce scripts and schedule the run. Rinse. Repeat.
Mature data development environments have a suite of technologies for provisioning, querying, scheduling, and maintaining a database. In the Hadoop framework, those capabilities are provided by concepts and code with names such as MapReduce, Pig, Hive, Sqoop, and Zookeeper. Over time, these early implementations will mature, and industry-leading tools are already working directly with various aspects of the Hadoop framework. Already, data integration, migration and ETL tools, both open source and commercial, abound for big data implementations.
Outside of HDFS, there exists need for data storage. The usual suspects of Postgres, MySQL, SQL Server, and Oracle each have libraries for integrating with MapReduce output. The nature of big data is served well by other data engines, known collectively as NoSQL databases. “Not Only” SQL engines belong to one of the four categories of NoSQL databases. Categories and examples of each:
1. Key-Value (Cassandra, BigTable, Riak)
2. XML (MarkLogic, BaseX)
3. Document (MongoDB, CouchDB)
4. Graph (Neo4j, GraphDB)
Each of the categories performs better on certain solutions than on others. Graph databases are well suited to efficiently maintaining multi-level relationships, as seen with LinkedIn, for instance. The NoSQL offerings are in early stages, when compared with the ecosystems and maturity of the ‘usual suspects.’ Passionate developer communities exist for the NoSQL open source tools. Depending on the nature of the problem domain that the big data environment addresses, one or more NoSQL databases will probably be implemented.
Typical BI Environment
If we think of the big data environment at a super cargo ship, offloading the cargo would be akin to moving contents to distribution centers (data warehouses), and then on to retail or manufacturing companies (data marts). Upon arrival, the goods can finally return value. This is where analysis of big data begins to use the analytical tools of a BI environment.
Does it seem like the data is moving around more than it should? Why not just leave the data in its raw state in HDFS, and query it there for all of the pretty graphs and visualizations? Recall that a MapReduce job must run on the raw data. Unless the incremental results are saved to a datastore of some kind, the documents and raw data will need to be mapped, partitioned, compared, and reduced each time an analysis is desired. That works for a developer analyst with time to spare, but not for the financial or marketing or operational analyst who seeks patterns from which to predict future outcomes. Hence, the data needs to be available for analysis and presentation with Excel, QlikView, Tableau, Gleph, Pentaho, Cognos, or Reporting Services.
A proper data warehouse implementation takes more than a few days to implement. Using an agile methodology, sprints can elicit requirements and build a glossary of terms from which a warehouse and marts are modelled. This data governance step, when passed by, will result in loose terminology and losses in translation of results and gaps and general misery a few short weeks into user testing.
Let’s not get too far ahead of the flow. Governance leads to integration scripting. ETL processes to take MapReduce results from HDFS into HBase or NoSQL will need to move the data into the departmental marts for data analysis on a daily basis. Even a company of 2,000 employees might have 100 reports and a dozen dashboards drawing on marts for analysis by sales, marketing, finance, operations, and production management.
Only after ensuring dependencies among data operations are staged properly can the jobs be automated. It is not unusual to have 50 or more data processes executed in a controlled order nightly. Some companies have jobs that run every few minutes or seconds or handle streaming data.
Finally, the analysts can begin to interactively work with the data, drawing upon a mart and using innovative features in interfaces to work with various measures. With enough analysis, some processes can be further automated, turning a report or visualization dormant in favor of alerts sent to mobile devices and dashboards when a mix of data elements is determined, programmatically, to be out of normal ranges. In either case, the presentation interface to the analyst is either a tool (Excel, Tableau) or a browser-based view with controls, as might be found in Pentaho or Cognos or Reporting Services.
In Part 1 of this exploration of big data and BI, key elements of the Hadoop framework were defined. Shifting gears, the movement of big data output to and through a BI environment was followed in a best-practices model.
In Part 2 of this series, a scenario will be presented and explored, using actual code examples and output.
About the Author: David Leininger
Dave Leininger consulted for 30 years. In that time, he has discussed data issues with managers and executives in hundreds of corporations and consulting companies in 20 countries. Mr. Leininger has shared his insights on data warehouse, data conversion, and knowledge management projects with multi-national banks, government agencies, educational institutions and large manufacturing companies. He is a Senior Consultant at Fusion Alliance. Reach him at [email protected]