The fact that organizations face Big Data challenges is common nowadays. The term Big Data refers to the use of a set of multiple technologies, both old and new, to extract some meaningful information out of a huge pile of data. The data set is not only large but also has its own unique set of challenges in capturing, managing, and processing them. Unlike data persisted in relational databases, which are structured, big data format can be structured, semi-structured to unstructured, or collected from different sources with different sizes. This article delves into the fundamental aspects of Big Data, its basic characteristics, and gives you a hint of the tools and techniques used to deal with it.
The term Big Data gives an impression only of the size of the data. This is true in a sense, but does not give the whole picture. The challenges associated with it are not merely about its size alone. In fact, the idea evolved to name a sea of data collected from various sources, formats, and sizes, and, at the same time, difficult to harness or get value out of it. The rise of emerging tech and the increasing use of the Internet gave an impetus to the volume and disparity. The volume keeps increasing with every information exchange over the Internet or even the minuscule IoT objects we use. A simple picking up of a phone call or switching on the CCTV can generate a data chain. Today, most devices are connected online. Now, if an organization wants to collect that information online, it needs a special processing process because the data generated will be massive. Moreover, there may be no uniformity in the format of data captured. This adds up to the complexity as we have to deal with structured, semi-structured, or unstructured data. The tools we used until now to organize data are incapable of dealing with such variety and volume. Therefore, we can say that the term Big Data actually applies to the data that cannot be processed or analyzed via traditional tools and techniques that are normally used to process structured or semi-structured data such as using relational databases, XML, and so forth.
Organizations today are replete with unstructured or semi-structured data available in raw format. These data can be a wealth of information if processed and the value gotten out of it. But, the problem is how to do it. Traditional techniques and tools, such as relational databases, are inadequate to deal with such a large volume of variegated data. It is also a double-edged problem for the organizations, because simply shredding them would mean losing valuable information—if any—and keeping them is a waste of resources. Therefore, some tools and techniques are sought to deal with the problem. Sometimes, we are quite sure of its potential value lying in the pile and can reap a gold mine of information but, without proper tools, it is quite taxing for the the business process to reap any benefit out of it. The data today are massive and exploded like anything in recent years; there seems to be no stopping it, by the way.
Big data is getting bigger every minute in almost every sector, be it tech, media, retail, financial service, travel, and social media, to name just a few. The volume of data processing we are talking about is mind boggling. Here is some statistical information to give you an idea:
- The weather channels receive 18,055,555 forecast requests every minute.
- Netflix users stream 97,222 hours of video every minute.
- Skype users make 176,220 calls every minute.
- Instagram users post 49,380 photos every minute.
These numbers are growing every year, with an increasing number of people using the Internet. In 2017, Internet usage reached up to 47% (3.8 billion people) of the world’s population. With an ever-increasing number of electronic devices, our approximate output data is estimated to be 2.5 quintillion bytes per day and growing.
The Google Search statistics show 3.5 billions of searches per day, which is over 40,000 searches every second on an average. We also should not miss that other search engines are also making searches. The Email Statistics Report, 2015-2019 of Radicati Group, Inc., shows 2.9 billion e-mail users by 2019.
In an attempt to estimate how many photos will be taken in 2017: If there were 7.5 billion people in the world in 2017, with about 5 billion having mobile phones, a probable guess is that 80% of those phones have built-in cameras. That means there are about 4 billion people using their cameras. If they take 10 photos per day, which amounts to 3,650 photos per year per person, this adds up to approximately 14 trillion photos being taken per year.
Therefore, when we say Big Data, it essentially refers to data or sets of records that are too large to be surmisable. They are produced through the search engines, business informatics, social networks, social media, genomics, meteorology, weather forecasts, and many other sources. This clearly cannot be operated using existing database management tools and techniques. Big Data opens an arena of big challenges in terms of storage, capture, management, maintenance, analysis, research, new tools to handle them, and the like.
Characteristics of Big Data
As with all big things, if we want to manage them, we need to characterize them to organize our understanding. Therefore, Big Data can be defined by one or more of three characteristics, the three Vs: high volume, high variety, and high velocity. These characteristics raise some important questions that not only help us to decipher it, but also gives an insight on how to deal with massive, disparate data at a manageable speed within a reasonable time frame so that we can get value out of it, do some real-time analysis, and provide a subsequent response quickly.
- Volume: Volume refers to the sheer size of the ever-exploding data of the computing world. It raises the question about the quantity of data.
- Velocity: Velocity refers to the processing speed. It raises the question of at what speed the data is processed.
- Variety: Variety refers to the types of data. It raises the question of how disparate the data formats are.
Note that we characterize Big Data into three Vs, only to simplify its basic tenets. It is quite possible that the size can be relatively small, yet too variegated and complex, or it can be relatively simple yet a huge volume of data. Therefore, in addition to these three Vs, we can easily add another, Veracity. Veracity determines the accuracy of the data in relation to the business value we want to extract. Without veracity, it is infeasible for an organization to apply its resources to analyze the pile of data. With more accuracy as to the context the data, there is a greater chance of getting valuable information. Therefore, veracity is another characteristic of Big Data. Companies leverage structured, semi-structured, and unstructured data from e-mail, social media, text streams, and more. But, before analysis, it important to identify the amount and types of data in consideration that would impact business outcomes.
Tools and Techniques
Artificial Intelligence (AI), IoT, and social media are driving the data complexity through new forms and sources. For example, it is crucial that, in real time, big data coming through sensors, devices, networks, transaction is captured, managed, and processed with low latency. Big Data enables analysts, researchers, and business users to make more informed decisions faster, using historic data which otherwise was unattainable. One can use text analysis, machine learning, predictive analytics, data mining, and natural language processing to extract new insight from the available pile of data.
The technology has evolved to manage massive volumes of data, which previously were expensive and had to have the help of supercomputers. With the emergence of social media like Facebook, search engines like Google, and Yahoo!, Big Data projects got impetus and grew as it is today. Tech such as MapReduce, Hadoop, and Big Table have been developed to fulfill the today’s need.
The NoSQL repositories are also mentioned in relation to Big Data. It is an alternate database in contrast to relational databases. These databases do not organize records in tables of rows and columns as found in the conventional relational databases. There are different types of NoSQL databases, such as Content Store, Document Store, Event Store, Graph, Key Value, and the like. They do not use SQL for queries and they follow a different architectural model. They are found to facilitate Big Data Analytics in a favorable manner. Some popular names are: Hbase, MongoDB, CouchDB, and Neo4j. Apart from them, there are many others.
Big Data opened a new opportunity to data harvesting and extracting value out of it, which otherwise were laying waste. It is impossible to capture, manage, and process Big Data with the help of traditional tools such as relational databases. The Big Data platform provides the tools and resources to extract insight out of the voluminous, various, and velocity of data. These piles of data now have means and viable context to be used for various purposes in the business process of an organization. Therefore, to exactly pinpoint what type of data we are talking about, we must understand it and its characteristics as the primary step.