Twelve years ago, when I wrote the first articles for “Cracking the Code: Breaking Down the Software Development Roles,” I made a conscious and perhaps controversial decision to not include the database administrator or a database architect as a part of the roles. The decision was made because there were few organizations who dealt with the scale of data that required this dedicated role in the software development process. The solution architect could take care of the organization’s need to design the data structure as a part of their overall role. However, the world of data has gotten bigger since then.
Big Data
Today, we’re facing more volume, greater velocity, and dynamic variety of the data sources that we’re processing. We’re not talking about the typical relational databases that have been popular for decades. The expansion of data requires a set of techniques and skills that are unlike historical approaches to data that we have been using.
Multithreading our processing of data is an improvement of the single threading approaches to data processing that popularized data processing in the 1980s; however, even these approaches, which rely on a single computer with multiple threads of execution, break down when the amount of processing necessary to extract meaning exceeds the capacity of a single machine.
The Rise of Service-based Computing
In 1999, users at home could donate their spare computing cycles on their computers to the cause of finding extraterrestrial intelligence through the SETI@Home project run through UC Berkeley. This wasn’t the first use of widely distributed computing or grid computing, but it is the project that captured the imagination of Internet users everywhere. Suddenly, they had the possibility to be the ones who found “ET.” In construction, the project distributed massive amounts of data for processing to many computers, which performed computations on the data to see if there were interesting bits that were likely not just background noise. SETI@Home was just one of the distributed computing projects that brought awareness to the kind of problems where a single computer wasn’t going to be enough.
IBM, Microsoft, and others are now offering computing and machine learning services to help organizations cope with the data that they’re capturing and make sense of it so they don’t have to mobilize an army of committed volunteers. The platforms aim to provide the computing power and the machine learning necessary to extract the information hidden in the volumes of data. Instead of organizations needing to build and deploy their own data centers with dedicated computing resources, the resources to transform data into information and meaning are available for rent.
It’s Not About the Data, It’s About the Insights
Even though the amount of data that we’re capturing is staggering, it’s not the data that’s interesting. What’s interesting is what the data can tell you—if you’re able to analyze it. The individual readings on the performance of an engine aren’t important, but the ability to predict when the engine needs maintenance or is likely to fail—that’s important.
Data scientists aren’t focused on data storage as the data architects and database administrators were. Instead, they’re focused on the conversion of data into information and, ultimately, insights that the business can use to make better decisions. This means looking for new approaches to analyze the data in ways that reveal interesting insights the business can use to its advantage.
Standing on Sets and Statistics
The traditional software development processional is familiar with a procedural approach to solving problems. Developers, leads, and architects are well-schooled in the methods and benefits of procedural construction. Procedural approaches are like the automation of an incredibly dutiful but not original worker. The computer is told the steps (procedure) to perform in what order and under what conditions it should repeat the operation or split between multiple paths. However, data scientists work not only with procedural approaches but with set-based logic as well. The thinking style differs, because it looks for gaps and intersections. It functions based on equality and inequality relationships between different sets of information.
Even though some developers have encountered set-based logic in their work, data scientists must be comfortable and fluent in their ability to manipulate sets of information.
In addition, unlike other roles in the software development lifecycle, the data scientist needs a specialized skill outside the realm of software development. Because data scientists look for insights about relationships between various bits of data, they need a solid foundation in statistics to be able to look for and generate statistical values like correlation to answer the questions they pose and find inexact relationships between different data sets.
Where’s the Position Heading, Anyway?
The growth in data has reached the tipping point. Whether it’s social network analysis, click history, or purchasing data, organizations are seeing real business value in the data that’s locked up in their databases, and data scientists are the key to unlocking the potential of that data.
Capturing that value means hiring the people who have the skills to connect the processing algorithms to the data and harness the computing power to create those outcomes.
The Good, The Bad, and The Ugly
Data science is exploding right now with the advent of Internet of Things devices recording all kinds of data from all sorts of places. That means great opportunity—and more than a few challenges. Here are just a few of those challenges:
- Good: There is great opportunity to find new ways to extract insights from data.
- Good: Computing and storage resources can be purchased in large quantities.
- Good: Data scientists are in strong demand and will likely remain so for some time.
- Bad: As algorithms and approaches evolve, you’ll feel always out of date.
- Bad: All data has the need for cleanup, and a substantial amount of the time will be spent on this work.
- Ugly: Trial and error will mean lots of “failures” and few triumphs.
In Conclusion
The Data Scientist role has a rapidly expanding need and a different set of skills. If you loved your statistics class and love finding patterns that other people can’t see, this might be right for you.