Turning data scientists into action heroes: The rise of self-service Hadoop

Click here to view original web page at gigaom.com

Mike is chief operating officer at Altiscale.

The unfortunate truth about data science professionals is that they spend a shockingly small amount of time actually exploring data. Instead, they are stuck devoting significant amounts of time wrangling data and pouring resources into the tedious act of prepping and managing it.

While Hadoop excels at turning massive amounts of data into valuable insights, it’s also a notorious culprit for sucking up resources. In fact, these hurdles are serious bottlenecks to big data success, with research firm Gartner predicting that through 2018, 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.

Whether it’s time stuck in a queue behind higher priority jobs or functioning as a Hadoop operations person, — building their own clusters, accessing data sources, and running and troubleshooting jobs — data scientists are wasting time on administrative tasks. Sure, it’s necessary to do some heavy lifting to successfully perform analysis on data. But it isn’t the best use of a data scientist’s time, and it’s a drain on an organization’s resources.

That said, how can data scientists stop serving as substitute Hadoop administrators and become analytics action heroes?

Just as the business intelligence industry has moved to a more self-service model, the Hadoop industry is also moving to a self-service model. Operational challenges are moving to the background, so that data scientists are liberated to spend more time building models, exploring data, testing hypotheses, and developing new analytics.

Self-service Hadoop solutions simplify, streamline, and automate the steps needed to create a data exploration environment. Self-service is achieved when a provider (one who runs and operates a scalable, secure Hadoop environment) delivers a data science platform for the analytics team.

With a self-service environment, data scientists can focus on the data analysis, while being confident that the data and Hadoop operations are well taken care of. And these environments can be kept separate from production environments, ensuring that test data science jobs don’t interfere with a production Hadoop environment that is core to business operations, thereby reducing risk of operational mishaps.

As we see a rise in self-service Hadoop, organizations will realize the benefits of analytics action heroes and their super power contributions. Here are a few reasons why:

Hadoop has unlimited potential to drive business forward. Yet, it can quickly become a drain on internal operational resources when running in production and at scale. Organizations need to devote more time on data science and not on the Hadoop infrastructure to fully realize big data’s potential — self-service tools make this a reality.