Anil Kumar Muppalla

This book provides a guide to the distributed computing technologies of Hadoop and Spark, from the perspective of industry practitioners.

Features: describes the fundamentals of building scalable software systems for large-scale data processing in the new paradigm of high performance distributed computing; presents an overview of the Hadoop and Spark ecosystem, followed by step-by-step instruction on their installation, programming and execution; Reviews the basics of Spark, including resilient distributed datasets, and examines Hadoop streaming and working with Scalding; Provides detailed case studies on approaches to clustering, data classification and regression analysis; Explains the process of creating a working recommender system using Scalding and Spark.

Supplies working source code to aid understanding through step-by-step implementation