Data ScienceusingBIG DATA Hadoop (Development)

1. Big Data –the actual reason for Hadoop

  • Understanding Big data
  • Collecting and cleaning data
  • Traditional approach for processing and its challenges
  • Big data vs Hadoop

2. An introduction to Hadoop

  • Hadoop overview
  • Hadoop components
  • Hadoop distributions
  • Getting started
  • What is HDFS
  • What is Map Reduce
  • Hadoop stack
  • Hands On –Hadoop setup and basic operations

3. HDFS

  • HDFS explained
  • High availability
  • Federation
  • Architecture
  • File system Shell
  • Hands On

4. Map Reduce

  • Map Reduce flow
  • Hello World
  • Map Reduce API concepts
  • Mapper
  • Reducer
  • Other components –combiner, partitioner, shuffle/sort
  • Hadoop 1.x vs 2.x
  • Hadoop streaming API
  • Hands on with Eclipse

5. comYARN

  • Architecture
  • Scheduler
  • Resource Manager (RM)
  • RM HA
  • YARN commands
  • Hands On with YARN applications

6. Integrating Hadoop into the Workflow

  • RDBMS interaction using Sqoop
  • Workflow management using Oozie
  • Back office jobs with Zookeeper
  • Hands On with actual data sets

7. Data Mining

  • Unstructured data using PIG
  • Structured data mining using hive
  • Hands On with actual data sets

8. setsHBASE

  • Problem with SQL Database
  • Introduction to NOSQL
  • Hands On Exercises
  • Introduction to HBASE
  • Column Families
  • Delving deeper into HBASE
  • HBASE Architecture
  • HBASE Hands-On Exercises

9. Delving Deeper Into the Hadoop API

  • More about Tool Runner
  • Testing with MR Unit
  • Reducing Intermediate Data With Combiners
  • The configure and close methods for Map/Reduce Setup and Teardown
  • Writing Partitionersfor Better Load Balancing
  • Hands-On Exercise
  • Directly Accessing HDFS
  • Using the Distributed Cache

10. Practical Development Tips and Techniques

  • Debugging MapReduce Code
  • Using LocalJobRunner Mode for Easier Debugging
  • Retrieving Job Information with Counters
  • Logging
  • Splittable File Formats
  • Determining the Optimal Number of Reducers
  • Map-Only MapReduce Jobs
  • Hands-On Exercise

11. Joining Data Sets in MapReduce

  • Map-Side Joins
  • The Secondary Sort
  • Reduce-Side Joins