Training

BigData (Hadoop/Spark)

Course Syllabus

Module 1:
  • Introduction to Big Data and Hadoop
  • Components of Hadoop and Hadoop Architecture
  • HDFS, Map Reduce & Yarn Deep Dive
  • Installation & Configuration of Hadoop in a VM(Single Node)
  • Multinode Installation(3 Nodes)
    • On Premise in Local Machines
    • Cloud
  • Performance tuning, Advanced administration activities, Monitoring the Hadoop Cluster
    • Hadoop Bench Marking(Teragen & Terasort on 10 GB Data)
    • Hadoop Web UI monitoring
    • Advanced Hadoop Administration commands from Cli
    • Tuning the Hadoop cluster by tweaking the Performance tuning Parameters for HDFS & MapReduce framework
    • Node Commissioning(addition) and Decomissioning(Removing)
    • Running Balancer to redistribute the Data in Hadoop
  • Writing MapReduce programs in Java: Wordcount
    • Webserver Log Analysis
    • Recommendation Engine(Product Recommendation generator)
    • Sentiment Analysis
    • Custom Record Readers, Partitioners, Combiners
    • Distributed Copy
  • Introduction and learning to Pig, Pig Latin: Installation & Wordcount
    • Webserver Log analysis
    • Sentiment Analysis
    • Processing JSON data in Pig using Elephant Bird library
    • Advanced Pig processing using Piggybank Library
    • Building Pig UDFs and calling from Pig scripts
  • Advanced Pig Concepts
    • Performance Tuning parameters
    • Controlling parallelism
    • Running Pig Scripts on Tez
  • Introduction and learning to Hive: Installation & Wordcount
    • Webserver Log analysis
    • Sentiment Analysis
    • Recommendation Engine in Hive(Product Based Recomendation)
    • Hive Performance Tuning Parameters
    • Loading CSV data, JSON data, etc in Hive
    • Hive File Formats including Text, ORC, Parquet
  • Introduction and learning to Sqoop
    • Advanced Sqoop Import export options using Queries
    • Controlling Parallelism
  • Introduction to Hbase, Installation and HBase Queries
  • Zookeeper for Coordination, Hbase Multinode installation with Zookeeper
  • Cloudera and Hortonworks Distribution of Hadoop
  • Deploying a Multinode Hadoop Cluster using Ambari
  • Workflow Scheduling using Oozie for Automation
Module 2:
  • Other Components of the Hadoop ecosystem
  • Flume for Relatime data collection
  • Kafka for Realtime Log analysis: Log Filtering
  • Spark for Realtime In memory Analytics
  • Advanced Spark Concepts, Spark Programming APIs, Spark RDDs
    • Spark Controlling Parallelism, Partitions & Persistence
    • Spark SQL
    • Spark Streaming
  • Scala Programming Basics to Advanced
  • Python Introduction & Python Spark programming using PySpark
  • Spark for Realtime Log analysis: Analytics
    • Creating and Deploying End-to-End Web Log Analysis Solution
    • Realtime Log collection using Flume
    • Filtering the Logs in Kafka
    • Realtime Threat detection in Spark using Logs from Kafka Stream
    • Click Stream analysis using Spark
  • Hadoop MR2 deployment(Yarn) Integration with Spark
  • Spark Machine Learning concepts and Lambda Architecture
  • Machine Learning using ML Lib
  • Customer Churn Modeling using Spark ML Lib
  • Zeppelin for Data Visualization, Spark Programming in Zeppelin using iPython Notebooks
  • Case studies & POC – Run Hadoop on a Medium size dataset(~5GB Data), POC can be on relatime project from your company or Duratech's Live project
  • Course conclusion
  • Final Steps
    • Project evaluation and exit Test
    • Profile Building to realign your profile Suitable for Bigdata Industry
    • Placement assistance & Interview handling support
Course Prerequisites

(If prerequisite skills are not met, we will offer additional training to impart the same free of cost)

  • Working with Linux in the CLI interface
  • Knowledge on SQL
  • Any Programming Language
Note
  • The cloud account necessary for the course needs to be created by the students using their own Debit cards(Cloud providers charge a maximum of Rs. 70 to register the cloud account with Validity of 1 Year). You will get cloud credits worth Rs. 20,000(Approx)
  • Placement will be based on your attendance, regularity to the course and your performance in Project & exit test evaluation
  • We support you for your placements till you get placed and beyond
  • We will connect you to our Alumni working in the Bigdata Industry for several years, you can interact with our Alumni to know about current industry trends, live projects, clarify your doubts, etc