When does class start/end?
Classes begin promptly at 9:00 am, and typically end at 5:00 pm.
This four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform...
Read More
This four-day hands-on training course delivers the key concepts and knowledge developers need to use Apache Spark to develop high-performance, parallel applications on the Cloudera Data Platform (CDP).
Hands-on exercises allow students to practice writing Spark applications that integrate with CDP core components, such as Hive and Kafka. Participants will learn how to use Spark SQL to query structured data, how to use Spark Streaming to perform real-time processing on streaming data, and how to work with “big data” stored in a distributed file system.
After taking this course, participants will be prepared to face real-world challenges and build applications to execute faster decisions, better decisions, and interactive analysis, applied to a wide variety of use cases, architectures, and industries.
This course is designed for developers and data engineers. All students are expected to have basic Linux experience, and basic proficiency with either Python or Scala programming languages. Basic knowledge of SQL is helpful. Prior knowledge of Spark and Hadoop is not required.
Introduction to Zeppelin
HDFS Introduction
YARN Introduction
Distributed Processing History
Working with DataFrames
Introduction to Apache Hive
Hive and Spark Integration
Data Visualization with Zeppelin
Distributed Processing Challenges
Spark Distributed Processing
Writing, Configuring, and Running Spark Applications
Introduction to Structured Streaming
Message Processing with Apache Kafka
Structured Streaming with Apache Kafka
Aggregating and Joining Streaming DataFrames
Appendix: Working with Datasets in Scala