Big Data Processing with Apache Spark Overview
Organizations from diverse disciplines are inundated with data. The ability to effectively and efficiently process large datasets is critical to making data-driven business decisions and for building data-intensive services (e.g., recommendations, predictions, diagnosis, etc.).
In this course, you learn to use the Apache Spark framework for the purposes of such big data management and analysis, focusing on the fundamental concepts, overall architecture, various components, APIs, and language interfaces of Spark. The emphasis is on learning through practical examples and use cases in real world applications.
Apache Spark is an open-source cluster computing framework. While the well-established Hadoop platform relies on the disk-based MapReduce paradigm, Spark was designed from the ground up to exploit aggregate cluster memory for small to medium sized datasets, as well as to scale gracefully to large datasets. Spark offers a unified stack of tightly integrated components, thereby providing an exceptional ability to build applications that seamlessly combine different processing models. The framework caters to the needs of a variety of applications involving streaming data, structured data, unstructured data, as well as graph data.
Who is this course for?
Big Data Processing with Apache Spark is for any data enthusiast who wants to derive value from datasets. It is an ideal course for aspiring data engineers and data scientists as well as for:
- * Data engineers who wish to broaden their knowledge of modern distributed computing platforms
- * Data analysts and scientists who are interested in learning techniques to scale their mathematical models to large data sets
- * Individuals with some computing background who are generally curious about the new exciting field of big data
This course is open to both beginner and experienced programmers; familiarity with programming in Java, C++, and Python is encouraged. Experience in Java or Scala is a plus. Students are expected to know the basic concepts of compiling/running developed programs and object-oriented programming and are expected to be familiar with basic Unix/Linux command line utilities. They should be able to install open-source software on their laptops.
Source Code Management
We will use GitHub throughout the course for sharing and maintaining code, so students should familiarize themselves with GitHub activities such as code committing, forking and cloning repositories, and creating pull requests and branches.
Beginner-level experience with SQL and related database query processing is recommended.
Upon completion of this course, students will: