Lecture 25 – Big Data
by Josh Hug (Fall 2018)
Important: This lecture is a combination of two lectures from the Fall 2018 semester.
- The Google slides version of the lecture slides has a lot of formatting issues. The PDF version might be better for viewing purposes.
- 25.2 has a little bit of redundancy, so feel free to skip through some of the redundant parts.
- Unlike the Fall 2018 semester, SQL was covered toward the beginning of the class this summer. There are also some references to “the project” or “project 2,” which would refer to Homework 9 this summer.
Data in the organization. Operational data stores and data warehouses. Extract, transform, load (ETL).
The multidimensional data model. Fact tables and dimension tables. Star schemas and snowflake schemas. Online analytics processing (OLAP).
Data warehouses and data lakes.
Distributed file systems and fault tolerance.
Distributed aggregation with MapReduce. The MapReduce abstraction.
MapReduce technologies. Hadoop and Spark. Resilient Distributed Datasets (RDDs).
Spark notebook demo.