Lecture 6 – Data Cleaning and EDA
Presented by Anthony D. Joseph
Content by Anthony D. Joseph, Joseph Gonzalez, Deborah Nolan, Joseph Hellerstein
A reminder – the right column of the table below contains Quick Checks. These are not required but suggested to help you check your understanding.
Exploratory data analysis and its position in the data science lifecycle. The relationship between data cleaning and EDA.
Exploring various different data storage formats and their tradeoffs.
Primary keys and foreign keys. Eliminating redundancy in tables.
Defining and discussing the terms quantitative discrete, quantitative continuous, qualitative ordinal, qualitative nominal.
Discussing the granularity and scope of our data to ensure that it's appropriate for analysis. Discussing various methods of encoding time, and flaws to be aware of.
Ways in which our data can be incorrect or corrupt. Different methods for addressing missing values, and their tradeoffs.
Summarizing the process of EDA.
A demo of EDA on real data.