Lecture 7 – Data Cleaning and EDA
Presented by Anthony D. Joseph
Content by Anthony D. Joseph, Joseph Gonzalez, Deborah Nolan, Joseph Hellerstein
A random one of the following Google Forms will give you an alphanumeric code once you submit; you should take this code and enter it into the “Lecture 7” question in the “Quick Check Codes” assignment on Gradescope to get credit for submitting this Quick Check. You must submit this by Monday, September 21st at 11:59PM to get credit for it.
Exploratory data analysis and its position in the data science lifecycle. The relationship between data cleaning and EDA.
Exploring various different data storage formats and their tradeoffs.
Primary keys and foreign keys. Eliminating redundancy in tables.
Defining and discussing the terms quantitative discrete, quantitative continuous, qualitative ordinal, qualitative nominal.
Discussing the granularity and scope of our data to ensure that it's appropriate for analysis. Discussing various methods of encoding time, and flaws to be aware of.
Ways in which our data can be incorrect or corrupt. Different methods for addressing missing values, and their tradeoffs.
Summarizing the process of EDA.
A demo of EDA on real data.