Part 2: Data Deep Dive

Contents

4.2. Part 2: Data Deep Dive#

Now that we understand the variables in our datasets through the data dictionaries created in the previous part, it’s time to dive deeper into the data. In this part, we will perform an exploratory data analysis (EDA) for each of our Part 3: Anchor Problems. The goal of this EDA is to uncover patterns, spot anomalies, test hypotheses, and check assumptions with the help of summary statistics and graphical representations.

4.2.1. Task#

For this lab, we will use the Task 2: Netflix Dataset. Refer to the data dictionary created previously and perform a thorough exploratory data analysis (EDA) on the dataset.

Your analysis should include, but is not limited to, the following:

  • Check for duplicate records.

  • Check for missing values. Identify which columns have missing values and the percentage of missing values in each column.

  • Look at the types of variables (categorical, numerical, datetime, etc.) and their distributions and wheather they need any transformation / formatting.

  • Plot histograms for numerical variables to understand their distributions and ranges.

  • Plot bar charts for the year wise release of shows/movies to see trends over time. Plot movies and shows separately also to see if there are any differences in trends.

  • Lastly, since we want to build a recommendation system later, we want to define a similarity metric between two shows/movies. Analyze the features available in the dataset and suggest which features can be used to define similarity between two shows/movies. Provide a brief explanation for your choices after experimenting with/without different set of features. For now, we only need a basic understanding of which features can be used for similarity calculation. Do not incorporate test based features like title, cast, etc. at this stage.