banner, Netflix Logo

Netflix EDA and Visualization


Background

Netflix is a subscription-based streaming service that allows our members to watch TV shows and movies without commercials on an internet-connected device. [Reference]

In this analysis, the dataset used contains the information of all the movies and TV shows on Netflix. The dataset will be used to answer the research questions:

About the Data

The data represents the current catalog of Movies and TV shows on Netflix as of September 25, 2021. The dataset was sourced from Kaggle: here. The following is a brief description of each variable (there are a total of 12 columns):

Importing required packages

Loading the dataset

The dataset contains 12 columns that we can use for exploratory analysis. Just by observing the first 5 rows, we can see that there are NaN values in multiple columns. This takes us to the next step, cleaning the dataset.

Cleaning the data

From the plot above, we see that the director column contains the most NaN values. With other columns such as cast and country showing null values. To get a better idea of just how much we will find the percentage of null values in each column and which columns contain null values.

About 30% of the director column is null, followed by country and cast. Before we can begin any analysis of the data, we will have to first deal with these null values.

We cannot use imputation, which is a method for dealing with missing values by filling them either with their estimatied statistical "best guess" values (e.g.: mean, mode, median) or by using techniques such as KNN or tree-based. This is because it is better in the cases of director, country, and cast to have an unknown value than to have an incorrect value. So, instead, we will use use of the fillna function from Pandas to indicate that the information is missing.

Since the percentage of null values for date_added, rating, and duration are less than 1%, we will instead drop all the rows that contain NaN values for any of these columns.

Only 17 rows out of the original 8807 rows are dropped.


EDA and Visualization

We will first perform some analysis that will help us better understand our dataset.

So, there are approximately 4000 movies and 2000 TV shows.

We now want to plot which year the titles where added to Netflix, however since the date_added column is an object, we want to convert it into date-time format first.

From the plot we can see that most titles (both movies and TV show) are added in the year 2019. We can also note that movies have been added as early as 2008, while TV shows have only been added since 2013.

What is Saudi Arabia's top genre?

Notice that the genres are listed under the listed_in column, however, we want to extract each genre individually. So, we will first create a Panda Series that contains each individual genre under the listed_in colum and then observe the most popular genres.

Now we will find the most popular genre produced by Saudi Arabia:

From the plot, we can see that most content produced by Saudi Arabia is listed under International Movies. However, we can say that the most popular genre of movies or TV shows produced fall under the genre of Comedy

Which country produces the most content?

For our next question, we're going to look at which country produced the most content in general.

So, we can see that the country with the most productions on Netflix is teh United States by quite a large gap with the second highest producer India.

What's the best month to release content?

Besides seasonal movies and TV shows, there is usually a preference to release new content during months where there are not alot of other new releases. This helps decrease competition for the new release.

So from the plot above we can see that the best months to release a movie or TV show onto Netflix would be February and then May since these are the months that have the least amount of content released.

Conclusion

We have been able to come to many informational inferences from our Netflix titles dataset, to summarize some of these inferences:

  1. Most of the content on Netflix is Movies
  2. Movies have been on Netflix since 2008, but TV shows weren't added until 2013
  3. Netflix saw much more content added after the years 2015 - 2016
  4. Saudi Arabia produces mostly Comedies and their content falls mostly under the international movies category.
  5. International movies is the most popular listing for movies, with the most popular genre on Netflix being Dramas
  6. The United States produces the most content on Netflix followed by India and the United Kingdom.
  7. The best month to release content is February.