
Netflix is a subscription-based streaming service that allows our members to watch TV shows and movies without commercials on an internet-connected device. [Reference]
In this analysis, the dataset used contains the information of all the movies and TV shows on Netflix. The dataset will be used to answer the research questions:
The data represents the current catalog of Movies and TV shows on Netflix as of September 25, 2021. The dataset was sourced from Kaggle: here. The following is a brief description of each variable (there are a total of 12 columns):
show_id: Unique ID for every Movie / Tv Showtype: Identifier - A Movie or TV Showtitle: Title of the Movie / Tv Showdirector: Director of the Moviecast: Actors involved in the movie / showcountry: Country where the movie / show was produceddate_added: Date it was added on Netflixrelease_year: Actual Release year of the move / showrating: TV Rating of the movie / showduration: Total Duration - in minutes or number of seasonslisted_in: Genredescription: The summary descriptionimport pandas as pd
import numpy as np
import seaborn as sns
from plotnine import ggplot,geom_bar, aes
import matplotlib.pyplot as plt
%matplotlib inline
import missingno as msno
from collections import Counter
netflix_df = pd.read_csv('netflix_titles.csv')
netflix_df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | NaN | United States | September 25, 2021 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | NaN | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | NaN | NaN | NaN | September 24, 2021 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | NaN | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | September 24, 2021 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
netflix_df.shape
(8807, 12)
The dataset contains 12 columns that we can use for exploratory analysis. Just by observing the first 5 rows, we can see that there are NaN values in multiple columns. This takes us to the next step, cleaning the dataset.
msno.matrix(netflix_df)
plt.show()
From the plot above, we see that the director column contains the most NaN values. With other columns such as cast and country showing null values. To get a better idea of just how much we will find the percentage of null values in each column and which columns contain null values.
print('\nColumns with missing value:')
print(netflix_df.isnull().any())
Columns with missing value: show_id False type False title False director True cast True country True date_added True release_year False rating True duration True listed_in False description False dtype: bool
(netflix_df.isnull().mean()*100).sort_values(ascending=False)[:6]
director 29.908028 country 9.435676 cast 9.367549 date_added 0.113546 rating 0.045418 duration 0.034064 dtype: float64
About 30% of the director column is null, followed by country and cast. Before we can begin any analysis of the data, we will have to first deal with these null values.
We cannot use imputation, which is a method for dealing with missing values by filling them either with their estimatied statistical "best guess" values (e.g.: mean, mode, median) or by using techniques such as KNN or tree-based. This is because it is better in the cases of director, country, and cast to have an unknown value than to have an incorrect value. So, instead, we will use use of the fillna function from Pandas to indicate that the information is missing.
netflix_df.director.fillna("No Director", inplace=True)
netflix_df.cast.fillna("No Cast", inplace=True)
netflix_df.country.fillna("Country Unavailable", inplace=True)
Since the percentage of null values for date_added, rating, and duration are less than 1%, we will instead drop all the rows that contain NaN values for any of these columns.
netflix_df.dropna(subset=["date_added", "rating", "duration"], inplace=True)
print('\nColumns with missing value:')
print(netflix_df.isnull().any())
Columns with missing value: show_id False type False title False director False cast False country False date_added False release_year False rating False duration False listed_in False description False dtype: bool
netflix_df.shape
(8790, 12)
Only 17 rows out of the original 8807 rows are dropped.
We will first perform some analysis that will help us better understand our dataset.
plt.figure(figsize=(12,6))
plt.title("Percentage of Netflix Titles as either Movies or TV Shows")
plt.pie(netflix_df.type.value_counts(),explode=(0.01,0.01),labels=netflix_df.type.value_counts().index, colors=['#b1a7a6',"#a4161a"],autopct="%1.2f%%")
plt.show()
So, there are approximately 4000 movies and 2000 TV shows.
We now want to plot which year the titles where added to Netflix, however since the date_added column is an object, we want to convert it into date-time format first.
netflix_df.dtypes
show_id object type object title object director object cast object country object date_added object release_year int64 rating object duration object listed_in object description object dtype: object
netflix_df['date_added'] = pd.to_datetime(netflix_df['date_added'])
netflix_df.head()
| show_id | type | title | director | cast | country | date_added | release_year | rating | duration | listed_in | description | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | s1 | Movie | Dick Johnson Is Dead | Kirsten Johnson | No Cast | United States | 2021-09-25 | 2020 | PG-13 | 90 min | Documentaries | As her father nears the end of his life, filmm... |
| 1 | s2 | TV Show | Blood & Water | No Director | Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban... | South Africa | 2021-09-24 | 2021 | TV-MA | 2 Seasons | International TV Shows, TV Dramas, TV Mysteries | After crossing paths at a party, a Cape Town t... |
| 2 | s3 | TV Show | Ganglands | Julien Leclercq | Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... | Country Unavailable | 2021-09-24 | 2021 | TV-MA | 1 Season | Crime TV Shows, International TV Shows, TV Act... | To protect his family from a powerful drug lor... |
| 3 | s4 | TV Show | Jailbirds New Orleans | No Director | No Cast | Country Unavailable | 2021-09-24 | 2021 | TV-MA | 1 Season | Docuseries, Reality TV | Feuds, flirtations and toilet talk go down amo... |
| 4 | s5 | TV Show | Kota Factory | No Director | Mayur More, Jitendra Kumar, Ranjan Raj, Alam K... | India | 2021-09-24 | 2021 | TV-MA | 2 Seasons | International TV Shows, Romantic TV Shows, TV ... | In a city of coaching centers known to train I... |
min(pd.DatetimeIndex(netflix_df['date_added']).year)
2008
netflix_df.groupby([pd.DatetimeIndex(netflix_df['date_added']).year, 'type'])['type'].count().unstack(level=1).plot(kind='line', figsize=(15, 8), color =['#b1a7a6','#a4161a'], linewidth = 4)
plt.xlim([2008,2021])
plt.xticks(np.arange(2008, 2022, step=1))
plt.show()
From the plot we can see that most titles (both movies and TV show) are added in the year 2019. We can also note that movies have been added as early as 2008, while TV shows have only been added since 2013.
Notice that the genres are listed under the listed_in column, however, we want to extract each genre individually. So, we will first create a Panda Series that contains each individual genre under the listed_in colum and then observe the most popular genres.
genre = netflix_df['listed_in']
seperated_genre = ','.join(genre).replace(' ,',',').replace(', ',',').split(',')
genre_count = pd.Series(dict(Counter(seperated_genre))).sort_values(ascending=False)
genre_count
International Movies 2752 Dramas 2426 Comedies 1674 International TV Shows 1349 Documentaries 869 Action & Adventure 859 TV Dramas 762 Independent Movies 756 Children & Family Movies 641 Romantic Movies 616 Thrillers 577 TV Comedies 573 Crime TV Shows 469 Kids' TV 448 Docuseries 394 Music & Musicals 375 Romantic TV Shows 370 Horror Movies 357 Stand-Up Comedy 343 Reality TV 255 British TV Shows 252 Sci-Fi & Fantasy 243 Sports Movies 219 Anime Series 174 Spanish-Language TV Shows 173 TV Action & Adventure 167 Korean TV Shows 151 Classic Movies 116 LGBTQ Movies 102 TV Mysteries 98 Science & Nature TV 92 TV Sci-Fi & Fantasy 83 TV Horror 75 Anime Features 71 Cult Movies 71 Teen TV Shows 69 Faith & Spirituality 65 TV Thrillers 57 Stand-Up Comedy & Talk Shows 56 Movies 53 Classic & Cult TV 26 TV Shows 16 dtype: int64
genre_top = genre_count[:20]
plt.figure(figsize=(20,12))
sns.barplot(genre_top, genre_top.index, palette="RdGy")
plt.show()
C:\Users\PC\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
Now we will find the most popular genre produced by Saudi Arabia:
netflix_genre_country = pd.DataFrame([netflix_df['country'].apply(lambda x: x.split(',')[0]), netflix_df['listed_in']])
netflix_genre_country_t = netflix_genre_country.T
netflix_df_exploded = netflix_genre_country_t.set_index(['country']).apply(lambda x: x.str.split(',').explode()).reset_index()
country_count_df = netflix_df_exploded.value_counts().rename_axis().reset_index(name='counts')
country_count_df
| country | listed_in | counts | |
|---|---|---|---|
| 0 | India | International Movies | 807 |
| 1 | United States | Documentaries | 429 |
| 2 | United States | Dramas | 429 |
| 3 | India | Dramas | 404 |
| 4 | United States | Comedies | 374 |
| ... | ... | ... | ... |
| 1436 | Mexico | Thrillers | 1 |
| 1437 | Mexico | Classic Movies | 1 |
| 1438 | Mexico | Docuseries | 1 |
| 1439 | Mexico | International Movies | 1 |
| 1440 | Zimbabwe | Comedies | 1 |
1441 rows × 3 columns
sa_count = country_count_df.loc[country_count_df['country'] == 'Saudi Arabia'].reset_index(drop=True)
del sa_count['country']
sa_count
| listed_in | counts | |
|---|---|---|
| 0 | International Movies | 7 |
| 1 | Comedies | 5 |
| 2 | International TV Shows | 3 |
| 3 | Romantic Movies | 2 |
| 4 | Dramas | 2 |
| 5 | Independent Movies | 2 |
| 6 | TV Dramas | 2 |
| 7 | TV Shows | 1 |
| 8 | Dramas | 1 |
| 9 | TV Comedies | 1 |
sa_count.plot.bar(x = 'listed_in',y = 'counts', color = '#a4161a')
plt.title("Most popular genre produced by Saudi Arabia on Netflix")
plt.xlabel("Genre")
plt.show()
From the plot, we can see that most content produced by Saudi Arabia is listed under International Movies. However, we can say that the most popular genre of movies or TV shows produced fall under the genre of Comedy
For our next question, we're going to look at which country produced the most content in general.
country_count=netflix_df['country'].value_counts().sort_values(ascending=False)
country_count=pd.DataFrame(country_count)
topcountries=country_count[0:14]
topcountries
| country | |
|---|---|
| United States | 2809 |
| India | 972 |
| Country Unavailable | 829 |
| United Kingdom | 418 |
| Japan | 243 |
| South Korea | 199 |
| Canada | 181 |
| Spain | 145 |
| France | 124 |
| Mexico | 110 |
| Egypt | 106 |
| Turkey | 105 |
| Nigeria | 95 |
| Australia | 85 |
topcountries.plot.bar(color = '#a4161a')
plt.title("Country with most content produced on Netflix")
plt.xlabel("Country")
plt.legend([])
plt.show()
So, we can see that the country with the most productions on Netflix is teh United States by quite a large gap with the second highest producer India.
Besides seasonal movies and TV shows, there is usually a preference to release new content during months where there are not alot of other new releases. This helps decrease competition for the new release.
netflix_date_df = pd.DataFrame()
netflix_date_df['content_added_month'] = netflix_df['date_added'].dt.month
netflix_date_df['type'] = netflix_df['type']
"""netflix_date_df['content_added_month'] = netflix_date_df['content_added_month'].map({
1: 'January', 2: 'February', 3: 'March', 4: "April", 5: "May", 6: "June", 7: "July", 8: "August", 9: "September", 10: "October", 11: "November", 12: "December"})"""
netflix_date_df
| content_added_month | type | |
|---|---|---|
| 0 | 9 | Movie |
| 1 | 9 | TV Show |
| 2 | 9 | TV Show |
| 3 | 9 | TV Show |
| 4 | 9 | TV Show |
| ... | ... | ... |
| 8802 | 11 | Movie |
| 8803 | 7 | TV Show |
| 8804 | 11 | Movie |
| 8805 | 1 | Movie |
| 8806 | 3 | Movie |
8790 rows × 2 columns
netflix_date_df.groupby(['content_added_month', 'type'])['type'].count().unstack(level=1).sort_values('content_added_month', ascending = True).plot(kind='bar', subplots=False, figsize=(15, 8), colormap="RdGy")
my_xticks = ["January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"]
y_pos = np.arange(len(my_xticks))
plt.xticks(y_pos, my_xticks, rotation=45, horizontalalignment='right')
plt.show()
So from the plot above we can see that the best months to release a movie or TV show onto Netflix would be February and then May since these are the months that have the least amount of content released.
We have been able to come to many informational inferences from our Netflix titles dataset, to summarize some of these inferences: