This project utilizes SQL to conduct an in-depth analysis of Netflix's movies and TV shows dataset, aiming to extract valuable insights and answer key business questions. This README provides a detailed account of the project's objectives, business problems, solutions, findings, and conclusions.
- Content Type Distribution: Analyze the distribution of content types (movies vs TV shows) to understand the composition of Netflix's catalog.
- Rating Analysis: Identify the most common ratings for movies and TV shows to inform content acquisition and recommendation strategies.
- Content Attributes: List and analyze content based on release years, countries, and durations to identify trends and patterns.
- Content Categorization: Explore and categorize content based on specific criteria and keywords to enable targeted content discovery and recommendation.
The Data for this Project sourced from the Kaggle Dataset:
- Dataset Link Netflix_Shows
Drop table if exists netflix
create table netflix
(
show_id varchar(10),
type varchar(10),
title varchar(150),
director varchar(220),
casts varchar(1000),
country varchar(150),
date_added varchar(150),
release_year int,
rating varchar(10),
duration varchar(20),
listed_in varchar(100),
description varchar(270)
);
select type, count(*) as shows_count
from netflix
group by type;
Objective: Determine the distribution of content types on Netflix.
select t1.type, t1.rating
from (
select type,rating,count(*),
rank() over (partition by type order by count(*) desc) as ranking
from netflix
group by 1,2) as t1
where t1.ranking =1;
Objective: Identify the most frequently occurring rating for each type of content.
select n.*
from netflix n
where release_year = 2020 and type = 'Movie';
Objective: Retrieve all movies released in a specific year.
- Some rows having multiple entires like (india, canada,us) in same row
- For this we using an uncut and unnest function --> (unnest(string_to_array(country, ',')))
select unnest(string_to_array(country, ',')) as country_name, count(show_id) as total_count
from netflix
group by 1
order by 2 desc
limit 5;
Objective: Identify the top 5 countries with the highest number of content items.
select movie_list.type, movie_list.duration, movie_list.title
from (
select type, duration, title,
rank () over(order by duration desc) total_duration
from netflix
where type = 'Movie' and duration <> '') as movie_list
where movie_list.total_duration = 1
Objective: Find the movie with the longest duration.
-- Here need to convert the date from varchar to string
select *
from netflix
where to_date(date_added, 'month DD,YYY') >= current_date - interval '5 years'
Objective: Retrieve content added to Netflix in the last 5 years.
select type, title
from netflix
where director = 'Rajiv Chilaka'
-- If there are more then 1-direcor then we need to slipt the rows which have more than 1 name
select new_one.director_name, new_one.type, new_one.title
from
(select unnest(string_to_array(director, ',')) as director_name, type, title
from netflix) as new_one
where new_one.director_name = 'Rajiv Chilaka';
-- or simple
select *
from netflix
where director Ilike '%Rajiv Chilaka%' --Here Ilike will check the lower case also
Objective: List all content directed by 'Rajiv Chilaka'.
-- In text we not use the '>= are <, <=, >' it may give false result
-- Hence using an split function
select kt.*
from
(select *, split_part(duration, ' ',1):: numeric as seasons
from netflix
where type = 'TV Show') as kt
where kt.seasons > 5;
--or
select *
from netflix
where type = 'TV Show' and split_part(duration, ' ',1):: numeric > 5;
Objective: Identify TV shows with more than 5 seasons.
select gn.genre, count(gn.title)
from(select *, unnest(string_to_array(listed_in, ',')) as genre
from netflix) as gn
group by 1
order by 2 desc;
Objective: Count the number of content items in each genre.
10. Find each year and the average numbers of content release only in India on netflix.Return top 5 year with highest avg content release !
SELECT Extract(Year from to_date(date_added, 'Month DD, YYYY')) as years, count(*) as yearly_content,
Round(count(*)::numeric / (select count(*) from netflix where country = 'India')*100,2) as Avg_content
FROM netflix
WHERE country = 'India'
group by 1
order by avg_content desc
limit 5;
Objective: Calculate and rank years by the average number of content releases by India.
select *
from netflix
where type = 'Movie' and listed_in Ilike '%Documentaries%';
Objective: Retrieve all movies classified as documentaries.
select *
from netflix
where director is null;
Objective: List content that does not have a director.
select *
from netflix
where casts Ilike '%Salman Khan%' and release_year > extract(year from current_date) - 10;
-- For static datasets
select *
from netflix
where casts Ilike '%Salman Khan%' and release_year between 2014 and 2024
Objective: Count the number of movies featuring 'Salman Khan' in the last 10 years.
select count(*), jk.actors
from(select *, unnest(string_to_array(casts, ',')) as actors
from netflix
where country Ilike '%India%' and type = 'Movie') as jk
group by 2
order by 1 desc
limit 10
Objective: Identify the top 10 actors with the most appearances in Indian-produced movies.
select cont.category, count(cont.description) as Total_count
from
(select *,
case
when
description Ilike '%kill%' or description Ilike '%violence%' then 'Bad Content'
else 'Good Content'
end category
from netflix) as cont
group by 1
Objective: Categorize content as 'Bad' if it contains 'kill' or 'violence' and 'Good' otherwise. Count the number of items in each category.
This analysis of the Netflix dataset reveals several key insights:
- Content Distribution: The dataset showcases a diverse range of movies and TV shows, with varying ratings and genres, catering to a broad audience.
- Common Ratings: The most common ratings provide valuable insights into the target audience for each content type, informing content acquisition and recommendation strategies.
- Geographical Insights: The top countries and average content releases by India highlight regional content distribution patterns, enabling targeted content localization.
- Content Categorization: Categorizing content based on specific keywords provides a deeper understanding of the nature of content available on Netflix, facilitating content discovery and recommendation.
For inquiries and collaborations, feel free to reach out to [email protected]