This project investigates the relationship between emotional arcs in films and their success, focusing on how emotions evolve within genres, across continents, and over time. The goal is to identify whether universal or genre-specific emotional patterns exist and how these patterns influence the reception of films, particularly in terms of box office performance, ratings, and awards. The project explores whether films with more varied emotional trajectories perform better and whether certain emotional dynamics optimize success. Additionally, it examines the role of historical context in shaping emotional arcs across different genres. By analyzing these emotional patterns, the research aims to uncover key insights into how emotions impact audience engagement and contribute to a film’s success. The motivation behind this work is to offer a deeper understanding of the emotional strategies that make films resonate with viewers globally, providing a better understanding of the factors that drive their enjoyment and connection to the story.
-
Does a universal or genre-specific emotional arc exist across movies, and how does it change over time and across different eras?
-
What emotions or sentiments dominate in each genre and why?
-
How do emotional arcs in films within the same genre vary across continents, and can we link this variation to historical contexts or events?? Especially do differences emerge when North America is not taken into account ?
-
To what extent does the average emotional tone of a film influence its success (box office, ratings, awards)? Is emotion a leading factor? Does a more varied emotional arc contribute to the film’s success?
-
How do emotions differ between successful and unsuccessful films? What impact does a film’s ending have on its success—should it end on a high note or a more melancholic one? Does a joyful conclusion lead to greater success?
To address missing data in our dataset, we incorporate additional information from Wikipedia pages for films to complete the CMU dataset. Approximately 40% of film summaries and over 90% of box office values are missing, and these values are essential for our analysis. To fill these gaps, we use libraries like wikipedia-api and pywikibot, and in cases requiring more detail, we utilize requests and BeautifulSoup for web scraping. Our approach involves extracting summaries from the Plot/Synopsis/Summary sections and box office revenue from the InfoBox, ensuring we access the correct page by handling title variations, such as adding "(film), and handling page redirections.
This process enriches our data, specifically targeting films with incomplete summaries (replacing those under 200 words with the Wikipedia entry) and adding missing box office values. Regarding the data size, as we have applied our scraping techniques to a smaller sample of 2000 elements with a running time of 12 minutes for the summaries and 25 minutes for the box office revenues, we estimate a maximal running time of 8 hours and 16 hours. Nonetheless, we expect our running time to be lower as we will impose higher validity and usefulness constraints on our dataset's elements, disqualifying and dropping outlier elements (too short "films", films with too many unknows unable to be scraped,...)
To estimate the sentiment of a movie, we apply sentiment classification models on the text of its summary. First, we clean each summary by removing any irrelevant or problematic content (e.g., html tags, weird citations). After cleaning, we calculate the word count of each summary to ensure it meets a minimum length requirement. For summaries that are too short, we retrieve additional information by scraping extended descriptions from the internet.
Once the summary is ready, we segment it—either by splitting into phrases or by splitting into segments (can be more than 1 sentence) based on semantic similarity. Each segment is then passed through a sentiment analysis model, specifically distiled version of RoBERTa. This model classifies each segment into one of seven emotions: anger, disgust, fear, joy, neutral, sadness, or surprise, providing a proportional score for each emotion.
To be able to compare emotion dynamics across movies, we use interpolation (or extrapolation when needed), to be able to have a common timeline for all movies. This allows us to approximate the emotions through the length of a movie with the emotions brought out by the summary.
In the final step, we fill in any missing data, such as box office revenue, release year, or genre, by scraping additional sources as needed and remove every row with NaN value. This thorough process results in a comprehensive dataset that captures not only the sentiment profile of each movie summary but also key contextual and financial data.
Details of the methods are presented in the results notebook.
Week | Description | Person in charge |
---|---|---|
Week 10 | Finish scraping if needed, run sentiment analysis on all films once done | Ines |
Week 10 | Draw plots, answer general questions about general dynamics for all movies, across genres, time, and continent | Florian |
Week 11 | Dig deeper into differences across genres | Alix |
Week 11 | Dig deeper into differences continent time | Mathieu |
Week 11 | Dig deeper into differences across continents | Xavier |
Week 12 | Investigate the box office - emotional relationship in detail | Mathieu |
Week 12 | Investigate the reviews, awards - emotional relationship in detail | Ines |
Week 13 | Get together week, analysis of results together | Xavier |
Week 14 | Data story writing | Alix and Florian |