Forecasting movie sales based on its trailer’s features

Development of a tool for predicting box office receipts from films sales and return on investment. Analysis of textual content used in trailers. Categorization of words into positive and negative. Building an attractive movie production model for buyers.

Рубрика Менеджмент и трудовые отношения
Вид дипломная работа
Язык английский
Дата добавления 10.12.2019
Размер файла 1,8 M

Отправить свою хорошую работу в базу знаний просто. Используйте форму, расположенную ниже

Студенты, аспиранты, молодые ученые, использующие базу знаний в своей учебе и работе, будут вам очень благодарны.

Размещено на http://allbest.ru

Federal state educational institution of higher education

National research university

Higher school of economics

Saint Petersburg School of Economics and Management

Department of Management

Bachelor's thesis

Forecasting movie sales based on its trailer's features

In the field 38.03.02 `Management'

Educational programme `Management'

Tashpulatov Azizbek Nabiyevich

Supervisor: E. Antipov

Associate professor of department of economics and management, PhD

Saint Petersburg 2019

Abstract

The filmmaking market together with an advertisement, being one of the most popular segments in business faces with a problem of forecasting the box office revenues. The problem is that investors need to know their return on investment (ROI) from movie sales before the movie has been released. Existing models do some kind of prediction; however, most of them are based on the information after the premiere date. Hence, the goal of the paper is to find a trailer as the main instrument to forecast the income of the movie using the information before the premiere date. To be more specific, the research covers the textual impact on the total gross of the movie, using sentiments and topic modeling. The methodology of the study is described in 3 stages. The first stage was the topic modeling process of words used in trailers. Another phase was the words categorization into sentiments (positive and negative), followed by linear regression for finding the correlation among the variables and building the model. It was expected that all these steps detect key topics and words of a trailer, that might have a positive or negative impact on the box office revenue. As the study showed, some of the suggested features of a trailer had a positive correlation with the upcoming movie sales. All the results were closely related to the word perception of consumers and were explained using statistical data. Despite the fact that not all of the suggested hypotheses have been approved, the results of the study are valuable. By using this model, both motion studios and cinemas can foresee their return on investment and make relevant decisions. Also, this study can be used in the advertising market as a manual for making their product attractive and favorable for customers.

Keywords: Topic modeling; Words; Trailer; Pre-released activities; Forecasting; Features; Advertising; ROI; Box-office revenue.

Table of contents

Introduction

1. Previous studies and existing models

1.1 A prediction model based on the movies' scenario

1.2 A prediction model based on Celebrities' power (cast)

1.3 Word of mouth effect on prediction

1.4 Demographics of a customer as another variable for prediction models

1.5 A prediction model based on Audio Dub

1.6 Neural networking predicting model

1.7 A gap in the literature review

2. Formulation of the research statement

3. Theoretical background

3.1 Main Concepts of the research (Trailers, Scripts, Box office revenue)

3.1.1 Trailers

3.1.2 Limitation of the trailers

3.1.3 Scripts

3.1.4 Box office revenue

3.2 Word (topic) perception and its influence on sales

3.3 Positive and negative words (topics) effect on sales

4. The methodology and research design

4.1 Variables

4.2 Data collection

4.3 Topic modeling

4.4 Sentiment analysis

4.5 Regression model

5. Results and Discussion

5.1 Descriptive analysis of scripts

5.2 Descriptive analysis of gross (income)

5.3 Topic modeling results

5.4 Sentimental analysis

5.5 Regression model

6. Discussion of results and hypotheses

Conclusion

Reference list

Appendices

Introduction

With an increasing number of cultural consumption of movies and theatres, the financial sector of the film-making industry continues to grow. (Algesheimer, Borle, Dholakia, & Singh, 2010) Following this look, the investments in advertisement and other pre-premier Pre-premier term is used for meaning the period before the official movie releasing date (Dhar, Sun, & Weinberg, 2012) activities have shown a dramatic rising since the 2000s. (Joshi & Hanssens, 2009) Yet the question about box office revenue of coverage of all the expenditures and expectations remains controversial. Such uncertainty raises risks for investors and motion studios of losing their money and authority as well if the ROI ROI - return on investment usually measured by percentage. The bigger this figure is, the more money will be gained back from the investment (`ROI Formula, Calculation, and Examples of Return on Investment', n.d.) is turned out to be negative. Therefore, with this need of guarantee, researchers have created several forecasting models for predicting the box office sales. (Gopinath, Chintagunta, & Venkataraman, 2013) Such models can roughly estimate the amount of money gained from an upcoming product in the nearest future. Each of these ways has its own approach and variables to be applied. For example, Kaimann and Pannike (2015) use “genre” to predict the potential score in IMDB service (Kaimann & Pannicke, 2015); Lee considers movie length as the main feature for calculating the gross of the movie. (S. Lee & Choeh, 2018)

However, most of them are calculated after the day of the movie was released. In addition, they use widespread features of a movie to do so, while some pre-releasing activities are ignored. For instance, 86% of these studies predict the profit of a movie, using variables such as genres, casting, video effects, etc. (Dumler, 2018) So, the purpose of this paper will basically be focused on the textual context of the trailers. In other words, how a word or group of words in the trailers can influence the profit of an upcoming movie before the official premiere date.

The reasons for holding this research are several, starting from simple cognitive studies of words to the advertisement market.

Since the study endeavors to explain gross fluctuation via words and texts, the research involves the human cognitive perception of the text. Such an instrument helps to find appropriate words in advertising. For example, Posner et al. (1988) consider the word "mother" to be associated with love and care, hence this term might propitiously be used in adverts of goods for children. (Posner, Petersen, Fox, & Raichle, 1988) The same principle is used in the paper, where a term, that positively correlates with high sales, can act as a driving force for making people happy from advertising.

Dwelling on advertising itself, it is noteworthy that trailers, being one of the most powerful instruments in SMM, are not fully explored. Elliot and Simmons in their research suggest it as two different things, closely relating to each other: “Advertising and trailers play a significant role in the box office performance since its strong connection to them”. (Elliott & Simmons, 2008). Simmons says that PR companies together with advertising agencies mainly focus on traditional ways of promoting the information, such as posters, short video clips or fliers. In Elliott's opinion, such an oversimplification of trailers led to scientific gaps to have emerged. One of them is the lack of scientific research about trailers. Although, according to statistics by Karray et al. (2017) 58% of advert consumption belongs to video formats such as trailers or short movies, whereas the posters and fliers have only 42%. (Karray & Debernitz, 2017) Based on these facts, this paper might act as good literature for filling up these gaps. In my opinion, the indicators, that are shown in the research will be helpful for both advertising companies and moviemaking studios to understand which features of the trailer must be given more attention and which are not.

Referring to the problems mentioned above there is a need for providing research about the topic of a movie trailer. Following this look, the goal of the research as it has partly been mentioned can be stated as: “Forecasting movie sales based on its trailer's textual features”.

Due to the fact that the forecasting model in the paper describes a causal relationship between the trailer's scripts and box office revenue, the research type is explanatory one. Such an approach helps to find the links between these two research objectives and creates recommendations not only for investors but for the motion studios and advertising agencies as well. (Basuroy, Chatterjee, & Ravid, 2003)

Objective to be explored:The objective of the paper might be described as forecasting the box office revenue of the upcoming movie through the pre-premier activities, trailers in particular.

The subject of the paper: The correlation of trailers' textual features with the movie sales.

The main areas of the study are:

ь Trailer's textual features (words, topics, sentiments)

ь Revenue of the movie measured by money (in U.S. dollars)

Tasks to be fulfilled to achieve the goal:

ь To explore the previous studies about prediction models;

ь To examine the literature about word perception and make hypotheses;

ь To make the research design and the appropriate methodology;

ь To collect the information about the gross of the English language movies in the period from the 1990s to 2018;

ь To gain the scripts of trailers used in these movies;

ь To make a vertical sentiment analysis of words used in trailers (data preparation and descriptive analysis);

ь To generate the topics from words used in trailers, using LDA This method will be described in detail in the “methodology” chapter 4.3. method (data preparation and descriptive analysis);

ь To make the correlation analysis and build a linear regression model (confirming or refusing suggested hypotheses);

ь To expound the results and make relevant conclusions.

The research will mainly use LDA and vertical sentiment The method will be detailly explained in the “methodology” chapter 4.4. analysis as the core methods of the study.

Structure of Paper. The paper is mainly divided into 6 sections. The first section is devoted to the theoretical background, followed by the next section about the literature review. The 3rd and 4th sections describe the research problem and methodology of the paper. All the results are given in the 5th part of the thesis with the implications drawn in the last 6th section.

1. Previous studies and existing models

Since the research requires lots of materials to be used, there is a need to look through the relevant literature. According to Weinberg (2012), there are many articles and publications, that are devoted to forecasting the box office revenue of released movies. (Oh, Baek, & Ahn, 2017) Most of them are successfully implemented in the cinemas and still reflect accurate results. However, as it has been already mentioned, the core value of this particular paper relies on a prediction before the movie has been premiered. Also, it is noteworthy to highlight the unicity of this approach, applying for using textual features of the trailers for successful advertisement and sales. Continuing the concept of the new approach, the next paragraphs describe the most popular variables used in predictive models, according to the previous studies.

1.1 A prediction model based on the movies' scenario

One of the models predicting the revenue has been built by J. Eliashberg et al. (2007) in the research named as “From storyline to box office”. The research underlines the importance of words and scripts, which have been used in films, examining its scenario. Authors suggest that major studios still employ qualified readers to help them analyze scenarios used in the upcoming movies. (Eliashberg, Hui, & Zhang, 2007) To be more specific, some motion studios assign four or five people, who are called “readers”, to scan each script and make recommendations after. As they said in their research: “the success of a movie production depends on the quality of the available readers and their acumen in picking out promising scripts. This approach becomes especially problematic when disagreements among readers occur.” (Eliashberg et al., 2007) Hence, they proposed a new approach that can potentially help studios make more profits from their decisions about scripts.

The main data, that was applied in the paper was gained from spoilers. Under the term spoiler, they considered a description of the movie. The method they have applied for is named as Bag-CART (Bootstrap Aggregated Classification and Regression Tree). (Liaw & Wiener, 2002) Such an approach helped them to make a prognosis of movie revenue based on separate words and sentences.

One more variable evolved by the authors was the order words in the text. It is an oversimplification to overlook some problems that could be emerged if the only words are analyzed. For example, "Batman kills Joker" and "Joker kills Batman" will trigger lots of questions and emotional fluctuations from the audience. (Eliashberg et al., 2007)

Although these two statements have the same words used, the order has a significant role to play, so that it was included in the model. This model is very effective; however, it requires scenarios to be written. Also, spoilers usually do not have the same language as the real movie do. So, the model has some shortfalls.

1.2 A prediction model based on Celebrities' power (cast)

textual trailer movie film sale

Other researchers suggest that the actors have their major influence on the box office revenue and advertising. In other words, a celebrity's authority usually impacts the movie's revenue and has the strongest effect on it. Treme and Craig (2013), who analyzed the star power of the income by looking at the actor's popularity and his/her name reference, found a positive correlation between these exploring topics. (Treme & Craig, 2013)

Although the model showed accurate results, the approach is not appropriate for this research. The reasons are several, starting from new actors, playing in the movie to the objectiveness of spectators. As it is known, some of the movies do not accept popular celebrities in their films in the purpose of budget and role play. In this case, the film is missed. The same problem emerges with other cases, so this variable is not included in this paper.

1.3 Word of mouth effect on prediction

Vany and Walls (2002) say that it's no longer the play of a particular celebrity but the word of mouth effect. (De Vany & Walls, 2007) As the authors consider, the bigger the impression of the movie is, the more people will share their feelings about it, hence, the more profit will be. Schematically the process is illustrated below. (Figure 1)

Figure 1. Word of mouth scheme

Despite the value of the variable, the approach has some drawbacks. The question of measuring the emotions of viewers remains controversial for the reason of different approaches and cultural differences. (De Vany & Walls, 2007) To be in detail, some companies use feedbacks as the main data for calculating the level of impressions, while others prefer to apply for surveys. Yet, there is a risk to miss some other factors of spectators in the question of their preferences about celebrities. All in all, as Treme and Craig (2013) say, this number can only foggy the research, due to several reasons such as viewer's associations, emotional background and the beauty of an actor. (Kaimann & Pannicke, 2015) Therefore, this variable will not be involved in the paper, except for the purpose of explanation some statements and indexes about trailers.

1.4 Demographics of a customer as another variable for prediction models

One more research found a strong correlation between the income of a movie and some demographic figures.

The research made by S. Copinath et al. (2013) claims that movie sales can be estimated by its advert platform, involving banners, social networking sites, blogs, and newsgroups, using specific target auditorium. In the research, the authors explore the relationship between people from different demographic groups and the advert itself. (Karray & Debernitz, 2017)

Following their study, younger people under the age of 25-30 with a low level of education and income have more chances to be influenced by advertising, the same effect takes place for women at the age of 25-40. (Karray & Debernitz, 2017) Such a piece of information can provide motion studios and advertising companies to expand their audience using a particular mas media decision. Since the paper involves the trailer's effect evaluation, it will consider the results of the article to explain some figures and interesting tendencies such as the share between men and female viewers.

1.5 A prediction model based on Audio Dub

It is interesting to note that experts from the University of Zurich in this sector, namely Sharad Burle (2012). have found the audio dub as another important variable for forecasting the box office revenue.

According to their study, the more synchronized audio duplication is, the more profitable and qualitative the movie is. In other words, about 85% of movies, that were tested with a high quality of audio synchronization went to show an upward trend in sales by 20%. (Nelson & Glotfelty, 2012) Since the following research explores the trailer filmed in its native language, particularly English, such a thing probably will not be in need to be considered. However, on the condition of animated movie analysis, this figure has its role to play.

1.6 Neural networking predicting model

Yunian Ru et al. (2018) created a neural network predicting model that can help to find the potential revenue of the movie by its producer, screen count, and distributor. The nut of the work is based on machine learning and artificial intelligence, where the results are shown in exact numbers. Almost the same method has been used by other researchers from the National Research University Higher School of Economics, Leonid Yasnitskiy (2018). He used about 20 variables of a movie such as budget, genre, characters and even the age of the producer. According to his work, "The Da Vinci Code" film could raise its box office revenue by 25% if it were 3 minutes longer and filmed as a sequel instead of drama. (Dumler, 2018) Still, his work is based on movies, that have been already come out and got some awards, whereas the following paper predicts income before the premiere date, using trailers.

1.7 A gap in the literature review

All in all, the literature made in this area is full of studies, starting with simple prediction movies to neural networking systems. Most of them have already been implemented as a tool for calculating sales. However, the core of forecasting relies on estimations before the movie has been released and finding specific features of a trailer for successful box office revenue, so the following paper endeavor to find a proper solution for the issues.

2. Formulation of the research statement

Quite recently, considerable attention has been paid to the movie market, particularly to the predictive models. Yet, there are still some interesting and relevant problems to be addressed.

One of them directly concerns the way these models are done. As it has already been discussed, the majority of them apply for the data gained after the premiere date, which might act as a problem.

The reason is that, despite their effectiveness and accuracy in forecasting, investors and motion studios may not find its approach suitable for their investments and interests.

Based on the research made by Lash and Zhao (2016), these two listed characters (motion studios and investors) strive to see their return of investments before any decision has been taken. (Lash & Zhao, 2016)

That is the reason why it is important to build such a model so that it can analyze the data before the movie has been put on the public.

The next thing, which is the most significant, is the ignorance of the trailer as a main driving force for gaining potential revenue.

The core thing of this drawback relies on variables used in these models. As literature review has demonstrated, there are many predictive variables such as genre, casting, director, longitude, etc. called for being an indicator for potential movie income.

However, it is an oversimplification to overlook another factor, which might play a not less significant role in the process of forecasting. Under this term factor, I consider a text (scripts of trailers).

Based on the approach presented in the "Green lighting effect” by Eliashberg et.al (2007), the text might have a strong impact on the final gross of the film. (Eliashberg et al., 2007)

For example, if there is a positive word in the movie trailer, the probability of the movie to be positively precepted might grow.

It is just an example to illustrate how it works.

Totally, these traditional approaches might miss another important factor, having ignored the variable “text” during the process of forecasting. Hence, the main focus of this paper is given to this specific aspect.

3. Theoretical background

3.1 Main Concepts of the research (Trailers, Scripts, Box office revenue)

Since the paper relies on three main areas: trailers, scripts and box office revenue, it is important to provide their determinations, so that the total image becomes clearer. So, the structure of this paragraph is as follows:

ь Trailers

ь Scripts

ь Box office revenue of the movies

3.1.1 Trailers

So, the term trailer in the research means short video clip, made of different scenes from the movie, arranged mostly in random order. (Le, 1991) Trailers are always shown before the movie has been released, acting as pre-premiere advertising. The main purpose of them is to attract more people to come and see the movie. Smeaton (2006) in his work describes them as a clot of dynamic and full of video effects scenes, made in the purpose of making a spectator be impressed by actions rather than scenario. (Smeaton, Lehane, O'Connor, Brady, & Craig, 2006) It is interesting to note that trailers are not restricted by films only, statistics of the 2015 year go to show the impressive consumption of them in the video games market. (Madigan, 2015) The average number of likes on YouTube between Video Games' trailers with movie ones is 3,12 and 5,6 million likes respectively. (Burgess & Green, 2018) That means that these short clips do have a high demand. As Chen et al. claim some of these trailers are used for advertising purposes. For example, the trailer of the movie "Psycho" has been filmed in the Bates hotel. About 678 thousand people saw the atmosphere of the hotel inside. It is important to mention that the movie itself doesn't have anything common with this place, however, the Bates paid a lot of money to make this trailer inside. So, after the activity, the total number of their visitors increased by 25%, proving that the trailers have a strong advertising effect. (Burgess & Green, 2018)

Switching to a marketing aspect, trailers have become another popular brand of advertisement. As Simmons claims, it is trailers usually act as a short description of the movie, whereas the advert promotes it as a product. (Elliott & Simmons, 2008) In my consideration, both of them use spectators' expectations as a pivot to understand the level of attraction. In other words, the higher they are, the more people are interested in the movie.

Overall, as it has been mentioned, this sector of advertisement is not popular among scientists. According to Eliashberg (2000): "Efficiency of movie advertising content and execution seem overlooked by the research community, specifically how the design of trailers can influence investors' valuation of the movie." (Eliashberg, Jonker, Sawhney, & Wierenga, 2000) As she considers, it is important to make some pre-released analysis of a movie due to the investors' expectations. That is why this paper will act as a filling point for this underlined gap.

3.1.2 Limitation of the trailers

Due to the fact that the values and tastes of the viewers (movie spectators, consumers) change over time, nobody can deny that the content of the trailers will change as well. (Smeaton et al., 2006) Starting in 2017, the number of words used in the trailers began to decline. (Oh et al., 2017) Many of them have been replaced by pictures, accompanied by the music. Some of the motion studios still keep using the words, however in the form of floating labels. There are also some trailers, where there is not any word at all (written or spoken). This situation can be called a limitation of this study since the number of trailer factors affecting profit will no longer contain text or subject matter.

3.1.3 Scripts

As it has been already mentioned the study examines the textual effect of trailers to the box office revenue. That is why the paper evolves scripts. Under the term "scripts" the paper considers the written format of words used in trailers. The more details about them can be found in a further chapter named as "Data".

3.1.4 Box office revenue

One more indicator listed above-called box office revenue is determined as the amount of money, gained from sales of a movie in commercial cinemas. In the research, this term will also be known as income, gross and profit.

3.2 Word (topic) perception and its influence on sales

The word, being one of the basic units of information, plays a significant role in the marketing sphere. As DeCarlo said in his work about negative word-of-mouth effects (2007), “a correctly chosen word can either save a person from great misfortune or, on the contrary, destroy him”. (DeCarlo, Laczniak, Motley, & Ramaswami, 2007) Alisher Navoiy, the national Uzbek writer considers a word as a more dangerous weapon than a sword. (Abdulkhayrov, 2012)

In the advertising, the process of finding an appropriate word or an interesting sentence, which can attract people, takes about a week or maybe more. (Joshi & Hanssens, 2009) This suggests that the text in most cases determines the success of a particular promotional product by changing the perception of a customer. The study made by Lieberman M.D. et al. (2008) found that the words: “now”, “soon”, “tomorrow” etc. have a great appeal to consumers. (Tabibnia, Lieberman, & Craske, 2008) As the authors' experiment showed, in advertising posts where these words were highlighted, there were more clicks on the post rather than in those, where these words were missing. As the authors of this study explain, these words have the ability to respond to a human request of receiving the wished immediately. Relying on research by Deci et al. (2000), a human always wants to get his/her goal as soon as possible. Following this look, when someone sees the word “now” or others, he/she perceives a signal of getting desired things in the near future, despite what is actually written there. (Deci & Ryan, 2000)

Another research conducted by Lee K.C. (2001) has found that the adverts, which were made in a homey style, using the domestic atmosphere, are positively accepted by consumers. (K.-C. Lee, 2001) In the authors' opinion, home visualization in advertising mostly meets the customers' ones. According to the results of the research, 78% of people preferred to buy the product, which was shown in the background of the home. (K.-C. Lee, 2001) One more study, related to the same research object by SS Banerjee et al. (2008) claimed that customers favorably attitude to the given topic, because of related objects to it, such as relationship, personal life, etc. (Banerjee & Dholakia, 2008) In the authors' opinion, people find place interesting not only because of its atmosphere, but also for the emotions and associations with it. Based on the numbers brought by the author, around 47% of people in the USA associate their homes with relationships and family, others either prefer to be alone or live do not have any clue about their future. (Banerjee & Dholakia, 2008)

One more study about context effect on popularity demonstrated that videos with criminal colors in "YouTube" have more chanced to be spread around consumers rather than videos full justice and morality. (Burgess & Green, 2018) The research was done in the framework of American customers, including local demographic features. It has shown that people at the age group of 18-30 years are stronger attracted by videos, having some cruelty or crime in their content. The author explains such phenomenon through the tendency of people's feeling about the fear and danger. As Burgess and Green say in their article about people perception, humankind is prone to experiencing a feeling of fear and danger much stronger than other emotions. Based on this logic, it can be assumed that trailers, containing a crime, will have more influence on the viewer, encouraging them to watch the upcoming movie. This, in turn, will increase the number of people in the cinemas, which will lead to an increase in the box office revenue of the film.

3.3 Positive and negative words (topics) effect on sales

According to the statistics and marketing research by DeCarlo (2007), the presence of negative words of a product in its description or feedback often reduces its quality in consumers opinion. (DeCarlo et al., 2007) This effect is enhanced when the product is unfamiliar to the consumer. The reason is that potential buyers pay more attention to the description and reviews at the moment of knowing nothing about the product. On the condition of any negative reaction emergence, the consumer will automatically doubt the quality of the desired product.

However, Jonah Berger et al. (2010) considers that it is so simple as it might be described, according to the numbers got from the research provided by their research, consumers spend more time reading all the negative reviews than positive ones. (Berger, Sorensen, & Rasmussen, 2010) It is often the case when they do not read any positives at all, looking for negative ones only. As the author claims positive reviews are often shorter in length than negative ones, hence people skip them quickly. Also, according to their study, negative reviews are more believable and interesting for buyers, rather than praise. (Berger et al., 2010) The core problem of their study was described through the number of details evolved in the negative posts. In other words, the more details are in the post, the more trustworthy the post is. Keeping these data in the consideration, the presence of negative words in the text may also have the property of attractiveness, since people perception about reality and details.

Deci and Ryan (2000) in their study concerned this phenomenon, explaining it as a way of human expression. As stated above, people tend to have stronger feelings towards fear and a sense of danger. In many cases the negative words characterize these feelings, therefore, it is suggested that bad words have more emotional power than good ones. (Deci & Ryan, 2000)

Following these data, it can be hypothesized that the presence of negative words can increase the level of attractiveness of the trailer, in consequence, raise the profit of the film.

To sum up, the information provides above, I have listed all 4 hypotheses suggested for this research. All of them can be seen below.

Hypotheses:

H1 - There is a correlation between topics of the texts in the trailers and the box office revenue;

H1.1 - The topic that has a crime content is positively accepted by viewers, making the revenue increase;

H2 - There is a correlation between sentiments and the revenue;

H2.1 - The presence of negative words in the trailer increase the total gross of the upcoming movie.

4. The methodology and research design

To solve the mentioned problem, I have divided the methodology part into 2 sections. The first section covers the data itself (how the data has been collected and what are the main variables applied for the research). Turning to the next section, it is mainly devoted to the analysis (the approach, formulas, etc.). For this stage, the research used R studio as its main instrument.

The research type and design: (figure 1)

Figure 2. Research type model

The main framework for the study was the mix of qualitative and quantitative research methods. In both cases, the analysis was done using a machine learning approach. However, in the process of constructing the topics from certain words, the vector LDA method, based on Gibbs sampling (quantitative method), was used; sentiment analysis resorted to the use of a dictionary (qualitative method).

4.1 Variables

Since the paper studies the correlation analysis between the trailer and the movie, it is clear to detect dependent and independent variables. Following this look, the textual features of a trailer will act as independent variables, whereas the revenue from the box office will be a dependent one.

For textual features, I have chosen sentiments (good and bad words), and topic modeling (several words joined in one common topic) variables. As for box office revenue, the gross has been selected as a dependent variable. However, it is important to detect if the gross is much more than the movie budget. To be more specific, when the sales of a movie indicate high numbers, the budget, which has been spent on the filming, might not be covered fully. In this case, the movie might be named as unsuccessful. Hence, under this variable "gross" the paper takes the difference between the income and its budget, formula 1. It is significant to mention that, the list of movies has been arranged according to the implemented rule, where all the movies, whose difference in gross was 2 times more than their budget is accepted.

(1)

Based on this formula, the research endeavors to filter the films, whose revenue is much bigger than expenditures. (successful movies)

4.2 Data collection

The data collection also has been divided into two phases. The 1st phase started from gaining, where all the necessary data was collected. The second one was data preparation (filtering, merging, cleaning) for analysis. In the data collection process, there were some issues, since popular web pages such as IMDB or "Kinopoisk” did not provide any information about trailers' texts. That is why I used GitHub as the main source of my data. This link “Kaggle” is a portal for data scientists, who can share their researches among each other without any charges. I gained two datasets from it: (`Predict IMDB score with data mining algorithms | Kaggle', n.d.)

1) movie dataset: has 9 columns (X1, actors, characters, movie title, genres, release year, IMDB ID, scripts) - 5043 observations (films)

2) movie_metadata: has 28 columns- 3923 observations (films) Appendix1

The following step was data preparation (merging, filtering, cleaning). In this stage, it was important to clean unnecessary columns such as actors, characters, likes, IMDB scores, etc. Then, I have started to prepare scripts for gaining the main topics. This process took several steps, which are described below:

a) Deleting stop words from scripts (pronouns, prepositions, articles, etc.). It is needed for creating topics from several valuable words. It is often the case when the program considers the article "a" as a separate word, which does not have any value for the research. The same thing happens with pronouns, which are statistically the biggest part of any speech. (Eliashberg et al., 2007) The list of these stop words is given in Appendix 2 ((Positive Effects of Negative Publicity: When Negative Reviews Increase Sales, n.d.))

b) Deleting the punctuation and other unknown symbols such as *, &, $, etc.

c) Referring to the mentioned problem of stop words, the data took all the numbers off.

d) Stemming - the process of finding the morphological root of the word in order to avoid repetitions. For example, the words: go, went, gone - must be joint to the one word “go” etc.

Finally, I have merged these two data sets together and got 1377 movies (trailers) released from 1992 to 2018.

4.3 Topic modeling

For this particular part, I have chosen the method named "LDA". The full name of the approach is Latent Dirichlet Allocation, which is responsible for topic modeling of words. To be more specific, this algorithm transforms each word of the trailer into the independent vector. All of these vectors then are calculated through the distance and frequency, so that the common topic of these vectors emerges. (Nikolenko, n.d.)

The basis of this model relies on machine learning. The system teaches a machine to detect words so that they can be joined together by a common topic. It usually outcomes 3-4 words combined and gives the theme its name, relying on the content. For example, the words: love, wedding, and husband are joined into one topic under the name "Family" and so on.

In the LDA model, the words, which are formed into the topics, are independent and do not have any correlation with each other. However, the topics are internally close, that is why some of them can have common words. For instance, the “family” topic has the word “love”, so does “drama”.

The core prevalence of using this model is its logistic normal distribution, which is not used in traditional CTM (correlated thematic model) model. According to statistical standard rules, the logistic normal distribution can simulate correlations between topics, making the model more expressive. (Efron, 1975)

Visualization of the topics is given in the graph named "Intertopic distance map". A detailed explanation of it is provided in the paragraph "Results and Discussion"

4.4 Sentiment analysis

This kind of analysis, as it has been mentioned above, is needed to identify the effect of positive and negative words on the box office revenue. It is based on the division of words into "good" and "bad" using the dictionary "Mining and Summarizing Customer Reviews." (Hand, Keim, & Ng, 2002)

Initially, I have counted the number of negative words, then positive ones. Considering the fact that the more words are used - the more sentiment will be, it was decided to use the remnants of regression. In other words, Log (number of characters) ~ log (sentiment) in figure 3. This will allow the study to identify more vividly the contrast between these two groups of words (positive and negative) and make our regression model more intuitive.

(2)

4.5 Regression model

All the results of sentiment and topic modeling analysis were tested for significance by linear regression. Based on the results obtained, appropriate conclusions were made regarding the influence of variables on the films' box office.

5. Results and Discussion

5.1 Descriptive analysis of scripts

· The average number of words in trailers' scenario - 255 words

· The median of Word count in trailers - 248 words

· The biggest number of words used in trailers - 1724

· The lowest number of words used in trailers - 2

These numbers are depicted in the histogram below in figure 2 “Histogram of words used in trailers”, where the WC means “word count” and the frequency line shows the number of movie trailers.

Figure 3. Histogram of words used in trailers

As the histogram shows, about 300 movie trailers have less than 250 words; the lowest number of words belongs to ~ 10 trailers, having almost zero words.

According to the word cloud, the most popular word, used in trailers since 1992 was “F*ck” (this word was then taken off from the data, for its large variety of meaning). The other words are illustrated in the word cloud below in figure 3 "Word cloud", where the darkness of a word reflects its frequency (the darker word is, the more frequently it is used in the trailers).

Figure 4. Word cloud of most frequent words used in trailers

After the second filtering of data, the most popular words are illustrated in table 1 “Popular words”

Table 1

Popular words after the second data filtering

words

term.freq

doc.freq

1 good

34787

1370

2 need

22445

1366

3 day

13998

1365

4 dont

73642

1357

5 said

13149

1357

6 could

17600

1352

7 find

10549

1351

8 would

19704

1351

9 ask

9259

1346

10 use

8786

1346

Where term.freq - the frequency of the word used in the whole data

doc.freq - the frequency of a word in trailers

5.2 Descriptive analysis of gross (income)

ь The mean value of the net gross came from movies - 72843139 USD

ь The median value of the same figure - 45462098 USD

ь The highest value of the net gross - 658672302 USD

ь The lowest value of the gross - 12561 USD

Schematically, the data has the following results (figure 4 “Histogram of gross”), where e - is a mathematical constant (equaling to 2.71828). (Efron, 1975)

Figure 5. The gross spreading among the movies

Referring to the histogram in figure 4, the vertical axis depicts the number of movies, whereas the gross itself is located in the horizontal axis.

5.3 Topic modeling results

The LDA model went to show 15 topics with 30 words in each. It is interesting to note that many of these words are met on several topics. For example, the word good can be observed in 3 topics. The top words for each of these topics are illustrated below:

1. fire control gun war readi weve men order train air

2. laugh grunt chuckl play scream speak sigh ring groan door

3. captain master dave sea bird ship bit aye eh sail

4. dont good would must time father find need could lord

5. dont dr human m need time find good john would

6. run team presid state sam game today field america school

7. jack water mrs town red mama mari sing certain tree

8. dont good time need would god could didnt kid day

9. case murder death law honor prison lawyer court arrest la

10. gotta play watch huh ass world hit three car hell

11. dont good need find time frank would offic phone hell

12. lm dont lts find good book georg must may time

13. dont good time need didnt aint could would said money

14. facesansserif aii dont iik weii wiii derek good iii christma

15. san facemicrosoft serif paul tom yцu art walter ami lou

Based on these top words, I have chosen 5 topics for further checking:

1) Drama + relationship

2) Spying

3) Action (criminal)

4) Wonders

5) Adventures

These topics can be detailly seen in the following graphs with information about movies' revenue.

5.3.1 Topic name: Drama + relationship

Terms: "mom", "househol", "girli", "cowgirl", "schoolgirl", "exgirlfrien", "war", "warm", "warrior", "warmer", "warmth", "hometown", "housekeep", "housewif", "housew", "song", "farther", "fatherson", "fatherinlaw", "belief", "human", "stayin", "meetin", "nicer", "finer", "finest"

Movie trailers with this topic and their net gross (US dollars)

[1] "The Croods" - 187165546

[2] "Dragon Blade" - 72413

[3] "Wrath of the Titans" - 83640426

[4] "Star Wars: Episode III - Revenge of the Sith" - 380262555

[5] "The Chronicles of Narnia: Prince Caspian" - 141614023

Figure 6. Map of “Drama + relationship” topic words

Intertopic Distance Map (figure 5) reflects words used in one common topic. As it is seen, the graph is divided into 2 parts. The first (left) part shows the distance between the topics. For example, if the topic 1 about drama and relationship, topic 3 (which stands close to 1) might be about love and family.

The distance between these circles is responsible for the closeness of topics within each other. The bigger circle is, the more general topic it has; the smaller circle is, the more specific topic is provided. (Cho, Bae, & Woo, 2017)

The second (right) side of the illustration has a list of terms used in a particular topic. Each word has 2 figures. The first figure (red) is responsible for the frequency of term within the selected topic, while the second (blue) one demonstrates the overall term frequency. Also, there is a regulator of these 2 figures' correspondence, it is placed above the listed words as . (Cho et al., 2017)

The less it is, the more specific words are selected; the bigger it is, the more general words are selected in the topic. For example, if the = 1, the topic drama would have the following words: bad, sad, blue, long, etc. - if the , the topic drama would have "mother", "hope", "death", "story" as the terms. So, in order to regulate all the words equally, it has been decided to put the.

5.3.2 Topic name: Spying

Terms: "weapon", "weaponri", "russian", "govern", "agent", "unit", "bomb", "comman", "control", "controll", "senat", "american", "target", "oper"

Movie trailers with this topic and their net gross (US dollars)

[1] "Cars 2" - 191450875

[2] "Zoolander" - 45162741

[3] "You, Me and Dupree" - 75604320

[4] "The Campaign" - 86897182

[5] "The Hunger Games: Catching Fire" - 424645577

Figure 7. Map of “Spying” topic words

5.3.3 Topic name: Action (criminal)

Terms: "mon", "cop", "helicopt", "bitch", "sonofabitch", "bitchin", "bitchass", "ass", "asshol", "shot", "shotgun", "gunshot", "babe", "gunman", "gunmen", "gunner", "motherfuck", "motherfuckin", "shoot", "fight", "brother", "game", "mike", "polic", "coach", "player"

Movie trailers with this topic and their net gross (US dollars)

[1] "Hardball" - 40219708

[2] "The Longest Yard" - 158115031

[3] "Remember the Titans" - 115648585

[4] "Semi-Pro" - 33472850

[5] "We Are Marshall" - 43532294

Figure 8. Map of “Action” topic words

5.3.4 Topic name: Wonders

Terms: "hearth", "mama", "heaven", "soul", "church", "faith", "pray", "spirit", "amen", "sea", "miracl", "miracul", "sing", "singer", "white", "smother", "holi"

Movie trailers with this topic and their net gross (US dollars)

[1] "The Apostle" - 20733485

[2] "The Prince of Egypt" - 101217900

[3] "Happy Feet Two" - 63992328

[4] "Happy Feet" - 197992827

[5] "Oscar and Lucinda" - 1508689

Figure 9. Map of “Wonders” topic words

5.3.5 Topic name: Adventures

Terms: "hole", "firecrack", "firework", "longlost", "captain", "million", "millionair", "water", "rop", "rope", "avi", "seat", "mark", "plane", "jump", "jumper","safe", "chief" ,"boat", "flight", "ship", "box"

Movie trailers with this topic and their net gross (US dollars)

[1] "U-571" - 77086030

[2] "Crimson Tide" - 91400000

[3] "Captain Phillips" - 107100855

[4] "Insurgent" - 129995817

[5] "The Adventures of Tintin" - 77564037

Figure 10. Map of “Adventures” topic words

All the listed topics are selected for checking their influence on the movie gross, using linear regression. Since the paper made 15 topics, which were internally close to each other, I chose only 5. The reason is that all of these 5 topics are different with no similar words.

5.4 Sentimental analysis

The histogram of negative words frequencies depicts that the average amount of these type of words used in trailers equals to 250 (figure 10). It is considerably low than the same index, having got from positive words, where it has reached almost 400 (figure 11). It is interesting to note that zero numbers of both graphs evidently range. In the positive figure, it equals around 40 trailers, whereas in the negative one it is more than 150. Following this look, we can suggest that motion studios mostly prefer to leave negative words in their trailers. Although, the total share of them goes to show that film trailers mostly have words with positive meaning rather than a negative one.

Figure 11. The frequency of negative words used in movie trailers

Figure 12. The frequency of positive words used in movie trailers

It is noteworthy that these results were made without using a log function, counting each word of the trailer. The adjusted version, using the logarithm formula is sharper and more concentrated. (figures 12 and 13). It is needed to avoid repetitions and inaccuracies in the regression since the number of words is huge.

Figure 13. The density of positive words in trailers

Figure 14. The density of negative words in trailers

The results of these two graphs have been further used for the regression model, which is detailly explored in the following paragraph.

5.5 Regression model

At this stage, I have tested the linear regression model using the variables mentioned above. The dependent variable y - net gross of the movie, whereas the x - topics of the trailers gained from LDA model and sentiments (negative and positive words). (Table 2)

lm(formula = netgross ~ drama + spys +

+ crime + miracles + adventures) Code in R programming for Linear Regression model (3)

Table 2

Linear Regression results

Min

1Q

Median

3Q

Max

-115174142

-51360289

-23820121

26240972

566280552

Coefficients:

Estimate

Std. Error

t value

Pr(>|t|)

(Intercept)

72895897

2145126

33.982

< 2e-16 ***

drama

-7635253

12543299

-0.609

0.54282

spys

-48559897

15104707

-3.215

0.00134 **

crime

67436031

10962507

6.152

1.01e-09 ***

miracles

36151306

11283348

3.204

0.00139 **

adventures

-30646393

11406985

-2.687

0.00730 **

Residual standard error: 79600000 on 1371 degrees of freedom

Multiple R-squared: 0.0446,Adjusted R-squared: 0.04112

F-statistic: 12.8 on 5 and 1371 DF, p-value: 3.488e-12

As can be seen from the output:

1) The trailers that have dramatic + relationship [drama] context in their texts have no effect on the films' box office revenue;

2) Spying [spys] topic in the trailers has a negative effect on the gross (making it decrease);

...

Подобные документы

  • Description of the structure of the airline and the structure of its subsystems. Analysis of the main activities of the airline, other goals. Building the “objective tree” of the airline. Description of the environmental features of the transport company.

    курсовая работа [1,2 M], добавлен 03.03.2013

  • Analysis of the peculiarities of the mobile applications market. The specifics of the process of mobile application development. Systematization of the main project management methodologies. Decision of the problems of use of the classical methodologies.

    контрольная работа [1,4 M], добавлен 14.02.2016

  • Critical literature review. Apparel industry overview: Porter’s Five Forces framework, PESTLE, competitors analysis, key success factors of the industry. Bershka’s business model. Integration-responsiveness framework. Critical evaluation of chosen issue.

    контрольная работа [29,1 K], добавлен 04.10.2014

  • Evaluation of urban public transport system in Indonesia, the possibility of its effective development. Analysis of influence factors by using the Ishikawa Cause and Effect diagram and also the use of Pareto analysis. Using business process reengineering.

    контрольная работа [398,2 K], добавлен 21.04.2014

  • История возникновения Lean Production, его инструменты. Понятие и сущность бережливого производства, его принципы, цели и задачи. Возможности и результаты применения концепции Lean на практике. Развитие методов и подходов к менеджменту производства.

    реферат [330,2 K], добавлен 23.05.2014

  • Selected aspects of stimulation of scientific thinking. Meta-skills. Methods of critical and creative thinking. Analysis of the decision-making methods without use of numerical values of probability (exemplificative of the investment projects).

    аттестационная работа [196,7 K], добавлен 15.10.2008

  • The concept of transnational companies. Finding ways to improve production efficiency. International money and capital markets. The difference between Eurodollar deposits and ordinary deposit in the United States. The budget in multinational companies.

    курсовая работа [34,2 K], добавлен 13.04.2013

  • Value and probability weighting function. Tournament games as special settings for a competition between individuals. Model: competitive environment, application of prospect theory. Experiment: design, conducting. Analysis of experiment results.

    курсовая работа [1,9 M], добавлен 20.03.2016

  • Formation of intercultural business communication, behavior management and communication style in multicultural companies in the internationalization and globalization of business. The study of the branch of the Swedish-Chinese company, based in Shanghai.

    статья [16,2 K], добавлен 20.03.2013

  • Searching for investor and interaction with him. Various problems in the project organization and their solutions: design, page-proof, programming, the choice of the performers. Features of the project and the results of its creation, monetization.

    реферат [22,0 K], добавлен 14.02.2016

  • Обобщение основных концепций "Lean production" в управлении офисом, как с отечественной, так и с зарубежной точки зрения. Система бережливого производства. Особенности методологии Хаммера. Управление цепочками поставок. Всеобщий уход за оборудованием.

    курсовая работа [53,0 K], добавлен 16.10.2010

  • History of development the world leader in the production of soft drinks company "Coca-Cola". Success factors of the company, its competitors on the world market, target audience. Description of the ongoing war company the Coca-Cola brand Pepsi.

    контрольная работа [17,0 K], добавлен 27.05.2015

  • Investigation of the subjective approach in optimization of real business process. Software development of subject-oriented business process management systems, their modeling and perfection. Implementing subject approach, analysis of practical results.

    контрольная работа [18,6 K], добавлен 14.02.2016

  • The main reasons for the use of virtual teams. Software development. Areas that are critical to the success of software projects, when they are designed with the use of virtual teams. A relatively small group of people with complementary skills.

    реферат [16,4 K], добавлен 05.12.2012

  • Эволюция автоматизированных систем управления предприятием. Возможности автоматизируемых систем управления торговыми предприятиями. Back-office и Front-office. Возможности ERP-систем для автоматизации торговли, интеграция с внешним торговым оборудованием.

    курсовая работа [46,8 K], добавлен 01.11.2010

  • Organizational legal form. Full-time workers and out of staff workers. SWOT analyze of the company. Ways of motivation of employees. The planned market share. Discount and advertizing. Potential buyers. Name and logo of the company, the Mission.

    курсовая работа [1,7 M], добавлен 15.06.2013

  • Проектирование совокупности взаимосвязанных бизнес-процессов предприятия как трудоемкий процесс по их моделированию. Модели прямого и обратного реинжиниринга в рамках стандарта моделирования бизнес-процессов IDEF0 на примере компании Destiny Development.

    курсовая работа [918,5 K], добавлен 22.04.2014

  • Major factors of success of managers. Effective achievement of the organizational purposes. Use of "emotional investigation". Providing support to employees. That is appeal charisma. Positive morale and recognition. Feedback of the head with workers.

    презентация [1,8 M], добавлен 15.07.2012

  • The primary goals and principles of asset management companies. The return of bank loans. Funds that are used as a working capital. Management perfection by material resources. Planning of purchases of necessary materials. Uses of modern warehouses.

    реферат [14,4 K], добавлен 13.05.2013

  • Company’s representative of small business. Development a project management system in the small business, considering its specifics and promoting its development. Specifics of project management. Problems and structure of the enterprises of business.

    реферат [120,6 K], добавлен 14.02.2016

Работы в архивах красиво оформлены согласно требованиям ВУЗов и содержат рисунки, диаграммы, формулы и т.д.
PPT, PPTX и PDF-файлы представлены только в архивах.
Рекомендуем скачать работу.