Sign In
Not register? Register Now!
Pages:
2 pages/β‰ˆ550 words
Sources:
No Sources
Style:
APA
Subject:
IT & Computer Science
Type:
Other (Not Listed)
Language:
English (U.S.)
Document:
MS Word
Date:
Total cost:
$ 10.8
Topic:

Project Report in RStudio

Other (Not Listed) Instructions:

Analyze R studio data (graphs) in a word doc. You can skip introduction part and directly working on the analyze part. Codes may or may not be modified later to make graphs more colorful and readable based on professor’s comments.
https://www(dot)kaggle(dot)com/datasets/ashishjangra27/imdb-top-250-movies
library(tidyverse)
library(dplyr)
library(ggplot2)
movies<-read_csv('movies.csv')
sum(is.na(movies$certificate))
sum(is.na(movies$duration))
#Movie "Das Boot" does not have certificate and duration.
#Count of ratings
p<- ggplot(movies, aes(imdb_rating)) + geom_bar() +
geom_text(stat='count', aes(label=..count..), vjust=-0.1) +
ggtitle('Frequency of Ratings in Top 250')
p
p1 <- ggplot(movies, aes(year))+geom_bar()+ggtitle('Top Movies per Year')
p1
#In the top 250 movies according to user reviews on IMDB, most films are produced after 1971.
#Not every year before 1971 made the top 250 films,
#but after 1971, every year had at least one film in the top 250.
#The 1996 film has eight of these 250 films, the most of any year.
#movies rating influenced by movies duration.
length<-select(movies, imdb_rating, duration)
length$imdb_rating <- as.character(length$imdb_rating)
p2 <- ggplot(length,aes(x=imdb_rating,y=duration))+geom_boxplot()+ggtitle('Distribution of Duration by Rating')
p2
#This boxplot represents the the relationship between movie's duration and imdb_rating, group by rating.
#Movies rating influenced by movies genre
genre<-select(movies, imdb_rating, genre)
genre$genre<-strsplit(genre$genre,split=',')
#I try to split genre because there are more than 1 genre for some movies, but the result become to a list.
p3 <- ggplot(genre,aes(imdb_rating,genre))+geom_point()
p3
#Movies rating influenced by movies year.
year<-select(movies, imdb_rating, year)
p4 <- ggplot(year,aes(year,imdb_rating))+geom_point()+geom_smooth()+ggtitle('Ratings by Year')
p4
#From this graph, we found that there are no significant correlation between rating and year.
#Movies rating influenced by director
director<-select(movies, imdb_rating, director_id, director_name)
director_counts <- director%>%count(director_name)
#ggplot(director,aes(director_id,imdb_rating))+geom_point()
p5 <- ggplot(director_counts[director_counts$n > 1,],aes(x=director_name,y=n))+
geom_bar(stat='identity')+theme(axis.text.x = element_text(angle = 90))+
ggtitle('Directors with Multiple Movies in the Top 250')+ylab('Count')
p5
Introduction
Description of Data and Methodology
Visualizations and Analysis
Summary

Other (Not Listed) Sample Content Preview:

Project Report
Student Name
Department, University
Course Code: Course Name
Professor’s Name
Due Date
Introduction
The project uses the given dataset to generate graphical representations of the data in RStudio. The data used is imdb_tops 250 movies. The description of the data and the methods used to generate the scatter plots and graphs have been illustrated in this report.
Description of Data and Methodology
The data was helpful in conducting a conclusive statistical analysis in RStudio. Below is a thorough description of the dataset; Year, no_of_movies, and movie_ids. These data consist of three sets, the year, the number of movies, and their specific Ids. The list is chronologically arranged from the year 1921-2022. It also included Ratings data and votes, movie genres, and ranking data.
Here is a list of the first entries on the genre dataset;
{“Musical”: {“tt0045152”: “Singin’ in the Rain”}, “Horror”: {“tt0054215”: “Psycho”, “tt0078748”: “Alien”, “tt0081505”: “The Shining”, “tt0084787”: “The Thing”, “tt1201607”: “Harry Potter and the Deathly Hallows: Part 2”, “tt0070047”: “The Exorcist”}, “Mystery”: {“tt0114369”: “Se7en”, “tt0054215”: “Psycho”}
Personal details – The data contains the name, id, frequency, and name of people. The data contains a total of 4,304 entries. Another dataset of this nature includes the person_movies_specific. It details the specific person who requested by clicking on movie links. The data contains the viewer’s id, name, and the movie’s link. Here is a list of a few entries;
id,name,link
nm0949985, Richard Young,/name/nm0949985
nm0238105,Robert Drivas,/name/nm0238105
nm0502425,Melissa Leo,/name/nm0502425
Visualizations and Analysis
R studio employs user inputs to generate graphic representations of the data. For this project, the following visualizations were exported from Rstudio. After everything is set up, the psych package is used to create a scatterplot using the numeric variables.
pairs.panels(movies[,c(2:6)]); the statistical relationship between columns 2-6 was as follows;
The above matrix shows that the year vs. imbd_votes have a strong relationship. The plots indicate a non-linear relationship between the two variables. The above-average values of the years ten...
Updated on
Get the Whole Paper!
Not exactly what you need?
Do you need a custom essay? Order right now:

πŸ‘€ Other Visitors are Viewing These APA Other (Not Listed) Samples: