Talking Stone Film

Film Reviews & Headlines


Jabril: Yeah, that was… John Green Bot: … the BEST movie ever! Jabril: That’s not what I was gonna say. How about for the next movie night we pick
a new movie that we’ll both probably like? John-Green-Bot: Maybe something romantic? How about Pride & Prejudice? Jabril: Oh John Green Bot… I’m going to need this. Okay, I think it’s time to make a movie recommender system AI. It’s the only hope for the future of John
Green Bot and my friendship… or at least our movie nights! INTRO Hey, I’m Jabril, and welcome to Crash Course AI. Last time, we introduced the idea of recommender
systems, which are AIs that use information about something, and its social ratings to
recommend new things to people. These things can be ads, products, YouTube
videos, or pretty much anything like that. Today, I’m going to build a recommender
system for movies to hopefully find a new movie that both me and John-Green-bot want
to watch for our next movie night. Like in previous labs, I’ll be writing all
of the code using a language called Python in a tool called Google Colaboratory. And as you watch this video, you can follow
along with the code in your browser from the link we put in the description. In these Colaboratory files, there’s some
regular text explaining what we’re trying to do, and pieces of code that you can run
by pushing the play button. These pieces of code build on each other,
so keep in mind that you have to run them in order from top to bottom, otherwise you
might get an error. To actually run the code or make changes to
it, you’ll have to either click “open in playground” at the top of the page or
open the File menu and click “Save a Copy to Drive”. And just an fyi: you’ll need a Google account
for this. So, if I’m going to build a movie-recommending
AI, the first thing I know is that AI systems need data. I’ll need to find and import a dataset of
movies, and ideally it’ll already have ratings given by lots of different people to lots
of different movies, so I won’t have to go through and rank every single movie by
myself. That would take a while. Second, I’ll need to do some basic analysis. Let’s start by finding some generic recommendations,
like the top-rated movies in both John-Green-bot’s and my favorite genres. Maybe we’ll get lucky and find a movie we
both want to watch and haven’t seen yet on those lists. But… I don’t really have hope for that because
we like such different movies. So, third, John-Green-bot and I will need
to personalize this dataset by providing some of our own movie ratings. Fourth, I’ll use a technique known as user-user
collaborative filtering to generate a set of recommendations for both me and John-Green-bot. Hopefully there will be SOME overlap on those
recommendation lists. Alright, let’s get started. The first step is getting data. And just like other labs, I’m not going
to start from scratch. This time, I’m using an existing dataset
published by MovieLens, which has about 100,000 user ratings for about 10,000 different movies. MovieLens has bigger datasets available, going
up to tens of millions of ratings, but this smaller set should be enough to plan movie
nights for John-Green-bot and me. I’m also going to use a library known as
LensKit, which comes built-in with some nice tools for building recommender systems. So now, I’ve got data, but let’s make
sure I understand what data are even there. This code lets me see the first 10 rows of
the ratings dataset. There’s one important thing that I notice
about this dataset right away: how it handles missing data. Like, for example, here I can see that user
#1 gave a rating of 4.0 to item #1, and that they provided a rating of 4.0 to item #3. But I don’t see a rating for item #2 at
all. Most people don’t watch most movies, so
that makes sense that there would be missing data. And storing a bunch of zeros would take a
lot of space, so it’s good to know that MovieLens decided to avoid zeros in this dataset
by not storing unranked items at all. But the way it stores movie data isn’t super
useful for this current problem, because I want to know what these movies are! Not just ID numbers like “item #2.” John-Green-bot and I can’t exactly search
for “item #2” when we’re trying to rent a movie. Thankfully, the MovieLens dataset has more
than just “ratings.” It also contains a table called “movies” that
has a bunch of information about each of these items, like titles and genres. So we can get a better sense of the data by
joining the “ratings” and “movies” files. From now on, let’s include the genre and
title whenever I print results, because that’s much more clear. So I’m done with Step 1! Step 2 is getting some generic recommendations
from the MovieLens dataset, just to see what happens. Let’s just average the ratings for each
movie and print out a sorted list, with the best-rated movies at the top. Uh… Paper Birds? Bill Hicks: Revelations? I have no idea what these movies are… but
they’re supposed to be good? They all have a perfect 5.0 average rating. I would expect to see movies like Harry Potter
or Titanic or I dunno… The Avengers? So let’s look at the data and see why these
are perfect. Let’s add a count column to see how many
people rated these movies so highly… Okay, so these movies have a perfect 5.0 average
rating because only one person actually rated each of these! That doesn’t really help me pick what to
watch, because if I just wanted ONE person’s opinion, I’d ask a friend who knows me! We’re using the MovieLens dataset to get
a more general idea of good movies. So let’s try only sorting movies with at
least a certain number of ratings. This is kind of arbitrary, but I guess I’d
want at least 20 people to weigh in before I trust an average rating. Okay, now I’ve heard of most of these movies
and I trust that they’re actually sort of popular recommendations. But these movies are all sorts of genres,
so maybe I can narrow the list down a little more based on what John-Green-bot and I usually
watch. I like action movies and John-Green-Bot likes
romance movie. There’s actually one movie that’s on both
of our recommended lists: The Princess Bride! Jabril: John-Green-Bot I’ve got the perfect
movie. You’re gonna love it. It’s got a love story, swords fights, the
greatest movie line of all time: “Hello, my name is Inigo Montoya, you killed my father,
prepare to die… John-Green-Bot: Seen it. Let’s watch something new. *sighs*… our lists don’t have any other
movies in common. So even though finding generic recommendations
is sort of helpful, our AI system hasn’t found us a new movie to watch together. What we’re facing is the cold-start problem
we talked about in the last video. The recommender system we’re programming
doesn’t know anything about John-Green-bot and me to make personalized recommendations. So for Step 3, it’s time to get personal! To personalize our recommender system AI,
we need to give it our own movie data. Okay, we’ve got two spreadsheets now, but
I don’t think that they’re in the right format for LensKit, so I need to check the
documentation which is linked it in the description. It looks like I need to import our spreadsheets
and store the data in item-rating pairs just like the original dataset. Thankfully, Python is great for changing data
formats. As a sanity check to make sure I coded everything
correctly, let’s print both of our ratings for The Princess Bride, since we know we’ve
both seen it. This all looks reasonable, so we’re done with Step 3! Remember, our goal is to program an AI to
give us personalized movie recommendations based on our ratings. So, to make this happen, I’ll implement
User-User Collaborative Filtering in Step 4. There are techniques like Item-Item Collaborative
Filtering, latent factor analysis, and others too, but the User-User approach is pretty
common and a nice first step to understanding recommender systems. In multiple episodes of Crash Course AI, we’ve
talked about visualizing AI features on a graph, whether it’s petal lengths on a flower
or weather and swimmers. As we add more features, we add more dimensions
to that graph. In user-user collaborative filtering, each
item is its own dimension. So if we have 10,000 movies in our dataset,
that’s 10,000 dimensions. We’re not even going to try to visualize
that, but we can understand the logic behind user-user collaborative filtering with a two-movie
example. To be totally honest, this is going to be
a pretty simplified explanation of what the user-user algorithm does. Dealing with thousands of dimensions and lots
of missing data requires a lot of clever linear algebra and statistics. But I can use the LensKit library to do this
math and understand what’s happening conceptually, without diving under the hood. So, okay, let’s say we have a graph where
one axis is the movie Inception and the other axis is The Notebook. And for this example, we’ll plot social
ratings on it from everyone who has seen and ranked both movies, such as John-Green-bot,
me, and a bunch of other people in the MovieLens dataset. Some people may really like or hate both movies. I like Inception but dislike The Notebook,
and John-Green-bot is the opposite of me. The user-user algorithm will try to cluster
people who gave the movies similar ratings. This is a classic unsupervised learning approach,
except there isn’t a “correct” size for these clusters, so I have to set parameters. First, I have to set a minimum neighborhood
size, or the minimum number of people the algorithm should put in one cluster. Like, for example, if I set the minimum neighborhood
size to 5, when the algorithm looks for people similar to John-Green-bot, it may select this
neat cluster here. But if I set the minimum neighborhood size
higher, the algorithm may be forced to include some people who are less similar to each other
and John-Green-bot. I also have to set a maximum neighborhood
size, or the maximum number of people the algorithm should put in one cluster. Again, having clusters that are too big might
give recommendations that are too generic and don’t consider individual taste enough. After the algorithm has defined the cluster
of people who like these movies just about as much as John-Green-Bot, it can analyze
what those users have rated movies that John-Green-bot hasn’t seen yet, such as Casablanca. Now, this is a classic supervised learning
problem. The user-user algorithm trains on past data
from users in the cluster to guess how much John-Green-bot would rate Casablanca. It might predict something like “4.6.” And then the algorithm will do the same thing
for all the other movies John-Green-bot hasn’t seen, that his cluster-neighbors have. In the end, I want the algorithm to give us
a sorted list of the top 10 movies John-Green-bot will probably like. There isn’t really a “best” minimum
and maximum neighborhood size. It really depends on what I want this AI to
recommend. Different parameters have different pros and
cons. A small neighborhood size would mean the AI
considers fewer people who have more similar movie tastes, and it has less data to make
predictions. So I’m more likely to run into the “Bill
Hicks: Revelations” situation from earlier which was when recommendations of surprising or obscure movies were based on what a few people like. A big neighborhood size would mean the AI
considers more people who have less similar movie tastes, and it has more data to make
predictions. So I’m more likely to get movie recommendations
that are generally popular and more widely known. Figuring out the best approach to clustering
requires a lot of tinkering. But if someone did work on it, they could
make a video streaming service that could recommend videos to billions of different
people online. YouTube. It’s a joke on YouTube if you didn’t get it. For this movie night AI, I’ll just set a
minimum neighborhood size of 3 and maximum size of 15, because those seem reasonable. But feel free to play around with those values
in your own code to see how it changes the recommendations. Now that the AI system has run the user-user
collaborative filtering algorithm and has clusters, I can give it our personal ratings to get its top 10 recommended movies for both John-Green-bot and me! Now we’re talking… show me what to watch! Remember, for each of us, the user-user algorithm
finds a neighborhood of similar users based on their movie ratings compared to ours. The algorithm looks for movies that people
in that neighborhood have seen, and rated, that we HAVEN’T seen yet. And based on the ratings in our neighborhoods,
the algorithm will predict how we might rate each of those movies, and print a list of
its “top 10” recommendations for us. So now we have thoughtful movie recommendations
by our newly programmed AI, but there’s still a huge problem. John-Green-bot and I have to AGREE on a movie
to watch, and our “top 10” lists don’t overlap at all because we like such different
things. We need another STEP! This is the beauty of representing movies
we like as lists of numbers! I can create a Jabril-Green-bot hybrid! Uh, but not a cyborg. Just a dataset. So if both of us have rated a movie, I’ll
use the average of our ratings. Using the two-axis graph of Inception and
The Notebook from before, this would place our Jabril-Green-bot hybrid around here. And if only one of us has rated a movie, I’ll
just add that movie rating to the list. I know this isn’t a perfect strategy. Like, it’s possible that I might hate some
movie that I haven’t seen but John-Green-Bot highly rated. But this keeps things simple, and it should
give a reasonable estimate across both of our ratings. Like always when I reorganize data with code,
I should do a quick sanity check. Let’s look at The Princess Bride again because
I rated it as a 4.5 and John-Green-bot rated it as a 3.5, so I’d expect our combined list
would have it as a 4. Looks like everything checks out! So now, I have a combined dataset of ratings
that I can plug right into our user-user collaborative filtering model from earlier. And I SHOULD get a ranked list of 10 movies
that we’ll both like! The number one recommendation is Submarine
which seems to be a quirky movie from 2010. I’ve never heard of it, but I’m willing to
give it a try. If that’s too obscure for John-Green-bot,
we could pick a different recommendation from this list… like I’ve heard some good things about True Grit. In fact, all these movies seem like they might
have some stuff we both like. At this point, I could also go back to step
4.1 and select different settings for my clusters. Bigger neighborhoods would probably give me a more well-known list of
movies. But that list may also be a little less
tailored to our individual interests. Anyway, we know what we’ll be watching this
weekend. Anyone can use our spreadsheets as a template
to enter their own preferences and see some recommendations for themselves and their friends. Of course, these spreadsheets don’t have
EVERY MOVIE EVER — that’s just one of the limits of our smaller dataset. By using one of the bigger datasets from MovieLens,
anyone can create a new set of spreadsheets for this project that does include more movies. But be warned that more movies will mean that
all the math will take a LOT longer to do before you get your recommendations! There’s also nothing that limits our algorithm
to just two people! You could combine a ten-person movie club
into one rating dataset to see what results it comes up with. Next time, we’ll take a look at a different
kind of recommendation that we use all the time: search engines. I’ll see ya then. Crash Course AI is produced in association
with PBS Digital Studios. If you want to help keep Crash Course free
for everyone, forever, you can join our community on Patreon. And if you want some more movie recommendations
along with analysis, check out Crash Course Film Criticism.

Leave a Reply

Your email address will not be published. Required fields are marked *