Steam Recommender

This post goes into some of the detail behind the site that generates game recommendations based on your Steam play history. I have received several questions about how I am able to generate the recommendations, so I hope to give a fair account of the methods that I use.

If you have spent any time working on recommender systems, then you have heard of the Netflix prize. In 2009, Netflix awarded the Belkor’ Pragmatic Chaos team with one-million dollars for a 10\% improvement in movie and TV recommendations based on user ratings provided by Netflix. One outcome of the Netflix prize is the raised awareness for recommender systems and standardized ways of talking about recommendations.

Training data for the Netflix prize consisted of a collection of triples: (user, item, rating) where the user referred to a unique user id (anonymized of course) and the item is the movie or TV show being rated, and rating being a quality score from one to five with five being best. There are many open-source tools that work with this format or similar formats, e.g. GraphChi (with work by Danny Bickson), Prediction IO, MLlib, and others.

In the end, I decided not to go with one of these existing services. This project was for my personal enjoyment and so that I could learn more about building web services, so I decided to build my own recommender system that would plug in nicely with my service. In the next few sections I describe different aspects of the system as a whole.

Data Ingestion

The recommendations generated from the site are primarily based on an Amazon system that has been going strong for almost 20 years. You would recognize it as the panel beneath items in the Amazon store with the title “Customers Who Bought This Item Also Bought.” The idea is that this panel would represent what your friends would recommend you purchase, if your friends were the entire user base of Amazon.

In order to generate this type of personalized recommendation for video games, I searched for datasets where I could gather all of the games owned by a person and how well they liked each game. The Steam store by Valve came the closest, since they provide a web API for developers to build integrations. In a piece by Kyle Orland published on Ars Technica, he describes a technique for querying this API for random Steam users to build up a dataset. With this technique we are able to get all of the games owned by a Steam user along with the number of minutes spent playing the game in the last two weeks as well as total time spent playing the game. Using similar techniques I am able to gather a sample of roughly 50k players every day.

Model Building

The data ingestion system runs uninterrupted daily and gathers more and more of these training samples. After a month of gathering publicly accessible Steam accounts I have roughly 1.5 million profiles to construct recommendations from. It is important to note that Steam users still have control over their privacy. At any time they can set their profile to private and I would not have any access to the games they own nor would I have any access to the amount of time they have spent playing those games.

The recommendation engine powering is based on the original paper: Recommendations: Item-to-Item Collaborative Filtering. The idea is to construct a matrix wherein the rows are the Steam users and each column is a different steam game. The challenge is deciding what the rating should be for a user and a game. We briefly explore the different options and the intuition for what each entails below.

  1. ownership: In the ownership rating system, each cell in the matrix has a one in it if the user owns the game and a zero if the user does not. Under this assumption, popular games are often recommended more frequently because they happen to be owned by a larger number of people. Relying on ownership for a rating is limited. Just looking through my own Steam library, I realized that I had a rather large backlog of games that I owned by had not yet played. If you were to personally ask me about those games, I would not be able to recommend them because I had not played them. This insight led to my second option for a rating.

  2. played: In the played rating system, each cell in the matrix has a one in it if the user has played the game and a zero otherwise. The user must have spent at least 30 minutes in the game before we consider the game played. If you quit a game prior to spending 30 minutes in the game, you either hate the game or are saving it to come back later. In either case, I see that as the user not wanting to recommend that game. Recommendations from the played system are qualitatively better than those from the ownership system since we are better able to filter out games that impulse purchases from Steam sales and have a better signal to remove disliked games.

  3. inferred: For the inferred ratings, I go one step further and try to compute a 5-star rating for each game from your playtime. My 5-star rating system will assume you would give a 5-star rating to a game if you played it to completion or longer. I make some assumptions (that may or may not be true) in order to create inferred ratings. First, I assume that majority of players will not finish a game. They may become bored, or may dislike the game, but either way they will not play through to completion. Second, I assume that there are some die-hard players who will continue to replay a game over and over simply because they love the game. Based on these two assumptions, I decided that the 75th percentile of the total time spent playing distribution (across all steam users who played the game) would represent the expected playtime of a game that you loved playing and would happily recommend.

For every player, the inferred rating is computed by dividing the number of minutes spent playing teh game with the 75th percentile for that game and then multiplying by five (for five stars), keeping the maximum star rating at 5. For example, if you played a game for 120 minutes and the 75th percentile is 280 minutes, then the 5-star rating for this game would be (120 / max(280, 120) * 5) = 2.

The inferred rating system is the one that is currently deployed on the website as it generated the best qualitative results. Once you have a method for computing the ratings, building a model for generating recommendations is straightforward. I will try and provide the intuition behind the collaborative filter that I build and leave the details to the original paper. For a more gentle introduction to recommendation techniques, I recommend Programming Collective Intelligence.

To generate recommendations for a single user, we first gather the games that he/she has played by making a call to the Steam web API. The playtimes associated with each game are converted into inferred ratings and every game that has not been played is dropped. At this point we should have a collection of games that our Steam user has played enough to generate a rating. What we would like is a principled way to find games similar to these that other users have also found favorable.

That is where the power of the collective comes in handy. We iterate through all of the players and look at the pairs of games that they have played. If a player has two games that they have played and given high ratings to, then odds are another player who has played one of those games will enjoy the other one. If you do that over 1.5 million players, this signal becomes pretty clear. We convert this intuition into something more formal by defining a distance metric between two games as the cosine similarity between the rating vectors. An example will hopefully help clear up any confusion.

  Skyrim Bioshock Infinite Left 4 Dead 2 Civilization V
Jim 5 5   3
Sally   5 5  
Gina 2 3 3 3
Bob 1 1 1 1

In this example, we can see that Jim either does not own or has not played Left 4 Dead 2 (L4D2) and the same goes for Sally with Skyrim and Civilization 5. Both Gina and Bob have played all of the games, but it seems that Bob cannot find anything that he really likes and tends to play games for short periods of time. Our recommendation system relies on recommending similar games to the ones that you have played before, so we will compute the cosine similarity between all pairs of games. some notion of similarity. For example, we want to compute the similarity between L4D2 and Bioshock. The rating vectors for these two games are: [5, 3, 1] and [5, 3, 1]. Jim has not played both games, so we do not include his rating in this computation. This example is not particularly interesting since the vectors are identical, the similarity between them is one. Of course with more examples, this would not be the case.

The result is that we have a matrix of similarities between games. A single column represents a single game and the similarity between it and all of the other games. If someone had only played Sid Meier’s Civilization V and he/she was looking for new recommendations, then we could order this column by the most similar games to Civilization 5 and generate decent recommendations (Note: these recommendations actually came from on 03/26/2015).

Sid Meier's Civilization V

Sid Meier's Civilization: Beyond Earth

Endless Legend

Sim City 4: Deluxe Edition

Empire: Total War

If someone has played more than one game, we weight each column by the rating that the user gave to that particular game, sum across of the games the user has played and sort to find the most similar games to this particular user.


This post has taken longer than I had originally planned an glosses over some details while getting into the gory details for others. Hopefully this post mixed with some external reading will make everything clear.

If you have built recommendations before, or happen to be a researcher in the field, you may wonder why I am using technology from the early 2000s instead of using methods like matrix factorization that helped win the Netflix prize. The primary reason is that I needed to be able to generate recommendations for users who were not in my sample set. I would have no guarantees that I have a factorization for a new user as they may have not been part of my initial training set. Of course, I could always add them to the set and redo the factorization, but that takes time and would result in a poor user experience.

The original idea for was to build a website that showcased recommendations for a product area that I thought was woefully neglected. At the time, Valve had not launched their updated recommendation system and I often wondered what I should play next. At the time, Steam sales were how I would decide what to play next because I would often purchase whatever was on sale. Given my background in recommender systems I thought that there had to be a better way. Now the site has become a fun toy for me to think about different recommender challenges and sometimes discover new games to play.

Now that this post is finished, I will begin work on the matrix factorization blog post where I discuss computing game embeddings and using them to improve discoverability, search, and recommendations.

1000 Words

So, I’m trying to get better at writing and I have heard that the best way to do that is by actually writing. So, the goal that I am going to work towards is writing at least 1000 words per night during the week on some topic. Tonight I am going to ramble a bit about Steam and the Web API that they provide.

Steam Web API

For those of you not aware, I built and maintain I originally thought I had this brilliant idea because the recommendations within Steam were not that good and there were no other recommendation engines that were making use of the data that would come out of the Steam platform. So, I started working on an ingestion system that would crawl all of the Steam users (eventually). Valve conveniently developed a web api that I could make up to 100,000 requests per day.

My first idea was to generate a seed set of Steam users (which primarily consisted of my friends) and then make the right API calls to get their public details. In this case the public details consisted of all of the games that they owned with the number of minutes they have spent playing each and a list of all of their friends. The first call would be to get all of the games for a known Steam ID.


The second call would help me to grow the graph of Steam users that I could crawl.


The friend graph within Steam consisted of far more isolated subgraphs than I had expected when I first started looking for users. I had to quickly devise new plans to find additional users so that I could continue growing the number of players in my samples. I soon discovered the Steam communities where people could express interest in certain games and become part of their groups. At the time you could visit a page that contained a listing of every member of a group, so I would generate a seed set of popular games and wrote some code to scrape the Steam ids from these pages. Even with these additions growth was still slow. Note: Valve has since removed the listing of the groups.

Another reason that I faced slow growth in the acquisition of new Steam IDs was that I wanted to ensure that I was always working with fresh data. Things change over time, users acquire new friends and new games. Therefore, I needed to build in part of the system that would actually refresh a user every so often in order to have the latest snapshot. With the 100,000 requests per day limit some fraction of my requests were going to refreshing Steam IDs whose data needed refreshed and another fraction of these requests went to the backlog for Steam IDs that had not been loaded yet. That left very few queries for gathering new Steam IDs through the friends API.

Despite these hurdles I was able to gather enough player information to build a item-item collaborative filter that generated reasonable recommendations. This allowed me to launch the website and begin playing with improving the recommendations.

Well hell. I only made it to 550 words on my first try. Hopefully with more dedication and effort, then tomorrow I can have a better turnout.

Procedural Forest Generation

Forest Generation

Before diving into how we are going to generate random forests in Slappy, I I wanted to give a brief reminder of the type of game that we are building. Our player is stranded in an abandoned world that has been been left to nature to rebuild. So, there are large swaths of open land and some old-growth forests that cover parts of the landscape. We already established that each level will be rectangular, and now the challenge will be to fill this area with forests without overwhelming the player.

So, why have forests in the first place? Forests provide natural cover for our enemies and so the player must choose to enter the forest and be prepared to deal with what is within. In addition to enemies hiding in the forest the player could find valuable items and useful power ups.


The forest algorithm is based on some research done during my time at the University of Wyoming. Our research lab had a grant to build a multi-agent simulator and demonstrate how decentralized control of flying swarms would allow us to provide surveillance for a forested area. In the summer of 2004, I began development on the simulator and needed to devise a way to generate random forests that felt natural and covered some percentage of a rectangular area (sound familiar?). For those interested, the original paper can be found here. The rest of this post will focus just on the forest generation algorithm.


We decided to build an algorithm that behaved very similar to the cellular automata described in a previous post, but rather than it being on or off and switching immediately, we introduced the idea of seeding.

We start with a small set of trees that grow and seed the neighboring area. Each iteration some seeds become trees and will grow to begin expanding the seeded area. In the previous post on procedural cave generation, there were relatively few parameters that could be tuned in order to generate different caves, whereas this algorithm has more parameters that can be varied to create different types forests.

  • nf - The initial number of trees in the forest.
  • Rs - The radius that seeds can travel.
  • Ss - The amount of seeds that fall during seeding.
  • ds - The decay rate of the seeds that fall.
  • Fc - The percentage of ground to be covered with forest.

Each iteration of the forest generation algorithm we go through a seeding phase and then some portion of those seeds become trees for the next generation. If seeds do not become trees, then they will decay according to the defined decay rate.

Initial Implementation

All of the code is available on my GitHub account.

The algorihtm begins by initialzing the map with a small set of trees. One caveat is that rather than being a single pixel, each tree is a collection of pixels shown below.

Initialization of forest generation algorithm.

From the initial set of trees, we will “grow” a forest until we have covered some predefined percentage of the area. Each round of growth is divided into three steps. In the first step existing seeds decay by ds. In the next section, we will discuss what the decay is for in more depth, but the idea is that we don’t want seeds to pile up indefinitely in each grid cell because in the step we select new trees from the existing seeds. Those grid cells with more seeds in them have a higher likelihood of becoming a tree during this phase.

The final step in the forest growth algorithm is to have all of the trees (including the new ones) seed the area around them for the next round of growth. The following image represents the effects of the seeding algorithm. Seeds are spread up to Rs away from the original tree. Think of this as the canopy of a mighty oak tree and how far away seeds can drop from the base of the tree. When two trees are close to each other and their canopies overlap, then there will be twice as many seeds that fall within that area increasing the likelihood that a tree will occur in the next round of growth.

Seeding of new trees.

The following screenshots show the forest creation algorithm in action.

Small forest after just a few steps.
Same forest after a few more steps.

At this point we have reached a reasonable looking forest. Since the forest generation algorithm is supposed to work in conjunction with the cave generation algorithm we outlined earlier, we should see what it looks like overlayed on top of a randomly generated level.

Random forest overlayed with the level.

As you can see there is definitely overlap with the forest and the empty areas within the level, which is what we desired. The forest is also not limited to empty areas in the original area and can exist on top of filled in areas. I deliberately chose this route and will need to consider how the art blends these two regions together.

One last thing to note is that the forest grid seems to be much finer than the grid we used to generate the level. This is because each grid cell in the level is supposed to be about 10m by 10m. We can definitely fit more than one tree in a grid of this size and that’s reflected in the granularity of the forest grid.


I want to try and capture what this algorithm is doing intuitively so that you you will not have to blindly vary the parameters, but you can set them and the forests that it generates are what you would expect.

  • nf - The initial number of trees in the forest often determines how many clumps of trees that you will have. If you have spent any time in the forest, you will notice that often there are stands of trees that are isolated from each other but densely populated. If we have a 10x10 grid representing our forest and we want 30% of the ground covered in forest, we could set the initial number of trees to 30 and have a uniform distribution of trees.

  • Rs - Setting the radius that seeds can travel to small number means that the forest can only grow roughly one pixel at a time. So, you will end up with very tight clusters of trees in your forest, whereas if you set this number higher, you forest could still be very densly populated but you would have small areas of light shining through.

Random forest with _Rs_=4.
Random forest with _Rs_=20.
  • Ss - In each iteration, seed is spread across the seed radius with the strength determined by this parameter. The higher the value, the higher the probability that a cell will sprout a tree. If this number is high enough, each iteration will result in a new tree in ever cell that is seeded. The result is a perfectly rounded forest.
Random forest with _Ss_=1.
  • ds - The decay rate prevents stacking seeds on top of each other forever until one becomes a tree. The idea is that if a tree does not sprout the first year, then that seed has a smaller chance of sprouting the next year and so on. A very high decay rate should result in a forest with many more patches as demonstrated below.
Random forest with _ds_=0.99.

The parameters should be relatively straightforward. Try it out for yourselves, modifying the parameters in the code and running it.

Running Demo

I included a demo of the system running (thanks to LibGDX HTML5 output).

Thanks again for taking the time to read this far, and feel free to send me any feedback.

Procedural Cave Generation

Level Generation

The key to level generation for Slappy is to assume that they player is not stranded in a wasteland ala the Fallout series, but the player is stranded in an abandoned part of the world. Around him are forests, open space, and enemies trying to kill him. In my head I imagine it being very similar to the movie/book I Am Legend but with our hero choosing to survive in the woods rather than a major metropolitan area.

Each level will be rectangular in nature (easier to model with a computer) and the focus of this blog post is how I have decided to fill in the rectangular area with empty and filled space. The goal of level generation is to create an environment that feels natural and spacious.

Initial Research

Searching for “procedurally generated levels” on Google yields plenty of resources and tutorials. A large portion of these resources are similar to this one: Create a Procedurally Generated Dungeon Cave System. These tend to generate maps that feel like they were designed on graph paper. Whenever I see these I think back to my time playing Dungeons and Dragons with pen and paper. Lately, a lot of these levels have been associated with the rogue-like sub-genre within RPG/Action games.

The tutorials that caught my eye were ones discussing cave generation and were not focused on rougue-like. I would see images like this one:

Image Source

Looking at that image I can see the top down view of a level that provides freedom to explore the level as the player sees fit, and more importantly it is a level that I would easily believe could occur naturally.

Many of the “procedural cave generation” articles that I read all referenced Cellular Automata and one algorithm in particular Cellular Automata Method for Generating Random Cave-Like Levels. I am not sure who is the first to discover this particular cellular automata, nor who was the first to apply it to procedural cave generation, but I will try to call out each of the interesting articles that helped me refine my level generator.

Initial Implementation

I have written a simple libGDX application for exploring different procedural generation algorithms that can be found on my GitHub account. Each blog post will have a tag so that you can download the specific version of the code that I’m referring to, since I plan to continually update the code.

The cave generation algorithm is based on a cellular automata (Wikipedia). The level is represented as a two-dimensional matrix (boolean[][]) and each cell can be in one of two states (EMPTY or FILLED). The matrix is initialized randomly with roughly 40% of the matrix starting out filled. The borders of the matrix are also initialized as filled just to ensure that the player is unable to walk off of the map. A sample initialized matrix is visualized below.

Random initialization of level.

Now that our matrix is initialized we can begin evolving the cave structure by repeatedly applying some rules for deciding whether or not a cell stays filled or becomes empty. The rules are applied to each cell in the matrix independently. In the cellular automata literature, these rules can also be referred to as a birth rule and a survival rule. There are two inputs to the rules that we define, each one is a count of the number of neighbors within a specified range that are filled. A visual demonstration of the neighborhoods is shown below.

Cellular Automata neighborhoods.

It is important to point out that although it looks like the nieghborhoods are disjoint, d=2 actually contains all of d=1.

The rules that I used are a combination of those found in these two articles: one and two. After initialization there are two phases to our cellular automate. The first of these phases is responsible for carving out the main cave structure, and the second phase refines the structure and clears out some of the middle to create more open spaces.

We define the following rules:

if count(d_1) >= x
  cell = FILLED
else if count(d_2) <= y
  cell = FILLED
  cell = EMPTY

In the first phase x=5 and y=2. After one iteration we have the following:

After one iteration of phase one.

And after a few more iterations we can see the cave structure emerging.

After four iterations of phase one.

In the second phase we set x=5 and y=-1 effectively shutting down the second conditional since there cannot have a negative number of filled cells around a cell in the matrix. This begins removing small chunks in the middle that do not have enough neighbors to stay around and opens up the space on the interior.

After one iteration of phase two.

And after a few more iterations.

After four iterations of phase two.

We have now reached a reasonable looking cave and happened to be lucky enough that all of the chambers are connected. There is a small bit of code included with the cave generation code to ensure that all rooms are connected. The basic idea is to use a “flood fill” algorithm to identify all of the unique rooms in the level ( Once we have identified the unique rooms we begin a random walk towards the center of the map clearing out filled cells until we hit another room.

Running Demo

I included a demo of the system running (thanks to LibGDX HTML5 output).

Thanks for reading this far, and feel free to send me any feedback.

Game Idea

Codename Slappy

So, I’m terrible at naming things, but I need to give this game idea a name so that I can refer to it in future blog posts. Slappy is a top-down 2D game that tries to emulate all of the things that I loved about the Left 4 Dead series.

Another Zombie Game

The hope is that this is not another zombie game, but I have not figured out what to replace the brain-dead zombies with. Each level will be self contained and will require the player to navigate the unsafe world until they find the “goal.” In Left 4 Dead (L4D) this was a safe house and reaching the safe house triggered the next level in addition to some details about your performance on the current level.

Procedural Generation

I plan on relying on procedural generation in several parts of the gameplay experience. The first and most obvious, is procedurally generated levels. The plan for each level is for it to last between five and ten minutes so that you can get in, knock out a level, and continue on with your day.

As part of the level generation, I will also be implementing a forest generation algorithm that will provide cover / hiding areas for enemy. Things within the forest cannot be seen until you enter the forest and even then you have only a limited range of sight. One fun challenge for level generation is developing the two algorithms so that they complement each other and provide a cohesive experience .

The last area where I can currently envision procedural generation is in the actual player experience. The L4D series ships with an AI Director which is responsible for creating enjoyable experiences for the player. The AI Director is designed to take into account the player’s abilities, current performance, position, mood, and tension to decide how many enemies to spawn, where to spawn them, etc. I plan on developing something similar because the AI of L4D has intrigued me for a very long time.

Other Details

The game will be developed in Java with the libGDX library and will be available on all major platforms (iOS, Android, Mac, Windows). This blog will host all of the major updates to the game until I decide on an actual name for the project and purchase the domain for it.