Developing a Steam Data Crawler

Christian Haja
Mar 6
4 min read

Updated: Jun 28

Introduction

I have always been a person driven by curiosity, always looking for ways to understand and grow. Recently, I remembered a project at SEAL.GAMES where I was part of a team that developed a Google Looker-based analytics dashboard to consolidate data from multiple mobile games across different platforms. I took the lead in defining key performance indicators, ensuring they aligned with business objectives, and designing the logic behind critical performance metrics. Additionally, I built data visualizations to transform raw data into actionable insights, improving decision-making across teams. All technical aspects were handled by my dear former colleague Tom Phillipp Möller, a Full-Stack magician.

That experience got me thinking about how I could extract insights from publicly available data sources and while I had a strong foundation in data analytics, I recognized that my Python scripting skills needed improvement. Instead of letting that limit me, I embraced the challenge and decided to learn by doing.

So, I set out to build a Steam API data crawler, a tool capable of retrieving comprehensive data on all published games. My goal wasn’t just to collect raw information but to create a structured data set that can later be used in data visualization tools like Tableau Public. This blog post chronicles that journey, highlighting the challenges I encountered and the improvements I made along the way. Disclaimer: Use this guide and provided script at your own risk; I am not responsible for any issues, data loss, or API restrictions. The full script can be downloaded at the bottom.

Step 1: Getting started with your data collection project

Install the latest Python version Check the box for "Add Python to PATH" (important!)*
Install Python packages:
1. requests (to fetch data from Steam API)
2. pandas (to process data)
3. openpyxl (to save data in Excel)
An IDE like Visual Studio or PyCharm

*"Add Python to PATH" allows you to run Python from any command prompt or terminal without needing to specify its full installation path.

Step 2: Setting Up the Steam API Crawler

The first step was to retrieve data from Steam's public API. Steam provides several endpoints, but the two most important ones for this project were:

GetAppList: Fetches a complete list of all games available on Steam.
GetAppDetails: Retrieves detailed information about each game, including pricing, user reviews, and supported platforms.

Initial Python Script

Since I was new to writing Python scripts, I started with a basic approach:

import requests
import pandas as pd
import time

APP_LIST_URL = "https://api.steampowered.com/ISteamApps/GetAppList/v2/"
APP_DETAILS_URL = "https://store.steampowered.com/api/appdetails"

def fetch_steam_games():
    response = requests.get(APP_LIST_URL)
    return response.json()["applist"]["apps"] if response.status_code == 200 else []

def fetch_game_details(app_id):
    params = {"appids": app_id}
    response = requests.get(APP_DETAILS_URL, params=params)
    return response.json().get(str(app_id), {}).get("data", None) if response.status_code == 200 else None

def main():
    games = fetch_steam_games()
    game_data_list = []
    
    for game in games[:100]:  # Fetching only 100 games initially
        game_data = fetch_game_details(game["appid"])
        if game_data:
            game_data_list.append(game_data)
        time.sleep(1)  # Avoid overwhelming the API
    
    df = pd.DataFrame(game_data_list)
    df.to_excel("steam_games.xlsx", index=False)
    print("Data saved to steam_games.xlsx")

if __name__ == "__main__":
    main()

This script fetches all game IDs and retrieves their details sequentially, storing them in an Excel file.

Step 3: Debugging Review Data Handling

After running the script, I encountered an issue:

AttributeError: 'str' object has no attribute 'get'

This error occurred because, in some cases, the reviews field was returned as a string instead of a dictionary. To prevent this, I modified the extract_game_info function:

reviews = game_data.get("reviews", {})
if not isinstance(reviews, dict):  
    reviews = {}  # Ensure reviews is always a dictionary

This simple fix prevented the script from crashing due to inconsistencies in the API response format.

Step 4: Improving Performance with Parallel Requests

While the script worked, it was too slow because it fetched game details one at a time (time.sleep(1) made it even slower). To speed things up, I implemented concurrent requests using Python’s ThreadPoolExecutor.

Optimized Script with Multi-Threading

import concurrent.futures

NUM_THREADS = 10  # Number of parallel requests

def process_game(game):
    app_id = game["appid"]
    game_data = fetch_game_details(app_id)
    return game_data if game_data else None

def main():
    games = fetch_steam_games()
    game_data_list = []

    with concurrent.futures.ThreadPoolExecutor(max_workers=NUM_THREADS) as executor:
        results = list(executor.map(process_game, games[:500]))
    
    game_data_list = [r for r in results if r]
    df = pd.DataFrame(game_data_list)
    df.to_excel("steam_games.xlsx", index=False)
    print("Data saved to steam_games.xlsx")

Performance Gains:

Original approach: ~8 minutes for 500 games
Optimized approach: ~50 seconds for 500 games

By running multiple API requests in parallel, I reduced execution time by 10x!

Step 5: Final Thoughts and Next Steps

This project was an excellent opportunity to explore API integration, error handling, and performance optimization. By testing different approaches and interating on it, I turned an idea into a working solution and gained valuable experience in Python scripting and data collection.

What’s Next?

👉 Data storage: Excel sheets might be a good start for handling smaller data sets but as soon as you want to import large data sets in tools like Tableau Public, you quickly run into challenges. So, switching to CSV could be a good option.

👉Data Visualization: I plan to load the extracted data into Tableau Public for interactive insights.

👉 Expanding Data Scope: Fetching more economic & user data (e.g., owner estimates via SteamSpy API).

👉 Automating Updates: Scheduling the script to run automatically on a periodic basis.

Would you like to try something similar or do you have questions or suggestions? Let’s connect! Download the script: https://drive.google.com/file/d/1A0AFGKiMXW9U6Qr12TbHVq-CXaAui35F/view?usp=sharing

Developing a Steam Data Crawler

Initial Python Script

Optimized Script with Multi-Threading

Performance Gains:

What’s Next?

Recent Posts

Comments