Project Overview

League of Legends (LoL) is one of the most popular online multiplayer battle arena games, where two teams of five players compete to destroy the opposing team's base. The game features a ranked system to match players of similar skill levels for fair competition.


A major issue in ranked play is the presence of smurfs — experienced players who create new low-level accounts to play against less experienced opponents. This disrupts matchmaking, creates unfair games, and negatively impacts the experience of legitimate players.


This project aims to detect smurf accounts by analyzing player performance data. The pipeline involves scraping player rank data from op.gg, collecting match statistics via the Riot Games API, performing data cleaning and feature engineering (e.g., KDA ratio, gold per minute), and training a machine learning model to identify outlier accounts likely to be smurfs.

Riot API Integration

Verified Developer Account with Riot Games.

Created and managed personal project API keys.

Integrated Riot API for retrieving game and player data.

AWS Cloud Infrastructure

Set up AWS EC2 Instance (Amazon Linux 2023) tailored to security and project needs.

Configured Security Roles and Elastic IP for stable hosting.

Used the EC2 instance for:

Running background scripts
Hosting Docker containers and website
Handling file transfers between environments
Collaboration and Tools

Version Control: GitHub for repository management.

Collaboration: DeepNote for live Jupyter notebook development.

IDE & Local Development: Heavy usage of PyCharm.

File Transfers: Integrated AWS workflows for moving data.

Docker & Web Technologies

Containerized web applications with Docker.

Implemented routing for worldwide access.

Webserver setup using NGINX with HTML/CSS front-end.

League of Legends Domain Knowledge

Deep dive into the game mechanics:

Champion statistics
Item statistics
Gold generation dynamics
Player behavior and ranking systems
LiveClient Application

Developed an .exe application to:

Extract live match data at intervals
Store extracted data as JSON
Provide a GUI for monitoring game events
Web Scraping & Data Collection

Built multiple web scrapers with BeautifulSoup for:

Player names
Champion stats
Item stats
Champion tiers
Analyzed dynamic websites and scraped HTML content for data enrichment.
Feature Engineering

Performed deep feature engineering using Python libraries (Pandas, NumPy, etc.).

Created new heuristic features and indirect attributes derived from multiple raw attributes.

Destructured JSON data into Pandas DataFrames for analysis.

Designed heuristic labels for weakly supervised classification.

Data Preprocessing

Applied imputation methods (mean, mode).

Encoding techniques: Ordinal Encoding and One-Hot Encoding.

Data normalization with StandardScaler.

Dimensionality reduction using PCA.

Data Visualization & Analysis
Created visualizations for insights and validation:
Boxplots
Heatmaps
Confusion matrices
Deep analysis of in-game processes (e.g., gold generation, item usage).
Machine Learning Models

Focus on Scikit-Learn and ML algorithms:

Anomaly detection models
IsolationForest
One-Class SVM
Local Outlier Factor
Autoencoders
Ensemble techniques
Bagging
RandomForest
XGBoost
Model Evaluation

Used ROC curves, confusion matrices, and classification reports for evaluation.

Compared predictions against heuristic labels to validate performance.

Project Structure & Documentation
Built a clean, GitHub-ready project:
Proper .gitignore
Structured directories
Detailed README.md
Added requirements.txt
Ensured reproducibility and collaborative workflow.

CRISP-DM Workflow

Business Understanding Data Understanding Data Preparation Modeling Evaluation Deployment (Optional) Define Goal: Detect Smurfs Copy HTML blocks (op.gg) Scrape player names (BeautifulSoup) Fetch match data (Riot API JSON) Data Cleaning in Jupyter Feature Engineering (KDA, Winrate) Train Model: Outlier Detection Validate Precision & Recall (Future: Integrate into Dashboard)

ReadMe Overview

Overview

League of Legends (LoL) is a popular multiplayer online battle arena game where two teams of five compete to destroy each other’s base. A recurring problem is smurfing — experienced players creating new low-ranked accounts to play against beginners, causing unfair matches.

This project detects smurfs by analyzing gameplay data. The pipeline includes scraping player ranks from op.gg, collecting match data via the Riot API, cleaning and enriching the data with ranks, engineering features like KDA ratio and gold per minute, and training outlier detection models (Isolation Forest, One-Class SVM, Local Outlier Factor, Neural Networks).

Setup & Preperation

  • RiotWatcher: a thin wrapper on top of the Riot Games API for League of Legends: https://riot-watcher.readthedocs.io/en/latest/
  • Pandas: powerful Python data analysis toolkit: https://pandas.pydata.org/docs/
  • NumPy: the fundamental package for scientific computing with Python: https://numpy.org/doc
  • MatPlotLib: a comprehensive library for creating static, animated, and interactive visualizations in Python:https://matplotlib.org/
  • Scikit-Learn: a Python module for machine learning built on top of SciPy: https://scikit-learn.org/
  • Python-DotEnv: reads key-value pairs from a .env file and can set them as environment variables: https://pypi.org/project/python-dotenv/
  • TensorFlow:
  • XGBoost:
  • Data Collection

    Use the Riot-API to:

    OP.GG provides useful public leaderboards but is rendered dynamically (JavaScript)
    HTML blocks from OP.GG can be copied and parsed **offline** using BeautifulSoup (`bs4`) in the script `OP_GG_name_scraper.py`
    
    from bs4 import BeautifulSoup
    import glob
    import os
    
    # Path to offline-saved HTML files
    path = os.path.join(os.path.dirname(__file__), "NameScrapeTxt")
    files = sorted(glob.glob(os.path.join(path, "*.txt")))
    
    players = []
    
    for file in files:
        with open(file, "r", encoding="utf-8") as f:
            soup = BeautifulSoup(f.read(), "html.parser")
        for row in soup.select("tr"):
            name = row.select_one("span.whitespace-pre-wrap.text-gray-900")
            tag = row.select_one("span.text-gray-500.truncate")
            if name and tag:
                players.append(f"{name.text.strip()}{tag.text.strip()}")
    
    # Remove duplicates and save results
    players = list(set(players))
    print(players)
    print(f"{len(players)} Spieler gefunden")
    
    with open("alle_spieler.txt", "w", encoding="utf-8") as f:
        for p in players:
            f.write(p + "\n")
                                
    Extracted summoner names are passed to `main_only_all_matches.py`
    This script uses `lol_watcher` and `riot_watcher` to retrieve the most recent match for each player
    Since each match contains 10 players, this approach helps expand the dataset naturally
    The resulting data is saved as a large `.json` file for further processing

    API Fetch Example
    
    from riotwatcher import LolWatcher, RiotWatcher
    from dotenv import load_dotenv
    import os, json, time
    from tqdm import tqdm
    
    # Load API key and region info
    load_dotenv()
    api_key = os.getenv("RIOT_API_KEY")
    platform = os.getenv("PLATFORM")
    region = os.getenv("REGION")
    
    lol_watcher = LolWatcher(api_key)
    riot_watcher = RiotWatcher(api_key)
    
    def riot_api_request(func, *args, max_retries=3, sleep=2, **kwargs):
        """Generic retry wrapper for Riot API calls"""
        for attempt in range(1, max_retries+1):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                print(f"[Retry {attempt}/{max_retries}] Error: {e}")
                if attempt == max_retries:
                    raise
                time.sleep(sleep)
    
    # Example: Fetch match IDs
    account = riot_api_request(
        riot_watcher.account.by_riot_id,
        platform, "SummonerName", "TAG"
    )
    match_ids = riot_api_request(
        lol_watcher.match.matchlist_by_puuid,
        platform, account["puuid"], count=5
    )
    print(match_ids)
                        
    Feature Engineering

    This step transforms raw gameplay data into structured machine learning input. It includes several preprocessing steps and the design of domain-specific features to detect smurf behavior effectively.

    Missing Values: Missing numerical values are imputed using the mean or median; categorical columns are imputed using the most frequent value
    Column Filtering: Redundant columns and constant features are dropped to reduce noise and overfitting
    Encoding:Categorical features (such as `role`) are one-hot encoded, depending on the model context. Whereas ordinal features are ordinal-encoded
    Outlier Filtering: Some filters (e.g., minimum game count or suspicious match duration) are used to remove invalid matches or rare outliers
    Type Conversion: Several string-based numbers are converted into `float` or `int` for computation

    The following features are some of our derived or selected from the match data

    kills
    deaths
    assists
    visionPerMinute
    skillshotsHit
    kda
    killShare
    turretPlatesTaken
    damageShare

    These features are designed to capture mechanical skill, team contribution, and player dominance in matches

    Heuristic Feautures

    To simulate smurf behavior (since no ground truth exists), we engineered a custom score (`smurf_score`) and binary label (`smurf_flag`) using domain knowledge and rules such as

    High APM or `kills` in early game
    Low summoner level but high winrate or performance
    High CS lead, vision score, or perfect game count
    Low number of deaths or ragequits (filtered via `earlySurrender` and `AFK`-proxy)
    sFirst-to-level-6 events and fast item completion

    Features were combined into a weighted score to assign weakly supervised labels

    0: Normal player
    1: Smurf (high-performing anomaly)
    2: Suspicious/Boosted (potential low-quality anomaly)

    These labels are used as training or evaluation targets for classification and anomaly detection models

    Model Training

    The training pipeline consists of two parallel approaches: unsupervised anomaly detection and supervised classification, both evaluated using the same weakly supervised label (`smurf_flag`)

    Unsupervised Ensemble (Anomaly Detection)
    Supervised Classification
    Models were trained on standardized features, reduced via PCA, and evaluated on held-out test sets
    Evaluation

    All models — both supervised and unsupervised — are evaluated using standard classification metrics. Since ground truth labels are derived heuristically (`smurf_flag`), this step assesses how well the models can replicate or generalize the heuristic logic

    Classification Report
    Confusion Matrix
    ROC-AUC Score
    Voting Threshold Sensitivity
    PCA Robustness Sweep
    General Model Hyperparameter Tuning
    t-SNE Vizualization
    t-SNE Vizualization
    Anomalies (red) are scattered throughout the feature space, not forming distinct or isolated clusters
    The majority of data points are normal cases (blue), indicating strong class imbalance
    Cheaters/anomalies are not easily separable from normal cases with the current features, making unsupervised detection challenging
    Requierments.txt
    
    absl-py==2.3.1
    anyio==4.10.0
    argon2-cffi==25.1.0
    argon2-cffi-bindings==25.1.0
    arrow==1.3.0
    asttokens==3.0.0
    astunparse==1.6.3
    async-lru==2.0.5
    attrs==25.3.0
    babel==2.17.0
    beautifulsoup4==4.13.4
    bleach==6.2.0
    bs4==0.0.2
    cabarchive==0.2.4
    certifi==2025.8.3
    cffi==1.17.1
    charset-normalizer==3.4.2
    colorama==0.4.6
    comm==0.2.3
    contourpy==1.3.3
    cx_Freeze==8.3.0
    cx_Logging==3.2.1
    cycler==0.12.1
    debugpy==1.8.16
    decorator==5.2.1
    defusedxml==0.7.1
    executing==2.2.0
    fastjsonschema==2.21.1
    filelock==3.18.0
    flatbuffers==25.2.10
    fonttools==4.59.0
    fqdn==1.5.1
    gast==0.6.0
    google-pasta==0.2.0
    grpcio==1.74.0
    h11==0.16.0
    h5py==3.14.0
    httpcore==1.0.9
    httpx==0.28.1
    idna==3.10
    ipykernel==6.30.1
    ipython==9.4.0
    ipython_pygments_lexers==1.1.1
    isoduration==20.11.0
    jedi==0.19.2
    Jinja2==3.1.6
    joblib==1.5.1
    json5==0.12.0
    jsonpointer==3.0.0
    jsonschema==4.25.0
    jsonschema-specifications==2025.4.1
    jupyter-events==0.12.0
    jupyter-lsp==2.2.6
    jupyter_client==8.6.3
    jupyter_core==5.8.1
    jupyter_server==2.16.0
    jupyter_server_terminals==0.5.3
    jupyterlab==4.4.5
    jupyterlab_pygments==0.3.0
    jupyterlab_server==2.27.3
    keras==3.11.1
    kiwisolver==1.4.8
    lark==1.2.2
    libclang==18.1.1
    lief==0.16.5
    Markdown==3.8.2
    markdown-it-py==3.0.0
    MarkupSafe==3.0.2
    matplotlib==3.10.5
    matplotlib-inline==0.1.7
    mdurl==0.1.2
    mistune==3.1.3
    ml_dtypes==0.5.3
    namex==0.1.0
    nbclient==0.10.2
    nbconvert==7.16.6
    nbformat==5.10.4
    nest-asyncio==1.6.0
    notebook==7.4.5
    notebook_shim==0.2.4
    numpy==2.3.2
    opt_einsum==3.4.0
    optree==0.17.0
    overrides==7.7.0
    packaging==25.0
    pandas==2.3.1
    pandas-stubs==2.3.0.250703
    pandocfilters==1.5.1
    parso==0.8.4
    pillow==11.3.0
    platformdirs==4.3.8
    prometheus_client==0.22.1
    prompt_toolkit==3.0.51
    protobuf==5.29.5
    psutil==7.0.0
    pure_eval==0.2.3
    pycparser==2.22
    Pygments==2.19.2
    pyparsing==3.2.3
    python-dateutil==2.9.0.post0
    python-dotenv==1.1.1
    python-json-logger==3.3.0
    pytz==2025.2
    pywin32==311
    pywinpty==2.0.15
    PyYAML==6.0.2
    pyzmq==27.0.1
    referencing==0.36.2
    requests==2.32.4
    rfc3339-validator==0.1.4
    rfc3986-validator==0.1.1
    rfc3987-syntax==1.1.0
    rich==14.1.0
    riotwatcher==3.3.1
    rpds-py==0.27.0
    scikit-learn==1.7.1
    scipy==1.16.1
    seaborn==0.13.2
    Send2Trash==1.8.3
    setuptools==80.4.0
    six==1.17.0
    sniffio==1.3.1
    soupsieve==2.7
    stack-data==0.6.3
    striprtf==0.0.29
    tensorboard==2.20.0
    tensorboard-data-server==0.7.2
    tensorflow==2.20.0rc0
    termcolor==3.1.0
    terminado==0.18.1
    threadpoolctl==3.6.0
    tinycss2==1.4.0
    tornado==6.5.1
    tqdm==4.67.1
    traitlets==5.14.3
    ttkbootstrap==1.14.2
    types-python-dateutil==2.9.0.20250708
    typing_extensions==4.14.1
    tzdata==2025.2
    uri-template==1.3.0
    urllib3==2.5.0
    wcwidth==0.2.13
    webcolors==24.11.1
    webencodings==0.5.1
    websocket-client==1.8.0
    Werkzeug==3.1.3
    wheel==0.45.1
    wrapt==1.17.2
    xgboost==3.0.3
                    
    Docker and Certification

    The primary goal was to implement one docker compose File (docker-compose.yml). There was only one issue with the certification. This command initiates a one-time execution of the Certbot client within a temporary Docker container to obtain a TLS/SSL certificate from the Let’s Encrypt Certificate Authority

    
    docker run --rm \
      -v /data/compose/9/web:/usr/share/nginx/html \
      -v /data/compose/9/letsencrypt:/etc/letsencrypt \
      certbot/certbot:latest certonly \
        --non-interactive --agree-tos --keep-until-expiring \
        --email "*****@****.de" \
        --webroot -w /usr/share/nginx/html \
        -d eneemr.sabuncuoglu.de
                        

    The Docker compose YAML

    
    version: "3.8"
    
    services:
      nginx_http:
        image: nginx:1.27-alpine
        container_name: nginx_http
        restart: unless-stopped
        ports:
          - "80:80"  # Serve HTTP traffic on IPv4/IPv6
        volumes:
          - /data/compose/9/web:/usr/share/nginx/html:ro  # Static web content (read-only)
    
      certbot:
        image: certbot/certbot:latest
        container_name: certbot
        restart: unless-stopped
        volumes:
          - /data/compose/9/web:/usr/share/nginx/html          # Webroot for ACME HTTP-01 challenge
          - /data/compose/9/letsencrypt:/etc/letsencrypt       # Persistent certificate/key storage
        entrypoint: ["/bin/sh", "-lc"]
        command:
          - >
            set -e;  # Abort on any command error
            # Ensure required dirs exist
            mkdir -p /usr/share/nginx/html/.well-known/acme-challenge /etc/letsencrypt;
    
            # Initial certificate issuance (only if no cert present)
            if [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem ]; then
              echo "[certbot] requesting initial certificate for eneemr.sabuncuoglu.de";
              certbot certonly \
                --non-interactive --agree-tos --keep-until-expiring \
                --email "******@*****.de" \
                --webroot -w /usr/share/nginx/html \
                -d eneemr.sabuncuoglu.de || true;  # Ignore failure to avoid container crash
            fi;
    
            # Renewal loop: run every 12h
            while :; do
              # Attempt silent renewal
              certbot renew --webroot -w /usr/share/nginx/html --quiet || true;
              sleep 12h;
            done
    
      certbot_renew:
        image: certbot/certbot:latest
        container_name: certbot_renew
        restart: unless-stopped
        volumes:
          - /data/compose/9/web:/usr/share/nginx/html
          - /data/compose/9/letsencrypt:/etc/letsencrypt
        command:
          - /bin/sh
          - -lc
          - |
            # Renewal-only loop — no initial issuance logic
            while :; do
              certbot renew --webroot -w /usr/share/nginx/html --quiet || true
              sleep 12h
            done
    
      nginx_https_redirect:
        image: nginx:1.27-alpine
        container_name: nginx_https_redirect
        restart: unless-stopped
        depends_on:
          - certbot  # Ensure certbot container has run at least once
        ports:
          - "443:443"  # Serve HTTPS connections
        volumes:
          - /data/compose/9/letsencrypt:/etc/letsencrypt:ro  # Read-only mount of issued certs
        entrypoint: ["/bin/sh", "-lc"]
        command:
          - >
            set -e;
    
            # Wait until certificate + key exist before starting Nginx
            while [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem ] ||
                  [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/privkey.pem ]; do
              echo "[https] waiting for certificates...";
              sleep 5;
            done;
    
            # Generate minimal nginx.conf: TLS termination with 308 redirect to HTTP
            printf '%s\n' '
            worker_processes auto;
            error_log  /var/log/nginx/error.log warn;
            pid        /var/run/nginx.pid;
            events { worker_connections 1024; }
            http {
              include       /etc/nginx/mime.types;
              default_type  application/octet-stream;
              access_log /var/log/nginx/access.log;
              sendfile on;
              keepalive_timeout 65;
              server_tokens off;
              server {
                listen 443 ssl http2;
                listen [::]:443 ssl http2;
                server_name eneemr.sabuncuoglu.de;
                ssl_certificate     /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem;
                ssl_certificate_key /etc/letsencrypt/live/eneemr.sabuncuoglu.de/privkey.pem;
                ssl_session_timeout 1d;
                ssl_session_cache shared:SSL:10m;
                ssl_protocols TLSv1.2 TLSv1.3;
                ssl_ciphers HIGH:!aNULL:!MD5;
                ssl_prefer_server_ciphers on;
                location / { return 308 http://$$host$$request_uri; }  # Preserve method/body in redirect
              }
            }' > /etc/nginx/nginx.conf;
    
            exec nginx -g 'daemon off;'
                        
    All redirects are 308 Permanent Redirect to preserve method and body
    No HSTS is set in the redirect endpoint to avoid pinning clients to HTTPS (by design, site is HTTP-first)
    On renewal, nginx doesn’t require a reload; updated PEMs are used on the next TLS handshake

    GitHub Repository

    View Private Repository (Login Required)

    Ethics & Privacy

    Complies with Riot Games API terms and GDPR.
    No competitive advantage; research purpose only.
    Only anonymized match data stored; no personal info.

    PROJECT OWNER

    Enes Sabuncuoglu
    • Enes Sabuncuoglu
    • 2nd Semester Master Data Science
    Emre Tahir Tursun
    • Emre Tahir Tursun
    • 1st Semester Master Data Science