Masterproject EneEmr

Project Overview

League of Legends (LoL) is one of the most popular online multiplayer battle arena games, where two teams of five players compete to destroy the opposing team's base. The game features a ranked system to match players of similar skill levels for fair competition.

A major issue in ranked play is the presence of smurfs — experienced players who create new low-level accounts to play against less experienced opponents. This disrupts matchmaking, creates unfair games, and negatively impacts the experience of legitimate players.

This project aims to detect smurf accounts by analyzing player performance data. The pipeline involves scraping player rank data from op.gg, collecting match statistics via the Riot Games API, performing data cleaning and feature engineering (e.g., KDA ratio, gold per minute), and training a machine learning model to identify outlier accounts likely to be smurfs.

Riot API Integration

Verified Developer Account with Riot Games.

Created and managed personal project API keys.

Integrated Riot API for retrieving game and player data.

AWS Cloud Infrastructure

Set up AWS EC2 Instance (Amazon Linux 2023) tailored to security and project needs.

Configured Security Roles and Elastic IP for stable hosting.

Used the EC2 instance for:

Running background scripts

Hosting Docker containers and website

Handling file transfers between environments

Collaboration and Tools

Version Control: GitHub for repository management.

Collaboration: DeepNote for live Jupyter notebook development.

IDE & Local Development: Heavy usage of PyCharm.

File Transfers: Integrated AWS workflows for moving data.

Docker & Web Technologies

Containerized web applications with Docker.

Implemented routing for worldwide access.

Webserver setup using NGINX with HTML/CSS front-end.

League of Legends Domain Knowledge

Deep dive into the game mechanics:

Champion statistics

Item statistics

Gold generation dynamics

Player behavior and ranking systems

LiveClient Application

Developed an .exe application to:

Extract live match data at intervals

Store extracted data as JSON

Provide a GUI for monitoring game events

Web Scraping & Data Collection

Built multiple web scrapers with BeautifulSoup for:

Player names

Champion stats

Item stats

Champion tiers

Analyzed dynamic websites and scraped HTML content for data enrichment.

Feature Engineering

Performed deep feature engineering using Python libraries (Pandas, NumPy, etc.).

Created new heuristic features and indirect attributes derived from multiple raw attributes.

Destructured JSON data into Pandas DataFrames for analysis.

Designed heuristic labels for weakly supervised classification.

Data Preprocessing

Applied imputation methods (mean, mode).

Encoding techniques: Ordinal Encoding and One-Hot Encoding.

Data normalization with StandardScaler.

Dimensionality reduction using PCA.

Data Visualization & Analysis

Created visualizations for insights and validation:

Boxplots

Heatmaps

Confusion matrices

Deep analysis of in-game processes (e.g., gold generation, item usage).

Machine Learning Models

Focus on Scikit-Learn and ML algorithms:

Anomaly detection models

IsolationForest

One-Class SVM

Local Outlier Factor

Autoencoders

Ensemble techniques

Bagging

RandomForest

XGBoost

Model Evaluation

Used ROC curves, confusion matrices, and classification reports for evaluation.

Compared predictions against heuristic labels to validate performance.

Project Structure & Documentation

Built a clean, GitHub-ready project:

Proper .gitignore

Structured directories

Detailed README.md

Added requirements.txt

Ensured reproducibility and collaborative workflow.

CRISP-DM Workflow

ReadMe Overview

Overview

League of Legends (LoL) is a popular multiplayer online battle arena game where two teams of five compete to destroy each other’s base. A recurring problem is smurfing — experienced players creating new low-ranked accounts to play against beginners, causing unfair matches.

This project detects smurfs by analyzing gameplay data. The pipeline includes scraping player ranks from op.gg, collecting match data via the Riot API, cleaning and enriching the data with ranks, engineering features like KDA ratio and gold per minute, and training outlier detection models (Isolation Forest, One-Class SVM, Local Outlier Factor, Neural Networks).

Setup & Preperation

RiotWatcher: a thin wrapper on top of the Riot Games API for League of Legends: https://riot-watcher.readthedocs.io/en/latest/

Pandas: powerful Python data analysis toolkit: https://pandas.pydata.org/docs/

NumPy: the fundamental package for scientific computing with Python: https://numpy.org/doc

MatPlotLib: a comprehensive library for creating static, animated, and interactive visualizations in Python:https://matplotlib.org/

Scikit-Learn: a Python module for machine learning built on top of SciPy: https://scikit-learn.org/

Python-DotEnv: reads key-value pairs from a .env file and can set them as environment variables: https://pypi.org/project/python-dotenv/

TensorFlow:

XGBoost:

Data Collection

Use the Riot-API to:

OP.GG provides useful public leaderboards but is rendered dynamically (JavaScript)

HTML blocks from OP.GG can be copied and parsed **offline** using BeautifulSoup (`bs4`) in the script `OP_GG_name_scraper.py`


from bs4 import BeautifulSoup
import glob
import os

# Path to offline-saved HTML files
path = os.path.join(os.path.dirname(__file__), "NameScrapeTxt")
files = sorted(glob.glob(os.path.join(path, "*.txt")))

players = []

for file in files:
    with open(file, "r", encoding="utf-8") as f:
        soup = BeautifulSoup(f.read(), "html.parser")
    for row in soup.select("tr"):
        name = row.select_one("span.whitespace-pre-wrap.text-gray-900")
        tag = row.select_one("span.text-gray-500.truncate")
        if name and tag:
            players.append(f"{name.text.strip()}{tag.text.strip()}")

# Remove duplicates and save results
players = list(set(players))
print(players)
print(f"{len(players)} Spieler gefunden")

with open("alle_spieler.txt", "w", encoding="utf-8") as f:
    for p in players:
        f.write(p + "\n")

Extracted summoner names are passed to `main_only_all_matches.py`

This script uses `lol_watcher` and `riot_watcher` to retrieve the most recent match for each player

Since each match contains 10 players, this approach helps expand the dataset naturally

The resulting data is saved as a large `.json` file for further processing

API Fetch Example


from riotwatcher import LolWatcher, RiotWatcher
from dotenv import load_dotenv
import os, json, time
from tqdm import tqdm

# Load API key and region info
load_dotenv()
api_key = os.getenv("RIOT_API_KEY")
platform = os.getenv("PLATFORM")
region = os.getenv("REGION")

lol_watcher = LolWatcher(api_key)
riot_watcher = RiotWatcher(api_key)

def riot_api_request(func, *args, max_retries=3, sleep=2, **kwargs):
    """Generic retry wrapper for Riot API calls"""
    for attempt in range(1, max_retries+1):
        try:
            return func(*args, **kwargs)
        except Exception as e:
            print(f"[Retry {attempt}/{max_retries}] Error: {e}")
            if attempt == max_retries:
                raise
            time.sleep(sleep)

# Example: Fetch match IDs
account = riot_api_request(
    riot_watcher.account.by_riot_id,
    platform, "SummonerName", "TAG"
)
match_ids = riot_api_request(
    lol_watcher.match.matchlist_by_puuid,
    platform, account["puuid"], count=5
)
print(match_ids)

Feature Engineering

This step transforms raw gameplay data into structured machine learning input. It includes several preprocessing steps and the design of domain-specific features to detect smurf behavior effectively.

Missing Values: Missing numerical values are imputed using the mean or median; categorical columns are imputed using the most frequent value

Column Filtering: Redundant columns and constant features are dropped to reduce noise and overfitting

Encoding:Categorical features (such as `role`) are one-hot encoded, depending on the model context. Whereas ordinal features are ordinal-encoded

Outlier Filtering: Some filters (e.g., minimum game count or suspicious match duration) are used to remove invalid matches or rare outliers

Type Conversion: Several string-based numbers are converted into `float` or `int` for computation

The following features are some of our derived or selected from the match data

kills

deaths

assists

visionPerMinute

skillshotsHit

kda

killShare

turretPlatesTaken

damageShare

These features are designed to capture mechanical skill, team contribution, and player dominance in matches

Heuristic Feautures

To simulate smurf behavior (since no ground truth exists), we engineered a custom score (`smurf_score`) and binary label (`smurf_flag`) using domain knowledge and rules such as

High APM or `kills` in early game

Low summoner level but high winrate or performance

High CS lead, vision score, or perfect game count

Low number of deaths or ragequits (filtered via `earlySurrender` and `AFK`-proxy)

sFirst-to-level-6 events and fast item completion

Features were combined into a weighted score to assign weakly supervised labels

0: Normal player

1: Smurf (high-performing anomaly)

2: Suspicious/Boosted (potential low-quality anomaly)

These labels are used as training or evaluation targets for classification and anomaly detection models

Model Training

The training pipeline consists of two parallel approaches: unsupervised anomaly detection and supervised classification, both evaluated using the same weakly supervised label (`smurf_flag`)

Unsupervised Ensemble (Anomaly Detection)

Supervised Classification

Models were trained on standardized features, reduced via PCA, and evaluated on held-out test sets

Evaluation

All models — both supervised and unsupervised — are evaluated using standard classification metrics. Since ground truth labels are derived heuristically (`smurf_flag`), this step assesses how well the models can replicate or generalize the heuristic logic

Classification Report

Confusion Matrix

ROC-AUC Score

Voting Threshold Sensitivity

PCA Robustness Sweep

General Model Hyperparameter Tuning

t-SNE Vizualization

Anomalies (red) are scattered throughout the feature space, not forming distinct or isolated clusters

The majority of data points are normal cases (blue), indicating strong class imbalance

Cheaters/anomalies are not easily separable from normal cases with the current features, making unsupervised detection challenging

Requierments.txt


absl-py==2.3.1
anyio==4.10.0
argon2-cffi==25.1.0
argon2-cffi-bindings==25.1.0
arrow==1.3.0
asttokens==3.0.0
astunparse==1.6.3
async-lru==2.0.5
attrs==25.3.0
babel==2.17.0
beautifulsoup4==4.13.4
bleach==6.2.0
bs4==0.0.2
cabarchive==0.2.4
certifi==2025.8.3
cffi==1.17.1
charset-normalizer==3.4.2
colorama==0.4.6
comm==0.2.3
contourpy==1.3.3
cx_Freeze==8.3.0
cx_Logging==3.2.1
cycler==0.12.1
debugpy==1.8.16
decorator==5.2.1
defusedxml==0.7.1
executing==2.2.0
fastjsonschema==2.21.1
filelock==3.18.0
flatbuffers==25.2.10
fonttools==4.59.0
fqdn==1.5.1
gast==0.6.0
google-pasta==0.2.0
grpcio==1.74.0
h11==0.16.0
h5py==3.14.0
httpcore==1.0.9
httpx==0.28.1
idna==3.10
ipykernel==6.30.1
ipython==9.4.0
ipython_pygments_lexers==1.1.1
isoduration==20.11.0
jedi==0.19.2
Jinja2==3.1.6
joblib==1.5.1
json5==0.12.0
jsonpointer==3.0.0
jsonschema==4.25.0
jsonschema-specifications==2025.4.1
jupyter-events==0.12.0
jupyter-lsp==2.2.6
jupyter_client==8.6.3
jupyter_core==5.8.1
jupyter_server==2.16.0
jupyter_server_terminals==0.5.3
jupyterlab==4.4.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
keras==3.11.1
kiwisolver==1.4.8
lark==1.2.2
libclang==18.1.1
lief==0.16.5
Markdown==3.8.2
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib==3.10.5
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.1.3
ml_dtypes==0.5.3
namex==0.1.0
nbclient==0.10.2
nbconvert==7.16.6
nbformat==5.10.4
nest-asyncio==1.6.0
notebook==7.4.5
notebook_shim==0.2.4
numpy==2.3.2
opt_einsum==3.4.0
optree==0.17.0
overrides==7.7.0
packaging==25.0
pandas==2.3.1
pandas-stubs==2.3.0.250703
pandocfilters==1.5.1
parso==0.8.4
pillow==11.3.0
platformdirs==4.3.8
prometheus_client==0.22.1
prompt_toolkit==3.0.51
protobuf==5.29.5
psutil==7.0.0
pure_eval==0.2.3
pycparser==2.22
Pygments==2.19.2
pyparsing==3.2.3
python-dateutil==2.9.0.post0
python-dotenv==1.1.1
python-json-logger==3.3.0
pytz==2025.2
pywin32==311
pywinpty==2.0.15
PyYAML==6.0.2
pyzmq==27.0.1
referencing==0.36.2
requests==2.32.4
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rfc3987-syntax==1.1.0
rich==14.1.0
riotwatcher==3.3.1
rpds-py==0.27.0
scikit-learn==1.7.1
scipy==1.16.1
seaborn==0.13.2
Send2Trash==1.8.3
setuptools==80.4.0
six==1.17.0
sniffio==1.3.1
soupsieve==2.7
stack-data==0.6.3
striprtf==0.0.29
tensorboard==2.20.0
tensorboard-data-server==0.7.2
tensorflow==2.20.0rc0
termcolor==3.1.0
terminado==0.18.1
threadpoolctl==3.6.0
tinycss2==1.4.0
tornado==6.5.1
tqdm==4.67.1
traitlets==5.14.3
ttkbootstrap==1.14.2
types-python-dateutil==2.9.0.20250708
typing_extensions==4.14.1
tzdata==2025.2
uri-template==1.3.0
urllib3==2.5.0
wcwidth==0.2.13
webcolors==24.11.1
webencodings==0.5.1
websocket-client==1.8.0
Werkzeug==3.1.3
wheel==0.45.1
wrapt==1.17.2
xgboost==3.0.3

Docker and Certification

The primary goal was to implement one docker compose File (docker-compose.yml). There was only one issue with the certification. This command initiates a one-time execution of the Certbot client within a temporary Docker container to obtain a TLS/SSL certificate from the Let’s Encrypt Certificate Authority


docker run --rm \
  -v /data/compose/9/web:/usr/share/nginx/html \
  -v /data/compose/9/letsencrypt:/etc/letsencrypt \
  certbot/certbot:latest certonly \
    --non-interactive --agree-tos --keep-until-expiring \
    --email "*****@****.de" \
    --webroot -w /usr/share/nginx/html \
    -d eneemr.sabuncuoglu.de

The Docker compose YAML


version: "3.8"

services:
  nginx_http:
    image: nginx:1.27-alpine
    container_name: nginx_http
    restart: unless-stopped
    ports:
      - "80:80"  # Serve HTTP traffic on IPv4/IPv6
    volumes:
      - /data/compose/9/web:/usr/share/nginx/html:ro  # Static web content (read-only)

  certbot:
    image: certbot/certbot:latest
    container_name: certbot
    restart: unless-stopped
    volumes:
      - /data/compose/9/web:/usr/share/nginx/html          # Webroot for ACME HTTP-01 challenge
      - /data/compose/9/letsencrypt:/etc/letsencrypt       # Persistent certificate/key storage
    entrypoint: ["/bin/sh", "-lc"]
    command:
      - >
        set -e;  # Abort on any command error
        # Ensure required dirs exist
        mkdir -p /usr/share/nginx/html/.well-known/acme-challenge /etc/letsencrypt;

        # Initial certificate issuance (only if no cert present)
        if [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem ]; then
          echo "[certbot] requesting initial certificate for eneemr.sabuncuoglu.de";
          certbot certonly \
            --non-interactive --agree-tos --keep-until-expiring \
            --email "******@*****.de" \
            --webroot -w /usr/share/nginx/html \
            -d eneemr.sabuncuoglu.de || true;  # Ignore failure to avoid container crash
        fi;

        # Renewal loop: run every 12h
        while :; do
          # Attempt silent renewal
          certbot renew --webroot -w /usr/share/nginx/html --quiet || true;
          sleep 12h;
        done

  certbot_renew:
    image: certbot/certbot:latest
    container_name: certbot_renew
    restart: unless-stopped
    volumes:
      - /data/compose/9/web:/usr/share/nginx/html
      - /data/compose/9/letsencrypt:/etc/letsencrypt
    command:
      - /bin/sh
      - -lc
      - |
        # Renewal-only loop — no initial issuance logic
        while :; do
          certbot renew --webroot -w /usr/share/nginx/html --quiet || true
          sleep 12h
        done

  nginx_https_redirect:
    image: nginx:1.27-alpine
    container_name: nginx_https_redirect
    restart: unless-stopped
    depends_on:
      - certbot  # Ensure certbot container has run at least once
    ports:
      - "443:443"  # Serve HTTPS connections
    volumes:
      - /data/compose/9/letsencrypt:/etc/letsencrypt:ro  # Read-only mount of issued certs
    entrypoint: ["/bin/sh", "-lc"]
    command:
      - >
        set -e;

        # Wait until certificate + key exist before starting Nginx
        while [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem ] ||
              [ ! -f /etc/letsencrypt/live/eneemr.sabuncuoglu.de/privkey.pem ]; do
          echo "[https] waiting for certificates...";
          sleep 5;
        done;

        # Generate minimal nginx.conf: TLS termination with 308 redirect to HTTP
        printf '%s\n' '
        worker_processes auto;
        error_log  /var/log/nginx/error.log warn;
        pid        /var/run/nginx.pid;
        events { worker_connections 1024; }
        http {
          include       /etc/nginx/mime.types;
          default_type  application/octet-stream;
          access_log /var/log/nginx/access.log;
          sendfile on;
          keepalive_timeout 65;
          server_tokens off;
          server {
            listen 443 ssl http2;
            listen [::]:443 ssl http2;
            server_name eneemr.sabuncuoglu.de;
            ssl_certificate     /etc/letsencrypt/live/eneemr.sabuncuoglu.de/fullchain.pem;
            ssl_certificate_key /etc/letsencrypt/live/eneemr.sabuncuoglu.de/privkey.pem;
            ssl_session_timeout 1d;
            ssl_session_cache shared:SSL:10m;
            ssl_protocols TLSv1.2 TLSv1.3;
            ssl_ciphers HIGH:!aNULL:!MD5;
            ssl_prefer_server_ciphers on;
            location / { return 308 http://$$host$$request_uri; }  # Preserve method/body in redirect
          }
        }' > /etc/nginx/nginx.conf;

        exec nginx -g 'daemon off;'

All redirects are 308 Permanent Redirect to preserve method and body

No HSTS is set in the redirect endpoint to avoid pinning clients to HTTPS (by design, site is HTTP-first)

On renewal, nginx doesn’t require a reload; updated PEMs are used on the next TLS handshake

GitHub Repository

View Private Repository (Login Required)

Ethics & Privacy

Complies with Riot Games API terms and GDPR.

No competitive advantage; research purpose only.

Only anonymized match data stored; no personal info.

PROJECT OWNER

Enes Sabuncuoglu
2nd Semester Master Data Science

Emre Tahir Tursun
1st Semester Master Data Science