locorecto/distributed-ops-game
GitHub: locorecto/distributed-ops-game
浏览器端分布式系统故障演练模拟器,通过交互式游戏帮助用户掌握 Kafka、Redis、Elasticsearch、Flink、RabbitMQ 的运维排障技能。
Stars: 0 | Forks: 0
# 分布式运维游戏
## 概述
Distributed Ops Game 让你置身于开始出现故障的生产系统中。你的任务:诊断问题,应用正确的配置,并在系统崩溃之前恢复其健康状态。
每个场景都基于真实的工程情况——从高峰时段不堪重负的披萨订购平台,到发生脑裂的 Redis 集群,再到状态无限增长的 Flink 作业。每项技术的模拟引擎都在 TypeScript 中对实际行为进行建模,因此你学到的机制可以直接应用到真实系统中。
**150 个场景**,涵盖 5 条技术路线,难度从初学者到大师级不等。
## 技术路线
| 技术 | 场景数 | 涵盖概念 |
|---|---|---|
| 🟠 Apache Kafka | 30 | Topics, Partitions, Consumer Groups, Replication, Exactly-Once, Kafka Streams, Schema Registry, MirrorMaker |
| 🔴 Redis | 30 | Data Structures, Pub/Sub, Streams, Persistence, Sentinel, Cluster, Eviction, Redlock, RediSearch |
| 🟡 Elasticsearch | 30 | Shards, Mappings, Query DSL, Aggregations, ILM, CCR, Snapshots, ML Anomaly Detection, EQL |
| 🔵 Apache Flink | 30 | Windowing, Watermarks, Checkpointing, State Backends, Backpressure, Exactly-Once, CEP, Rescaling |
| 🟣 RabbitMQ | 30 | Exchanges, Queues, Routing, Publisher Confirms, Dead Letters, Quorum Queues, Streams, Federation |
所有 5 条路线从一开始就全部开放——无需跨技术解锁。
## 游戏玩法
1. **选择一条技术路线** —— 从 Kafka, Redis, Elasticsearch, Flink, 或 RabbitMQ 中选择
2. **阅读简报** —— 了解系统架构和报告的症状
3. **观察模拟** —— 实体实时动画,消息粒子流动
4. **诊断故障** —— 使用指标面板(吞吐量、延迟、错误率、健康分数)来确定根本原因
5. **应用修复** —— 在控制面板中调整配置
6. **维持恢复** —— 连续 10 个周期保持修复状态即可获胜
评分基于所用时间、使用的提示以及最终系统健康度。每条路线中的场景会依次解锁。
## Kafka 场景 (30)
| # | 场景 | 难度 | 核心概念 |
|---|---|---|---|
| 1 | Pizza Order System | Beginner | Consumer lag, max.poll.records |
| 2 | Flash Sale Inventory | Easy | Partitions, consumer groups |
| 3 | Ride-Sharing Dispatch | Easy | Message keys, partition routing |
| 4 | Chat App Fan-Out | Easy | Multiple consumer groups, auto.offset.reset |
| 5 | IoT Sensor Pipeline | Medium | Batching, linger.ms, compression |
| 6 | Stock Market Data Feed | Medium | Key strategy, per-symbol ordering |
| 7 | Payment Gateway | Medium | Idempotent producer, acks=all |
| 8 | Log Aggregation Pipeline | Medium | Retention (time + size), cleanup.policy |
| 9 | E-Commerce Order Pipeline | Medium | Transactional read-process-write |
| 10 | Audit Log Compliance | Medium | Replication factor, min.insync.replicas |
| 11 | Real-Time Analytics Dashboard | Medium-Hard | Manual commit, offset reset |
| 12 | Video Streaming Platform | Medium-Hard | Large messages, fetch.max.bytes |
| 13 | Supply Chain Event Tracker | Medium-Hard | Transactions, isolation.level |
| 14 | Gaming Leaderboard | Medium-Hard | Partition scaling, rebalance |
| 15 | Healthcare Patient Monitor | Hard | session.timeout.ms, SLA enforcement |
| 16 | Microservices Event Bus | Hard | Dead letter queue, retry logic |
| 17 | Database CDC Sync | Hard | Log compaction, exactly-once |
| 18 | Fraud Detection Engine | Expert | Kafka Streams, stateful windowing |
| 19 | Schema Registry Migration | Expert | Schema evolution, BACKWARD compatibility |
| 20 | Multi-DC Disaster Recovery | Master | MirrorMaker, geo-replication lag |
| 21 | Log Compaction Deep Dive | Hard | Tombstones, compaction lag, cleaner threads |
| 22 | Consumer Rebalance Storm | Hard | Eager vs cooperative sticky rebalancing |
| 23 | Quota Throttling Crisis | Hard | Producer/consumer byte-rate quotas |
| 24 | Kafka Connect — JDBC Sink | Hard | Sink connector, error tolerance, DLQ |
| 25 | Debezium CDC Source | Hard | CDC, binlog offset, idempotent producer |
| 26 | Schema Forward Compatibility | Expert | FORWARD compat, field removal, Avro unions |
| 27 | Partition Leadership Imbalance | Expert | Preferred replica election, leader skew |
| 28 | Active-Active Geo-Replication | Expert | MirrorMaker 2, cycle detection |
| 29 | ACL & SASL Security Incident | Expert | SASL/PLAIN, ACLs, authorization failures |
| 30 | Multi-Tenant Cluster Isolation | Master | Quotas per client-id, namespace isolation |
## Redis 场景 (30)
| # | 场景 | 难度 | 核心概念 |
|---|---|---|---|
| 1 | Session Cache Miss Storm | Beginner | GET/SET, TTL, cache-aside pattern |
| 2 | Leaderboard Sorted Set | Easy | ZADD/ZRANGE, sorted sets |
| 3 | Shopping Cart Hash | Easy | HSET/HGET, hash operations |
| 4 | Rate Limiter Race Condition | Easy | INCR + EXPIRE, atomic operations |
| 5 | Pub/Sub Fan-Out Failure | Easy | PUBLISH/SUBSCRIBE vs Streams |
| 6 | Task Queue Data Loss | Medium | LPUSH/BRPOPLPUSH, reliable queue |
| 7 | Cache Stampede | Medium | Thundering herd, mutex lock |
| 8 | Inventory Race Condition | Medium | WATCH/MULTI/EXEC transactions |
| 9 | Bloom Filter Memory | Medium | Probabilistic structures, false positives |
| 10 | Geospatial Delivery Zones | Medium | GEOADD/GEORADIUS, spatial queries |
| 11 | RDB Snapshot Blocking | Medium | BGSAVE, fork, COW, latency spikes |
| 12 | AOF Rewrite Overhead | Medium | appendfsync, AOF rewrite, disk I/O |
| 13 | Memory Eviction Crisis | Medium-Hard | maxmemory-policy, LRU vs LFU |
| 14 | Streams Consumer Group | Medium-Hard | XADD/XREADGROUP, pending entries |
| 15 | Keyspace Notification Flood | Medium-Hard | notify-keyspace-events, filtering |
| 16 | Lua Script Blocking | Medium-Hard | EVAL, event loop, atomicity |
| 17 | Pipeline Throughput | Medium-Hard | Pipelining, RTT reduction |
| 18 | Sentinel Failover | Hard | Sentinel quorum, leader election |
| 19 | Cluster Slot Resharding | Hard | CLUSTER RESHARD, MOVED redirects |
| 20 | Hot Key Overload | Hard | Key sharding, read replicas |
| 21 | Redlock Race Condition | Hard | SET NX PX, fencing tokens |
| 22 | Replica Lag Under Load | Hard | repl-backlog-size, partial resync |
| 23 | Connection Pool Exhaustion | Hard | maxclients, connection multiplexing |
| 24 | Large Value Fragmentation | Hard | OBJECT ENCODING, compression |
| 25 | Time Series High Cardinality | Expert | RedisTimeSeries, downsampling |
| 26 | RediSearch Index Corruption | Expert | FT.CREATE, SORTABLE, query optimization |
| 27 | Transaction Isolation Failure | Expert | MULTI/EXEC, WATCH, retry backoff |
| 28 | Cluster Brain-Split | Expert | Quorum, cluster-require-full-coverage |
| 29 | ACL Security Breach | Expert | ACL SETUSER, command categories |
| 30 | Active-Active Geo-Replication | Master | CRDT, conflict resolution, causal consistency |
## Elasticsearch 场景 (30)
| # | 场景 | 难度 | 核心概念 |
|---|---|---|---|
| 1 | Unassigned Shards | Beginner | Primary shard allocation, cluster yellow/red |
| 2 | Index Not Found | Easy | Index creation, dynamic vs explicit mappings |
| 3 | Slow Query | Easy | match vs term queries, _source filtering |
| 4 | Mapping Conflict | Easy | Field type mismatch, strict mapping |
| 5 | Over-Sharding OOM | Medium | Shard sizing, heap per shard |
| 6 | Relevance Tuning | Medium | BM25 scoring, field boost |
| 7 | Analyzer Mismatch | Medium | standard vs keyword analyzers |
| 8 | Nested Object Query | Medium | nested field type, nested query |
| 9 | Aggregation Memory OOM | Medium | Terms agg circuit breaker |
| 10 | Index Template Migration | Medium | Template priority, component templates |
| 11 | Reindex Performance | Medium-Hard | Sliced scroll, pipeline ingest |
| 12 | Disk Watermark Breach | Medium-Hard | flood_stage, read-only index |
| 13 | Split-Brain Cluster | Medium-Hard | Master quorum, voting config |
| 14 | Alias Rollover | Medium-Hard | Write alias, ILM rollover |
| 15 | Ingest Pipeline Failure | Medium-Hard | Enrich processor, GeoIP, refresh |
| 16 | ILM Policy Misconfiguration | Medium-Hard | hot/warm/cold/delete phases |
| 17 | Cross-Cluster Replication Lag | Hard | CCR leader/follower, lag monitoring |
| 18 | Snapshot Restore Failure | Hard | SLM policy, partial restore |
| 19 | Deep Pagination OOM | Hard | search_after, point-in-time |
| 20 | Circuit Breaker Tripping | Hard | Request/fielddata breakers, heap |
| 21 | Security Role Mapping | Hard | Document-level security, field masking |
| 22 | Watcher Alert Latency | Hard | Trigger, condition, action throttle |
| 23 | EQL Sequence Matching | Expert | EQL syntax, max_span |
| 24 | ML Anomaly Detection | Expert | Datafeed, job state, index patterns |
| 25 | Runtime Field Performance | Expert | Painless scripts, doc_values |
| 26 | Async Search | Expert | Long-running queries, status polling |
| 27 | Percolator Queries | Expert | Document matching, alerting |
| 28 | Geo-Shape Indexing | Expert | geo_shape, BKD tree, spatial |
| 29 | Transform Pivot Aggregation | Expert | Transforms, checkpointing |
| 30 | Cross-Cluster Search | Master | CCS, skip_unavailable, minimize_roundtrips |
## Apache Flink 场景 (30)
| # | 场景 | 难度 | 核心概念 |
|---|---|---|---|
| 1 | DataStream Backpressure | Beginner | Operator chaining, throughput |
| 2 | Tumbling Window Late Data | Easy | Window triggers, allowedLateness |
| 3 | Event Time Semantics | Easy | Watermarks, out-of-order records |
| 4 | Unbounded ValueState | Easy | StateTtlConfig, key TTL |
| 5 | Sliding Window Memory | Medium | Window pane overhead |
| 6 | Session Window Timeout | Medium | Session gap, dynamic sessions |
| 7 | Checkpoint Failure | Medium | Checkpoint barriers, recovery |
| 8 | Savepoint Migration | Medium | Operator UIDs, restore |
| 9 | Kafka Source Offset | Medium | scan.startup.mode, backfill |
| 10 | Side Output Late Data | Medium | Tagged outputs, late events |
| 11 | Async I/O DB Lookup | Medium-Hard | AsyncFunction, capacity |
| 12 | RocksDB State Backend | Medium-Hard | Heap vs RocksDB, incremental checkpoints |
| 13 | Broadcast State Rules | Medium-Hard | BroadcastStream, dynamic config |
| 14 | Temporal Join | Medium-Hard | Versioned table, event-time join |
| 15 | Watermark Alignment | Medium-Hard | Multi-source drift, idle timeout |
| 16 | Late Data Side Output | Medium-Hard | allowedLateness, side output |
| 17 | Task Manager OOM | Hard | Managed memory fraction, network buffers |
| 18 | State Rescaling | Hard | Key group redistribution, savepoint |
| 19 | Exactly-Once Sink | Hard | TwoPhaseCommitSink, pre-commit |
| 20 | CEP Pattern Matching | Hard | Strict/relaxed contiguity |
| 21 | State TTL Cleanup | Hard | StateTtlConfig, background cleanup |
| 22 | Dynamic Parallelism | Hard | Per-operator parallelism, auto-scaling |
| 23 | Flink SQL CDC Pipeline | Expert | CDC connector, upsert Kafka sink |
| 24 | Temporal Join Versioned Table | Expert | Versioned table, changelog mode |
| 25 | Prometheus Metrics | Expert | MetricGroup, custom reporters |
| 26 | Multi-Sink Fan-Out | Expert | Independent exactly-once per sink |
| 27 | Changelog Compaction | Expert | CHANGELOG_MODE, retract vs upsert |
| 28 | Kubernetes HA | Expert | Application mode, HA config |
| 29 | Global Window Trigger | Master | Custom trigger, purge logic |
| 30 | Unified Batch + Streaming | Master | BATCH execution mode, bounded |
## RabbitMQ 场景 (30)
| # | 场景 | 难度 | 核心概念 |
|---|---|---|---|
| 1 | Queue Overflow | Beginner | max-length, consumer overload |
| 2 | Direct Exchange Routing | Easy | Routing keys, bindings |
| 3 | Fanout Broadcast | Easy | Fanout exchange, multi-queue |
| 4 | Topic Exchange Wildcards | Easy | `#`/`*` routing patterns |
| 5 | Message TTL Expiry | Medium | Per-message TTL, x-message-ttl |
| 6 | Dead Letter Infinite Loop | Medium | DLX, nack requeue=false |
| 7 | Priority Queue Starvation | Medium | x-max-priority, fairness |
| 8 | Manual Acknowledgements | Medium | manual-ack, nack on failure |
| 9 | Publisher Confirms | Medium | confirm mode, retry on nack |
| 10 | Prefetch Throttling | Medium | basic.qos, unacked limits |
| 11 | Lazy Queue Memory | Medium-Hard | Lazy mode, disk spooling |
| 12 | Headers Exchange | Medium-Hard | x-match all/any, complex routing |
| 13 | Classic Mirrored HA | Medium-Hard | ha-mode, ha-sync-mode |
| 14 | Shovel Plugin | Medium-Hard | Shovel config, frame max |
| 15 | Federation Link | Medium-Hard | Federation upstream, link state |
| 16 | Vhost Isolation | Medium-Hard | Virtual hosts, per-vhost limits |
| 17 | Memory Alarm Blocking | Hard | vm_memory_high_watermark, flow control |
| 18 | Disk Free Alarm | Hard | disk_free_limit, publish blocking |
| 19 | Quorum Queue Election | Hard | Raft consensus, leader election |
| 20 | Classic → Quorum Migration | Hard | Drain-and-delete migration |
| 21 | Split-Brain Partition | Hard | cluster_partition_handling, autoheal |
| 22 | Connection Storm | Hard | channel_max, connection pooling |
| 23 | OAuth 2.0 Auth | Hard | rabbitmq-auth-backend-oauth2, JWT |
| 24 | Per-User Rate Limiting | Hard | Credit flow, per-connection rate |
| 25 | Consistent Hash Exchange | Expert | Sharded queues, slot redistribution |
| 26 | Stream Queue Throughput | Expert | RabbitMQ Streams, publisher offsets |
| 27 | Stream Offset Replay | Expert | Offset spec, timestamp-based restart |
| 28 | Delayed Message Exchange | Expert | rabbitmq-delayed-message-exchange |
| 29 | Multi-AZ Active-Passive | Expert | Quorum queues, node evacuation |
| 30 | Cross-Protocol AMQP → MQTT | Master | MQTT plugin, QoS levels, session |
## 技术栈
| 关注点 | 库 |
|---|---|
| UI | React 18 + TypeScript |
| 构建 | Vite 6 |
| 样式 | Tailwind CSS 4 |
| 状态 | Zustand 5 |
| 动画 | Framer Motion 12 |
| 图表 | Recharts 3 |
| 测试 | Vitest + @testing-library/react |
模拟引擎(`src/technologies/*/engine/`)是纯 TypeScript 编写,不依赖 React —— 它们可以在没有浏览器的情况下进行单元测试,并以无头模式运行。
## 快速开始
```
# 安装依赖
npm install
# 启动 dev server
npm run dev
# 运行 tests
npm test
# Production build
npm run build
```
需要 Node 18+。无需外部服务 —— 所有内容均在浏览器中运行。
## 架构
```
src/
├── technologies/ # Per-technology engines + scenarios
│ ├── types.ts # TechKey, TechDefinition, TECH_DEFINITIONS
│ ├── kafka/
│ │ ├── engine/ # Kafka simulation engine (11 modules)
│ │ └── scenarios/ # 30 Kafka scenario definitions
│ ├── redis/
│ │ ├── engine/ # Redis simulation engine
│ │ └── scenarios/ # 30 Redis scenario definitions
│ ├── elasticsearch/
│ │ ├── engine/ # Elasticsearch simulation engine
│ │ └── scenarios/ # 30 ES scenario definitions
│ ├── flink/
│ │ ├── engine/ # Flink simulation engine
│ │ └── scenarios/ # 30 Flink scenario definitions
│ └── rabbitmq/
│ ├── engine/ # RabbitMQ simulation engine
│ └── scenarios/ # 30 RabbitMQ scenario definitions
│
├── store/
│ ├── gameStore # Phase, tech selection, per-tech progress (localStorage)
│ ├── simulationStore # Live snapshot from active engine
│ └── metricsStore # 300-tick circular buffer for charts
│
└── components/
├── screens/
│ ├── TechnologyLobby # 5-technology selection screen
│ ├── MainMenu # Per-tech scenario grid
│ └── GameScreen # Simulation canvas + control panel
├── canvas/ # SimulationCanvas, node components, particles
├── panels/ # ControlPanel — per-entity config UI
├── metrics/ # MetricsPanel, Recharts charts
└── tutorial/ # HintPanel, scenario briefing
```
### 模拟周期循环
每个引擎运行一个 100ms 的周期循环(可扩展至 2 倍/4 倍速):
1. **实体步进** —— 根据当前配置更新实体状态
2. **故障注入器** —— 在指定周期触发脚本化的 `FailureEvent`
3. **指标步进** —— 计算健康分数、错误率、吞吐量
4. **胜利检查** —— 评估条件;连续 10 次通过 = 获胜
5. **发射** —— 将快照推送到 `simulationStore`
## 评分
```
score = 1000
− (secondsTaken × 5) # time penalty
− (hintsUsed × 50) # hint penalty
+ (finalHealthScore × 2) # health bonus (max +200)
− (duplicates × 10) # correctness penalty
Stars: ≥ 800 → 3★ ≥ 500 → 2★ ≥ 1 → 1★
```
进度和分数会按技术分类持久化存储到 `localStorage`。
## 涵盖概念
### Kafka
`consumer-lag` · `partitions` · `consumer-groups` · `message-keys` · `auto.offset.reset` · `linger.ms` · `batch.size` · `compression` · `idempotent-producer` · `acks` · `transactions` · `isolation.level` · `retention` · `compaction` · `replication-factor` · `min.insync.replicas` · `manual-commit` · `max.request.size` · `session.timeout.ms` · `dead-letter-queue` · `schema-evolution` · `mirrormaker` · `kafka-connect` · `quotas` · `acl`
### Redis
`strings` · `hashes` · `lists` · `sets` · `sorted-sets` · `streams` · `pub-sub` · `ttl` · `eviction` · `rdb` · `aof` · `pipelining` · `transactions` · `lua-scripts` · `sentinel` · `cluster` · `redlock` · `keyspace-notifications` · `bloom-filter` · `geospatial` · `timeseries` · `redisearch`
### Elasticsearch
`shards` · `replicas` · `mappings` · `analyzers` · `query-dsl` · `aggregations` · `ilm` · `aliases` · `rollover` · `ccr` · `snapshots` · `ingest-pipelines` · `circuit-breakers` · `security` · `eql` · `ml-anomaly-detection` · `runtime-fields` · `async-search` · `percolator` · `transforms`
### Apache Flink
`datastream` · `windowing` · `watermarks` · `event-time` · `processing-time` · `checkpointing` · `savepoints` · `state-backends` · `rocksdb` · `backpressure` · `exactly-once` · `cep` · `broadcast-state` · `temporal-join` · `async-io` · `rescaling` · `flink-sql` · `kubernetes-ha`
### RabbitMQ
`exchanges` · `queues` · `bindings` · `routing-keys` · `publisher-confirms` · `consumer-acks` · `prefetch` · `dead-letter-exchange` · `ttl` · `priority-queues` · `lazy-queues` · `quorum-queues` · `streams` · `federation` · `shovel` · `vhosts` · `flow-control` · `oauth2` · `consistent-hash` · `mqtt`
标签:API集成, Elasticsearch, Flink, Kafka, RabbitMQ, Redis, SonarQube插件, SRE, TypeScript, 交互式游戏, 偏差过滤, 内存数据库, 分布式系统, 可观测性, 后端架构, 吞吐量优化, 响应大小分析, 基于浏览器的游戏, 基础设施, 学习平台, 安全插件, 实时计算, 库, 应急响应, 延迟分析, 性能调优, 技能提升, 搜索引擎, 故障恢复, 故障排查, 模拟仿真, 流处理, 消息队列, 状态管理, 系统运维, 自动化攻击