anantha037/aiops-log-anomaly-detection

GitHub: anantha037/aiops-log-anomaly-detection

Stars: 1 | Forks: 0

# AIOps Log Anomaly Detection Engine 🛡️ An end-to-end unsupervised machine learning pipeline designed to ingest, process, and score system logs in real-time. By applying Isolation Forests to compressed text features, this engine identifies out-of-bounds system states and structural anomalies without requiring labeled historical datasets. ## 📈 Value Proposition & Business Context In modern distributed cloud architectures, applications and infrastructure daemons generate millions of unstructured log entries daily. Standard rule-based alerting systems (e.g., searching for the word `ERROR`) fail to catch novel, structural regressions, and they suffer from high false-alarm rates. This **AIOps Log Anomaly Detection Engine** addresses these gaps through: - **Zero-Label Dependency**: Learns normal system operational baselines completely unsupervised. - **Structural Outlier Detection**: Automatically flags rare events, such as failed task configurations, temporary folder setups, or unusual daemon sequence paths, which are often precursors to critical service outages or security breaches. - **Low-Latency Inference**: Operates in real-time, vectorizing and scoring incoming log streams on the fly to fit within high-throughput log ingestion pipelines. ## 🏗️ System Architecture Flow The workflow below illustrates the journey of a log line from ingest to live operations monitoring: [ Raw HDFS Logs ] │ ▼ (ingest.py / Regex Engine) [ Dynamic Text Masking ] ──► Mask Block IDs, IPs, and Standalone Numbers │ ▼ (scikit-learn) [ TF-IDF Vectorization ] ──► Transforms clean strings into 240-term vocabulary │ ▼ (train_anomaly.py / Isolation Forest) [ Model Offline Baseline ] ──► Learns normal structure (Contamination = 5%) │ ▼ (joblib Serialization) [ Serialized Model & Vectorizer (.pkl) ] │ ▼ (main.py / FastAPI Lifespan Startup) [ FastAPI Live Service ] ──► Exposes /score-log POST Endpoint │ ▼ (app.py / Streamlit) [ Streamlit Operations Dashboard ] ──► Visualizes normal streams vs. glowing alerts ## 🛠️ Key Engineering Highlights ### 1. Dynamic Text Masking & Vocabulary Compression Unstructured log messages contain variable tokens (e.g., transaction IDs, block hashes like `blk_38865049064139660`, IP addresses, and file sizes) that occur only once. Raw TF-IDF on these tokens causes severe overfitting and extreme feature dimension expansion. - **Our Solution**: We apply deterministic regex rules to clean and mask transient values into generic placeholders (``, ``, ``). - **Impact**: This compressed our TF-IDF feature vocabulary from thousands of noise-filled dimensions to **just 240 clean, structural features**, forcing the machine learning model to focus purely on the operational phrases (e.g., `addStoredBlock`, `PacketResponder terminating`). ### 2. Unsupervised Anomaly Isolation - **Isolation Forest Model**: Instead of modeling normal log density (which is highly complex and non-linear), we use an Isolation Forest to isolate anomalies. Outliers are located closer to the root of the forest trees because they require fewer random splits to separate from the rest of the data. - **Calibration**: Configured with a strict `contamination=0.05` (5% expected anomaly rate) and a fixed `random_state=42`. It flags rare task executions and temporary file paths with high precision. ### 3. Production-Ready APIs & UI - **FastAPI Backend**: Built with the modern `lifespan` context manager, loading the trained Isolation Forest and TF-IDF models into memory once at startup to guarantee sub-millisecond response latency. - **Streamlit Command Center**: Features a custom dark-themed Operations Dashboard. It simulates log streams and renders alerts dynamically—complete with a glowing red terminal emulator box for flagged anomalies. ## 🚀 Local Setup & Installation ### Prerequisites Ensure you have Python 3.9+ and Git installed. ### 1. Clone the Repository git clone cd "Log Anomaly Detection" ### 2. Run Data Ingestion & Model Training Prepare the dataset and train the Isolation Forest model: ### 3. Launch the FastAPI Backend Start the high-performance API server using Uvicorn: # Starts the server locally on http://127.0.0.1:8000 python main.py ### 4. Launch the Streamlit Operations Dashboard In a separate terminal window, start the frontend interface: # Starts the dashboard locally on http://127.0.0.1:8501 streamlit run app.py ## 🎯 Verification Sandbox Once both servers are running: 1. Open `http://localhost:8501` in your browser. 2. Under the **Log Injector Control Panel**, select a preset message: - Selecting **Normal Log** will return `is_anomaly: false` with a positive anomaly score and display a green console stream. - Selecting **Anomalous Log** will return `is_anomaly: true` with a negative score and trigger a flashing red **[ALERT] SYSTEM ANOMALY DETECTED** banner. 3. Test your own custom logs using the text field to see how changes in log structure affect the Isolation Forest decision scores.