ahmedmir221/Edge-to-Cloud-log-Analyzer

GitHub: ahmedmir221/Edge-to-Cloud-log-Analyzer

Stars: 0 | Forks: 0

# ⚡ Edge-to-Cloud Automated Log ETL Pipeline An enterprise-grade, event-driven serverless architecture built to automatically ingest, parse, and analyze edge device logs in real-time. Built entirely as Infrastructure as Code (IaC) and hosted locally using LocalStack. This project serves as the foundational data engineering layer for a Machine Learning workflow, extracting raw unstructured edge logs, transforming them via an event-driven microservice, and loading the structured analytics into a NoSQL database for downstream model training and visualization. ## 🏗️ Architecture Overview ![Cloud ETL Pipeline](https://static.pigsec.cn/wp-content/uploads/repos/2026/06/9275aa8692093017.png) This project simulates a real-time ETL (Extract, Transform, Load) data pipeline using AWS services. 1. **Storage (AWS S3):** Acts as the raw data lake. Edge devices drop `.txt` log files into the `raw-edge-logs-ahmed` bucket. 2. **Nervous System (S3 Event Notifications):** Instantly detects `s3:ObjectCreated:*` events and triggers the compute layer without manual intervention. 3. **Compute Engine (AWS Lambda):** A Python 3.10 microservice (`LogAnalyzerProcessor`) wakes up, downloads the raw file into memory, parses the text strings, and runs mathematical anomaly detection (counting critical errors and warnings). 4. **Database (AWS DynamoDB):** The parsed, structured data is loaded into a fast NoSQL table (`EdgeLogAnalytics`) mapped by a `LogFileName` Hash Key. 5. **Alerting System (AWS SNS):** If the pipeline detects a critical threshold (>= 2 errors in a single log), an automated alarm is published to `critical-errors-topic` for engineering response. 6. **Data Extraction:** A local Python extraction engine queries the NoSQL database and compiles the processed JSON reports into a structured CSV dataset, ready for Pandas or ML ingestion. ## 🛠️ Tech Stack * **Cloud Simulation:** LocalStack Pro (with Docker-in-Docker configuration) * **Infrastructure as Code:** HashiCorp Terraform (`winget` managed) * **Compute / ETL Logic:** Python 3.10 (`boto3`, `csv`, `datetime`) * **AWS Services Used:** S3, Lambda, DynamoDB, SNS, IAM, CloudWatch * **CLI Tools:** `awslocal`, `docker`, `powershell` ## 🚀 Quick Start & Deployment ### 1. Install Dependencies pip install boto3 ### 2. Start the Local Cloud Engine Boot the LocalStack Pro container, ensuring the host's Docker socket is mapped to allow Lambda to spin up isolated execution environments: docker run -d --name localstack-main -e LOCALSTACK_AUTH_TOKEN="" -p 4566:4566 -v /var/run/docker.sock:/var/run/docker.sock localstack/localstack-pro:latest ### 3. Package the Compute Logic Compress the Python ETL script into a deployment package: Compress-Archive -Path lambda_handler.py -DestinationPath function.zip -Force ### 4. Build the Infrastructure Initialize the AWS provider and deploy the architecture: terraform init terraform apply -auto-approve ## 🧪 Testing the End-to-End Pipeline To trigger the automated pipeline, generate a dummy log file and push it to the S3 data lake: **1. Create a clean, UTF-8 encoded log file:** $logs = @" [INFO] System booted successfully [WARN] CPU temperature reaching 85C [INFO] User authentication passed [ERROR] Database connection timeout [ERROR] Retrying connection... failed. "@ [System.IO.File]::WriteAllText("$PWD\server_crash.txt", $logs) **2. Ingest to S3:** awslocal s3 cp server_crash.txt s3://raw-edge-logs-ahmed/ **3. Verify the Pipeline Execution:** Check the engine logs to see the Lambda execution and the SNS Alarm dispatch: docker logs --tail 20 localstack-main Check the DynamoDB table to verify the Load phase of the ETL process: awslocal dynamodb scan --table-name EdgeLogAnalytics **4. Extract for Data Science (CSV Export):** Run the extraction script to pull all processed reports from the database into a machine-learning-ready format: python export_reports.py ## 🧠 Technical Challenges & Infrastructure Solutions 1. **Docker-in-Docker (DinD) Socket Isolation:** * **Issue:** Lambda builds failed with permission errors. * **Solution:** Mapped `-v /var/run/docker.sock:/var/run/docker.sock` in the main boot sequence, granting LocalStack authority to manage sibling Lambda containers. * **Business Impact:** Ensures the CI/CD pipeline is portable and not dependent on host-level OS permissions. 2. **The Windows UTF-16 Encoding Trap:** * **Issue:** PowerShell output redirection (`>`) created `UTF-16` files, causing `UnicodeDecodeError` in the Linux-based Lambda. * **Solution:** Utilized `.NET` classes (`[System.IO.File]::WriteAllText`) to enforce cloud-standard UTF-8. * **Business Impact:** Robust handling of data from heterogeneous edge environments without manual encoding intervention. 3. **Terraform DNS Routing:** * **Issue:** `no such host` failures due to Windows failing to route `.localhost` subdomains. * **Solution:** Enforced `s3_use_path_style = true` in Terraform, forcing stable, IP-based path routing. 4. **Local API Endpoint Spoofing:** * **Issue:** Terraform attempted to contact live AWS production servers. * **Solution:** Explicitly mapped `dynamodb` and `sns` endpoints to `http://localhost:4566` in the provider configuration to sandbox traffic. 5. **CloudWatch Variable Interpretation:** * **Issue:** PowerShell evaluated AWS log stream syntax (`[$LATEST]`) as a system variable. * **Solution:** Enforced literal string interpretation using single quotes for all AWS CLI queries. ## 📜 License Distributed under the MIT License. See `LICENSE` for more information. ## 👤 About the Author **Ahmed Mir** | Computer Engineering Graduate (GIKI) | Machine Learning & Cloud Enthusiast * www.linkedin.com/in/ahmed-mir-247325139 """