ahmedmir221/Edge-to-Cloud-log-Analyzer
GitHub: ahmedmir221/Edge-to-Cloud-log-Analyzer
Stars: 0 | Forks: 0
# ⚡ Edge-to-Cloud Automated Log ETL Pipeline
An enterprise-grade, event-driven serverless architecture built to automatically ingest, parse, and analyze edge device logs in real-time. Built entirely as Infrastructure as Code (IaC) and hosted locally using LocalStack.
This project serves as the foundational data engineering layer for a Machine Learning workflow, extracting raw unstructured edge logs, transforming them via an event-driven microservice, and loading the structured analytics into a NoSQL database for downstream model training and visualization.
## 🏗️ Architecture Overview

This project simulates a real-time ETL (Extract, Transform, Load) data pipeline using AWS services.
1. **Storage (AWS S3):** Acts as the raw data lake. Edge devices drop `.txt` log files into the `raw-edge-logs-ahmed` bucket.
2. **Nervous System (S3 Event Notifications):** Instantly detects `s3:ObjectCreated:*` events and triggers the compute layer without manual intervention.
3. **Compute Engine (AWS Lambda):** A Python 3.10 microservice (`LogAnalyzerProcessor`) wakes up, downloads the raw file into memory, parses the text strings, and runs mathematical anomaly detection (counting critical errors and warnings).
4. **Database (AWS DynamoDB):** The parsed, structured data is loaded into a fast NoSQL table (`EdgeLogAnalytics`) mapped by a `LogFileName` Hash Key.
5. **Alerting System (AWS SNS):** If the pipeline detects a critical threshold (>= 2 errors in a single log), an automated alarm is published to `critical-errors-topic` for engineering response.
6. **Data Extraction:** A local Python extraction engine queries the NoSQL database and compiles the processed JSON reports into a structured CSV dataset, ready for Pandas or ML ingestion.
## 🛠️ Tech Stack
* **Cloud Simulation:** LocalStack Pro (with Docker-in-Docker configuration)
* **Infrastructure as Code:** HashiCorp Terraform (`winget` managed)
* **Compute / ETL Logic:** Python 3.10 (`boto3`, `csv`, `datetime`)
* **AWS Services Used:** S3, Lambda, DynamoDB, SNS, IAM, CloudWatch
* **CLI Tools:** `awslocal`, `docker`, `powershell`
## 🚀 Quick Start & Deployment
### 1. Install Dependencies
pip install boto3
### 2. Start the Local Cloud Engine
Boot the LocalStack Pro container, ensuring the host's Docker socket is mapped to allow Lambda to spin up isolated execution environments:
docker run -d --name localstack-main -e LOCALSTACK_AUTH_TOKEN="" -p 4566:4566 -v /var/run/docker.sock:/var/run/docker.sock localstack/localstack-pro:latest
### 3. Package the Compute Logic
Compress the Python ETL script into a deployment package:
Compress-Archive -Path lambda_handler.py -DestinationPath function.zip -Force
### 4. Build the Infrastructure
Initialize the AWS provider and deploy the architecture:
terraform init
terraform apply -auto-approve
## 🧪 Testing the End-to-End Pipeline
To trigger the automated pipeline, generate a dummy log file and push it to the S3 data lake:
**1. Create a clean, UTF-8 encoded log file:**
$logs = @"
[INFO] System booted successfully
[WARN] CPU temperature reaching 85C
[INFO] User authentication passed
[ERROR] Database connection timeout
[ERROR] Retrying connection... failed.
"@
[System.IO.File]::WriteAllText("$PWD\server_crash.txt", $logs)
**2. Ingest to S3:**
awslocal s3 cp server_crash.txt s3://raw-edge-logs-ahmed/
**3. Verify the Pipeline Execution:**
Check the engine logs to see the Lambda execution and the SNS Alarm dispatch:
docker logs --tail 20 localstack-main
Check the DynamoDB table to verify the Load phase of the ETL process:
awslocal dynamodb scan --table-name EdgeLogAnalytics
**4. Extract for Data Science (CSV Export):**
Run the extraction script to pull all processed reports from the database into a machine-learning-ready format:
python export_reports.py
## 🧠 Technical Challenges & Infrastructure Solutions
1. **Docker-in-Docker (DinD) Socket Isolation:**
* **Issue:** Lambda builds failed with permission errors.
* **Solution:** Mapped `-v /var/run/docker.sock:/var/run/docker.sock` in the main boot sequence, granting LocalStack authority to manage sibling Lambda containers.
* **Business Impact:** Ensures the CI/CD pipeline is portable and not dependent on host-level OS permissions.
2. **The Windows UTF-16 Encoding Trap:**
* **Issue:** PowerShell output redirection (`>`) created `UTF-16` files, causing `UnicodeDecodeError` in the Linux-based Lambda.
* **Solution:** Utilized `.NET` classes (`[System.IO.File]::WriteAllText`) to enforce cloud-standard UTF-8.
* **Business Impact:** Robust handling of data from heterogeneous edge environments without manual encoding intervention.
3. **Terraform DNS Routing:**
* **Issue:** `no such host` failures due to Windows failing to route `.localhost` subdomains.
* **Solution:** Enforced `s3_use_path_style = true` in Terraform, forcing stable, IP-based path routing.
4. **Local API Endpoint Spoofing:**
* **Issue:** Terraform attempted to contact live AWS production servers.
* **Solution:** Explicitly mapped `dynamodb` and `sns` endpoints to `http://localhost:4566` in the provider configuration to sandbox traffic.
5. **CloudWatch Variable Interpretation:**
* **Issue:** PowerShell evaluated AWS log stream syntax (`[$LATEST]`) as a system variable.
* **Solution:** Enforced literal string interpretation using single quotes for all AWS CLI queries.
## 📜 License
Distributed under the MIT License. See `LICENSE` for more information.
## 👤 About the Author
**Ahmed Mir** | Computer Engineering Graduate (GIKI) | Machine Learning & Cloud Enthusiast
* www.linkedin.com/in/ahmed-mir-247325139
"""