NarahariRaghava/Pre-Deployment-Detection-of-Terraform-Security-Misconfigurations-Using-Machine-Learning
GitHub: NarahariRaghava/Pre-Deployment-Detection-of-Terraform-Security-Misconfigurations-Using-Machine-Learning
Stars: 0 | Forks: 0
# Terraform Security Misconfiguration Detector
A CLI tool that scans Terraform `.tf` files and flags security risks before they reach production, built after reviewing the infrastructure of 45+ cloud applications and seeing the same misconfigurations surface repeatedly.

## The Problem
Terraform is how most teams manage AWS infrastructure. It's fast and repeatable, but that speed cuts both ways: a single misconfigured resource block, such as an open SSH port, a public S3 bucket, or a wildcard IAM policy, gets deployed just as easily as a correct one.
Most tools catch these issues after deployment, during a cloud security audit or an incident. This tool catches them at the source, before `terraform apply` is ever run.
## What It Does
Point it at a `.tf` file or a project folder. It finds every resource block, runs it through a trained ML model, and tells you what's risky and why.
Resource : aws_security_group.open_ssh
Risk : High (High: 94% | Low: 2% | Medium: 4%)
Reason : CIDR range is open to the entire internet (0.0.0.0/0); SSH (port 22) is exposed.
Resource : aws_db_instance.reporting_db
Risk : Medium (High: 13% | Low: 10% | Medium: 76%)
Reason : the database is publicly accessible.
Resource : aws_db_instance.secure_db
Risk : Low (High: 1% | Low: 73% | Medium: 26%)
Reason : No high-risk indicators detected.
Summary : High=3 Medium=1 Low=5
It also generates a colour-coded HTML report you can share or open in a browser.
## How It Works
**2. Extract:** each block is checked against 17 security indicators across common AWS resource types (open CIDR ranges, exposed ports, unencrypted storage, hardcoded secrets, and more). Each indicator is a 1 or 0. The block becomes a row of 17 numbers.
**3. Classify:** that row goes into a trained Random Forest model, which outputs Low, Medium, or High risk along with a per-class confidence score.
**4. Explain:** whichever indicators fired are translated into a plain-English reason, so engineers know exactly what to fix.
## Resource Types Covered
| Resource | What gets flagged |
|---|---|
| `aws_security_group` / `aws_security_group_rule` | SSH/RDP open to internet, DB ports exposed, IPv6 open |
| `aws_s3_bucket` | Public-read ACL, missing public access block |
| `aws_iam_policy` | Wildcard action or resource |
| `aws_db_instance` | Publicly accessible, storage not encrypted |
| `aws_instance` | Public IP assigned, unencrypted EBS, hardcoded credentials |
| `aws_lambda_function` | Hardcoded passwords or tokens in environment variables |
| `aws_lb_listener` | Plain HTTP with no redirect to HTTPS |
| `aws_cloudtrail` | Logging explicitly disabled |
## Models
Three classifiers were trained and compared on a balanced dataset of 300 labeled Terraform snippets:
| Model | Test Accuracy | 5-Fold CV |
|---|---|---|
| Logistic Regression | **81.33%** | 78.67% ± 3.71% |
| Random Forest | 77.33% | 78.33% ± 3.50% |
| Decision Tree | 77.33% | 77.00% ± 3.23% |
Random Forest is the default, providing per-class confidence scores and never misclassifying a High-risk resource as Low. Logistic Regression achieves the highest raw accuracy, which makes sense: the binary feature space is largely linearly separable.
## Project Structure
terraform-security-ml/
├── data/
│ ├── generate_dataset.py
│ ├── terraform_dataset.csv
│ └── sample_tf/
│ └── example.tf
├── src/
│ ├── feature_extractor.py
│ ├── model_trainer.py
│ ├── predictor.py
│ └── report_generator.py
├── notebooks/
│ └── exploration.ipynb
├── outputs/
│ ├── evaluation_report.json / .txt
│ ├── confusion_matrix_*.png
│ ├── feature_importance_*.png
│ ├── scan_*.json / .txt / .html
│ └── models/
├── main.py
└── requirements.txt
## Setup
cd terraform-security-ml
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
## Usage
Run the tool:
python main.py
This launches an interactive menu where you can choose to scan a file, scan a directory, train the models, or run demo predictions.
You can also use command-line arguments directly:
python main.py --file path/to/main.tf # Scan a single file
python main.py --dir path/to/project/ # Scan a project directory
python main.py --train # Train the models
python main.py --predict # Run demo predictions
After scanning, open the generated HTML report:
open outputs/scan_.html