thisisdhruvchopra/log-analyser
GitHub: thisisdhruvchopra/log-analyser
Stars: 6 | Forks: 0
# HTTP Honeypot Log Analyser
A modular Python-based threat intelligence toolkit for analysing HTTP honeypot logs. Built for real-world deployment as part of a deception technology engagement at a critical industrial infrastructure client.
Each script is standalone and targets a specific aspect of attacker behaviour: event classification, credential harvesting, geolocation, endpoint targeting, and temporal analysis.
## Module Overview
| Script | Purpose |
|---|---|
| `analysis.py` | Classifies log events by type and severity |
| `geo_analysis.py` | Maps attacking IPs to countries using GeoIP |
| `endpoint.py` | Identifies which endpoints are being targeted and by whom |
| `passwords.py` | Extracts and ranks credential (password) attempts |
| `usernames.py` | Extracts and ranks username attempts |
| `summary.py` | Aggregates outputs into a single attack summary |
| `split_logs_by_date.py` | Slices logs into a specific date range |
| `remove_localhost_logs.py` | Cleans localhost/loopback entries from raw logs |
## Requirements
- Python 3.8+
- [`maxminddb`](https://pypi.org/project/maxminddb/) — for GeoIP lookups
- `GeoLite2-Country.mmdb` — MaxMind GeoLite2 database (included in repo)
Install dependency:
pip install maxminddb
## Usage
Each script is run independently and prompts for input interactively.
### 1. Clean the logs first
Remove loopback/localhost noise before any analysis:
python remove_localhost_logs.py
# Input: your raw .log file
# Output: _no_localhost.log
### 2. Slice by date range (optional)
Focus analysis on a specific week or reporting period:
python split_logs_by_date.py
# Accepts formats: YYYY-MM-DD, DD-MM-YYYY, YYYY/MM/DD, etc.
# Output: __to_.log
### 3. Classify events by type and severity
python analysis.py
# Output: analysis_event_statistics.csv, analysis_severity_breakdown.csv
**Event types detected:**
| Event | Severity | Examples |
|---|---|---|
| `command_execution_attempts` | High | shell pipes, backtick execution, `$()` |
| `exploit_attempts` | High | `/etc/passwd`, `../`, `/bin/sh`, `/cmd=` |
| `attack_log_attempts` | High | Bad request version, HTTP 400 |
| `scan_attempts` | Medium | `/admin`, `/wp-`, `.env`, `.git`, `/phpmyadmin` |
| `connect_attempts` | Low | Standard GET/POST/HEAD requests |
### 4. Geolocate attacking IPs
python geo_analysis.py
# Output: attacking_countries.csv, attacking_ips_country.csv
Only lines matching attack patterns are counted — consistent with `analysis.py`.
### 5. Extract credential attempts
python passwords.py # Top 50 passwords attempted
python usernames.py # Top 50 usernames attempted
# Output: _top_passwords.csv / _top_usernames.csv
Extracts from POST body parameters (`password=`, `pass=`, `username=`, `user=`, etc.). Filters out path traversal and exploit garbage automatically.
### 6. Analyse endpoint targeting
python endpoint.py
# Prompts for a specific endpoint e.g. /admin, /cgi-bin
# Output: endpoint.csv (IP → hit count, sorted descending)
### 7. Generate full attack summary
python summary.py
# Requires: attacking_countries.csv and attacking_ips_country.csv to exist
# Output: Top date, top country, top IP, most targeted endpoint
## Sample Summary Output
=== BSP Aggregated Attack Summary ===
Most Traffic / Attacks On
14 May 2025 (3,241 events)
Most Attacking Country
China (1,872 events)
Most Attacking IP
45.xxx.xxx.xxx (304 events)
Most Targeted Endpoint
/admin (918 hits)
## Attack Detection Logic
All scripts share a consistent regex-based detection engine. A log line is classified as an attack if it matches any of the following patterns:
- **Command injection**: `;cmd`, `|cmd`, `` `cmd` ``, `$(cmd)`
- **Path traversal / exploits**: `../`, `/etc/passwd`, `/proc/self`, `/bin/sh`
- **CMS / admin scanning**: `/wp-`, `/phpmyadmin`, `/manager/html`, `/HNAP1`
- **Sensitive file probing**: `.env`, `.git`, `/cgi-bin`
- **Malformed HTTP**: `Bad request version`, `code 400`, `invalid HTTP version`
ANSI escape sequences are stripped from all log lines before processing to handle coloured terminal output formats.
## Notes
- Logs are expected in IST (UTC+5:30) timezone for date-based slicing.
- The `GeoLite2-Country.mmdb` database must be present in the working directory for `geo_analysis.py` to function.
- All CSVs are UTF-8 encoded and compatible with Excel and Google Sheets.
- Scripts are designed for **weekly threat intel reporting** workflows.
## Author
**Dhruv Chopra**
Associate – Deception Technology, C3iHub @ IIT Kanpur
[github.com/thisisdhruvchopra](https://github.com/thisisdhruvchopra)