venkat15vk/ato-nlp-features

GitHub: venkat15vk/ato-nlp-features

Stars: 0 | Forks: 0

# Per-Account NLP Features for Account-Takeover Detection Reproducible code for a study of per-account TF–IDF feature engineering for account-takeover (ATO) detection on the public AWS Fraud Detector sample dataset. **Paper:** [arXiv preprint — link added after submission] ## Headline result Across five chronological splits on 529 expert-labeled login events (161 confirmed account takeovers in a corpus of 735K total events on 100K users): | Method | AUC-PR | AUC-ROC | |---|---|---| | RF [per-account + global TF-IDF] | **0.838 ± 0.020** | **0.763 ± 0.039** | | RF [per-account TF-IDF] | 0.820 ± 0.034 | 0.690 ± 0.023 | | Global TF-IDF (raw, unsupervised) | 0.817 ± 0.016 | 0.749 ± 0.027 | | RF [global TF-IDF] | 0.744 ± 0.061 | 0.631 ± 0.055 | | Logistic regression on count features | 0.715 ± 0.084 | 0.611 ± 0.049 | | RF on count features only | 0.648 ± 0.057 | 0.510 ± 0.011 | | Isolation Forest on count features | 0.597 ± 0.039 | 0.363 ± 0.082 | Per-account TF–IDF features beat the strongest count-feature baseline by +0.17 AUC-PR (5/5 splits, paired t-test p = 0.0002). The improvement is **heterogeneous across users**: per-account modeling under-performs global modeling for cold-start users with ≤ 3 events, then dominates with increasing history depth (see Fig. 1 in the paper). ## Repository layout . ├── src/ │ ├── tfidf.py — PerAccountTFIDF and GlobalTFIDF │ ├── run_ato.py — End-to-end ATO experiment (main entry point) │ ├── run_experiments.py — Generic harness (RBA-schema datasets) │ ├── run_openssh.py — Adapter for the loghub OpenSSH dataset │ ├── parse_openssh.py — Log parser for the OpenSSH dataset │ └── tokenize_event.py — Tokenizer for RBA-schema events ├── results/ │ ├── ato.csv — Main results: AWS ATO dataset, 5 splits │ ├── hybrid.csv — Cohort-aware policy ablation │ ├── openssh.csv — Secondary check: OpenSSH (2K sample) │ └── openssh_multisplit.csv — OpenSSH, multi-split ├── make_figure.py — Generates Fig. 1 (volume-stratified bars) ├── requirements.txt └── README.md ## Reproducing the main result # 1. Install dependencies python3 -m pip install -r requirements.txt # 2. Download the dataset (51 MB compressed, 152 MB uncompressed) mkdir -p data curl -L -o data/ato.zip \ https://raw.githubusercontent.com/aws-samples/aws-fraud-detector-samples/master/data/ato_data_800K_full.csv.zip unzip data/ato.zip -d data/ # 3. Run the experiment (~30 seconds on a laptop) python3 src/run_ato.py \ --data data/ato_data_800K_full.csv \ --out results/ato.csv # 4. (Optional) regenerate Fig. 1 python3 make_figure.py Output should match `results/ato.csv` in this repo (up to numerical noise). ## Method (brief) For each login event, build a bag of textual tokens from its structured fields (IP octets, user-agent components, device fingerprint, hour-of-day, credential outcome). For each user account with sufficient history, compute a TF–IDF score against that user's prior events ("per-account"). Separately compute a TF–IDF score against the full corpus ("global"). Concatenate these scores with seven count-based per-user features (event count, distinct IPs, distinct user-agents, etc.) and train a random forest classifier. Full method description, math, and ablations are in the paper. ## Dataset The AWS Fraud Detector Account Takeover Insights sample dataset, hosted publicly at `aws-samples/aws-fraud-detector-samples`. 735,683 login events on 100,000 distinct user accounts over approximately three months in 2022, with 529 expert-labeled positives/negatives (161 confirmed account takeovers, 368 legitimate). License: Apache 2.0 / MIT-0 per the source repository. ## Citation @misc{gopalakrishnan2026ato, author = {Gopalakrishnan, Venkatakrishnan}, title = {Per-Account NLP Features for Account-Takeover Detection: A Reproducible Study on Real-World Login Data}, year = {2026}, howpublished = {arXiv preprint [arXiv ID to be added after submission]} } ## License Code: MIT (see LICENSE) The dataset used in this work is distributed under the terms of the aws-samples/aws-fraud-detector-samples repository; please refer to that project for dataset usage rights. ## Contact Issues and pull requests welcome.