venkat15vk/ato-nlp-features
GitHub: venkat15vk/ato-nlp-features
Stars: 0 | Forks: 0
# Per-Account NLP Features for Account-Takeover Detection
Reproducible code for a study of per-account TF–IDF feature engineering for
account-takeover (ATO) detection on the public AWS Fraud Detector sample
dataset.
**Paper:** [arXiv preprint — link added after submission]
## Headline result
Across five chronological splits on 529 expert-labeled login events
(161 confirmed account takeovers in a corpus of 735K total events on
100K users):
| Method | AUC-PR | AUC-ROC |
|---|---|---|
| RF [per-account + global TF-IDF] | **0.838 ± 0.020** | **0.763 ± 0.039** |
| RF [per-account TF-IDF] | 0.820 ± 0.034 | 0.690 ± 0.023 |
| Global TF-IDF (raw, unsupervised) | 0.817 ± 0.016 | 0.749 ± 0.027 |
| RF [global TF-IDF] | 0.744 ± 0.061 | 0.631 ± 0.055 |
| Logistic regression on count features | 0.715 ± 0.084 | 0.611 ± 0.049 |
| RF on count features only | 0.648 ± 0.057 | 0.510 ± 0.011 |
| Isolation Forest on count features | 0.597 ± 0.039 | 0.363 ± 0.082 |
Per-account TF–IDF features beat the strongest count-feature baseline by
+0.17 AUC-PR (5/5 splits, paired t-test p = 0.0002).
The improvement is **heterogeneous across users**: per-account modeling
under-performs global modeling for cold-start users with ≤ 3 events,
then dominates with increasing history depth (see Fig. 1 in the paper).
## Repository layout
.
├── src/
│ ├── tfidf.py — PerAccountTFIDF and GlobalTFIDF
│ ├── run_ato.py — End-to-end ATO experiment (main entry point)
│ ├── run_experiments.py — Generic harness (RBA-schema datasets)
│ ├── run_openssh.py — Adapter for the loghub OpenSSH dataset
│ ├── parse_openssh.py — Log parser for the OpenSSH dataset
│ └── tokenize_event.py — Tokenizer for RBA-schema events
├── results/
│ ├── ato.csv — Main results: AWS ATO dataset, 5 splits
│ ├── hybrid.csv — Cohort-aware policy ablation
│ ├── openssh.csv — Secondary check: OpenSSH (2K sample)
│ └── openssh_multisplit.csv — OpenSSH, multi-split
├── make_figure.py — Generates Fig. 1 (volume-stratified bars)
├── requirements.txt
└── README.md
## Reproducing the main result
# 1. Install dependencies
python3 -m pip install -r requirements.txt
# 2. Download the dataset (51 MB compressed, 152 MB uncompressed)
mkdir -p data
curl -L -o data/ato.zip \
https://raw.githubusercontent.com/aws-samples/aws-fraud-detector-samples/master/data/ato_data_800K_full.csv.zip
unzip data/ato.zip -d data/
# 3. Run the experiment (~30 seconds on a laptop)
python3 src/run_ato.py \
--data data/ato_data_800K_full.csv \
--out results/ato.csv
# 4. (Optional) regenerate Fig. 1
python3 make_figure.py
Output should match `results/ato.csv` in this repo (up to numerical noise).
## Method (brief)
For each login event, build a bag of textual tokens from its structured
fields (IP octets, user-agent components, device fingerprint, hour-of-day,
credential outcome). For each user account with sufficient history,
compute a TF–IDF score against that user's prior events ("per-account").
Separately compute a TF–IDF score against the full corpus ("global").
Concatenate these scores with seven count-based per-user features (event
count, distinct IPs, distinct user-agents, etc.) and train a random
forest classifier.
Full method description, math, and ablations are in the paper.
## Dataset
The AWS Fraud Detector Account Takeover Insights sample dataset, hosted
publicly at `aws-samples/aws-fraud-detector-samples`. 735,683 login
events on 100,000 distinct user accounts over approximately three months
in 2022, with 529 expert-labeled positives/negatives (161 confirmed
account takeovers, 368 legitimate). License: Apache 2.0 / MIT-0 per
the source repository.
## Citation
@misc{gopalakrishnan2026ato,
author = {Gopalakrishnan, Venkatakrishnan},
title = {Per-Account NLP Features for Account-Takeover Detection: A Reproducible Study on Real-World Login Data},
year = {2026},
howpublished = {arXiv preprint [arXiv ID to be added after submission]}
}
## License
Code: MIT (see LICENSE)
The dataset used in this work is distributed under the terms of the
aws-samples/aws-fraud-detector-samples repository; please refer to that
project for dataset usage rights.
## Contact
Issues and pull requests welcome.