HaniBacha/isms-bench-artifact

GitHub: HaniBacha/isms-bench-artifact

Stars: 0 | Forks: 0

# KI-SEC Assist / ISMS-Bench This repository contains the research artifact for an ACSAC 2026 submission on false compliance in AI-assisted ISMS Incident Response pre-assessment. The narrow claim is: The artifact does not certify compliance, replace auditors, or validate deployment performance on confidential real-world audits. ## What Is Included - Deterministic synthetic Incident Response benchmark generation. - Evidence passages, benchmark cases, mutation cases, paraphrase/multilingual stress cases, manual challenge cases, public-document-derived diagnostic stress cases, and adversarial fixtures. - Retrieval and assessment baselines: BM25, TF-IDF, metadata-aware rules, provenance-balanced, provenance-conservative, provenance-conservative with source guard, and trivial constant-status baselines for bounding false-compliance trade-offs. - Modular LLM/RAG baseline runner for OpenAI-compatible endpoints, with dry-run/mock mode and parsed prediction outputs. - Evaluation scripts for retrieval, compliance classification, attacks, public-document diagnostics, metadata spoofing, bootstrap intervals, and table/figure generation. - Generated tables/figures, claim-evidence matrix, benchmark card, and reproducibility notes. ## Quick Start python -m venv .venv source .venv/bin/activate pip install -r requirements.txt python -m pytest python scripts/check_label_leakage.py python scripts/check_generator_coupling.py python scripts/run_label_audit.py python scripts/make_tables.py --version v04 If `rank_bm25` is unavailable, the BM25 retriever falls back to a deterministic local implementation. ## Reproduce Main Deterministic Results python scripts/build_corpus.py python scripts/generate_synthetic_cases.py --version v03 --split development_template --seed 42 python scripts/generate_synthetic_cases.py --version v03 --split heldout_template --seed 42 python scripts/generate_synthetic_cases.py --version v03 --split stress_test --seed 42 python scripts/generate_mutation_cases.py --seed 42 python scripts/generate_paraphrase_stress_v04.py --seed 42 python scripts/run_retrieval_eval.py --method bm25 --k 5 --version v04 python scripts/run_compliance_eval.py --method bm25 --k 5 --version v04 python scripts/run_attack_eval.py --method bm25 --k 5 --version v04 python scripts/make_tables.py --version v04 ## LLM/RAG Baselines Real LLM/RAG runs require an OpenAI-compatible endpoint and local environment variables. No API keys are included in this repository. JGU_API_BASE="https://your-openai-compatible-endpoint.example/v1" JGU_MODEL="gpt-oss-120b" python scripts/run_llm_baselines_v13.py --seed 42 --max-cases 150 --real-api --output-stem llm_medium_150_v14 python scripts/run_llm_baselines_v13.py --seed 42 --max-cases 5 --dry-run --output-stem llm_mock_smoke The paper reports one real model family (`gpt-oss-120b`) on subset evaluations. These results are model-specific diagnostics, not a broad model comparison. ## Repository Layout src/kisec/ Package code for data models, generation, retrieval, assessment, attacks, metrics, and LLM/RAG utilities scripts/ Reproducibility entry points tests/ Fast pytest suite and leakage/coupling checks data/benchmark/ Synthetic, manual, LLM subset, and public-document-derived benchmark files data/attacks/ Original and advanced static adversarial cases data/external_public/ Public source inventory and short paraphrase/excerpt evidence corpus data/synthetic_cases/ Synthetic evidence passages experiments/results/ Result CSV/JSON/MD files used by submitted-paper tables and figures artifact_outputs/ Generated tables and figures for traceability ## Important Scope Notes - Most benchmark labels are synthetic or project-authored. - The public-document-derived split uses real public Incident Response documents, but its labels remain project-initial and have not been independently reviewed; it is diagnostic stress evidence, not independent validation. - The benchmark currently covers Incident Response, not the full ISMS control space. - LLM/RAG evidence uses one available model family and subset runs. - Human/expert validation is documented as a protocol unless completed annotations are present. See `ARTIFACT.md`, `REPRODUCIBILITY.md`, `BENCHMARK_CARD.md`, `CLAIMS_AND_EVIDENCE.md`, and `SCOPE_AND_LIMITATIONS.md` for reviewer-facing details. Useful Make targets: make artifact-check make eval-small make sensitivity