Syleena10/AI-Powered-Phishing-Email-Detection-for-Digital-Forensics

GitHub: Syleena10/AI-Powered-Phishing-Email-Detection-for-Digital-Forensics

基于 TF-IDF 与逻辑回归的钓鱼邮件检测与取证分析工具，帮助在事件响应中快速分类可疑邮件并提取关键判定词汇。

Stars: 0 | Forks: 0

# 面向数字取证的人工智能钓鱼邮件检测专为数字取证调查设计的 AI 驱动钓鱼邮件检测系统。本项目使用 TF-IDF 和机器学习将电子邮件文本作为数字证据进行分析，帮助识别可疑通信，并在事件响应期间支持证据分类。 ## 概述本项目使用机器学习 (AI) 检测钓鱼邮件，作为数字取证调查工作流的一部分。电子邮件被视为数字证据，该模型有助于识别可疑通信以供进一步分析。 ## 目标将邮件分类为安全 (0) 或钓鱼 (1) 使用 TF-IDF 提取有意义的文本特征训练机器学习模型以检测钓鱼模式支持取证证据分类 ## 使用的技术 Python Pandas NumPy Scikit-learn Matplotlib ## 数据集下载地址： https://www.kaggle.com/datasets/subhajournal/phishingemails 包含以下内容的 CSV 文件： Email Text Email Type (Safe Email / Phishing Email) # 逐步复现指南 ## 步骤 1：安装依赖项 pip install pandas numpy scikit-learn matplotlib ## 步骤 2：导入库 import pandas as pd import numpy as np import re import string from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report, confusion_matrix ## 步骤 3：加载数据集 df = pd.read_csv("Phishing_Email.csv") print(df.head()) ## 步骤 4：清理与准备数据 df = df[['Email Type', 'Email Text']] df.columns = ['label', 'text'] df['label'] = df['label'].map({ 'Safe Email': 0, 'Phishing Email': 1 }) df.dropna(inplace=True) ## 步骤 5：文本清洗 def clean_text(text): text = text.lower() text = re.sub(r"http\S+", "", text) text = re.sub(r"\d+", "", text) text = text.translate(str.maketrans("", "", string.punctuation)) return text df['text'] = df['text'].apply(clean_text) ## 步骤 6：训练/测试集分割 X = df['text'] y = df['label'] X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) ## 步骤 7：TF-IDF 向量化 vectorizer = TfidfVectorizer( stop_words='english', max_features=5000, token_pattern=r'\b[a-zA-Z]{3,}\b' ) X_train_vec = vectorizer.fit_transform(X_train) X_test_vec = vectorizer.transform(X_test) ## 步骤 8：训练模型 model = LogisticRegression() model.fit(X_train_vec, y_train) ## 步骤 9：进行预测 y_pred = model.predict(X_test_vec) ## 步骤 10：评估模型 print("Accuracy:", accuracy_score(y_test, y_pred)) print("\nClassification Report:\n", classification_report(y_test, y_pred)) print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred)) ## 步骤 11：可视化混淆矩阵 import matplotlib.pyplot as plt from sklearn.metrics import ConfusionMatrixDisplay ConfusionMatrixDisplay.from_estimator(model, X_test_vec, y_test) plt.show() ## 步骤 12：提取重要词汇 feature_names = vectorizer.get_feature_names_out() coefficients = model.coef_[0] word_importance = list(zip(feature_names, coefficients)) top_phishing = sorted(word_importance, key=lambda x: x[1], reverse=True)[:10] top_safe = sorted(word_importance, key=lambda x: x[1])[:10] print("Top Phishing Words:", top_phishing) print("Top Safe Words:", top_safe) ## 步骤 13：可视化词汇重要性 words = [w[0] for w in top_phishing] values = [w[1] for w in top_phishing] plt.figure() plt.barh(words, values) plt.title("Top Phishing Words") plt.gca().invert_yaxis() plt.show()

标签：AI, AMSI绕过, Apex, Kaggle 数据集, Matplotlib, meg, NLP, NumPy, Object Callbacks, Python, Scikit-learn, TF-IDF, 二分类, 人工智能, 代码示例, 信息安全, 垃圾邮件过滤, 域渗透, 威胁检测, 安全运营, 库, 应急响应, 扫描框架, 数字取证, 数字证据分析, 数据分析, 数据清洗, 文本分类, 无后门, 机器学习, 特征提取, 用户模式Hook绕过, 电子数据取证, 网络安全, 网络钓鱼, 自动化代码审查, 自动化脚本, 证据分类, 逆向工具, 逻辑回归, 钓鱼邮件检测, 隐私保护