Ericaaixue/fyp-tcr-pmhc-pu-learning

GitHub: Ericaaixue/fyp-tcr-pmhc-pu-learning

Stars: 0 | Forks: 0

# Evaluating Positive-Unlabeled Learning and Sampling Strategies for Reliable Negative Construction in TCR-pMHC Binding Prediction This repository contains a reproducible Python workflow for reliable negative construction in T-cell receptor and peptide-MHC (TCR-pMHC) binding prediction. The project treats presumed non-interactive TCR-pMHC pairs as unlabeled candidate negatives rather than confirmed non-binders, then evaluates positive-unlabeled (PU) learning methods and sampling strategies for constructing reliable negative datasets. ## Research Goal Experimentally verified non-binding TCR-pMHC pairs are limited. In practice, presumed non-interactive pairs are often used as negative samples, although some may be untested binders. This project asks whether PU-learning based reliable-negative screening and sampling strategies can construct more informative negative datasets for downstream TCR-pMHC binding prediction. ## Project Summary The workflow uses: - `51,183` curated positive TCR-pMHC binders. - `3,342,225` presumed non-interactive pairs as an unlabeled candidate negative pool. - `44` sequence-derived CDR3beta and peptide features. - Seven reliable-negative screening methods: Spy, Rocchio, kNN, biased SVM, weighted logistic regression, uPU, and nnPU. - One random baseline plus four PU-guided sampling strategies. - `29` negative datasets, each with exactly `50,000` selected negative samples. - Stratified 5-fold downstream validation with Logistic Regression, Complement Naive Bayes, and Random Forest. The main finding is that top-k reliable-negative sampling often gives the highest classifier separability, but may select overly easy and distributionally narrow negatives. Peptide-balanced and HLA-peptide-balanced sampling generally reduce classifier scores but better preserve biological coverage. Reliable-negative quality should therefore be evaluated using both downstream performance and distributional representativeness. ## Repository Layout fyp-tcr-pmhc-pu-learning/ |-- README.md |-- environment.yml |-- .gitignore | |-- data/ | `-- example_data/ | |-- positive_sample.csv | |-- unlabeled_candidate_sample.csv | |-- reliable_negative_sample.csv | `-- README.md | |-- rn_screener/ | `-- core implementation package | |-- scripts/ | |-- 01_data_preprocessing.py | |-- 02_feature_extraction.py | |-- 03_rn_screening.py | |-- 04_negative_sampling.py | |-- 05_downstream_validation.py | |-- 06_stage10_analysis.py | |-- 07_plot_figures.py | `-- legacy_helpers/ | `-- auxiliary plotting and continuation scripts from the project workflow | |-- tests/ | `-- smoke tests for core workflows | `-- results/ |-- figures/ | |-- data_distribution.png | |-- stage10_loss_gap_boxplot.png | `-- stage10_mcc_gap_heatmap.png | `-- tables/ |-- stage10_validation_loss.csv |-- stage10_validation_mcc.csv |-- stage10_test_mcc.csv |-- stage10_loss_gap.csv `-- stage10_mcc_gap.csv ## Example Data Small preview CSV files are included under `data/example_data/` so readers can inspect the expected input format. These files contain only 20 data rows each and are not intended for final analysis. The full project data are excluded from GitHub because of file size. The full-data workflow expects: data/positive_stats_output1/positive_cleaned_mouse_MHC_removed.csv data/omics_neg_with_HLA_peptide.csv rn_features/ rn_method_outputs_50k/ rn_sampling_strategies_50k_29/ staged_validation_outputs/ dataset_level_validation_outputs/ Full data package: https://drive.google.com/file/d/1Heify61gDa-YvUQNqAEI4KFRv8rh8Zkx/view?usp=drive_link ## Environment Create the conda environment: conda env create -f environment.yml conda activate bio319-rn-analysis The project was developed with Python `3.8.10`. Optional dependency notes: - PyTorch is required only for rerunning the uPU and nnPU screeners. - XGBoost is included for optional reruns, but XGBoost results are not part of the main final analysis. ## Main Workflow Scripts The numbered scripts in `scripts/` represent the main project workflow. Preview and validate example data: python scripts\01_data_preprocessing.py Build features: python scripts\02_feature_extraction.py ` --positive-input data\positive_stats_output1\positive_cleaned_mouse_MHC_removed.csv ` --unlabeled-input data\omics_neg_with_HLA_peptide.csv ` --output-dir rn_features Run reliable-negative screening: python scripts\03_rn_screening.py ` --x-pos rn_features\X_pos.npy ` --x-unlabeled rn_features\X_unlabeled.npy ` --metadata rn_features\U_metadata.csv ` --method all ` --ratio 0.05 ` --output-dir rn_method_outputs_50k Construct the 29 negative datasets: python scripts\04_negative_sampling.py ` --rn-output-dir rn_method_outputs_50k ` --metadata rn_features\U_metadata.csv ` --output-dir rn_sampling_strategies_50k_29 ` --rn-count 50000 Run downstream validation: python scripts\05_downstream_validation.py ` --staged-run-dir staged_validation_outputs\full_stratified_stage10_20260507_233700 ` --negative-dir rn_sampling_strategies_50k_29 ` --classifiers logistic complement_nb random_forest ` --n-splits 5 ` --save-plots Generate stage-10 analysis tables and heatmaps: python scripts\06_stage10_analysis.py Generate manuscript-style data distribution figures: python scripts\07_plot_figures.py ## Results Included This upload includes selected lightweight result artifacts for inspection: - `results/figures/data_distribution.png` - `results/figures/stage10_loss_gap_boxplot.png` - `results/figures/stage10_mcc_gap_heatmap.png` - `results/tables/stage10_validation_loss.csv` - `results/tables/stage10_validation_mcc.csv` - `results/tables/stage10_test_mcc.csv` - `results/tables/stage10_loss_gap.csv` - `results/tables/stage10_mcc_gap.csv` These are representative thesis-facing outputs. Full intermediate data, full method outputs, and full validation output directories are excluded from GitHub. ## Testing Smoke tests are included under `tests/`: python -B tests\smoke_test_rn_screeners.py python -B tests\smoke_test_rn_sampling.py python -B tests\smoke_test_staged_validation.py python -B tests\smoke_test_staged_continue.py python -B tests\smoke_test_fixed_stage_dataset_validation.py python -B tests\smoke_test_stage_history_summary.py ## Limitations - Reliable negatives are inferred from unlabeled data, not experimentally confirmed non-binders. - Top-k sampling can produce very high MCC but may overrepresent easy negatives. - Balanced sampling better preserves peptide or HLA-peptide coverage but may reduce classifier separability. - Large raw and generated data files must be downloaded separately. ## AI Use Declaration