fastino-ai/GLiNER2

GitHub: fastino-ai/GLiNER2

GLiNER2 是一个轻量级统一信息提取框架,将命名实体识别、文本分类、结构化数据提取和关系抽取整合到单一模型中,支持 CPU 高效推理和完全本地化部署。

Stars: 1058 | Forks: 100

# GLiNER2:统一的基于 Schema 的信息提取 [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyPI version](https://badge.fury.io/py/gliner2.svg)](https://badge.fury.io/py/gliner2) [![Downloads](https://pepy.tech/badge/gliner2)](https://pepy.tech/project/gliner2) GLiNER2 将 **Named Entity Recognition (命名实体识别)**、**Text Classification (文本分类)**、**Structured Data Extraction (结构化数据提取)** 和 **Relation Extraction (关系提取)** 统一到一个拥有 2.05 亿参数的单一模型中。它提供了高效的 CPU 推理能力,无需复杂的流水线或外部 API 依赖。 ## ✨ 为什么选择 GLiNER2? - **🎯 一个模型,四项任务**:单次前向传递即可完成实体提取、分类、结构化数据和关系提取 - **💻 CPU 优先**:在标准硬件上实现极速推理——无需 GPU - **🛡️ 隐私保护**:100% 本地处理,零外部依赖 ## 🚀 安装与快速开始 ``` pip install gliner2 ``` ``` from gliner2 import GLiNER2 # Load model once, use everywhere extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1") # Extract entities in one line text = "Apple CEO Tim Cook announced iPhone 15 in Cupertino yesterday." result = extractor.extract_entities(text, ["company", "person", "product", "location"]) print(result) # {'entities': {'company': ['Apple'], 'person': ['Tim Cook'], 'product': ['iPhone 15'], 'location': ['Cupertino']}} ``` ### 🌐 API 访问:GLiNER XL 1B 我们最大、最强大的模型——**GLiNER XL 1B**——仅通过 API 提供。无需 GPU,无需下载模型,即可即时访问最先进的提取技术。请在 [gliner.pioneer.ai](https://gliner.pioneer.ai) 获取您的 API 密钥。 ``` from gliner2 import GLiNER2 # Access GLiNER XL 1B via API extractor = GLiNER2.from_api() # Uses PIONEER_API_KEY env variable result = extractor.extract_entities( "OpenAI CEO Sam Altman announced GPT-5 at their San Francisco headquarters.", ["company", "person", "product", "location"] ) # {'entities': {'company': ['OpenAI'], 'person': ['Sam Altman'], 'product': ['GPT-5'], 'location': ['San Francisco']}} ``` ## 📦 可用模型 | 模型 | 参数 | 描述 | 用途 | |-------|------------|-------------|--------------------------------------------------| | `fastino/gliner2-base-v1` | 205M | 基础版 | 提取 / 分类 | | `fastino/gliner2-large-v1` | 340M | 大型版 | 提取 / 分类 | 模型可在 [Hugging Face](https://huggingface.co/collections/fastino/gliner2-family) 上获取。 ## 📚 文档与教程 涵盖所有 GLiNER2 功能的全面指南: ### 核心功能 - **[文本分类](tutorial/1-classification.md)** - 带置信度分数的单标签和多标签分类 - **[实体提取](tutorial/2-ner.md)** - 带描述和文本跨度的命名实体识别 - **[结构化数据提取](tutorial/3-json_extraction.md)** - 从文本解析复杂的 JSON 结构 - **[组合 Schema](tutorial/4-combined.md)** - 单次传递中的多任务提取 - **[Regex 验证器](tutorial/5-validator.md)** - 过滤和验证提取的文本跨度 - **[关系提取](tutorial/6-relation_extraction.md)** - 提取实体之间的关系 - **[API 访问](tutorial/7-api.md)** - 通过云 API 使用 GLiNER2 ### 训练与定制 - **[训练数据格式](tutorial/8-train_data.md)** - 准备训练数据的完整指南 - **[模型训练](tutorial/9-training.md)** - 为您的领域训练定制模型 - **[LoRA 适配器](tutorial/10-lora_adapters.md)** - 参数高效微调 - **[适配器切换](tutorial/11-adapter_switching.md)** - 在不同领域的适配器之间切换 ## 🎯 核心能力 ### 1. 实体提取 提取带有可选描述的命名实体以提高精度: ``` # Basic entity extraction entities = extractor.extract_entities( "Patient received 400mg ibuprofen for severe headache at 2 PM.", ["medication", "dosage", "symptom", "time"] ) # Output: {'entities': {'medication': ['ibuprofen'], 'dosage': ['400mg'], 'symptom': ['severe headache'], 'time': ['2 PM']}} # Enhanced with descriptions for medical accuracy entities = extractor.extract_entities( "Patient received 400mg ibuprofen for severe headache at 2 PM.", { "medication": "Names of drugs, medications, or pharmaceutical substances", "dosage": "Specific amounts like '400mg', '2 tablets', or '5ml'", "symptom": "Medical symptoms, conditions, or patient complaints", "time": "Time references like '2 PM', 'morning', or 'after lunch'" } ) # Same output but with higher accuracy due to context descriptions # With confidence scores entities = extractor.extract_entities( "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.", ["company", "person", "product", "location"], include_confidence=True ) # Output: { # 'entities': { # 'company': [{'text': 'Apple Inc.', 'confidence': 0.95}], # 'person': [{'text': 'Tim Cook', 'confidence': 0.92}], # 'product': [{'text': 'iPhone 15', 'confidence': 0.88}], # 'location': [{'text': 'Cupertino', 'confidence': 0.90}] # } # } # With character positions (spans) entities = extractor.extract_entities( "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.", ["company", "person", "product"], include_spans=True ) # Output: { # 'entities': { # 'company': [{'text': 'Apple Inc.', 'start': 0, 'end': 9}], # 'person': [{'text': 'Tim Cook', 'start': 15, 'end': 23}], # 'product': [{'text': 'iPhone 15', 'start': 35, 'end': 44}] # } # } # With both confidence and spans entities = extractor.extract_entities( "Apple Inc. CEO Tim Cook announced iPhone 15 in Cupertino.", ["company", "person", "product"], include_confidence=True, include_spans=True ) # Output: { # 'entities': { # 'company': [{'text': 'Apple Inc.', 'confidence': 0.95, 'start': 0, 'end': 9}], # 'person': [{'text': 'Tim Cook', 'confidence': 0.92, 'start': 15, 'end': 23}], # 'product': [{'text': 'iPhone 15', 'confidence': 0.88, 'start': 35, 'end': 44}] # } # } ``` ### 2. 文本分类 具有可配置置信度的单标签或多标签分类: ``` # Sentiment analysis result = extractor.classify_text( "This laptop has amazing performance but terrible battery life!", {"sentiment": ["positive", "negative", "neutral"]} ) # Output: {'sentiment': 'negative'} # Multi-aspect classification result = extractor.classify_text( "Great camera quality, decent performance, but poor battery life.", { "aspects": { "labels": ["camera", "performance", "battery", "display", "price"], "multi_label": True, "cls_threshold": 0.4 } } ) # Output: {'aspects': ['camera', 'performance', 'battery']} # With confidence scores result = extractor.classify_text( "This laptop has amazing performance but terrible battery life!", {"sentiment": ["positive", "negative", "neutral"]}, include_confidence=True ) # Output: {'sentiment': {'label': 'negative', 'confidence': 0.82}} # Multi-label with confidence schema = extractor.create_schema().classification( "topics", ["technology", "business", "health", "politics", "sports"], multi_label=True, cls_threshold=0.3 ) text = "Apple announced new health monitoring features in their latest smartwatch, boosting their stock price." results = extractor.extract(text, schema, include_confidence=True) # Output: { # 'topics': [ # {'label': 'technology', 'confidence': 0.92}, # {'label': 'business', 'confidence': 0.78}, # {'label': 'health', 'confidence': 0.65} # ] # } ``` ### 3. 结构化数据提取 通过字段级控制解析复杂的结构化信息: ``` # Product information extraction text = "iPhone 15 Pro Max with 256GB storage, A17 Pro chip, priced at $1199. Available in titanium and black colors." result = extractor.extract_json( text, { "product": [ "name::str::Full product name and model", "storage::str::Storage capacity like 256GB or 1TB", "processor::str::Chip or processor information", "price::str::Product price with currency", "colors::list::Available color options" ] } ) # Output: { # 'product': [{ # 'name': 'iPhone 15 Pro Max', # 'storage': '256GB', # 'processor': 'A17 Pro chip', # 'price': '$1199', # 'colors': ['titanium', 'black'] # }] # } # Multiple structured entities text = "Apple Inc. headquarters in Cupertino launched iPhone 15 for $999 and MacBook Air for $1299." result = extractor.extract_json( text, { "company": [ "name::str::Company name", "location::str::Company headquarters or office location" ], "products": [ "name::str::Product name and model", "price::str::Product retail price" ] } ) # Output: { # 'company': [{'name': 'Apple Inc.', 'location': 'Cupertino'}], # 'products': [ # {'name': 'iPhone 15', 'price': '$999'}, # {'name': 'MacBook Air', 'price': '$1299'} # ] # } # With confidence scores result = extractor.extract_json( "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage.", { "product": [ "name::str", "price", "features" ] }, include_confidence=True ) # Output: { # 'product': [{ # 'name': {'text': 'MacBook Pro', 'confidence': 0.95}, # 'price': [{'text': '$1999', 'confidence': 0.92}], # 'features': [ # {'text': 'M3 chip', 'confidence': 0.88}, # {'text': '16GB RAM', 'confidence': 0.90}, # {'text': '512GB storage', 'confidence': 0.87} # ] # }] # } # With character positions (spans) result = extractor.extract_json( "The MacBook Pro costs $1999 and features M3 chip.", { "product": [ "name::str", "price" ] }, include_spans=True ) # Output: { # 'product': [{ # 'name': {'text': 'MacBook Pro', 'start': 4, 'end': 15}, # 'price': [{'text': '$1999', 'start': 22, 'end': 27}] # }] # } # With both confidence and spans result = extractor.extract_json( "The MacBook Pro costs $1999 and features M3 chip, 16GB RAM, and 512GB storage.", { "product": [ "name::str", "price", "features" ] }, include_confidence=True, include_spans=True ) # Output: { # 'product': [{ # 'name': {'text': 'MacBook Pro', 'confidence': 0.95, 'start': 4, 'end': 15}, # 'price': [{'text': '$1999', 'confidence': 0.92, 'start': 22, 'end': 27}], # 'features': [ # {'text': 'M3 chip', 'confidence': 0.88, 'start': 32, 'end': 39}, # {'text': '16GB RAM', 'confidence': 0.90, 'start': 41, 'end': 49}, # {'text': '512GB storage', 'confidence': 0.87, 'start': 55, 'end': 68} # ] # }] # } ``` ### 4. 关系提取 将实体之间的关系提取为有向元组: ``` # Basic relation extraction text = "John works for Apple Inc. and lives in San Francisco. Apple Inc. is located in Cupertino." result = extractor.extract_relations( text, ["works_for", "lives_in", "located_in"] ) # Output: { # 'relation_extraction': { # 'works_for': [('John', 'Apple Inc.')], # 'lives_in': [('John', 'San Francisco')], # 'located_in': [('Apple Inc.', 'Cupertino')] # } # } # With descriptions for better accuracy schema = extractor.create_schema().relations({ "works_for": "Employment relationship where person works at organization", "founded": "Founding relationship where person created organization", "acquired": "Acquisition relationship where company bought another company", "located_in": "Geographic relationship where entity is in a location" }) text = "Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California." results = extractor.extract(text, schema) # Output: { # 'relation_extraction': { # 'founded': [('Elon Musk', 'SpaceX')], # 'located_in': [('SpaceX', 'Hawthorne, California')] # } # } # With confidence scores results = extractor.extract_relations( "John works for Apple Inc. and lives in San Francisco.", ["works_for", "lives_in"], include_confidence=True ) # Output: { # 'relation_extraction': { # 'works_for': [{ # 'head': {'text': 'John', 'confidence': 0.95}, # 'tail': {'text': 'Apple Inc.', 'confidence': 0.92} # }], # 'lives_in': [{ # 'head': {'text': 'John', 'confidence': 0.94}, # 'tail': {'text': 'San Francisco', 'confidence': 0.91} # }] # } # } # With character positions (spans) results = extractor.extract_relations( "John works for Apple Inc. and lives in San Francisco.", ["works_for", "lives_in"], include_spans=True ) # Output: { # 'relation_extraction': { # 'works_for': [{ # 'head': {'text': 'John', 'start': 0, 'end': 4}, # 'tail': {'text': 'Apple Inc.', 'start': 15, 'end': 25} # }], # 'lives_in': [{ # 'head': {'text': 'John', 'start': 0, 'end': 4}, # 'tail': {'text': 'San Francisco', 'start': 33, 'end': 46} # }] # } # } # With both confidence and spans results = extractor.extract_relations( "John works for Apple Inc. and lives in San Francisco.", ["works_for", "lives_in"], include_confidence=True, include_spans=True ) # Output: { # 'relation_extraction': { # 'works_for': [{ # 'head': {'text': 'John', 'confidence': 0.95, 'start': 0, 'end': 4}, # 'tail': {'text': 'Apple Inc.', 'confidence': 0.92, 'start': 15, 'end': 25} # }], # 'lives_in': [{ # 'head': {'text': 'John', 'confidence': 0.94, 'start': 0, 'end': 4}, # 'tail': {'text': 'San Francisco', 'confidence': 0.91, 'start': 33, 'end': 46} # }] # } # } ``` ### 5. 多任务 Schema 组合 当您需要全面分析时,组合所有提取类型: ``` # Use create_schema() for multi-task scenarios schema = (extractor.create_schema() # Extract key entities .entities({ "person": "Names of people, executives, or individuals", "company": "Organization, corporation, or business names", "product": "Products, services, or offerings mentioned" }) # Classify the content .classification("sentiment", ["positive", "negative", "neutral"]) .classification("category", ["technology", "business", "finance", "healthcare"]) # Extract relationships .relations(["works_for", "founded", "located_in"]) # Extract structured product details .structure("product_info") .field("name", dtype="str") .field("price", dtype="str") .field("features", dtype="list") .field("availability", dtype="str", choices=["in_stock", "pre_order", "sold_out"]) ) # Comprehensive extraction in one pass text = "Apple CEO Tim Cook unveiled the revolutionary iPhone 15 Pro for $999. The device features an A17 Pro chip and titanium design. Tim Cook works for Apple, which is located in Cupertino." results = extractor.extract(text, schema) # Output: { # 'entities': { # 'person': ['Tim Cook'], # 'company': ['Apple'], # 'product': ['iPhone 15 Pro'] # }, # 'sentiment': 'positive', # 'category': 'technology', # 'relation_extraction': { # 'works_for': [('Tim Cook', 'Apple')], # 'located_in': [('Apple', 'Cupertino')] # }, # 'product_info': [{ # 'name': 'iPhone 15 Pro', # 'price': '$999', # 'features': ['A17 Pro chip', 'titanium design'], # 'availability': 'in_stock' # }] # } ``` ## 🏭 使用场景示例 ### 金融文档处理 ``` financial_text = """ Transaction Report: Goldman Sachs processed a $2.5M equity trade for Tesla Inc. on March 15, 2024. Commission: $1,250. Status: Completed. """ # Extract structured financial data result = extractor.extract_json( financial_text, { "transaction": [ "broker::str::Financial institution or brokerage firm", "amount::str::Transaction amount with currency", "security::str::Stock, bond, or financial instrument", "date::str::Transaction date", "commission::str::Fees or commission charged", "status::str::Transaction status", "type::[equity|bond|option|future|forex]::str::Type of financial instrument" ] } ) # Output: { # 'transaction': [{ # 'broker': 'Goldman Sachs', # 'amount': '$2.5M', # 'security': 'Tesla Inc.', # 'date': 'March 15, 2024', # 'commission': '$1,250', # 'status': 'Completed', # 'type': 'equity' # }] # } ``` ### 医疗信息提取 ``` medical_record = """ Patient: Sarah Johnson, 34, presented with acute chest pain and shortness of breath. Prescribed: Lisinopril 10mg daily, Metoprolol 25mg twice daily. Follow-up scheduled for next Tuesday. """ result = extractor.extract_json( medical_record, { "patient_info": [ "name::str::Patient full name", "age::str::Patient age", "symptoms::list::Reported symptoms or complaints" ], "prescriptions": [ "medication::str::Drug or medication name", "dosage::str::Dosage amount and frequency", "frequency::str::How often to take the medication" ] } ) # Output: { # 'patient_info': [{ # 'name': 'Sarah Johnson', # 'age': '34', # 'symptoms': ['acute chest pain', 'shortness of breath'] # }], # 'prescriptions': [ # {'medication': 'Lisinopril', 'dosage': '10mg', 'frequency': 'daily'}, # {'medication': 'Metoprolol', 'dosage': '25mg', 'frequency': 'twice daily'} # ] # } ``` ### 法律合同分析 ``` contract_text = """ Service Agreement between TechCorp LLC and DataSystems Inc., effective January 1, 2024. Monthly fee: $15,000. Contract term: 24 months with automatic renewal. Termination clause: 30-day written notice required. """ # Multi-task extraction for comprehensive analysis schema = (extractor.create_schema() .entities(["company", "date", "duration", "fee"]) .classification("contract_type", ["service", "employment", "nda", "partnership"]) .relations(["signed_by", "involves", "dated"]) .structure("contract_terms") .field("parties", dtype="list") .field("effective_date", dtype="str") .field("monthly_fee", dtype="str") .field("term_length", dtype="str") .field("renewal", dtype="str", choices=["automatic", "manual", "none"]) .field("termination_notice", dtype="str") ) results = extractor.extract(contract_text, schema) # Output: { # 'entities': { # 'company': ['TechCorp LLC', 'DataSystems Inc.'], # 'date': ['January 1, 2024'], # 'duration': ['24 months'], # 'fee': ['$15,000'] # }, # 'contract_type': 'service', # 'relation_extraction': { # 'involves': [('TechCorp LLC', 'DataSystems Inc.')], # 'dated': [('Service Agreement', 'January 1, 2024')] # }, # 'contract_terms': [{ # 'parties': ['TechCorp LLC', 'DataSystems Inc.'], # 'effective_date': 'January 1, 2024', # 'monthly_fee': '$15,000', # 'term_length': '24 months', # 'renewal': 'automatic', # 'termination_notice': '30-day written notice' # }] # } ``` ### 知识图谱构建 ``` # Extract entities and relations for knowledge graph building text = """ Elon Musk founded SpaceX in 2002. SpaceX is located in Hawthorne, California. SpaceX acquired Swarm Technologies in 2021. Many engineers work for SpaceX. """ schema = (extractor.create_schema() .entities(["person", "organization", "location", "date"]) .relations({ "founded": "Founding relationship where person created organization", "acquired": "Acquisition relationship where company bought another company", "located_in": "Geographic relationship where entity is in a location", "works_for": "Employment relationship where person works at organization" }) ) results = extractor.extract(text, schema) # Output: { # 'entities': { # 'person': ['Elon Musk', 'engineers'], # 'organization': ['SpaceX', 'Swarm Technologies'], # 'location': ['Hawthorne, California'], # 'date': ['2002', '2021'] # }, # 'relation_extraction': { # 'founded': [('Elon Musk', 'SpaceX')], # 'acquired': [('SpaceX', 'Swarm Technologies')], # 'located_in': [('SpaceX', 'Hawthorne, California')], # 'works_for': [('engineers', 'SpaceX')] # } # } ``` ## ⚙️ 高级配置 ### 自定义置信度阈值 ``` # High-precision extraction for critical fields result = extractor.extract_json( text, { "financial_data": [ "account_number::str::Bank account number", # default threshold "amount::str::Transaction amount", # default threshold "routing_number::str::Bank routing number" # default threshold ] }, threshold=0.9 # High confidence for all fields ) # Per-field thresholds using schema builder (for multi-task scenarios) schema = (extractor.create_schema() .structure("sensitive_data") .field("ssn", dtype="str", threshold=0.95) # Highest precision .field("email", dtype="str", threshold=0.8) # Medium precision .field("phone", dtype="str", threshold=0.7) # Lower precision ) ``` ### 字段类型与约束 ``` # Structured extraction with choices and types result = extractor.extract_json( "Premium subscription at $99/month with mobile and web access.", { "subscription": [ "tier::[basic|premium|enterprise]::str::Subscription level", "price::str::Monthly or annual cost", "billing::[monthly|annual]::str::Billing frequency", "features::[mobile|web|api|analytics]::list::Included features" ] } ) # Output: { # 'subscription': [{ # 'tier': 'premium', # 'price': '$99/month', # 'billing': 'monthly', # 'features': ['mobile', 'web'] # }] # } ``` ## 🔍 Regex 验证器 过滤提取的文本跨度以确保它们符合预期的模式,从而提高提取质量并减少误报。 ``` from gliner2 import GLiNER2, RegexValidator extractor = GLiNER2.from_pretrained("fastino/gliner2-base-v1") # Email validation email_validator = RegexValidator(r"^[\w\.-]+@[\w\.-]+\.\w+$") schema = (extractor.create_schema() .structure("contact") .field("email", dtype="str", validators=[email_validator]) ) text = "Contact: john@company.com, not-an-email, jane@domain.org" results = extractor.extract(text, schema) # Output: {'contact': [{'email': 'john@company.com'}]} # Only valid emails # Phone number validation (US format) phone_validator = RegexValidator(r"\(\d{3}\)\s\d{3}-\d{4}", mode="partial") schema = (extractor.create_schema() .structure("contact") .field("phone", dtype="str", validators=[phone_validator]) ) text = "Call (555) 123-4567 or 5551234567" results = extractor.extract(text, schema) # Output: {'contact': [{'phone': '(555) 123-4567'}]} # Second number filtered out # URL validation url_validator = RegexValidator(r"^https?://", mode="partial") schema = (extractor.create_schema() .structure("links") .field("url", dtype="list", validators=[url_validator]) ) text = "Visit https://example.com or www.site.com" results = extractor.extract(text, schema) # Output: {'links': [{'url': ['https://example.com']}]} # www.site.com filtered out # Exclude test data import re no_test_validator = RegexValidator(r"^(test|demo|sample)", exclude=True, flags=re.IGNORECASE) schema = (extractor.create_schema() .structure("products") .field("name", dtype="list", validators=[no_test_validator]) ) text = "Products: iPhone, Test Phone, Samsung Galaxy" results = extractor.extract(text, schema) # Output: {'products': [{'name': ['iPhone', 'Samsung Galaxy']}]} # Test Phone excluded # Multiple validators (all must pass) username_validators = [ RegexValidator(r"^[a-zA-Z0-9_]+$"), # Alphanumeric + underscore RegexValidator(r"^.{3,20}$"), # 3-20 characters RegexValidator(r"^(?!admin)", exclude=True, flags=re.IGNORECASE) # No "admin" ] schema = (extractor.create_schema() .structure("user") .field("username", dtype="str", validators=username_validators) ) text = "Users: ab, john_doe, user@domain, admin, valid_user123" results = extractor.extract(text, schema) # Output: {'user': [{'username': 'john_doe'}]} # Only valid usernames ``` ## 📦 批量处理 在单次调用中高效处理多个文本: ``` # Batch entity extraction texts = [ "Google's Sundar Pichai unveiled Gemini AI in Mountain View.", "Microsoft CEO Satya Nadella announced Copilot at Build 2023.", "Amazon's Andy Jassy revealed new AWS services in Seattle." ] results = extractor.batch_extract_entities( texts, ["company", "person", "product", "location"], batch_size=8 ) # Returns list of results, one per input text # Batch relation extraction texts = [ "John works for Microsoft and lives in Seattle.", "Sarah founded TechStartup in 2020.", "Bob reports to Alice at Google." ] results = extractor.batch_extract_relations( texts, ["works_for", "founded", "reports_to", "lives_in"], batch_size=8 ) # Returns list of relation extraction results for each text # All requested relation types appear in each result, even if empty # Batch with confidence and spans results = extractor.batch_extract_entities( texts, ["company", "person"], include_confidence=True, include_spans=True, batch_size=8 ) ``` ## 🎓 训练定制模型 在您自己的数据上训练 GLiNER2,以针对您的领域或用例进行专门化。 ### 快速开始训练 ``` from gliner2 import GLiNER2 from gliner2.training.data import InputExample from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig # 1. Prepare training data examples = [ InputExample( text="John works at Google in California.", entities={"person": ["John"], "company": ["Google"], "location": ["California"]} ), InputExample( text="Apple released iPhone 15.", entities={"company": ["Apple"], "product": ["iPhone 15"]} ), # Add more examples... ] # 2. Configure training model = GLiNER2.from_pretrained("fastino/gliner2-base-v1") config = TrainingConfig( output_dir="./output", num_epochs=10, batch_size=8, encoder_lr=1e-5, task_lr=5e-4 ) # 3. Train trainer = GLiNER2Trainer(model, config) trainer.train(train_data=examples) ``` ### 训练数据格式 (JSONL) GLiNER2 使用 JSONL 格式,其中每一行包含一个 `input` 和 `output` 字段: ``` {"input": "Tim Cook is the CEO of Apple Inc., based in Cupertino, California.", "output": {"entities": {"person": ["Tim Cook"], "company": ["Apple Inc."], "location": ["Cupertino", "California"]}, "entity_descriptions": {"person": "Full name of a person", "company": "Business organization name", "location": "Geographic location or place"}}} {"input": "OpenAI released GPT-4 in March 2023.", "output": {"entities": {"company": ["OpenAI"], "model": ["GPT-4"], "date": ["March 2023"]}}} ``` **分类示例:** ``` {"input": "This movie is absolutely fantastic! I loved every minute of it.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["positive"]}]}} {"input": "The service was terrible and the food was cold.", "output": {"classifications": [{"task": "sentiment", "labels": ["positive", "negative", "neutral"], "true_label": ["negative"]}]}} ``` **结构化提取示例:** ``` {"input": "iPhone 15 Pro Max with 256GB storage, priced at $1199.", "output": {"json_structures": [{"product": {"name": "iPhone 15 Pro Max", "storage": "256GB", "price": "$1199"}}]}} ``` **关系提取示例:** ``` {"input": "John works for Apple Inc. and lives in San Francisco.", "output": {"relations": [{"works_for": {"head": "John", "tail": "Apple Inc."}}, {"lives_in": {"head": "John", "tail": "San Francisco"}}]}} ``` ### 从 JSONL 文件训练 ``` from gliner2 import GLiNER2 from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig # Load model and train from JSONL file model = GLiNER2.from_pretrained("fastino/gliner2-base-v1") config = TrainingConfig(output_dir="./output", num_epochs=10) trainer = GLiNER2Trainer(model, config) trainer.train(train_data="train.jsonl") # Path to your JSONL file ``` ### LoRA 训练(参数高效微调) 训练用于特定领域任务的轻量级适配器: ``` from gliner2 import GLiNER2 from gliner2.training.data import InputExample from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig # Prepare domain-specific data legal_examples = [ InputExample( text="Apple Inc. filed a lawsuit against Samsung Electronics.", entities={"company": ["Apple Inc.", "Samsung Electronics"]} ), # Add more examples... ] # Configure LoRA training model = GLiNER2.from_pretrained("fastino/gliner2-base-v1") config = TrainingConfig( output_dir="./legal_adapter", num_epochs=10, batch_size=8, encoder_lr=1e-5, task_lr=5e-4, # LoRA settings use_lora=True, # Enable LoRA lora_r=8, # Rank (4, 8, 16, 32) lora_alpha=16.0, # Scaling factor (usually 2*r) lora_dropout=0.0, # Dropout for LoRA layers save_adapter_only=True # Save only adapter (~5MB vs ~450MB) ) # Train adapter trainer = GLiNER2Trainer(model, config) trainer.train(train_data=legal_examples) # Use the adapter model.load_adapter("./legal_adapter/final") results = model.extract_entities(legal_text, ["company", "law"]) ``` **LoRA 的优势:** - **体积更小**:适配器仅约 2-10 MB,而完整模型约 450 MB - **训练更快**:比完整微调快 2-3 倍 - **易于切换**:在几毫秒内切换不同领域的适配器 ### 完整训练示例 ``` from gliner2 import GLiNER2 from gliner2.training.data import InputExample, TrainingDataset from gliner2.training.trainer import GLiNER2Trainer, TrainingConfig # Prepare training data train_examples = [ InputExample( text="Tim Cook is the CEO of Apple Inc., based in Cupertino, California.", entities={ "person": ["Tim Cook"], "company": ["Apple Inc."], "location": ["Cupertino", "California"] }, entity_descriptions={ "person": "Full name of a person", "company": "Business organization name", "location": "Geographic location or place" } ), # Add more examples... ] # Create and validate dataset train_dataset = TrainingDataset(train_examples) train_dataset.validate(strict=True, raise_on_error=True) train_dataset.print_stats() # Split into train/validation train_data, val_data, _ = train_dataset.split( train_ratio=0.8, val_ratio=0.2, test_ratio=0.0, shuffle=True, seed=42 ) # Configure training model = GLiNER2.from_pretrained("fastino/gliner2-base-v1") config = TrainingConfig( output_dir="./ner_model", experiment_name="ner_training", num_epochs=15, batch_size=16, encoder_lr=1e-5, task_lr=5e-4, warmup_ratio=0.1, scheduler_type="cosine", fp16=True, eval_strategy="epoch", save_best=True, early_stopping=True, early_stopping_patience=3 ) # Train trainer = GLiNER2Trainer(model, config) trainer.train(train_data=train_data, val_data=val_data) # Load best model model = GLiNER2.from_pretrained("./ner_model/best") ``` 更多详情,请参阅 [训练教程](tutorial/9-training.md) 和 [数据格式指南](tutorial/8-train_data.md)。 ## 📄 许可证 本项目根据 Apache License 2.0 授权 - 详情请参阅 [LICENSE](LICENSE) 文件。 ## 📚 引用 如果您在研究中使用 GLiNER2,请引用: ``` @inproceedings{zaratiana-etal-2025-gliner2, title = "{GL}i{NER}2: Schema-Driven Multi-Task Learning for Structured Information Extraction", author = "Zaratiana, Urchade and Pasternak, Gil and Boyd, Oliver and Hurn-Maloney, George and Lewis, Ash", editor = {Habernal, Ivan and Schulam, Peter and Tiedemann, J{\"o}rg}, booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.emnlp-demos.10/", pages = "130--140", ISBN = "979-8-89176-334-0", abstract = "Information extraction (IE) is fundamental to numerous NLP applications, yet existing solutions often require specialized models for different tasks or rely on computationally expensive large language models. We present GLiNER2, a unified framework that enhances the original GLiNER architecture to support named entity recognition, text classification, and hierarchical structured data extraction within a single efficient model. Built on a fine-tuned encoder architecture, GLiNER2 maintains CPU efficiency and compact size while introducing multi-task composition through an intuitive schema-based interface. Our experiments demonstrate competitive performance across diverse IE tasks with substantial improvements in deployment accessibility compared to LLM-based alternatives. We release GLiNER2 as an open-source library available through pip, complete with pre-trained models and comprehensive documentation." } ``` ## 🙏 致谢 基于 [Fastino AI](https://fastino.ai) 团队的原始 [GLiNER](https://github.com/urchade/GLiNER) 架构构建。
准备好从您的数据中提取洞察了吗?
pip install gliner2
标签:Apex, API服务, CPU推理, GLiNER, IaC 扫描, NER, NLP, Transformer, 信息抽取, 关系抽取, 凭据扫描, 命名实体识别, 大语言模型技术, 开源模型, 数据清洗, 文本分类, 本地推理, 机器学习, 深度学习, 知识图谱构建, 系统调用监控, 结构化数据提取, 统一模型, 网络安全, 逆向工具, 隐私保护, 零样本学习