feast-dev/feast
GitHub: feast-dev/feast
这是一个用于管理和服务机器学习特征的开源特征库,旨在确保模型训练与推理之间的数据一致性和可复现性。
Stars: 6954 | Forks: 1299
[](https://pypi.org/project/feast/)
[](https://github.com/feast-dev/feast/graphs/contributors)
[](https://github.com/feast-dev/feast/actions/workflows/unit_tests.yml)
[](https://github.com/feast-dev/feast/actions/workflows/master_only.yml)
[](https://github.com/feast-dev/feast/actions/workflows/linter.yml)
[](https://docs.feast.dev/)
[](http://rtd.feast.dev/)
[](https://github.com/feast-dev/feast/blob/master/LICENSE)
[](https://github.com/feast-dev/feast/releases)
## 在 Slack 上加入我们!
👋👋👋 [来 Slack 上打个招呼!](https://communityinviter.com/apps/feastopensource/feast-the-open-source-feature-store)
[查看我们的 DeepWiki!](https://deepwiki.com/feast-dev/feast)
## 概述

Feast (**Fea**ture **St**ore) 是一个用于机器学习的开源特征库。Feast 是管理现有基础设施并将分析数据用于模型训练和在线推理投产的最快路径。
Feast 允许 ML 平台团队:
* **通过管理用于训练和服务的特征,确保特征的一致可用性**,包括管理一个 _offline store_(用于处理历史数据以进行水平扩展的批量评分或模型训练)、一个低延迟的 _online store_(用于为实时预测提供支持)和一个久经考验的 _feature server_(用于在线提供预计算的特征)。
* **通过生成时间点正确的特征集来避免数据泄露**,使数据科学家能够专注于特征工程,而不是调试容易出错的数据集连接逻辑。这确保了未来的特征值不会在训练期间泄露到模型中。
* **通过提供单一的数据访问层来实现 ML 与数据基础设施的解耦**,该层抽象了特征存储与特征检索,确保当您从训练模型转向服务模型、从批量模型转向实时模型以及从一个数据基础设施系统迁移到另一个系统时,模型保持可移植性。
请参阅我们的 [文档](https://docs.feast.dev/) 以获取有关项目的更多信息。
## 📐 架构

上述架构是 Feast 的最小部署方式。想要在 Snowflake/GCP/AWS 上运行完整的 Feast?点击[此处](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws)。
## 🐣 入门指南
### 1. 安装 Feast
```
pip install feast
```
### 2. 创建特征仓库
```
feast init my_feature_repo
cd my_feature_repo/feature_repo
```
### 3. 注册特征定义并设置特征库
```
feast apply
```
### 4. 在 Web UI 中探索数据(实验性)

```
feast ui
```
### 5. 构建训练数据集
```
from feast import FeatureStore
import pandas as pd
from datetime import datetime
entity_df = pd.DataFrame.from_dict({
"driver_id": [1001, 1002, 1003, 1004],
"event_timestamp": [
datetime(2021, 4, 12, 10, 59, 42),
datetime(2021, 4, 12, 8, 12, 10),
datetime(2021, 4, 12, 16, 40, 26),
datetime(2021, 4, 12, 15, 1 , 12)
]
})
store = FeatureStore(repo_path=".")
training_df = store.get_historical_features(
entity_df=entity_df,
features = [
'driver_hourly_stats:conv_rate',
'driver_hourly_stats:acc_rate',
'driver_hourly_stats:avg_daily_trips'
],
).to_df()
print(training_df.head())
# 训练模型
# model = ml.fit(training_df)
```
```
event_timestamp driver_id conv_rate acc_rate avg_daily_trips
0 2021-04-12 08:12:10+00:00 1002 0.713465 0.597095 531
1 2021-04-12 10:59:42+00:00 1001 0.072752 0.044344 11
2 2021-04-12 15:01:12+00:00 1004 0.658182 0.079150 220
3 2021-04-12 16:40:26+00:00 1003 0.162092 0.309035 959
```
### 6. 将特征值加载到您的在线存储中
**选项 1:增量物化(推荐)**
```
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize-incremental $CURRENT_TIME
```
**选项 2:带时间戳的完全物化**
```
CURRENT_TIME=$(date -u +"%Y-%m-%dT%H:%M:%S")
feast materialize 2021-04-12T00:00:00 $CURRENT_TIME
```
**选项 3:不带时间戳的简单物化**
```
feast materialize --disable-event-timestamp
```
`--disable-event-timestamp` 标志允许您使用当前日期时间作为事件时间戳来物化所有可用的特征数据,而无需指定开始和结束时间戳。当您的源数据缺乏适当的日期时间戳列时,这非常有用。
```
Materializing feature view driver_hourly_stats from 2021-04-14 to 2021-04-15 done!
```
### 7. 以低延迟读取在线特征
```
from pprint import pprint
from feast import FeatureStore
store = FeatureStore(repo_path=".")
feature_vector = store.get_online_features(
features=[
'driver_hourly_stats:conv_rate',
'driver_hourly_stats:acc_rate',
'driver_hourly_stats:avg_daily_trips'
],
entity_rows=[{"driver_id": 1001}]
).to_dict()
pprint(feature_vector)
# 进行预测
# model.predict(feature_vector)
```
```
{
"driver_id": [1001],
"driver_hourly_stats__conv_rate": [0.49274],
"driver_hourly_stats__acc_rate": [0.92743],
"driver_hourly_stats__avg_daily_trips": [72]
}
```
## 📦 功能和路线图
以下列表包含贡献者计划为 Feast 开发的功能。
* 我们欢迎对路线图中的所有项目做出贡献!
* **自然语言处理**
* [x] 向量搜索(Alpha 版本。请参阅 [RFC](https://docs.google.com/document/d/18IWzLEA9i2lDWnbfbwXnMCg3StlqaLVI-uRpQjr_Vos/edit#heading=h.9gaqqtox9jg6))
* [ ] [增强型 Feature Server 和 SDK 以支持 NLP](https://github.com/feast-dev/feast/issues/4964)
* **数据源**
* [x] [Snowflake 源](https://docs.feast.dev/reference/data-sources/snowflake)
* [x] [Redshift 源](https://docs.feast.dev/reference/data-sources/redshift)
* [x] [BigQuery 源](https://docs.feast.dev/reference/data-sources/bigquery)
* [x] [Parquet 文件源](https://docs.feast.dev/reference/data-sources/file)
* [x] [Azure Synapse + Azure SQL 源(contrib 插件)](https://docs.feast.dev/reference/data-sources/mssql)
* [x] [Hive(社区插件)](https://github.com/baineng/feast-hive)
* [x] [Postgres(contrib 插件)](https://docs.feast.dev/reference/data-sources/postgres)
* [x] [Spark(contrib 插件)](https://docs.feast.dev/reference/data-sources/spark)
* [x] [Couchbase(contrib 插件)](https://docs.feast.dev/reference/data-sources/couchbase)
* [x] [Athena(contrib 插件)](https://docs.feast.dev/reference/data-sources/athena)
* [x] [Clickhouse(contrib 插件)](https://docs.feast.dev/reference/data-sources/clickhouse)
* [x] [Oracle(contrib 插件)](https://docs.feast.dev/reference/data-sources/oracle)
* [x] Kafka / Kinesis 源(通过[推送到在线存储的支持](https://docs.feast.dev/reference/data-sources/push))
* **离线存储**
* [x] [Snowflake](https://docs.feast.dev/reference/offline-stores/snowflake)
* [x] [Redshift](https://docs.feast.dev/reference/offline-stores/redshift)
* [x] [BigQuery](https://docs.feast.dev/reference/offline-stores/bigquery)
* [x] [DuckDB](https://docs.feast.dev/reference/offline-stores/duckdb)
* [x] [Dask](https://docs.feast.dev/reference/offline-stores/dask)
* [x] [Remote](https://docs.feast.dev/reference/offline-stores/remote-offline-store)
* [x] [Azure Synapse + Azure SQL(contrib 插件)](https://docs.feast.dev/reference/offline-stores/mssql)
* [x] [Hive(社区插件)](https://github.com/baineng/feast-hive)
* [x] [Postgres(contrib 插件)](https://docs.feast.dev/reference/offline-stores/postgres)
* [x] [Trino(contrib 插件)](https://docs.feast.dev/reference/offline-stores/trino)
* [x] [Spark(contrib 插件)](https://docs.feast.dev/reference/offline-stores/spark)
* [x] [Couchbase(contrib 插件)](https://docs.feast.dev/reference/offline-stores/couchbase)
* [x] [Athena(contrib 插件)](https://docs.feast.dev/reference/offline-stores/athena)
* [x] [Clickhouse(contrib 插件)](https://docs.feast.dev/reference/offline-stores/clickhouse)
* [x] [Ray(contrib 插件)](https://docs.feast.dev/reference/offline-stores/ray)
* [x] [Oracle(contrib 插件)](https://docs.feast.dev/reference/offline-stores/oracle)
* [x] [Hybrid](https://docs.feast.dev/reference/offline-stores/hybrid)
* [x] [自定义离线存储支持](https://docs.feast.dev/how-to-guides/customizing-feast/adding-a-new-offline-store)
* **在线存储**
* [x] [Snowflake](https://docs.feast.dev/reference/online-stores/snowflake)
* [x] [DynamoDB](https://docs.feast.dev/reference/online-stores/dynamodb)
* [x] [Redis](https://docs.feast.dev/reference/online-stores/redis)
* [x] [Dragonfly](https://docs.feast.dev/reference/online-stores/dragonfly)
* [x] [Datastore](https://docs.feast.dev/reference/online-stores/datastore)
* [x] [Bigtable](https://docs.feast.dev/reference/online-stores/bigtable)
* [x] [SQLite](https://docs.feast.dev/reference/online-stores/sqlite)
* [x] [Remote](https://docs.feast.dev/reference/online-stores/remote)
* [x] [Postgres](https://docs.feast.dev/reference/online-stores/postgres)
* [x] [HBase](https://docs.feast.dev/reference/online-stores/hbase)
* [x] [Cassandra / AstraDB](https://docs.feast.dev/reference/online-stores/cassandra)
* [x] [ScyllaDB](https://docs.feast.dev/reference/online-stores/scylladb)
* [x] [MySQL](https://docs.feast.dev/reference/online-stores/mysql)
* [x] [Hazelcast](https://docs.feast.dev/reference/online-stores/hazelcast)
* [x] [Elasticsearch](https://docs.feast.dev/reference/online-stores/elasticsearch)
* [x] [SingleStore](https://docs.feast.dev/reference/online-stores/singlestore)
* [x] [Couchbase](https://docs.feast.dev/reference/online-stores/couchbase)
* [x] [MongoDB](https://docs.feast.dev/reference/online-stores/mongodb)
* [x] [Qdrant(向量存储)](https://docs.feast.dev/reference/online-stores/qdrant)
* [x] [Milvus(向量存储)](https://docs.feast.dev/reference/online-stores/milvus)
* [x] [Faiss(向量存储)](https://docs.feast.dev/reference/online-stores/faiss)
* [x] [Hybrid](https://docs.feast.dev/reference/online-stores/hybrid)
* [x] [Azure Cache for Redis(社区插件)](https://github.com/Azure/feast-azure)
* [x] [自定义在线存储支持](https://docs.feast.dev/how-to-guides/customizing-feast/adding-support-for-a-new-online-store)
* **特征工程**
* [x] 按需转换(On Read)(Beta 版本。请参阅 [RFC](https://docs.google.com/document/d/1lgfIw0Drc65LpaxbUu49RCeJgMew547meSJttnUqz7c/edit#))
* [x] 流式转换(Alpha 版本。请参阅 [RFC](https://docs.google.com/document/d/1UzEyETHUaGpn0ap4G82DHluiCj7zEbrQLkJJkKSv4e8/edit))
* [ ] 批量转换(进行中。请参阅 [RFC](https://docs.google.com/document/d/1964OkzuBljifDvkV-0fakp2uaijnVzdwWNGdz7Vz50A/edit))
* [x] 按需转换(On Write)(Beta 版本。请参阅 [GitHub Issue](https://github.com/feast-dev/feast/issues/4376))
* **流处理**
* [x] [自定义流式摄取作业支持](https://docs.feast.dev/how-to-guides/customizing-feast/creating-a-custom-provider)
* [x] [基于推送到在线存储的流式数据摄取](https://docs.feast.dev/reference/data-sources/push)
* [x] [基于推送到离线存储的流式数据摄取](https://docs.feast.dev/reference/data-sources/push)
* **部署**
* [x] AWS Lambda(Alpha 版本。请参阅 [RFC](https://docs.google.com/document/d/1eZWKWzfBif66LDN32IajpaG-j82LSHCCOzY6R7Ax7MI/edit))
* [x] Kubernetes(请参阅[指南](https://docs.feast.dev/how-to-guides/running-feast-in-production))
* **特征服务**
* [x] Python 客户端
* [x] [Python feature server](https://docs.feast.dev/reference/feature-servers/python-feature-server)
* [x] [Feast Operator(alpha)](https://github.com/feast-dev/feast/blob/master/infra/feast-operator/README.md)
* [x] [Java feature server(alpha)](https://github.com/feast-dev/feast/blob/master/infra/charts/feast/README.md)
* [x] [Go feature server(alpha)](https://docs.feast.dev/reference/feature-servers/go-feature-server)
* [x] [Offline Feature Server(alpha)](https://docs.feast.dev/reference/feature-servers/offline-feature-server)
* [x] [Registry server(alpha)](https://github.com/feast-dev/feast/blob/master/docs/reference/feature-servers/registry-server.md)
* **数据质量管理(请参阅 [RFC](https://docs.google.com/document/d/110F72d4NTv80p35wDSONxhhPBqWRwbZXG4f9mNEMd98/edit))**
* [x] 数据分析和验证
* **特征发现与治理**
* [x] 用于浏览特征注册表的 Python SDK
* [x] 用于浏览特征注册表的 CLI
* [x] 以模型为中心的特征跟踪(feature services)
* [x] Amundsen 集成(请参阅 [Feast extractor](https://github.com/amundsen-io/amundsen/blob/main/databuilder/databuilder/extractor/feast_extractor.py))
* [x] DataHub 集成(请参阅 [DataHub Feast docs](https://datahubproject.io/docs/generated/ingestion/sources/feast/))
* [x] Feast Web UI(Beta 版本。请参阅[文档](https://docs.feast.dev/reference/alpha-web-ui))
* [ ] Feast Lineage Explorer
## 🎓 重要资源
请访问位于 [Documentation](https://docs.feast.dev/) 的官方文档
* [快速入门](https://docs.feast.dev/getting-started/quickstart)
* [教程](https://docs.feast.dev/tutorials/tutorials-overview)
* [示例](https://github.com/feast-dev/feast/tree/master/examples)
* [在 Snowflake/GCP/AWS 上运行 Feast](https://docs.feast.dev/how-to-guides/feast-snowflake-gcp-aws)
* [变更日志](https://github.com/feast-dev/feast/blob/master/CHANGELOG.md)
## 👋 贡献
Feast 是一个社区项目,目前仍在积极开发中。如果您想为该项目做出贡献,请查看我们的贡献和开发指南:
- [Feast 贡献流程](https://docs.feast.dev/project/contributing)
- [Feast 开发指南](https://docs.feast.dev/project/development-guide)
- [Feast 主仓库开发指南](./CONTRIBUTING.md)
## 🌟 GitHub Star 历史
## ✨ 贡献者
感谢这些杰出的人士:
标签:Apache 2.0, Apex, DNS解析, ETL, Feast, JavaCC, MLOps, Python, 人工智能, 在线服务, 多线程, 大数据, 子域名突变, 实时特征, 开源项目, 批处理, 搜索引擎查询, 数据仓库, 数据科学, 数据管道, 无后门, 机器学习, 模型训练, 模型部署, 特征存储, 特征工程, 特征管理, 生产环境, 用户模式Hook绕过, 目录扫描, 请求拦截, 资源验证, 软件工程, 逆向工具