From 48a4285b3a91bd6e490d6562591de612b96a5e02 Mon Sep 17 00:00:00 2001 From: Francisco Arceo Date: Sun, 9 Feb 2025 12:09:09 -0500 Subject: [PATCH] docs: Adding blog posts from feast.dev website (#5034) --- docs/SUMMARY.md | 1 + docs/blog/README.md | 21 +++ docs/blog/a-state-of-feast.md | 80 +++++++++ docs/blog/announcing-feast-0-11.md | 65 +++++++ ...faster-feature-transformations-in-feast.md | 50 ++++++ docs/blog/feast-0-10-announcement.md | 163 ++++++++++++++++++ ...vers-and-feature-views-without-entities.md | 45 +++++ ...st-0-14-adds-aws-lambda-feature-servers.md | 56 ++++++ ...ake-support-and-data-quality-monitoring.md | 37 ++++ ...-20-adds-api-and-connector-improvements.md | 41 +++++ docs/blog/feast-benchmarks.md | 65 +++++++ ...-joins-the-linux-foundation-for-ai-data.md | 37 ++++ ...2-adds-aws-redshift-and-dynamodb-stores.md | 46 +++++ docs/blog/feast-supports-vector-database.md | 43 +++++ docs/blog/go-feature-server-benchmarks.md | 55 ++++++ ...how-danny-chiao-is-keeping-feast-simple.md | 31 ++++ ...kubeflow-and-feast-with-david-aronchick.md | 31 ++++ ...ime-fraud-prediction-using-feast-on-gcp.md | 50 ++++++ ...t-for-python-based-feast-feature-server.md | 61 +++++++ docs/blog/rbac-role-based-access-controls.md | 71 ++++++++ ...g-feature-engineering-with-denormalized.md | 135 +++++++++++++++ docs/blog/the-future-of-feast.md | 38 ++++ docs/blog/the-road-to-feast-1-0.md | 35 ++++ docs/blog/what-is-a-feature-store.md | 85 +++++++++ 24 files changed, 1342 insertions(+) create mode 100644 docs/blog/README.md create mode 100644 docs/blog/a-state-of-feast.md create mode 100644 docs/blog/announcing-feast-0-11.md create mode 100644 docs/blog/faster-feature-transformations-in-feast.md create mode 100644 docs/blog/feast-0-10-announcement.md create mode 100644 docs/blog/feast-0-13-adds-on-demand-transforms-feature-servers-and-feature-views-without-entities.md create mode 100644 docs/blog/feast-0-14-adds-aws-lambda-feature-servers.md create mode 100644 docs/blog/feast-0-18-adds-snowflake-support-and-data-quality-monitoring.md create mode 100644 docs/blog/feast-0-20-adds-api-and-connector-improvements.md create mode 100644 docs/blog/feast-benchmarks.md create mode 100644 docs/blog/feast-joins-the-linux-foundation-for-ai-data.md create mode 100644 docs/blog/feast-release-0-12-adds-aws-redshift-and-dynamodb-stores.md create mode 100644 docs/blog/feast-supports-vector-database.md create mode 100644 docs/blog/go-feature-server-benchmarks.md create mode 100644 docs/blog/how-danny-chiao-is-keeping-feast-simple.md create mode 100644 docs/blog/kubeflow-and-feast-with-david-aronchick.md create mode 100644 docs/blog/machine-learning-data-stack-for-real-time-fraud-prediction-using-feast-on-gcp.md create mode 100644 docs/blog/performance-test-for-python-based-feast-feature-server.md create mode 100644 docs/blog/rbac-role-based-access-controls.md create mode 100644 docs/blog/streaming-feature-engineering-with-denormalized.md create mode 100644 docs/blog/the-future-of-feast.md create mode 100644 docs/blog/the-road-to-feast-1-0.md create mode 100644 docs/blog/what-is-a-feature-store.md diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index bbda7773b4..127b27463e 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -1,6 +1,7 @@ # Table of contents * [Introduction](README.md) +* [Blog](blog/README.md) * [Community & getting help](community.md) * [Roadmap](roadmap.md) * [Changelog](https://github.com/feast-dev/feast/blob/master/CHANGELOG.md) diff --git a/docs/blog/README.md b/docs/blog/README.md new file mode 100644 index 0000000000..cc42cfe442 --- /dev/null +++ b/docs/blog/README.md @@ -0,0 +1,21 @@ +# Blog Posts + +Welcome to the Feast blog! Here you'll find articles about feature store development, new features, and community updates. + +## Featured Posts + +{% content-ref url="what-is-a-feature-store.md" %} +[what-is-a-feature-store.md](what-is-a-feature-store.md) +{% endcontent-ref %} + +{% content-ref url="the-future-of-feast.md" %} +[the-future-of-feast.md](the-future-of-feast.md) +{% endcontent-ref %} + +{% content-ref url="feast-supports-vector-database.md" %} +[feast-supports-vector-database.md](feast-supports-vector-database.md) +{% endcontent-ref %} + +{% content-ref url="rbac-role-based-access-controls.md" %} +[rbac-role-based-access-controls.md](rbac-role-based-access-controls.md) +{% endcontent-ref %} diff --git a/docs/blog/a-state-of-feast.md b/docs/blog/a-state-of-feast.md new file mode 100644 index 0000000000..7343effdcb --- /dev/null +++ b/docs/blog/a-state-of-feast.md @@ -0,0 +1,80 @@ +# A State of Feast + +*January 21, 2021* | *Willem Pienaar* + +## Introduction + +Two years ago we first announced the launch of Feast, an open source feature store for machine learning. Feast is an operational data system that solves some of the key challenges that ML teams encounter while productionizing machine learning systems. + +Recognizing that ML and Feast have advanced since we launched, we take a moment today to discuss the past, present and future of Feast. We consider the more significant lessons we learned while building Feast, where we see the project heading, and why teams should consider adopting Feast as part of their operational ML stacks. + +## Background + +Feast was developed to address the challenges faced while productionizing data for machine learning. In our original [Google Cloud article](https://cloud.google.com/blog/products/ai-machine-learning/introducing-feast-an-open-source-feature-store-for-machine-learning), we highlighted some of these challenges, namely: + +1. Features aren't reused. +2. Feature definitions are inconsistent across teams. +3. Getting features into production is hard. +4. Feature values are inconsistent between training and serving. + +Whereas an industry to solve data transformations and data-quality problems already existed, our focus for shaping Feast was to overcome operational ML hurdles that exist between data science and ML engineering. Toward that end, our initial aim was to provide: + +1. Registry: The registry is a common catalog with which to explore, develop, collaborate on, and publish new feature definitions within and across teams. It is the central interface for all interactions with the feature store. +2. Ingestion: A means for continually ingesting batch and streaming data and storing consistent copies in both an offline and online store. This layer automates most data-management work and ensures that features are always available for serving. +3. Serving: A feature-retrieval interface which provides a temporally consistent view of features for both training and online serving. Serving improves iteration speed by minimizing coupling to data infrastructure, and prevents training-serving skew through consistent data access. + +Guided by this design, we co-developed and shipped Feast with our friends over at Google. We then open sourced the project in early 2019, and have since been running Feast in production and at scale. In our follow up blog post, [Bridging ML Models and Data](https://blog.gojekengineering.com/feast-bridging-ml-models-and-data), we touched on the impact Feast has had at companies like Gojek. + +## Feast today + +Teams, large and small, are increasingly searching for ways to simplify the productionization and maintenance of their ML systems at scale. Since open sourcing Feast, we've seen both the demand for these tools and the activity around this project soar. Working alongside our open source community, we've released key pieces of our stack throughout the last year, and steadily expanded Feast into a robust feature store. Highlights include: + +* Point-in-time correct queries that prevent feature data leakage. +* A query optimized table-based data model in the form of feature sets. +* Storage connectors with implementations for Cassandra and Redis Cluster. +* Statistics generation and data validation through TFDV integration. +* Authentication and authorization support for SDKs and APIs. +* Diagnostic tooling through request/response logging, audit logs, and Statsd integration. + +Feast has grown more rapidly than initially anticipated, with multiple large companies, including Agoda, Gojek, Farfetch, Postmates, and Zulily adopting and/or contributing to the project. We've also been working closely with other open source teams, and we are excited to share that Feast is now a [component in Kubeflow](https://www.kubeflow.org/docs/components/feature-store/). Over the coming months we will be enhancing this integration, making it easier for users to deploy Feast and Kubeflow together. + +## Lessons learned + +Through frequent engagement with our community and by way of running Feast in production ourselves, we've learned critical lessons: + +Feast requires too much infrastructure: Requiring users provision a large system is a big ask. A minimal Feast deployment requires Kafka, Zookeeper, Postgres, Redis, and multiple Feast services. + +Feast lacks composability: Requiring all infrastructural components be present in order to have a functional system removes all modularity. + +Ingestion is too complex: Incorporating a Kafka-based stream-first ingestion layer trivializes data consistency across stores, but the complete ingestion flow from source to sink can still mysteriously fail at multiple points. + +Our technology choices hinder generalization: Leveraging technologies like BigQuery, Apache Beam on Dataflow, and Apache Kafka has allowed us to move faster in delivering functionality. However, these technologies now impede our ability to generalize to other clouds or deployment environments. + +## The future of Feast + +> *"Always in motion is the future."* +> – Yoda, The Empire Strikes Back + +While feature stores have already become essential systems at large technology companies, we believe their widespread adoption will begin in 2021. We also foresee the release of multiple managed feature stores over the next year, as vendors seek to enter the burgeoning operational ML market. + +As we've discussed, feature stores serve both offline and production ML needs, and therefore are primarily built by engineers for engineers. What we need, however, is a feature store that's purpose-built for data-science workflows. Feast will move away from an infrastructure-centric approach toward a more localized experience that does just this: builds on teams' existing data-science workflows. + +The lessons we've learned during the preceding two years have crystallized a vision for what Feast should become: a light-weight modular feature store. One that's easy to pick up, adds value to teams large and small, and can be progressively applied to production use cases that span multiple teams, projects, and cloud-environments. We aim to reach this by applying the following design principles: + +1. Python-first: First-class support for running a minimal version of Feast entirely from a notebook, with all infrastructural dependencies becoming optional enhancements. + * Encourages quick evaluation of the software and ensures Feast is user friendly + * Minimizes the operational burden of running the system in production + * Simplifies testing, developing, and maintaining Feast + +## Next Steps + +Our vision for Feast is not only ambitious, but actionable. Our next release, Feast 0.8, is the product of collaborating with both our open source community and our friends over at [Tecton](https://tecton.ai/). + +1. Python-first: We are migrating all core logic to Python, starting with training dataset retrieval and job management, providing a more responsive development experience. +2. Modular ingestion: We are shifting to managing batch and streaming ingestion separately, leading to more actionable metrics, logs, and statistics and an easier to understand and operate system. +3. Support for AWS: We are replacing GCP-specific technologies like Beam on Dataflow with Spark and adding native support for running Feast on AWS, our first steps toward cloud-agnosticism. +4. Data-source integrations: We are introducing support for a host of new data sources (Kinesis, Kafka, S3, GCS, BigQuery) and data formats (Parquet, JSON, Avro), ensuring teams can seamlessly integrate Feast into their existing data-infrastructure. + +## Get involved + +We've been inspired by the soaring community interest in and contributions to Feast. If you're curious to learn more about our mission to build a best-in-class feature store, or are looking to build your own: Check out our resources, say hello, and get involved! diff --git a/docs/blog/announcing-feast-0-11.md b/docs/blog/announcing-feast-0-11.md new file mode 100644 index 0000000000..b3b1f2f4ac --- /dev/null +++ b/docs/blog/announcing-feast-0-11.md @@ -0,0 +1,65 @@ +# Announcing Feast 0.11 + +*June 23, 2021* | *Jay Parthasarthy & Willem Pienaar* + +Feast 0.11 is here! This is the first release after the major changes introduced in Feast 0.10. We've focused on two areas in particular: + +1. Introducing a new online store, Redis, which supports feature serving at high throughput and low latency. +2. Improving the Feast user experience through reduced boilerplate, smoother workflows, and improved error messages. A key addition here is the introduction of *feature inferencing,* which allows Feast to dynamically discover data schemas in your source data. + +Let's get into it! + +### Support for Redis as an online store 🗝 + +Feast 0.11 introduces support for Redis as an online store, allowing teams to easily scale up Feast to support high volumes of online traffic. Using Redis with Feast is as easy as adding a few lines of configuration to your feature_store.yaml file: + +```yaml +project: fraud +registry: data/registry.db +provider: local +online_store: + type: redis + connection_string: localhost:6379 +``` + +Feast is then able to read and write from Redis as its online store. + +```bash +$ feast materialize + +Materializing 3 feature views to 2021-06-15 18:43:03+00:00 into the redis online store. + +user_account_features from 2020-06-16 18:43:04 to 2021-06-15 18:43:13: +100%|███████████████████████| 9944/9944 [00:04<00:00, 20065.15it/s] +user_transaction_count_7d from 2021-06-08 18:43:21 to 2021-06-15 18:43:03: +100%|███████████████████████| 9674/9674 [00:04<00:00, 19943.82it/s] +``` + +We're also working on making it easier for teams to add their own storage and compute systems through plugin interfaces. Please see this RFC for more details on the proposal. + +### Feature Inferencing 🔎 + +Before 0.11, users had to define each feature individually when defining Feature Views. Now, Feast infers the schema of a Feature View based on upstream data sources, significantly reducing boilerplate. + +Before: +```python +driver_hourly_stats_view = FeatureView( + name="driver_hourly_stats", + entities=["driver_id"], + ttl=timedelta(days=1), + features=[ + Feature(name="conv_rate", dtype=ValueType.FLOAT), + ], + input=BigQuerySource(table_ref="feast-oss.demo_data.driver_hourly_stats"), +) +``` + +Aside from these additions, a wide variety of small bug fixes, and UX improvements made it into this release. [Check out the changelog](https://github.com/feast-dev/feast/blob/master/CHANGELOG.md) for a full list of what's new. + +Special thanks and a big shoutout to the community contributors whose changes made it into this release: [MattDelac](https://github.com/MattDelac), [mavysavydav](https://github.com/mavysavydav), [szalai1](https://github.com/szalai1), [rightx2](https://github.com/rightx2) + +### Help us design Feast for AWS 🗺️ + +The 0.12 release will include native support for AWS. We are looking to meet with teams that are considering using Feast to gather feedback and help shape the product as design partners. We often help our design partners out with architecture or design reviews. If this sounds helpful to you, [join us in Slack](http://slack.feastsite.wpenginepowered.com/), or [book a call with Feast maintainers here](https://calendly.com/d/gc29-y88c/feast-chat-w-willem). + +### Feast from around the web 📣 diff --git a/docs/blog/faster-feature-transformations-in-feast.md b/docs/blog/faster-feature-transformations-in-feast.md new file mode 100644 index 0000000000..689a5b84ab --- /dev/null +++ b/docs/blog/faster-feature-transformations-in-feast.md @@ -0,0 +1,50 @@ +# Faster Feature Transformations in Feast 🏎️💨 + +*December 5, 2024* | *Francisco Javier Arceo, Shuchu Han* + +*Thank you to [Shuchu Han](https://www.linkedin.com/in/shuchu/), [Ross Briden](https://www.linkedin.com/in/ross-briden/), [Ankit Nadig](https://www.linkedin.com/in/ankit-nadig/), and the folks at Affirm for inspiring this work and creating an initial proof of concept.* + +Feature engineering is at the core of building high-performance machine learning models. The Feast team has introduced two major enhancements to [On Demand Feature Views](https://docs.feast.dev/reference/beta-on-demand-feature-views) (ODFVs), pushing the boundaries of efficiency and flexibility for data scientists and engineers. Here's a closer look at these exciting updates: + +## 1. Transformations with Native Python + +Traditionally, transformations in ODFVs were limited to Pandas-based operations. While powerful, Pandas transformations can be computationally expensive for certain use cases. Feast now introduces Native Python Mode, a feature that allows users to write transformations using pure Python. + +Key benefits of Native Python Mode include: + +* Blazing Speed: Transformations using Native Python are nearly 10x faster compared to Pandas for many operations. +* Intuitive Design: This mode supports list-based and singleton (row-level) transformations, making it easier for data scientists to think in terms of individual rows rather than entire datasets. +* Versatility: Users can now switch between batch and singleton transformations effortlessly, catering to both historical and online retrieval scenarios. + +Using the cProfile library and snakeviz we were able to profile the runtime for the ODFV transformation using both Pandas and Native python and observed a nearly 10x reduction in speed. + +## 2. Transformations on Writes + +Until now, ODFVs operated solely as transformations on reads, applying logic during online feature retrieval. While this ensured flexibility, it sometimes came at the cost of increased latency during retrieval. Feast now supports transformations on writes, enabling users to apply transformations during data ingestion and store the transformed features in the online store. + +Why does this matter? + +* Reduced Online Latency: With transformations pre-applied at ingestion, online retrieval becomes a straightforward lookup, significantly improving performance for latency-sensitive applications. +* Operational Flexibility: By toggling the write_to_online_store parameter, users can choose whether transformations should occur at write time (to optimize reads) or at read time (to preserve data freshness). + +Here's an example of applying transformations during ingestion: + +```python +@on_demand_feature_view( + sources=[driver_hourly_stats_view], +) + +df = pd.DataFrame() +df["conv_rate_adjusted"] = features_df["conv_rate"] * 1.1 +return df +``` + +With this new capability, data engineers can optimize online retrieval performance without sacrificing the flexibility of on-demand transformations. + +### The Future of ODFVs and Feature Transformations + +These enhancements bring ODFVs closer to the goal of seamless feature engineering at scale. By combining high-speed Python-based transformations with the ability to optimize retrieval latency, Feast empowers teams to build more efficient, responsive, and production-ready feature pipelines. + +For more detailed examples and use cases, check out the [documentation for On Demand Feature Views](https://docs.feast.dev/reference/beta-on-demand-feature-views). Whether you're a data scientist prototyping features or an engineer optimizing a production system, the new ODFV capabilities offer the tools you need to succeed. + +The future of Feature Transformations in Feast will be to unify feature transformations and feature views to allow for a simpler API. If you have thoughts or interest in giving feedback to the maintainers, feel free to comment directly on [the GitHub Issue](https://github.com/feast-dev/feast/issues/4584) or in [the RFC](https://docs.google.com/document/d/1KXCXcsXq1bU...). diff --git a/docs/blog/feast-0-10-announcement.md b/docs/blog/feast-0-10-announcement.md new file mode 100644 index 0000000000..54b10880d1 --- /dev/null +++ b/docs/blog/feast-0-10-announcement.md @@ -0,0 +1,163 @@ +# Announcing Feast 0.10 + +*April 15, 2021* | *Jay Parthasarthy & Willem Pienaar* + +Today, we're announcing Feast 0.10, an important milestone towards our vision for a lightweight feature store. Feast is an open source feature store that helps you serve features in production. It prevents feature leakage by building training datasets from your batch data, automates the process of loading and serving features in an online feature store, and ensures your models in production have a consistent view of feature data. + +With Feast 0.10, we've dramatically simplified the process of managing a feature store. This new release allows you to: + +* Run a minimal local feature store from your notebook +* Deploy a production-ready feature store into a cloud environment in 30 seconds +* Operate a feature store without Kubernetes, Spark, or self-managed infrastructure + +We think Feast 0.10 is the simplest and fastest way to productionize features. Let's get into it! + +## The challenge with feature stores + +In our previous post, [A State of Feast](https://blog.feastsite.wpenginepowered.com/post), we shared our vision for building a feature store that is accessible to all ML teams. Since then, we've been working towards this vision by shipping support for AWS, Azure, and on-prem deployments. + +Over the last couple of months we've seen a surge of interest in Feast. ML teams are increasingly being tasked with building production ML systems, and many are looking for an open source tool to help them operationalize their feature data in a structured way. However, many of these teams still can't afford to run their own feature stores: + +> "Feature stores are big infrastructure!" + +The conventional wisdom is that feature stores should be built and operated as platforms. It's not surprising why many have this notion. Feature stores require access to compute layers, offline and online databases, and need to directly interface with production systems. + +This infrastructure-centric approach means that operating your own feature store is a daunting task. Many teams simply don't have the resources to deploy and manage a feature store. Instead, ML teams are being forced to hack together their own custom scripts or end up delaying their projects as they wait for engineering support. + +## Towards a simpler feature store + +Our vision for Feast is to provide a feature store that a single data scientist can deploy for a single ML project, but can also scale up for use by large platform teams. We've made all infrastructure optional in Feast 0.10. That means no Spark, no Kubernetes, and no APIs, unless you need them. If you're just starting out we won't ask you to deploy and manage a platform. + +Additionally, we've pulled out the core of our software into a single Python framework. This framework allows teams to define features and declaratively provision a feature store based on those definitions, to either local or cloud environments. If you're just starting out with feature stores, you'll only need to manage a Git repository and run the Feast CLI or SDK, nothing more. + +Feast 0.10 introduces a first-class local mode: not installed through Docker containers, but through pip. It allows users to start a minimal feature store entirely from a notebook, allowing for rapid development against sample data and for testing against the same ML frameworks they're using in production. Finally, we've also begun adding first-class support for managed services. Feast 0.10 ships with native support for GCP, with more providers on the way. Platform teams running Feast at scale get the best of both worlds: a feature store that is able to scale up to production workloads by leveraging serverless technologies, with the flexibility to deploy the complete system to Kubernetes if needed. + +## The new experience + +Machine learning teams today are increasingly being tasked with building models that serve predictions online. These teams are also sitting on a wealth of feature data in warehouses like BigQuery, Snowflake, and Redshift. It's natural to use these features for model training, but hard to serve these features online at low latency. + +## 1. Create a feature repository + +Installing Feast is now as simple as: +```bash +pip install feast +``` + +We'll scaffold a feature repository based on a GCP template: +```bash +feast init driver_features -t gcp +``` + +A feature repository consists of a *feature_store.yaml*, and a collection of feature definitions. +``` +driver_features/ +└── feature_store.yaml +└── driver_features.py +``` + +The *feature_store.yaml* file contains infrastructural configuration necessary to set up a feature store. The *project* field is used to uniquely identify a feature store, the *registry* is a source of truth for feature definitions, and the *provider* specifies the environment in which our feature store will run. + +feature_store.yaml: +```yaml +project: driver_features +registry: gs://driver-fs/ +provider: gcp +``` + +The feature repository also contains Python based feature definitions, like *driver_features.py*. This file contains a single entity and a single feature view. Together they describe a collection of features in BigQuery that can be used for model training or serving. + +## 2. Set up a feature store + +Next we run *apply* to set up our feature store on GCP. +```bash +feast apply +``` + +Running Feast apply will register our feature definitions with the GCS feature registry and prepare our infrastructure for writing and reading features. Apply can be run idempotently, and is meant to be executed from CI when feature definitions change. + +At this point we haven't moved any data. We've only stored our feature definition metadata in the object store registry (GCS) and Feast has configured our infrastructure (Firestore in this case). + +## 3. Build a training dataset + +Feast is able to build training datasets from our existing feature data, including data at rest in our upstream tables in BigQuery. Now that we've registered our feature definitions with Feast we are able to build a training dataset. + +From our training pipeline: +```python +# Connect to the feature registry +fs = FeatureStore( + RepoConfig( + registry="gs://driver-fs/", + project="driver_features" + ) +) + +# Load our driver events table. This dataframe will be enriched with features from BigQuery +driver_events = pd.read_csv("driver_events.csv") + +# Build a training dataset from features in BigQuery +training_df = fs.get_historical_features( + feature_refs=[ + "driver_hourly_stats:conv_rate", + "driver_hourly_stats:acc_rate" + ], + entity_df=driver_events +).to_df() + +# Train a model, and ship it into production +model = ml.fit(training_data) +``` + +The code snippet above will join the user provided dataframe driver_events to our driver_stats BigQuery table in a point-in-time correct way. Feast is able to use the temporal properties (event timestamps) of feature tables to reconstruct a view of features at a specific point in time, from any amount of feature tables or views. + +## 4. Load features into the online store + +At this point we have trained our model and we are ready to serve it. However, our online feature store contains no data. In order to load features into the feature store we run *materialize-incremental* from the command line. + +Feast provides materialization commands that load features from an offline store into an online store. The default GCP provider exports features from BigQuery and writes them directly into Firestore using an in-memory process. Teams running at scale may want to leverage cloud-based ingestion by using a different provider configuration. + +## 5. Read features at low latency + +Now that our online store has been populated with the latest feature data, it's possible for our ML model services to read online features for prediction. + +From our model serving service: +```python +# Connect to the feature store +fs = feast.FeatureStore( + RepoConfig(registry="gs://driver-fs/", project="driver_features") +) + +# Query Firestore for online feature values +online_features = fs.get_online_features( + feature_refs=[ + "driver_hourly_stats:conv_rate", + "driver_hourly_stats:acc_rate" + ], + entity_rows=[{"driver_id": 1001}, {"driver_id": 1002}], +).to_dict() + +# Make a prediction +model.predict(online_features) +``` + +## 6. That's it + +At this point, you can schedule a Feast materialization job and set up our CI pipelines to update our infrastructure as feature definitions change. + +## What's next + +Our vision for Feast is to build a simple yet scalable feature store. With 0.10, we've shipped local workflows, infrastructure pluggability, and removed all infrastructural overhead. But we're still just beginning this journey, and there's still lots of work left to do. + +Over the next few months we will focus on making Feast as accessible to teams as possible. This means adding support for more data sources, streams, and cloud providers, but also means working closely with our users in unlocking new operational ML use cases and integrations. + +Feast is a community driven project, which means extensibility is always a key focus area for us. We want to make it super simple for you to add new data stores, compute layers, or bring Feast to a new stack. We've already seen teams begin development towards community providers for 0.10 during pre-release, and we welcome community contributions in this area. + +The next few months are going to be big ones for the Feast project. Stay tuned for more news, and we'd love for you to get started using Feast 0.10 today! + +## Get started + +* ✨ Try out our [quickstart](https://docs.feastsite.wpenginepowered.com/quickstart) if you're new to Feast, or learn more about Feast through our [documentation](https://docs.feastsite.wpenginepowered.com). +* 👋 Join our [Slack](http://slack.feastsite.wpenginepowered.com/) and say hello! Slack is the best forum for you to get in touch with Feast maintainers, and we love hearing feedback from teams trying out 0.10 Feast. +* 📢 Register for [apply()](https://www.applyconf.com/) – the ML data engineering conference, where we'll [demo Feast 0.10](https://www.applyconf.com/agenda/rethinking-feature-stores) and discuss [future developments for AWS](https://www.applyconf.com/agenda/bringing-feast-to-aws). +* 🔥 For teams that want to continue to run Feast on Kubernetes with Spark, have a look at our installation guides and Helm charts. + +🛠️ Thinking about contributing to Feast? Check out our [code on GitHub](https://github.com/feast-dev/feast)! diff --git a/docs/blog/feast-0-13-adds-on-demand-transforms-feature-servers-and-feature-views-without-entities.md b/docs/blog/feast-0-13-adds-on-demand-transforms-feature-servers-and-feature-views-without-entities.md new file mode 100644 index 0000000000..63dd71a52b --- /dev/null +++ b/docs/blog/feast-0-13-adds-on-demand-transforms-feature-servers-and-feature-views-without-entities.md @@ -0,0 +1,45 @@ +# Feast 0.13 adds on-demand transforms, feature servers, and feature views without entities + +*October 2, 2021* | *Danny Chiao, Tsotne Tabidze, Achal Shah, and Felix Wang* + +We are delighted to announce the release of [Feast 0.13](https://github.com/feast-dev/feast/releases/tag/v0.13.0), which introduces: + +* [Experimental] On demand feature views, which allow for consistently applied transformations in both training and online paths. This also introduces the concept of request data, which is data only available at the time of the prediction request, as potential inputs into these transformations +* [Experimental] Python feature servers, which allow you to quickly deploy a local HTTP server to serve online features. Serverless deployments and java feature servers to come soon! +* Feature views without entities, which allow you to specify features that should only be joined on event timestamps. You do not need lists of entities / entity values when defining and retrieving features from these feature views. + +Experimental features are subject to API changes in the near future as we collect feedback. If you have thoughts, please don't hesitate to reach out to the Feast team! + +### [Experimental] On demand feature views + +On demand feature views allows users to use existing features and request data to transform and create new features. Users define Python transformation logic which is executed in both historical retrieval and online retrieval paths.‌ This unlocks many use cases including fraud detection and recommender systems, and reduces training / serving skew by allowing for consistently applied transformations. Example features may include: + +* Transactional features such as `transaction_amount_greater_than_7d_average` where the inputs to features are part of the transaction, booking, or order event. +* Features requiring the current location or time such as `user_account_age`, `distance_driver_customer` +* Feature crosses where the keyspace is too large to precompute such as `movie_category_x_movie_rating` or `lat_bucket_x_lon_bucket` + +Currently, these transformations are executed locally. Future milestones include building a feature transformation server for executing transformations at higher scale. + +First, we define the transformations: + +```python +# Define a request data source which encodes features / information only +# available at request time (e.g. part of the user initiated HTTP request) +input_request = RequestDataSource( + name="vals_to_add", + schema={ + "val_to_add": ValueType.INT64, + } +) +``` + +See [On demand feature view](https://docs.feastsite.wpenginepowered.com/reference/on-demand-feature-view) for detailed info on how to use this functionality. + +### [Experimental] Python feature server + +The Python feature server provides an HTTP endpoint that serves features from the feature store. This enables users to retrieve features from Feast using any programming language that can make HTTP requests. As of now, it's only possible to run the server locally. A remote serverless feature server is currently being developed. Additionally, a low latency java feature server is in development. + +```bash +$ feast init feature_repo +Creating a new Feast repository in /home/tsotne/feast/feature_repo. +``` diff --git a/docs/blog/feast-0-14-adds-aws-lambda-feature-servers.md b/docs/blog/feast-0-14-adds-aws-lambda-feature-servers.md new file mode 100644 index 0000000000..45062d2d54 --- /dev/null +++ b/docs/blog/feast-0-14-adds-aws-lambda-feature-servers.md @@ -0,0 +1,56 @@ +# Feast 0.14 adds AWS Lambda feature servers + +*October 23, 2021* | *Tsotne Tabidze, Felix Wang* + +We are delighted to announce the release of [Feast 0.14](https://github.com/feast-dev/feast/releases/tag/v0.14.0), which introduces a new feature and several important improvements: + +* [Experimental] AWS Lambda feature servers, which allow you to quickly deploy an HTTP server to serve online features on AWS Lambda. GCP Cloud Run and Java feature servers are coming soon! +* Bug fixes around performance. The core online serving path is now significantly faster. +* Improvements for developer experience. The integration tests are now faster, and temporary tables created during integration tests are immediately dropped after the test. + +Experimental features are subject to API changes in the near future as we collect feedback. If you have thoughts, please don't hesitate to reach out to the Feast team! + +### [Experimental] AWS Lambda feature servers + +Prior to Feast 0.13, the only way for users to retrieve online features was to use the Python SDK. This was restrictive, so Feast 0.13 introduced local Python feature servers, allowing users to deploy a local HTTP server to serve their online features. Feast 0.14 now allows users to deploy a feature server on AWS Lambda to quickly serve features at scale. The new AWS Lambda feature servers are available for feature stores using the AWS provider. + +To deploy a feature server to AWS Lambda, they must be enabled and be given the appropriate permissions: + +```yaml +project: dev +registry: s3://feast/registries/dev +provider: aws +online_store: + region: us-west-2 +offline_store: + cluster_id: feast + region: us-west-2 + user: admin + database: feast + s3_staging_location: s3://feast/redshift/tests/staging_location + iam_role: arn:aws:iam::{aws_account}:role/redshift_s3_access_role +flags: + alpha_features: true + aws_lambda_feature_server: true +feature_server: + enabled: True + execution_role_name: arn:aws:iam::{aws_account}:role/lambda_execution_role +``` + +Calling `feast apply` will then deploy the feature server. The precise endpoint can be determined with by calling `feast endpoint`, and the endpoint can then be queried as follows: + +See [AWS Lambda feature server](https://docs.feastsite.wpenginepowered.com/reference/feature-servers/aws-lambda) for detailed info on how to use this functionality. + +### Performance bug fixes and developer experience improvements + +The provider for a feature store is now cached instead of being instantiated repeatedly, making the core online serving path 30% faster. + +Integration tests now run significantly faster on Github Actions due to caching. Also, tables created during integration tests were previously not always cleaned up properly; now they are always deleted immediately after the integration tests finish. + +### What's next + +We are collaborating with the community on supporting streaming sources, low latency serving, a Python feature transformation server for on demand transforms, improved support for Kubernetes deployments, and more. + +In addition, there is active community work on building Hive, Snowflake, Azure, Astra, Presto, and Alibaba Cloud connectors. If you have thoughts on what to build next in Feast, please fill out this [form](https://docs.google.com/forms/d/e/1FAIpQLSfa1nR). + +Download Feast 0.14 today from [PyPI](https://pypi.org/project/feast/) (or pip install feast) and try it out! Let us know on our [slack channel](http://slack.feastsite.wpenginepowered.com/). diff --git a/docs/blog/feast-0-18-adds-snowflake-support-and-data-quality-monitoring.md b/docs/blog/feast-0-18-adds-snowflake-support-and-data-quality-monitoring.md new file mode 100644 index 0000000000..4b4321e325 --- /dev/null +++ b/docs/blog/feast-0-18-adds-snowflake-support-and-data-quality-monitoring.md @@ -0,0 +1,37 @@ +# Feast 0.18 adds Snowflake support and data quality monitoring + +*February 14, 2022* | *Felix Wang* + +We are delighted to announce the release of Feast [0.18](https://github.com/feast-dev/feast/releases/tag/v0.18.0), which introduces several new features and other improvements: + +* Snowflake offline store, which allows you to define and use features stored in Snowflake. +* [Experimental] Saved Datasets, which allow training datasets to be persisted in an offline store. +* [Experimental] Data quality monitoring, which allows you to validate your training data with Great Expectations. Future work will allow you to detect issues with upstream data pipelines and check for training-serving skew. +* Python feature server graduation from alpha status. +* Performance improvements to on demand feature views, protobuf serialization and deserialization, and the Python feature server. + +Experimental features are subject to API changes in the near future as we collect feedback. If you have thoughts, please don't hesitate to reach out to the Feast team through our [Slack](http://slack.feastsite.wpenginepowered.com/)! + +### Snowflake offline store + +Prior to Feast 0.18, Feast had first-class support for Google BigQuery and AWS Redshift as offline stores. In addition, there were various plugins for Snowflake, Azure, Postgres, and Hive. Feast 0.18 introduces first-class support for Snowflake as an offline store, so users can more easily leverage features defined in Snowflake. The Snowflake offline store can be used with the AWS, GCP, and Azure providers. + +### [Experimental] Saved Datasets + +Training datasets generated via `get_historical_features` can now be persisted in an offline store and reused later. This functionality will be primarily needed to generate reference datasets for validation purposes (see next section) but also could be useful in other use cases like caching results of a computationally intensive point-in-time join. + +### [Experimental] Data quality monitoring + +Feast 0.18 includes the first milestone of our data quality monitoring work. Many users have requested ways to validate their training and serving data, as well as monitor for training-serving skew. Feast 0.18 allows users to validate their training data through an integration with [Great Expectations](https://greatexpectations.io/). Users can declare one of the previously generated training datasets as a reference for this validation by persisting it as a "saved dataset" (see previous section). More details about future milestones of data quality monitoring can be found [here](https://docs.feastsite.wpenginepowered.com/v/master/reference/data-quality). There's also a [tutorial on validating historical features](https://docs.feastsite.wpenginepowered.com/v/master/how-to-guides/validation/validating-historical-features) that demonstrates all new concepts in action. + +### Performance improvements + +The Feast team and community members have made several significant performance improvements. For example, the Python feature server performance was improved by switching to a more efficient serving interface. Improving our protobuf serialization and deserialization logic led to speedups in on demand feature views. The Datastore implementation was also sped up by batching operations. For more details, please see our [blog post](https://feastsite.wpenginepowered.com/blog/feast-benchmarks/) with detailed benchmarks! + +### What's next + +We are collaborating with the community on the first milestone of the `feast plan` command, future milestones of data quality monitoring, and a consolidation of our online serving logic into Golang. + +In addition, there is active community work on adding support for Snowflake as an online store, merging the Azure plugin into the main Feast repo, and more. If you have thoughts on what to build next in Feast, please fill out this [form](https://docs.google.com/forms/d/e/1FAIpQLSfa1nR). + +Download Feast 0.18 today from [PyPI](https://pypi.org/project/feast/) diff --git a/docs/blog/feast-0-20-adds-api-and-connector-improvements.md b/docs/blog/feast-0-20-adds-api-and-connector-improvements.md new file mode 100644 index 0000000000..a15482b634 --- /dev/null +++ b/docs/blog/feast-0-20-adds-api-and-connector-improvements.md @@ -0,0 +1,41 @@ +# Feast 0.20 adds API and connector improvements + +*April 21, 2022* | *Danny Chiao* + +We are delighted to announce the release of Feast 0.20, which introduces many new features and enhancements: + +* Many connector improvements and bug fixes (DynamoDB, Snowflake, Spark, Trino) + * Note: Trino has been officially bundled into Feast. You can now run this with `pip install "feast[trino]"`! +* Feast API changes +* [Experimental] Feast UI as an importable npm module +* [Experimental] Python SDK with embedded Go mode + +### Connector optimizations & bug fixes + +Key changes: + +* DynamoDB online store implementation is now much more efficient with batch feature retrieval (thanks [@TremaMiguel](https://github.com/TremaMiguel)!). As per updates on the [benchmark blog post](https://feastsite.wpenginepowered.com/blog/feast-benchmarks/), DynamoDB now is much more performant at high batch sizes for online feature retrieval! +* Snowflake offline store connector supports key pair authentication. +* Contrib plugins (documentation still pending, but see [old docs](https://github.com/Shopify/feast-trino)) + +### Feast API simplification + +In planning for upcoming functionality (data quality monitoring, batch + stream transformations), certain parts of the Feast API are changing. As part of this change, Feast 0.20 addresses API inconsistencies. No existing feature repos will be broken, and we intend to provide a migration script to help upgrade to the latest syntax. + +Key changes: + +* Naming changes (e.g. `FeatureView` changes from features -> schema) +* All Feast objects will be defined with keyword args (in practice not impacting users unless they use positional args) +* Key Feast object metadata will be consistently exposed through constructors (e.g. owner, description, name) +* [Experimental] Pushing transformed features (e.g. from a stream) directly to the online store: + * Favoring push sources + +### [Experimental] Feast Web UI + +See [https://github.com/feast-dev/feast/tree/master/ui](https://github.com/feast-dev/feast/tree/master/ui) to check out the new Feast Web UI! You can generate registry dumps via the Feast CLI and stand up the server at a local endpoint. You can also embed the UI as a React component and add custom tabs. + +### What's next + +In response to survey results (fill out this [form](https://forms.gle/9SpCeJnq3MayAqHe6) to give your input), the Feast community will be diving much more deeply into data quality monitoring, batch + stream transformations, and more performant / scalable materialization. + +The community is also actively involved in many efforts. Join [#feast-web-ui](https://tectonfeast.slack.com/channels/feast-web-ui) to get involved with helping on the Feast Web UI. diff --git a/docs/blog/feast-benchmarks.md b/docs/blog/feast-benchmarks.md new file mode 100644 index 0000000000..49cd0624ed --- /dev/null +++ b/docs/blog/feast-benchmarks.md @@ -0,0 +1,65 @@ +# Serving features in milliseconds with Feast feature store + +*February 1, 2022* | *Tsotne Tabidze, Oleksii Moskalenko, Danny Chiao* + +Feature stores are operational ML systems that serve data to models in production. The speed at which a feature store can serve features can have an impact on the performance of a model and user experience. In this blog post, we show how fast Feast is at serving features in production and describe considerations for deploying Feast. + +## Updates +Apr 19: Updated DynamoDB benchmarks for Feast 0.20 given batch retrieval improvements + +## Background + +One of the most common questions Feast users ask in our [community Slack](http://slack.feastsite.wpenginepowered.com/) is: how scalable / performant is Feast? (spoiler alert: Feast is *very* fast, serving features at <1.5ms @p99 when using Redis in the below benchmarks) + +In a survey conducted last year ([results](https://docs.google.com/forms/d/e/1FAIpQLScV2RX)), we saw that most users were tackling challenging problems like recommender systems (e.g. recommending items to buy) and fraud detection, and had strict latency requirements. + +Over 80% of survey respondents needed features to be read at less than 100ms (@p99). Taking into account that most users in this survey were supporting recommender systems, which often require ranking 100s-1000s of entities simultaneously, this becomes even more strict. Feature serving latency scales with batch size because of the need to query features for random entities and other sources of tail latency. + +In this blog, we present results from a benchmark suite ([RFC](https://docs.google.com/document/d/12UuvTQnTTCJ)), describe the benchmark setup, and provide recommendations for how to deploy Feast to meet different operational goals. + +## Considerations when deploying Feast + +There are a couple of decisions users need to make when deploying Feast to support online inference. There are two key decisions when it comes to performance: + +1. How to deploy a feature server +2. Choice of online store + +Each approach comes with different tradeoffs in terms of performance, scalability, flexibility, and ease of use. This post aims to help users decide between these approaches and enable users to easily set up their own benchmarks to see if Feast meets their own latency requirements. + +### How to deploy a feature server + +While all users setup a Feast feature repo in the same way (using the Python SDK to define and materialize features), users retrieve features from Feast in a few different ways (see also [Running Feast in Production](https://docs.feastsite.wpenginepowered.com/how-to-guides/running-feast-in-production)): + +1. Deploy a Java gRPC feature server (Beta) +2. Deploy a Python HTTP feature server +3. Deploy a serverless Python HTTP feature server on AWS Lambda +4. Use the Python client SDK to directly fetch features +5. (Advanced) Build a custom client (e.g in Go or Java) to directly read the registry and read from an online store + +The first four above come for free with Feast, while the fifth requires custom work. All options communicate with the same Feast registry component (managed by feast apply) to understand where features are stored. + +Deploying a feature server service (compared to using a Feast client that directly communicates with online stores) can enable many improvements such as better caching (e.g. across clients), improved data access management, rate limiting, centralized monitoring, supporting client libraries across multiple languages, etc. However, this comes at the cost of increased architectural complexity. Serverless architectures are on the other end of the spectrum, enabling simple deployments at the cost of latency overhead. + +### Choice of online stores + +Feast is highly pluggable and extensible and supports serving features from a range of online stores (e.g. Amazon DynamoDB, Google Cloud Datastore, Redis, PostgreSQL). Many users build their own plugins to support their specific needs / online stores. [Building a Feature Store](https://www.tecton.ai/blog/how-to-build-a-feature-store/) dives into some of the trade-offs between online stores. Easier to manage solutions like DynamoDB or Datastore often lose against Redis in terms of read performance and cost. Each store also has its own API idiosyncrasies that can impact performance. The Feast community is continuously optimizing store-specific performance. + +## Benchmark Results + +The raw data exists at [https://github.com/feast-dev/feast-benchmarks](https://github.com/feast-dev/feast-benchmarks). We choose a subset of comparisons here to answer some of the most common questions we hear from the community. + +### Summary + +* The Java feature server is very fast (e.g. p99 latency is ~1.3 ms for a single row fetch of 250 features) + * Note: The Java feature server is in Beta and does not support new functionality such as the more scalable SQL registry. + +The Beta Feast Java feature server with Redis provides very low latency retrieval (p99 < 1.5ms for single row retrieval of 250 features), but at increased architectural complexity, less first class support for functionality (e.g. no SQL registry support), and more overhead in managing Redis clusters. Using a Python server with other managed online stores like DynamoDB or Datastore is easier to manage. + +Note: there are managed services for Redis like Redis Enterprise Cloud which remove the additional complexity associated with managing Redis clusters and provide additional benefits. + +### What's next + +The community is always improving Feast performance, and we'll post updates to performance improvements in the future. Future improvements in the works include: + +* Improved on demand transformation performance +* Improved pooling of clients (e.g. we've seen that caching Google clients significantly improves response times and reduces memory consumption) diff --git a/docs/blog/feast-joins-the-linux-foundation-for-ai-data.md b/docs/blog/feast-joins-the-linux-foundation-for-ai-data.md new file mode 100644 index 0000000000..19af5632b8 --- /dev/null +++ b/docs/blog/feast-joins-the-linux-foundation-for-ai-data.md @@ -0,0 +1,37 @@ +# Feast Joins The Linux Foundation for AI & Data + +*January 22, 2021* | *Christina Harter* + +([Original post](https://lfaidata.foundation/blog/2020/11/10/feast-joins-lf-ai-data-as-new-incubation-project/)) + +LF AI & Data Foundation—the organization building an ecosystem to sustain open source innovation in artificial intelligence (AI), machine learning (ML), deep learning (DL), and Data open source projects—today is announcing FEAST as its latest Incubation Project. [Feast](https://feastsite.wpenginepowered.com/) (Feature Store) is an open source feature store for machine learning. + +Today, teams running operational machine learning systems are faced with many technical and organizational challenges: + +1. Models don't have a consistent view of feature data and are tightly coupled to data infrastructure. +2. Deploying new features in production is difficult. +3. Feature leakage decreases model accuracy. +4. Features aren't reused across projects. +5. Operational teams can't monitor the quality of data served to models. + +Developed collaboratively between [Gojek](https://www.gojek.com/) and [Google Cloud](https://cloud.google.com/) in 2018, Feast was open sourced in early 2019. The project sets out to address these challenges as follows: + +1. Providing a single data access layer that decouples models from the infrastructure used to generate, store, and serve feature data. +2. Decoupling the creation of features from the consumption of features through a centralized store, thereby allowing teams to ship features into production with minimal engineering support. +3. Providing point-in-time correct retrieval of feature data for both model training and online serving. +4. Encouraging reuse of features by allowing organizations to build a shared foundation of features. +5. Providing data-centric operational monitoring that ensures operational teams can run production machine learning systems confidently at scale. + +"Feast was created to address the data challenges we faced at Gojek while scaling machine learning for ride-hailing, food delivery, digital payments, fraud detection, and a myriad of other use cases" said Willem Pienaar, creator of Feast. "After open sourcing the project we've seen an explosion of demand for the software, leading to strong adoption and community growth. Entering the LF AI & Data Foundation is an important step for us toward decentralized governance and wider industry adoption and development." + +Jeremy Lewi, Kubeflow founder, said "Feast entering the LF AI & Data Foundation is both a major milestone for the project and recognition of the strides the project has made toward solving some of the hardest problems in productionizing data for machine learning. Technologies like Feast have the potential to shape the machine learning stack of the future, and with its incubation in LF AI & Data, the project now has the ideal environment to expand its community in building a best-in-class open source feature store." + +Dr. Ibrahim Haddad, Executive Director of LF AI & Data, said: "We are very excited to welcome FEAST to LF AI & Data and help it thrive in a vendor-neutral environment under an open governance model. With the addition of FEAST, we are increasing the number of hosted projects under the Data category and look forward to tighter collaboration between our data projects and all other projects to drive innovation in data, analytics, and AI open source technologies." + +LF AI & Data supports projects via a wide range of services, and the first step is joining as an Incubation Project. LF AI & Data will support the neutral open governance for FEAST to help foster the growth of the project. Check out the [Documentation](https://docs.feastsite.wpenginepowered.com/) to start working with FEAST today. Learn more about FEAST on their [GitHub](https://github.com/feast-dev/feast) and be sure to join the [FEAST-Announce](https://lists.lfaidata.foundation/g/feast-announce) and [FEAST-Technical-Discuss](https://lists.lfaidata.foundation/g/feast-technical-discuss) mail lists to join the community and stay connected on the latest updates. + +A warm welcome to FEAST! We look forward to the project's continued growth and success as part of the LF AI & Data Foundation. To learn about how to host an open source project with us, visit the [LF AI & Data website](https://lfaidata.foundation/proposal-and-hosting/). + +FEAST Key Links: +* [Website](https://feastsite.wpenginepowered.com/) +* [GitHub](https://github.com/feast-dev/feast) diff --git a/docs/blog/feast-release-0-12-adds-aws-redshift-and-dynamodb-stores.md b/docs/blog/feast-release-0-12-adds-aws-redshift-and-dynamodb-stores.md new file mode 100644 index 0000000000..c79008caaf --- /dev/null +++ b/docs/blog/feast-release-0-12-adds-aws-redshift-and-dynamodb-stores.md @@ -0,0 +1,46 @@ +# Feast 0.12 adds AWS Redshift and DynamoDB stores + +*August 11, 2021* | *Jules S. Damji, Tsotne Tabidze, and Achal Shah* + +We are delighted to announce [Feast 0.12](https://github.com/feast-dev/feast/blob/master/CHANGELOG.md) is released! With this release, Feast users can take advantage of AWS technologies such as Redshift and DynamoDB as feature store backends to power their machine learning models. We want to share three key additions that extend Feast's ecosystem and facilitate a convenient way to group features via a Feature Service for serving: + +1. Adding [AWS Redshift](https://aws.amazon.com/redshift/), a cloud data warehouse, as an offline store, which supports features serving for training and batch inference at high throughput + +Let's briefly take a peek at each and how easily you can use them through simple declarative APIs and configuration changes. + +### AWS Redshift as a feature store data source and an offline store + +Redshift data source allows you to fetch historical feature values from Redshift for building training datasets and materializing features into an online store (see below how to materialize). A data source is defined as part of the [Feast Declarative API](https://rtd.feastsite.wpenginepowered.com/en/latest/) in the feature repo directory's Python files. For example, `aws_datasource.py` defines a table from which we want to fetch features. + +```python +from feast import RedshiftSource + +my_redshift_source = RedshiftSource(table="redshift_driver_table") +``` + +### AWS DynamoDB as an online store + +To allow teams to scale up and support high volumes of online transactions requests for machine learning (ML) predictions, Feast now supports a scalable DynamoDB to serve fresh features to your model in production in the AWS cloud. To enable DynamoDB as your online store, just change `featore_store.yaml`: + +```yaml +project: fraud_detection +registry: data/registry.db +provider: aws +online_store: + type: dynamodb + region: us-west-2 +``` + +To materialize your features into your DynamoDB online store, simply issue the command: + +```bash +$ feast materialize +``` + +Use a Feature Service when you want to logically group features from multiple Feature Views. This way, when requested from Feast, all features will be returned from the feature store. `feature_store.get_historical_features(...)` and `feature_store.get_online_features(...)` + +### What's next + +We are working on a Feast tutorial use case on AWS, meanwhile you can check out other [tutorials in documentation](https://docs.feastsite.wpenginepowered.com/). For more documentation about the aforementioned features, check the following Feast links: + +* [Online stores](https://docs.feastsite.wpenginepowered.com/reference/online-stores/) diff --git a/docs/blog/feast-supports-vector-database.md b/docs/blog/feast-supports-vector-database.md new file mode 100644 index 0000000000..6463a271ae --- /dev/null +++ b/docs/blog/feast-supports-vector-database.md @@ -0,0 +1,43 @@ +# Feast Launches Support for Vector Databases 🚀 + +*July 25, 2024* | *Daniel Dowler, Francisco Javier Arceo* + +## Feast and Vector Databases + +With the rise of generative AI applications, the need to serve vectors has grown quickly. We are pleased to announce that Feast now supports (as an experimental feature in Alpha) embedding vector features for popular GenAI use-cases such as RAG (retrieval augmented generation). + +An important consideration is that GenAI applications using embedding vectors stand to benefit from a formal feature framework, just as traditional ML applications do. We are excited about adding support for embedding vector features because of the opportunity to improve GenAI backend operations. The integration of embedding vectors as features into Feast, allows GenAI developers to take advantage of MLOps best practices, lowering development time, improving quality of work, and sets the stage for [Retrieval Augmented Fine Tuning](https://techcommunity.microsoft.com/t5/ai-ai-platform-blog/retrieval-augmented-fine-tuning-raft-with-azure-ai/ba-p/3979114). + +## Setting Up a Document Embedding Feature View + +The [feast-workshop repo example](https://github.com/feast-dev/feast-workshop/tree/main) shows how Feast users can define feature views with vector database sources. They can easily convert text queries to embedding vectors, which are then matched against a vector database to retrieve closest vector records. All of this works seamlessly within the Feast toolset, so that vector features become a natural addition to the Feast feature store solution. + +Defining a feature backed by a vector database is very similar to defining other types of features in Feast. Specifically, we can use the FeatureView class with an Array type field. + +```python +from datetime import timedelta +from feast import FeatureView +from feast.types import Array, Float32 +from feast.field import Field + +for key, value in sorted(features.items()): + print(key, " : ", value) + +print_online_features(features) +``` + +## Supported Vector Databases + +The Feast development team has conducted preliminary testing with the following vector stores: + +* SQLite +* Postgres with the PGVector extension +* Elasticsearch + +There are many more vector store solutions available, and we are excited about discovering how Feast may work with them to support vector feature use-cases. We welcome community contributions in this area–if you have any thoughts feel free to join the conversation on GitHub + +## Final Thoughts + +Feast brings formal feature operations support to AI/ML teams, enabling them to produce models faster and at higher levels of quality. The need for feature store support naturally extends to vector embeddings as features from vector databases (i.e., online stores). Vector storage and retrieval is an active space with lots of development and solutions. We are excited by where the space is moving, and look forward to Feast's role in operationalizing embedding vectors as first class features in the MLOps ecosystem. + +If you are new to feature stores and MLOps, this is a great time to give Feast a try. Check out [Feast documentation](https://feast.dev/) and the [Feast GitHub](https://github.com/feast-dev/feast) page for more on getting started. Big thanks to [Hao Xu](https://www.linkedin.com/in/hao-xu-a04436103/) and the community for their contributions to this effort. diff --git a/docs/blog/go-feature-server-benchmarks.md b/docs/blog/go-feature-server-benchmarks.md new file mode 100644 index 0000000000..1bd2539cab --- /dev/null +++ b/docs/blog/go-feature-server-benchmarks.md @@ -0,0 +1,55 @@ +# Go feature server benchmarks + +*July 19, 2022* | *Felix Wang* + +## Background + +The Feast team published a [blog post](https://feastsite.wpenginepowered.com/blog/feast-benchmarks/) several months ago with latency benchmarks for all of our online feature retrieval options. Since then, we have built a Go feature server. It is currently in alpha mode, and only supports Redis as an online store. The docs are [here](https://docs.feastsite.wpenginepowered.com/reference/feature-servers/go-feature-server/). We recommend teams that require extremely low-latency feature serving to try the Go feature server. To test it, we ran our benchmarks against it; the results are presented below. + +## Benchmark Setup + +See [https://github.com/feast-dev/feast-benchmarks](https://github.com/feast-dev/feast-benchmarks) for the exact benchmark code. The feature servers were deployed in Docker on AWS EC2 instances (c5.4xlarge, 16vCPU, 64GiB memory). + +## Data and query patterns + +Feast's feature retrieval primarily manages retrieving the latest values of a given feature for specified entities. In this benchmark, the online stores contain: + +* 25 feature views (with 10 features per feature view) for a total of 250 features +* 1M entity rows + +As described in [RFC-031](https://docs.google.com/document/d/12UuvTQnTTCJ), we simulate different query patterns by additionally varying by number of entity rows in a request (i.e. *batch size*), requests per second, and the concurrency of the feature server. The goal here is to have numbers that apply to a diverse set of teams, regardless of their scale and typical query patterns. Users are welcome to extend the benchmark suite to better test their own setup. + +## Online store setup + +These benchmarks only used Redis as an online store. We used a single Redis server, run locally with Docker Compose on an EC2 instance. This should closely approximate usage of a separate Redis server in AWS. Typical network latency within the same availability zone in AWS is [< 1-2 ms](https://aws.amazon.com/blogs/architecture/improving-performance-and-reducing-cost-using-availability-zone-affinity/). In these benchmarks, we did not hit limits that required use of a Redis cluster. With higher batch sizes, the benchmark suite would likely only work with Redis clusters. Redis clusters should improve Feast's performance. + +## Benchmark Results + +### Summary + +* The Go feature server is very fast (e.g. p99 latency is ~3.9 ms for a single row fetch of 250 features) +* For the same number of features and batch size, the Go feature server is about 3-5x faster than the Python feature server + * Despite this, there are still compelling reasons to use Python, depending on your situation (e.g. simplicity of deployment) +* Feature server latency… + * scales linearly (moderate slope) with batch size + * scales linearly (low slope) with number of features + * does not substantially change as requests per seconds increase + +### Latency when varying by batch size + +For this comparison, we check retrieval of 50 features across 5 feature views. At p99, we see that Go significantly outperforms Python, by ~3-5x. It also scales much better with batch size. + +| Batch size | 1 | 10 | 20 | 30 | 40 | 50 | 60 | 70 | 80 | 90 | 100 | +|------------|---|----|----|----|----|----|----|----|----|----|----| +| Python | 7.23 | 15.14 | 23.96 | 32.80 | 41.44 | 50.43 | 59.88 | 94.57 | 103.28 | 111.93 | 124.87 | +| Go | 4.32 | 3.88 | 6.09 | 8.16 | 10.13 | 12.32 | 14.3 | 16.28 | 18.53 | 20.27 | 22.18 | + +### Latency when varying by number of requested features + +The Go feature server scales a bit better than the Python feature server in terms of supporting a large number of features: +p99 retrieval times (ms), varying by number of requested features (batch size = 1) + +| Num features | 50 | 100 | 150 | 200 | 250 | +|-------------|----|----|-----|-----|-----| +| Python | 8.42 | 10.28 | 13.36 | 16.69 | 45.41 | +| Go | 1.78 | 2.43 | 2.98 | 3.33 | 3.92 | diff --git a/docs/blog/how-danny-chiao-is-keeping-feast-simple.md b/docs/blog/how-danny-chiao-is-keeping-feast-simple.md new file mode 100644 index 0000000000..24955b5aa8 --- /dev/null +++ b/docs/blog/how-danny-chiao-is-keeping-feast-simple.md @@ -0,0 +1,31 @@ +# How Danny Chiao is Keeping Feast Simple + +*March 2, 2022* | *Claire Besset* + +Tecton Engineer Danny Chiao recently appeared on *The Feast Podcast* to have a conversation with host Demetrios Brinkmann, head of the [MLOps Community](https://mlops.community). Demetrios and Danny spent an hour together discussing why Danny left Google to work on Feast, what it's like to be a leader in an open-source community, and what the future holds for Feast and MLOps. You can read about the highlights from their conversation below, or listen to the full episode [here](https://anchor.fm/featurestore). + +## From Google to Feast + +Prior to joining Tecton, Danny spent 7.5 years at Google, working on everything from Google+ to Android to Google Workspace. As a machine learning engineer, he worked with stakeholders from both product and research teams. Bridging gaps between these teams was a two-way challenge: product teams needed help applying learnings from research teams, and research teams needed to be convinced to take on projects from the product space. + +In addition, it was difficult to share data from enterprise Google products with research teams due to security and privacy mandates. Danny's experience working on multiple ML products and interfacing between diverse stakeholder groups would later prove to be highly valuable in his role in the Feast open source community. + +What prompted Danny to leave Google and join Tecton? He noticed how the ML landscape outside of Google was starting to look very different from how it did internally. While Google was still using ETL jobs to read data from data lakes or databases and perform massive transformations, other companies were taking advantage of new data warehouse technologies: "I was hearing that the ecosystem for iterating, developing, and shipping models was oddly enough more mature outside of Google…Internally, a lot of these massive systems are dependent on the infrastructure, so you can't iterate as quickly." + +Excited by the innovations in ML infrastructure that were appearing in the broader community, Danny moved to Tecton to work on [Feast](https://www.tecton.ai/blog/feast-announcement/), an open-source feature store. [Feature stores](https://www.tecton.ai/blog/what-is-a-feature-store/) act as a central hub for feature data across an ML project's lifecycle, and are responsible for transforming raw data into features, storing and managing features, and serving features for training and prediction. Feature stores are quickly becoming a critical infrastructure for data science teams putting ML into production. + +## What it's like to work in the Feast open source community + +As a leader in the Feast community, Danny splits his time between engineering projects and community engagement. In working with the community, Danny is learning about the current and emerging use cases for Feast. One of the big challenges with Feast is its broad user-base: "We have users coming to us like, 'Hey, I don't have that much data. I don't have super-strict latency requirements. I don't need a lot of complexity.' Then you have the Twitters of the world who are like, 'Hey, we need massive scale, massive low latency.' There's definitely that tug." + +There are also diverse usecases for Feast, from recommender systems, to fraud detection, to credit scoring, to biotech. The solution has been to keep Feast as simple and streamlined as possible. It should be flexible and extensible enough to meet the needs of its broad community, but it also aims to be accessible for small companies just beginning machine learning operations. As Danny says, "You can't get all these new users to come in and enjoy value if it's going to take a really, really long time to stand something up." + +This was the vision behind the release of Feast 0.10, which is Python-centric and can run on a developer's local machine. Overall, Danny holds a very positive outlook on the future of collaboration within Feast, noting how the diversity of the community can be an asset: "If you can motivate the right people and drive people towards the same vision, then you can do things way faster than if you were just a small team executing on it." + +## The future for Feast + +What's on the docket for Feast development this year? They're working with companies like Twitter and Redis to get benchmarks on how performant Feast is and harden the serving layer. Danny's excited to work on data quality monitoring and make that practice more standardized in the community. He's also looking forward to the launch of the Feast Web UI, because users have been asking for easier ways to discover and share features and data pipelines. + +True to the vision of keeping Feast simple, the team is focused on targeting new users in the ML space and getting them from zero-to-one. This is the plan for a world where machine learning is becoming even more ubiquitous. "It's going to become something that is just expected of companies," Demetrios said. "Right now, it doesn't feel like we've even gotten at 2% of what is potentially possible if every single business is going to be using machine learning." Fortunately, feature stores are a technology that can dramatically shorten the time it takes a new company to begin realizing value from machine learning. + +From meeting the machine learning needs of a broad user base to helping new teams get started with ML, there's a lot of exciting work to be done at Feast! You can learn more about the Feast project on our [website](https://www.tecton.ai/feast/), or read updates in Danny's community newsletter on the [Feast google group](https://groups.google.com/g/feast-dev/). diff --git a/docs/blog/kubeflow-and-feast-with-david-aronchick.md b/docs/blog/kubeflow-and-feast-with-david-aronchick.md new file mode 100644 index 0000000000..ca8ec91469 --- /dev/null +++ b/docs/blog/kubeflow-and-feast-with-david-aronchick.md @@ -0,0 +1,31 @@ +# Kubeflow + FEAST With David Aronchick, Co-creator of Kubeflow + +*April 29, 2022* | *demetrios* + +A recent episode of *The Feast Podcast* featured the co-creator of [Kubeflow](https://www.kubeflow.org/), David Aronchick, along with hosts Willem Pienaar and Demetrios Brinkmann. David, Willem, and Demetrios talked about the complexities of setting up machine learning (ML) infrastructure today and what's needed in the future to improve this process. You can read about the highlights from the podcast below or listen to the full episode [here](https://anchor.fm/featurestore/episodes/Kubeflo...). + +## Creation and philosophy behind Kubeflow + +[Kubeflow](https://www.kubeflow.org/) is a project that improves the deployment process of ML workflows on [Kubernetes](https://kubernetes.io/), a system for managing containers. It's an open-source platform originally based on Google's internal method to deploy [TensorFlow](https://www.tensorflow.org/) models, and is available for public use. It can deploy systems everywhere that Kubernetes is supported: e.g. on-premise installations, Google Cloud, AWS, and Azure. + +For machine learning practitioners, training is usually done in one of two ways. If the data set is small, users typically work in a Jupyter notebook, which allows them to quickly iterate on the necessary parameters without having to do much manual setup. On the other hand, if the data set is very large, distributed training is required with many physical or virtual machines. + +Originally, Kubeflow started as a way to connect the two worlds, so one could start with a Jupyter notebook and then move into distributed training with more features, pipelines, and feature stores as the data set grows. By itself, Kubeflow did not provide these additional capabilities, but wanted to partner with a service that did — hence, the beginning of a great collaboration with [Feast](https://www.tecton.ai/feast/). David described how Kubeflow is built on a mix of services: Kubeflow defines the pipeline language, Feast provides the feature store, [Argo](https://argoproj.github.io/workflows/) does work under the hood, [Katib](https://www.kubeflow.org/docs/components/katib/...) provides a hyperparameter sweep, and [Seldon](https://www.kubeflow.org/docs/external-add-ons/...) provides an inference endpoint. As Kubeflow becomes more mature, the goal is to restructure from a monolithic infrastructure where many services are installed at once to become more clean and specialized, so users only install the services they need. Currently we can see that happening with the graduation of KServe. + +## Improving the collaboration between data scientists and software engineers + +Next, David discussed how data scientists and software engineers work together to build and deploy ML systems. Data scientists fine tune the parameters of the model while engineers work on productionizing the model — that is, making sure it runs smoothly without interruptions. Unfortunately, the production deploy process cannot be fully automated yet. One of the core problems is that the APIs for ML systems are complicated to use, which is a hindrance to data scientists. + +A lot of work in ML is closer to science, where hypotheses are made and tested, as opposed to software development, where there is an iterative process and new versions are always being shipped. If you start building a distributed model based on a large data set, it may be hard or impossible to work in an interactive notebook like Jupyter unless a completely new, smaller, model is created. + +The general process for ML practitioners is a pipeline, but the individual steps are often not clearly described so it is difficult to map each step to the correct tool for the job. A data scientist's daily work can often look like downloading a CSV, deleting a column of data, uploading it to a feature store, running a Python script, and then doing training. Willem stated the need for a better solution: "Small groups should be able to independently deploy solutions for specific use cases that solve business problems." David wants to make this pipeline easier to perform with existing tools: "While there certainly are components of that available in Kubeflow and Kubernetes and others, I'm thinking about what that next abstraction layer looks like and it should be available shortly." + +## What's needed to accelerate the industry + +The landscape of ML operations platforms is very complex. There are several infrastructure options out there: Kubernetes was chosen as the backbone of Kubeflow because it's simple to set up and tweak. Willem talked about the consolidation of ML and ML operations tools: "It's going to happen eventually because there's just too many out there, and they're not all going to make it. Right now, it's the breeding grounds, and then it's going to be survival of the fittest." We can already see this playing out with DataRobot acquiring Algorithmia, Snowflake purchasing Streamlit and a few days ago, Databricks buying Cortex Labs. + +For open-source projects like Kubeflow, there should be a working group around core components that establishes standards. It isn't necessary to have one person who makes all of the decisions in this space. If a new feature is needed, code discussions are 10% of the problem but the majority of the work is around deciding implementation details and making sure that it works. The fastest way to get something done is just to build it yourself and try to get it merged. + +David mentioned that to really improve the ecosystem for ML, we "need to develop not just a standard layer for describing the entire platform, but also a system that describes many of the common objects in machine learning: a feature, a feature store, a data set, a training run, an experiment, a serving inference, and so on. It will spur innovation because it defines a set of clear contracts that users can produce and consume." Currently, this is hard to do programmatically because the variety of systems means that auxiliary tools need to be written to connect data sets. + +If expanding the future of ML infrastructure sounds exciting to you, there's a lot of contributions that are needed! You can learn more about [Feast](https://www.tecton.ai/feast/), the feature store connected to Kubeflow, and start using it today. Jump in our [slack](http://slack.feastsite.wpenginepowered.com/) and say hi! diff --git a/docs/blog/machine-learning-data-stack-for-real-time-fraud-prediction-using-feast-on-gcp.md b/docs/blog/machine-learning-data-stack-for-real-time-fraud-prediction-using-feast-on-gcp.md new file mode 100644 index 0000000000..729787dc4a --- /dev/null +++ b/docs/blog/machine-learning-data-stack-for-real-time-fraud-prediction-using-feast-on-gcp.md @@ -0,0 +1,50 @@ +# Machine learning data stack for real-time fraud detection using Feast on GCP + +*September 8, 2021* | *Jay Parthasarthy and Jules S. Damji* + +A machine learning (ML) model decides whether your transaction is blocked or approved every time you purchase using your credit card. Fraud detection is a canonical use case for real-time ML. Predictions are made upon each request quickly while you wait at a point of sale for payment approval. + +Even though this is a common problem with ML, companies often build custom tooling to tackle these predictions. Like most ML problems, the hard part of fraud prediction is in the data. The fundamental data challenges are the following: + +1. Some data needed for prediction is available as part of the transaction request. This data is the easy part of passing to the model. +2. Other data (for example, a user's historical purchases) provides a high signal for predictions, but it isn't available as part of the transaction request. This data takes time to look up: it's stored in a batch system like a data warehouse. This data is challenging to fetch since it requires a system to handle many queries per second (QPS). +3. Together, they comprise ML features as signals to the model for predicting whether the requested transaction is fraudulent. + +[Feast](https://feastsite.wpenginepowered.com/) is an open-source feature store that helps teams use batch data for real-time ML applications. It's used as part of fraud [prediction and other high-volume transactions systems](https://www.youtube.com/watch?v=ED81DvicQuQ) to prevent fraud for billions of dollars worth of transactions at companies like [Gojek](https://www.gojek.com/en-id/) and [Postmates](https://postmates.com/). In this blog, we discuss how we can use Feast to build a stack for fraud predictions. You can also follow along on Google Cloud Platform (GCP) by running this [Colab tutorial notebook](https://colab.research.google.com/github/feast-dev/feast-fraud-tutorial). + +## Generic data stack for fraud detection + +Here's what a generic stack for fraud prediction looks like: + +## 1. Generating batch features from data sources + +The first step in deploying an ML model is to generate features from raw data stored in an offline system, such as a data warehouse (DWH) or a modern data lake. After that, we use these features in our ML model for training and inference. But before we get into the specifics of fraud detection related to our example below, let's quickly understand some high-level concepts. + +Data sources: This data repository records all historical transactions data for a user, account information, and any indication of user fraud history. Usually, it's a data warehouse (DHW) with respective tables. The diagram above shows that features are generated from these data sources and put into another offline store (or the same store). Using transformational queries, like SQL, this data, joined from multiple tables, could be injected or stored as another table in a DWH— refined and computed as features. + +Features used: In the fraud use case, one set of the raw data is a record of historical transactions. This record includes data about the transaction: +* Amount of transaction +* Timestamp when the event occurred +* User account information + +## 3. Materialize features to low-latency online stores + +We have a model that's ready for real-time inference. However, we won't be able to make predictions in real-time if we need to fetch or compute data out of the data warehouse on each request because it's slow. + +Feast allows you to make real-time predictions based on warehouse data by materializing it into an [online store](https://docs.feastsite.wpenginepowered.com/concepts/registry). Using the Feast CLI, you can incrementally materialize your data, from the current time on since the previous materialized data: + +```bash +feast materialize-incremental $(date -u +"%Y-%m-%dT%H:%M:%S") +``` + +With our feature values loaded into the online store, a low-latency key-value store, as shown in the diagram above, we can retrieve new data when a new transaction request arrives in our system. + +Note that the feast materialize-incremental command needs to be run regularly so that the online store can continue to contain fresh feature values. We suggest that you integrate this command into your company's scheduler (e.g., Airflow.) + +## Conclusion + +In summation, we outlined a general data stack for real-time fraudulent prediction use cases. We implemented an end-to-end fraud prediction system using [Feast on GCP](https://github.com/feast-dev/feast-fraud-tutorial) as part of our tutorial. + +We'd love to hear how your organization's setup differs. This setup roughly corresponds to the most common patterns we've seen from our users, but things are usually more complicated as teams introduce feature logging, streaming features, and operational databases. + +You can bootstrap a simple stack illustrated in this blog by running our [tutorial notebook on GCP](https://colab.research.google.com/github/feast-dev/feast-fraud-tutorial). From there, you can integrate your prediction service into your production application and start making predictions in real-time. We can't wait to see what you build with Feast, and please share with the [Feast community](http://slack.feastsite.wpenginepowered.com/). diff --git a/docs/blog/performance-test-for-python-based-feast-feature-server.md b/docs/blog/performance-test-for-python-based-feast-feature-server.md new file mode 100644 index 0000000000..f9ef252e93 --- /dev/null +++ b/docs/blog/performance-test-for-python-based-feast-feature-server.md @@ -0,0 +1,61 @@ +# Performance Test for the Python-Based Feast Feature Server: Comparison Between DataStax Astra DB (Based on Apache Cassandra), Google Datastore & Amazon DynamoDB + +*April 17, 2023* | *Stefano Lottini* + +## Introduction + +Feature stores are an essential part of the modern stack around machine learning (ML); in particular, the effort aimed at rationalizing the access patterns to the features associated with ML models by the various functions revolving around it (from data engineers to data scientists). At its core, a feature store provides a layer atop a persistent data store (a database) that facilitates shared access to the features associated with the entities belonging to a business domain, making it easier to retrieve them consistently for both training and prediction. + +Out of several well-established feature stores available today, the most popular open-source solution is arguably [Feast](https://feastsite.wpenginepowered.com/). With its active base of contributors and support for a growing list of backends to choose from, ML practitioners don't have to worry about the boilerplate setup of their data system and can focus on delivering their product—all while retaining the freedom to choose the backend that best suits their needs. + +Last year, the Feast team published [extensive benchmarks](https://feastsite.wpenginepowered.com/blog/feast-benchmarks/) comparing the performance of the feature store when using different storage layers for retrieval of "online" features (that is, up-to-date reads to calculate inferences, as opposed to batch or historical "offline" reads). The storage backends used in the test, each powered by its own Feast plugin, were: Redis (running locally), Google Datastore, and Amazon DynamoDB—the latter on the same cloud region as the testing client. The main takeaways were: + +* Redis yields the lowest response times (but at a cost; see below) +* Among the cloud DB vendors, DynamoDB is noticeably faster than Datastore +* Latencies increase with the number of features needed and, albeit less so, with the number of rows ("entities") + +Moreover, Feast offers an SDK for both Java and Python. Although choosing the Java stack for the feature server results in faster responses, the vast majority of Feast users work with a Python-centered stack. So in our tests, we'll focus on Python start-to-end setups. + +Surveys done by Feast also showed that more than 60% of the interviewees required that P99 latency stay below 50 ms. These ultra-low-latency ML use cases often fall in the [fraud detection](https://www.tecton.ai/blog/how-to-build-a-fraud-detection-ml-system/) and [recommender system](https://www.tecton.ai/blog/guide-to-building-online-recommender-systems/) categories. + +## Feature stores & Cassandra / Astra DB + +The need for persistent data stores is ubiquitous in any ML application—and it comes in all sizes and shapes, of which the "feature store" pattern is but a certain, albeit very common, instance. As is discussed at length in the Feast blog post, many factors influence the architectural choices for ML-based systems. Besides serving latency, there are considerations about fault tolerance and data redundancy, ease of use, pricing, ease of integration with the rest of the stack, and so forth. For example, in some cases, it may be convenient to employ an in-memory store such as Redis, trading data durability and ease of scaling for reduced response times. + +In this [recently published guide](https://planetcassandra.org/post/practitioners-guide-to-cassandra-for-ml/), the author highlights the fact that a feature store lies at the core of most ML-centered architectures lies, possibly (and, looking forward, more and more so) augmented with real-time capabilities owing to a combination of CDC (Change Data Capture), event-streaming technologies, and sometimes in-memory cache layers. The guide makes the case that Cassandra and DataStax's cloud-based DBaaS [Astra DB](https://astra.datastax.com/) (which is built on Cassandra) are great databases to build a feature store on top of, owing to the world-class fault tolerance, 100% uptime, and extremely low latencies it can offer out of the box. + +We then set out to extend the performance measurements to Astra DB, with the intent to provide hard data corroborating our claim that Cassandra and Astra DB are performant first-class choices for an online feature store. In other words, once the plugin made its way to Feast, we took the next logical step: running the very same testing already done for the other DBaaS choice, but this time on Astra DB. The next section reports on our findings. + +## Performance benchmarks for Feast on Astra DB + +The Feast team published a Github [repository](https://github.com/feast-dev/feast-benchmarks) with the code used for the benchmarks. We added coverage for Astra DB (plus a one-node Cassandra cluster running locally, serving the purpose of a functional test) and upgraded the Feast version used in all benchmarks to use v0.26 consistently. + +*Note: The original tests used v0.20 for DynamoDB, v0.17 for Datastore and v0.21 for Redis. Because we reproduced all pre-existing benchmarks, finding most values to be in acceptable agreement (see below for more remarks on this point), we are confident that upgrading the Feast version does not significantly alter the performance.* + +The tests have been run on comparable AWS and GCP machines (respectively c5.4xlarge and c2-standard-16 instances) running in the same region as the cloud database (thereby mimicking the desired architecture for a production system). We did not change any benchmark parameter in order to keep the comparison meaningful, even with prior results. As stated earlier, we focused on the Python feature server, which has a wider adoption among the Feast community and supports a broader ecosystem of plugins. + +Here's how we conducted the benchmarking. First, a moderate amount of synthetic "feature data" (10k entities with 250 integer features each, for a total of about 11 MB) was materialized to the online store. Then various one-minute test runs were performed, each with a certain choice of feature-retrieval parameters, all while collecting statistics (in particular, high percentiles) on the response time of these retrieval operations. The parameters that varied between runs were: + +* batch size (1 to 100 entities per request) + +Let's go back to the Cassandra plugin for Feast and examine some properties of how it was structured. + +First, one might notice that, regardless of which features are requested at runtime, the whole partition (i.e., all features for a given entity) is read. This was chosen to avoid using IN clauses when querying Cassandra; these are indeed discouraged unless the number of values is very, very small (as a rule of thumb, less than half a dozen). Moreover, since one does not know at write-time which features will be read together, there is no preferred way to arrange the clustering column(s) to have these features grouped together in the partition (as done, for example, with Facebook's ["feature re-ordering"](https://engineering.fb.com/2022/09/19/ml-applications/feature-store-announcement/) which purportedly results in a 30%-70% latency reduction). A reasonable compromise was then to always read the whole partition and apply client-side post-query filtering to avoid burdening the query coordinators with additional work—at the cost, of course, of increased network throughput. + +Second, when features from multiple entities are needed, the plugin makes good use of the execute_concurrently_with_args primitive offered by the Cassandra Python driver, thereby spawning one thread per partition and firing all requests at once (up to a maximum concurrency threshold, which can be configured). This leverages the excellent support for concurrency by the Cassandra architecture, which accounts for the observed moderate dependency of latencies on the batch size. + +## Conclusion + +We put the Cassandra plugin for Feast to test in the same way as other DBaaS plugins were tested; that is, using the Astra DB cloud database built on Cassandra, and we ran the same benchmarks that were applied to Redis, Datastore, and DynamoDB. + +Besides broadly confirming the previous results published by the Feast team, our main finding is that the performance with Astra DB is on par with that of AWS DynamoDB and noticeably better than that of Google Datastore. + +All these tests target the Python implementation. As mentioned in the Feast article, switching to a Java feature server greatly improves the performance, but requires a more convoluted setup and architecture and overall more expertise both for setup and maintenance. + +Other evidence points to the fact that, *if one is mainly concerned about performance*, replacing any feature store with a direct-to-DB implementation may be the best choice. In this regard, our extensive investigations clearly make the case that Cassandra is a good fit for ML applications, regardless of whether a feature store is involved or not. + +Some results might be made statistically stronger by more extensive tests, which could be a task for a future iteration of these performance benchmarks. It is possible that longer runs and/or much larger amounts of stored data would better highlight the underlying patterns in how the response times behave as a function of batch size and/or number of requested features. + +## Acknowledgements + +The author would like to thank Alan Ho, Scott Regan, and Jonathan Shook for a critical reading of this manuscript, and the Feast team for a pleasant and fruitful collaboration around the development (first) and the benchmarking (afterwards) of the Cassandra / Astra DB plugin for the namesake feature store. diff --git a/docs/blog/rbac-role-based-access-controls.md b/docs/blog/rbac-role-based-access-controls.md new file mode 100644 index 0000000000..683dd9c694 --- /dev/null +++ b/docs/blog/rbac-role-based-access-controls.md @@ -0,0 +1,71 @@ +# Feast Launches Role Based Access Control (RBAC)! 🚀 + +*November 21, 2024* | *Daniele Martinoli, Francisco Javier Arceo* + +Feast is proud to introduce Role-Based Access Control (RBAC), a game-changing feature for secure and scalable feature store management. With RBAC, administrators can define granular access policies, ensuring each team has the appropriate permissions to access and manage only the data they need. Built on Kubernetes RBAC and OpenID Connect (OIDC), this powerful model enhances data governance, fosters collaboration, and makes Feast a trusted solution for teams handling sensitive, proprietary data. + +## What is the Feast Permission Model? + +Feast now supports Role Based Access Controls (RBAC) so you can secure and govern your data. If you ever wanted to securely partition your feature store across different teams, the new Feast permissions model is here to make that possible! + +This powerful feature allows administrators to configure granular authorization policies, letting them decide which users and groups can access specific resources and what operations they can perform. + +The default implementation is based on Role-Based Access Control (RBAC): user roles determine whether a user has permission to perform specific functions on registered resources. + +## Why is RBAC important for Feast? + +Feature stores often operate on sensitive, proprietary data and we want to make sure teams are able to govern the access and control of that data thoughtfully, while benefiting from transparent code and an open source community like Feast. + +That's why we built RBAC using [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) and [OpenID Connect protocol (OIDC)](https://auth0.com/docs/authenticate/protocols/openid-connect), ensuring secure, fine-grained access control in Feast. + +## What are the Benefits of using Feast Permissions? + +Using the Feast Permissions Model offers two key benefits: + +1. Securely share and partition your feature store: grant each team only the minimum privileges necessary to access and manage the relevant resources. +2. Adopt a Service-Oriented Architecture and leverage the benefits of a distributed system. + +## How Feast Uses RBAC + +### Permissions as Feast resources + +The RBAC configuration is defined using a new Feast object type called "Permission". Permissions are registered in the Feast registry and are defined and applied like all the other registry objects, using Python code. + +A permission is defined by these three components: + +* A resource: a Feast object that we want to secure against unauthorized access. It's identified by the matching type(s), a possibly empty list of name patterns and a dictionary of required tags. +* An action: a logical operation performed on the secured resource, such as managing the resource state with CREATE, DESCRIBE, UPDATE or DELETE, or accessing the resource data with READ and WRITE (differentiated by ONLINE and OFFLINE store types) +* A policy: the rule to enforce authorization decisions based on the current user. The default implementation uses role-based policies. + +The resource types supported by the permission framework are those defining the customer feature store: + +* Project +* Entity +* Clients use the feature store transparently, with authorization headers automatically injected in every request. +* Service-to-service communications are permitted automatically. + +Currently, only the following Python servers are supported in an authorized environment: +- Online REST feature server +- Offline Arrow Flight feature server +- gRPC Registry server + +### Configuring Feast Authorization + +For backward compatibility, by default no authorizations are enforced. The authorization functionality must be explicitly enabled using the auth configuration section in feature_store.yaml. Of course, all server and client applications must have a consistent configuration. + +Currently, feast supports [OIDC](https://auth0.com/docs/authenticate/protocols/openid-connect) and [Kubernetes RBAC](https://kubernetes.io/docs/reference/access-authn-authz/rbac/) authentication/authorization. + +* With OIDC authorization, the client uses an OIDC server to fetch a JSON Web Token (JWT), which is then included in every request. On the server side, the token is parsed to extract user roles and validate them against the configured permissions. +* With Kubernetes authorization, the client injects its service account JWT token into the request. The server then extracts the service account name from the token and uses it to look up the associated role in the Kubernetes RBAC resources. + +### Inspecting and Troubleshooting the Permissions Model + +The feast CLI includes a new permissions command to list the registered permissions, with options to identify the matching resources for each configured permission and the existing resources that are not covered by any permission. + +For troubleshooting purposes, it also provides a command to list all the resources and operations allowed to any managed role. + +## How Can I Get Started? + +This new feature includes working examples for both supported authorization protocols. You can start by experimenting with these examples to see how they fit your own feature store and assess their benefits. + +As this is a completely new functionality, your feedback will be extremely valuable. It will help us adapt the feature to meet real-world requirements and better serve our customers. diff --git a/docs/blog/streaming-feature-engineering-with-denormalized.md b/docs/blog/streaming-feature-engineering-with-denormalized.md new file mode 100644 index 0000000000..ace3ec2df6 --- /dev/null +++ b/docs/blog/streaming-feature-engineering-with-denormalized.md @@ -0,0 +1,135 @@ +# Streaming Feature Engineering with Denormalized + +*December 17, 2024* | *Matt Green* + +Learn how to use Feast with [Denormalized](https://www.denormalized.io/) + +Thank you to [Matt Green](https://www.linkedin.com/in/mgreen9/) and [Francisco Javier Arceo](https://www.linkedin.com/in/franciscojavierarceo) for their contributions! + +## Introduction + +Feature stores have become a critical component of the modern AI stack where they serve as a centralized repository for model features. Typically, they consist of both an offline store for aggregating large amounts of data while training models, and an online store, which allows for low latency delivery of specific features when running inference. + +A popular open source example is [Feast](https://feast.dev/), which allows users to store features together by ingesting data from different data sources. While Feast allows you to define features and query data stores using those definitions, it relies on external systems to calculate and update online features. This post will demonstrate how to use [denormalized](https://www.denormalized.io/) to build real-time feature pipelines. + +The full working example is available at the [feast-dev/feast-denormalized-tutorial](https://github.com/feast-dev/feast-denormalized-tutorial) repo. Instructions for configuring and running the example can be found in the README file. + +## The Problem + +Fraud detection is a classic example of a model that uses real-time features. Imagine you are building a model to detect fraudulent user sessions. One feature you would be interested in is the number of login attempts made by a user and how many of those were successful. You could calculate this feature by looking back in time over a sliding interval (AKA "a sliding window"). If you notice a large amount of failed login attempts over the previous 5 seconds, you might infer the account is being brute-forced and choose to invalidate the session and lock the account. + +To simulate this scenario, we wrote a simple script that emits fake login events to a Kafka cluster: [session_generator](https://github.com/feast-dev/feast-denormalized-tutorial). + +This script will emit json events according to the following schema: + +```python +@dataclass +timestamp: datetime +user_id: str +ip_address: str +success: bool +``` + +## Configuring the Feature Store with Feast + +Before we can start writing our features, we need to first configure the feature store. Feast makes this easy using a Python API. In Feast, features are referred to as Fields and are grouped into FeatureViews. FeatureViews have corresponding PushSources for ingesting data from online sources (i.e., we can push data to Feast). We also define an offline data store using the FileSource class, though we won't be using that in this example. + +```python +file_sources = [] +push_sources = [] +feature_views = [] + +for i in [1, 5, 10, 15]: + file_source = FileSource( + path=str(Path(__file__).parent / f"./data/auth_attempt_{i}.parquet"), + timestamp_field="timestamp", + ) + file_sources.append(file_source) + + push_source = PushSource( + name=f"auth_attempt_push_{i}", + batch_source=file_source, + ) + push_sources.append(push_source) + + feature_views.append( + FeatureView( + name=f"auth_attempt_view_w{i}", + entities=[auth_attempt], + schema=[ + Field(name="user_id", dtype=feast_types.String,), + Field(name="timestamp", dtype=feast_types.String,), + Field(name=f"{i}_success", dtype=feast_types.Int32,), + Field(name=f"{i}_total", dtype=feast_types.Int32,), + Field(name=f"{i}_ratio", dtype=feast_types.Float32,), + ], + source=push_source, + online=True, + ) + ) +``` + +The code creates 4 different FeatureViews each containing their own features. As discussed previously, fraud features can be calculated over a sliding interval. It can be useful to not only look at recent failed authentication attempts but also the aggregate of attempts made over longer time intervals. This could be useful when trying to detect things like credential testing which can happen over a longer period of time. + +In our example, we're creating 4 different FeatureViews that will ultimately be populated by 4 different window lengths. This can help our model detect various types of attacks over different time intervals. Before we can use our features, we'll need to run `feast apply` to set-up the online datastore. + +## Writing the Pipelines with Denormalized + +Now that we have our online data store configured, we need to write our data pipelines for computing the features. Simply speaking, these pipelines need to: + +1. Read messages from Kafka +2. Aggregate those messages over a varied timeframe +3. Write the resulting aggregate value to the feature store + +Denormalized makes this really easy. First, we create our DataStream object from a Context(): + +```python +ds = FeastDataStream( + Context().from_topic( + config.kafka_topic, + feature_service, f"auth_attempt_push*{config.feature_prefix}" + ) +) +``` + +This will start the Denormalized Rust stream processing engine, which is powered by DataFusion so it's ultra-fast. + +## Running Multiple Pipelines + +The write_feast_feature() method is a blocking call that continuously executes one pipeline to produce a set of features across a single sliding window. If we want to calculate features for using different sliding window lengths, will need to configure and start multiple pipelines. We can easily do this using the multiprocessing library in python: + +```python +for window_length in [1, 5, 10, 15]: + config = PipelineConfig( + window_length_ms=window_length * 1000, + slide_length_ms=1000, + feature_prefix=f"{window_length}", + kafka_bootstrap_servers=args.kafka_bootstrap_servers, + kafka_topic=args.kafka_topic, + ) + process = multiprocessing.Process( + target=run_pipeline, + args=(config,), + name=f"PipelineProcess-{window_length}", + daemon=False, + ) + processes.append(process) + +for p in processes: + try: + p.start() + except Exception as e: + logger.error(f"Failed to start process {p.name}: {e}") + cleanup_processes(processes) + return +``` + +For each group of features we defined earlier, we spin up a different system process with a different window length. Each process will then execute its own instance of the Denormalized stream processing engine, which has its own thread pools for effective parallelization of work. + +While this example demonstrates how you can easily run multiple Denormalized pipelines, in a production environment, you'd probably want each pipeline running in its own container. + +## Final Thoughts + +We've demonstrated how you can easily create real-time features using Feast and Denormalized. While working with streaming data can be a challenge, modern python libraries backed by fast native code are making it easier than ever to quickly iterate on model inputs. + +Denormalized is currently in the early stages of development. If you have any feedback or questions, feel free to reach out at [hello@denormalized.io](mailto:hello@denormalized.io). diff --git a/docs/blog/the-future-of-feast.md b/docs/blog/the-future-of-feast.md new file mode 100644 index 0000000000..f08547a1bf --- /dev/null +++ b/docs/blog/the-future-of-feast.md @@ -0,0 +1,38 @@ +# The Future of Feast + +*February 23, 2024* | *Willem Pienaar* + +AI has taken center stage with the rise of large language models, but production ML systems remain the lifeblood of most AI powered companies today. At the heart of these products are feature stores like Feast, serving real-time, batch, and streaming data points to ML models. I'd like to spend a moment taking stock on what we've accomplished over the last six years and what the growing Feast community has to look forward to. + +## Act 1: Gojek and Google + +Feast was started in 2018 as a [collaboration](https://cloud.google.com/blog/products/ai-machine-learning/introducing-feast-an-open-source-feature-store-for-machine-learning) between Gojek and our friends at Google Cloud. The primary motivation behind the project was to reign in the rampant duplication of feature engineering across the Southeast Asian decacorn's many ML teams. + +Almost immediately, the key challenge with feature stores became clear: Can it be generalized across various ML use cases? + +The natural way to answer that question is to battle test the software out in the open. So in late 2018, spurred on by our friends in the Kubeflow project, we open sourced Feast. A community quickly formed around the project. This group was mostly made up of software engineers at data rich technology companies, trying to find a way to help their ML teams productionize models at a much higher pace. + +Having a community centric approach is in the DNA of the project. All of our RFCs, discussions, designs, community calls, and code are open source. The project became a vehicle for ML platform teams globally to collaborate. Many teams saw the project as a means of stress testing their internal feature store designs, while others like Agoda, Zulily, Farfetch, and Postmates adopted the project wholesale and became core contributors. + +As time went by the demand grew for the project to have neutral ownership and formal governance. This led to us [entering the project into the Linux Foundation for AI in 2020](https://lfaidata.foundation/blog/2020/11/10/feast-joins-lf-ai-data-as-new-incubation-project/). + +## Act 2: Rise of the Feature Store + +By 2020, the demand for feature stores had reached a fever pitch. If you were dealing with more than just an Excel sheet of data, you were likely planning to either build or buy a feature store. A category formed around feature stores and MLOps. Being a neutrally governed open source project brought in a raft of contributions, which helped the project generalize not just to different data platforms and vendors, but also different use cases and deployment patterns. A few of the highlights include: + +* We worked closely with AI teams at [Snowflake](https://quickstarts.snowflake.com/guide/getting_started_with_feast/), [Azure](https://techcommunity.microsoft.com/t5/ai-customer-engineering-team/using-feast-feature-store-with-azure-ml/ba-p/2908404) + +It's also important to mention that by far the biggest contributor to Feast was [Tecton](https://www.tecton.ai/), who invested considerable resources into the project and helped create the category. + +Today, the project is battle hardened and stable. It's seen adoption and/or contribution from companies like Adyen, Affirm, Better, Cloudflare, Discover, Experian, Lowes, Red Hat, Robinhood, Palo Alto Networks, Porch, Salesforce, Seatgeek, Shopify, and Twitter, just to name a few. + +## Act 3: The Road to 1.0 + +The rate of change in AI has accelerated, and nowhere is it moving faster than in open source. Keeping up with this rate of change for AI infra requires the best minds, so with that we'd like to introduce a set of contributors who will be graduating to official project maintainers: + +* [Francisco Javier Arceo](https://www.linkedin.com/in/franciscojavierarceo/) – Engineering Manager, [Affirm](https://www.affirm.com/) +* [Hao Xu](https://www.linkedin.com/in/hao-xu-a04436103/) – Lead Software Engineer, J.P. Morgan + +Over the next few months these maintainers will focus on bringing the project to a major 1.0 release. In our next post we will take a closer look at what the road to 1.0 looks like. + +If you'd like to get involved, try out the project [over at GitHub](https://github.com/feast-dev/feast) or join our [Slack](https://feastopensource.slack.com) community! diff --git a/docs/blog/the-road-to-feast-1-0.md b/docs/blog/the-road-to-feast-1-0.md new file mode 100644 index 0000000000..27be07aade --- /dev/null +++ b/docs/blog/the-road-to-feast-1-0.md @@ -0,0 +1,35 @@ +# The road to Feast 1.0 + +*February 28, 2024* | *Edson Tirelli* + +### Past Achievements and a Bright Future + +In the [previous blog](https://feast.dev/blog/the-future-of-feast/) we recapped Feast's journey over the last 6 years and hinted about what is coming in the future. We also announced a new group of maintainers that joined the project to help drive it to the 1.0 milestone. Today, we will drill down a little bit into the goals for the project towards that milestone. + +### The Goals for Feast 1.0 + +* Tighter Integration with [**Kubeflow**](https://www.kubeflow.org/): Recognizing the growing importance of Kubernetes in the ML workflow, a primary objective is to achieve a closer integration with [Kubeflow](https://www.kubeflow.org/). This will enable smoother workflows and enhanced scalability for ML projects. + +* Development of Enterprise Features: With the aim to make Feast more robust for enterprise usage, we are focusing on developing features that cater to the complex needs of large-scale organizations. These include advanced security measures, scalability enhancements, and improved data management capabilities. + +* Graduation from [**LF AI and Data Foundation Incubation**](https://landscape.lfai.foundation/?selected=feast): Currently incubating under the [LF AI and Data Foundation](https://landscape.lfai.foundation/?selected=feast), we are setting our sights on graduating Feast to become a fully-fledged project under the foundation. This step will mark a significant milestone in our journey, recognizing the maturity and stability of Feast. + +* Research and Development for Novel Use Cases: Keeping pace with the rapidly evolving ML landscape (e.g., Large Language Models and Retrieval Augmented Generation), we are committed to exploring new research areas. Our aim is to adapt Feast to support novel use cases, keeping it at the forefront of technology. + +* Support for Latest ML Model Advancements: As ML models become more sophisticated, Feast will evolve to support these advancements. This includes accommodating new model architectures and training techniques. + +This new phase is not just about setting goals but laying down a concrete roadmap to achieve Feast version 1.0. This version will encapsulate all our efforts towards making Feast more integrated, enterprise-ready, and aligned with the latest ML advancements. + +### Why Invest in Feast? + +Many industry applications of machine learning require intensely sophisticated data pipelines. Over the last decade, the data infrastructure and analytics community collaborated together to build powerful frameworks like dbt that enabled analytics to flourish. We believe Feast can do the same for the machine learning community–particularly those that spend most of their time on data pipelining and feature engineering. We believe Feast is a core foundation in the future of machine learning and we will build it to offer a standard set of patterns that will enable ML Engineering and ML Ops teams to leverage those patterns and industry best practices to avoid common pitfalls, while (1) offering the flexibility of choosing their own infrastructure and (2) providing ML Practitioners with a Python-based interface. + +### In Conclusion + +This transition marks a pivotal moment in Feast's journey. We are excited about the opportunities and challenges ahead. With the support of the ML community, the dedication of our new maintainers, and the clear vision set by our steward committee, Feast is poised to reach new heights and continue to be a pivotal tool in the ML ecosystem. + +We invite everyone to join us in this exciting journey and contribute to the future of Feast. Together, let's shape the next chapter in the evolution of feature stores and machine learning. + +For updates and discussions, join our [Slack channel](http://feastopensource.slack.com/) and follow our [GitHub repository](https://github.com/feast-dev/feast/). + +*This post reflects the collective vision and aspirations of the new Feast steward committee. For more detailed discussions and contributions, please reach out to us on our [community channels](https://docs.feast.dev/community).* diff --git a/docs/blog/what-is-a-feature-store.md b/docs/blog/what-is-a-feature-store.md new file mode 100644 index 0000000000..720d6ab1ab --- /dev/null +++ b/docs/blog/what-is-a-feature-store.md @@ -0,0 +1,85 @@ +# What is a Feature Store? + +*January 21, 2021* | *Willem Pienaar & Mike Del Balso* + +Blog co-authored with Mike Del Balso, Co-Founder and CEO of Tecton, and cross-posted [here](https://www.tecton.ai/blog/what-is-a-feature-store/) + +Data teams are starting to realize that operational machine learning requires solving data problems that extend far beyond the creation of data pipelines. In [Why We Need DevOps for ML Data](https://www.tecton.ai/blog/devops-ml-data/), Tecton highlighted some of the key data challenges that teams face when productionizing ML systems: + +* Accessing the right raw data +* Building features from raw data +* Combining features into training data +* Calculating and serving features in production +* Monitoring features in production + +Production data systems, whether for large scale analytics or real-time streaming, aren't new. However, *operational machine learning* — ML-driven intelligence built into customer-facing applications — is new for most teams. The challenge of deploying machine learning to production for operational purposes (e.g. recommender systems, fraud detection, personalization, etc.) introduces new requirements for our data tools. + +A new kind of ML-specific data infrastructure is emerging to make that possible. Increasingly Data Science and Data Engineering teams are turning towards feature stores to manage the data sets and data pipelines needed to productionize their ML applications. This post describes the key components of a modern feature store and how the sum of these parts act as a force multiplier on organizations, by reducing duplication of data engineering efforts, speeding up the machine learning lifecycle, and unlocking a new kind of collaboration across data science teams. + +Quick refresher: in ML, a feature is data used as an input signal to a predictive model. For example, if a credit card company is trying to predict whether a transaction is fraudulent, a useful feature might be *whether the transaction is happening in a foreign country*, or *how the size of this transaction compares to the customer's typical transaction*. When we refer to a feature, we're usually referring to the concept of that signal (e.g. "transaction_in_foreign_country"), not a specific value of the feature (e.g. not "transaction #1364 was in a foreign country"). + +## Enter the feature store + +*"The interface between models and data"* + +We first introduced feature stores in our blog post describing Uber's [Michelangelo](https://eng.uber.com/michelangelo-machine-learning-platform/) platform. Feature stores have since emerged as a necessary component of the operational machine learning stack. + +Feature stores make it easy to: +1. Productionize new features without extensive engineering support +2. Automate feature computation, backfills, and logging +3. Share and reuse feature pipelines across teams +4. Track feature versions, lineage, and metadata +5. Achieve consistency between training and serving data +6. Monitor the health of feature pipelines in production + +Feature stores aim to solve the full set of data management problems encountered when building and operating operational ML applications. A feature store is an ML-specific data system that: + +* Runs data pipelines that transform raw data into feature values +* Stores and manages the feature data itself, and +* Serves feature data consistently for training and inference purposes + +Feature stores bring economies of scale to ML organizations by enabling collaboration. When a feature is registered in a feature store, it becomes available for immediate reuse by other models across the organization. This reduces duplication of data engineering efforts and allows new ML projects to bootstrap with a library of curated production-ready features. + +## Components of a Feature Store + +There are 5 main components of a modern feature store: Transformation, Storage, Serving, Monitoring, and Feature Registry. + +In the following sections we'll give an overview of the purpose and typical capabilities of each of these sections. + +## Serving + +Models need access to fresh feature values for inference. Feature stores accomplish this by regularly recomputing features on an ongoing basis. Transformation jobs are orchestrated to ensure new data is processed and turned into fresh new feature values. These jobs are executed on data processing engines (e.g. Spark or Pandas) to which the feature store is connected. + +Model development introduces different transformation requirements. When iterating on a model, new features are often engineered to be used in training datasets that correspond to historical events (e.g. all purchases in the past 6 months). To support these use cases, feature stores make it easy to run "backfill jobs" that generate and persist historical values of a feature for training. Some feature stores automatically backfill newly registered features for preconfigured time ranges for registered training datasets. + +Transformation code is reused across environments preventing training-serving skew and frees teams from having to rewrite code from one environment to the next. Feature stores manage all feature-related resources (compute, storage, serving) holistically across the feature lifecycle. Automating repetitive engineering tasks needed to productionize a feature, they enable a simple and fast path-to-production. Management optimizations (e.g. retiring features that aren't being used by any models, or deduplicating feature transformations across models) can bring significant efficiencies, especially as teams grow increasingly the complexity of managing features manually. + +## Monitoring + +When something goes wrong in an ML system, it's usually a data problem. Feature stores are uniquely positioned to detect and surface such issues. They can calculate metrics on the features they store and serve that describe correctness and quality. Feature stores monitor these metrics to provide a signal of the overall health of an ML application. + +Feature data can be validated based on user defined schemas or other structural criteria. Data quality is tracked by monitoring for drift and training-serving skew. E.g. feature data served to models are compared to data on which the model was trained to detect inconsistencies that could degrade model performance. + +When running production systems, it's also important to monitor operational metrics. Feature stores track operational metrics relating to core functionality. E.g. metrics relating to feature storage (availability, capacity, utilization, staleness) or feature serving (throughput, latency, error rates). Other metrics describe the operations of important adjacent system components. For example, operational metrics for external data processing engines (e.g. job success rate, throughput, processing lag and rate). + +Feature stores make these metrics available to existing monitoring infrastructure. This allows ML application health to be monitored and managed with existing observability tools in the production stack. Having visibility into which features are used by which models, feature stores can automatically aggregate alerts and health metrics into views relevant to specific users, models, or consumers. + +It's not essential that all feature stores implement such monitoring internally, but they should at least provide the interfaces into which data quality monitoring systems can plug. Different ML use cases can have different, specialized monitoring needs so pluggability here is important. + +## Registry + +A critical component in all feature stores is a centralized registry of standardized feature definitions and metadata. The registry acts as a single source of truth for information about a feature in an organization. + +The registry is a central interface for user interactions with the feature store. Teams use the registry as a common catalog to explore, develop, collaborate on, and publish new definitions within and across teams. + +The registry allows for important metadata to be attached to feature definitions. This provides a route for tracking ownership, project or domain specific information, and a path to easily integrate with adjacent systems. This includes information about dependencies and versions which is used for lineage tracking. + +To help with common debugging, compliance, and auditing workflows, the registry acts as an immutable record of what's available analytically and what's actually running in production. + +## Where to go to get started + +We see features stores as the heart of the data flow in modern ML applications. They are quickly proving to be [critical infrastructure](https://a16z.com/2020/10/15/the-emerging-architectures-of-modern-data/) for data science teams putting ML into production. We expect 2021 to be a year of massive feature store adoption, as machine learning becomes a key differentiator for technology companies. + +There are a few options for getting started with feature stores: + +* [Feast](https://feastsite.wpenginepowered.com/) is a great option if you already have transformation pipelines to compute your features, but need a great storage and serving layer to help you use them in production. Feast is GCP/AWS only today, but we're working hard to make Feast available as a light-weight feature store for all environments. Stay tuned.