Effective AI Infrastructure or Why Feature Store Is Not Enough
Modern AI Infrastructure could accelerate the ML lifecycle and create a peaceful interaction between Data Scientists & Engineers. But, what are they? And how they differ from MLOps?
In the last few years, the AI industry exploded and had an enormous advancement. This growth has led to numerous new projects aiming to standardize the model development. Still, model development is only half of the job, and most projects fail to deploy to production.
Deploying a model as a functional unit of a product(aka ML-driven product) remains an art that requires its practitioners to carefully craft all of the pieces together one by one in a waterfall methodology. Modern AI infrastructures, as described below, have the potential to change that.
Although Scrum and Agile methodologies became popular in the tech industry, these methodologies cannot be applied when developing ML-driven products due to the fragile dependencies between the steps:
- Data discovery — Data teams collect information from internal and external resources.
- Data preparation — Data teams transform raw data and curate relevant data points relevant to the problem (aka “data features”).
- Model training — Data scientists are experimenting with the data to solve an objective.
- Model validations —Data scientists are tracking model experiments and find the best performing model.
- Production code —Engineers are writing a micro-service that utilizing the model to serve predictions in production usage.
- Deployment & Test —Production code should be designed for scale and deployed and tested using the organization’s CI/CD system.
- Continuous monitoring and learning
Although the ML-driven product development cycle(aka ML Workflow) is a very complex process, we can split it into three major stages: (1) Preparation, (2) Training, (3) Productization.
The most challenging part of the ML lifecycle, which is also the main barrier, is, indubitably, the Productization stage.
Providing the model with features with time-of-prediction accuracy(aka live features) can generate the most accurate and up-to-date prediction, which is mandatory for ML-driven products.
However, it is still very challenging due to the infamous friction between DS and SW Engineers.
SW Engineers need to work closely with the Data Scientists who trained the model to make it production-ready. Sometimes they even reverse-engineer their peer’s code or rewrite it entirely from scratch.
To build such a solution, they need to write a new micro-service application that:
- Wrap the model’s weights inside an inferencing code.
- Online transforming the data from the production data sources, as precisely as we did in our preparation process into “live features.”
- Supply the live features to the model.
- Serve the prediction
Current architectures are built as a monolithic domain-specific service(i.e., a “fraud detection service”) responsible for translating the domain-specific request(i.e., transaction), transforming the data, prediction, monitoring, and model management.
These kinds of architectures involve high-touch cooperation of DS and SW Engineers to build such a service.
Due to the complex workflow of deploying models, companies started to look for new approaches to standardize the development and deployment of ML-driven products and reduce the long-time it takes to productionze the project.
In 2007, Uber published an article about Michaellangello — Uber’s ML Platform. The article described the initiatives Uber took to speed up ML-driven products’ development and a unique data management layer — the “Feature Store.”
The Feature Store is a unique storage layer that allows the reusability of “data features.” It utilizes “cold” storage for serving features for training and “hot” caching storage to communicate them for production.
Although the idea of a shared caching layer is not a new approach in SW development, the concept of separating the waterfall workflow was revolutionizing. It allowed splitting the process and handling the data processing separately from the model development and eased the feature engineering process.
Feature engineering is considered the most iterative, time-consuming, and resource-intensive phase of the workflow. In fact, data teams spend about 80% of their time doing that. Thus, simplification of it looks like a necessary step in the AI Infrastructure evolution.
This emerging architecture enables organizations to separate the feature and model concerns, standardize a contract between the model and the data, and reduce DS and SW Engineering’s friction. Like the “microservices” revolution, standardization of units in the workflow allowed us to generalize new parts and speed up our development.
A new horizon is shining.
Compared to the current architectures, modern AI infrastructures will separate the operational efforts of model deployment from feature deployment and partially detach the bound between DS and SW Engineers.
Furthermore, new architectures rapidly adopt the feature store concept and separation of concerns in the ML industry. Still, Feature Stores are not enough and only solve a part of the problem; thus, modern infrastructures will have to provide comprehensive solutions:
Data Science Platforms (aka MLOps platforms)
Data Science platforms are comprehensive solutions responsible for developing, tracking experiments, managing, deploying, monitoring, and managing models. They responsible for the operational aspect of this process and ease the complexity of day-to-day ops.
The MLOps platforms supply a generic inference server that can provide an inference layer to the solution given a set of input features. These servers might also implement a “transformer layer” that enables connecting to the feature platforms. Some products on the market already provide such a layer: KFServing, TensorFlow serving, Seldon, SageMaker, GCP, and more.
Many have written about the importance of MLOps systems and their ability to reduce the dependency on Ops. Nevertheless, it is vital to not confuse them with infrastructure solutions due to their different role.
An excellent analogy is Kafka and Jenkins — Kafka is an infrastructure system, while Jenkins is a DevOps system.
Feature (Engineering) Platforms
Feature platforms are a missing component in the ML ecosystem. The feature platforms are responsible for transforming, serving, and monitoring the models’ features.
Due to their role as a functional part of the production systems, modern feature platforms must ensure adherence to the production SLA and provide a stable, scalable, low-latency, and fail-over resilience.
It is important to emphasize that features are required for both training and inference, and features should be engineered in both phases. This two-legged procedure creates a skew between the training and the inference — it’s the platform’s responsibility to provide a mechanism to ensure the alignment between them.
Unlike the MLOps platforms, the feature platforms are not responsible for the operational ecosystem but for accessing the production data flow.
Feature platforms are responsible for the following objectives:
- Accessing and storing feature values, and feature sets
- Managing and monitoring features metadata
- Enable and manage sugared engineering processes
- Act as an operative function and ensure a high-scale SLA
Modern Feature Platforms won’t provide only stores but a standardized communication layer between the engineering efforts of moving and transforming data to the actual transformation (business) logic.
Feature Platforms should include the following functions to fulfill their objective: Governance layer, Feature Store, Transformation framework, and an Operation layer.
Feature Platforms should provide unified governance for the features including:
- Metadata and features registry — I.e., feature name, textual explanation of its logic, ownership, etc.
- Feature Catalog — Allowing collaboration and reusability of features across the organization.
- Monitoring — Enables tracking the feature performance and discovering drifts in the data.
- Versioning — Tracking different implementations of the data transformation.
Feature Store (storage)
The feature store is responsible for serving features both for serving and training. It should also serve as the single point of truth regarding feature training and serving and ensuring alignment between online/offline values.
This component is also responsible for enabling a “time-travel” functionality, which is essential for tracking different values of time-series features and synchronizing the values across other features in the data set.
Different architectures might be implemented to reach this goal; however, the typical architecture combines “online/offline” storages (therefore — feature store), which split the load through the filter of the latency requirement.
The transformation framework should lay out tools to communicate with the feature store to process, enrich, and calculate raw data into a feature value and save it into the feature storage.
The transformation frameworks should enable developers a sugared communication by reducing the amount of engineering code required to process “high-level use-cases” such as: backfilling, online transformations, bulk(offline) transformations, functional(i.e., geo-distance, derived features), etc.
Finally, since the Feature Platform is a critical part of the production solution, the platform must comply with the product’s SLA — providing low-latency serving, scalability, high availability, etc.
Large-scale deployments might also handle the challenge of deploying features to different cloud environments.
The big picture
Deploying an ML-driven product that can serve online model predictions accurately as a functional part of the product is a very complex task, but it shouldn’t be that way. Modern AI infrastructure can effectively reduce friction and create a peaceful interaction between data scientists and engineers.
Like the traditional software process, we’re anticipating a break in the workflow, reducing the operational and engineering overhead when developing and deploying new services.
Due to the complexity of data-driven products, the ML realm will have to divide its monolithic system into multiple systems. Each is responsible for a different task, similar to the traditional software realm.
Feature platforms will gather raw data from the production data sources and curate transformed features that the model platforms can consume for training and serving purposes. In contrast, the MLOps platforms will help data scientists to develop and deploy ML models.
With the current solutions in the market, it’s still hard to differentiate between the different parts of the architecture. Thus, it’s essential to distinguish between MLOps and AI infrastructure due to the very different roles of each system.
Although some emerging architectures have made considerable development, current solutions still focus on operation and serving features instead of focusing on the challenge of creating them.
Modern feature platforms will have to focus on feature creation rather than caching storage and could be the key to easing the development and deployment of new ML-driven products and services.
What do you think? Let me know about the use-cases you are adopting infrastructure for and the challenges you face (or if you need help setting up a system such as this). You’re more than welcome to drop me a line via an email or LinkedIn.
Effective AI Infrastructure or Why Feature Store Is Not Enough was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Discover Past Posts