GSoC 2026

Architecture Overview

Architecture

This guide introduces Kubeflow projects and how they fit in each stage of the AI lifecycle.

Read the introduction to learn more about Kubeflow, Kubeflow projects, and Kubeflow AI reference platform.

Kubeflow Ecosystem

The following diagram gives an overview of the Kubeflow Ecosystem and how it relates to the wider Kubernetes and AI landscape. Kubeflow builds on Kubernetes as a system for deploying, scaling, and managing AI platforms.

Introducing the AI Lifecycle

When you develop and deploy an AI application, the AI lifecycle typically consists of several stages. Developing an AI system is an iterative process. You need to evaluate the output of various stages of the AI lifecycle, and apply changes to the model and parameters when necessary to ensure the model keeps producing the results you need.

The following diagram shows the AI lifecycle stages in sequence:

Looking at the stages in more detail:

In the

Data Preparationstep you ingest raw data, perform feature engineering to extract ML features for the offline feature store, and prepare training data for model development. Usually, this step is associated with data processing tools such as Spark, Dask, Flink, or Ray.In the

Model Developmentstep you choose an ML framework, develop your model architecture and explore the existing pre-trained models for fine-tuning like BERT or Llama.In the

Model Trainingstep you train or fine-tune your models on the large-scale compute environment. You should use a distributed training if single GPU can’t handle your model size. The results of the model training is the trained model artifact that you can store in theModel Registry.In the

Model Optimizationstep you optimize your model hyperparameters and optimize your model with various AutoML algorithms such as neural architecture search and model compression. During model optimization you can store ML metadata in theModel Registry.In the

Model Servingstep you serve your model artifact for online or batch inference. Your model may perform predictive or generative AI tasks depending on the use-case. During the model serving step you may use an online feature store to extract features. You monitor the model performance, and feed the results into your previous steps in the AI lifecycle.

AI Lifecycle for Production and Development Phases

The AI lifecycle for AI applications may be conceptually split between development and production phases, this diagram explores which stages fit into each phase:

Kubeflow Projects in the AI Lifecycle

The next diagram shows how Kubeflow projects fit for each stage of the AI lifecycle:

See the following links for more information about each Kubeflow project:

Kubeflow Spark Operatorcan be used for data preparation and feature engineering step.Kubeflow Notebookscan be used for model development and interactive data science to experiment with your AI workflows.Kubeflow Trainercan be used for large-scale distributed training or LLMs fine-tuning.Kubeflow Katibcan be used for model optimization and hyperparameter tuning using various AutoML algorithms.Kubeflow Model Registrycan be used to store ML metadata, model artifacts, and preparing models for production serving.KServecan be used for online and batch inference in the model serving step.Feastcan be used as a feature store and to manage offline and online features.Kubeflow Pipelinescan be used to build, deploy, and manage each step in the AI lifecycle.

AI platform teams can build on top of Kubeflow by using each project independently or deploying the entire AI reference platform to meet their specific needs.

Kubeflow Interfaces

This section introduces the interfaces that you can use to interact with Kubeflow projects.

Kubeflow Dashboard

The Kubeflow Central Dashboard looks like this:

The Kubeflow AI reference platform includes Kubeflow Central Dashboard which acts as a hub for your AI platform and tools by exposing the UIs of components running in the cluster.

Kubeflow APIs and SDKs

Various Kubeflow projects offer APIs and Python SDKs.

See the following sets of reference documentation:

Pipelines reference docsfor the Kubeflow Pipelines API and SDK, including the Kubeflow Pipelines domain-specific language (DSL).Kubeflow Python SDKto interact with Kubeflow Trainer APIs and to manage TrainJobs.Katib Python SDKto manage Katib hyperparameter tuning Experiments using Python APIs.

Next steps

Feedback

Was this page helpful?

Thank you for your feedback!

We're sorry this page wasn't helpful. If you have a moment, please share your feedback so we can improve.

trainer: Update kubeflow sdk reference (#4171) (a3afb00a)

Command Palette

Search for a command to run...