Kubernetes-native platform for federated learning without data exposure
Lukas Wuttke
Federated learning unlocks value from data that can't be centralized—training models across distributed locations, sensitive datasets, or multi-party collaborations while keeping data in place.
It's particularly powerful for vendor evaluation: benchmark external ML models against proprietary data without sharing it, or train models across geographically distributed datasets without violating data residency requirements.
The challenge is implementation. Building federated infrastructure from scratch means orchestrating distributed model training, handling edge failures, aggregating weights across heterogeneous environments, and maintaining security controls throughout. Most teams either spend months building custom solutions or leave valuable datasets unused.
tracebloc provides federated learning as a managed platform. Deploy via Helm charts, and the system handles edge orchestration, weight aggregation, fault tolerance, and compliance logging. Data scientists work through a familiar SDK: upload models, configure training plans, launch experiments—while the platform manages the distributed coordination. External vendors submit models that execute entirely inside your infrastructure, with results appearing in configurable leaderboards.
This guide explains the architecture, supported workflows, and what engineering teams can build with the platform.
tracebloc deploys as a Kubernetes application via Helm charts and runs fully within your infrastructure perimeter. External models execute in isolated pods, ensuring all training and inference occur locally within the distributed system.
Data never leaves your environment because the platform is built around structural safeguards:
Every operation within the platform is logged and auditable. Actions ranging from model access to system events are recorded in formats aligned with GDPR, ISO 27001, and EU AI Act requirements.
Administrators can configure:
tracebloc is designed around use-case-driven benchmarking. Teams define their own metrics—accuracy, F1, latency, energy consumption, or domain-specific KPIs—and every submission is evaluated under identical conditions using the same data, hardware, and configuration.
Before execution, models are validated against enterprise policy requirements to ensure compatibility. Results appear in configurable leaderboards that make comparisons transparent, reproducible, and directly relevant to the real problem being solved.
Under the hood, the benchmarking layer is built on a full competition management system designed for structured vendor evaluation:
For engineering teams evaluating multiple vendors against the same problem, this structure provides a repeatable and auditable comparison framework rather than ad-hoc benchmarking.
One of the first questions engineers ask when evaluating a platform is: does it support my use case? tracebloc covers a broad range of ML domains and frameworks out of the box.
The platform natively supports ten task categories across computer vision, natural language processing, tabular data, and time series:

Each task category can be implemented using one or more of the following frameworks:
Custom containers are also supported for workloads that require dependencies beyond these frameworks.
The platform validates and optimizes execution based on recognized model architectures. For object detection, both YOLO and R-CNN pipelines are supported. For tabular data, the system recognizes linear models, tree-based models, ensemble methods, SVMs, neural networks, XGBoost, CatBoost, LightGBM, naive Bayes, and clustering models. Each type is tagged on upload and used to configure the appropriate training pipeline.
External partners can train or fine-tune models directly inside the secure environment. Supported workflows include:
The platform supports PyTorch, TensorFlow, and custom containers, with workloads scheduled across CPU, GPU, or TPU compute resources through Kubernetes. For deep learning models requiring distributed training of large language models or other compute-intensive architectures, the platform can leverage optimized frameworks such as DeepSpeed, which enables efficient distributed AI model training across multiple GPUs and multiple machines. Advanced parallelization strategies including pipeline parallelism and tensor parallelism are supported for training larger models that cannot fit on a single machine.
For organizations with geographically distributed datasets, federated learning enables distributed model training across multiple locations while keeping data local to each site.
Every experiment is governed by a training plan that gives the submitting data scientist full control over how their model is trained. The following parameters are configurable through the SDK:
For image-based tasks, augmentation parameters are configurable directly in the training plan: rotation range, width and height shift, brightness range, shear range, zoom range, channel shift, fill mode, horizontal and vertical flip, and rescale factor.
For text classification tasks using HuggingFace models, parameter-efficient fine-tuning is supported through LoRA with the following configurable parameters:
LoRA configuration is validated before experiment launch by running a small training loop to confirm compatibility with the uploaded model.
Time series forecasting tasks expose additional configuration:
For tabular datasets that allow feature modification, the SDK provides a feature interaction API. Data scientists can create derived features by specifying interaction methods between columns (e.g., ratios, products, differences), as well as include or exclude specific features from training. Available features and methods are retrieved dynamically from the dataset schema.
After each cycle, trained weights from all edges are collected and aggregated by a dedicated averaging service. The service supports both PyTorch and TensorFlow weight formats and produces a new global model that is distributed for the next cycle. This collective communication mechanism ensures synchronized updates across all participating nodes in the distributed system.
The platform calculates data distribution across edges automatically. When a training plan is created, the backend determines how many samples each edge holds per class, computes the minimum viable validation split based on the smallest edge, and distributes training accordingly. Each node trains on its local batch of data before weights are aggregated. Sub-dataset selection is supported, allowing data scientists to specify how many samples per class should be used from each edge.
Each experiment progresses through a defined state machine:
Updated weights are available for download once training completes.
An important practical question for any engineer submitting a model: what does the platform expect?
Models are submitted as Python files (.py) or, in some cases, as zip archives containing model code and dependencies. The model file must be importable and expose the architecture so the platform can perform automated validation.
On upload, every model goes through an 8-step validation pipeline that checks:
If validation fails, the upload is rejected with a descriptive error message indicating which check failed and what the expected values are.
After uploading a model, it must be linked to a dataset. During linking, the platform verifies compatibility between the model and dataset:
If there is a mismatch, the SDK displays the expected parameters from the dataset alongside the model's parameters, so the data scientist can adjust accordingly.
Prebuilt ingestion pipelines support common dataset formats. Training and test data are stored in Kubernetes persistent volumes so datasets, logs, and artifacts persist across pod restarts.
Throughout the process, raw data remains inside the cluster. Only metadata is exposed to the interface layer.
The data ingestion layer provides a config-driven framework with ready-made templates for each supported task type. The core pattern is:
Code Blockpython1234567891011121314151617181920from tracebloc_ingestor import Config, Database, APIClient, CSVIngestor from tracebloc_ingestor.utils.constants import TaskCategory, Intent, DataFormat config = Config() database = Database(config) api_client = APIClient(config) ingestor = CSVIngestor( database=database, api_client=api_client, table_name=config.TABLE_NAME, data_format=DataFormat.IMAGE, category=TaskCategory.IMAGE_CLASSIFICATION, csv_options={"chunk_size": 1000, "delimiter": ","}, label_column="label", intent=Intent.TRAIN, ) with ingestor: failed_records = ingestor.ingest(config.LABEL_FILE, batch_size=config.BATCH_SIZE)
Both CSV and JSON ingestors are available. For tabular data, schemas can be defined explicitly with column types. For image data, options such as target size and allowed file extensions are configured at the ingestor level.
The ingestion pipeline includes a chain of validators that run automatically during data import:
Templates are provided for all ten task categories, each preconfigured with the relevant validators and data format options. The ingestion framework is containerized with Docker support and can be deployed as a Kubernetes job.
tracebloc is infrastructure-agnostic and can run on:
Once deployed, the platform can launch hundreds of isolated training or inference pods in parallel at large scale. Each runs in its own namespace for fault tolerance, autoscaling, and resource isolation. Compute or inference quotas can be defined per vendor to maintain control over resource usage.
tracebloc integrates into existing engineering workflows rather than replacing them. Available interfaces include:
The Python SDK provides the primary programmatic interface. The full workflow from authentication to experiment launch follows this pattern:
Code Blockpython123456789101112131415161718192021222324252627from tracebloc_package import User # Authenticate user = User(environment="production") # Prompts for email and password # Upload a model (with optional pretrained weights) user.uploadModel("my_resnet_model", weights=True) # Link model to a dataset and get a training plan object plan = user.linkModelDataset("dataset-abc-123") # Configure training parameters plan.experimentName("ResNet50 fine-tune on manufacturing defects") plan.epochs(50) plan.cycles(3) plan.optimizer("adam") plan.learningRate({"type": "constant", "value": 0.0001}) plan.lossFunction({"type": "standard", "value": "crossentropy"}) plan.validation_split(0.2) # Add callbacks plan.earlystopCallback("val_loss", patience=10) plan.reducelrCallback("val_loss", factor=0.1, patience=5, min_delta=0.0001) # Launch the experiment plan.start()
On successful launch, the SDK returns an experiment key and a direct link to the experiment in the web interface. The training plan details are printed for confirmation.
For LLM fine-tuning with LoRA, the SDK extends the standard workflow:
Code Blockpython1234567891011121314151617plan = user.linkModelDataset("text-dataset-456")plan.enable_lora(True)plan.set_lora_parameters( lora_r=128, lora_alpha=256, lora_dropout=0.05, q_lora=False)plan.start()
The platform validates the LoRA configuration against the uploaded model before launching the experiment.
Time series forecasting tasks have additional configuration options:
Code Blockpython12345plan = user.linkModelDataset("sensor-timeseries-789") plan.sequence_length(24) # 24 past time steps as input plan.forecast_horizon(12) # predict 12 steps ahead plan.scaler("StandardScaler") plan.start()
Example notebooks demonstrate the full lifecycle for each supported task type, allowing data scientists to operate within familiar tooling without managing Kubernetes or infrastructure details.
Vendor onboarding is handled through controlled access assignment and use-case permissions.
tracebloc includes built-in sustainability metrics as a first-class feature, not just an optional KPI. Every experiment automatically tracks:
These metrics are available in the web interface and through the API, enabling engineering teams to include energy consumption and carbon footprint data in their model evaluation criteria. For organizations with ESG reporting requirements, the platform provides auditable sustainability data at the experiment level.
tracebloc turns proprietary data from a locked asset into an active resource by bringing AI models to the data instead of exposing the data itself. The platform combines federated learning architecture, enterprise governance, and scalable infrastructure to enable secure external collaboration without sacrificing control.
For engineering teams, this means a practical path to evaluating and deploying external AI capabilities while maintaining full ownership of data, compliance posture, and infrastructure.
The platform supports ten task categories across vision, NLP, tabular, and survival analysis workloads, with five frameworks including PyTorch, TensorFlow, scikit-learn, Lifelines, and scikit-survival. The SDK provides a clear workflow from model upload to experiment launch, with rich configuration options for hyperparameters, augmentation, LoRA fine-tuning, and feature engineering — all executed securely inside your infrastructure.