clusterrace

Services

A broad bench. We mix and match — most engagements touch two or three of these.

Kubernetes & Infra

We design, build, and care for clusters that don't wake your team at 3am — on your cloud or on bare metal.

What we do

  • Cluster architecture (managed or self-hosted: EKS, GKE, AKS, k3s, Talos)
  • Networking (CNI, ingress, mesh) and storage that survives node death
  • Upgrade strategy, capacity planning, and cost-down passes

What success looks like

  • Predictable upgrades with no surprise downtime
  • A clear story for "what happens when X fails"
  • Infra costs that map to actual usage, not historical accident

Platform Engineering

Internal platforms that engineers actually use because they make the next thing easier, not harder.

What we do

  • Service templates (golden paths) covering deploy, telemetry, secrets, healthchecks
  • Self-serve developer portals (Backstage or rolled-your-own)
  • Platform team operating model and on-call hand-off

What success looks like

  • New service from scratch to prod in under a day
  • Platform metrics show enablement, not just uptime
  • Engineers stop building one-off scripts to work around the platform

CI/CD

Pipelines that get out of the way — fast, deterministic, and boring.

What we do

  • Pipeline rewrites: parallel test sharding, smart caching, kill-the-cruft passes
  • Promotion model: PR → preview env → main → prod with the right gates
  • Test infra (containers, fixtures, ephemeral DBs) that holds up under load

What success looks like

  • PR feedback in minutes, not hours
  • No-fear deploys, multiple times a day
  • CI failures that point at real bugs, not flaky infra

DevSecOps

Security woven into the platform, not bolted onto it after audit.

What we do

  • Supply chain: SBOMs, signed images, provenance (SLSA, Sigstore)
  • Policy as code (OPA/Kyverno) at admission and CI time
  • Secrets management (Vault, sealed-secrets, ExternalSecrets) and rotation

What success looks like

  • Audit-ready posture without a fire drill
  • Vulns fixed before they ship, not after they're reported
  • Engineers can move fast safely

Observability & OpenTelemetry

Signals that explain — not just alarm. We help you correlate logs, metrics, and traces so debugging takes minutes.

What we do

  • OpenTelemetry instrumentation across services and runtimes
  • Backend choice and config (Tempo/Loki/Mimir, Datadog, Honeycomb, Grafana Cloud)
  • SLOs, dashboards, and alerting that actually wake the right person

What success looks like

  • Mean-time-to-understand drops sharply
  • You can answer "why was that slow" without grepping logs
  • Alerts engineers trust enough not to silence

Automation

Toil-killing scripts, controllers, and operators — wherever the repetition lives.

What we do

  • Custom Kubernetes operators and controllers (Go, Python, Java)
  • Workflow orchestration (Argo Workflows, Temporal, n8n)
  • Internal tools that turn 30-minute manual jobs into one-click ops

What success looks like

  • Recurring work disappears from your team's plate
  • Operations stop being a single-person dependency
  • Mistakes from manual repetition stop happening

Software Development

When the right answer is to ship code: backend, infra-adjacent services, agents, integrations.

What we do

  • Java, Python, Go, C++ — pragmatic choice based on the problem
  • Service design with the operability built in from day one
  • Code reviews and pairing that level your team up while we ship

What success looks like

  • Code your team can own and extend after we leave
  • Test suites that catch real bugs and run fast
  • Documentation that survives contact with reality

AI Engineering

LLM-powered systems engineered for production: evals, observability, fallbacks, and cost discipline.

What we do

  • Agentic systems and tool-using assistants (Anthropic, OpenAI, local models)
  • RAG pipelines with retrieval that actually works on your corpus
  • Eval harnesses, observability for LLM calls, cost controls

What success looks like

  • AI features that pass real user tests, not just demos
  • Latency, cost, and accuracy you can defend to a CFO
  • A clear path from prototype to production
Discuss a project