Andrey Lesnikov — Senior DevOps Engineer & Cloud Architect
Senior DevOps Engineer and Cloud Architect at CONTACT Software GmbH building production AI infrastructure with Kubernetes/EKS, GitOps, observability, Terraform, and AI gateways. Cloud Native Rockstars 2026 Company Award, 3rd place.
Focus
- AI infrastructure and AI gateway operations
- Kubernetes/EKS, GitOps, Argo CD, Terraform, observability, and platform engineering
- Cloud-native systems for production reliability, cost visibility, and operational clarity
Technical Case Studies
- Automatic SaaS Restore System — Problem: Restores are high-pressure, manual, and easy to execute inconsistently. Constraints: Cloud state, safety gates, auditability, and rollback clarity matter. Architecture: Repeatable restore workflow with dry-run visibility, status checks, and operational handoff. Result: Recovery becomes a platform capability instead of an emergency script.
- Cloud-Native AI Gateway — Problem: AI usage needs routing, policy, budget awareness, and provider resilience. Constraints: Latency, observability, prompt safety, rate limits, and failover behavior. Architecture: Gateway layer for model selection, request shaping, telemetry, and controlled fallback paths. Result: AI becomes operable infrastructure, not an opaque API call.
- Kanister Backup Restore — Problem: Application-aware Kubernetes restores need more than volume snapshots and manual runbooks. Constraints: Stateful services, namespace boundaries, object storage retention, test restores, and auditable recovery steps. Architecture: Kanister blueprints coordinate backup actions, restore actions, validation hooks, and operator handoff around Kubernetes workloads. Result: Restore behavior becomes repeatable, reviewable, and easier to exercise before an incident.
- GitOps ArgoCD Flux — Problem: Teams need a clear delivery model before GitOps becomes another layer of operational confusion. Constraints: Multi-environment promotion, drift detection, rollback safety, secret handling, and developer feedback loops. Architecture: Comparison of Argo CD and Flux reconciliation patterns, sync ownership, policy boundaries, and platform team responsibilities. Result: GitOps decisions become explicit platform contracts instead of tool preference debates.
- SBOM Integration — Problem: Software supply-chain data is often generated late, stored separately, and disconnected from deployment decisions. Constraints: CI/CD speed, artifact provenance, vulnerability context, policy gates, and developer-readable remediation feedback. Architecture: SBOM generation in the pipeline, artifact attachment, vulnerability enrichment, policy evaluation, and release evidence storage. Result: Supply-chain visibility becomes part of the delivery system, not a quarterly compliance export.
- LLM Infrastructure Runtime — Problem: LLM workloads move faster than traditional platform controls and can quickly become expensive, opaque, and hard to operate. Constraints: GPU/CPU placement, model latency, token cost, prompt boundaries, provider limits, data privacy, and fallback behavior. Architecture: Runtime layer with model routing, request budgets, telemetry, policy checks, provider abstraction, and operational dashboards around inference flows. Result: LLM usage becomes a controlled platform capability with observability and operating contracts instead of isolated API calls.
- RAG Knowledge Platform — Problem: Engineering knowledge is spread across repositories, runbooks, tickets, architecture notes, and project history. Constraints: Source freshness, citation quality, chunking, access boundaries, hallucination control, and explainable answers. Architecture: Curated ingestion pipeline with markdown exports, project metadata, embedding-ready documents, source references, and fallback local answers. Result: The AI assistant can answer infrastructure questions with project context, sources, and a safer boundary around what it knows.
- EKS Platform Foundation — Problem: Kubernetes clusters become inconsistent when networking, identity, ingress, storage, and observability are assembled per project. Constraints: AWS account boundaries, workload identity, node lifecycle, ingress policy, autoscaling, logging, and upgrade safety. Architecture: EKS foundation with Terraform modules, baseline add-ons, workload identity, GitOps bootstrap, default observability, and controlled environment overlays. Result: Clusters become a repeatable platform product rather than a one-off infrastructure build.
- OpenTelemetry Observability Mesh — Problem: Metrics, logs, and traces often exist separately, making incidents slower and ownership unclear. Constraints: Cardinality, sampling, cost, multi-service correlation, dashboard sprawl, and alert fatigue. Architecture: OpenTelemetry collection layer with service conventions, trace context propagation, metric normalization, log correlation, and dashboard/runbook links. Result: Production behavior becomes easier to understand from request path to workload to infrastructure signal.
- SLO Driven Monitoring — Problem: Dashboards can look healthy while users experience latency, errors, or degraded workflows. Constraints: Service ownership, error budgets, burn-rate alerts, noisy dependencies, and product-facing reliability language. Architecture: SLO model with user-centric indicators, burn-rate alerts, Grafana-style views, incident thresholds, and runbook context near alerts. Result: Monitoring shifts from raw infrastructure charts to reliability decisions teams can act on.
- Multi Region GitOps — Problem: Multi-region systems need repeatable promotion and rollback without turning every deployment into manual coordination. Constraints: Regional overlays, failover state, secret distribution, traffic switching, drift, and environment-specific policy. Architecture: GitOps layout with regional overlays, promotion gates, sync waves, health checks, and clear ownership between platform and application teams. Result: Regional delivery becomes auditable and reversible while keeping infrastructure state understandable.
- Terraform Platform Modules — Problem: Cloud platforms drift when teams copy infrastructure snippets and adjust them under delivery pressure. Constraints: Module versioning, state boundaries, reviewable plans, environment variance, and provider upgrade safety. Architecture: Terraform module contracts for networking, EKS, IAM, storage, DNS, and platform defaults with CI-compatible plan workflows. Result: Infrastructure changes become reviewable product changes instead of undocumented console state.
- ArgoCD App of Apps — Problem: As platforms grow, application onboarding, add-ons, and environment drift become hard to reason about. Constraints: Bootstrap order, namespace ownership, secrets, cluster add-ons, team autonomy, and rollback visibility. Architecture: Argo CD app-of-apps pattern with platform add-ons, application sets, sync waves, health checks, and environment-level ownership. Result: Platform state becomes visible in Git and easier to bootstrap, audit, and recover.
- Policy as Code Guardrails — Problem: Security and platform rules are often discovered only after deployment or during reviews. Constraints: Developer experience, admission control, exception handling, auditability, and avoiding fragile gatekeeping. Architecture: Policy-as-code guardrails with OPA/Kyverno-style checks, CI feedback, admission policies, and documented exception paths. Result: Teams get fast feedback while platform standards are enforced consistently across environments.
- Secrets and Certificate Automation — Problem: Manual secret rotation and certificate handling create outage risk and hidden operational debt. Constraints: Rotation cadence, Kubernetes consumption, identity boundaries, audit trail, renewals, and emergency revocation. Architecture: Secret delivery model with external secret sources, workload identity, certificate automation, renewal monitoring, and rotation runbooks. Result: Sensitive material becomes lifecycle-managed infrastructure instead of scattered manual state.
- Incident Runbook Automation — Problem: Incidents are slower when context, dashboards, logs, and recovery steps live in different places. Constraints: On-call pressure, incomplete symptoms, permissions, dry-run safety, and post-incident learning. Architecture: Runbook-linked alerts with diagnostic commands, status checks, escalation context, safe remediation steps, and follow-up documentation hooks. Result: Incident response becomes calmer, more repeatable, and easier to improve after the event.
- Self Healing Infrastructure — Problem: Transient infrastructure failures can become user-facing incidents when recovery depends on manual detection. Constraints: False positives, blast radius, rollback safety, observability confirmation, and human override. Architecture: Failure detection with health signals, bounded remediation actions, chaos validation, alert correlation, and operator approval for risky paths. Result: Common failure modes can recover faster while preserving control over high-risk actions.
- AI DevOps Labs Platform — Problem: DevOps learning is often passive and disconnected from real infrastructure failure modes. Constraints: Safe execution, terminal UX, generated scenarios, repeatability, cost control, and guided feedback. Architecture: AI-generated lab platform with scenario generation, interactive terminal flow, containerized execution, scoring, and cloud-native learning paths. Result: Infrastructure knowledge becomes hands-on practice instead of static documentation.
- Developer Platform Interface — Problem: Developers lose time when every deployment, environment, and infrastructure request requires platform team translation. Constraints: Self-service boundaries, golden paths, ownership, auditability, and avoiding an unmaintainable portal. Architecture: Platform interface with paved workflows, templates, environment contracts, GitOps-backed changes, and visible operational status. Result: Teams can ship through clear platform paths while platform engineers keep control of the underlying system.
- Cost and Token Observability — Problem: AI and cloud costs can grow quietly when usage is disconnected from teams, services, and deployment changes. Constraints: Token attribution, cloud tags, model pricing, request volume, budget alerts, and developer-readable reports. Architecture: Cost telemetry tied to services, AI gateway requests, deployment events, dashboards, and threshold-based feedback loops. Result: Cost becomes an operational signal teams can understand before it becomes a finance surprise.
- FleetDM Endpoint Visibility — Problem: Endpoint visibility is often separate from cloud and Kubernetes operations, leaving security context incomplete. Constraints: Device inventory, query safety, rollout control, privacy, vulnerability context, and integration with existing operations. Architecture: FleetDM-style visibility layer connected to inventory, policy queries, vulnerability signals, and operational reporting. Result: Endpoint state becomes part of the broader infrastructure picture instead of a separate security island.
- Zero Trust Service Mesh — Problem: Internal traffic is often trusted by default, making lateral movement and policy gaps hard to see. Constraints: Service identity, mTLS, policy rollout, observability, latency overhead, and developer debugging. Architecture: Service mesh model with workload identity, mTLS, authorization policy, traffic telemetry, and progressive rollout controls. Result: East-west traffic becomes governed, observable, and easier to reason about during security reviews and incidents.
Selected Projects
- infra-labs.ai — AI-generated DevOps labs with guided scenarios, interactive terminal flow, and cloud-native learning paths.
- self-healing-infrastructure-chaos-engineering — Self-healing infrastructure experiments: failure injection, automated recovery loops, and chaos-driven validation of platform behavior.
- gitops-duel-argocd-vs-flux — Interactive duel between Argo CD and Flux — drift, reconciliation, and deployment tradeoffs as a platform game.
- insurance-platform-infrastructure — Multi-account insurance platform foundation: networking, EKS, identity, and environment-scoped Terraform modules.
- cloud-devops-platform — End-to-end cloud DevOps platform skeleton — clusters, delivery, and operational defaults as reusable IaC.
- devops-admin-platform — Admin console for platform operations — workflows, visibility, and control surfaces for DevOps teams.
Talks And Signals
- Cloud Native Rockstars 2026 — Company Award finalist, 3rd place — CONTACT Software GmbH (source)
- Cloud Native Conference Frankfurt 2026 — Cloud-native systems, GitOps, and platform operations
- Fourier AI talks — AI infrastructure, automation, and production readiness