Monday, April 20, 2026

Docker Compose Gets You to the Demo. In Regulated Domains, Here Is What Gets You to Production.

I built an APRA-compliant insurance platform in my spare time to prove a point. Then I asked an honest question: could I actually run it in production? The answer revealed something counterintuitive about regulatory burden. 


In my previous post, I made the case that the enterprise modernisation playbook is broken. The evidence I offered was UnderwriteAI: a production-grade, APRA-compliant insurance platform I built entirely in my spare time, using GitHub Copilot powered by Claude Sonnet 4.6, across 41 working sessions. Eight microservices, a React portal, an API gateway, real-time Kafka event streaming, 156 automated BDD test scenarios, and a live demo in which an AI agent executes the complete insurance policy lifecycle in eleven natural language commands.

The platform works. The demos are compelling. The test coverage is real.

And then I asked a more uncomfortable question: could I actually run this in production?

 


The Docker Compose Fiction

The current deployment descriptor for UnderwriteAI is a single docker-compose.yml file. It starts 29 containers on a single machine, hardwires service discovery via a Docker bridge network, and manages persistence through named volumes on the local file system. It works perfectly on my MacBook. It has worked perfectly for 41 sessions of development and demonstration.

It is not a production deployment model.

Docker Compose is a development orchestration tool. It assumes a single host. It has no concept of the machine being unavailable. If the host restarts, you run docker compose up and everything comes back. If a container crashes, the restart: unless-stopped directive brings it back on the same host. If load increases and a service needs more instances, Docker Compose cannot scale to meet it. There is no concept of a rolling deployment. There is no concept of a disruption budget. There is no way to say "this service requires at least one replica to be available at all times."

None of this matters for development. All of it matters for production.

I'm not raising this as a gap in the AI-assisted development story. I'm raising it because the distinction between "working software" and "production software" is consistently underweighted in the industry conversation about what AI-accelerated development can actually deliver. Working software is a necessary condition. It is not a sufficient one.

 


Resilience Is Not a Feature. It Is a Deployment Architecture.

The regulatory context sharpens this considerably.

APRA's CPS 230, which came into effect on 1 July 2025, sets explicit requirements for operational resilience in Australian regulated entities. It requires demonstrated availability controls: documented tolerance for disruption, tested recovery procedures, and evidence that critical business services can withstand realistic failure scenarios.

An insurance platform running on a single Docker host does not satisfy CPS 230. It cannot, structurally. There is no redundancy. There is no automated failover. There is no mechanism for demonstrating controlled disruption.

The standard artefacts that satisfy CPS 230 requirements in a modern deployment model are Kubernetes-native constructs: Pod Disruption Budgets (defining how many replicas can be unavailable during voluntary disruption), HorizontalPodAutoscalers (scaling replicas in response to load, ensuring capacity under demand), rolling update strategies (allowing new versions to be deployed without service interruption), and liveness and readiness probes (enabling the cluster to remove unhealthy instances from the load pool automatically, without human intervention).

These are not nice-to-have engineering hygiene items. For a regulated insurer, they are the substance of the operational resilience capability that a prudential regulator asks you to demonstrate.

An enterprise programme that defers infrastructure architecture to a later phase is deferring the regulatory capability itself. It cannot be discovered in integration. It has to be designed in.

 


Twenty-Nine Containers, Three Categories, One Tractable Problem


As always, I want to be specific here, because the move from Docker Compose to Kubernetes is often described at a level of abstraction that makes it sound either trivial ("just deploy the containers differently") or impossibly complex ("you need a dedicated platform team"). Neither characterisation is accurate.

The 29 containers in my stack fall into three categories, and each requires a different approach.

 

Category 1: Application services (nine containers)

Eight Java microservices and the React frontend. For each of these, the Kubernetes work is mechanical. A Deployment manifest encoding replica count and the resource limits already documented in the project's architecture guide. A Service manifest for internal cluster DNS. A HorizontalPodAutoscaler targeting 70% CPU utilisation with a minimum of one replica and a maximum of three. A PodDisruptionBudget with minAvailable: 1. Liveness and readiness probes wired to the Spring Boot Actuator health endpoints that already exist in every service.

This is templatable. The services share enough structural similarity that nine manifests can be generated from a single template with per-service variable substitution. That is what Helm charts are: parameterised Kubernetes manifest templates with environment-specific values files.

In practice, the structural approach is a single deployment.yaml template that iterates over a services: map using a Go template range loop. All eight microservices are declared as entries in values.yaml under a shared key. The template renders one Deployment, one Service, one ConfigMap, and one Secret per entry, and the only per-service inputs are port numbers, database credentials, and the small number of service-specific environment variables (Redis cache config for the premium service, document storage paths for the document service). The alternative of one template file per service produces eight times the maintenance surface area for changes that are structurally identical across all eight.

Non-sensitive configuration (datasource URLs, Kafka bootstrap addresses, Keycloak JWK endpoints) goes into ConfigMap. Passwords and signing keys go into Kubernetes Secret objects using stringData. The two are mounted into the container together via envFrom. A checksum/config annotation on the Deployment (a SHA-256 hash of the ConfigMap content) ensures that updating a config value triggers a rolling restart automatically, without requiring a manual image rebuild. That is a Helm convention, not a Kubernetes built-in. Kubernetes does not watch ConfigMap content directly. What it watches is the Deployment spec, and when Helm recalculates the hash on the next helm upgrade and writes an updated annotation value, the spec has changed, so Kubernetes sees a new revision and triggers a rolling update. The end result is automatic configuration change propagation; the mechanism is a chart-level pattern built on top of standard Kubernetes rollout behaviour.

The liveness probe wires to /actuator/health/liveness and the readiness probe to /actuator/health/readiness, the Spring Boot Actuator endpoints that already exist in every service. No additional instrumentation is required.

 

Category 2: Infrastructure services (20 containers)

This is where the real work is. PostgreSQL (ten databases: eight for the application services, plus dedicated instances for Keycloak and Kong, each maintaining their own schema, kept as separate containers to preserve the database-per-service isolation pattern), Redis, Apache Kafka, Zookeeper, Confluent Schema Registry, Keycloak, Kong API Gateway, Mailpit, Prometheus, Grafana, and Swagger UI.

For none of these do you write manifests from scratch. The ecosystem provides well-maintained community Helm charts: Bitnami's postgresql chart, Bitnami's kafka chart, the official Kong chart, the kube-prometheus-stack umbrella chart. The work is configuration: translating the environment variables in the Docker Compose file into the values schema expected by each community chart, ensuring persistent storage is correctly provisioned via PersistentVolumeClaim objects, and preserving the service interconnections (the Kafka bootstrap address, the Schema Registry URL, the Keycloak JWK endpoint) that the application services depend on.

This is the category that consumes most of the effort in any real Kubernetes migration. Configuration surface area is large, the community chart schemas differ from what you'd design yourself, and the failure modes during initial bring-up are obscure. It takes iteration.

The specific friction point in this stack is service discovery. Docker Compose's bridge network uses the service name as a DNS hostname (policy-db, kafka, redis), and every microservice's configuration already hardwires those names as spring.datasource.host, spring.kafka.bootstrap-servers, and so on. The default behaviour of the Bitnami community charts is to name Kubernetes services using the Helm release name as a prefix: a release named underwriteai with a PostgreSQL subchart aliased as policy-db would create a service called underwriteai-policy-db, not policy-db. That prefix would break every microservice's database connection configuration without a single changed line of application code.

The solution is fullnameOverride. Every infrastructure dependency in the umbrella chart's dependency declarations includes a fullnameOverride value matching the Docker Compose hostname exactly. The result is Kubernetes service DNS names that are identical to the docker-compose names, which means the application configuration files require zero changes. The umbrella chart for UnderwriteAI declares 14 dependencies: ten aliased bitnami/postgresql instances, bitnami/redis, bitnami/kafka, bitnami/keycloak, and kong/kong. Each has a fullnameOverride.

Two infrastructure charts offer meaningful topology differences between environments. The Bitnami Kafka chart supports KRaft mode (Kafka's internal Raft consensus mechanism, available from Kafka 3.3), which eliminates the Zookeeper dependency entirely. In the Kubernetes deployment, the chart runs single-node KRaft in development (one pod, no Zookeeper sidecar) and scales the controller pool to three replicas in production. This is a cleaner topology than the docker-compose configuration, which still runs a separate Zookeeper container because the docker-compose image predates the KRaft stabilisation. The Redis chart runs in standalone mode for development and switches to replication with Sentinel enabled in the production values file.

A question that arises consistently at this point in the conversation: who operates a Kubernetes cluster? For most organisations deploying a single application of this scale, the answer is that you do not operate the control plane. EKS (AWS), AKS (Azure), and GKE (Google Cloud) provide Kubernetes as a managed service; the control plane is the cloud provider's operational responsibility. What you need is someone who can write and maintain Helm charts, understand the cluster's operational model, and own the deployment pipeline. For an eight-service application, that is one person with a platform engineering or SRE background, not an organisational function. The 'dedicated platform team' threshold is real for organisations running hundreds of services. It is not the right framing for a greenfield deployment of this scale, and treating it as such is how the infrastructure conversation gets indefinitely deferred.

helm dependency update resolves and downloads all 14 dependency charts into a local charts/ directory in a single command. The pull takes roughly 90 seconds on a reasonable connection. The output names the exact chart version pulled for each dependency (bitnami/postgresql:18.5.24, bitnami/kafka:32.4.3, bitnami/keycloak:25.2.0), which is the version-pinned audit trail the regulatory framework expects of dependency management.

 

Category 3: Secrets (a category of its own)

The Docker Compose file contains roughly 40 plaintext credentials: database passwords, Redis authentication strings, JWT signing keys, Kafka configuration. Every one of these needs to be removed from the manifest layer and replaced with a Kubernetes Secret reference before this stack goes anywhere near a production cluster.

This is not just a security requirement. It is a baseline expectation of any modern infrastructure audit. Credentials hardcoded into deployment files cannot be rotated cleanly, cannot be scoped by environment, and cannot be managed without modifying source-controlled configuration. Kubernetes Secret objects are the minimum viable solution. A full implementation would use a secrets management tool such as HashiCorp Vault with sidecar injection, but that is a subsequent step. The immediate requirement is to remove the plaintext.

The reason secrets warrant their own category is that the problem class is distinct from both application configuration and infrastructure topology. The challenges are operational: how do you rotate a database password without downtime? How do you promote credentials across environments without committing them to source control? How do you produce an audit trail showing which workloads accessed which credentials, and when? These questions have answers (External Secrets Operator pulling from AWS Secrets Manager, HashiCorp Vault with Kubernetes auth, sealed secrets for GitOps workflows), but each introduces operational surface area that needs to be staffed, monitored, and tested. In a regulated domain, the audit trail for secret access is as important as the audit trail for deployment configuration. They are separate records requiring separate toolchains, and conflating them is where the scope of this work expands unexpectedly.

This three-layer framing (application services, infrastructure services, secrets) applies to any containerised migration, regardless of stack. The proportions of effort will vary depending on how much of your infrastructure is already cloud-native; the categories will not.

 

A note on sequencing: the three categories do not need to be resolved in parallel. Start with Category 1 (application services). It is mechanical, and the process of templating eight structurally similar deployments builds the chart familiarity required for the harder infrastructure work. Move to Category 2 (infrastructure services) dependency by dependency rather than attempting a full migration in a single pass. Address secrets management early in the Category 2 phase, not as a final step. Establishing how credentials flow through the system while infrastructure charts are being wired is significantly less disruptive than retrofitting a secrets model after 14 dependency charts already have credentials embedded in their values files.

 


The Deployment Pipeline and the Compliance Audit Trail Are the Same Thing


There is a non-obvious benefit to this work that I want to name directly.

Helm charts are infrastructure-as-code. They are source-controlled, versioned, and diffable. Every change to the deployment configuration produces a commit. Every deployment can be rolled back with a single command. The entire history of how the platform has been deployed is preserved in the repository.

For a regulated insurer, this is not incidental. The ability to produce an immutable record of what was deployed, when, and with what configuration is a compliance requirement in its own right. Docker Compose running on a development laptop is the opposite of this. A Helm chart in a version-controlled repository with a CI/CD pipeline running helm upgrade is exactly the audit trail the regulatory framework expects.

The operational resilience capability and the auditability capability are not separate concerns. They are the same work, expressed at the infrastructure layer.

The Helm chart is not a deployment pipeline on its own. That distinction matters for the compliance story. A CI/CD pipeline running helm upgrade on a validated merge to main automates deployment execution. A GitOps controller such as ArgoCD or Flux takes this further: the desired cluster state is declared in version control, and the controller continuously reconciles the cluster against it. The compliance value is in the approval gate. A pull request review and merge approval on the values file is the change management record. The deployment cannot proceed without it, and the audit trail is the repository history rather than a separate ITSM ticket. For regulated organisations, this collapses the deployment toolchain and the change management toolchain into a single artefact. That is not a small thing.

 


The Production Readiness Question Always Gets Asked. The Variable is When.


I have sat in a large number of enterprise architecture reviews over 20 years. A recurring pattern: the question "how will this run in production?" is asked late, often after significant investment in application design, and the answer frequently requires renegotiating assumptions that were baked in at the beginning.

Container orchestration is a specific example of this. Organisations that built their containerisation strategy on Docker Compose or Docker Swarm (a reasonable early-phase choice) found themselves rearchitecting the operational layer when production requirements became concrete. The application code was fine. The infrastructure model needed to change.

The pattern is repeatable because the organisational incentive is to demonstrate capability quickly. Docker Compose lets you demonstrate working software on a laptop in a review meeting. That is genuinely useful. But the demonstration creates an impression of production-readiness that can persist longer than it should.

I am not immune to this. I made the same choice. UnderwriteAI runs on Docker Compose because it let me ship working software quickly and demonstrate the full platform in a compelling way. That was the right choice for the phase I was in.

The right choice for the next phase is different. This pattern of working software that is not yet production software is one I encounter consistently in enterprise modernisation engagements, and it is rarely the result of poor engineering. It is the structural consequence of an organisational incentive to demonstrate capability quickly. The demonstration was a success; the problem is that its implied production-readiness tends to outlast the phase it was designed for.

 


The Counterintuitive Advantage of Regulated Domains

Here is the observation that tends to surprise people when I raise it: in my experience, organisations operating in regulated domains have this conversation earlier and with less organisational resistance than their unregulated counterparts.

The reason is structural. APRA does not ask whether you have thought about resilience. It asks you to demonstrate it, before you operate at scale. CPS 230 requires documented tolerance for disruption, tested recovery procedures, and evidence of availability controls. It is not a checkbox exercise. An auditor will ask to see the Pod Disruption Budgets, the rollback procedures, the incident response runbooks. The regulator has, in effect, mandated that the production infrastructure conversation happen before go-live.

That is an uncomfortable constraint when you first encounter it. It adds work to a phase of the programme that feels like it should be focused on features. But the constraint is doing something useful: it prevents the infrastructure debt from accumulating in the first place.

Compare this to the pattern I have observed consistently in non-regulated organisations. The production readiness conversation gets deferred. The team ships features. The deployment model that worked for the demo becomes the deployment model for production, because changing it would delay the launch. The launch happens. For some period, the single-host deployment holds. Then load increases, or a dependency fails, or a deployment goes wrong and there is no rollback path, and the production infrastructure conversation finally happens. Now it is happening under operational pressure, with real customers affected, in a remediation context rather than a design context. The cost is higher, the options are narrower, and the team is working against the clock.

This cost has been quantified. Google Cloud's DORA programme (Accelerate State of DevOps Report, dora.dev) has tracked software delivery performance across tens of thousands of practitioners for over a decade. A consistent finding: high-performing organisations excel at both speed and stability simultaneously. The assumption that trading production infrastructure maturity for early-phase delivery velocity is a rational choice does not hold up in the data. DORA's 2019 research found that elite performers were more than 23 times more likely to have fully adopted flexible cloud infrastructure than low performers. Their 2023 report found that organisations leveraging flexible infrastructure demonstrate 30% higher organisational performance than those that lift and shift without adopting cloud-native practices. The 2024 report was direct: 'simply migrating to the cloud without adopting its inherent flexibility can be more harmful than staying in a traditional data center' (Accelerate State of DevOps Report 2024, dora.dev). Deferred infrastructure work does not preserve optionality. It compounds a performance deficit.


It is worth naming the governance pattern underneath this. The team that makes the deferral decision is rarely the team that inherits the remediation cost. The engineering team that shipped the demo successfully moved on to the next programme. The operations team, or the team contracted to modernise the platform six months later, inherited the production stability debt. There is a funding structure that reinforces this split. The delivery programme is capitalised: CAPEX, with a defined budget and a clear end date, typically overseen by an executive sponsor accountable for shipping on time. The team that inherits the platform operates under OPEX, a cost centre under sustained pressure to reduce expenditure year on year. The production stability debt crosses the boundary between those two funding models invisibly. It does not appear in the CAPEX programme's final cost. It appears as operational overhead in a budget that was already too small. This is a governance gap, not an engineering failure. The incentive to demonstrate working software quickly is rational for the team that faces it. The cost falls elsewhere, to someone who was not in the room when the deferral was decided.

I have lived both versions. The regulated path feels slower at the time. In retrospect it is faster, because you do not pay the production stability debt after launch.

 

The lesson for technology leaders in non-regulated domains is uncomfortable but clear: the regulator is not the reason to build production-grade infrastructure before go-live. The reason is that it is cheaper and less risky to build it before go-live than after. The regulator is simply the external forcing function that makes regulated organisations do what all organisations should be doing anyway.

If your programme does not have a regulator imposing that constraint, consider voluntarily imposing it yourself. Define your production readiness criteria at the start of the programme, and make them specific enough to be binding. 'We will use Kubernetes' is not a criterion. 'Helm charts passing helm lint before the first sprint' is. 'We will manage secrets properly' is not a criterion. 'No credentials in deployment files before the first integration environment' is.

Add observability to that list explicitly. An instrumentation layer is not the same as a monitoring capability, and the difference is only visible under production load. Defined service level objectives, alerting on SLO breach, and enough baseline telemetry to distinguish normal behaviour from abnormal are as much a production readiness criterion as liveness probes. Without them, the first indication of a degraded service is a customer complaint.

Treat container orchestration model, secrets management approach, liveness and disruption budget configuration, observability baselines, and tested rollback procedures as launch-blocking requirements with named exit criteria in the programme charter. The charter is the right place for these precisely because it is agreed before anyone has an incentive to defer them.

The conversation will happen eventually. The only question is whether it happens while you still have the full set of options available, and while the team that defined the architecture is still in the room.

 


From Blank Directory to 145 Resources: What the Work Actually Involved

I completed this work between writing the first and second drafts of this post, so I can give a precise account rather than an estimate.

The finished umbrella chart renders 145 Kubernetes resources in total: 37 from the custom templates (8 Deployments, 8 Services, 8 ConfigMaps, 8 Secrets, 1 PVC for document storage, 1 frontend Deployment, 1 frontend Service, 1 frontend ConfigMap, 1 Ingress) and 108 from the 14 dependency sub-charts. helm lint reports zero failures. helm template against the development values file completes without errors and produces valid YAML for every resource.

The directory structure, consistent with the project's architecture decision records on file organisation:

infrastructure/helm/
├── underwriteai/                       # Umbrella chart
│   ├── Chart.yaml                      # 14 dependency declarations
│   ├── values.yaml                     # Default values (development credentials)
│   └── templates/                      # 10 files
│       ├── deployment.yaml             # Range loop over services map
│       ├── service.yaml
│       ├── configmap.yaml
│       ├── secret.yaml
│       ├── hpa.yaml                    # HorizontalPodAutoscaler
│       ├── pvc.yaml
│       ├── frontend-deployment.yaml
│       ├── frontend-service.yaml       # Service + ConfigMap + Ingress
│       ├── _helpers.tpl
│       └── NOTES.txt
└── values/
    ├── values-dev.yaml                 # 1 replica, 2Gi PVCs, Always pull
    └── values-prod.yaml                # HA replicas, TLS, empty passwords

The entire chart (application templates and all 14 infrastructure dependencies) was produced in a single AI agent session. Total elapsed time from blank directory to passing helm lint: approximately 15 minutes.

Again, I want to be precise about what that means, because it is easy to read "15 minutes" and conclude that the work was simple. It was not. The application template work was mechanical: the range-loop design over a services map is a structural pattern, and the shared Spring Boot probe configuration required no per-service customisation. But the infrastructure configuration work (translating 20 Docker Compose container definitions into correctly wired Helm dependency values, discovering fullnameOverride, resolving the Bitnami chart schemas across 14 dependencies) is exactly the category of work that a senior platform engineer, doing it manually, would have allocated the better part of a day to (if not multiple days). The fullnameOverride problem alone (understanding why the Bitnami chart was not producing the expected DNS name and finding the correct values key to override it) is the class of problem that documentation does not surface until you encounter it. It appears in the Bitnami chart's values.yaml on line 63, unremarked, between unrelated configuration items.


The AI agent resolved it in minutes. This is the part of the AI-accelerated development story that the industry has not yet fully priced in: the compression is not happening in feature development alone. It is happening in the infrastructure layer that was previously the primary bottleneck to production readiness.

The programme-level implication is sharper than it might initially appear. If your organisation is treating Kubernetes migration as a multi-quarter platform programme requiring specialist hiring, and a principal architect in a competing organisation can produce a validated 145-resource chart in 15 minutes, the competitive gap is not only in feature velocity. It is in infrastructure maturity. Organisations that have internalised AI-assisted development at the infrastructure layer are arriving at production-ready deployment configurations in the time it previously took to write the design document for one. The distance between working software and production software has not disappeared. It has shortened to the point where deferring it is a choice, not a constraint imposed by capability.

 

The secrets layer remains an outstanding item. The development values files contain the same plaintext credentials used in Docker Compose, which is acceptable for a private development repository. A production deployment requires every credential reference replaced with either a --set flag at deploy time or an External Secrets Operator integration pulling from AWS Secrets Manager or HashiCorp Vault. The production values file is structured to make this transition explicit: every password field is set to an empty string with a # REQUIRED: override via --set comment. The shape of the secrets surface is defined; the management mechanism is deferred.

The prerequisite for validating this in a real cluster is a local Kubernetes environment. Docker Desktop includes one (Settings → Kubernetes → Enable Kubernetes). That is sufficient for development and cluster-level validation before deploying to a managed service such as EKS, AKS, or GKE.

 


If You Don’t Have a Regulator, Consider Becoming Your Own


The broader point I want to leave with technology leaders is this.

AI-assisted development has materially shortened the distance between intent and working software. That is real, and the implications for enterprise programme economics are significant, as I argued in the previous post.

But the distance between working software and production software has not shortened by the same factor. Infrastructure architecture, operational resilience design, secrets management, and the regulatory capability layer that sits on top of all of it are still substantial engineering work. AI tooling helps with the mechanical parts. The design judgements are still human.

The risk for organisations that have adopted AI-assisted development without yet internalising this distinction is that they are delivering working software faster than their infrastructure capability can absorb. Demos improve. Release pipelines, operational resilience frameworks, and audit-ready deployment configurations do not automatically improve alongside them.

Product-led modernisation, the position I argued for in the previous post, does not mean "ship features and work out production later." It means the path to production should be short and known from the beginning. Feature velocity and infrastructure maturity need to advance together, or the gap between what you can demonstrate and what you can actually operate at scale will quietly widen.

I’m closing that gap on my own platform now. It is, predictably, the hardest part of the project.



Link to my previous post 👉 "The Enterprise Modernisation Playbook Is Broken. I Know Because I Helped Write It ... " 



I’m Tyrell Perera, an Enterprise Solutions Architect and Fractional CTO with 20+ years of experience leading digital transformation in Insurance, Telecommunications, Energy, Retail, and Media across Australia. The gap between working software and production software is the one I see most consistently underestimated in enterprise modernisation programmes, regardless of how well the application development has gone. If you’re leading a programme where the working software story is strong and the production readiness story is not yet written, that is the specific conversation I’m set up for. Find me at tyrell.co or on GitHub.

 

Saturday, April 11, 2026

The Enterprise Modernisation Playbook Is Broken. I Know Because I Helped Write It ...


After two decades inside large-scale transformation programmes, I stopped waiting for the right conditions. I built the proof myself. On weekends.
 

I've spent 20+ years as an Enterprise Solutions Architect and various other technology leadership roles inside large-scale technology transformation programmes. Insurance. Telecommunications. Energy. Retail. Media. Different industries, different technology stacks, different executive sponsors. The same programme structure, year after year.

Discovery phase. Architecture blueprints. Governance frameworks. Vendor selections. Roadmaps that stretch eighteen months before a single line of production code is written. And somewhere in month fourteen, when the business context has shifted and the original assumptions are quietly no longer true, a measured renegotiation of scope. The "minimum viable" quietly becomes the "maximum achievable."

I've built a career navigating this model. I'm not writing this from the outside. I've led engineering organisations of 90 people inside that model, managing platforms supporting hundreds of millions in annual revenue at one of Australia's largest telecommunications companies. I'm not dismissing it wholesale. For some problems, it's still the right approach. But something has changed in the last eighteen months that makes the old playbook genuinely obsolete for a significant class of enterprise modernisation challenges.

I was frustrated enough to prove it on my own time.

 


What the Old Playbook Assumes 

Large-scale transformation programmes are built on a set of assumptions that were reasonable when they were formed.

Assumption 1: Building software is expensive and slow. Therefore, front-load the planning. Get the architecture right before committing to implementation. The cost of changing direction mid-programme is prohibitive.

Assumption 2: Complexity requires specialisation. Regulated domains like insurance, banking, and healthcare require deep domain expertise, and that expertise takes time to co-ordinate across teams. Move carefully.

Assumption 3: Working software is a late-stage deliverable. The artefacts of early phases are documents: requirements, designs, blueprints. Stakeholders validate against slides and wireframes. Working software comes at the end, when you integrate and test.

These assumptions shaped programme structures, governance models, vendor relationships, and, critically, the way executives are asked to think about technology investment.

Every one of these assumptions is now wrong.

 


What Changed: AI Collapsed the Distance Between Intent and Working Software

 
 

I want to be precise here, because this point is usually made too broadly.

I'm not saying "AI speeds up development." That framing undersells the structural change. What has actually happened is that the distance between a clear statement of intent and working, tested, production-grade software has collapsed to a degree that invalidates the planning-heavy programme model entirely.

To test this hypothesis properly, I chose the hardest domain I could think of: Australian insurance. Regulatory obligations under APRA, the Privacy Act 1988, and the Insurance Contracts Act 1984. Multi-service architecture requirements. Real-time event streaming, audit trail integrity, compliance reporting. If you want a genuinely complex proving ground, insurance qualifies.

I started building outside of work hours. No team. No budget. No programme governance structure.

Over 41 working sessions, using GitHub Copilot powered by Claude Sonnet 4.6, I built UnderwriteAI: a working reference system for Australian insurance. Production-grade, eight microservices, compliance-complete. Policy management. Customer onboarding with Privacy Act consent capture. A rating engine covering five insurance products. Claims workflow from lodgement through settlement. APRA regulatory reporting. Kafka event streaming across six topics. A React portal. Kong API gateway. Keycloak authentication. 156 automated BDD test scenarios covering Australian compliance requirements.

The architecture is not a prototype. The compliance is not simulated. The test coverage is not aspirational.

And I built the whole thing in my spare time (Hence the 41 sessions. The way I preserved Agent context and memory between those sessions deserves another dedicated blog post 😉). 

An equivalent programme scoped through a traditional delivery model, with vendor selection, requirements workshops, architecture review boards, and staged releases, would conservatively carry an 18 to 24 month timeline and a seven-figure budget before a line of production code shipped. This took 41 sessions.

 


The Moment That Clarified Everything

There are actually two demonstrations from this project, and the progression between them is the point.

The first is a 16-chapter walk-through of the complete insurance lifecycle: customer creation, premium rating, policy binding, claims lodgement, workflow progression, notifications, APRA reporting, renewals. A browser opens. Every screen is navigated. Every form is filled. Every button is clicked. It looks like a polished product demonstration performed by a skilled operator.

There is no human operator. The entire browser session is driven by a Playwright script authored by the same AI that built the platform. I provided the instruction to run it. That is the full extent of my involvement. The AI that wrote the code also wrote the tests, and the tests are the demo.

That realisation sat with me for a while. Then I took it one step further.


I wired GitHub Copilot into the live platform via the Model Context Protocol, a standard that allows AI agents to call real APIs directly as tools. In the second demonstration, there is no browser at all. No Playwright script. No human navigating screens. Just a VS Code chat window and natural language instructions.

In eleven tool calls, Copilot created a customer with Privacy Act consent captured, ran the premium rating engine for a comprehensive motor policy, bound the policy, lodged a claim for a not-at-fault rear collision, advanced the claim through the full regulatory workflow (acknowledge, investigate, assess, approve, settle) and pulled the immutable APRA audit trail.

Every step landed in the live database. Every Kafka event fired. Every notification dispatched. Every audit record written.

The progression across the two demos is not a technical curiosity. It is a directional signal. In the first demo, the AI uses the interface designed for humans because it can. In the second, it discards that interface entirely and operates the system directly. The browser, and by extension the entire human-facing layer, turns out to be optional infrastructure.

I've spent years explaining to executive stakeholders what possible looks like in a regulated domain. These two demonstrations are now the explanation.

Watch the full demo:



What This Means for Your Technology Organisation

 

I want to offer four genuinely consequential implications for CIOs and CTOs. Not the usual list of AI adoption recommendations.

 

1. Your planning horizon is your biggest risk.

If your modernisation programme is spending its first twelve months producing documents rather than working software, you are not managing risk. You are accumulating it. The business context that justified the programme will change. The technology landscape will change. The AI tools available to your engineering teams will change dramatically. Programmes that defer working software to the integration phase will arrive at that phase with outdated assumptions and no mechanism to detect it.

Product-led modernisation, defined simply as shipping working, tested, incrementally improving software from week one, is not an Agile methodology recommendation. It is a risk management position.

 

2. The regulated domain objection no longer holds.

The most common pushback I receive when discussing faster, more iterative approaches to enterprise transformation is: "Our domain is too complex. We have regulatory obligations. We can't move that quickly."

I built UnderwriteAI specifically to empirically test this objection. APRA compliance, dual-consent privacy obligations, statutory notice timelines, immutable audit trails: none of these prevented iterative delivery. Some of them were easier to implement correctly when tested continuously from the beginning rather than bolted on at the end. Compliance that is woven into every sprint cannot be descoped. Compliance that is scheduled for the "integration phase" routinely is.

 

3. AI is now simultaneously the builder and the operator of enterprise systems.

This is the implication that most organisations haven't fully absorbed.

The MCP demonstration is not a curiosity. It is a preview of enterprise architecture in which AI agents are first-class participants in business workflows. Not augmenting human activity. Executing it. The question for your technology organisation is not whether to prepare for this, but whether your current modernisation investments are producing the kind of clean, API-first, event-driven architecture that AI agents can actually operate.

Legacy systems with opaque integrations and inconsistent APIs are not just technically awkward. They are structurally incompatible with the direction enterprise computing is moving. Every year of deferred modernisation is a year of compounding incompatibility with the operational model that is already emerging.

 

4. You can start smaller than you think, and sooner than your governance model assumes.

The most common response I get when sharing this with technology leaders is: "That's compelling, but we can't restructure our whole programme around it." That is not what I'm suggesting.

Pick one bounded domain. A single workflow that is materially important but not mission-critical enough to paralyse decision-making. Set a 90-day deadline. Ship working software against it. Not a prototype, not a proof of concept: working software, with tests, running against real data.

What you learn in those 90 days about what AI can and cannot do in your specific environment, with your specific constraints, is worth more than the outputs of a six-month discovery phase. And you will have working software at the end of it, which means the next conversation with your board is grounded in evidence rather than projections.

 


The Question I'd Leave You With

Most modernisation programmes can show you a roadmap. Many can show you a milestone report. Very few can show you working software that solves the actual problem: real compliance, real test coverage, and a live demonstration you can put in front of a sceptical stakeholder today.

I built that in my spare time to prove a point about what is possible.

The question worth asking of your current transformation programme (or the one you are about to commission) is simple: what is the working software that proves this is on the right track? Not the wireframes, not the architecture diagrams, not the vendor's reference implementation. The working software, running against real data, that a sceptical stakeholder can interact with today.

If the answer is "we'll have that in the integration phase," the programme structure is carrying more risk than the governance papers are showing you.

 


I'm Tyrell Perera, an Enterprise Solutions Architect and Fractional CTO with 20+ years of experience in digital transformation across Insurance, Telecommunications, Energy, Retail, and Media in Australia. 

UnderwriteAI is a project I built entirely in my own time, outside of my day job. It is currently in a private repository while I work through what comes next, whether that is open sourcing it, building a product around it, or using it as a foundation for advisory engagements. If you're navigating modernisation decisions for your organisation and want to explore what this model looks like in your context, I'd welcome the conversation. 

Find me at tyrell.co or on GitHub.

 

Wednesday, March 18, 2026

NVIDIA's Inferencing Chip Launch: Market Validation of the Enterprise AI Strategy I Predicted in January

March 18, 2026

Seven weeks ago, I published a blog post arguing that enterprises should focus on AI inferencing rather than training, based on a casual lunch conversation with fellow architects. Today, NVIDIA's announcement of their new chip specifically designed for AI inferencing workloads provides compelling market validation of that thesis.

This isn't just another hardware launch. It's a definitive signal that the AI infrastructure market is bifurcating exactly as I predicted, and enterprises that recognised this shift early are now perfectly positioned for the next phase of AI adoption.

 

What NVIDIA's Move Tells Us About Market Reality

When one of the world's most influential AI infrastructure companies invests in developing dedicated silicon for inferencing, it confirms several critical market dynamics that I outlined in my original analysis:

Enterprise Inferencing Demand Has Reached Scale

NVIDIA doesn't develop new chips on speculation. This launch indicates that enterprise demand for optimised inferencing performance has reached sufficient scale to justify the massive R&D investment required for new silicon development.

In January, I wrote:

"For most enterprise IT departments, the strategic focus should be on inferencing and model consumption rather than large scale model training."

The market has spoken, and enterprises globally are clearly following this path, creating enough demand to drive hardware innovation.

Performance Optimisation is Now a Competitive Differentiator

Real time inferencing performance has evolved from a technical requirement to a business competitive advantage. Organisations that can serve AI predictions faster, more reliably, and at lower cost will outperform those still grappling with infrastructure basics.

This aligns perfectly with my January prediction about where enterprise value creation occurs:

"Enterprise Value Creation: Data preparation and feature engineering, Business process integration and workflow automation, User experience and interface design, Governance, compliance, and risk management, Model monitoring and performance optimisation"

Infrastructure Specialisation is Accelerating

The development of inferencing specific hardware confirms that the "one size fits all" approach to AI infrastructure is over. Training and inferencing require fundamentally different optimisations, and the market is now mature enough to support this specialisation.

 

Why This Validates My Original Enterprise AI Framework

In my January post, I argued that enterprises should focus on four key areas rather than attempting to compete with Big Tech on model training:

✅ Model Consumption: Leverage existing foundation models through APIs
✅ Fine Tuning Excellence: Customise models for domain specific applications
✅ Inferencing Infrastructure: Invest in robust, scalable serving capabilities
✅ Governance and Compliance: Build frameworks for responsible AI deployment

NVIDIA's inferencing chip directly supports points 2, 3, and 4 by providing:

  • Enhanced fine tuning capabilities through optimised inference performance
  • Superior inferencing infrastructure with dedicated silicon
  • Better governance support through consistent, auditable performance metrics
 

What This Means for Enterprise Strategy Moving Forward

The Infrastructure Investment Decision is Clearer

Seven weeks ago, some enterprises were still debating whether to invest heavily in training infrastructure or focus on inferencing capabilities. NVIDIA's move settles this debate definitively for most organisations.

The message is clear: invest in inferencing infrastructure excellence, not training infrastructure competition.

Early Adopters Have a Significant Advantage

Organisations that began focusing on inferencing capabilities, governance frameworks, and operational excellence in late 2025 and early 2026 are now positioned to leverage this next wave of specialised infrastructure immediately.

Those still allocating significant resources to training infrastructure may find themselves at a disadvantage as the market continues to specialise.

Cost Efficiency Becomes Strategic

With dedicated inferencing hardware available, the enterprises that master cost efficient model serving will have substantial competitive advantages. This reinforces my January emphasis on "Inferencing Cost Optimisation" as a critical enterprise capability.

 

Looking Forward: The Enterprise AI Maturity Model

Based on this market validation, I'm seeing a clear enterprise AI maturity progression:

Stage 1: Experimentation (2023-2024)

  • Proof of concept projects
  • Basic API consumption
  • Limited governance

Stage 2: Strategic Focus (2025-2026)

  • Choose between training vs inferencing investment
  • Develop governance frameworks
  • Build operational capabilities

Stage 3: Infrastructure Excellence (2026-2027) ← We are here

  • Optimised inferencing infrastructure
  • Advanced governance and compliance
  • Competitive differentiation through AI performance

Stage 4: Business Integration (2027+)

  • AI native business processes
  • Real time decision systems
  • Continuous optimisation and evolution
 

Key Implications for Solutions Architects

Infrastructure Planning

  • Immediate: Evaluate current inferencing infrastructure against new performance benchmarks
  • Short term: Develop business cases for inferencing specific hardware investments
  • Medium term: Design architectures that can leverage specialised inferencing capabilities

Investment Priorities

  • Deprioritise: Large scale training infrastructure investments
  • Maintain: API consumption and model evaluation capabilities
  • Accelerate: Inferencing optimisation, monitoring, and governance frameworks

Skills Development

  • Critical: Inferencing performance tuning and optimisation
  • Important: Multi model orchestration and management
  • Essential: AI governance and compliance frameworks
 

The Broader Industry Implications

NVIDIA's inferencing chip launch signals several broader trends that will reshape the enterprise AI landscape:

Hardware Ecosystem Maturation

We can expect other hardware vendors to follow with their own inferencing optimised solutions, creating a competitive market that will drive further innovation and cost reduction.

Software Stack Specialisation

Infrastructure software will increasingly optimise for inferencing specific workloads, creating more sophisticated orchestration, monitoring, and management capabilities.

Service Provider Evolution

Cloud providers and managed service vendors will develop inferencing specific offerings, making advanced capabilities accessible to smaller organisations.

 

Vindication and Forward Momentum

The NVIDIA announcement validates the strategic framework I proposed in January, but more importantly, it provides clear direction for enterprise AI investments moving forward.

The key insight remains unchanged: enterprises should focus their resources on becoming excellent at AI consumption, integration, and governance rather than attempting to compete with Big Tech on foundational infrastructure.

What's new: The market has now provided dedicated hardware to support this strategy, making the performance and cost benefits even more compelling.

The next challenge: Organisations must move quickly to capitalise on this infrastructure evolution. Those that continue to debate strategy while others implement inferencing excellence will find themselves increasingly disadvantaged.

For solutions architects and enterprise IT leaders, the path forward is clear. The question isn't whether to invest in inferencing capabilities, but how quickly and effectively you can build them.

The future belongs to organisations that excel at leveraging AI capabilities, not those trying to recreate them.

 


This post builds on my January analysis: "AI Training vs Inferencing: An Enterprise Solutions Architect's Guide to Building Secure, Compliant AI Systems". What trends are you seeing in your organisation's AI infrastructure decisions? I'd love to hear about your experiences in the comments.

 

Thursday, January 29, 2026

AI Training vs Inferencing: An Enterprise Solutions Architect's Guide to Building Secure, Compliant AI Systems

As enterprises increasingly adopt artificial intelligence to drive innovation and operational efficiency, understanding the fundamental differences between AI training and inferencing becomes crucial for solutions architects. This distinction isn't just technical but has profound implications for security, compliance, data governance, and infrastructure architecture in enterprise environments.

In this post, I'll break down the key differences between AI training and inferencing from an enterprise perspective, highlighting the critical guardrails and considerations necessary when building AI solutions for large organisations, particularly in regulated industries.

 

Understanding the Fundamentals

 

AI Training: Building the Intelligence


AI Training is the process of teaching a machine learning model to recognise patterns, make predictions, or generate outputs based on historical data. During training:

  • Large datasets are processed to adjust model parameters
  • The model learns from examples and feedback
  • Computational resources are heavily utilised for extended periods
  • The goal is to optimise model accuracy and performance metrics

 

AI Inferencing: Applying the Intelligence


AI Inferencing is the operational phase where a trained model applies its learned knowledge to new, unseen data to make predictions or generate outputs. During inferencing:

  • Real time or batch processing of new data inputs
  • Pre trained models execute predictions quickly
  • Lower computational overhead compared to training
  • The focus shifts to latency, throughput, and availability
 

 

The Enterprise Reality: Focus on Inferencing, Not Training

Before diving into the technical considerations, it's crucial to address a fundamental strategic question: Should your enterprise be building its own AI models from scratch?

For most enterprise IT departments, the answer is definitively no. Here's why:


Why Enterprises Should Avoid Large-Scale Model Training

Infrastructure Reality:

  • Training state of the art models requires thousands of high end GPUs
  • Infrastructure costs can range from hundreds of thousands to millions of dollars
  • Specialised engineering teams with deep ML expertise are required
  • Power consumption and cooling requirements are substantial

Business Focus Alignment:

  • Enterprise IT exists to serve the core business (banking, insurance, retail, healthcare)
  • Your competitive advantage lies in your domain expertise, not in building foundation models
  • Resources are better invested in business specific applications and integrations
  • Time to market is critical for business solutions

Market Dynamics:

  • Companies like OpenAI, Anthropic, Google, and Meta have massive infrastructure investments
  • Pre trained models are becoming increasingly sophisticated and accessible
  • The cost of using existing models via APIs is often lower than building from scratch
  • Rapid innovation in the foundation model space makes internal development risky

 

The Practical Enterprise AI Strategy

Model Consumption, Not Creation:

  • Leverage existing foundation models through APIs (GPT 4, Claude, Gemini)
  • Focus on fine tuning and prompt engineering for your specific use cases
  • Invest in model evaluation and selection processes
  • Build expertise in model integration and orchestration

Training Where It Makes Sense:

  • Small, domain specific models for specialised tasks
  • Fine tuning existing models with your proprietary data
  • Transfer learning from pre trained models
  • Custom models for unique business processes where no alternatives exist

Enterprise Value Creation:

  • Data preparation and feature engineering
  • Business process integration and workflow automation
  • User experience and interface design
  • Governance, compliance, and risk management
  • Model monitoring and performance optimisation

 

Enterprise Considerations: Beyond the Technical


1. Data Classification and Governance

Training Phase Challenges (When Applicable):

  • Fine tuning requires access to curated, domain specific datasets
  • Often involves sensitive proprietary data for model customisation
  • Data preparation and feature engineering for specialised models
  • Model validation and testing with business specific metrics

Note: Most enterprises will focus on fine tuning pre trained models rather than training from scratch.

Inferencing Phase Challenges:

  • Processes real time customer data
  • Requires immediate access to current business context
  • Must maintain data lineage for audit purposes
  • Output data may contain derived sensitive information

Enterprise Guardrails:

  1. Implement data classification frameworks (Public, Internal, Confidential, Restricted)
  2. Establish clear data retention and purging policies for both phases
  3. Deploy data loss prevention (DLP) tools to monitor data movement
  4. Create separate data governance processes for training vs. operational data

 

2. Security Architecture Considerations

Training Environment Security (for Fine Tuning):

  • Isolated compute environments for model customisation
  • Secure data transfer protocols for proprietary training datasets
  • Encryption at rest for custom training data and model artifacts
  • Access controls limiting who can initiate fine tuning jobs

Inferencing Environment Security:

  • Real time threat detection and response capabilities
  • API security and rate limiting for model endpoints
  • Input validation and sanitisation to prevent adversarial attacks
  • Secure model serving infrastructure with load balancing

Enterprise Security Framework:

Training Security Stack:
├── Secure Data Lake/Warehouse
├── Isolated Training Clusters (Air gapped if required)
├── Encrypted Model Storage
└── Audit Logging and Monitoring

Inferencing Security Stack:
├── API Gateway with Authentication/Authorisation
├── WAF and DDoS Protection
├── Runtime Application Self Protection (RASP)
└── Real time Security Monitoring

 

 

3. Regulatory Compliance Implications

 

GDPR and Data Privacy

Training Considerations (Fine Tuning Scenarios):

  • Right to be forgotten requires model retraining or reversion capabilities
  • Data minimisation principles affect feature selection for custom models
  • Consent management for using personal data in model customisation
  • Cross border data transfer restrictions for fine tuning datasets

Inferencing Considerations:

  • Real time consent validation for processing personal data
  • Purpose limitation ensuring inference aligns with original consent
  • Data portability requirements for inference results
  • Transparent decision making processes

 

Financial Services (SOX, PCI DSS, Basel III)

Training Compliance (Fine Tuning Context):

  • Model customisation lifecycle documentation
  • Data lineage and transformation tracking for proprietary datasets
  • Version control for custom training data and model variants
  • Independent validation for fine tuned models

Inferencing Compliance:

  • Real time transaction monitoring and alerting
  • Explainable AI requirements for credit and lending decisions
  • Audit trails for all model predictions
  • Stress testing and back testing capabilities

 

Healthcare (HIPAA, HITECH)

Training Safeguards (Fine Tuning Scenarios):

  • De identification of PHI before model customisation
  • Business Associate Agreements with cloud providers offering fine tuning services
  • Secure multi party computation for collaborative model development
  • Regular privacy impact assessments for custom model development

Inferencing Protections:

  • Patient consent verification before processing
  • Minimum necessary standard for data access
  • Secure messaging for AI generated insights
  • Integration with existing EMR audit systems

 

4. Infrastructure and Operational Excellence

Resource Management

Training Infrastructure:
* High performance computing clusters
* GPU optimised instances for deep learning
* Distributed storage systems for large datasets
* Batch processing orchestration platforms

Inferencing Infrastructure:
* Low latency serving infrastructure
* Auto scaling capabilities for variable load
* Multi region deployment for disaster recovery
* Edge computing for real time decisions

 

Cost Optimisation Strategies

Training Cost Management:

  • Spot instances for non critical training jobs
  • Model compression and pruning techniques
  • Efficient data pipeline design to reduce preprocessing costs
  • Training job scheduling during off peak hours

Inferencing Cost Optimisation:

  • Model optimisation for efficient serving
  • Caching strategies for repeated queries
  • Serverless computing for variable workloads
  • Progressive deployment strategies (A/B testing)

 

5. Model Governance and Lifecycle Management

Version Control and Lineage

Training Governance:
├── Dataset versioning and lineage tracking
├── Hyperparameter and configuration management
├── Model performance metrics and validation
└── Automated testing and quality gates

Inferencing Governance:
├── Model deployment pipeline automation
├── A/B testing and canary deployment frameworks
├── Performance monitoring and alerting
└── Rollback and recovery procedures

 

Monitoring and Observability

Training Monitoring:

  • Resource utilisation and cost tracking
  • Data quality and drift detection
  • Training convergence and performance metrics
  • Automated failure detection and notification

Inferencing Monitoring:

  • Real time performance metrics (latency, throughput)
  • Model accuracy and drift detection
  • Business metrics and KPI tracking
  • Anomaly detection for unusual prediction patterns

 

6. Risk Management Framework

Model Risk Management

Training Risks:
├── Data bias and fairness issues
├── Overfitting and generalisation problems
├── Intellectual property and trade secret exposure
└── Adversarial training data attacks

Inferencing Risks:
├── Model degradation over time
├── Adversarial input attacks
├── Availability and performance issues
└── Incorrect predictions leading to business impact
 

Mitigation Strategies

Training Risk Mitigation:

  • Diverse and representative training datasets
  • Regular bias testing and fairness audits
  • Secure development environments with access controls
  • Adversarial training techniques for robustness

Inferencing Risk Mitigation:

  • Continuous monitoring and automated retraining triggers
  • Input validation and anomaly detection
  • Circuit breakers and fallback mechanisms
  • Human in the loop for high risk decisions

 


Best Practices for Enterprise AI Implementation

 

1. Establish Clear Boundaries

  • Separate training and production environments completely
  • Implement network segmentation and access controls
  • Define clear data flow and approval processes
  • Create role based access control (RBAC) for different phases

 

2. Implement Defence in Depth

Security Layers:
├── Physical Security (Data centres, hardware)
├── Network Security (Firewalls, VPNs, network segmentation)
├── Application Security (Authentication, authorisation, input validation)
├── Data Security (Encryption, tokenisation, data masking)
└── Monitoring and Response (SIEM, SOC, incident response)

 

3. Build for Auditability

  • Comprehensive logging for all AI operations
  • Immutable audit trails for compliance reporting
  • Automated compliance checking and reporting
  • Regular third party security assessments

 

4. Plan for Scale and Evolution

  • Modular architecture supporting multiple AI workloads
  • Container based deployment for consistency and portability
  • API first design for integration flexibility
  • Continuous integration and deployment pipelines

 

Conclusion

For most enterprise IT departments, the strategic focus should be on inferencing and model consumption rather than large scale model training. The distinction between AI training and inferencing extends far beyond technical implementation details, but the practical reality is that enterprises should leverage the massive investments made by AI companies rather than attempting to recreate them.


The Enterprise AI Sweet Spot:

  • Consume foundation models via APIs or cloud services
  • Focus on fine tuning for domain specific applications
  • Invest heavily in inferencing infrastructure and governance
  • Build competitive advantage through integration and user experience

Success in enterprise AI implementations requires:

  • Strategic Focus: Concentrating resources on business value creation, not infrastructure
  • Practical Security: Implementing robust governance for model consumption and fine tuning
  • Compliance by Design: Building regulatory requirements into AI workflows from day one
  • Operational Excellence: Ensuring reliable, scalable inferencing systems that serve business needs
  • Smart Risk Management: Understanding the risks of both model consumption and custom development

 

As AI continues to transform enterprise operations, the architects who understand these nuances and implement appropriate guardrails will be best positioned to deliver successful, sustainable AI solutions that drive business value whilst maintaining the trust and confidence of customers and regulators.


Thursday, December 11, 2025

The $11B Power Play: How IBM's Confluent Acquisition Reshapes Enterprise Data Architecture for the AI Era

IBM just made its boldest bet on the future of enterprise data with an $11 billion acquisition of Confluent. This isn't just another corporate deal. It's a strategic repositioning that signals exactly where enterprise data architectures are heading in the age of AI agents and real-time intelligence.

 

The Deal That Changes Everything

On December 8th, 2025, IBM announced its acquisition of Confluent for $31 per share. An $11 billion transaction that immediately caught my attention. As someone who has been architecting enterprise data solutions across multiple organisations since the GenAI revolution began, I see this as more than just a strategic acquisition. It's validation of a fundamental shift in how enterprises must think about data architecture.

The numbers tell part of the story:

  • 6,500+ clients across major industries
  • 40% of Fortune 500 already using Confluent
  • $100 billion TAM in real-time data streaming (doubled in 4 years)
  • 1 billion new applications expected by 2028

But the real story is what this means for enterprise architects and CTOs planning their data strategies.

 

Why This Acquisition Matters Beyond the Headlines

 

The Real-Time Imperative Becomes Non-Negotiable

IDC's projection of over one billion new logical applications by 2028 isn't just a statistic. It's a fundamental reshaping of enterprise IT. Every one of these applications, along with the AI agents that will power them, needs access to connected, trusted data in real-time.

Traditional batch processing architectures that dominated enterprise data strategies for decades are becoming obsolete. The acquisition signals IBM's recognition that real-time data streaming isn't a nice-to-have. It's the foundational infrastructure for AI-driven enterprises.

 

The End of Data Silos in AI Architectures

What struck me most about IBM CEO Arvind Krishna's statement was this: "Data is spread across public and private clouds, datacenters and countless technology providers." This is the reality every enterprise architect faces today.

Confluent's Apache Kafka-based platform doesn't just connect systems. It eliminates the data silos that cripple AI implementations. For agentic AI to work effectively, data must flow seamlessly between environments, applications, and APIs. The acquisition creates a platform specifically designed for this challenge.

 

The Strategic Implications for Enterprise Data Architecture

 

1. Event Streaming Becomes Central Infrastructure

This acquisition positions event streaming as core infrastructure, not middleware. Just as Red Hat's acquisition established containers as fundamental to enterprise cloud strategy, the Confluent deal establishes real-time data streaming as foundational for AI-era enterprises.

What this means for architects:

  1. Event streaming platforms become tier-1 infrastructure investments
  2. Data architecture decisions must prioritise real-time capabilities over traditional ETL approaches
  3. Stream-first thinking becomes the default for new application designs

 

2. Hybrid Cloud Data Gets First-Class Support

IBM's hybrid cloud expertise combined with Confluent's multi-cloud capabilities addresses one of the biggest enterprise challenges: data integration across heterogeneous environments.

Key architectural implications:

  • Consistent data streaming across on-premises, private cloud, and public cloud
  • Native integration with existing IBM ecosystem (Red Hat OpenShift, Watson, etc.)
  • Simplified governance for data flowing across hybrid environments

 

3. AI-Native Data Architectures Emerge

The acquisition creates the foundation for what I'm calling "AI-native data architectures." Systems designed from the ground up to support AI agents and real-time decision making.

Core characteristics:

  • Always-on data streams that AI agents can consume continuously
  • Event-driven architectures that respond to real-time insights
  • Governance frameworks that ensure AI systems have access to clean, trusted data
  • Scalable processing that handles both human and AI-generated workloads

 

The Technical Evolution: What Changes for Enterprise Teams

 

Stream Processing Becomes Mainstream

Confluent's platform includes advanced stream processing capabilities, including Apache Flink integration. This acquisition will accelerate enterprise adoption of stream processing beyond traditional messaging use cases.

Practical implications:

  • Real-time analytics become standard, not exceptional
  • Event-driven microservices replace traditional request-response architectures
  • Continuous data transformation replaces batch ETL jobs
  • Stream governance becomes as important as data governance

 

The Kafka Ecosystem Gets Enterprise-Grade

Apache Kafka's open-source foundation gets IBM's enterprise-grade support and security model. This matters enormously for large organisations that need both innovation and stability.

Enterprise benefits:

  • Enterprise security models integrated with streaming platforms
  • Compliance frameworks for regulated industries
  • Professional services for complex implementations
  • Long-term support for mission-critical streaming infrastructure

 

Industry Impact: Winners and Implications

 

Immediate Winners

Enterprise Kafka Adopters: Organisations already using Kafka gain access to IBM's enterprise services and support ecosystem.

Hybrid Cloud Enterprises: Companies with complex multi-cloud strategies get integrated streaming capabilities across their entire infrastructure.

AI-First Organisations: Companies building AI agents and real-time decision systems get purpose-built data infrastructure.

 

Market Dynamics Shift

This acquisition forces other enterprise software vendors to reconsider their data streaming strategies:

  • Microsoft will likely accelerate Azure Event Hubs and Fabric integration
  • AWS may need to enhance Kinesis and MSK enterprise capabilities
  • Google could strengthen Pub/Sub and Dataflow positioning
  • Snowflake and Databricks may need to enhance real-time capabilities

 

What This Means for Your Enterprise Data Strategy

 

Immediate Considerations

If you're planning enterprise data architecture for the next 3-5 years, this acquisition should influence your thinking:

  1. Evaluate real-time requirements: Traditional batch processing may not support your AI ambitions
  2. Assess streaming capabilities: Current data platforms may need augmentation for real-time use cases
  3. Consider vendor consolidation: IBM's expanded platform may simplify your technology stack
  4. Plan for AI integration: Your data architecture should support both human and AI consumers

 

Long-Term Strategic Implications

The Platform Play: IBM is building an end-to-end platform for AI-driven enterprises, not just selling point solutions.

The Skills Gap: Enterprise teams will need new capabilities in stream processing, event-driven architecture, and real-time data governance.

The Competitive Advantage: Organisations that master real-time data architectures will have significant advantages in AI implementation speed and effectiveness.

 

The Bigger Picture: Enterprise AI Infrastructure Matures

This acquisition represents the maturation of enterprise AI infrastructure. We're moving beyond experimental AI projects to production-scale AI implementations that require enterprise-grade data foundations.

The combination of IBM's enterprise expertise with Confluent's streaming technology creates a platform specifically designed for the challenges of AI-era enterprises:

  • Trusted data flows that AI agents can rely on
  • Real-time governance that maintains data quality at streaming speeds
  • Scalable architecture that handles exponential growth in data and applications
  • Hybrid deployment that works across complex enterprise environments

 

The Path Forward for Enterprise Architects

As someone who has guided multiple organisations through AI-enabled transformations, I see this acquisition as validation of the architectural principles I've been advocating:

  1. Data architecture must be AI-first: Design for both human and AI consumers from the start
  2. Real-time capabilities are foundational: Batch processing alone won't support AI agents
  3. Stream processing is becoming mainstream: Event-driven architectures are the new standard
  4. Vendor integration matters: Platform plays win over point solutions

The IBM-Confluent combination creates compelling advantages for enterprises ready to embrace this evolution. But the broader implication is clear: the data architecture decisions you make today will determine your AI capabilities tomorrow.

 

Conclusion: The Future of Enterprise Data is Real-Time

IBM's $11 billion bet on Confluent isn't just about acquiring a streaming platform. It's about positioning for a future where real-time data capabilities determine enterprise competitiveness.

For enterprise leaders and architects, the message is clear: the age of batch processing and siloed data is ending. The future belongs to organisations that can connect, process, and govern data in real-time across hybrid environments.

The question isn't whether your enterprise needs real-time data capabilities. It's how quickly you can build them before your competitors do.


The IBM-Confluent acquisition transaction is expected to close by mid-2026. Enterprise leaders should begin evaluating how this combined platform might fit their long-term data architecture strategies, particularly for AI and real-time analytics use cases.