Building a Cloud Platform: Why Programming Languages Beat DSLs

When rolling out your cloud platform, you’ll use an infrastructure-as-code (IaC) tool. This isn’t controversial. IaC has proven itself as a reliable foundation for deploying infrastructure quickly and repeatably. But choosing which IaC tool for a platform introduces constraints that most teams discover too late. This is the journey I went through.

Vendor Lock-in and the DSL Trap

In the IaC tooling landscape, we have tools that are built by cloud providers: ARM, Bicep, CloudFormation, and others. When I started as an engineer, I frequently heard architects say, “Let’s use the one provided by the vendor, they know best.”

This approach holds true if you stay within your cloud provider and their services. However, the moment you want to start gluing together services, vendor-specific tools can become limiting. Even provisioning a PaaS database and providing a starting schema and setting up users and roles is not possible. So you need a second step; in the case of vendor languages: a script written in Bash, Python, or PowerShell to set up the database once it was provisioned.

This is cumbersome and breaks the workflow. A schema, users, and roles are all part of infrastructure. They should be managed as code, not in separate scripts. A long time ago, Terraform was born to be a cloud agnostic tools but the architecture of terraform was also ideal to fix these kinds of problems. Using plugins, it could not only deploy infrastructure across cloud providers, it could also integrate with SaaS products like Datadog or products like PostgreSQL. Coming from CloudFormation, the difference was immediate: one codebase where AWS resource outputs fed directly into Datadog alert definitions.

One of the major points of Terraform is that the end-state matters. It is a tool that uses a declarative language (HCL). I want a database and it needs to look like this. You create a module for a database or VPC that contains all resources that are needed, and you can create databases or a VPC in minutes, whereas in a normal datacenter this would take days or months.

But Terraform also had its limitations. You give a tool like this to engineers and they have the tendency to start programming with it. So things like conditionals (early on with count) and loops (much later with for_each) are introduced within Terraform to help alleviate some of the issues people have when creating modules that have inputs that decide what needs to be deployed. Terraform is a declarative language with some imperative functions bolted on. In the end, the ability to execute a plan (check what will be deployed) of infrastructure before applying is still a major advantage. This way you can see (more or less) what a module will produce before deploying it to your cloud provider.

For a platform team, these limitations hurt. You’re not writing one-off infrastructure. You’re building reusable components that product teams will consume. Engineers will push DSLs past their limits, and maintenance becomes your problem.

Six years ago, I joined a team (a mix of developers and infrastructure people) that was using Python (with troposphere) that was compiled into CloudFormation templates. Those were deployed to AWS using their own written tools, orchestrating the stacks to be deployed. Coming from Terraform to this opened my eyes. We shouldn’t be programming inside a tool like Terraform. HCL, a brilliant DSL, is not a programming language. It is missing all the core constructs that you would expect from a programming language. Using an imperative language to produce a declarative file that could be deployed (even with CloudFormation) was less of a struggle than creating Terraform modules.

After a year or two, we noticed CloudFormation was lacking in third-party service compatibility and slow to support new AWS features. We set out to find an alternative. We needed a tool where we could program in a real language and still integrate with third-party services.

The search

During that time AWS released their CDK and later TFCDK followed. They both solved the syntax problem, but they still compile to a configuration file which still needs to be deployed. Your code runs before deployment, you have no access to resource information during code execution. They were disregarded because it gave us no real advantage over the solution we had at that time. On the other hand, Pulumi did it differently. You could program in your language of choice (in our case Python), have access to the info of provisioned resources and get state management and declarative features during deployment.

When starting with Pulumi, it was underwhelming at first. We saw a lot of things we already had in Python and troposphere. The plugins were from Terraform. Even the CLI commands were more or less the same as Terraform, with the same advantages: preview instead of plan and up instead of execute. The way Pulumi did state management was the same as terraform.

So why did Pulumi convince me? With Terraform, you’re limited to writing providers in Go if there isn’t one available. With Pulumi, you can write providers in your language. Those integrate with state like the native Go providers. That random API from a product your company bought? You can integrate it directly into your IaC without learning another language. Over time we integrated a couple of those APIs, which reduced integration work. You can also use try-catch statements to check if a resource already exists or specify which resource to import programmatically. These things eliminated manual steps.

The Automation Advantage

Platform teams don’t just provision infrastructure. We manage migrations, enforce policies, and generate documentation. We want to maximize automation of these processes. With Terraform, you need external orchestration for this. Pulumi’s Automation API builds it into the tool.

The biggest selling point was the Pulumi Automation API. It exposes a programmatic way to control and configure Pulumi. Through the API you can programmatically configure your Pulumi project and control the flow. You can define where your state comes, glue together multiple executions and more. This enabled us to automate all our processes from deploying to migrating in Python in a single CLI tool and because we have access to the information within python we can also automatically generate Markdown documentation that we publish to an internal site. The infrastructure settings are documented automatically from the code. No manual documentation to maintain, freeing up our team to do other things.

Adoption

Adopting a tool like Pulumi, certainly as a platform team that focuses on infrastructure is not a walk in the park. Sometimes lacking in documentation in the more advanced parts of pulumi, you will want to have some experience in the team with developing applications so the basics of your application are strong. Do not expect to get it right from the first go, somethings are trial and error and be prepared that you will need to rewrite parts of your codebase. We rewrote our codebase 4 times with all the new learnings. The biggest rewrite was adopting a plugin style approach where the CLI application together with the service catalog, our standard infrastrastructure, is a separate Python project. The project that contain the specific infrastructure are imported based on the folder name.

Conclusion

Platform teams face a different problem than application teams. You’re not deploying infrastructure once. You’re building systems that other teams use repeatedly. Migrations, policy enforcement, and documentation need to be automated, not manual.

We use Pulumi’s Automation API to codify these processes. For example, our firewall whitelisting is a tool that reads a yaml file as input for firewall rules. A firewall change is a YAML edit and a PR, pure GitOps. The approval gets tracked in the pull request, the firewall updates automatically, and the CISO gets up-to-date documentation. Not every process belongs in code, but the ones that do eliminate waiting on the platform team.

With these things in mind the platform team doesn’t block product teams from shipping.