Optimize Development Cold Start

Published in

CodeX

6 min readMar 4, 2022

In my career at Amazon as a software engineer, I’ve contributed to over 200 packages across 43 applications and services for 6 software products, from high-throughput AWS services to customer-facing web apps to infrequently-changed internal tools.

As somewhat of a nomad, having contributed to so many different applications, each for a relatively short period of time, I’m more aware than most of the problem of the “Cold Start” development experience.

After another team transfer in the last few months, I’ve realized yet again just how impactful the Cold Start development experience is to a software product, and the more I believe it’s worth tracking and optimizing.

Cold Start Development

I define the Cold Start development experience of an application as the process an engineer uses to change that application for the first time, or after a long hiatus.

Cold start development experience is one component of “engineer onboarding,” but is limited to the software development lifecycle. It includes things like granting source permissions, access to encryption keys, databases, or other resources, one-time configuration, bootstrapping software, dependencies, deploy tools, and hardware.

In this context, Cold Start is what it takes for a typical new-to-the-team engineer to make the simplest possible change to an application (think: Hello World).

I use the term “Cold Start” as I’m thinking a lot about the Cold Start problem of Function as a Service products these days, and the analogy between Functions and Engineers works pretty well.

“Operating Lambda: Performance optimization”

The Impacts of Cold Start on Software

Below I argue that the Cold Start development experience of an application impacts nearly every major facet of software and software teams: productivity, availability, quality, security, fungibility, morale, and attrition.

Engineer Productivity

The obvious part of software that Cold Start impacts is productivity. Considering the low average tenure and high turnover rate of software engineers, the total contribution an average engineer will make to a product is hampered by long cold start.

If an average engineer will only be expected to spend 18 months on a team that owns a dozen or so actively-changing applications, each incurring a day-long cold-start that needs to be done once every 6 months or so, the opportunity cost to the business is a meaningful amount of an engineer’s total time on the team. Factoring in the effort needed to hire engineers, I’d argue this cost is in the tens of thousands of dollars per year per team spent setting up development environments. That’s silly!

Availability and Lost Revenue

For products with an on-call rotation that includes software engineers, a long cold-start is also an availability risk. I’ve seen more than one outage meaningfully prolonged because the on call spent hours struggling to set up the service to fix a bug.

When a service has tight availability SLAs with many, many “nines”, an hour wasted can be devastating. Reading a Correction of Error timeline that includes on-calls setting up a service for hours is embarrassing (neglecting the obvious that this should have been done beforehand).

Factoring in lost revenue per minute of outages, the cost of a long cold start can be enormous.

Quality

When a dev is tasked with building a feature, any additional time spent just getting the application to run eats into the time allotted to complete the feature. When pressed for time, quality tends to suffer first.

Additionally, a painful cold start setup experience contributes to the effect of the Broken Window Theory on that application. If the application already takes 3 days to set up, what’s another 5 minutes?

Another factor of software quality impacted by cold-start is architectural cohesion. In a service oriented architecture, an unfortunate side effect of long cold-start can be a short-sighted motivation to put logic where it doesn’t belong, just for the sake of to-production quickness or an avoidance of friction. It’d be a shame for the architectural cohesion to be compromised by a one-day barrier, but it happens.

Security

Recall the recent high-profile Log4J/jndi zero-day that set the internet on fire. When every second increases the likelihood that a web application is compromised, Cold Start is an existential threat. The application of public zero-day patches being delayed for hours as a developer figures out for the first time (at 3am) how to build an application that hasn’t been touched by anyone on the team for a year can easily bust trust.

Knowledge Silos and Engineer Fungibility

Cold start also causes silos in expertise that don’t have to exist, which are hugely detrimental. In my experience, self-organizing teams will distribute tasks to engineers that already have environments set up instead of fungibly distributing that work evenly across available peers. It’s natural to want to avoid wall clock time by giving a task to a person that already has the dev environment set up, when long-term it’s bad for the team health.

This application-setup-induced siloing can be toxic. It leaves some engineers “stuck” primarily working on tasks in a particular part of the system, simply because the team is too lazy to set up the environment. In my experience, if an application has a long cold-start setup cost, it’s usually also accompanied by other hallmarks of bad software like a high degree of coupling, complexity, and/or poor test coverage, as mentioned above.

Having seen this situation more than once, I can say it can be demoralizing to be the first choice in tasks on systems that suck just because you happen to have managed to get it running most recently.

Team Morale and Attrition

A particularly painful experience getting software set up initially is also a factor in attrition.

While just an anecdatum, I once watched a new engineer make an abrupt “about-face!” and transfer away from the team after fighting with a particularly annoying-to-setup application, citing setup difficulty as a major complaint on their way out (though one of several).

Mentioned in the above section of siloing, applications with long cold-start setup time are typically also accompanied by other bad software patterns. To risk personifying software, friction in first-time environment setup makes it seem like the application is actively resisting being changed, leading to discontent and a desire to avoid that component.

How to Measure and Judge Cold Start

Engineering leadership can measure this Cold Start simply: Observe a new hire or recent transfer when they make their first change to the system. The time between feature/bugfix assignment and an exclamation of “I finally got it to run!” is the Cold Start.

Cold Start of a few minutes is achievable, even for the most hairy legacy systems I’ve seen. Cold Start on the order of days is more common than I want to admit and unacceptable.

If you’re considering a team change, ask your to-be teammates how long it would take to get a one-line change from “New Laptop, Who ‘Dis?” into the CD pipeline on their most-neglected service or application. That answer can tell you a lot!

How to Improve

In applications that have been around long enough to have long Cold Start, there are probably dozens of obvious, low-hanging opportunities to improve.

I encourage anyone still reading to use this framework for identifying those opportunities.

Ask new developers what the most painful part was. Improve that first.
Can bootstrapping tools needed to build the application be installed with scripts?
Does your development environment really need that resource? Consider mocking dependencies locally, using less-secure proxies instead of SSL certificates (in a development environment!), using static instead of dynamic resources.
Collect and list the steps developers take to cold start an application. Take the day to eliminate or automate just one of them. Do this once a month or once a quarter until you can’t.

Conclusion and References

This isn’t a particularly new topic, but recent experience has compelled me to complain about it.

If you’re looking to read more about the topic, check out The Joel Test, and this awesome article “Want a High Performing Dev Team? Improve Your Time to Productivity.”

If you found this piece useful, please drop a comment, clap, or @ me!