How Amazon Web Services Uses Formal Methods (2015) [pdf]

mindcrime | 199 points

One of the teams I worked for in AWS used TLA+ on a new back-end feature that was added. It was one that, if things went wrong, would be extremely bad for customers. The needs for operating on scale only added to the necessary complexity. Paranoia was key.

They'd estimated something like an optimistic 4 months for the work.

Two of the more senior developers were assigned the task, and they had to learn TLA+ and formal methods from scratch. (one dev's main complaint was that he was forced to use Eclipse, I don't know if integration has now happened with other IDEs since then).

It took a little bit for them to get up to speed, but they were soon implementing and catching bugs in the proposed designs, and they had all the bugs ironed out within a few weeks.

They then found that going from TLA+ to actual Java code was almost a "fill in the blanks" exercise. The formal model provided almost the exact structure they needed for their final code. Going from model to code took them very little time at all. It was in production, running in shadow mode after not even 2 months, and was fully live not long afterwards.

TLA+ got subsequently used for a number of other features added to the platform.

There had been general push back among developers that learning TLA+ and doing formal verification was only going to slow them down, but we found it was quite to the opposite. TLA+ actually sped up development, with the big bonus of higher confidence of correctness.

Twirrim | 4 years ago

Bryon Cook, a principal scientist at Automated Reasoning Group gives an overview of various solvers in-use today at AWS: https://www.youtube.com/watch?v=UKqVY0SSbus

Here's a presentation by Eric, a distinguished engineer, and Neha, a principal engineer, on analysing IAM policies (Zelkova?) and Network reachability (Tiros?): https://www.youtube.com/watch?v=x6wsTFnU3eY

This webpage has many more links to blog articles on the topic: https://aws.amazon.com/security/provable-security/

ignoramous | 4 years ago

Questions I wondered about that weren't answered until the end of the paper:

Q: Do engineers actually implement AWS systems in TLA+? A: No, TLA+ is called "exhaustively testable pseudo-code" internally, and is only used for designs.

Q: Which AWS systems use it? A: The first was DynamoDB, followed by S3, and now there are 8 additional complex systems that use it for designs.

Q: Who uses TLA+ at Amazon? A: All engineers, from entry level to principal.

I thought it was especially interesting that several engineers at AWS were effectively writing their own model checkers by writing programs that tried to brute force rare combinations of conditions that would lead to system failure. Obviously TLA+ does this better and faster so they switched to that.

lwb | 4 years ago

I’ve started using TLA+ in my spare time. There are several good sites and books to learn more about it:

- https://learntla.com/introduction/

- Pracitcal TLA+

It’s much easier to get running with than other formal methods tools like Isabelle and Coq.

exdsq | 4 years ago
dang | 4 years ago

Does anyone have resources that talk about writing a TLA+ model for an existing software system as opposed to designing a new system using TLA+ as the design reference?

_cairn | 4 years ago

I introduced TLA+ to help understand underlying components in the OpenStack ecosystem and at small companies to solve thorny problems. It's well worth learning as it gives you a very different perspective on the design and construction of systems.

In addition to the books and sites mentioned here, there's also a video course with tutorial content produced by the creatore of TLA+: https://lamport.azurewebsites.net/video/videos.html

agentultra | 4 years ago

How do you go from something like TLA+ to a common language and keep the correctness? Are assertions the only way?

jayd16 | 4 years ago

One thing I'm wondering is whether (or rather, what kind of) bugs arise when you try to translate a model (described as "exhaustively tested pseudocode" elsewhere in this thread) into a real system. For example, if there is if-branch somewhere that gets missed etc?

martythemaniak | 4 years ago