An introduction to Formal Verification for Software Systems
By Santhosh Raju, Kamil Rytarowski
- 12 minutes read - 2525 wordsWhen engineers were designing completely autonomously operated Paris Métro line 14, they had to ensure the safety of tens of millions passengers each year, its 22 stations across Paris as well as smooth running of trains. How to approach the development of software that allows for the automation of public transit? The French team decided to use formal verification – learn what it means!
We will go through an introduction to formally verified software, note when formal verification gives most returns and present two introductory examples. The first example will present a turnstile logic with a flip coin and the other one will be a real, but still basic example of TLA+ methodology on the Die Hard with a Vengeance problem with two jugs.
After reading the article you will gain an overview on fundamentals of Formally Verified Software and will be aware of the method to avoid software bugs which can cause your business to implode.
Why do systems need Formal Verification?
In engineering we have methods to validate (i.e. “Are we trying to make the product meet the user’s requirements?") and verify (i.e. “Does the product conform to the validated specification?"). In the realm of computer hardware Formal Verification is a pretty old concept and has been in existence since 1984 with tools like Verilog and now superseded by tools like SystemVerilog. Such tools have become part of the IEEE specifications for designing and verifying hardware.
However, in the realm of software engineering even though we have had verification tools that have existed since late the 1980s, it still has not been standardized nor is in widespread use in comparison to its hardware counterpart. Even when we see the use of Formal Verification, we can see that it is usually limited to specific fields like cryptography or communication protocols and is frequently done by the academic circles. A good example of implementing formal methods of verification is the automation of Paris Métro line 14. With the ever increasing complexity of software and the layers of abstraction, we have reached a time when writing secure, efficient and resilient code requires some level of formal verification to be done, if not for the whole software at least for the important sub-systems involved. In recent times we have seen more widespread adoption of formal verification by the industry leaders like Intel, Amazon or Microsoft, in products where we have enormous complexity and multiple systems interacting with one another.
What do we formally verify?
In the realm of software engineering, the first step of formal verification is asking the following questions:
- What do we intend to verify?
- How do we verify?
What do we intend to verify?
Setting out to verify the whole system and every line of written code, even though theoretically possible, is a beginner’s mistake. The art of engineering is about being smart enough to know which corners to cut and which ones not to cut. This is not an easy task - it takes time, practice and experience to take decisions on this. We should also keep in mind writing “verification” for a system is not the same as writing tests (Unit / Functional) and fuzzing for it. Writing tests and achieving 100% code coverage does not guarantee that the system works as intended and it is only as good as tests themselves (after all, who tests the unit tests?). Verification and testing can be considered to be two sides of the same coin and one complements the other, completing the picture.
Formal verification involves writing proofs on an abstract “mathematical model” of the system. Depending on the design of the system, there are various options available which can be used as the basis for writing the proof of verification. Let us take the case of Tamarin Prover – the tool used for symbolic modeling and analysis of security protocols. Here abstractions happen at the protocol level and agents that interact with the protocol. This abstraction works well for security and cryptography related code. Our approach discussed in this article would be a more fine grained approach, trying to verify finite state machines and this is called model checking.
How do we verify?
Finite state machines are the simplest mathematical models of computation where we represent states and the rules to transition from one state to another. As an illustrative example let us take a closer look into the working of a coin operated turnstile.
- We have a coin operated turnstile which is either in a “closed” state or “open” state.
- We have a couple of operations that we can do on such a turnstile
- We can “push” it.
- We can “insert coin” into it.
- Applying the operations on a given state allows you transition between states or keeps you in the same state depending on which operation is applied to which state.
The above description can be represented as a diagram and this basically constitutes the representation of a finite state machine.
Our aim is to verify that given this state machine, the turnstile works as expected and more importantly, it does not enter an unexpected state. For example, we do not want the turnstile to go “unlocked” without a coin or remain “unlocked” after a push has been made.
A very simple way of thinking about this is to consider the state machine in a linear timeline and the actions and transitions in this linear timeline.
Our reasoning on the working of the machine and the states it takes up is not absolute :
- The turnstile is not guaranteed to be “open” at all points in time (or “closed” for that matter)
- The statement “The turnstile is open” has truth values that would vary with time
All this forms the basis of temporal logic.
NOTE: One of the basic things to keep at the back of the mind would be that time continues to move forward in this universe even if no further changes occur in the state machine.
So by verifying the behavior of the state machine through time one can in turn verify the behavior of the system and its implementation. Of course things do get complicated when these state machines are distributed over a network and things happen in a concurrent system.
Tools of the trade
Let’s take a look at a very specific tool called TLA+, which can be used to achieve state machine verification for concurrent systems.
Why TLA+?
TLA+ has been around since 1999, and is developed by Leslie Lamport, a Turing award winner. In addition to writing formal specification, it can also be used to design, model, document and verify programs, especially concurrent systems and distributed systems. This is a good toolkit to have since many of the systems level applications and blockchain applications tend to have a combination of distributed and concurrent systems at play.
TLA+ has been used to write up systems level proofs for things like Memory Cache coherence protocols to distributed consensus protocols like Raft. In addition to this, TLA+ specification is also LaTeX compatible making for an excellent way to generate documentation of the proofs.
A simple TLA+ example
Unfortunately, TLA+ does not come with a run off the mill “Hello, World”, like most other programming languages. And we have to take a slightly bigger example to understand the basics of TLA+. We shall choose one of the basic introductory examples given in the TLA+ tutorials called Die Hard problem.
The name of the problem comes from the movie “Die Hard with a Vengeance”, where the two heroes of the movie have to stop a bomb from exploding by placing 4 gallons of water in a jug. The twist being that the heroes are provided only with a 3-gallon jug, 5-gallon jug and a water faucet.
We solve this problem by starting to fill 5-gallon jug. This is the approach taken in the movie.
- Make sure both the jugs are empty at the start.
- Fill the 5-gallon jug.
- Empty the 5-gallon jug into the 3-gallon jug.
- We have 2 gallons of space in the 5-gallon jug.
- Empty 3-gallon jug on to the ground.
- Pour the remaining 2 gallons in the 5-gallon jug into the 3-gallon jug. (1 gallon space remaining in 3-gallon jug)
- Fill the 5-gallon jug.
- Pour 1 gallon out from the 5-gallon jug into the remaining space in the 3-gallon jug.
By the end of this manual process we should have exactly 4 gallons of water in the 5-gallon jug.
There is an alternative solution to this problem, by starting to fill the 3-gallon jug. If can find it out, leave your solution in a comment.
The manual description of the solution happens in discrete steps, like following states in a finite state machine. Each state is well defined and has a path to the next state.
Now we are going to take a look at the solution from a TLA+ perspective.
We are not going to do any fancy stuff with TLA+ for now. But we will be exploring the basic features of TLA+ and how it works, namely making it step through a Finite state machine.
NOTE::TLA+ uses a breadth-first search approach when going through a state machine
Let us have a look at the TLA+ description of the Die Hard problem.
-
Variables - These describe specification of the variables (jugs) in our problem.
VARIABLES big, \* The number of gallons of water in the 5-gallon jug. small \* The number of gallons of water in the 3-gallon jug.
-
Invariants - These describe the invariants within our system, in our case the values for the 3-gallon jug has to be between 0 and 3 (inclusive) and likewise for the 5-gallon jug this has to be between 0 and 5 (inclusive). The invariants keep a sanity check on the operations we do on the jug. This helps to keep having operations that may result in negative values in our jugs or have overflow conditions like having 4 gallons in the 3-gallon jug.
TypeOK == /\ small \in 0..3 /\ big \in 0..5
-
Initial state - These define the entry point to state machine, and we make sure both jugs are at 0 gallons at start.
Init == /\ big = 0 /\ small = 0
-
Actions - These define the various actions that are possible from a given state in the machine that helps it to advance to the next state within the state machine. We list the various actions possible below, actions are defined independently for the 3-gallon jug and the 5-gallon jug.
-
Filling the jug (3 / 5)
FillSmallJug == /\ small' = 3 /\ big' = big FillBigJug == /\ big' = 5 /\ small' = small
-
Emptying the jug (3 / 5)
EmptySmallJug == /\ small' = 0 /\ big' = big EmptyBigJug == /\ big' = 0 /\ small' = small
-
Transferring the water (3 to 5 and 5 to 3)
SmallToBig == /\ big' = Min(big + small, 5) /\ small' = small - (big' - big) BigToSmall == /\ small' = Min(big + small, 3) /\ big' = big - (small' - small)
You will also notice that, for each action, we also make sure the next state of the unused jug (for example when filling or emptying), is mentioned not to change. This is important when specifying actions to make sure that we mention what has changed in that state and what has not changed so that the system has no undefined state for any of the jugs at any given point in time.
-
-
Next state - These define all possible next states in our system using both “disjunction” (OR) and “conjunction” (AND). In the Die Hard example all actions are possible independently of each other. Hence we use “disjunction” to describe the “Next” state from our initial state.
Next == \/ FillSmallJug \/ FillBigJug \/ EmptySmallJug \/ EmptyBigJug \/ SmallToBig \/ BigToSmall
-
Specification - Finally we define our specification to be a conjunction of our Initial state and Next state.
Spec == Init /\ [][Next]_<<big, small>>
Finding the solution
We need 4 gallons of water to prevent the bomb from
exploding! This invariant will make sure all the states not leading to the 5-gallon jug
do not have 4 gallons as errors since they are not a part of the solution we seek.
For this we have a predicate called NotSolved
asserting that big
is not
equal to 4.
NotSolved == big # 4
With the above description we have managed to describe the manual process of finding the solution to a state machine that can be run within TLA+ with well defined actions on how to reach each of the states.
The TLA+ main window presented above shows the loaded program DieHard.tla.
The result of the model checker shows the various reachable states in our simulation.
We have also set the check on the TypeOK
invariant to make sure
the formula holds true for every reachable state. The key to finding the
solution is the big /= 4
invariant. This way we make the model checker show us
a behavior ending with the 5-gallon jug nor equal 4 is false.
The error trace shows the case (the last entry) where the invariant becomes false and our desired solution to the Die Hard Problem.
Each of the steps in the error trace shows how the jugs have to be filled so that we end up with the 4 gallons of water in the 5-gallon jug.
Hence running the simulation within TLA+ has helped our heroes to come to this solution automatically, by just defining the model in the formal language. The bomb has been disarmed on time!
Conclusion
We have demonstrated how we can use TLA+ to play around with states of a finite state machine. However, we still have not actually stepped into the powerful tools available in TLA+ like emulating passing around messages, or declaring a theorem for our proof.
In the next part we shall explore the details of using TLA+ to write a verification for a specification from a description written in an RFC.
Formal Verification is gaining importance with the ongoing trend to automate further aspects of our life. The Paris Métro line 14 mentioned above uses another way of formal verification called “B-Method”, which uses abstract state machines, set theory and first-order predicate logic to help define systems. This approach is different from TLA+ as it has a tight loop between abstract machine and implementation, and uses refinement technique to match the abstract machine with the implementation. The approach of formally verified software in the Paris Métro gave significant returns to the city and motivated the switch to autonomous orchestration of another Paris Métro Line 1.