Saturday, October 23, 2010

Software Testing and Forensics Engineering

When we think about reliability in software development, we often focus on the robustness of the application itself. Is it designed properly, will it fail cleanly if a hardware failure occurs, and will it stand up under load? We create test teams to exercise the software while it's being developed to ensure it's ready when we release it to our customers. The test team is a critical part of the software development process, and it's often one of the most difficult parts of a software development project to manage. How do we know we have a healthy test team? How do we know that our test team will continue to function efficiently throughout the duration of the project? What do we do if our test team breaks down?

In solving problems, whether we are consciously aware of it or not, we use tools called heuristics. A heuristic is a fallible method of solving a problem. Heuristics are things like rules of thumb, road-maps or methodologies, and models that describe the behavior of a system. A good example of a heuristic would be forensics engineering. Forensics engineering is the science of understanding what causes structural failure, and it is commonly applied to vehicle or aircraft incidents. The reason we say a heuristic is a fallible method of solving a problem is because the method may or may not work every time it is applied, and from experience we know that not every accident or failure can be explained. The fallible nature of heuristics gives us the freedom to take a heuristic from one area of inquiry and apply it somewhere else with the goal of shedding more light on the system we are trying to understand.

So, how might we learn more about software test teams by applying the methods of forensic engineering? A vehicle or aircraft is a self-contained, complex system operating under specific environmental conditions. A test team can also be considered a self-contained system operating within the conditions of a software or hardware development project. A vehicle or aircraft is a moving body and is therefore governed by the laws of physics. A test team can also be considered a moving body that interacts with external entities like the system under test, business analysts, project managers, architects, and developers. A vehicle is designed to operate safely within the laws of physics, but when the right conditions occur a collision becomes inevitable. The same can be said of a test team, and using the methods of forensic engineering we can understand how a test team may behave when adverse conditions present themselves.

One of the first tasks of the forensic analyst is to review the conditions or context of the failure. In the case of a vehicle collision, how fast was the vehicle going at the time of the collision? How heavy was the vehicle at the time of the collision? The speed and weight of the vehicle can tell them how much momentum the vehicle had before the collision occurred. What were the road conditions? Could the tires create enough friction to stop the vehicle before the collision occurred? In other words, what were the physical laws governing the system when the collision occurred? One of the most basic laws that is applied in this instance is the law of conservation of linear momentum.

The law of conservation of linear momentum states that the total momentum of a closed system of objects (which has no interactions with external agents) is constant. Here is the law expressed as a mathematical formula:

m1u1 + m2u2 = m1v1 + m2v2

Momentum is a product of mass (m1) and velocity (u1). The law states that the momentum of the first object (m1u1) plus the momentum of the second object (m2u2) before a collision must equal the combined masses and velocities of the two objects after the collision. Velocity is expressed as a vector quantity because it has direction and magnitude. Think of velocity as an arrow; the point of the arrow indicates direction, and the length of the arrow indicates the quantity of speed.

How might this formula describe a test team? A test team has mass in the form of testers. A test team has velocity in the form of direction and speed in that they have a set of test scenarios which they are executing at a specific rate. We might also define the velocity of a test team by the rate at which they find bugs in the system. What's interesting is that these variables often change over the duration of the project, and we don't always understand why. If the law of conservation of linear momentum can be applied to test teams, and the momentum of a test team changes, then it must follow that the test team has experienced a collision.

Collisions come in two different forms: elastic and inelastic. An elastic collision is one where the total kinetic energy (energy of motion) of two bodies before a collision is equal to the total kinetic energy after the collision. Think of two billiard balls in space. A moving billiard ball collides with a stationary billiard ball and stops. The first billiard ball now has zero kinetic energy while the second billiard ball travels away with all of the kinetic energy of the first ball. An inelastic collision is where the kinetic energy of the two bodies is not conserved after a collision. Think of a golf ball hitting the green. The surface of the green deforms, absorbing a lot of the kinetic energy of the golf ball causing it to slow down or stop completely.

What kinds of collisions could a test team experience that might impact their momentum. External collisions could come in the form of bad results from the system under test. News that the team is testing the wrong business processes can be considered a collision. These two collisions share common characteristics. They both come from outside the test team and they both have a large quantity of their own momentum. After the collision occurs the test team may be in a very different state than before. They may lose all their momentum but still remain intact.

Let's look at the first case which could represent an inelastic collision. Bad results do not mean that the tests failed. Often we want tests to fail so we can find bugs and understand the conditions that cause failure. Bad results are results that destroy the confidence in the team. One of the worst things that can happen to a test team is to have people lose confidence in the results they are producing. This can happen when the test team has little control over the system under test. They run a set of tests one day and they run the same tests another day but return different results. If they can't explain the reason for the different results everyone begins to focus on the skills of the testers, the tools they are using, and their testing approach. In other words, the integrity of the test team comes under stress and it becomes doubtful whether the team will hold together when the next collision occurs. Team members may leave the project, the project manager may decide to remove a tester from the project, or he may stop testing altogether.

The second case represents more of an elastic collision. The team already has momentum, but the new business requirements cause the team to change direction. Changing direction is not necessarily a negative impact unless the team ends up going backwards. The team may have to start over or re-test their earlier results, but the team still remains intact and is capable of regaining their momentum. It's also possible that the new business requirements provide much needed focus for the test team and end up increasing their momentum.

Another area where crash forensics is applied is in the field of aircraft safety. The story of AF 447 is one of the most famous crash investigations in history. AF 447 was a passenger aircraft that initially disappeared without a trace while crossing the Atlantic. Eventually some debris was found but the location of the crash made it impossible to recover the black box recorder. The cause of the crash was never conclusively determined, but the air safety authorities ordered that all pitot tubes be replaced on similar aircraft. It is highly unlikely that the pitot tube caused the crash, but authorities had to produce some kind of response to the tragedy to restore confidence in the industry.

A pitot tube is a device on an aircraft that measures airspeed. If a pitot tube freezes over when an aircraft experiences bad weather the automated systems on the aircraft can malfunction. Airspeed is only one small part of the equation in our law of linear conservation. By focusing on the potential for a pitot tube failure the authorities used a small component of the system to explain the failure of the more complex system as a whole.

In testing, when a team experiences an in-elastic collision like one similar to AF 447, the management or test team is often tempted to 'fix the problem' by implementing radical changes that may not really address the problem. Imagine a testing project where the results are “good” for 2 months, and then the results become “bad” and stay that way for another two months while the team tries to troubleshoot the unexplained difference. The team has no success in finding the solution until one day when the system under test is rebooted. Unix systems don't often get rebooted, so it's not surprising that this “solution” was not attempted earlier. The performance problem disappears after the reboot, and the team experienced so much pain and expense troubleshooting the issue that management declares that all servers will undergo performance testing prior to being released to the development teams for application deployment. It is highly unlikely that the new routine performance testing will prevent any performance problems, but with testing, as in crash forensics, management often reacts to “bad” results in a way that is not necessarily good for the team.

Another factor at work in crash forensics is friction. Investigators often try to determine how much friction was at play in the systems. For vehicles, friction between the road and the tires is a critical factor, and for any mechanical system friction between moving parts can play a big part in system failure.

In testing, black boxes can act as a form of friction. Think of a black box in terms of a combination lock, like the ones we used on lockers at school. With a black box you know what goes into the system and what comes out, but you don't know what is going on inside the system. In our combination lock example, we know that numbers go in (or more precisely tumbler movement), and either “locked” or “unlocked” comes out.

The problem with back boxes is that since you don't know what's going on inside the system you often have to perform exhaustive testing for each scenario. For our test team this means that our momentum is going to slow down considerably. With our combination lock example, let's say we don't know how many numbers it takes to open the lock. That means we need to test all combinations of all numbers to prove our system only opens on one combination. If we know that the system opens with a three number combination only, then the problem becomes a bit smaller. The more information we have about what's going on inside the black box, the better the test team is able to direct their testing to get the required results. In other words, the 'blacker” the box, the longer the test.

So, applying the heuristic of crash forensics and the law of conservation of linear momentum to software testing helps us understand the dynamics of test teams as they move through their projects. We need to be aware of collisions that may affect our teams, and we need to focus on maintaining or regaining momentum when the collisions prove to be severe. We need to recognize when we have experience an in-elastic collision and make sure we spend some effort re-assembling the team before we can expect it to begin gaining momentum. Finally, we need to understand what causes test teams to fail and not blindly react to the event for the wrong reasons.