Recently, the development team where I work has started collecting bona-fide metrics, based on our ticketing system. So few development shops (especially small ones) collect real information on how they work that it's exciting that we're doing it.
Here's what we're doing:
- Number of releases during QA (we do a daily release, so more than daily is an indicator)
- Defects found, by severity and priority
- Average time from accepting a ticket (starting work) to resolving it (sending it for testing)
- Number of re-opens (i.e. a defect was sent to testing, but not fixed)
- Average time from resolving to closing (i.e. testing the fix)
- Defects due to coding errors vs. unclear requirements (this is really great to be able to collect; with our company so new and small, we can introduce this and use it without ruffling a lot of feathers)
The tricky thing about metrics is that they are not terribly meaningful by themselves; rather they indicate areas for focussed investigation. For example, if it takes an average of 1 day to resolve a ticket, but 3 days to test and close it, we don't just conclude that testing is inefficient; we have to investigate why. Perhaps we don't have enough testers. Perhaps our testing environment isn't stable enough. Perhaps there are too many show-stoppers that put the testers on the bench while developers are fixing them.
Another way to interpret these values is to watch them over time. If the number of critical defects is decreasing, it stands to reason we're doing a good job. If the number of re-opens is increasing, we are packing too much into one iteration and possibly not doing sufficient requirements analysis. We just started collecting these on the most recent iteration, so in the coming months, it will be pretty cool to see what happens.
These metrics are pretty basic, but it's great to be collecting them. The one thing that can make hard-core analysis of these numbers (esp. over time as the team grows and new projects are created) is the lack of normalization. If we introduced twice as many critical bugs this iteration than last, are we necessarily "doing worse"? What if the requirements were more complex, or the code required was just...bigger?
Normalizing factors like cyclomatic complexity, lines of code, etc, can shed some more light on these questions. These normalizing factors aren't always popular, but interpreted the right way, could be very informative. We're the same team, using the same language, working on the same product. If iteration 14 adds 400 lines of code, with 3 critical bugs, but iteration 15 adds 800 lines of code with 4 critical bugs, I think we can draw some real conclusions (i.e. we're getting better).
Another interesting bit of data would be to incorporate our weekly code review. We typically review fresh-but-not-too-fresh code, mostly for knowledge sharing and general "architectural consistency". If we were to actively review code in development, before it is sent to testing, we could then have real data on the effectiveness of our code reviews. Are we finding lots of coding errors at testing time? Maybe more code reviews would help? Are we finding fewer critical bugs in iteration 25, than in iteration 24 and 23, where we weren't doing reviews? Reviews helped a lot.
These are actually really simple things to do (especially with a small, cohesive team), and can shed real light on the development process. What else can be done?