Friday, August 14, 2009

The Art of Unit Testing

After I read Tim Barcz’s review of Roy Osherove’s “The Art of Unit Testing” and I knew I had to get a copy right away. It just arrived and I read it in one sitting.  I am so pleased that I did. I’ll quarrel with it … but do not let that deter you from rushing to buy your own copy.

Let me say that again. I highly recommend this book – five stars -, especially to folks like me who are not deep into unit testing. This review is full of my grumpy disagreements. That’s how I engage with a good book. Don’t be dissuaded.

Warning: Long post ahead. The short of it: buy the book. Everything else is commentary.

There is no point in recapping the book’s main points as Tim Barcz did that for us. I’m coming at it from a different angle. I’m coming at it from the angle of a guy who wishes he wrote more tests, wishes he was good at testing, even wishes he practiced (or at least gave a serious, sustained effort at trying) TDD. A guy who doesn’t.

A guy much like the vast multitude of developers out there … who is embarrassed by being “old school”, is looking for an opportunity to catch up, but isn’t going to take crap from an obnoxious TDD fan-boy.

I’ve had plenty of success over the years, thank you very much. I’ve written good programs (and bad) that still work. And I can mop the floor with legions of developers who think TDD/BDD/WTF experience yields greatness. They remind me of newly minted MBAs who believe with unshakeable certainty that they’re entitled to a management position. Think again.

Do I sound defensive? Yup. Enough already. My point is this ...

One of Roy’s goals is to reach people like me. We’re experienced developers who may have mucked around with unit testing but aren’t doing it regularly and may have had some rough experiences. We believe … but we don’t practice. Can he do something for us that makes us want to try again or try harder. Can he keep it simple and approachable and be respectful and non-dogmatic.

Yes he can.

He extends olive branches aplenty throughout. Out the gate he writes: “One of the biggest failed projects I worked on had unit tests. … The project was a miserable failure because we let the tests we wrote do more harm than good.” 

Thank you. I don’t believe it for a second. Oh I believe the tests were every bit as unmaintainable. I’m just not buying that the project failed because of the tests. They contributed perhaps, but in my experience, projects fail for other deeper reasons. That, however, is another post.

What I applaud is that he opens empathetically. He goes straight to the dark heart of our limited test-mania experience: when brittle, inscrutable tests became so onerous that they had to be abandoned. Been there. Seen it several times.

I appreciate that a similarly open and self-critical sensibility shines throughout. I’m particularly fond of the section on alternatives to “Design for Testability” in appendix A. There he notes that the uncomfortable coding-style changes required to support testing are an artifact of the statically typed languages we use today.  “The main problems with non-testable designs is their inability to replace dependencies at runtime. That’s why we need to create interfaces, make methods, virtual, and do many other related things. [266]”

Dynamic languages, for example, don’t require such gymnastics. Perhaps with better tools and language extensions (Aspect Oriented Programming comes to mind) we can make testing easier for the statically typed languages.

Here he acknowledges that testing is just too darned hard, harder than it should be, and this difficulty – not resistance to new-ish ideas by crusty old farts like me – is a genuine obstacle.

Until then, we have to accept that incorporating unit testing in our practice requires more than an act of will. You will need hard won skills and experience and you will have to contort your code to get the benefits of unit testing. This is not your fault. You will pay a bigger price than you should have to pay. It may be rational to say “I can’t pay that price today, on this project.”

It may be rational. It may also be wrong. In any case, Roy’s goal is to reduce that price as best he can (a) with a progressive curriculum yielding skills you can use at each step and (b) by introducing you to tools that cover for language deficiencies.

Roy succeeds for me on both fronts. Each step was a small enough to grasp and big enough to be useful. The tools survey was thin … but at least he has one – with opinions – that gives you places to look and an appreciation of their place in a complete testing regime.

Part 1 - Basics

This part is so important for readers like me. Overall, I thought it was grand. I’m about to freak out about a few of Roy’s choices but before I do I want to say “(mostly) well done!”

My biggest disappointment is Roy’s scant mention of IoC. There is brief treatment of Dependency Injection [62-64] and a listing of IoC offerings in the appendix. That’s it. There is not a single example of IoC usage.

Testing is one of the primary justifications for using IoC. Such short shrift could leave the reader wondering what all the fuss is about. Wrongly, in my opinion. I was really looking forward to guidance on proper use of IoC in unit testing.

The omission felt consequential in Roy’s discussion of test super classes [152ff] where he takes a couple of classes that do logging and refactors their test classe to derive from a BaseTestClass [155] whose only contribution derived classes is its StubLogger. What a waste of inheritance. Injecting a logger is the IoC equivalent of “Hello, World”. What am I missing?

I realize (from painful experience) that it’s easy to create an IoC configuration rat’s nest in your test environment. That’s why I was hoping Roy would propose some best practices. Instead, I believe we are served an anti-pattern.

I must also say I was shocked to see favorable mention of using compiler directives [79 – 80]. He urges caution; I would ban the technique outright.

I was not fond of Roy’s preference for the AAA (Arrange-act-assert) style of test coding. This style facilitates brittle tests because it brings the “arrange” moment into the test class and this has been a source of trouble for me.

“Arrange” code is distracting and bloats the test, making it too hard to see what is going on and leading to test methods that do too many things at once. When I was using this style, I couldn’t stop putting multiple asserts in each method [a “no-no” discussed 199-205]; it was too painful to make separate methods.

His associated test naming convention tends to say more about how the test works than what it is trying to achieve … and I think it is easier to find and understand tests when the names express intent.

Since I adopted more of the Context/Specification style espoused by BDD fans (see, for example, Dan North’s 2006 essay and a more recent manifesto by Scott Bellware), I’ve written smaller tests that are easier to read and easier to maintain. Roy can’t be faulted too much for this; Context/Specification is starting to take hold only this year (2009) and we don’t have the years of experience that go with AAA.

Two caveats: As I made clear at the beginning, I don’t do enough unit testing to be taken seriously as a guide. Second, test regimes falter in year #2 as the long term maintenance of actually-existing-unit-test-implementations overwhelm the development effort; that’s why Roy’s book is important. But the Context/Specification style hasn’t been around long enough to prove its worth in the field. It will take a couple of years to find out.

Part 2 – Core Techniques

Discussion of difference between Stubs and Mocks was brilliant. "If it’s used to check an interaction (asserted against), it’s a mock object. Otherwise , it’s a stub” [90]

Loved that he handwrote mocks before introducing mocking frameworks (he prefers to call them “Isolation Frameworks”). This is a crucial pedagogical move. Many of us are stunned by the mocking framework syntax (e.g., Rhino Mocks) and our instinct is to run away and only use state-based testing.

Those of you who know better will smile knowingly as I confess to the awful mess I made for myself by hand rolling my own mocks for fear of frameworks. There is a reason and it is the sheer ugliness of mocking framework APIs.

Roy gets it. That’s why he sneaks up on Rhino Mocks.

“One Mock Per Test” [94]. I like the sound of it. I like Roy’s reasoning. It’s the kind of clear, unambiguous advice that novices like me need. I’m sure there are times when it is smart to set it aside but it has the whiff of hard-earned wisdom.

I much appreciated the “traps to avoid” section at the end of the Mocking chapter 5. It’s easy to say “if it looks complicated, stop”. We should say it again anyway. Roy goes one better and identifies the tell-tale signs of too much mock framework fascination.

Part 3 – Test Code

I tend to agree with Tim Barcz: Chapter 7, “The Pillars of good tests” is essential and some of it feels like it belongs early in the book … not here, 100 pages in. On the other hand, the reader isn’t ready for a review in depth of test smells and maintainability until they know the basics. On balance, the timing of this chapter feels right.

The passages on “trustworthy tests” overflow with good sense. How to fix a broken test … which includes breaking the production code to ensure the test still catches the failure … that’s a step you overlook at your peril.

It’s proof again, if proof is needed, that you don’t automate unit tests and you can’t do it by rote. Junk testing is hardly better than no testing … and Roy has an iron grip on this fact.

Chapter 6 concerns build automation, code organization, and conventions … crucial blocking and tackling.

This is the place I mentioned earlier at which Roy speaks favorably of test class inheritance where I feel IoC techniques are more appropriate. I don’t think much of overriding a virtual setup method either; I think the Template Pattern is much preferred. With Template Pattern  – in which derived classes override an empty virtual method that is called by the base class – you ensure that base behavior is always invoked and you don’t trouble the developer with knowing when the base method should be called.

Roy describes something he calls the “Test Template Pattern” [158] which sounds like Template Pattern but isn’t. His Test Template Pattern consists of abstract test methods which, perforce, must be implemented by derived test classes. The intention is to ensure that all derived test classes implement specific tests – not, as in Template Pattern, to provide a well-managed base class extension point.

The Context/Specification approach employs the Template Pattern (in the form of a virtual Context() method) as the preferred means by which a derived Specification class makes arrangements (adds “context”) that are particular to its needs.

Speaking of Context/Specification, if you prefer that style, you’ll need to adjust Roy’s recommendation from “One Test Class Per Class Under Test (CUT)” [149] to “One Test File Per Class Under Test”. That’s because Context/Specification yields many test classes, each dedicated to a different “context” in which the CUT is revealed. It is typical of the examples I’ve seen that these many classes can be found in the same physical file, named after the CUT.

I have a feeling that BDD practitioners go farther and argue that you build tests around scenarios, not classes. They could say that it’s a category mistake to force a correlation between CUTs and test files. I just don’t know. Such correlation seem convenient but it may distort the design process. I lack the experience necessary to weigh the tradeoffs. I wish Roy had explored this avenue.

Part 4 – Design and Process

Chapter 8 is about the politics of implementing a testing regime where none exists, a hugely important topic. I enjoyed this chapter immensely. Unfortunately, Roy is utterly unpersuasive.

To summarize, a team that writes tests takes twice as long to deliver the first implementation as the team that doesn’t [232], there are no studies proving that unit tests improve quality [234] even though we believe it anecdotally, there is strong evidence that programmers who write tests won’t do a good job of testing for bugs despite their best intentions [235], and finally, it appears most defects stem not from poor code quality but rather from misunderstanding the application domain [237] . This litany is not the way to management’s heart.

I will expand on each of these observations.

Time to Market

In the “Tough questions and answers” section Roy’s prepares an answer to the #1 question on your manager’s mind: “How much time will this add to the current process?”

Roy’s frank answer is “it doubles your initial implementation time …” [232]

That’s a conversation stopper. Management prizes an early delivery date and it is extremely difficult for management to distinguish the first implementation from “the” implementation.

Roy hastens to add “… the overall release date for the product may actually be reduced.”

That may re-open the conversation … because you’re talking about the delivery date again. You’re making the case that the project won’t be considered delivered until it passes some quality bar … that the savings in the mandatory testing phase may compensate for the slower start.

The equivocation - “may” – will be noticed. Management has heard too many stories about Total Cost of Ownership and Reduced Maintenance. It’s going to be tough.

Here’s the worst part. It is often true that to be in the lead at the first turn means you win the race. It means you get resource commitments that won’t be available without a (ridiculous) early delivery. This is so even if we finish much later than the conscientious, test driven developers. Too bad, because they never get the shot. And by the time the technology debt comes due, there are sunk costs (real and political) that management will be loathe to abandon.

This is just how it is. So, while I applaud Roy’s honesty, this is a tough sell. He needs another plan. He needs a way to shift the definition of “delivered” to an implementation that passes a measurable quality bar. He needs to talk about short cycles so that the evidence is experienced on this project and registers in management’s short term memory.

Roy shows some grasp of this dynamic. In his example – a tale of two projects – the debugged release time is 26 days in the worst (no-testing) case. You can win a month to prove your point … but not much longer.

Does unit testing improve code quality?

Roy is his typical honest self here. Unfortunately, what he reports is not likely to advance his cause.

He draws proper attention to code coverage. There are lovely charts. There is just one flaw: you have to convince the skeptics that you’re measuring something that matters.

You think that’s a good metric because it measures unit testing activity. The skeptic doesn’t care about your activity. Activity – expenditure of effort – is irrelevant. The skeptic cares about delivering the system that “works acceptably” as quickly as possible. The skeptic suspects your polishing one apple, while he wants many apples, perhaps less polished.

A devastating admission: “There aren’t any specific studies I can point to on whether unit testing helps achieve better code quality.” [234] Ouch! That has to be fixed.

Here’s another groaner: “A study by Glenford Myres showed that developers writing tests were not really looking for bugs, and so found only half to two-thirds of the bugs in an application.” [235]

Here’s another citation that Roy interprets as strengthening the case for unit tests although I think it does the opposite: “A study held by … showed that most defects don’t come from the code itself, but result from miscommunication between people, requirements that keep changing, and a lack of application domain knowledge.”[237]

It is not self-evident how unit testing alleviates these sources of error. The best he can say is that, as you correct course, the unit tests provide some assurance that the other things you still think are true are still tested. That’s valuable … but weak beer at best.

This chapter made him wonder again if I should be so ashamed of my test-less oeuvre.

Nah. We may lack the proof but absence of proof is not proof of absence. Where would we be if we only followed rigorously proven practices? Show me the study that proves “GoTo”s are bad.

There was a prolonged and super-heated argument in the ‘70s and ‘80 about the (de)merits of GoTo. Steve McConnell covers it in an article from his Code Complete where he makes reference to a Ben Shneiderman “literature survey”. I suspect a literature survey would yield comparable support for unit testing. Literature survey’s perhaps reflect the “wisdom of the field”; they are not evidence.

The fact is, we have very little social science on any development practices. The objection that unit testing and TDD are unproven could be raised about almost any practice. The anecdotal support for unit testing remains strong.

We shouldn’t leave it there. We need real studies. I’d like to see some of my former colleagues in economic sociology jump in. There’s at least a masters thesis here.

It’s also possible that the limited studies to which Roy refers (he does not cite them) produce inconclusive results because they don’t account for test quality. Noise from botched test regimes may be hiding the good news. Roy established early that (a) a poor bad unit testing can be worse than no unit testing and (b) it’s easy to make a mess.

If this interpretation is correct, we are challenged to improve testing as actually practiced in the wild. We lose the argument – and we should lose it -  if proper unit testing remains a rare skill, difficult to acquire. If Roy’s book becomes widely read and as more developers learn to write better tests, we could hope for a positive swing in the statistics.

Finally, I’ve heard Steve McConnell claim in a DotNet Rocks Show (0:28)  that “40% to 80% of its effort on unplanned defect correction work … in other words, low quality is the single largest cost driver for the average project.” I don’t know how Steve came by these statistics (and “40% – to 80%” is huge swag). It is Steve’s business to measure and track this stuff. And if you’re doing something that attacks the “single largest cost driver” … and you’re not disproportionately increasing costs with your remedy [!] … then you’re making business sense.

Chapter 9 on testing legacy code is a welcome introduction with good advice … but no substitute for Michael Feather’s Working Effectively with Legacy Code. Feather’s book is expensive ($47 on Amazon); perhaps Roy’s chapter and his enthusiasm for Feather’s book will encourage sales.

Appendices

Appendix “A”, ostensibly about design and testability, is mostly about design for testability. That’s no small leap. Testing code heavily is one thing. It is another to distort your design to satisfy inadequacies in the language that make testing difficult.

I’ve deliberately expressed this point in the most contentious way possible to dramatize the implications of exchanging “for” for “and”.

I hasten to express my enthusiasm for the contribution of “unit testing” to design. Expressing your expectations in code clarifies the design and casts a strong light on otherwise dark edge cases. Many of the test disciplines, loosening dependencies in particular, promote SOLID design principles (especially Single Responsibility) that are beneficial in their own right. Roy is excellent on these points.

The problem is that at least one of Roy’s recommendations, “Make methods virtual by default” [258], reduces design quality in order to make testing easier. Testability and Good Design are at cross purposes.

“Make methods virtual by default” is a terrible idea in my opinion. I explore that opinion in a separate post. My argument in brief is that a virtual method is an invitation to extension everywhere. Extensibility is not a frill. You have doors in your house for a reason; that’s where people are expected to enter. They aren’t expected to come through the windows. You don’t punch orifices into every wall. A plethora of virtual methods invites violation of the Open / Closed Principle (“Liskov Substitution Principle” to be more precise) and makes delivery, maintenance, and support of a system pointlessly more difficult.

This aside, the chapter, although brief, is clear and persuasive.

Appendix “B” enumerates helpful tools and test frameworks. Each merits only a brief blurb but I was pleased to have an annotated list of Roy-approved choices.

Conclusion

This is a wonderful book for the experienced developer who is open to unit testing while having limited experience of it. I suspect it will help technical managers of a certain age … managers who’s programming days are behind them, who’ve heard the fuss, been through a few fads, and want a serious, honest, warts-and-all look at unit testing.

I’m told also that it has earned the respect and admiration of many with deep unit testing experience. That’s a confidence builder for me.

Get it.

3 comments:

Jason Diamond said...

Absolutely fantastic review, Ward.

I'm a TDD/BDD fanatic and I also had "issues" with some of what I was reading in Roy's book. I didn't put anywhere near as much thought into my objections as you did, though.

Your review makes me want to re-read his book with a printout of your commentary by my side.

Thanks.

David Nelson said...

"...the unit tests provide some assurance that the other things you still think are true are still tested. That’s valuable … but weak beer at best."

In my opinion, this is not "weak beer" (interesting turn of phrase by the way). Unit testing to find bugs is good, unit testing to improve design is good, but unit testing to serve as a regression suite is where the real gold is to be found. Case in point: I work on a legacy system which, while it works (more or less), is badly written and overly complex (actually complex is probably not the right word - overly coupled would be a better way to put it). Every time we attempt to add a feature or fix a bug, we end up breaking something, sometimes completely unrelated. Users hate bugs, but they especially hate regressions, and they especially hate seemingly unrelated regressions; they give the impression of severe incompetence and drive down user confidence, which in turn drives down cooperation and makes everything more difficult. It has gotten to the point now where we are forgoing adding valuable new features and fixing non-trivial bugs, simply because we have no confidence that we are not going to make things worse.

If I could push a button before every release and determine that, at least, nothing that I hadn't intended to change had changed, that would have incalculable value, even if it told me nothing about the correctness of what I had changed.

Sadly, I am in the same position you are: I believe in the anecdotal evidence of the value of unit testing, but I don't know where or how to start, I maintain legacy systems that I know will be a nightmare to test, I doubt my ability to convince my boss of the value, and I am turned off by the fan boys and their exaggerated claims. Maybe I will give this book a chance to give me a jump start.

Ward Bell said...

@David - I was wrong to call it "weak beer" ... both an excess of rhetoric and just plain wrong.

You are on the money when you write: "unit testing to serve as a regression suite is where the real gold is to be found."

I don't know if anyone has tried to measure this but it sure does align with our experience.

Regression tests have been crucial as we evolved our product from .NET 1.0 to .NET 3.5, including major leaps to support generics, EF, LINQ, and Silverlight.

Businesses have difficulty putting present value on future costs of evolving a product. It's not just the money. It's the managerial truth that "I won't be here when the chickens come home to roost."

Thanks for the correction.