Wednesday, December 21, 2011

Why Pilot Projects Fail

Why Pilot Projects Fail

via Megan McArdle : The Atlantic by Megan McArdle on 12/21/11

It seems that the LA Unified School District recently revamped its lunch menus to eliminate fattening standbys like chicken nuggets, nachos, and flavored milk.  The resulting meals are much healthier, but apparently also much less appetizing.  As a result, participation in the program is down, and the LA Times found students replacing the Beef Jambalaya and lentil cutlets with things like Cheetos.

This happened despite the fact that the menu was tested extensively before they put it into operation:

Andre Jahchan, a 16-year-old sophomore at Esteban Torres High School, said the food was "super good" at the summer tasting at L.A. Unified's central kitchen. But on campus, he said, the chicken pozole was watery, the vegetable tamale was burned and hard, and noodles were soggy.

"It's nasty, nasty," said Andre, a member of InnerCity Struggle, an East L.A. nonprofit working to improve school lunch access and quality. "No matter how healthy it is, if it's not appetizing, people won't eat it."

At Van Nuys High School, complaints about the food were so widespread that Principal Judith Vanderbok wrote to Barrett with the plea: "Please help! Bring back better food!"

Among other complaints, Vanderbok said salads dated Oct. 7 were served Oct. 17. (Binkle said the dates indicate when the food is at its highest quality, not when it goes bad. They have been removed to avoid misinterpretation.) On campus, even adults -- including a Junior ROTC officer and an art teacher -- have been found selling black market candy, chips and instant noodles to hungry students, she said.

"I compare it to Prohibition," Vanderbok said.
This is one more installment in a continuing series, brought to you by the universe, entitled "promising pilot projects often don't scale".  They don't scale for corporations, and they don't scale for government agencies.  They don't scale even when you put super smart people with expert credentials in charge of them.  They don't scale even when you make sure to provide ample budget resources.  Rolling something out across an existing system is substantially different from even a well run test, and often, it simply doesn't translate.

Sometims the "success" of the earlier project was simply a result of random chance, or what researchers call the Hawthorne Effect.  The effect is named after a factory outside of Chicago which ran tests to see whether workers were more productive at higher or lower levels of light.  When researchers raised the lights, productivity went up.  When researchers lowered the lights, productivity also went up.  Obviously, it wasn't the light that boosted productivity, but something else--the change from the ordinary, or the mere act of being studied.

Sometimes the success was due to what you might call a "hidden parameter", something that researchers don't realize is affecting their test.   Remember the New Coke debacle?  That was not a hasty, ill-thought out decision by managers who didn't care about their brand.  They did the largest market research study in history, and repeated it several times, before they made the switch.  People invariably told researchers they loved the stuff.  And they did, in the taste test.  But they didn't love the stuff when it cost them the option of drinking old Coke.  More importantly, they were being offered a three-ounce cup of the stuff in a shopping mall lobby or supermarket parking lot, often after they'd spent an hour or so shopping.  New Coke was sweeter, so (like Pepsi before it) it won the taste test.  But that didn't mean that people wanted to drink a whole can of the stuff with a meal.

Sometimes the success was due to the high quality, fully committed staff.  Early childhood interventions show very solid success rates at doing things like reducing high school dropout and incarceration rates, and boosting employment in later life.  Head Start does not show those same results--not unless you squint hard and kind of cock your head to the side so you can't see the whole study.  Those pilot programs were staffed with highly trained specialists in early childhood education who had been recruited specially to do research.  But when they went to roll out Head Start, it turned out the nation didn't have all these highly trained experts in early childhood education that you could recruit specially--and definitely not at the wages they were paying.  Head Start ended up requiring a two-year associates degree, and recruiting from a pool that included folks who were just looking for a job, not a life's mission to rescue poor children while adding to the sum of human knowledge.

Sometimes the program becomes unmanageable as it gets larger. You can think about all sorts of technical issues, where architectures that work for a few nodes completely break down when too many connections or users are added.  Or you can think about a pilot mortgage modification program.  In the pilot, you're dealing with a concrete group of people who are already in default, and in every case, both the bank and the individual are better off if you modify the mortgage.  But if you roll the program out nationwide, people will find out that they can get their mortgages modified if they default . . . and then suddenly the bank isn't better off any more.

Sometimes the results are survivor bias.  This is an especially big problem with studying health care, and the poor. Health care, because compliance rates are quite low (by one estimate I heard, something like 3/4 of the blood pressure medication prescribed is not being taken 9 months in) and the poor, because their lives are chaotic and they tend to move around a lot, so they may have to drop out, or may not be easy to find and re-enroll if they stop coming.  In the end, you've got a study of unusually compliant and stable people (who may be different in all sorts of ways) and oops! that's not what the general population looks like.

So consider the LAUSD test.  In the testing phase, when the program was small, they were probably  working with a small group of schools which had been specially chosen to participate.  They did not have a sprawling supply chain to manage.  The kids and the workers knew they were being studied.  And they were asking the kids which food they liked--a question which, social science researchers will tell you, is highly likely to elicit the answer that they liked something.

That is very different from choosing to eat it in a cafeteria when no one's looking.  And producing the food is also very different.  Cooking palatable food in large amounts is hard, particularly when you don't have an enormous budget--and the things that make us fat are, by and large, also the things that are palatable when mass-produced.  Bleached grains and processed fats have a much longer shelf life than fresh produce, and can take a hell of a lot more handling.  Salt and sugar are delicious, but they are also preservatives that, among other things, disguise the flavor of stale food.

I think one anecdote in the article is particularly telling.  People complained that salads dated October 7th were served on the 17th--and the district responded by first, pointing out that that was the "best served by" date, not the date when the food actually went bad; and second, removing the labels because they were "confusing".  Now, as anyone who has forgotten to eat a bag of lettuce knows, while it may not actually be rotten after 10 days, it probably doesn't look much like something you'd eat voluntarily.  This is not something that you can change by stamping a different "sell by" date on the container.  If that were my choice, I too would come to school with a backup bag of Cheetos.

So why would he say something so obviously weird?  There are two reasons I can think of:  1) in a large and complicated distribution system, and with their limited funds, he knows that there is no way to actually solve this problem, so they mounted the only defense they could.  Or 2) the school district still has the mentality of the old system, which is mostly focused on not poisoning anyone.  In fact, there isn't much difference between Chicken nuggets that won't poison you, and Chicken nuggets at their absolute peak of freshness.  And the employees just sort of assumed that the same set of rules would work for lettuce.

That's what real world applications are up against.  They're not an awesome pilot project with everyone pulling together and a lot of political push behind them; they're being rolled out into a system that already has a very well established mindset, and a comprehensive body of rules.  The new program implemented by the old rules often turns out to be worse than the old program.  You don't move kids from pizza to salad; you move them from pizza to cheetos.

This is not, obviously, an argument against ever changing anything.  It is, however, an argument against assuming that your changes will work.  No, not even if you had a great pilot.

No comments: