Let’s say you are asked to evaluate two computers in terms of user friendliness. If one is an Apple and the other is a PC, then it’s obvious you need to do something to hide this information from the user. There’s too much of a preconceived notion that the Apple is more user-friendly to allow for a fair comparison. No problem, we’ll make it a blind test so the user doesn’t know which is which. In fact we’ll go one step further and make it a double-blind test, so that the individuals administering the test don’t even know which is which. So far so good.
Now for the next part we ask the user to shift back and forth from one computer to the other every few seconds. While they are doing this, the users are asked to make note of any difference in user-friendliness between the two computers. After the trials are conducted, the users sum up their computer user-friendliness feedback. After careful statistical analysis the results are: There is no significant difference in user-friendliness between the two computers. Hooray for the scientific method!
The two computers are then given to somebody to use as he pleases over the next couple of weeks. He scratches his head as he quickly realizes that one of the machines is more user-friendly than the other. What’s going on here? Well, first off everybody knows the way this test was conducted is silly. How are you supposed to notice subtle differences when you’re jumping from one machine to the other every few seconds? I should clarify that statement: Everybody outside of the audio industry realizes that this test is silly. However, in the audio industry it is considered the most somber science to conduct a double-blind (ABX) test that flips back and forth between two units under test every few seconds.
Physicist Richard Feynman had a term for this, he called it “Cargo Cult Science“. It is the sort of science you do when you appear to go though all the right motions, but are missing the big picture. You are doing everything you are “supposed to”, but you’re not getting useful results. For example, I read that Swedish Radio conducted a “double-blind, triple-stimulus, hidden-reference” test over two years with 60 experts to test the transparency of low bit-rate audio codecs. The result? No statistically significant difference. However, another listener was able to quickly identify a 1.5 kHz tone in the processed sample under normal listening conditions (i.e. not double-blind).
What can be done to remedy this situation? Standard audio ABX tests are of little value except for detecting gross differences. Conversely, going by “whatever sounds the best” is too much like reading tea leaves. As I’ve mentioned in a previous post, you really need to be able to live with a piece of audio equipment for a while to detect subtle differences, but that sort of test is very difficult to control. Ah, control…that makes me think of a control group. What if there is a way to conduct these long term listening trials while maintaining a control group that is given a “placebo”? Follow me for a moment and I’ll show how this might work.
The latest, whiz-bang, perfect-sound-forever amplifier is finally out! The objective measurements are phenomenal and the designer is a Ph.D. who works on quantum chaos in his spare time. The previous generation of this amplifier is very, very good, but this new one makes it seem like something an orangutan designed, the marketing department claims. Demo samples are distributed, and true to the manufacturer’s word, the new amplifier is so good that every single person loves it and they all burn their old amplifiers so as not to be tempted to ever again listen to something so vile.
That’s the typical scenario (well, sort-of), but what if instead the manufacturer of the new amplifier did the following: Twenty demo amplifiers are sent to reviewers and early adopters, all eager to hear and comment on the new sound. They are allowed to spend several weeks with the new amplifiers, enough to detect very subtle differences, and then they are asked their opinion relative to competitor’s amplifiers and even to the manufacturer’s previous generation amplifier. Good so far, but now for the coup de grace: 50% of the demo amplifiers were in fact the previous generation amplifier inside of the new chassis! That’s right folks, half of you are part of the control group and have been administered the placebo.
How many manufacturers of audiophile equipment are willing to do this? Can you imagine the change in tone of audio reviews if this practice became common? If an amplifier manufacturer spends time, spends money, and adds complexity in order to create a next generation amplifier, then shouldn’t it be able to pass this test? If improvements in THD+N, or any other objective test you can think of, do not pass this long-term double-blind AB test with placebo, then how can you justify manufacturing it and selling it to your customers?
I’m an engineer, so of course I love the technical challenge of optimizing a given parameter as much as any other engineer. However, an important difference between research and product development is that in product development any change you make should result in a measurable improvement for the customer (cost, reliability, usability, sound quality, etc.). Unfortunately, it seems like the only people benefiting from efforts to improve THD+N are those in the marketing department. The research simply isn’t there to support the practice.
Earl Geddes sums it up very nicely with the following quote from his Dagogo interview:
“My position is that if some manufacturer claims an improvement in some sonic property, subtle or not, then it is their obligation to measure this (even if they have to figure out how to do that) and show in a statistically significant way that it makes an audible difference.”