Let’s say you are asked to evaluate two computers in terms of user friendliness. If one is an Apple and the other is a PC, then it’s obvious you need to do something to hide this information from the user. There’s too much of a preconceived notion that the Apple is more user-friendly to allow for a fair comparison. No problem, we’ll make it a blind test so the user doesn’t know which is which. In fact we’ll go one step further and make it a double-blind test, so that the individuals administering the test don’t even know which is which. So far so good.
Now for the next part we ask the user to shift back and forth from one computer to the other every few seconds. While they are doing this, the users are asked to make note of any difference in user-friendliness between the two computers. After the trials are conducted, the users sum up their computer user-friendliness feedback. After careful statistical analysis the results are: There is no significant difference in user-friendliness between the two computers. Hooray for the scientific method!
The two computers are then given to somebody to use as he pleases over the next couple of weeks. He scratches his head as he quickly realizes that one of the machines is more user-friendly than the other. What’s going on here? Well, first off everybody knows the way this test was conducted is silly. How are you supposed to notice subtle differences when you’re jumping from one machine to the other every few seconds? I should clarify that statement: Everybody outside of the audio industry realizes that this test is silly. However, in the audio industry it is considered the most somber science to conduct a double-blind (ABX) test that flips back and forth between two units under test every few seconds.
Physicist Richard Feynman had a term for this, he called it “Cargo Cult Science“. It is the sort of science you do when you appear to go though all the right motions, but are missing the big picture. You are doing everything you are “supposed to”, but you’re not getting useful results. For example, I read that Swedish Radio conducted a “double-blind, triple-stimulus, hidden-reference” test over two years with 60 experts to test the transparency of low bit-rate audio codecs. The result? No statistically significant difference. However, another listener was able to quickly identify a 1.5 kHz tone in the processed sample under normal listening conditions (i.e. not double-blind).
What can be done to remedy this situation? Standard audio ABX tests are of little value except for detecting gross differences. Conversely, going by “whatever sounds the best” is too much like reading tea leaves. As I’ve mentioned in a previous post, you really need to be able to live with a piece of audio equipment for a while to detect subtle differences, but that sort of test is very difficult to control. Ah, control…that makes me think of a control group. What if there is a way to conduct these long term listening trials while maintaining a control group that is given a “placebo”? Follow me for a moment and I’ll show how this might work.
The latest, whiz-bang, perfect-sound-forever amplifier is finally out! The objective measurements are phenomenal and the designer is a Ph.D. who works on quantum chaos in his spare time. The previous generation of this amplifier is very, very good, but this new one makes it seem like something an orangutan designed, the marketing department claims. Demo samples are distributed, and true to the manufacturer’s word, the new amplifier is so good that every single person loves it and they all burn their old amplifiers so as not to be tempted to ever again listen to something so vile.
That’s the typical scenario (well, sort-of), but what if instead the manufacturer of the new amplifier did the following: Twenty demo amplifiers are sent to reviewers and early adopters, all eager to hear and comment on the new sound. They are allowed to spend several weeks with the new amplifiers, enough to detect very subtle differences, and then they are asked their opinion relative to competitor’s amplifiers and even to the manufacturer’s previous generation amplifier. Good so far, but now for the coup de grace: 50% of the demo amplifiers were in fact the previous generation amplifier inside of the new chassis! That’s right folks, half of you are part of the control group and have been administered the placebo.
How many manufacturers of audiophile equipment are willing to do this? Can you imagine the change in tone of audio reviews if this practice became common? If an amplifier manufacturer spends time, spends money, and adds complexity in order to create a next generation amplifier, then shouldn’t it be able to pass this test? If improvements in THD+N, or any other objective test you can think of, do not pass this long-term double-blind AB test with placebo, then how can you justify manufacturing it and selling it to your customers?
I’m an engineer, so of course I love the technical challenge of optimizing a given parameter as much as any other engineer. However, an important difference between research and product development is that in product development any change you make should result in a measurable improvement for the customer (cost, reliability, usability, sound quality, etc.). Unfortunately, it seems like the only people benefiting from efforts to improve THD+N are those in the marketing department. The research simply isn’t there to support the practice.
Earl Geddes sums it up very nicely with the following quote from his Dagogo interview:
“My position is that if some manufacturer claims an improvement in some sonic property, subtle or not, then it is their obligation to measure this (even if they have to figure out how to do that) and show in a statistically significant way that it makes an audible difference.”
August 5th, 2013 at 3:53 am
You present a nice summary on why “ABX” and related blind testing should be at least be regard with great suspicion and one should require the promoters of such tests to supply formal proof of methodology and statistics and to “calibrate” their tests with known audible stimulae.
As for the test you suggested, we have had occasion to carry it out repeatedly. Only we do not proudly announce new improvements and then have people test, instead we simply apply what we feel are improvements in the series production in a few units WITHOUT notifying anyone. The units fetch up with distributors and dealers and are auditioned…
We then wait for either howls of outrage of how we destroyed the great sound of the piece (in which case we exchange the units and revert the change, something which only happened once and was far from unambigous) or the glowing reports of how much better everything sounds in which case the change becomes permanent.
Of course, we also test blind in house (but not ABX of course).
One last thing, while in an ideal world, I would agree with Earl re. making any change quantifiable by measurements.
The reality is that such rarely if ever helps any manufacturer to sell more products and thus increase market share, profitability and many other factors that or concern to REAL manufacturers (not small outfits of the one or two/few man kind).
In fact, if such a manufacturer did spend the resources to develop new measurement methods and to design the necessary instrumentation and then used these results in advertising, what would happen is the following:
1) Many of the knowledgable consumers have become jaded and suspicious of measurement results because of the long standing singular failure of common synthetic benchmarks to correlate with the experience of sound. At best, in this market there will be no backlash at yet another (percieved to be useless) test, at the worst sales will suffer, they CERTAINLY will not increase.
2) In the group of less knowledgable consumers new measurements that cannot be compared to other manufacturers are pointless, they will be ignored. Only if your numbers are better than the competition (who often made up the numbers anyway) will measurements have any point inimproved sales.
3) Among Audio Engineers and Designers etc. there will be severe resistance to accepting the new test, for a variety of reasons; including the “Not Invented Here” syndrome; the violation of so called common sense, that dreary bog of sullen prejudice and muddy inertia; the fact that the new metric is morally unaffordable, as it would require a fundamental re-structuring of the worldview ofmany individuals in the engineering community.
Most likely proposing such a test will draw the usual highly vocal opposition from the usual quarters of the “cargo cult” “scientists” who will first insist on an ABX test and the resulting controversy and publicity may well prove detrimental to sales.
Further, making such a new metric available would of course give everyone access to it and could thus reduce the actual advantage in sound quality over competitors that this manufacturer enjoys.
So, I can see few if any incentives for a manufacturer in the current market situation to do as Earl suggests, but many to be in fact as evasive as possible, to shroud things in mystery and claim esoteric knowledge that may not be imparted to the profane and so on, as this is more likely to maximise their advantage in the market place…
We may find this situation regrettable and desire it to be difficult, however as I often say, I have not made this world, I only live here and enjoy certain pursuits.
Sir Shagsalot – K.R.D.C. (Knight of the Round Dancing Table)