Back to absmc home page 
 
 
c't The Magazine for Computer Technique 
June/2000, p. 92: MP3-Comparison 
By Carsten Meyer 

Original German article: http://www.heise.de/ct/00/06/092/
 

Cross-examination test 

The c't-Reader’s Listening Test: MP3 versus CD

After our controversial discussion of some fundamental issues of MP3 encoding in the March, 2000 issue (see [1]), c't asked our readers to perform a listening test: Unbelievers should face the task of identifying, in 'blind flight', the source of various music selections. The results of our test surprised not only our reference listeners; also our editing staff was perplexed by some new knowledge they gained.

We had stirred up a hornet’s nest. Long discussions on our Usenet forum, harsh as well as constructive letters to the editor, and angry messages to our hot-line during business hours showed that there the battle between MP3 opponents and supporters was still undecided after that test. Critics accused us of populist opinion making, argued with great technical skill about the intricacies of HiFi/Audio specifications, and damned MP3 compression as the work of the Devil; others praised our enlightened explanations as worth reading and useful to dispel all the esoteric and voodoo superstitions on matters of audio and HiFi, or simply declared us correct with respect to the audibility (or even inaudibility) of the effects of lossy audio compression at different quality levels.

All this persuaded us to take an extraordinary step, which we made public in the April, 2000 issue of c't. Our critical readers themselves were asked to distinguish MP3-encoded samples of music from the originals in a common listening test. The participant with the best hit quota would win a cash prize of 1000 DM (approx. US$600). Initially we wanted to invite six readers, but we got so much response (more than 300 serious applications within a week), that we decided that twelve participants would be asked to come to Hanover. They were screened initially by their qualifications and then a final selection of that group was made randomly. We asked sound engineer Gernot von Schultzendorff to participate and to be our assessor and ‘reference listener’. Mr. Schultzendorff works for Deutsche Gramophon in Hanover, and his primary activity is to prepare masters for the production of classical recordings. Without wanting to anticipate the result of this second test, we may say that the charts in the March/2000 issue are still as valid as before, and we don’t need to recommend to any of our former participants a visit to their hearing doctors. 

Reminiscences

This time our comparative listening test took place entirely in our publishing house studio, where the damping, reflection, and resonance conditions are comparable to those in an audiophile’s living room. Some readers may remember the studio from the time when the magazine HIFI-Vision was sold to Heise. At that time, the ceiling had been covered with diffusers (sand-filled plastic sacks), and had additional damping elements on the walls, as well as a built-in filled bookshelf, which made for dry acoustics. However, the former conditions of the HIFI-Vision studio could not be completely reconstructed: instead of the HiFi magazines in the bookshelves, we had to content ourselves with telephone directories from the publisher’s program to provide effective acoustic lining. Our readers will have to forgive us for this inaccuracy.

Our top class audio components were a pair of B&W Nautilus 803 speakers, connected to a Marantz CD-Player CD14 and a PM14 amplifier. With the Straightwire-Pro cables and accessories, this combination cost approximately 30,000 DM, an amount that few HiFi lovers could pay for their hobby. The Nautilus speakers, of high-quality English manufacture, are a first choice for studios and mastering rooms, because of their balanced, analytic and neutral sound. Furthermore, Axel Grell, from Sennheiser, (who is not related to our chief editor and unofficial competitor Detlef Grell) provided us with the electrostatic reference headphones Orpheus, along with the corresponding tube amplifier – unfortunately only for the duration of the test, because the noble small series product, priced at 20,000 DM, was the most expensive component we used.

Four minutes 

We chose an arbitrary list of musical works (17 in all, see the list below). From each of these a one-minute long passage would be played to each listener from the original CD, as a reference. Then, three samples of the same passage (at 128 kbps, at 256 kbps, and again from the original) were to be played in a random sequence. The listeners had to determine the correct source of the three samples and record their answers on a questionnaire. Correctly identifying the 128 kbps sample earned the listener one point each per piece, and the same for a correct identification of the CD sample. For correctly identifying the source of all three versions, the contestant got three points. But no points were awarded at all if the 256 kbps sample was correctly identified but the 128 kbps and original CD samples were reversed. A maximum score of 51 points was therefore possible and the random statistical mean (caused by unequal weight) was at 14.1 points. Any contestant who had a score greater than 14.1 would therefore have heard actual differences in quality.

In order to eliminate variations that could be caused by different D-to-A characteristics between the CD and MP3 players, we had the test samples encoded with MusicMatch 4.4 for Windows in joint-stereo, converted into AIFF format with a Power Mac G3 for the Apple QuickTime Player, and then burned onto a single Audio-CD in a random sequence along with the extracted CD Audio files.

Listening Test

After the first half-hour of intense listening, some of the contestants already wanted to quit. ‘A lottery’, was a comment heard many times. Many of the listeners were surprised at how good an MP3 recording can sound through the outstanding Marantz player. People chattered about technical issues such as phase relationships, the influence of the (imperfect) room acoustics and their personal listening habits. They argued about the importance of good cables and praised the superiority of  analog recordings on vinyl (which unfortunately were not available for the listening test). 

During the pause and after the official common part of the test, several doubting contestants were allowed to use the Orpheus headphones to help listen to and classify the individual pieces. They were also then permitted to jump from one passage to another in direct one-to-one comparisons between the individual versions, which obviously could not be done in the common listening test.

First Place Winner

The unofficial winner, with 26 total points was our ‘reference listener’ Gernot von Schultzendorff who, after over an hour of intensive listening, had to admit he was exhausted. ‘That was hard. It seemed to me almost as if some of the 256 kbps samples sounded somewhat rounder and more pleasing than the originals from the CD. One cannot let oneself be distracted by those characteristics’, he said. And, in fact, people often incorrectly chose the 256 kbps sample as the original CD version.

Among the invited readers, Mirko Eßling from Schopp, a student electronics developer, won first place. According to his own statement on his application, he ‘can predict the sound of an audio circuit by the mere sight of it’. He won with 22 points. Given the test conditions of foreign acoustics, performance stress, unfamiliar equipment, and sub-optimal listening conditions, he achieved an absolutely respectable score that garnered him the first place prize of our competition: 1000 DM, in cash.

We were somewhat surprised when we found out about his musical preferences. ‘In fact I cheated a little in my application. I really have a classical piano training, but as an active amateur musician, I prefer to perform punk-rock’, said he. Prior to the test, he practiced intensely by listening to different kinds of MP3s. He had a final success rate of 90% with 128 kbps encoding, and that despite a severe handicap. ‘Since an accident involving an explosion I can hear on my left-side only up to 8 kHz, and on the right side I had a stubborn ringing until recently. However, I can catch the typical flanging effects of the MP3 filters and maybe do that better than my competitors because of my hearing impairment.’

There may be some truth in this. The basis for the psycho-acoustic model of MP3 encoding originates from a person with normal hearing. Someone who can perceive frequencies up to only 8 kHz will not hear a bright cymbal or triangle crash, but will probably hear the normalization noise of the filters in the lower frequencies, because in this case the noise will not be appropriately masked by high frequency sounds. Sharp notch filters, as implemented in the MP3 decoders, can generate a flanging (or jet effect) when the signal changes rapidly.

So it isn’t those with perfect hearing, but those that deviate strongly from normal that seem to be especially sensitive to MP3 artifacts. Psycho-acoustic masking effects are at the basis of the MP3 encoding algorithm (the alarm clock goes on ticking even when it rings [but the algorithm doesn’t encode the ticking because it will be masked by the ringing anyway G.]; and the algorithm relies upon such effects also in the case of the generated normalization noises, which in general are supposed to be masked by the useful signals. But when a hearing impairment cause these noises to surface they will be much easier to detect.

A Shared Second Place

With 20 points each, Jochen Kähler and Tom Weidner from Nuremberg both achieved second place, followed by Martin Eisenmann from Hamburg. Mr. Eisenmann owns the big B&W Nautilus 801, and because of his ‘deep appreciation of music and desire to accept nothing but the best’ he spent 40,000 DM on his stereo system. Tom Weidner is an engineer who develops hearing aids, works on audio signal processing algorithms, and is used to participating ‘in complex sound tests, mostly dealing with finding artifacts and sound differences’. Jochen Kähler had a previous opportunity while employed at the Fraunhofer IIS in Erlangen, to work on the Advanced Audio Coding and other MP3 successors.

Stefan Weiler from Hambühren, blind from birth and an ardent listener of classical, jazz and of “serious light music”, possesses perfect pitch and has been actively involved in the development of the ‘Kunstkopf’ recording apparatus [a recording device in the form of a human head with microphones in the place of the ears, used to obtain a more realistic stereo effect in recording (G)]. Because of an inadvertent mistake when communicating his choice to his companion he came in at an undistinguished fourth place. If he had not inadvertently switched the Brahms samples, he too would have amassed 20 points. As a consolation we have promised him the opportunity to work on a campaign we are launching for the sight impaired. Weiler identified MP3 encodings chiefly by the lack of “spatiality of the rustle in the silent passages”, as he explained.

From a statistical point of view

It’s true that the data we collected does not support watertight conclusions, but they do provide interesting insights. We wanted to find out which pieces of music were the hardest to distinguish from the original and which ones were the easiest for the listeners to detect. From the simple sum of all the scores obtained by all participants for each title we can tell whether it was easy or difficult for participants to distinguish the original and the different MP3 encodings (see table scores).

By no means do classical recordings always have an advantage in this respect, and in the case of some pieces, participants were consistently wrong in their choices. For example, the Arabic Dance of Edvard Grieg’s Peer Gynt encoded at 128 kbps was preferred over the original by more than half of our participants. The compression may have eliminated some small weaknesses of the recording, perhaps a roughness of the woodwind players. On the other hand, Chic’s ‘Jusagroove’, a very dynamic and tight funk, was correctly identified by most listeners.

In order to further understand this phenomenon we did some additional investigation of the test results. We were particularly interested in the causes of the difficulties. Did the testers have problems distinguishing high-quality MP3s at 256 kbps from lower quality ones at 128 kbps, or did the MP3s sound better to them than the original CD?

To determine this, we modified a bit the evaluation procedure. According to people’s prejudices about MP3 quality, one would expect that 128 kbps sounds the worst, 256K would be preferred next, and that the original Audio-CD sample delivers the best sound. So, we re-scored the test results; every test sample that was identified as 128 kbps received one point, a sample identified as 256 kbps garnered two points, and a sample identified as the original CD got three points. This was done for each sample regardless of whether the listener’s identification of the sample source was correct or not. If a listener could not hear any difference between any of the three sample versions, we assessed all of them as ‘CD quality’ and gave each sample three points. 

Then we added up all the points for each sample over all listeners. If all 14 people had always guessed correctly, then each of the pieces of music would show the same distribution for its samples: 14 points for a sample at 128 kbps, 28 points for a 256 kbps sample, and 42 points of the original CD. But a completely different picture emerged. For those pieces which our listeners most frequently guessed wrong, the MP3 encoded samples were judged in general to be superior to the CD sample.

Our biggest surprise, however, came when we added up all the points achieved by all of the samples at each quality level: 128 kbps, 256 kbps, and CD-ROM. The samples at 256 kbps and the original CD samples achieved precisely the same score of 501 points. The 128 kbps samples clearly scored lower, with a total of 439 points. For those interested in statistics, these values of 501 and 439 differ significantly in statistical terms, with a probability of error of one percent (in scientific investigations, statistical deviations are considered significant when the probability is 5% or less). And between the 256 kbps and CD samples, which got exactly the same score, there was, of course, no statistical difference.

Summing Up

In plain language, this means that our musically trained test listeners could reliably distinguish the poorer quality MP3s at 128 kbps quite accurately from either of the other higher-quality samples. But when deciding between 256 kbps encoded MP3s and the original CD, no difference could be determined, on average, for all the pieces. The testers took the 256 kbps samples for the CD just as often as they took the original CD samples themselves.

The fact that some of the 128 kbps samples were consistently judged to be better than their original CD counterparts by this skilled group – even by the best among them – stunned our editor (who participated in the test although his results were not included in the evaluation, and had to confess that he got only 15 points). It seems safe to declare that there is no musical genre that is especially well-suited or ill-suited to compression. It is apparent that there are quite other factors related to the technical aspects of recording that will later adversely affect the results at low bit rates.

This article will not end the ongoing debate of whether the use of MP3 compression is a reasonable or unreasonable procedure. Audiophile fans that concern themselves with brand names and are status conscious will never listen to MP3s, no matter how many tests may prove that the sound experience is equivalent in both cases. Skeptics (“They are all sissies at c’t; I would certainly have heard the difference”) should get encoders and CD burners and then submit themselves – perhaps even using the same pieces and under similar conditions – to their own ‘Pepsi-Test’.

cm 


References: 

[1] Carsten Meyer, Doppelt blind, MP3 gegen CD: Der Hörtest [Double blind, MP3 versus CD: The Listening Test], c't March, 2000, p. 144 
 
Results of Readers' Listening Test
Test Listener a (b) c d e f g h i j k l m (n)
Points/Title
(stat. random average: 
11 points)
Chic - Jusagroove 3 3 3 0 3 1 1 1 1 3 3 3 3 3
31
Brahms - Ungarische Tänze 1 1 1 0 3 0 0 0 1 0 3 3 1 0
14
Donald Fagen - IGY 1 1 0 0 0 1 1 3 1 0 0 1 0 3
12 
Anne S. von Otter - I’m a Stranger Here... 0 3 0 0 0 0 0 3 0 1 0 3 3 3
16 
Peter Gabriel - Steam 3 3 0 1 0 3 1 0 3 1 0 3 3 1
22 
Leonard Cohen - First We Take Manhattan  1 3 0 0 0 1 3 1 3 3 0 3 0 0
18 
Orff - Carmina/Gnomus 1 3 0 1 0 1 0 1 3 3 1 1 1 3
19 
Shostakovitch - Jazz/2 March 0 1 1 3 3 1 1 1 0 0 1 0 0 3
15 
Bill Whithers - Ain’t No Sunshine 1 0 3 1 0 0 0 0 0 0 0 1 0 1
Adrian Legg - Norah Hanleys Waltz 0 0 3 0 0 0 0 0 0 1 0 0 3 1
Liszt - Aprés une lecture du Dante 1 0 0 1 3 0 1 0 0 3 1 0 0 0
10 
Mussorgsky - Bilder einer Ausstellung 1 1 0 3 0 0 0 0 0 0 1 1 0 1
Sara K. - Tell Me I’m Not Dreamin 3 3 1 1 0 0 1 1 1 0 0 1 1 1
14 
Grieg - Arabischer Tanz 0 1 3 3 1 0 0 1 0 0 0 0 0 0
Marla Glen - The Cost Of Freedom 1 0 1 0 1 1 3 0 1 0 3 1 0 3
15 
Anne S. von Otter - Quello di Tito è il volto 0 0 0 3 0 3 0 0 0 1 0 0 1 0
Clair Marlo - All For The Feeling 3 3 3 3 0 3 3 3 3 0 3 1 1 0
29 
Points/Listener 20  26  19  20  14  15  15  15  17  16  16  22  17  23   

Hosted by www.Geocities.ws

1