The Danger of Perceptual Coding

Perceptual coding is responsible for data loss that is greatly misunderstood and perhaps even dangerous to society.

What is perceptual coding ? It’s a data compression concept used in audio, video, and streaming technologies.

send-to-zip — ZIP is a lossless compression like FLAC. To permanently reduce media size, MP3 and AAC use perceptual coding to determine importance of data and permanently reduce it.

Why does perceptual compression exist? Native media files tend to be large. In the 90’s it was difficult to move these files around because they were too large for the network speed and storage prices of the time. Extreme data compression was needed.

A CD might hold 10 songs at 40mb each for a total of 400mb. How to get that 40mb song file small enough to fit through a dial-up modem and play on the other side in real-time?

The answer was perceptual coding, the trick behind lossy compression. It has been used for decades in voice transmission compression. You have to go inside the audio data and start throwing sound away.

But what sounds can be thrown away? How do you go inside of a mixed piece of music and delete things? And how far can you go before people notice a quality drop?

Perceptual coding can’t do things like delete the 2nd guitar solo or reduce the backing vocals, that can only be done in the mix of the song.

Perceptual coding also can’t make the song acoustic or shorter in length, those can only be done in the mixing stage.

What perceptual coding does do is analyze the sounds in the song and prioritize them. The programmers determined which sounds are more important on the scale.

First it locates the lead sounds – the main instruments/voices in the material.

There might be 5 primary sound makers in your song, let’s say drums, bass, guitar, keys, and voice. Perceptual coding manages to quarantine those and only removes small amounts of their identifying data.

This allows a listener to quickly ID the melody, the lyric, the artist, and the song since these primary elements are only slightly degraded.

But you can’t achieve 90% overall data reduction by only slightly degrading the material. Perceptual coding achieves the brunt of it’s loss from outside of the primary sounds.

This includes everything not inside the primary sound including the echoes and delays of the primary sounds. In fact all reverbs, delays and room sounds are attacked and removed. Other things outside the primary sound are timbre characteristics, breaths, string and instrument noise, room shape and activity, and soundstage timing cues. All of this is shorthanded to “the tone” and “the soundstage”.

By masking and/or deleting all kinds of sounds that they believe are unable to be reliably perceived* by listeners they achieve massive size decreases.

*What the smart DSP programmers behind perceptual coding understood is that while people can easily hear this loss in the music, most can’t identify it reliably and consistently using the same terminology, and good luck having any of this come out in the whacked-world of ABX listening tests.

If most can’t identify what is gone, but can identify the song and sing along, the codec is considered a success. And MP3 was and still is a huge success by those metrics.

But listen to Ghost in the MP3 to hear an idea of what perceptual coding takes away from your music.

The destruction of all of the natural movement, transients, and timing cues has a long lasting effect on our music, which has a long lasting effect on our psyche.

The things that perceptual coding deems unnecessary and inaudible are in fact the critical emotional elements of the music.

This amounts to a perceptual loss in all modern music and is the reason behind two trends: 1- robotic voices with fake instruments, and 2- hyper-fast switching of sounds from disparate sources with heavily active pan and audio limiter settings.

When your end result is forced to be artificial and limited in size and range, hip producers know to co-opt the weaknesses and make them strengths. The more artificial and huge you can sound the better.

No point in producing realism when there is none at the distribution.