decades of fun

New Listening Test – A Proposal

It’s time for a better listening test. It’s time to use our understanding properly.

A proper listening test…

  1. needs to use all available sensory data from a modern smartwatch/ wearable CPU device
  2. needs to be portable and self-contained to allow for mobile use/multiple playback locations
  3. needs to account for the musical style preference of the test subject
  4. needs to stress half-song units as it’s shortest measurement, rejecting fast-switching between samples
  5. needs to be blind without altering the listeners normal and natural listening state
  6. needs to avoid comparisons between a memory and a real sample
  7. needs a moniker as easy to remember as ABX or Blind


Why is this needed?

ABX tests are not sufficient for the measure of musical enjoyment. See here and here and here and here for some thoughts on why that is.

Why ABX Testing Usually Produces Null Results with Human Subjects by Teresa Goodwin
The Problem with A-B’ing and Why Neil Young is Right about Sound Quality by Allen Farmelo
Double Blind Testing by Mark Deneen
Hearing Science Has Not Decoded Musical Enjoyment by



Materials Needed.

A combination of 3 small devices and a notebook could deliver the needed features to conduct this new style of test:

A – HD DAP: Used for music playback. A high quality digital audio player loaded with several albums to the test subjects liking, at various resolutions, without the ability for the user to see the resolution. They do see a 2-digit ID (playback ID) that flashes on the screen occasionally over top of the album art work that they transfer to their logs and the video. I believe volume leveling could be used but this could be a problem point.

B -Smart-Watch: Used for physical inputs during music playback. Possible data points include body motion, GPS location, vibration, heart rate, tap speed, tap frequency, tap strength. Must be able to quickly add playback ID in the watch data.

C – Smartphone: Used for video recording during playback and then logging emotional feedback after playback. Must also include playback ID. Prompt survey questions can be designed such as “starting at 5, rate your emotional state after playback with 1 being worse and 10 being better” or some other method of getting a read on their general happiness after playback.

D – Diary/notebook: Used to jot down ideas and emotional feedback during playback. 1 page-per-playback.

A & B are on the market and people already own C.


DAP – Digital Audio Player

Playback ID – basic code given to each playback unit (song) for tracking purposes

Playback session – the listening session being studied, ranging from 2-8 songs (10 – 30 minutes)

Realtime – the actual moment you hear the music, not your recollection of hearing it (these are two very different processes to our brains)

Slate – saying and/or displaying the ID on video

Digital sensors finally paying attention to analog senses!
Digital sensors finally paying attention to analog senses!


Test workflow:

Step 1 – When the subject wants to listen to music they turn on the DAP and browse for a song or album they like. It works with all their existing playback systems.

Step 2 – Upon making their selection, the DAP plays a random resolution of the file hidden to the user behind a simple playback ID like F3 or R9 that flashes on the screen 5-10 times during playback. It’s first appearance isn’t until at least 2 minutes into the song, forcing concentration and removing fast-switching from the data. Listener is encouraged to jot down feedback and emotional changes in their notebook using the ID tag, 1 per page.

Step 3 – After playback of the entire song a 10 second pause is inserted as the screen is slated with previous ID and upcoming ID, and the second file is provided for playback, again hidden behind a playback ID.

Step 4 – If the test subject hears an immediate change in quality they are to record this in their notebook, such as “F3 > R9”.

Step 5 – Video should be taken of the subject when listening. The subject can announce, slate or sing the playback ID to the video for later sync upon review.

Step 6 – Watch sensor data can be correlated by playback time and/or entering in the playback ID. (This could be combined with #4 on the watch UI.)

The playback session should range from 5:00 (1-2 songs) to 30:00.

Step 7 – Upon completion of their playback session the subject should review the video in real-time while textually logging their experiences and recollections of the playback into the appropriate notebook pages.  Emotional ratings can be applied based on mood changes experienced during playback, i.e. “this song made me feel the most improvement this time”.

Any musical details or excitement the subject felt should be put into words as best as possible. If we only ask for positives the data can be weighed properly with word count roughly equivalent to excitement.

Step 8 – An additional data point is to ask the subject after each session to determine which single ID out of that batch they want to own, based on their listen and then realtime review of the video.

Step 9 – The entire set of test data is then collected for analysis (video, logs, and watch data), and the test subject should be given a reward for completing the task. Perhaps one of the albums in HD or each of the songs they identified as wanting to own.



It’s clear to me that most of this could be done in a watch app and using a DAP with a slightly modified UI, and you probably have to pre-load this DAP with content.

If there were only 48 hours in every day.


Revealer is a nice ABCDE test, but it's not blind unless you make it so.
The PonoPlayer Revealer feature is a nice ABCDE test, but it’s not blind unless you make it so.


Upon reporting on the data, the simple replacement of playback ID’s with genre/resolution tags (classical/4000k; modern pop/320k; classic rock/1400k) will make it human readable.

This should allow for a variety of graphs covering emotional reactions to genres, resolutions, and time of day or number in sequence of playback session.

You could slice this data many different ways, and with data visualizations overlaying on video evidence could make a compelling case for whatever results are found.



More to come – please comment if this intrigues you. I have been wanting to design a replacement for flawed AB tests for some time now, this could be it.


Addendum — The Ayre method of listening — I have to find and the post Charlie Hansen of Ayre made regarding how he and his staff do listening tests. It was awesome and somewhat similar to many of the ideas I lay out above.

To radically summarize and paraphrase: they often pick components in circuit designs based on listening test results only. But it’s not an ABX test. They use a long-term listening test (over several hours even days) done with a variety of music, with parameters and suggestions designed by Charlie to force the listener to critically determine how the playback is making them feel. They are told by their boss to select the component that feels and therefore sounds best; all but ignoring specs or meter readings and going with an educated ear over figures.

This is a reaction based on the simple fact that music you love will move you more emotionally when presented properly than when presented incorrectly, but you need time and concentration to digest the music properly.

Charlie, or anyone at Ayre, if you read this, please post and keep the discussion going. I would love to find some way to formalize a new listening test format.

If you have any idea how to formalize/publish a new test spec, or want to find some funding to actually run one of these studies, contact me and let’s knock this thing out!