The Power of Suggestion Part II: The Wizards behind Merlin Sound ID

By Ryan Nakano

Two weeks ago I wrote about Merlin Sound ID in response to a long-time member’s inquiry about the automated bird identification tool’s accuracy. At best the article captured the use cases for such an incredibly powerful tool in light of its imperfections and within the context of learning to identify birds. At worst, it flushed readers due to the article’s unnecessary length, or worse still, flushed beginner birders from using Merlin Sound ID out in the field.

I hope it was more the former and less the latter, regardless, I never answered the original question that prompted the article in the first place — how accurate is Merlin Sound ID? 

Hence, Part II. This go around I went straight to the source and spoke to one of the many wizards behind the magic — Lead Developer of Merlin Sound ID, Dr. Grant Van Horn. 

Alright bird nerds, time to get in the weeds on machine learning, recall, precision, and pattern recognition. 

Q: How accurate is Merlin Sound ID? 

GVH: “Because we’re running in real time and the metrics we care about revolve around recognizing every vocalization being produced, we have to optimize two metrics: precision and recall. This notion of “accuracy” is a term that is convenient for English speakers but it doesn’t quite match the machine learning requirements.”  

According to Van Horn, precision in this case is concerned with how often Merlin correctly identifies the presence of a particular bird species (the inverse being a false positive). Whereas, recall is concerned with how often Merlin detects the presence of a particular bird species to begin with (the inverse being a false negative). 

As far as Merlin Sound ID is concerned, the tool’s baseline goalpost is 90% precision and 70% recall for every bird species its currently trained on. What this means in layman’s terms, is that when you hit record on Merlin Bird ID and it picks up a bird vocalization, 9 times out of ten it will have predicted the correct bird species and when you hear a bird species vocalize, 7 times out of 10 Merlin will be confident enough to make a prediction. Again these are the baseline metrics that Merlin Sound ID is checked against by its designers and will vary from species to species in “accuracy”. 

Ultimately of course, the goal is 100% on both metrics for all bird species, but this is the standard set for the tool in order to release a particular species into the wild (Merlin-user community) so to speak. 

Why 90/70? To begin to answer this question, we ask a different question. 

Q: When Merlin Sound ID was initially being envisioned, what experience/user did you have in mind? 

GVH: “There were previous projects to Merlin Sound ID and acoustic analysis goes way back. The thing I didn’t see back in 2019 was something that was fun and easy to use, everything had a manual step. When you’re birding by ear with an expert they’re usually just calling things out and pointing in a particular direction saying ‘that’s it, that’s it, that’s it’ so when we started building Sound ID my main goal was to mimic that experience. It was like ‘okay, when I’m not with these amazing birders can I replicate the experience even partially?’ I’ll never tell you that Merlin will be like birding with another human, but can it get you part of the way there? Can it help beginners and novice birders make sense of what they’re hearing?”

This emphasis on mimicking an “expert in the field” and hooking beginners and novice birders helps explain the baseline goal posts. If an expert is not confident in what they’re hearing they are less likely to call out a particular species, when they do guess it’s usually spot on. Of course reputation and trust wise, new birders expect a certain level of both to be high.

GVH: “Sound ID is released exclusively in Merlin, but you could imagine that if we released a version of it in eBird we might set the model configurations to have higher precision and lower recall. The version of sound ID in Melin is tuned for a beginner birder.” 

Okay, let’s take a pause here to quickly talk about “experts in the field”, because while we are talking about a machine learning tool, there are very real birders doing an incredible amount of work to train Merlin Sound ID. 

Since 2020, the year the tool launched, close to 300 birders have helped annotate 225,000+ audio recordings, listening and creating 2.4 million boxes on spectrograms to flag the precise moments when a particular bird species vocalizes in a recording. These annotated audio files are then used to train and test Merlin Sound ID. Van Horn refers to these birders as the heroes of Merlin Sound ID. No kidding. I love a good birdsong/call as much as the next person, but that’s a truly wild dedication. 

Q: Okay, so how does the machine learning aspect work for Merlin Sound ID?

GVH: “This form of machine learning is based on pattern recognition. Early on I asked a human (birding) expert how they recognized bird calls. They popped open Adobe Audition, loaded a file, rendered the spectrogram,  and said ‘you see how Alder Flycatcher has this shape and Willow Flycatcher has this shape?’. I was like ‘oh my gosh that’s just a computer vision problem, like facial recognition’. So for Merlin Sound ID we want a lot of examples of these patterns in different contexts to show the tool so it can ‘see’ the pattern in the future” when someone gets it on their phone.”

Q: That makes sense, but ummm how does the machine learn? 

GVH: “ So an image on your computer is just a sequence of numbers represented as a matrix. Think back to your algebra days. There are these other little matrices called ‘weights’ that we use to multiply the image by. Sound ID is really a collection of these matrices. Each one of these we’re going to multiply doing matrix-matrix multiplication with our image and the output of those multiplications we can interpret as maybe a Northern Cardinal, a Tufted Titmouse or White-breasted Nuthatch. So the ‘learning’ happens through adjusting all the little values of those small matrices.”

Sheesh, my mind left as soon as the word “matrix” was uttered. But someone reading this gets it. 

Q: So we spoke a bit about pattern recognition, and I was curious, what patterns have you noticed or what predictions do you have in regards to occasional Merlin Sound ID challenges to precision or recall?

GVH: “Birds that are ‘out of domain’, the data set is still biased to the East Coast in the U.S. which means it will do better on those birds. Mimicry. Similarly sounding songs, especially if the beginning of a song is similar, the Sound ID will jump the gun and make a prediction based on the first couple seconds, also if you walk or talk you can drop the recall because the other sounds will dominate the signal.” 

Q: Why do you train Merlin Sound ID on nonbird audio recordings? 

GVH: “We maintain a huge resource of audio with no birds to capture what the world sounds like in the absence of birds. This helps us cut down on false positives on machine noise, amphibian noise, etc. Showing the tool what the world sounds like with no birds, teaches it when it should not be predicting bird species.” 

Q: Is there a way to report MisIDs to Sound ID, and more generally, how can we improve the tool?GVH: There’s no current way in Merlin. That said, I strongly encourage birders to take their recordings, save them especially if they’re at least 10 seconds, and add them to their eBird checklists. Leave a comment to the annotators, tell us about the background species, tell us the sound ID did not detect the goldfinch or misID’d it as a different species, it really helps the annotators quickly get up to speed on what is happening and what we need to improve. Our big future goals for Merlin, is to allow for media upload and contribution through Merlin itself, so that audio and photo data come back to us to show us where we are working well and not working well.

Ryan Nakano is the Director of Communications for Golden Gate Bird Alliance.