Save 99% power for audio processing thanks to a new analog approach
Table of Contents
Dolphin Design is tackling a major challenge in making voice user interface (VUI) truly ultra-low-power to enhance end-user experience.
In our rapidly evolving human-to-machine interface landscape, voice-user interaction has become an integral part of our daily lives, transforming the way we engage with technology. Devices we can talk to (voice command) are now a common part of our homes, workplaces, and even our pockets, announcing a new era of convenience and efficiency.
However, with this surge in voice interaction adoption comes a significant challenge – efficient acquisition, pre-processing (feature extraction), and Artificial intelligence (AI) processing (inference).
Conventional methods of audio acquisition and feature extraction have proven to be energy-intensive, obstructing the implementation of really low-power always-on voice listening capabilities, particularly in resource-constrained devices like smart glasses, hearables, and remote controls. These traditional signal acquisition and processing approaches involve analog-to-digital signal conversion and subsequent computational operations, demanding a substantial power budget.
Standard solutions are reaching their power limits
The current landscape of audio processing relies heavily on legacy Audio Analog-to-Digital Converters (ADCs) designed for human-to-human interfaces, which were not optimized for the demands of human-to-machine interactions. Conventional audio ADCs typically consume down to 200 µA, being on the SoC or the digital microphone. While these ADCs are designed for high-quality audio playback, they are not well-suited for the requirements of low-power AI-driven voice processing.
When audio is utilized for AI applications, the need for high-fidelity Pulse Code Modulation (PCM) signals, which are used for human auditory, becomes unnecessary. In AI applications, the audio data does not need to be audible to humans; rather, it serves as raw input for machine learning algorithms before feature extraction.
The conventional approach often involves running software-based Voice Activity Detection (VAD) on a Digital Signal Processor (DSP) with an always-on, power-hungry ADC, coupled with a resource-intensive Microcontroller Unit (MCU) subsystem. This approach results in suboptimal power efficiency, as the ADC must remain activated at all times to detect keywords or speech activity. The constant operation of these power-hungry components places a significant and persistent drain on the system’s power resources.
Furthermore, within this conventional framework, the Fast Fourier Transform (FFT) computation circuit, a critical component for feature extraction, stands out as a major power consumer. Transforming PCM data into Mel Frequency Cepstrum Coefficients (MFCC) typically consumes down to 400 µA and is often in operation for extended periods.
The Analog Front-End Feature Extraction: Transforming Voice Processing
The emergence of Analog Front-End (AFE) feature extraction represents a pivotal shift, offering innovative and efficient solutions to the challenges posed by the conventional approach.
This AFE consists of two key components: an analog filter bank and an analog envelope detector bank. The role of the filter bank is to dissect the input audio signal into its fundamental frequency components, effectively “listening” to the voice in a way that mimics the human auditory system. Once the frequency components are separated, the envelope detector bank steps in, detecting the envelope, which represents the average power of each frequency component. This ingenious process results in the creation of a feature vector. Each element of this vector corresponds to the short-term average power within a specific frequency sub-band. As these feature vectors are stacked over time, a spectrogram of the input signal emerges, serving as the basis for further processing in the neural network backend.
This AFE emulates the functioning of the human ear, mirroring the cochlea’s ability to decode and process auditory information. This AFE is encoding these extracted features into multi-channel parallel asynchronous event streams. These streams simulate the spike found in the cochlea and follow the pathway of auditory nerve fibers to the prime auditory cortex. This human-inspired approach offers unparalleled efficiency capabilities for audio processing.
The journey of AFEs in audio and voice AI processing has been one of rapid advancement, with various labs and universities dedicating years to research and development. The roots of AFE for feature extraction date back over three decades, with early efforts focusing on the utilization of switched-capacitor (SC) circuits for parallel spectrum analysis. The neuromorphic concept of silicon cochlea design originated in Carver Mead’s lab in the late 1980s, setting the stage for the groundbreaking solution Dolphin Design has released.
Dolphin Design has emerged as a pioneer in industrializing this promising AFE solution, making it accessible beyond the confines of research labs. Their groundbreaking achievement lies in being the first IP provider to implement an analog front-end for feature extraction, challenging the traditional boundaries of voice and audio processing.
By replacing the ADC, the VAD, and the DSP with a single mix-signal IP, SoC providers are able to drastically reduce the power consumption of your system from approximately 1000 µW to only 7 µW. This AFE is designed to operate at remarkably low power, resulting in a significant reduction in power consumption for edge AI SoC. Our AFE can function with just a 32 kHz clock in an always-on domain, consuming only 7 µW, a feat that promises to reshape the power efficiency landscape.
The utilization of Dolphin Design WhisperExtractor for feature extraction in audio and voice AI processing has a lot of other benefits:
- Reduced Complexity: AFEs simplify software design and implementation by extracting features from audio signals before digitization. This reduction in complexity streamlines software algorithms used in classification, recognition, and other critical tasks.
- Low Area Preroll: The lower data density of Mel Frequency Cepstrum Coefficients (MFCC) compared to standard PCM data enables the use of smaller SRAM in the always-on domain when a preroll is necessary. This leads to optimized data storage and faster processing.
- Integrated VAD: WhisperExtractor AFE also has an autonomous integrated voice activity detection, that is capable of waking up the system when a voice is detected.
- Performance: Real-time feature extraction improves the responsiveness of edge AI applications, enhancing their overall performance.
What is the impact in terms of accuracy? Advanced testing with standard open-source datasets in English and Chinese was done. WhisperExtractor allows for maintaining good inference accuracy while saving up to 99% power efficiency.
For this test, a standard CNN model without optimization was used.
In conclusion, the coming of Analog Front-End (AFE) feature extraction marks a significant step in the domain of voice and audio processing. With its human-inspired approach and benefits, it promises to revolutionize the way we perceive and process voice, offering a pathway to a more efficient, responsive, and versatile future. WhisperExtractor is not a fancy innovation in the lab, Dolphin has made a real industrial solution.
Opening New Horizons: The Promise of WhisperExtractor
Gone are the days when the edge AI’s processing capabilities were only voice activity detection (VAD). Today, we witness the capability to run speech recognition directly at the edge. The horizon of voice technology is continually expanding, with the near-future prospect of running Large Language Models (LLMs) directly at the edge. To unlock these capabilities, it is imperative to drastically reduce the power consumption of the initial layers of the audio acquisition and processing chain.
Applications that operate within stringent power constraints, such as AR glasses, smart remotes, and event detection, have been faced with the challenge of meeting consumer expectations for battery life. WhisperExtractor emerges as the transformative solution, reducing power consumption for voice user interfaces by 99%. By breaking through these constraints, it set in a new era of power efficiency, allowing these applications to last days.
WhisperExtractor’s mixed-signal architecture is a versatile and adaptable solution designed to meet a wide spectrum of applications. Whether you’re in the business of creating smart home devices, wearables, IoT gadgets, or any other voice-enabled technology, WhisperExtractor can seamlessly integrate and elevate your product. Its adaptability and compatibility with various machine-learning models and accelerators ensure that developers can harness its potential for a wide range of voice-processing needs. WhisperExtractor’s analog front-end can be leveraged for various types of processing for inference, including Analog Neural Networks (ANNs), Spiking Neural Networks (SNNs), or more traditional Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs).
Key Takeaways: WhisperExtractor - Revolutionizing Voice Technology
In conclusion, the innovation that Dolphin Design is bringing into the industry of using an Analog Front-End capable of doing feature extraction represents a groundbreaking leap forward in voice and audio processing. It redefines how we approach voice and audio AI. WhisperExtractor, with its innovative mixed-signal architecture, addresses the challenges of power consumption of conventional methods promising a paradigm shift for low power always-on voice user interaction in applications like AR glasses, smart remotes, hearables, and event detection.
About the author
Marketing manager at Dolphin Design