xiaozhi/xiaozhi-esp32

Fork 1

mirror of https://github.com/78/xiaozhi-esp32.git synced 2026-02-11 14:43:47 +00:00

Files

History

Xiaoxia 9215a04a7e Delay init success sound playback and remove gif playback delay (#1748 )

* refactor: Remove hardcoded loop delay for GIF playback in LcdDisplay class

* chore: Update esp-ml307 and uart-eth-modem component versions in idf_component.yml

- Bump esp-ml307 version from ~3.6.3 to ~3.6.4
- Update uart-eth-modem version from ~0.3.1 to ~0.3.2

* feat: Add PrintPmLocks method to SystemInfo class

- Introduced PrintPmLocks method to display power management locks using esp_pm_dump_locks.
- Updated system_info.h to declare the new method.

* refactor: Streamline audio codec initialization and enablement

- Removed redundant channel enable checks from AudioCodec::Start.
- Added channel enablement in CreateDuplexChannels for various audio codecs.
- Implemented EnableInput and EnableOutput methods in NoAudioCodec for better control over input/output states.

* refactor: Delay audio success sound playback until after activation completion

- Moved the success sound playback to a scheduled task to ensure it occurs after the activation process is complete.
- This change improves the responsiveness of the application during activation events.

* refactor: Update camera integration from EspVideo to Esp32Camera

- Replaced EspVideo with Esp32Camera for improved camera configuration and initialization.
- Streamlined camera setup by utilizing a new configuration structure for better clarity and maintainability.
- Updated README.md to remove outdated camera sensor configuration instructions.

* refactor: Update audio demuxing process in AudioService

- Replaced the existing demuxer instance with a local unique pointer in the PlaySound method for better memory management.
- Moved the OnDemuxerFinished callback setup into the PlaySound method to ensure it is correctly associated with the new demuxer instance.
- Removed the member variable demuxer_ from AudioService to streamline the class structure.

2026-02-08 22:09:45 +08:00

codecs

Delay init success sound playback and remove gif playback delay (#1748 )

2026-02-08 22:09:45 +08:00

demuxer

增加流式ogg解封装支持 (#1705 )

2026-02-05 00:12:40 +08:00

processors

Enhance audio processing and wake word detection (#1739 )

2026-02-04 14:28:21 +08:00

wake_words

Enhance audio processing and wake word detection (#1739 )

2026-02-04 14:28:21 +08:00

audio_codec.cc

Delay init success sound playback and remove gif playback delay (#1748 )

2026-02-08 22:09:45 +08:00

audio_codec.h

Add SetInputGain(float gain) to AudioCodec (#1252 )

2025-10-02 09:55:45 +08:00

audio_processor.h

Switch to 2.0 branch (#1152 )

2025-09-04 15:41:28 +08:00

audio_service.cc

Delay init success sound playback and remove gif playback delay (#1748 )

2026-02-08 22:09:45 +08:00

audio_service.h

Delay init success sound playback and remove gif playback delay (#1748 )

2026-02-08 22:09:45 +08:00

README.md

v1.8.0: Audio 代码重构与低功耗优化 (#943 )

2025-07-19 22:45:22 +08:00

wake_word.h

Switch to 2.0 branch (#1152 )

2025-09-04 15:41:28 +08:00

README.md

Audio Service Architecture

The audio service is a core component responsible for managing all audio-related functionalities, including capturing audio from the microphone, processing it, encoding/decoding, and playing back audio through the speaker. It is designed to be modular and efficient, running its main operations in dedicated FreeRTOS tasks to ensure real-time performance.

Key Components

AudioService: The central orchestrator. It initializes and manages all other audio components, tasks, and data queues.
AudioCodec: A hardware abstraction layer (HAL) for the physical audio codec chip. It handles the raw I2S communication for audio input and output.
AudioProcessor: Performs real-time audio processing on the microphone input stream. This typically includes Acoustic Echo Cancellation (AEC), noise suppression, and Voice Activity Detection (VAD). AfeAudioProcessor is the default implementation, utilizing the ESP-ADF Audio Front-End.
WakeWord: Detects keywords (e.g., "你好，小智", "Hi, ESP") from the audio stream. It runs independently from the main audio processor until a wake word is detected.
OpusEncoderWrapper / OpusDecoderWrapper: Manages the encoding of PCM audio to the Opus format and decoding Opus packets back to PCM. Opus is used for its high compression and low latency, making it ideal for voice streaming.
OpusResampler: A utility to convert audio streams between different sample rates (e.g., resampling from the codec's native sample rate to the required 16kHz for processing).

Threading Model

The service operates on three primary tasks to handle the different stages of the audio pipeline concurrently:

AudioInputTask: Solely responsible for reading raw PCM data from the AudioCodec. It then feeds this data to either the WakeWord engine or the AudioProcessor based on the current state.
AudioOutputTask: Responsible for playing audio. It retrieves decoded PCM data from the audio_playback_queue_ and sends it to the AudioCodec to be played on the speaker.
OpusCodecTask: A worker task that handles both encoding and decoding. It fetches raw audio from audio_encode_queue_, encodes it into Opus packets, and places them in the audio_send_queue_. Concurrently, it fetches Opus packets from audio_decode_queue_, decodes them into PCM, and places the result in the audio_playback_queue_.

Data Flow

There are two primary data flows: audio input (uplink) and audio output (downlink).

1. Audio Input (Uplink) Flow

This flow captures audio from the microphone, processes it, encodes it, and prepares it for sending to a server.

graph TD
    subgraph Device
        Mic[("Microphone")] -->|I2S| Codec(AudioCodec)
        
        subgraph AudioInputTask
            Codec -->|Raw PCM| Read(ReadAudioData)
            Read -->|16kHz PCM| Processor(AudioProcessor)
        end

        subgraph OpusCodecTask
            Processor -->|Clean PCM| EncodeQueue(audio_encode_queue_)
            EncodeQueue --> Encoder(OpusEncoder)
            Encoder -->|Opus Packet| SendQueue(audio_send_queue_)
        end

        SendQueue --> |"PopPacketFromSendQueue()"| App(Application Layer)
    end
    
    App -->|Network| Server((Cloud Server))

The AudioInputTask continuously reads raw PCM data from the AudioCodec.
This data is fed into an AudioProcessor for cleaning (AEC, VAD).
The processed PCM data is pushed into the audio_encode_queue_.
The OpusCodecTask picks up the PCM data, encodes it into Opus format, and pushes the resulting packet to the audio_send_queue_.
The application can then retrieve these Opus packets and send them over the network.

2. Audio Output (Downlink) Flow

This flow receives encoded audio data, decodes it, and plays it on the speaker.

graph TD
    Server((Cloud Server)) -->|Network| App(Application Layer)

    subgraph Device
        App -->|"PushPacketToDecodeQueue()"| DecodeQueue(audio_decode_queue_)

        subgraph OpusCodecTask
            DecodeQueue -->|Opus Packet| Decoder(OpusDecoder)
            Decoder -->|PCM| PlaybackQueue(audio_playback_queue_)
        end

        subgraph AudioOutputTask
            PlaybackQueue -->|PCM| Codec(AudioCodec)
        end

        Codec -->|I2S| Speaker[("Speaker")]
    end

The application receives Opus packets from the network and pushes them into the audio_decode_queue_.
The OpusCodecTask retrieves these packets, decodes them back into PCM data, and pushes the data to the audio_playback_queue_.
The AudioOutputTask takes the PCM data from the queue and sends it to the AudioCodec for playback.

Power Management

To conserve energy, the audio codec's input (ADC) and output (DAC) channels are automatically disabled after a period of inactivity (AUDIO_POWER_TIMEOUT_MS). A timer (audio_power_timer_) periodically checks for activity and manages the power state. The channels are automatically re-enabled when new audio needs to be captured or played.