v0.3.1

reconstruct application
update to 4MB partition
2024-10-03 06:41:16 +08:00 · 2024-10-03 06:39:22 +08:00 · 2024-10-01 15:58:03 +08:00 · 2024-10-01 14:16:12 +08:00 · 2024-09-26 16:19:54 +08:00 · 2024-09-26 16:19:06 +08:00
31 changed files with 2177 additions and 618 deletions
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -4,7 +4,7 @@
 # CMakeLists in this exact order for cmake to work correctly
 cmake_minimum_required(VERSION 3.16)

-set(PROJECT_VER "0.2.0")
+set(PROJECT_VER "0.3.1")

 include($ENV{IDF_PATH}/tools/cmake/project.cmake)
 project(xiaozhi)
--- a/README.md
+++ b/README.md
@@ -1,6 +1,94 @@
-# 你好，小智
+# 小智 AI 聊天机器人

-【ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！】
+BiliBili 视频介绍 [【ESP32+SenseVoice+Qwen72B打造你的AI聊天伴侣！】](https://www.bilibili.com/video/BV11msTenEH3/?share_source=copy_web&vd_source=ee1aafe19d6e60cf22e60a93881faeba)
+
+这是虾哥的第一个硬件作品。
+
+## 项目目的
+
+本项目基于乐鑫的 ESP-IDF 进行开发。
+
+本项目是一个开源项目，主要用于教学目的。我们希望通过这个项目，能够帮助更多人入门 AI 硬件开发,了解如何将当下飞速发展的大语言模型应用到实际的硬件设备中。无论你是对 AI 感兴趣的学生，还是想要探索新技术的开发者，都可以通过这个项目获得宝贵的学习经验。
+
+欢迎所有人参与到项目的开发和改进中来。如果你有任何想法或建议，请随时提出 issue 或加入群聊。
+
+学习交流 QQ 群：946599635
+
+## 已实现功能
+
+- Wi-Fi 配网
+- 支持 BOOT 键唤醒和打断
+- 离线语音唤醒（使用乐鑫方案）
+- 流式语音对话（WebSocket 协议）
+- 支持国语、粤语、英语、日语、韩语 5 种语言识别（使用 SenseVoice 方案）
+- 声纹识别（识别是谁在喊 AI 的名字，[3D Speaker 项目](https://github.com/modelscope/3D-Speaker)）
+- 使用大模型 TTS（火山引擎方案，阿里云接入中）
+- 支持可配置的提示词和音色（自定义角色）
+- 免费提供 Qwen2.5 72B 和 豆包模型（受限于性能和额度，人多后可能会限额）
+- 支持每轮对话后自我总结，生成记忆体
+- 扩展液晶显示屏，显示信号强弱（后面可以显示中文字幕）
+- 支持 ML307 Cat.1 4G 模块（可选）
+
+## 硬件部分
+
+为方便协作，目前所有硬件资料都放在飞书文档中：
+
+[《小智 AI 聊天机器人百科全书》](https://ccnphfhqs21z.feishu.cn/wiki/F5krwD16viZoF0kKkvDcrZNYnhb?from=from_copylink)
+
+第二版接线图如下：
+
+![第二版接线图](docs/wiring2.jpg)
+
+## 固件部分
+
+### 免开发环境烧录
+
+新手第一次操作建议先不要搭建开发环境，直接使用免开发环境烧录的固件。
+
+点击 [这里](https://github.com/78/xiaozhi-esp32/releases) 下载最新版固件。
+
+固件使用的是作者友情提供的测试服，目前开放免费使用，请勿用于商业用途。
+
+### 搭建开发环境
+
+- Cursor 或 VSCode
+- 安装 ESP-IDF 插件，选择 SDK 版本 5.3 或以上
+- Ubuntu 比 Windows 更好，编译速度快，也免去驱动问题的困扰
+
+### 配置项目与编译固件
+
+- 目前只支持 ESP32 S3，Flash 至少 8MB, PSRAM 至少 2MB（注意：默认配置只兼容 8MB PSRAM，如果你使用 2MB PSRAM，需要修改配置，否则无法识别）
+- 配置 OTA Version URL 为 `https://api.tenclass.net/xiaozhi/ota/`
+- 配置 WebSocket URL 为 `wss://api.tenclass.net/xiaozhi/v1/`
+- 配置 WebSocket Access Token 为 `test-token`
+- 如果 INMP441 和 MAX98357 接线跟默认配置不一样，需要修改 GPIO 配置
+- 配置完成后，编译固件
+
+
+## 配置 Wi-Fi （4G 版本跳过）
+
+按照上述接线，烧录固件，设备上电后，开发板上的 RGB 会闪烁蓝灯（部分开发板需要焊接 RGB 灯的开关才会亮），进入配网状态。
+
+打开手机 Wi-Fi，连接上设备热点 `Xiaozhi-xxxx` 后，使用浏览器访问 `http://192.168.4.1`，进入配网页面。
+
+选择你的路由器 WiFi，输入密码，点击连接，设备会在 3 秒后自动重启，之后设备会自动连接到路由器。
+
+## 测试设备是否连接成功
+
+设备连接上路由器后，闪烁一下绿灯。此时，喊一声“你好，小智”，设备会先亮蓝灯（表示连接服务器），然后再亮绿灯，播放语音。
+
+如果没有亮蓝灯，说明麦克风有问题，请检查接线是否正确。
+
+如果没有亮绿灯，或者蓝灯常亮，说明设备没有连接到服务器，请检查 WiFi 连接是否正常。
+
+如果设备已经连接 Wi-Fi，但是没有声音，请检查是否接线正确。
+
+在 v0.2.1 版本之后的固件，也可以按下连接 GPIO 1 的按钮（低电平有效），进行录音测试。
+
+## 配置设备
+
+如果上述步骤测试成功，设备会播报你的设备 ID，你需要到 [小智测试服的控制面板](https://xiaozhi.tenclass.net/) 页面，添加设备。
+
+详细的使用说明以及测试服的注意事项，请参考 [小智测试服的帮助说明](https://xiaozhi.tenclass.net/help)。

-https://www.bilibili.com/video/BV11msTenEH3/?share_source=copy_web&vd_source=ee1aafe19d6e60cf22e60a93881faeba

--- a/docs/wiring.jpg
+++ b/docs/wiring.jpg
--- a/docs/wiring2.jpg
+++ b/docs/wiring2.jpg
--- a/main/Application.cc
+++ b/main/Application.cc
@@ -1,452 +1,448 @@
-#include "Application.h"
-#include "BuiltinLed.h"
-#include "WifiStation.h"
+#include <BuiltinLed.h>
+#include <TlsTransport.h>
+#include <Ml307SslTransport.h>
+#include <WifiConfigurationAp.h>
+#include <WifiStation.h>
+#include <SystemInfo.h>
+
 #include <cstring>
-#include "esp_log.h"
-#include "model_path.h"
-#include "SystemInfo.h"
-#include "cJSON.h"
+#include <esp_log.h>
+#include <cJSON.h>
+#include <driver/gpio.h>
+
+#include "Application.h"

 #define TAG "Application"


-Application::Application() {
+Application::Application()
+    : button_((gpio_num_t)CONFIG_BOOT_BUTTON_GPIO)
+#ifdef CONFIG_USE_ML307
+    , ml307_at_modem_(CONFIG_ML307_TX_PIN, CONFIG_ML307_RX_PIN, 4096),
+      http_(ml307_at_modem_),
+      firmware_upgrade_(http_)
+#else
+    , http_(),
+    firmware_upgrade_(http_)
+#endif
+#ifdef CONFIG_USE_DISPLAY
+    , display_(CONFIG_DISPLAY_SDA_PIN, CONFIG_DISPLAY_SCL_PIN)
+#endif
+{
    event_group_ = xEventGroupCreate();
-    audio_encode_queue_ = xQueueCreate(100, sizeof(iovec));
-    audio_decode_queue_ = xQueueCreate(100, sizeof(AudioPacket*));
-
-    srmodel_list_t *models = esp_srmodel_init("model");
-    for (int i = 0; i < models->num; i++) {
-        ESP_LOGI(TAG, "Model %d: %s", i, models->model_name[i]);
-        if (strstr(models->model_name[i], ESP_WN_PREFIX) != NULL) {
-            wakenet_model_ = models->model_name[i];
-        } else if (strstr(models->model_name[i], ESP_NSNET_PREFIX) != NULL) {
-            nsnet_model_ = models->model_name[i];
-        }
-    }
    
    opus_encoder_.Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 1);
    opus_decoder_ = opus_decoder_create(opus_decode_sample_rate_, 1, NULL);
    if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE) {
        opus_resampler_.Configure(opus_decode_sample_rate_, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
    }
+
+    firmware_upgrade_.SetCheckVersionUrl(CONFIG_OTA_VERSION_URL);
+    firmware_upgrade_.SetHeader("Device-Id", SystemInfo::GetMacAddress().c_str());
+    firmware_upgrade_.SetPostData(SystemInfo::GetJsonString());
 }

 Application::~Application() {
-    if (afe_detection_data_ != nullptr) {
-        esp_afe_sr_v1.destroy(afe_detection_data_);
-    }
-
-    if (afe_communication_data_ != nullptr) {
-        esp_afe_vc_v1.destroy(afe_communication_data_);
-    }
-
-    if (wake_word_encode_task_stack_ != nullptr) {
-        free(wake_word_encode_task_stack_);
-    }
-    for (auto& pcm : wake_word_pcm_) {
-        free(pcm.iov_base);
-    }
-    for (auto& opus : wake_word_opus_) {
-        free(opus.iov_base);
-    }
-    
    if (opus_decoder_ != nullptr) {
        opus_decoder_destroy(opus_decoder_);
    }
    if (audio_encode_task_stack_ != nullptr) {
        free(audio_encode_task_stack_);
    }
-    if (audio_decode_task_stack_ != nullptr) {
-        free(audio_decode_task_stack_);
-    }
-    vQueueDelete(audio_decode_queue_);
-    vQueueDelete(audio_encode_queue_);

    vEventGroupDelete(event_group_);
 }

+void Application::CheckNewVersion() {
+    // Check if there is a new firmware version available
+    firmware_upgrade_.CheckVersion();
+    if (firmware_upgrade_.HasNewVersion()) {
+        // Wait for the chat state to be idle
+        while (chat_state_ != kChatStateIdle) {
+            vTaskDelay(100);
+        }
+        SetChatState(kChatStateUpgrading);
+        firmware_upgrade_.StartUpgrade([this](int progress, size_t speed) {
+#ifdef CONFIG_USE_DISPLAY
+            char buffer[64];
+            snprintf(buffer, sizeof(buffer), "Upgrading...\n %d%% %zuKB/s", progress, speed / 1024);
+            display_.SetText(buffer);
+#endif
+        });
+        // If upgrade success, the device will reboot and never reach here
+        ESP_LOGI(TAG, "Firmware upgrade failed...");
+        SetChatState(kChatStateIdle);
+    } else {
+        firmware_upgrade_.MarkCurrentVersionValid();
+    }
+}
+
+#ifdef CONFIG_USE_DISPLAY
+
+#ifdef CONFIG_USE_ML307
+static std::string csq_to_string(int csq) {
+    if (csq == -1) {
+        return "No network";
+    } else if (csq >= 0 && csq <= 9) {
+        return "Very bad";
+    } else if (csq >= 10 && csq <= 14) {
+        return "Bad";
+    } else if (csq >= 15 && csq <= 19) {
+        return "Fair";
+    } else if (csq >= 20 && csq <= 24) {
+        return "Good";
+    } else if (csq >= 25 && csq <= 31) {
+        return "Very good";
+    }
+    return "Invalid";
+}
+#else
+static std::string rssi_to_string(int rssi) {
+    if (rssi >= -55) {
+        return "Very good";
+    } else if (rssi >= -65) {
+        return "Good";
+    } else if (rssi >= -75) {
+        return "Fair";
+    } else if (rssi >= -85) {
+        return "Poor";
+    } else {
+        return "No network";
+    }
+}
+#endif
+
+void Application::UpdateDisplay() {
+    while (true) {
+        if (chat_state_ == kChatStateIdle) {
+#ifdef CONFIG_USE_ML307
+            std::string network_name = ml307_at_modem_.GetCarrierName();
+            int signal_quality = ml307_at_modem_.GetCsq();
+            if (signal_quality == -1) {
+                network_name = "No network";
+            } else {
+                ESP_LOGI(TAG, "%s CSQ: %d", network_name.c_str(), signal_quality);
+                display_.SetText(network_name + "\n" + csq_to_string(signal_quality) + " (" + std::to_string(signal_quality) + ")");
+            }
+#else
+            auto& wifi_station = WifiStation::GetInstance();
+            int8_t rssi = wifi_station.GetRssi();
+            display_.SetText(wifi_station.GetSsid() + "\n" + rssi_to_string(rssi) + " (" + std::to_string(rssi) + ")");
+#endif
+        }
+        vTaskDelay(pdMS_TO_TICKS(10 * 1000));
+    }
+}
+#endif
+
 void Application::Start() {
+    auto& builtin_led = BuiltinLed::GetInstance();
+#ifdef CONFIG_USE_ML307
+    builtin_led.SetBlue();
+    builtin_led.StartContinuousBlink(100);
+    ml307_at_modem_.SetDebug(false);
+    ml307_at_modem_.SetBaudRate(921600);
+    // Print the ML307 modem information
+    std::string module_name = ml307_at_modem_.GetModuleName();
+    ESP_LOGI(TAG, "ML307 Module: %s", module_name.c_str());
+#ifdef CONFIG_USE_DISPLAY
+    display_.SetText(std::string("Wait for network\n") + module_name);
+#endif
+    ml307_at_modem_.ResetConnections();
+    ml307_at_modem_.WaitForNetworkReady();
+
+    ESP_LOGI(TAG, "ML307 IMEI: %s", ml307_at_modem_.GetImei().c_str());
+    ESP_LOGI(TAG, "ML307 ICCID: %s", ml307_at_modem_.GetIccid().c_str());
+#else
+    // Try to connect to WiFi, if failed, launch the WiFi configuration AP
+    auto& wifi_station = WifiStation::GetInstance();    
+#ifdef CONFIG_USE_DISPLAY
+    display_.SetText(std::string("Connect to WiFi\n") + wifi_station.GetSsid());
+#endif
+    builtin_led.SetBlue();
+    builtin_led.StartContinuousBlink(100);
+    wifi_station.Start();
+    if (!wifi_station.IsConnected()) {
+        builtin_led.SetBlue();
+        builtin_led.Blink(1000, 500);
+        auto& wifi_ap = WifiConfigurationAp::GetInstance();
+        wifi_ap.SetSsidPrefix("Xiaozhi");
+#ifdef CONFIG_USE_DISPLAY
+        display_.SetText(wifi_ap.GetSsid() + "\n" + wifi_ap.GetWebServerUrl());
+#endif
+        wifi_ap.Start();
+        return;
+    }
+#endif
+
+    audio_device_.OnInputData([this](const int16_t* data, int size) {
+#ifdef CONFIG_USE_AFE_SR
+        if (audio_processor_.IsRunning()) {
+            audio_processor_.Input(data, size);
+        }
+        if (wake_word_detect_.IsDetectionRunning()) {
+            wake_word_detect_.Feed(data, size);
+        }
+#else
+        std::vector<int16_t> pcm(data, data + size);
+        Schedule([this, pcm = std::move(pcm)]() {
+            if (chat_state_ == kChatStateListening) {
+                std::lock_guard<std::mutex> lock(mutex_);
+                audio_encode_queue_.emplace_back(std::move(pcm));
+                cv_.notify_all();
+            }
+        });
+#endif
+    });
+
    // Initialize the audio device
    audio_device_.Start(CONFIG_AUDIO_INPUT_SAMPLE_RATE, CONFIG_AUDIO_OUTPUT_SAMPLE_RATE);
-    audio_device_.OnStateChanged([this]() {
-        if (audio_device_.playing()) {
-            SetChatState(kChatStateSpeaking);
-        } else {
-            // Check if communication is still running
-            if (xEventGroupGetBits(event_group_) & COMMUNICATION_RUNNING) {
-                SetChatState(kChatStateListening);
-            } else {
-                SetChatState(kChatStateIdle);
-            }
-        }
-    });

    // OPUS encoder / decoder use a lot of stack memory
    const size_t opus_stack_size = 4096 * 8;
    audio_encode_task_stack_ = (StackType_t*)malloc(opus_stack_size);
-    xTaskCreateStatic([](void* arg) {
+    audio_encode_task_ = xTaskCreateStatic([](void* arg) {
        Application* app = (Application*)arg;
        app->AudioEncodeTask();
+        vTaskDelete(NULL);
    }, "opus_encode", opus_stack_size, this, 1, audio_encode_task_stack_, &audio_encode_task_buffer_);
-    audio_decode_task_stack_ = (StackType_t*)malloc(opus_stack_size);
-    xTaskCreateStatic([](void* arg) {
+
+    xTaskCreate([](void* arg) {
        Application* app = (Application*)arg;
-        app->AudioDecodeTask();
-    }, "opus_decode", opus_stack_size, this, 1, audio_decode_task_stack_, &audio_decode_task_buffer_);
+        app->AudioPlayTask();
+        vTaskDelete(NULL);
+    }, "play_audio", 4096 * 2, this, 5, NULL);

-    auto& builtin_led = BuiltinLed::GetInstance();
-    // Blink the LED to indicate the device is connecting
-    builtin_led.SetBlue();
-    builtin_led.BlinkOnce();
-    WifiStation::GetInstance().Start();
-    
-    // Check if there is a new firmware version available
-    firmware_upgrade_.CheckVersion();
-    if (firmware_upgrade_.HasNewVersion()) {
-        builtin_led.TurnOn();
-        firmware_upgrade_.StartUpgrade();
-        // If upgrade success, the device will reboot and never reach here
-        ESP_LOGI(TAG, "Firmware upgrade failed...");
-        builtin_led.TurnOff();
-    } else {
-        firmware_upgrade_.MarkValid();
-    }
+#ifdef CONFIG_USE_AFE_SR
+    wake_word_detect_.OnVadStateChange([this](bool speaking) {
+        Schedule([this, speaking]() {
+            auto& builtin_led = BuiltinLed::GetInstance();
+            if (chat_state_ == kChatStateListening) {
+                if (speaking) {
+                    builtin_led.SetRed(32);
+                } else {
+                    builtin_led.SetRed(8);
+                }
+                builtin_led.TurnOn();
+            }
+        });
+    });

-    StartCommunication();
-    StartDetection();
+    wake_word_detect_.OnWakeWordDetected([this]() {
+        Schedule([this]() {
+            if (chat_state_ == kChatStateIdle) {
+                // Encode the wake word data and start websocket client at the same time
+                // They both consume a lot of time (700ms), so we can do them in parallel
+                wake_word_detect_.EncodeWakeWordData();
+
+                SetChatState(kChatStateConnecting);
+                if (ws_client_ == nullptr) {
+                    StartWebSocketClient();
+                }
+                if (ws_client_ && ws_client_->IsConnected()) {
+                    auto encoded = wake_word_detect_.GetWakeWordStream();
+                    // Send the wake word data to the server
+                    ws_client_->Send(encoded.data(), encoded.size(), true);
+                    opus_encoder_.ResetState();
+                    // Send a ready message to indicate the server that the wake word data is sent
+                    SetChatState(kChatStateWakeWordDetected);
+                    // If connected, the hello message is already sent, so we can start communication
+                    audio_processor_.Start();
+                    ESP_LOGI(TAG, "Audio processor started");
+                } else {
+                    SetChatState(kChatStateIdle);
+                }
+            } else if (chat_state_ == kChatStateSpeaking) {
+                break_speaking_ = true;
+            }
+
+            // Resume detection
+            wake_word_detect_.StartDetection();
+        });
+    });
+    wake_word_detect_.StartDetection();
+
+    audio_processor_.OnOutput([this](std::vector<int16_t>&& data) {
+        Schedule([this, data = std::move(data)]() {
+            if (chat_state_ == kChatStateListening) {
+                std::lock_guard<std::mutex> lock(mutex_);
+                audio_encode_queue_.emplace_back(std::move(data));
+                cv_.notify_all();
+            }
+        });
+    });
+#endif

    // Blink the LED to indicate the device is running
    builtin_led.SetGreen();
    builtin_led.BlinkOnce();
-    xEventGroupSetBits(event_group_, DETECTION_RUNNING);
+
+    button_.OnClick([this]() {
+        Schedule([this]() {
+            if (chat_state_ == kChatStateIdle) {
+                SetChatState(kChatStateConnecting);
+                StartWebSocketClient();
+
+                if (ws_client_ && ws_client_->IsConnected()) {
+                    opus_encoder_.ResetState();
+#ifdef CONFIG_USE_AFE_SR
+                    audio_processor_.Start();
+#endif
+                    SetChatState(kChatStateListening);
+                    ESP_LOGI(TAG, "Communication started");
+                } else {
+                    SetChatState(kChatStateIdle);
+                }
+            } else if (chat_state_ == kChatStateSpeaking) {
+                break_speaking_ = true;
+            } else if (chat_state_ == kChatStateListening) {
+                if (ws_client_ && ws_client_->IsConnected()) {
+                    ws_client_->Close();
+                }
+            }
+        });
+    });
+
+    xTaskCreate([](void* arg) {
+        Application* app = (Application*)arg;
+        app->MainLoop();
+        vTaskDelete(NULL);
+    }, "main_loop", 4096 * 2, this, 5, NULL);
+
+    // Launch a task to check for new firmware version
+    xTaskCreate([](void* arg) {
+        Application* app = (Application*)arg;
+        app->CheckNewVersion();
+        vTaskDelete(NULL);
+    }, "check_new_version", 4096 * 2, this, 1, NULL);
+
+#ifdef CONFIG_USE_DISPLAY
+    // Launch a task to update the display
+    xTaskCreate([](void* arg) {
+        Application* app = (Application*)arg;
+        app->UpdateDisplay();
+        vTaskDelete(NULL);
+    }, "update_display", 4096, this, 1, NULL);
+#endif
+}
+
+void Application::Schedule(std::function<void()> callback) {
+    std::lock_guard<std::mutex> lock(mutex_);
+    main_tasks_.push_back(callback);
+    cv_.notify_all();
+}
+
+// The Main Loop controls the chat state and websocket connection
+// If other tasks need to access the websocket or chat state,
+// they should use Schedule to call this function
+void Application::MainLoop() {
+    while (true) {
+        std::unique_lock<std::mutex> lock(mutex_);
+        cv_.wait(lock, [this]() {
+            return !main_tasks_.empty();
+        });
+        auto task = std::move(main_tasks_.front());
+        main_tasks_.pop_front();
+        lock.unlock();
+        task();
+    }
 }

 void Application::SetChatState(ChatState state) {
-    auto& builtin_led = BuiltinLed::GetInstance();
+    const char* state_str[] = {
+        "idle",
+        "connecting",
+        "listening",
+        "speaking",
+        "wake_word_detected",
+        "testing",
+        "upgrading",
+        "unknown"
+    };
    chat_state_ = state;
+    ESP_LOGI(TAG, "STATE: %s", state_str[chat_state_]);
+
+    auto& builtin_led = BuiltinLed::GetInstance();
    switch (chat_state_) {
        case kChatStateIdle:
-            ESP_LOGI(TAG, "Chat state: idle");
            builtin_led.TurnOff();
            break;
        case kChatStateConnecting:
-            ESP_LOGI(TAG, "Chat state: connecting");
            builtin_led.SetBlue();
            builtin_led.TurnOn();
            break;
        case kChatStateListening:
-            ESP_LOGI(TAG, "Chat state: listening");
            builtin_led.SetRed();
            builtin_led.TurnOn();
            break;
        case kChatStateSpeaking:
-            ESP_LOGI(TAG, "Chat state: speaking");
            builtin_led.SetGreen();
            builtin_led.TurnOn();
            break;
        case kChatStateWakeWordDetected:
-            ESP_LOGI(TAG, "Chat state: wake word detected");
            builtin_led.SetBlue();
            builtin_led.TurnOn();
            break;
+        case kChatStateUpgrading:
+            builtin_led.SetGreen();
+            builtin_led.StartContinuousBlink(100);
+            break;
    }

-    const char* state_str[] = { "idle", "connecting", "listening", "speaking", "wake_word_detected", "unknown" };
-    std::lock_guard<std::recursive_mutex> lock(mutex_);
    if (ws_client_ && ws_client_->IsConnected()) {
        cJSON* root = cJSON_CreateObject();
        cJSON_AddStringToObject(root, "type", "state");
        cJSON_AddStringToObject(root, "state", state_str[chat_state_]);
        char* json = cJSON_PrintUnformatted(root);
+
+        std::lock_guard<std::mutex> lock(mutex_);
        ws_client_->Send(json);
        cJSON_Delete(root);
        free(json);
    }
 }

-void Application::StartCommunication() {
-    afe_config_t afe_config = {
-        .aec_init = false,
-        .se_init = true,
-        .vad_init = false,
-        .wakenet_init = false,
-        .voice_communication_init = true,
-        .voice_communication_agc_init = true,
-        .voice_communication_agc_gain = 10,
-        .vad_mode = VAD_MODE_3,
-        .wakenet_model_name = NULL,
-        .wakenet_model_name_2 = NULL,
-        .wakenet_mode = DET_MODE_90,
-        .afe_mode = SR_MODE_HIGH_PERF,
-        .afe_perferred_core = 0,
-        .afe_perferred_priority = 5,
-        .afe_ringbuf_size = 50,
-        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
-        .afe_linear_gain = 1.0,
-        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
-        .pcm_config = {
-            .total_ch_num = 1,
-            .mic_num = 1,
-            .ref_num = 0,
-            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE,
-        },
-        .debug_init = false,
-        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
-        .afe_ns_mode = NS_MODE_SSP,
-        .afe_ns_model_name = NULL,
-        .fixed_first_channel = true,
-    };
-
-    afe_communication_data_ = esp_afe_vc_v1.create_from_config(&afe_config);
-    
-    xTaskCreate([](void* arg) {
-        Application* app = (Application*)arg;
-        app->AudioCommunicationTask();
-    }, "audio_communication", 4096 * 2, this, 5, NULL);
-}
-
-void Application::StartDetection() {
-    afe_config_t afe_config = {
-        .aec_init = false,
-        .se_init = true,
-        .vad_init = false,
-        .wakenet_init = true,
-        .voice_communication_init = false,
-        .voice_communication_agc_init = false,
-        .voice_communication_agc_gain = 10,
-        .vad_mode = VAD_MODE_3,
-        .wakenet_model_name = wakenet_model_,
-        .wakenet_model_name_2 = NULL,
-        .wakenet_mode = DET_MODE_90,
-        .afe_mode = SR_MODE_HIGH_PERF,
-        .afe_perferred_core = 0,
-        .afe_perferred_priority = 5,
-        .afe_ringbuf_size = 50,
-        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
-        .afe_linear_gain = 1.0,
-        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
-        .pcm_config = {
-            .total_ch_num = 1,
-            .mic_num = 1,
-            .ref_num = 0,
-            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE
-        },
-        .debug_init = false,
-        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
-        .afe_ns_mode = NS_MODE_SSP,
-        .afe_ns_model_name = NULL,
-        .fixed_first_channel = true,
-    };
-
-    afe_detection_data_ = esp_afe_sr_v1.create_from_config(&afe_config);
-    xTaskCreate([](void* arg) {
-        Application* app = (Application*)arg;
-        app->AudioFeedTask();
-    }, "audio_feed", 4096 * 2, this, 5, NULL);
-
-    xTaskCreate([](void* arg) {
-        Application* app = (Application*)arg;
-        app->AudioDetectionTask();
-    }, "audio_detection", 4096 * 2, this, 5, NULL);
-}
-
-void Application::AudioFeedTask() {
-    int chunk_size = esp_afe_vc_v1.get_feed_chunksize(afe_detection_data_);
-    int16_t buffer[chunk_size];
-    ESP_LOGI(TAG, "Audio feed task started, chunk size: %d", chunk_size);
-
-    while (true) {
-        audio_device_.Read(buffer, chunk_size);
-
-        auto event_bits = xEventGroupGetBits(event_group_);
-        if (event_bits & DETECTION_RUNNING) {
-            esp_afe_sr_v1.feed(afe_detection_data_, buffer);
-        } else if (event_bits & COMMUNICATION_RUNNING) {
-            esp_afe_vc_v1.feed(afe_communication_data_, buffer);
-        }
-    }
-
-    vTaskDelete(NULL);
-}
-
-void Application::StoreWakeWordData(uint8_t* data, size_t size) {
-    // store audio data to detect_packets_
-    auto iov = (iovec){
-        .iov_base = heap_caps_malloc(size, MALLOC_CAP_SPIRAM),
-        .iov_len = size
-    };
-    memcpy(iov.iov_base, data, size);
-    wake_word_pcm_.push_back(iov);
-    // remove the oldest packet if the size is larger than 50, about 2 seconds
-    if (wake_word_pcm_.size() > 50) {
-        heap_caps_free(wake_word_pcm_.front().iov_base);
-        wake_word_pcm_.pop_front();
-    }
-}
-
-void Application::EncodeWakeWordData() {
-    wake_word_opus_.clear();
-    if (wake_word_encode_task_stack_ == nullptr) {
-        wake_word_encode_task_stack_ = (StackType_t*)malloc(4096 * 8);
-    }
-    wake_word_encode_task_ = xTaskCreateStatic([](void* arg) {
-        Application* app = (Application*)arg;
-        auto start_time = esp_timer_get_time();
-        // encode detect packets
-        OpusEncoder* encoder = new OpusEncoder();
-        encoder->Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 1, 60);
-        encoder->SetComplexity(2);
-
-        for (auto& pcm: app->wake_word_pcm_) {
-            encoder->Encode(pcm, [app](const iovec opus) {
-                iovec iov = {
-                    .iov_base = heap_caps_malloc(opus.iov_len, MALLOC_CAP_SPIRAM),
-                    .iov_len = opus.iov_len
-                };
-                memcpy(iov.iov_base, opus.iov_base, opus.iov_len);
-                app->wake_word_opus_.push_back(iov);
-            });
-            heap_caps_free(pcm.iov_base);
-        }
-        app->wake_word_pcm_.clear();
-
-        auto end_time = esp_timer_get_time();
-        ESP_LOGI(TAG, "Encode wake word data opus packets: %d in %lld ms", app->wake_word_opus_.size(), (end_time - start_time) / 1000);
-        xEventGroupSetBits(app->event_group_, DETECT_PACKETS_ENCODED);
-        delete encoder;
-        vTaskDelete(NULL);
-    }, "encode_detect_packets", 4096 * 8, this, 1, wake_word_encode_task_stack_, &wake_word_encode_task_buffer_);
-}
-
-void Application::SendWakeWordData() {
-    for (auto& opus: wake_word_opus_) {
-        ws_client_->Send(opus.iov_base, opus.iov_len, true);
-        heap_caps_free(opus.iov_base);
-    }
-    wake_word_opus_.clear();
-}
-
-void Application::AudioDetectionTask() {
-    auto chunk_size = esp_afe_sr_v1.get_fetch_chunksize(afe_detection_data_);
-    ESP_LOGI(TAG, "Audio detection task started, chunk size: %d", chunk_size);
-
-    while (true) {
-        xEventGroupWaitBits(event_group_, DETECTION_RUNNING, pdFALSE, pdTRUE, portMAX_DELAY);
-
-        auto res = esp_afe_sr_v1.fetch(afe_detection_data_);
-        if (res == nullptr || res->ret_value == ESP_FAIL) {
-            ESP_LOGE(TAG, "Error in fetch");
-            if (res != nullptr) {
-                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
-            }
-            continue;;
-        }
-
-        // Store the wake word data for voice recognition, like who is speaking
-        StoreWakeWordData((uint8_t*)res->data, res->data_size);
-
-        if (res->wakeup_state == WAKENET_DETECTED) {
-            xEventGroupClearBits(event_group_, DETECTION_RUNNING);
-            SetChatState(kChatStateConnecting);
-
-            // Encode the wake word data and start websocket client at the same time
-            // They both consume a lot of time (700ms), so we can do them in parallel
-            EncodeWakeWordData();
-            StartWebSocketClient();
-
-            // Here the websocket is done, and we also wait for the wake word data to be encoded
-            xEventGroupWaitBits(event_group_, DETECT_PACKETS_ENCODED, pdTRUE, pdTRUE, portMAX_DELAY);
-
-            std::lock_guard<std::recursive_mutex> lock(mutex_);
-            if (ws_client_ && ws_client_->IsConnected()) {
-                // Send the wake word data to the server
-                SendWakeWordData();
-                // Send a ready message to indicate the server that the wake word data is sent
-                SetChatState(kChatStateWakeWordDetected);
-                opus_encoder_.ResetState();
-                // If connected, the hello message is already sent, so we can start communication
-                xEventGroupSetBits(event_group_, COMMUNICATION_RUNNING);
-                
-                ESP_LOGI(TAG, "Start communication after wake word detected");
-            } else {
-                SetChatState(kChatStateIdle);
-                xEventGroupSetBits(event_group_, DETECTION_RUNNING);
-            }
-        }
-    }
-}
-
-void Application::AudioCommunicationTask() {
-    int chunk_size = esp_afe_vc_v1.get_fetch_chunksize(afe_communication_data_);
-    ESP_LOGI(TAG, "Audio communication task started, chunk size: %d", chunk_size);
-
-    while (true) {
-        xEventGroupWaitBits(event_group_, COMMUNICATION_RUNNING, pdFALSE, pdTRUE, portMAX_DELAY);
-
-        auto res = esp_afe_vc_v1.fetch(afe_communication_data_);
-        if (res == nullptr || res->ret_value == ESP_FAIL) {
-            ESP_LOGE(TAG, "Error in fetch");
-            if (res != nullptr) {
-                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
-            }
-            continue;
-        }
-
-        // Check if the websocket client is disconnected by the server
-        {
-            std::lock_guard<std::recursive_mutex> lock(mutex_);
-            if (ws_client_ == nullptr || !ws_client_->IsConnected()) {
-                if (ws_client_ != nullptr) {
-                    delete ws_client_;
-                    ws_client_ = nullptr;
-                }
-                if (audio_device_.playing()) {
-                    audio_device_.Break();
-                }
-                SetChatState(kChatStateIdle);
-                xEventGroupSetBits(event_group_, DETECTION_RUNNING);
-                xEventGroupClearBits(event_group_, COMMUNICATION_RUNNING);
-                continue;
-            }
-        }
-
-        if (chat_state_ == kChatStateListening) {
-            // Send audio data to server
-            iovec data = {
-                .iov_base = malloc(res->data_size),
-                .iov_len = (size_t)res->data_size
-            };
-            memcpy(data.iov_base, res->data, res->data_size);
-            xQueueSend(audio_encode_queue_, &data, portMAX_DELAY);
-        }
-    }
+BinaryProtocol* Application::AllocateBinaryProtocol(const uint8_t* payload, size_t payload_size) {
+    auto last_timestamp = 0;
+    auto protocol = (BinaryProtocol*)heap_caps_malloc(sizeof(BinaryProtocol) + payload_size, MALLOC_CAP_SPIRAM);
+    protocol->version = htons(PROTOCOL_VERSION);
+    protocol->type = htons(0);
+    protocol->reserved = 0;
+    protocol->timestamp = htonl(last_timestamp);
+    protocol->payload_size = htonl(payload_size);
+    assert(sizeof(BinaryProtocol) == 16);
+    memcpy(protocol->payload, payload, payload_size);
+    return protocol;
 }

 void Application::AudioEncodeTask() {
    ESP_LOGI(TAG, "Audio encode task started");
    while (true) {
-        iovec pcm;
-        xQueueReceive(audio_encode_queue_, &pcm, portMAX_DELAY);
-
-        // Encode audio data
-        opus_encoder_.Encode(pcm, [this](const iovec opus) {
-            std::lock_guard<std::recursive_mutex> lock(mutex_);
-            if (ws_client_ && ws_client_->IsConnected()) {
-                ws_client_->Send(opus.iov_base, opus.iov_len, true);
-            }
+        std::unique_lock<std::mutex> lock(mutex_);
+        cv_.wait(lock, [this]() {
+            return !audio_encode_queue_.empty() || !audio_decode_queue_.empty();
        });

-        free(pcm.iov_base);
-    }
-}
+        if (!audio_encode_queue_.empty()) {
+            auto pcm = std::move(audio_encode_queue_.front());
+            audio_encode_queue_.pop_front();
+            lock.unlock();

-void Application::AudioDecodeTask() {
-    while (true) {
-        AudioPacket* packet;
-        xQueueReceive(audio_decode_queue_, &packet, portMAX_DELAY);
+            // Encode audio data
+            opus_encoder_.Encode(pcm, [this](const uint8_t* opus, size_t opus_size) {
+                auto protocol = AllocateBinaryProtocol(opus, opus_size);
+                Schedule([this, protocol, opus_size]() {
+                    if (ws_client_ && ws_client_->IsConnected()) {
+                        ws_client_->Send(protocol, sizeof(BinaryProtocol) + opus_size, true);
+                    }
+                    heap_caps_free(protocol);
+                });
+            });
+        } else if (!audio_decode_queue_.empty()) {
+            auto packet = std::move(audio_decode_queue_.front());
+            audio_decode_queue_.pop_front();
+            lock.unlock();

-        if (packet->type == kAudioPacketTypeData) {
            int frame_size = opus_decode_sample_rate_ / 1000 * opus_duration_ms_;
            packet->pcm.resize(frame_size);

@@ -458,14 +454,79 @@ void Application::AudioDecodeTask() {
            }

            if (opus_decode_sample_rate_ != CONFIG_AUDIO_OUTPUT_SAMPLE_RATE) {
-                int target_size = frame_size * CONFIG_AUDIO_OUTPUT_SAMPLE_RATE / opus_decode_sample_rate_;
+                int target_size = opus_resampler_.GetOutputSamples(frame_size);
                std::vector<int16_t> resampled(target_size);
-                opus_resampler_.Process(packet->pcm.data(), frame_size, resampled.data(), target_size);
+                opus_resampler_.Process(packet->pcm.data(), frame_size, resampled.data());
                packet->pcm = std::move(resampled);
            }
+
+            std::lock_guard<std::mutex> lock(mutex_);
+            audio_play_queue_.push_back(packet);
+            cv_.notify_all();
+        }
+    }
+}
+
+void Application::HandleAudioPacket(AudioPacket* packet) {
+    switch (packet->type)
+    {
+    case kAudioPacketTypeData: {
+        if (skip_to_end_) {
+            break;
        }

-        audio_device_.QueueAudioPacket(packet);
+        // This will block until the audio device has finished playing the audio
+        audio_device_.OutputData(packet->pcm);
+
+        if (break_speaking_) {
+            break_speaking_ = false;
+            skip_to_end_ = true;
+            
+            // Play a silence and skip to the end
+            int frame_size = opus_decode_sample_rate_ / 1000 * opus_duration_ms_;
+            std::vector<int16_t> silence(frame_size);
+            bzero(silence.data(), silence.size() * sizeof(int16_t));
+            audio_device_.OutputData(silence);
+        }
+        break;
+    }
+    case kAudioPacketTypeStart:
+        Schedule([this]() {
+            SetChatState(kChatStateSpeaking);
+        });
+        break;
+    case kAudioPacketTypeStop:
+        skip_to_end_ = false;
+        Schedule([this]() {
+            SetChatState(kChatStateListening);
+        });
+        break;
+    case kAudioPacketTypeSentenceStart:
+        ESP_LOGI(TAG, "<< %s", packet->text.c_str());
+        break;
+    case kAudioPacketTypeSentenceEnd:
+        break;
+    default:
+        ESP_LOGI(TAG, "Unknown packet type: %d", packet->type);
+        break;
+    }
+
+    delete packet;
+}
+
+void Application::AudioPlayTask() {
+    ESP_LOGI(TAG, "Audio play task started");
+
+    while (true) {
+        std::unique_lock<std::mutex> lock(mutex_);
+        cv_.wait(lock, [this]() {
+            return !audio_play_queue_.empty();
+        });
+        auto packet = std::move(audio_play_queue_.front());
+        audio_play_queue_.pop_front();
+        lock.unlock();
+
+        HandleAudioPacket(packet);
    }
 }

@@ -484,13 +545,19 @@ void Application::SetDecodeSampleRate(int sample_rate) {

 void Application::StartWebSocketClient() {
    if (ws_client_ != nullptr) {
+        ESP_LOGW(TAG, "WebSocket client already exists");
        delete ws_client_;
    }

    std::string token = "Bearer " + std::string(CONFIG_WEBSOCKET_ACCESS_TOKEN);
-    ws_client_ = new WebSocketClient();
+#ifdef CONFIG_USE_ML307
+    ws_client_ = new WebSocket(new Ml307SslTransport(ml307_at_modem_, 0));
+#else
+    ws_client_ = new WebSocket(new TlsTransport());
+#endif
    ws_client_->SetHeader("Authorization", token.c_str());
    ws_client_->SetHeader("Device-Id", SystemInfo::GetMacAddress().c_str());
+    ws_client_->SetHeader("Protocol-Version", std::to_string(PROTOCOL_VERSION).c_str());

    ws_client_->OnConnected([this]() {
        ESP_LOGI(TAG, "Websocket connected");
@@ -498,8 +565,7 @@ void Application::StartWebSocketClient() {
        // Send hello message to describe the client
        // keys: message type, version, wakeup_model, audio_params (format, sample_rate, channels)
        std::string message = "{";
-        message += "\"type\":\"hello\", \"version\":\"1.0\",";
-        message += "\"wakeup_model\":\"" + std::string(wakenet_model_) + "\",";
+        message += "\"type\":\"hello\",";
        message += "\"audio_params\":{";
        message += "\"format\":\"opus\", \"sample_rate\":" + std::to_string(CONFIG_AUDIO_INPUT_SAMPLE_RATE) + ", \"channels\":1";
        message += "}}";
@@ -507,21 +573,26 @@ void Application::StartWebSocketClient() {
    });

    ws_client_->OnData([this](const char* data, size_t len, bool binary) {
-        auto packet = new AudioPacket();
        if (binary) {
-            auto header = (AudioDataHeader*)data;
-            packet->type = kAudioPacketTypeData;
-            packet->timestamp = ntohl(header->timestamp);
+            auto protocol = (BinaryProtocol*)data;

-            auto payload_size = ntohl(header->payload_size);
+            auto packet = new AudioPacket();
+            packet->type = kAudioPacketTypeData;
+            packet->timestamp = ntohl(protocol->timestamp);
+            auto payload_size = ntohl(protocol->payload_size);
            packet->opus.resize(payload_size);
-            memcpy(packet->opus.data(), data + sizeof(AudioDataHeader), payload_size);
+            memcpy(packet->opus.data(), protocol->payload, payload_size);
+
+            std::lock_guard<std::mutex> lock(mutex_);
+            audio_decode_queue_.push_back(packet);
+            cv_.notify_all();
        } else {
            // Parse JSON data
            auto root = cJSON_Parse(data);
            auto type = cJSON_GetObjectItem(root, "type");
            if (type != NULL) {
                if (strcmp(type->valuestring, "tts") == 0) {
+                    auto packet = new AudioPacket();
                    auto state = cJSON_GetObjectItem(root, "state");
                    if (strcmp(state->valuestring, "start") == 0) {
                        packet->type = kAudioPacketTypeStart;
@@ -537,19 +608,35 @@ void Application::StartWebSocketClient() {
                        packet->type = kAudioPacketTypeSentenceStart;
                        packet->text = cJSON_GetObjectItem(root, "text")->valuestring;
                    }
+
+                    std::lock_guard<std::mutex> lock(mutex_);
+                    audio_decode_queue_.push_back(packet);
+                    cv_.notify_all();
+                } else if (strcmp(type->valuestring, "stt") == 0) {
+                    auto text = cJSON_GetObjectItem(root, "text");
+                    if (text != NULL) {
+                        ESP_LOGI(TAG, ">> %s", text->valuestring);
+                    }
                }
            }
            cJSON_Delete(root);
        }
-        xQueueSend(audio_decode_queue_, &packet, portMAX_DELAY);
    });

    ws_client_->OnError([this](int error) {
        ESP_LOGE(TAG, "Websocket error: %d", error);
    });

-    ws_client_->OnClosed([this]() {
-        ESP_LOGI(TAG, "Websocket closed");
+    ws_client_->OnDisconnected([this]() {
+        ESP_LOGI(TAG, "Websocket disconnected");
+        Schedule([this]() {
+#ifdef CONFIG_USE_AFE_SR
+            audio_processor_.Stop();
+#endif
+            delete ws_client_;
+            ws_client_ = nullptr;
+            SetChatState(kChatStateIdle);
+        });
    });

    if (!ws_client_->Connect(CONFIG_WEBSOCKET_URL)) {
--- a/main/Application.h
+++ b/main/Application.h
@@ -2,24 +2,60 @@
 #define _APPLICATION_H_

 #include "AudioDevice.h"
-#include "OpusEncoder.h"
-#include "OpusResampler.h"
-#include "WebSocketClient.h"
-#include "FirmwareUpgrade.h"
+#include <OpusEncoder.h>
+#include <OpusResampler.h>
+#include <WebSocket.h>
+#include <Ml307AtModem.h>
+#include <Ml307Http.h>
+#include <EspHttp.h>

-#include "opus.h"
-#include "resampler_structs.h"
-#include "freertos/event_groups.h"
-#include "freertos/queue.h"
-#include "freertos/task.h"
-#include "esp_afe_sr_models.h"
-#include "esp_nsn_models.h"
+#include <opus.h>
+#include <resampler_structs.h>
+#include <freertos/event_groups.h>
+#include <freertos/task.h>
 #include <mutex>
 #include <list>
+#include <condition_variable>
+
+#include "Display.h"
+#include "FirmwareUpgrade.h"
+
+#ifdef CONFIG_USE_AFE_SR
+#include "WakeWordDetect.h"
+#include "AudioProcessor.h"
+#endif
+
+#include "Button.h"

 #define DETECTION_RUNNING 1
 #define COMMUNICATION_RUNNING 2
-#define DETECT_PACKETS_ENCODED 4
+
+#define PROTOCOL_VERSION 2
+struct BinaryProtocol {
+    uint16_t version;
+    uint16_t type;
+    uint32_t reserved;
+    uint32_t timestamp;
+    uint32_t payload_size;
+    uint8_t payload[];
+} __attribute__((packed));
+
+enum AudioPacketType {
+    kAudioPacketTypeUnkonwn = 0,
+    kAudioPacketTypeStart,
+    kAudioPacketTypeStop,
+    kAudioPacketTypeData,
+    kAudioPacketTypeSentenceStart,
+    kAudioPacketTypeSentenceEnd
+};
+
+struct AudioPacket {
+    AudioPacketType type = kAudioPacketTypeUnkonwn;
+    std::string text;
+    std::vector<uint8_t> opus;
+    std::vector<int16_t> pcm;
+    uint32_t timestamp;
+};


 enum ChatState {
@@ -27,7 +63,8 @@ enum ChatState {
    kChatStateConnecting,
    kChatStateListening,
    kChatStateSpeaking,
-    kChatStateWakeWordDetected
+    kChatStateWakeWordDetected,
+    kChatStateUpgrading
 };

 class Application {
@@ -47,28 +84,38 @@ private:
    Application();
    ~Application();

+    Button button_;
    AudioDevice audio_device_;
+#ifdef CONFIG_USE_AFE_SR
+    WakeWordDetect wake_word_detect_;
+    AudioProcessor audio_processor_;
+#endif
+#ifdef CONFIG_USE_ML307
+    Ml307AtModem ml307_at_modem_;
+    Ml307Http http_;
+#else
+    EspHttp http_;
+#endif
    FirmwareUpgrade firmware_upgrade_;
-
-    std::recursive_mutex mutex_;
-    WebSocketClient* ws_client_ = nullptr;
-    esp_afe_sr_data_t* afe_detection_data_ = nullptr;
-    esp_afe_sr_data_t* afe_communication_data_ = nullptr;
+#ifdef CONFIG_USE_DISPLAY
+    Display display_;
+#endif
+    std::mutex mutex_;
+    std::condition_variable_any cv_;
+    std::list<std::function<void()>> main_tasks_;
+    WebSocket* ws_client_ = nullptr;
    EventGroupHandle_t event_group_;
-    char* wakenet_model_ = NULL;
-    char* nsnet_model_ = NULL;
    volatile ChatState chat_state_ = kChatStateIdle;
+    volatile bool break_speaking_ = false;
+    bool skip_to_end_ = false;

    // Audio encode / decode
-    TaskHandle_t audio_feed_task_ = nullptr;
+    TaskHandle_t audio_encode_task_ = nullptr;
    StaticTask_t audio_encode_task_buffer_;
    StackType_t* audio_encode_task_stack_ = nullptr;
-    QueueHandle_t audio_encode_queue_ = nullptr;
-
-    TaskHandle_t audio_decode_task_ = nullptr;
-    StaticTask_t audio_decode_task_buffer_;
-    StackType_t* audio_decode_task_stack_ = nullptr;
-    QueueHandle_t audio_decode_queue_ = nullptr;
+    std::list<std::vector<int16_t>> audio_encode_queue_;
+    std::list<AudioPacket*> audio_decode_queue_;
+    std::list<AudioPacket*> audio_play_queue_;

    OpusEncoder opus_encoder_;
    OpusDecoder* opus_decoder_ = nullptr;
@@ -77,26 +124,22 @@ private:
    int opus_decode_sample_rate_ = CONFIG_AUDIO_OUTPUT_SAMPLE_RATE;
    OpusResampler opus_resampler_;

-    TaskHandle_t wake_word_encode_task_ = nullptr;
-    StaticTask_t wake_word_encode_task_buffer_;
-    StackType_t* wake_word_encode_task_stack_ = nullptr;
-    std::list<iovec> wake_word_pcm_;
-    std::vector<iovec> wake_word_opus_;
+    TaskHandle_t check_new_version_task_ = nullptr;
+    StaticTask_t check_new_version_task_buffer_;
+    StackType_t* check_new_version_task_stack_ = nullptr;

+    void MainLoop();
+    void Schedule(std::function<void()> callback);
+    BinaryProtocol* AllocateBinaryProtocol(const uint8_t* payload, size_t payload_size);
    void SetDecodeSampleRate(int sample_rate);
    void SetChatState(ChatState state);
-    void StartDetection();
-    void StartCommunication();
    void StartWebSocketClient();
-    void StoreWakeWordData(uint8_t* data, size_t size);
-    void EncodeWakeWordData();
-    void SendWakeWordData();
+    void CheckNewVersion();
+    void UpdateDisplay();

-    void AudioFeedTask();
-    void AudioDetectionTask();
-    void AudioCommunicationTask();
    void AudioEncodeTask();
-    void AudioDecodeTask();
+    void AudioPlayTask();
+    void HandleAudioPacket(AudioPacket* packet);
 };

 #endif // _APPLICATION_H_
--- a/main/AudioDevice.cc
+++ b/main/AudioDevice.cc
@@ -1,18 +1,15 @@
 #include "AudioDevice.h"
-#include "esp_log.h"
+#include <esp_log.h>
 #include <cstring>

 #define TAG "AudioDevice"

 AudioDevice::AudioDevice() {
-    audio_play_queue_ = xQueueCreate(100, sizeof(AudioPacket*));
 }

 AudioDevice::~AudioDevice() {
-    vQueueDelete(audio_play_queue_);
-
-    if (audio_play_task_ != nullptr) {
-        vTaskDelete(audio_play_task_);
+    if (audio_input_task_ != nullptr) {
+        vTaskDelete(audio_input_task_);
    }
    if (rx_handle_ != nullptr) {
        ESP_ERROR_CHECK(i2s_channel_disable(rx_handle_));
@@ -37,8 +34,8 @@ void AudioDevice::Start(int input_sample_rate, int output_sample_rate) {

    xTaskCreate([](void* arg) {
        auto audio_device = (AudioDevice*)arg;
-        audio_device->AudioPlayTask();
-    }, "audio_play", 4096 * 4, this, 5, &audio_play_task_);
+        audio_device->InputTask();
+    }, "audio_input", 4096 * 2, this, 5, &audio_input_task_);
 }

 void AudioDevice::CreateDuplexChannels() {
@@ -76,10 +73,10 @@ void AudioDevice::CreateDuplexChannels() {
        },
        .gpio_cfg = {
            .mclk = I2S_GPIO_UNUSED,
-            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_BCLK,
-            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_WS,
-            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DOUT,
-            .din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DIN,
+            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_BCLK,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_WS,
+            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_DOUT,
+            .din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_DIN,
            .invert_flags = {
                .mclk_inv = false,
                .bclk_inv = false,
@@ -127,9 +124,9 @@ void AudioDevice::CreateSimplexChannels() {
        },
        .gpio_cfg = {
            .mclk = I2S_GPIO_UNUSED,
-            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_BCLK,
-            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_WS,
-            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DOUT,
+            .bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_BCLK,
+            .ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_WS,
+            .dout = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_SPK_GPIO_DOUT,
            .din = I2S_GPIO_UNUSED,
            .invert_flags = {
                .mclk_inv = false,
@@ -147,7 +144,7 @@ void AudioDevice::CreateSimplexChannels() {
    std_cfg.gpio_cfg.bclk = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_BCLK;
    std_cfg.gpio_cfg.ws = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_WS;
    std_cfg.gpio_cfg.dout = I2S_GPIO_UNUSED;
-    std_cfg.gpio_cfg.din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_GPIO_DIN;
+    std_cfg.gpio_cfg.din = (gpio_num_t)CONFIG_AUDIO_DEVICE_I2S_MIC_GPIO_DIN;
    ESP_ERROR_CHECK(i2s_channel_init_std_mode(rx_handle_, &std_cfg));
    ESP_LOGI(TAG, "Simplex channels created");
 }
@@ -180,57 +177,22 @@ int AudioDevice::Read(int16_t* dest, int samples) {
    return samples;
 }

-void AudioDevice::QueueAudioPacket(AudioPacket* packet) {
-    xQueueSend(audio_play_queue_, &packet, portMAX_DELAY);
+void AudioDevice::OnInputData(std::function<void(const int16_t*, int)> callback) {
+    on_input_data_ = callback;
 }

-void AudioDevice::AudioPlayTask() {
-    while (true) {
-        AudioPacket* packet;
-        xQueueReceive(audio_play_queue_, &packet, portMAX_DELAY);
+void AudioDevice::OutputData(std::vector<int16_t>& data) {
+    Write(data.data(), data.size());
+}

-        switch (packet->type)
-        {
-        case kAudioPacketTypeStart:
-            playing_ = true;
-            breaked_ = false;
-            if (on_state_changed_) {
-                on_state_changed_();
-            }
-            break;
-        case kAudioPacketTypeStop:
-            playing_ = false;
-            if (on_state_changed_) {
-                on_state_changed_();
-            }
-            break;
-        case kAudioPacketTypeSentenceStart:
-            ESP_LOGI(TAG, "Playing sentence: %s", packet->text.c_str());
-            break;
-        case kAudioPacketTypeSentenceEnd:
-            if (breaked_) { // Clear the queue
-                AudioPacket* p;
-                while (xQueueReceive(audio_play_queue_, &p, 0) == pdTRUE) {
-                    delete p;
-                }
-                breaked_ = false;
-                playing_ = false;
-            }
-            break;
-        case kAudioPacketTypeData:
-            Write(packet->pcm.data(), packet->pcm.size());
-            break;
-        default:
-            ESP_LOGE(TAG, "Unknown audio packet type: %d", packet->type);
+void AudioDevice::InputTask() {
+    int duration = 30;
+    int input_frame_size = input_sample_rate_ / 1000 * duration;
+    int16_t input_buffer[input_frame_size];
+    while (true) {
+        int samples = Read(input_buffer, input_frame_size);
+        if (samples > 0) {
+            on_input_data_(input_buffer, samples);
        }
-        delete packet;
    }
 }
-
-void AudioDevice::OnStateChanged(std::function<void()> callback) {
-    on_state_changed_ = callback;
-}
-
-void AudioDevice::Break() {
-    breaked_ = true;
-}
--- a/main/AudioDevice.h
+++ b/main/AudioDevice.h
@@ -1,76 +1,44 @@
 #ifndef _AUDIO_DEVICE_H
 #define _AUDIO_DEVICE_H

-#include "opus.h"
-#include "freertos/FreeRTOS.h"
-#include "freertos/queue.h"
-#include "freertos/event_groups.h"
-#include "driver/i2s_std.h"
+#include <freertos/FreeRTOS.h>
+#include <freertos/event_groups.h>
+#include <driver/i2s_std.h>

 #include <vector>
 #include <string>
 #include <functional>

-enum AudioPacketType {
-    kAudioPacketTypeUnkonwn = 0,
-    kAudioPacketTypeStart,
-    kAudioPacketTypeStop,
-    kAudioPacketTypeData,
-    kAudioPacketTypeSentenceStart,
-    kAudioPacketTypeSentenceEnd
-};
-
-struct AudioPacket {
-    AudioPacketType type = kAudioPacketTypeUnkonwn;
-    std::string text;
-    std::vector<uint8_t> opus;
-    std::vector<int16_t> pcm;
-    uint32_t timestamp;
-};
-
-struct AudioDataHeader {
-    uint32_t version;
-    uint32_t reserved;
-    uint32_t timestamp;
-    uint32_t payload_size;
-} __attribute__((packed));
-
 class AudioDevice {
 public:
    AudioDevice();
    ~AudioDevice();

    void Start(int input_sample_rate, int output_sample_rate);
-    int Read(int16_t* dest, int samples);
-    void Write(const int16_t* data, int samples);
-    void QueueAudioPacket(AudioPacket* packet);
-    void OnStateChanged(std::function<void()> callback);
-    void Break();
+    void OnInputData(std::function<void(const int16_t*, int)> callback);
+    void OutputData(std::vector<int16_t>& data);

    int input_sample_rate() const { return input_sample_rate_; }
    int output_sample_rate() const { return output_sample_rate_; }
    bool duplex() const { return duplex_; }
-    bool playing() const { return playing_; }

 private:
-    bool playing_ = false;
-    bool breaked_ = false;
    bool duplex_ = false;
    int input_sample_rate_ = 0;
    int output_sample_rate_ = 0;
-
    i2s_chan_handle_t tx_handle_ = nullptr;
    i2s_chan_handle_t rx_handle_ = nullptr;

-    QueueHandle_t audio_play_queue_ = nullptr;
-    TaskHandle_t audio_play_task_ = nullptr;
+    TaskHandle_t audio_input_task_ = nullptr;
    
    EventGroupHandle_t event_group_;
-    std::function<void()> on_state_changed_;
+    std::function<void(const int16_t*, int)> on_input_data_;

    void CreateDuplexChannels();
    void CreateSimplexChannels();
-    void AudioPlayTask();
+    void InputTask();
+    int Read(int16_t* dest, int samples);
+    void Write(const int16_t* data, int samples);
 };

 #endif // _AUDIO_DEVICE_H
--- a/main/AudioProcessor.cc
+++ b/main/AudioProcessor.cc
@@ -0,0 +1,106 @@
+#include "AudioProcessor.h"
+#include <esp_log.h>
+
+#define PROCESSOR_RUNNING 0x01
+
+static const char* TAG = "AudioProcessor";
+
+AudioProcessor::AudioProcessor()
+    : afe_communication_data_(nullptr) {
+    event_group_ = xEventGroupCreate();
+
+    afe_config_t afe_config = {
+        .aec_init = false,
+        .se_init = true,
+        .vad_init = false,
+        .wakenet_init = false,
+        .voice_communication_init = true,
+        .voice_communication_agc_init = true,
+        .voice_communication_agc_gain = 10,
+        .vad_mode = VAD_MODE_3,
+        .wakenet_model_name = NULL,
+        .wakenet_model_name_2 = NULL,
+        .wakenet_mode = DET_MODE_90,
+        .afe_mode = SR_MODE_HIGH_PERF,
+        .afe_perferred_core = 0,
+        .afe_perferred_priority = 5,
+        .afe_ringbuf_size = 50,
+        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
+        .afe_linear_gain = 1.0,
+        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
+        .pcm_config = {
+            .total_ch_num = 1,
+            .mic_num = 1,
+            .ref_num = 0,
+            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE,
+        },
+        .debug_init = false,
+        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
+        .afe_ns_mode = NS_MODE_SSP,
+        .afe_ns_model_name = NULL,
+        .fixed_first_channel = true,
+    };
+
+    afe_communication_data_ = esp_afe_vc_v1.create_from_config(&afe_config);
+    
+    xTaskCreate([](void* arg) {
+        auto this_ = (AudioProcessor*)arg;
+        this_->AudioProcessorTask();
+        vTaskDelete(NULL);
+    }, "audio_communication", 4096 * 2, this, 5, NULL);
+}
+
+AudioProcessor::~AudioProcessor() {
+    if (afe_communication_data_ != nullptr) {
+        esp_afe_vc_v1.destroy(afe_communication_data_);
+    }
+    vEventGroupDelete(event_group_);
+}
+
+void AudioProcessor::Input(const int16_t* data, int size) {
+    input_buffer_.insert(input_buffer_.end(), data, data + size);
+
+    auto chunk_size = esp_afe_vc_v1.get_feed_chunksize(afe_communication_data_);
+    while (input_buffer_.size() >= chunk_size) {
+        auto chunk = input_buffer_.data();
+        esp_afe_vc_v1.feed(afe_communication_data_, chunk);
+        input_buffer_.erase(input_buffer_.begin(), input_buffer_.begin() + chunk_size);
+    }
+}
+
+void AudioProcessor::Start() {
+    xEventGroupSetBits(event_group_, PROCESSOR_RUNNING);
+}
+
+void AudioProcessor::Stop() {
+    xEventGroupClearBits(event_group_, PROCESSOR_RUNNING);
+}
+
+bool AudioProcessor::IsRunning() {
+    return xEventGroupGetBits(event_group_) & PROCESSOR_RUNNING;
+}
+
+void AudioProcessor::OnOutput(std::function<void(std::vector<int16_t>&& data)> callback) {
+    output_callback_ = callback;
+}
+
+void AudioProcessor::AudioProcessorTask() {
+    int chunk_size = esp_afe_vc_v1.get_fetch_chunksize(afe_communication_data_);
+    ESP_LOGI(TAG, "Audio communication task started, chunk size: %d", chunk_size);
+
+    while (true) {
+        xEventGroupWaitBits(event_group_, PROCESSOR_RUNNING, pdFALSE, pdTRUE, portMAX_DELAY);
+
+        auto res = esp_afe_vc_v1.fetch(afe_communication_data_);
+        if (res == nullptr || res->ret_value == ESP_FAIL) {
+            if (res != nullptr) {
+                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
+            }
+            continue;
+        }
+
+        if (output_callback_) {
+            output_callback_(std::vector<int16_t>(res->data, res->data + res->data_size / sizeof(int16_t)));
+        }
+    }
+}
--- a/main/AudioProcessor.h
+++ b/main/AudioProcessor.h
@@ -0,0 +1,33 @@
+#ifndef AUDIO_PROCESSOR_H
+#define AUDIO_PROCESSOR_H
+
+#include <esp_afe_sr_models.h>
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#include <freertos/event_groups.h>
+
+#include <string>
+#include <vector>
+#include <functional>
+
+class AudioProcessor {
+public:
+    AudioProcessor();
+    ~AudioProcessor();
+
+    void Input(const int16_t* data, int size);
+    void Start();
+    void Stop();
+    bool IsRunning();
+    void OnOutput(std::function<void(std::vector<int16_t>&& data)> callback);
+
+private:
+    EventGroupHandle_t event_group_ = nullptr;
+    esp_afe_sr_data_t* afe_communication_data_ = nullptr;
+    std::vector<int16_t> input_buffer_;
+    std::function<void(std::vector<int16_t>&& data)> output_callback_;
+
+    void AudioProcessorTask();
+};
+
+#endif
--- a/main/Button.cc
+++ b/main/Button.cc
@@ -0,0 +1,67 @@
+#include "Button.h"
+#include <esp_log.h>
+
+static const char* TAG = "Button";
+
+Button::Button(gpio_num_t gpio_num) : gpio_num_(gpio_num) {
+    button_config_t button_config = {
+        .type = BUTTON_TYPE_GPIO,
+        .long_press_time = 3000,
+        .short_press_time = 100,
+        .gpio_button_config = {
+            .gpio_num = gpio_num,
+            .active_level = 0
+        }
+    };
+    button_handle_ = iot_button_create(&button_config);
+    if (button_handle_ == NULL) {
+        ESP_LOGE(TAG, "Failed to create button handle");
+        return;
+    }
+}
+
+Button::~Button() {
+    if (button_handle_ != NULL) {
+        iot_button_delete(button_handle_);
+    }
+}
+
+void Button::OnPress(std::function<void()> callback) {
+    on_press_ = callback;
+    iot_button_register_cb(button_handle_, BUTTON_PRESS_DOWN, [](void* handle, void* usr_data) {
+        Button* button = static_cast<Button*>(usr_data);
+        if (button->on_press_) {
+            button->on_press_();
+        }
+    }, this);
+}
+
+void Button::OnLongPress(std::function<void()> callback) {
+    on_long_press_ = callback;
+    iot_button_register_cb(button_handle_, BUTTON_LONG_PRESS_START, [](void* handle, void* usr_data) {
+        Button* button = static_cast<Button*>(usr_data);
+        if (button->on_long_press_) {
+            button->on_long_press_();
+        }
+    }, this);
+}
+
+void Button::OnClick(std::function<void()> callback) {
+    on_click_ = callback;
+    iot_button_register_cb(button_handle_, BUTTON_SINGLE_CLICK, [](void* handle, void* usr_data) {
+        Button* button = static_cast<Button*>(usr_data);
+        if (button->on_click_) {
+            button->on_click_();
+        }
+    }, this);
+}
+
+void Button::OnDoubleClick(std::function<void()> callback) {
+    on_double_click_ = callback;
+    iot_button_register_cb(button_handle_, BUTTON_DOUBLE_CLICK, [](void* handle, void* usr_data) {
+        Button* button = static_cast<Button*>(usr_data);
+        if (button->on_double_click_) {
+            button->on_double_click_();
+        }
+    }, this);
+}
--- a/main/Button.h
+++ b/main/Button.h
@@ -0,0 +1,28 @@
+#ifndef BUTTON_H_
+#define BUTTON_H_
+
+#include <driver/gpio.h>
+#include <iot_button.h>
+#include <functional>
+
+class Button {
+public:
+    Button(gpio_num_t gpio_num);
+    ~Button();
+
+    void OnPress(std::function<void()> callback);
+    void OnLongPress(std::function<void()> callback);
+    void OnClick(std::function<void()> callback);
+    void OnDoubleClick(std::function<void()> callback);
+private:
+    gpio_num_t gpio_num_;
+    button_handle_t button_handle_;
+
+
+    std::function<void()> on_press_;
+    std::function<void()> on_long_press_;
+    std::function<void()> on_click_;
+    std::function<void()> on_double_click_;
+};
+
+#endif // BUTTON_H_
--- a/main/CMakeLists.txt
+++ b/main/CMakeLists.txt
@@ -1,9 +1,17 @@
 set(SOURCES "AudioDevice.cc"
+            "FirmwareUpgrade.cc"
+            "SystemInfo.cc"
            "SystemReset.cc"
            "Application.cc"
+            "Display.cc"
+            "Button.cc"
            "main.cc"
            )

+if(CONFIG_USE_AFE_SR)
+    list(APPEND SOURCES "AudioProcessor.cc" "WakeWordDetect.cc")
+endif()
+
 idf_component_register(SRCS ${SOURCES}
                    INCLUDE_DIRS "."
                    )
--- a/main/Display.cc
+++ b/main/Display.cc
@@ -0,0 +1,139 @@
+
+#include "Display.h"
+
+#include <esp_log.h>
+#include <esp_err.h>
+#include <esp_lcd_panel_ops.h>
+#include <esp_lcd_panel_vendor.h>
+#include <esp_lvgl_port.h>
+#include <string>
+#include <cstdlib>
+
+#define TAG "Display"
+
+#ifdef CONFIG_USE_DISPLAY
+
+Display::Display(int sda_pin, int scl_pin) : sda_pin_(sda_pin), scl_pin_(scl_pin) {
+    ESP_LOGI(TAG, "Display Pins: %d, %d", sda_pin_, scl_pin_);
+
+    i2c_master_bus_config_t bus_config = {
+        .i2c_port = I2C_NUM_0,
+        .sda_io_num = (gpio_num_t)sda_pin_,
+        .scl_io_num = (gpio_num_t)scl_pin_,
+        .clk_source = I2C_CLK_SRC_DEFAULT,
+        .glitch_ignore_cnt = 7,
+        .intr_priority = 1,
+        .trans_queue_depth = 0,
+        .flags = {
+            .enable_internal_pullup = 1,
+        },
+    };
+
+    ESP_ERROR_CHECK(i2c_new_master_bus(&bus_config, &i2c_bus_));
+
+    // SSD1306 config
+    esp_lcd_panel_io_i2c_config_t io_config = {
+        .dev_addr = 0x3C,
+        .on_color_trans_done = nullptr,
+        .user_ctx = nullptr,
+        .control_phase_bytes = 1,
+        .dc_bit_offset = 6,
+        .lcd_cmd_bits = 8,
+        .lcd_param_bits = 8,
+        .flags = {
+            .dc_low_on_data = 0,
+            .disable_control_phase = 0,
+        },
+        .scl_speed_hz = 400 * 1000,
+    };
+
+    ESP_ERROR_CHECK(esp_lcd_new_panel_io_i2c_v2(i2c_bus_, &io_config, &panel_io_));
+
+    ESP_LOGI(TAG, "Install SSD1306 driver");
+    esp_lcd_panel_dev_config_t panel_config = {};
+    panel_config.reset_gpio_num = -1;
+    panel_config.bits_per_pixel = 1;
+
+    esp_lcd_panel_ssd1306_config_t ssd1306_config = {
+        .height = CONFIG_DISPLAY_HEIGHT
+    };
+    panel_config.vendor_config = &ssd1306_config;
+
+    ESP_ERROR_CHECK(esp_lcd_new_panel_ssd1306(panel_io_, &panel_config, &panel_));
+    ESP_LOGI(TAG, "SSD1306 driver installed");
+
+    // Reset the display
+    ESP_ERROR_CHECK(esp_lcd_panel_reset(panel_));
+    if (esp_lcd_panel_init(panel_) != ESP_OK) {
+        ESP_LOGE(TAG, "Failed to initialize display");
+        return;
+    }
+
+    ESP_LOGI(TAG, "Initialize LVGL");
+    lvgl_port_cfg_t port_cfg = ESP_LVGL_PORT_INIT_CONFIG();
+    lvgl_port_init(&port_cfg);
+
+    const lvgl_port_display_cfg_t display_cfg = {
+        .io_handle = panel_io_,
+        .panel_handle = panel_,
+        .buffer_size = 128 * CONFIG_DISPLAY_HEIGHT,
+        .double_buffer = true,
+        .hres = 128,
+        .vres = CONFIG_DISPLAY_HEIGHT,
+        .monochrome = true,
+        .rotation = {
+            .swap_xy = 0,
+            .mirror_x = 0,
+            .mirror_y = 0,
+        },
+        .flags = {
+            .buff_dma = 0,
+            .buff_spiram = 0,
+        },
+    };
+    disp_ = lvgl_port_add_disp(&display_cfg);
+    lv_disp_set_rotation(disp_, LV_DISP_ROT_180);
+
+    // Set the display to on
+    ESP_LOGI(TAG, "Turning display on");
+    ESP_ERROR_CHECK(esp_lcd_panel_disp_on_off(panel_, true));
+    
+    ESP_LOGI(TAG, "Display Loading...");
+    if (lvgl_port_lock(0)) {
+        label_ = lv_label_create(lv_disp_get_scr_act(disp_));
+        lv_label_set_text(label_, "Initializing...");
+        lv_obj_set_width(label_, disp_->driver->hor_res);
+        lv_obj_set_height(label_, disp_->driver->ver_res);
+        lv_obj_set_style_text_line_space(label_, 0, 0);
+        lv_obj_set_style_pad_all(label_, 0, 0);
+        lv_obj_set_style_outline_pad(label_, 0, 0);
+        lvgl_port_unlock();
+    }
+}
+
+Display::~Display() {
+    if (label_ != nullptr) {
+        lvgl_port_lock(0);
+        lv_obj_del(label_);
+        lvgl_port_unlock();
+    }
+
+    if (disp_ != nullptr) {
+        lvgl_port_deinit();
+        esp_lcd_panel_del(panel_);
+        esp_lcd_panel_io_del(panel_io_);
+        i2c_master_bus_reset(i2c_bus_);
+    }
+}
+
+void Display::SetText(const std::string &text) {
+    if (label_ != nullptr) {
+        text_ = text;
+        lvgl_port_lock(0);
+        // Change the text of the label
+        lv_label_set_text(label_, text_.c_str());
+        lvgl_port_unlock();
+    }
+}
+
+#endif
--- a/main/Display.h
+++ b/main/Display.h
@@ -0,0 +1,32 @@
+#ifndef DISPLAY_H
+#define DISPLAY_H
+
+#include <driver/i2c_master.h>
+#include <esp_lcd_panel_io.h>
+#include <esp_lcd_panel_ops.h>
+#include <lvgl.h>
+
+#include <string>
+
+class Display {
+public:
+    Display(int sda_pin, int scl_pin);
+    ~Display();
+
+    void SetText(const std::string &text);
+
+private:
+    int sda_pin_;
+    int scl_pin_;
+
+    i2c_master_bus_handle_t i2c_bus_ = nullptr;
+
+    esp_lcd_panel_io_handle_t panel_io_ = nullptr;
+    esp_lcd_panel_handle_t panel_ = nullptr;
+    lv_disp_t *disp_ = nullptr;
+    lv_obj_t *label_ = nullptr;
+
+    std::string text_;
+};
+
+#endif
--- a/main/FirmwareUpgrade.cc
+++ b/main/FirmwareUpgrade.cc
@@ -0,0 +1,259 @@
+#include "FirmwareUpgrade.h"
+#include "SystemInfo.h"
+#include <cJSON.h>
+#include <esp_log.h>
+#include <esp_partition.h>
+#include <esp_http_client.h>
+#include <esp_ota_ops.h>
+#include <esp_app_format.h>
+
+#include <vector>
+#include <sstream>
+#include <algorithm>
+
+#define TAG "FirmwareUpgrade"
+
+
+FirmwareUpgrade::FirmwareUpgrade(Http& http) : http_(http) {
+}
+
+FirmwareUpgrade::~FirmwareUpgrade() {
+}
+
+void FirmwareUpgrade::SetCheckVersionUrl(std::string check_version_url) {
+    check_version_url_ = check_version_url;
+}
+
+void FirmwareUpgrade::SetPostData(const std::string& post_data) {
+    post_data_ = post_data;
+}
+
+void FirmwareUpgrade::SetHeader(const std::string& key, const std::string& value) {
+    headers_[key] = value;
+}
+
+void FirmwareUpgrade::CheckVersion() {
+    std::string current_version = esp_app_get_description()->version;
+    ESP_LOGI(TAG, "Current version: %s", current_version.c_str());
+
+    if (check_version_url_.length() < 10) {
+        ESP_LOGE(TAG, "Check version URL is not properly set");
+        return;
+    }
+
+    for (const auto& header : headers_) {
+        http_.SetHeader(header.first, header.second);
+    }
+
+    if (post_data_.empty()) {
+        http_.Open("GET", check_version_url_);
+    } else {
+        http_.SetHeader("Content-Type", "application/json");
+        http_.SetContent(post_data_);
+        http_.Open("POST", check_version_url_);
+    }
+
+    auto response = http_.GetBody();
+    http_.Close();
+
+    // Response: { "firmware": { "version": "1.0.0", "url": "http://" } }
+    // Parse the JSON response and check if the version is newer
+    // If it is, set has_new_version_ to true and store the new version and URL
+    
+    cJSON *root = cJSON_Parse(response.c_str());
+    if (root == NULL) {
+        ESP_LOGE(TAG, "Failed to parse JSON response");
+        return;
+    }
+    cJSON *firmware = cJSON_GetObjectItem(root, "firmware");
+    if (firmware == NULL) {
+        ESP_LOGE(TAG, "Failed to get firmware object");
+        cJSON_Delete(root);
+        return;
+    }
+    cJSON *version = cJSON_GetObjectItem(firmware, "version");
+    if (version == NULL) {
+        ESP_LOGE(TAG, "Failed to get version object");
+        cJSON_Delete(root);
+        return;
+    }
+    cJSON *url = cJSON_GetObjectItem(firmware, "url");
+    if (url == NULL) {
+        ESP_LOGE(TAG, "Failed to get url object");
+        cJSON_Delete(root);
+        return;
+    }
+
+    firmware_version_ = version->valuestring;
+    firmware_url_ = url->valuestring;
+    cJSON_Delete(root);
+
+    // Check if the version is newer, for example, 0.1.0 is newer than 0.0.1
+    has_new_version_ = IsNewVersionAvailable(current_version, firmware_version_);
+    if (has_new_version_) {
+        ESP_LOGI(TAG, "New version available: %s", firmware_version_.c_str());
+    } else {
+        ESP_LOGI(TAG, "Current is the latest version");
+    }
+}
+
+void FirmwareUpgrade::MarkCurrentVersionValid() {
+    auto partition = esp_ota_get_running_partition();
+    if (strcmp(partition->label, "factory") == 0) {
+        ESP_LOGI(TAG, "Running from factory partition, skipping");
+        return;
+    }
+
+    ESP_LOGI(TAG, "Running partition: %s", partition->label);
+    esp_ota_img_states_t state;
+    if (esp_ota_get_state_partition(partition, &state) != ESP_OK) {
+        ESP_LOGE(TAG, "Failed to get state of partition");
+        return;
+    }
+
+    if (state == ESP_OTA_IMG_PENDING_VERIFY) {
+        ESP_LOGI(TAG, "Marking firmware as valid");
+        esp_ota_mark_app_valid_cancel_rollback();
+    }
+}
+
+void FirmwareUpgrade::Upgrade(const std::string& firmware_url) {
+    ESP_LOGI(TAG, "Upgrading firmware from %s", firmware_url.c_str());
+    esp_ota_handle_t update_handle = 0;
+    auto update_partition = esp_ota_get_next_update_partition(NULL);
+    if (update_partition == NULL) {
+        ESP_LOGE(TAG, "Failed to get update partition");
+        return;
+    }
+
+    ESP_LOGI(TAG, "Writing to partition %s at offset 0x%lx", update_partition->label, update_partition->address);
+    bool image_header_checked = false;
+    std::string image_header;
+
+    if (!http_.Open("GET", firmware_url)) {
+        ESP_LOGE(TAG, "Failed to open HTTP connection");
+        return;
+    }
+
+    size_t content_length = http_.GetBodyLength();
+    if (content_length == 0) {
+        ESP_LOGE(TAG, "Failed to get content length");
+        http_.Close();
+        return;
+    }
+
+    char buffer[4096];
+    size_t total_read = 0, recent_read = 0;
+    auto last_calc_time = esp_timer_get_time();
+    while (true) {
+        int ret = http_.Read(buffer, sizeof(buffer));
+        if (ret < 0) {
+            ESP_LOGE(TAG, "Failed to read HTTP data: %s", esp_err_to_name(ret));
+            http_.Close();
+            return;
+        }
+
+        // Calculate speed and progress every second
+        recent_read += ret;
+        total_read += ret;
+        if (esp_timer_get_time() - last_calc_time >= 1000000 || ret == 0) {
+            size_t progress = total_read * 100 / content_length;
+            ESP_LOGI(TAG, "Progress: %zu%% (%zu/%zu), Speed: %zuB/s", progress, total_read, content_length, recent_read);
+            if (upgrade_callback_) {
+                upgrade_callback_(progress, recent_read);
+            }
+            last_calc_time = esp_timer_get_time();
+            recent_read = 0;
+        }
+
+        if (ret == 0) {
+            break;
+        }
+
+
+        if (!image_header_checked) {
+            image_header.append(buffer, ret);
+            if (image_header.size() >= sizeof(esp_image_header_t) + sizeof(esp_image_segment_header_t) + sizeof(esp_app_desc_t)) {
+                esp_app_desc_t new_app_info;
+                memcpy(&new_app_info, image_header.data() + sizeof(esp_image_header_t) + sizeof(esp_image_segment_header_t), sizeof(esp_app_desc_t));
+                ESP_LOGI(TAG, "New firmware version: %s", new_app_info.version);
+
+                auto current_version = esp_app_get_description()->version;
+                if (memcmp(new_app_info.version, current_version, sizeof(new_app_info.version)) == 0) {
+                    ESP_LOGE(TAG, "Firmware version is the same, skipping upgrade");
+                    http_.Close();
+                    return;
+                }
+
+                if (esp_ota_begin(update_partition, OTA_WITH_SEQUENTIAL_WRITES, &update_handle)) {
+                    esp_ota_abort(update_handle);
+                    http_.Close();
+                    ESP_LOGE(TAG, "Failed to begin OTA");
+                    return;
+                }
+
+                image_header_checked = true;
+            }
+        }
+        auto err = esp_ota_write(update_handle, buffer, ret);
+        if (err != ESP_OK) {
+            ESP_LOGE(TAG, "Failed to write OTA data: %s", esp_err_to_name(err));
+            esp_ota_abort(update_handle);
+            http_.Close();
+            return;
+        }
+    }
+    http_.Close();
+
+    esp_err_t err = esp_ota_end(update_handle);
+    if (err != ESP_OK) {
+        if (err == ESP_ERR_OTA_VALIDATE_FAILED) {
+            ESP_LOGE(TAG, "Image validation failed, image is corrupted");
+        } else {
+            ESP_LOGE(TAG, "Failed to end OTA: %s", esp_err_to_name(err));
+        }
+        return;
+    }
+
+    err = esp_ota_set_boot_partition(update_partition);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "Failed to set boot partition: %s", esp_err_to_name(err));
+        return;
+    }
+
+    ESP_LOGI(TAG, "Firmware upgrade successful, rebooting in 3 seconds...");
+    vTaskDelay(pdMS_TO_TICKS(3000));
+    esp_restart();
+}
+
+void FirmwareUpgrade::StartUpgrade(std::function<void(int progress, size_t speed)> callback) {
+    upgrade_callback_ = callback;
+    Upgrade(firmware_url_);
+}
+
+std::vector<int> FirmwareUpgrade::ParseVersion(const std::string& version) {
+    std::vector<int> versionNumbers;
+    std::stringstream ss(version);
+    std::string segment;
+    
+    while (std::getline(ss, segment, '.')) {
+        versionNumbers.push_back(std::stoi(segment));
+    }
+    
+    return versionNumbers;
+}
+
+bool FirmwareUpgrade::IsNewVersionAvailable(const std::string& currentVersion, const std::string& newVersion) {
+    std::vector<int> current = ParseVersion(currentVersion);
+    std::vector<int> newer = ParseVersion(newVersion);
+    
+    for (size_t i = 0; i < std::min(current.size(), newer.size()); ++i) {
+        if (newer[i] > current[i]) {
+            return true;
+        } else if (newer[i] < current[i]) {
+            return false;
+        }
+    }
+    
+    return newer.size() > current.size();
+}
--- a/main/FirmwareUpgrade.h
+++ b/main/FirmwareUpgrade.h
@@ -0,0 +1,38 @@
+#ifndef _FIRMWARE_UPGRADE_H
+#define _FIRMWARE_UPGRADE_H
+
+#include <functional>
+#include <string>
+#include <map>
+
+#include <Http.h>
+
+class FirmwareUpgrade {
+public:
+    FirmwareUpgrade(Http& http);
+    ~FirmwareUpgrade();
+
+    void SetCheckVersionUrl(std::string check_version_url);
+    void SetPostData(const std::string& post_data);
+    void SetHeader(const std::string& key, const std::string& value);
+    void CheckVersion();
+    bool HasNewVersion() { return has_new_version_; }
+    void StartUpgrade(std::function<void(int progress, size_t speed)> callback);
+    void MarkCurrentVersionValid();
+
+private:
+    Http& http_;
+    std::string check_version_url_;
+    bool has_new_version_ = false;
+    std::string firmware_version_;
+    std::string firmware_url_;
+    std::string post_data_;
+    std::map<std::string, std::string> headers_;
+
+    void Upgrade(const std::string& firmware_url);
+    std::function<void(int progress, size_t speed)> upgrade_callback_;
+    std::vector<int> ParseVersion(const std::string& version);
+    bool IsNewVersionAvailable(const std::string& currentVersion, const std::string& newVersion);
+};
+
+#endif // _FIRMWARE_UPGRADE_H
--- a/main/Kconfig.projbuild
+++ b/main/Kconfig.projbuild
@@ -1,14 +1,20 @@
 menu "Xiaozhi Assistant"

+config OTA_VERSION_URL
+    string "OTA Version URL"
+    default "https://api.tenclass.net/xiaozhi/ota/"
+    help
+        The application will access this URL to check for updates.
+
 config WEBSOCKET_URL
    string "Websocket URL"
-    default "wss://"
+    default "wss://api.tenclass.net/xiaozhi/v1/"
    help
        Communication with the server through websocket after wake up.

 config WEBSOCKET_ACCESS_TOKEN
    string "Websocket Access Token"
-    default ""
+    default "test-token"
    help
        Access token for websocket communication.

@@ -24,29 +30,29 @@ config AUDIO_OUTPUT_SAMPLE_RATE
    help
        Audio output sample rate.

-config AUDIO_DEVICE_I2S_GPIO_BCLK
-    int "I2S GPIO BCLK"
-    default 5
-    help
-        GPIO number of the I2S BCLK.
-
-config AUDIO_DEVICE_I2S_GPIO_WS
+config AUDIO_DEVICE_I2S_MIC_GPIO_WS
    int "I2S GPIO WS"
    default 4
    help
        GPIO number of the I2S WS.

-config AUDIO_DEVICE_I2S_GPIO_DOUT
-    int "I2S GPIO DOUT"
+config AUDIO_DEVICE_I2S_MIC_GPIO_BCLK
+    int "I2S GPIO BCLK"
+    default 5
+    help
+        GPIO number of the I2S BCLK.
+
+config AUDIO_DEVICE_I2S_MIC_GPIO_DIN
+    int "I2S GPIO DIN"
    default 6
-    help
-        GPIO number of the I2S DOUT.
-
-config AUDIO_DEVICE_I2S_GPIO_DIN
-    int "I2S GPIO DIN"
-    default 3
    help
        GPIO number of the I2S DIN.
+
+config AUDIO_DEVICE_I2S_SPK_GPIO_DOUT
+    int "I2S GPIO DOUT"
+    default 7
+    help
+        GPIO number of the I2S DOUT.
    
 config AUDIO_DEVICE_I2S_SIMPLEX
    bool "I2S Simplex"
@@ -54,18 +60,77 @@ config AUDIO_DEVICE_I2S_SIMPLEX
    help
        Enable I2S Simplex mode.
    
-config AUDIO_DEVICE_I2S_MIC_GPIO_BCLK
-    int "I2S MIC GPIO BCLK"
-    default 11
+config AUDIO_DEVICE_I2S_SPK_GPIO_BCLK
+    int "I2S SPK GPIO BCLK"
+    default 15
    depends on AUDIO_DEVICE_I2S_SIMPLEX
    help
        GPIO number of the I2S MIC BCLK.
    
-config AUDIO_DEVICE_I2S_MIC_GPIO_WS
-    int "I2S MIC GPIO WS"
-    default 10
+config AUDIO_DEVICE_I2S_SPK_GPIO_WS
+    int "I2S SPK GPIO WS"
+    default 16
    depends on AUDIO_DEVICE_I2S_SIMPLEX
    help
        GPIO number of the I2S MIC WS.

+config BOOT_BUTTON_GPIO
+    int "Boot Button GPIO"
+    default 0
+    help
+        GPIO number of the boot button.
+
+config USE_AFE_SR
+    bool "Use Espressif AFE SR"
+    default y
+    help
+        Use AFE SR for wake word detection.
+
+config USE_ML307
+    bool "Use ML307"
+    default n
+    help
+        Use ML307 as the modem.
+
+config ML307_RX_PIN
+    int "ML307 RX Pin"
+    default 11
+    depends on USE_ML307
+    help
+        GPIO number of the ML307 RX.
+
+config ML307_TX_PIN
+    int "ML307 TX Pin"
+    default 12
+    depends on USE_ML307
+    help
+        GPIO number of the ML307 TX.
+
+config USE_DISPLAY
+    bool "Use Display"
+    default n
+    help
+        Use Display.
+
+config DISPLAY_HEIGHT
+    int "Display Height"
+    default 32
+    depends on USE_DISPLAY
+    help
+        Display height in pixels.
+
+config DISPLAY_SDA_PIN
+    int "Display SDA Pin"
+    default 41
+    depends on USE_DISPLAY
+    help
+        GPIO number of the Display SDA.
+
+config DISPLAY_SCL_PIN
+    int "Display SCL Pin"
+    default 42
+    depends on USE_DISPLAY
+    help
+        GPIO number of the Display SCL.
+
 endmenu
--- a/main/SystemInfo.cc
+++ b/main/SystemInfo.cc
@@ -0,0 +1,218 @@
+#include "SystemInfo.h"
+#include <freertos/task.h>
+#include <esp_log.h>
+#include <esp_flash.h>
+#include <esp_mac.h>
+#include <esp_chip_info.h>
+#include <esp_system.h>
+#include <esp_partition.h>
+#include <esp_app_desc.h>
+#include <esp_ota_ops.h>
+
+
+#define TAG "SystemInfo"
+
+size_t SystemInfo::GetFlashSize() {
+    uint32_t flash_size;
+    if (esp_flash_get_size(NULL, &flash_size) != ESP_OK) {
+        ESP_LOGE(TAG, "Failed to get flash size");
+        return 0;
+    }
+    return (size_t)flash_size;
+}
+
+size_t SystemInfo::GetMinimumFreeHeapSize() {
+    return esp_get_minimum_free_heap_size();
+}
+
+size_t SystemInfo::GetFreeHeapSize() {
+    return esp_get_free_heap_size();
+}
+
+std::string SystemInfo::GetMacAddress() {
+    uint8_t mac[6];
+    esp_read_mac(mac, ESP_MAC_WIFI_STA);
+    char mac_str[18];
+    snprintf(mac_str, sizeof(mac_str), "%02x:%02x:%02x:%02x:%02x:%02x", mac[0], mac[1], mac[2], mac[3], mac[4], mac[5]);
+    return std::string(mac_str);
+}
+
+std::string SystemInfo::GetChipModelName() {
+    return std::string(CONFIG_IDF_TARGET);
+}
+
+std::string SystemInfo::GetJsonString() {
+    /* 
+        {
+            "flash_size": 4194304,
+            "psram_size": 0,
+            "minimum_free_heap_size": 123456,
+            "mac_address": "00:00:00:00:00:00",
+            "chip_model_name": "esp32s3",
+            "chip_info": {
+                "model": 1,
+                "cores": 2,
+                "revision": 0,
+                "features": 0
+            },
+            "application": {
+                "name": "my-app",
+                "version": "1.0.0",
+                "compile_time": "2021-01-01T00:00:00Z"
+                "idf_version": "4.2-dev"
+                "elf_sha256": ""
+            },
+            "partition_table": [
+                "app": {
+                    "label": "app",
+                    "type": 1,
+                    "subtype": 2,
+                    "address": 0x10000,
+                    "size": 0x100000
+                }
+            ],
+            "ota": {
+                "label": "ota_0"
+            }
+        }
+    */
+    std::string json = "{";
+    json += "\"flash_size\":" + std::to_string(GetFlashSize()) + ",";
+    json += "\"minimum_free_heap_size\":" + std::to_string(GetMinimumFreeHeapSize()) + ",";
+    json += "\"mac_address\":\"" + GetMacAddress() + "\",";
+    json += "\"chip_model_name\":\"" + GetChipModelName() + "\",";
+    json += "\"chip_info\":{";
+
+    esp_chip_info_t chip_info;
+    esp_chip_info(&chip_info);
+    json += "\"model\":" + std::to_string(chip_info.model) + ",";
+    json += "\"cores\":" + std::to_string(chip_info.cores) + ",";
+    json += "\"revision\":" + std::to_string(chip_info.revision) + ",";
+    json += "\"features\":" + std::to_string(chip_info.features);
+    json += "},";
+
+    json += "\"application\":{";
+    auto app_desc = esp_app_get_description();
+    json += "\"name\":\"" + std::string(app_desc->project_name) + "\",";
+    json += "\"version\":\"" + std::string(app_desc->version) + "\",";
+    json += "\"compile_time\":\"" + std::string(app_desc->date) + "T" + std::string(app_desc->time) + "Z\",";
+    json += "\"idf_version\":\"" + std::string(app_desc->idf_ver) + "\",";
+
+    char sha256_str[65];
+    for (int i = 0; i < 32; i++) {
+        snprintf(sha256_str + i * 2, sizeof(sha256_str) - i * 2, "%02x", app_desc->app_elf_sha256[i]);
+    }
+    json += "\"elf_sha256\":\"" + std::string(sha256_str) + "\"";
+    json += "},";
+
+    json += "\"partition_table\": [";
+    esp_partition_iterator_t it = esp_partition_find(ESP_PARTITION_TYPE_ANY, ESP_PARTITION_SUBTYPE_ANY, NULL);
+    while (it) {
+        const esp_partition_t *partition = esp_partition_get(it);
+        json += "{";
+        json += "\"label\":\"" + std::string(partition->label) + "\",";
+        json += "\"type\":" + std::to_string(partition->type) + ",";
+        json += "\"subtype\":" + std::to_string(partition->subtype) + ",";
+        json += "\"address\":" + std::to_string(partition->address) + ",";
+        json += "\"size\":" + std::to_string(partition->size);
+        json += "},";
+        it = esp_partition_next(it);
+    }
+    json.pop_back(); // Remove the last comma
+    json += "],";
+
+    json += "\"ota\":{";
+    auto ota_partition = esp_ota_get_running_partition();
+    json += "\"label\":\"" + std::string(ota_partition->label) + "\"";
+    json += "}";
+
+    // Close the JSON object
+    json += "}";
+    return json;
+}
+
+esp_err_t SystemInfo::PrintRealTimeStats(TickType_t xTicksToWait) {
+    #define ARRAY_SIZE_OFFSET 5
+    TaskStatus_t *start_array = NULL, *end_array = NULL;
+    UBaseType_t start_array_size, end_array_size;
+    configRUN_TIME_COUNTER_TYPE start_run_time, end_run_time;
+    esp_err_t ret;
+    uint32_t total_elapsed_time;
+
+    //Allocate array to store current task states
+    start_array_size = uxTaskGetNumberOfTasks() + ARRAY_SIZE_OFFSET;
+    start_array = (TaskStatus_t*)malloc(sizeof(TaskStatus_t) * start_array_size);
+    if (start_array == NULL) {
+        ret = ESP_ERR_NO_MEM;
+        goto exit;
+    }
+    //Get current task states
+    start_array_size = uxTaskGetSystemState(start_array, start_array_size, &start_run_time);
+    if (start_array_size == 0) {
+        ret = ESP_ERR_INVALID_SIZE;
+        goto exit;
+    }
+
+    vTaskDelay(xTicksToWait);
+
+    //Allocate array to store tasks states post delay
+    end_array_size = uxTaskGetNumberOfTasks() + ARRAY_SIZE_OFFSET;
+    end_array = (TaskStatus_t*)malloc(sizeof(TaskStatus_t) * end_array_size);
+    if (end_array == NULL) {
+        ret = ESP_ERR_NO_MEM;
+        goto exit;
+    }
+    //Get post delay task states
+    end_array_size = uxTaskGetSystemState(end_array, end_array_size, &end_run_time);
+    if (end_array_size == 0) {
+        ret = ESP_ERR_INVALID_SIZE;
+        goto exit;
+    }
+
+    //Calculate total_elapsed_time in units of run time stats clock period.
+    total_elapsed_time = (end_run_time - start_run_time);
+    if (total_elapsed_time == 0) {
+        ret = ESP_ERR_INVALID_STATE;
+        goto exit;
+    }
+
+    printf("| Task | Run Time | Percentage\n");
+    //Match each task in start_array to those in the end_array
+    for (int i = 0; i < start_array_size; i++) {
+        int k = -1;
+        for (int j = 0; j < end_array_size; j++) {
+            if (start_array[i].xHandle == end_array[j].xHandle) {
+                k = j;
+                //Mark that task have been matched by overwriting their handles
+                start_array[i].xHandle = NULL;
+                end_array[j].xHandle = NULL;
+                break;
+            }
+        }
+        //Check if matching task found
+        if (k >= 0) {
+            uint32_t task_elapsed_time = end_array[k].ulRunTimeCounter - start_array[i].ulRunTimeCounter;
+            uint32_t percentage_time = (task_elapsed_time * 100UL) / (total_elapsed_time * CONFIG_FREERTOS_NUMBER_OF_CORES);
+            printf("| %-16s | %8lu | %4lu%%\n", start_array[i].pcTaskName, task_elapsed_time, percentage_time);
+        }
+    }
+
+    //Print unmatched tasks
+    for (int i = 0; i < start_array_size; i++) {
+        if (start_array[i].xHandle != NULL) {
+            printf("| %s | Deleted\n", start_array[i].pcTaskName);
+        }
+    }
+    for (int i = 0; i < end_array_size; i++) {
+        if (end_array[i].xHandle != NULL) {
+            printf("| %s | Created\n", end_array[i].pcTaskName);
+        }
+    }
+    ret = ESP_OK;
+
+exit:    //Common return path
+    free(start_array);
+    free(end_array);
+    return ret;
+}
+
--- a/main/SystemInfo.h
+++ b/main/SystemInfo.h
@@ -0,0 +1,20 @@
+#ifndef _SYSTEM_INFO_H_
+#define _SYSTEM_INFO_H_
+
+#include <string>
+
+#include <esp_err.h>
+#include <freertos/FreeRTOS.h>
+
+class SystemInfo {
+public:
+    static size_t GetFlashSize();
+    static size_t GetMinimumFreeHeapSize();
+    static size_t GetFreeHeapSize();
+    static std::string GetMacAddress();
+    static std::string GetChipModelName();
+    static std::string GetJsonString();
+    static esp_err_t PrintRealTimeStats(TickType_t xTicksToWait);
+};
+
+#endif // _SYSTEM_INFO_H_
--- a/main/SystemReset.cc
+++ b/main/SystemReset.cc
@@ -1,10 +1,10 @@
 #include "SystemReset.h"
-#include "esp_log.h"
-#include "nvs_flash.h"
-#include "driver/gpio.h"
-#include "esp_partition.h"
-#include "esp_system.h"
-#include "freertos/FreeRTOS.h"
+#include <esp_log.h>
+#include <nvs_flash.h>
+#include <driver/gpio.h>
+#include <esp_partition.h>
+#include <esp_system.h>
+#include <freertos/FreeRTOS.h>


 #define TAG "SystemReset"
--- a/main/WakeWordDetect.cc
+++ b/main/WakeWordDetect.cc
@@ -0,0 +1,203 @@
+#include <esp_log.h>
+#include <model_path.h>
+
+#include "WakeWordDetect.h"
+#include "Application.h"
+
+#define DETECTION_RUNNING_EVENT 1
+#define WAKE_WORD_ENCODED_EVENT 2
+
+static const char* TAG = "WakeWordDetect";
+
+WakeWordDetect::WakeWordDetect()
+    : afe_detection_data_(nullptr),
+      wake_word_pcm_(),
+      wake_word_opus_() {
+
+    event_group_ = xEventGroupCreate();
+
+    srmodel_list_t *models = esp_srmodel_init("model");
+    for (int i = 0; i < models->num; i++) {
+        ESP_LOGI(TAG, "Model %d: %s", i, models->model_name[i]);
+        if (strstr(models->model_name[i], ESP_WN_PREFIX) != NULL) {
+            wakenet_model_ = models->model_name[i];
+        }
+    }
+
+    afe_config_t afe_config = {
+        .aec_init = false,
+        .se_init = true,
+        .vad_init = true,
+        .wakenet_init = true,
+        .voice_communication_init = false,
+        .voice_communication_agc_init = false,
+        .voice_communication_agc_gain = 10,
+        .vad_mode = VAD_MODE_3,
+        .wakenet_model_name = wakenet_model_,
+        .wakenet_model_name_2 = NULL,
+        .wakenet_mode = DET_MODE_90,
+        .afe_mode = SR_MODE_HIGH_PERF,
+        .afe_perferred_core = 0,
+        .afe_perferred_priority = 5,
+        .afe_ringbuf_size = 50,
+        .memory_alloc_mode = AFE_MEMORY_ALLOC_MORE_PSRAM,
+        .afe_linear_gain = 1.0,
+        .agc_mode = AFE_MN_PEAK_AGC_MODE_2,
+        .pcm_config = {
+            .total_ch_num = 1,
+            .mic_num = 1,
+            .ref_num = 0,
+            .sample_rate = CONFIG_AUDIO_INPUT_SAMPLE_RATE
+        },
+        .debug_init = false,
+        .debug_hook = {{ AFE_DEBUG_HOOK_MASE_TASK_IN, NULL }, { AFE_DEBUG_HOOK_FETCH_TASK_IN, NULL }},
+        .afe_ns_mode = NS_MODE_SSP,
+        .afe_ns_model_name = NULL,
+        .fixed_first_channel = true,
+    };
+
+    afe_detection_data_ = esp_afe_sr_v1.create_from_config(&afe_config);
+
+    xTaskCreate([](void* arg) {
+        auto this_ = (WakeWordDetect*)arg;
+        this_->AudioDetectionTask();
+        vTaskDelete(NULL);
+    }, "audio_detection", 4096 * 2, this, 5, NULL);
+}
+
+WakeWordDetect::~WakeWordDetect() {
+    if (afe_detection_data_ != nullptr) {
+        esp_afe_sr_v1.destroy(afe_detection_data_);
+    }
+
+    if (wake_word_encode_task_stack_ != nullptr) {
+        free(wake_word_encode_task_stack_);
+    }
+
+    vEventGroupDelete(event_group_);
+}
+
+void WakeWordDetect::OnWakeWordDetected(std::function<void()> callback) {
+    wake_word_detected_callback_ = callback;
+}
+
+void WakeWordDetect::OnVadStateChange(std::function<void(bool speaking)> callback) {
+    vad_state_change_callback_ = callback;
+}
+
+void WakeWordDetect::StartDetection() {
+    xEventGroupSetBits(event_group_, DETECTION_RUNNING_EVENT);
+}
+
+void WakeWordDetect::StopDetection() {
+    xEventGroupClearBits(event_group_, DETECTION_RUNNING_EVENT);
+}
+
+bool WakeWordDetect::IsDetectionRunning() {
+    return xEventGroupGetBits(event_group_) & DETECTION_RUNNING_EVENT;
+}
+
+void WakeWordDetect::Feed(const int16_t* data, int size) {
+    input_buffer_.insert(input_buffer_.end(), data, data + size);
+
+    auto chunk_size = esp_afe_sr_v1.get_feed_chunksize(afe_detection_data_);
+    while (input_buffer_.size() >= chunk_size) {
+        esp_afe_sr_v1.feed(afe_detection_data_, input_buffer_.data());
+        input_buffer_.erase(input_buffer_.begin(), input_buffer_.begin() + chunk_size);
+    }
+}
+
+void WakeWordDetect::AudioDetectionTask() {
+    auto chunk_size = esp_afe_sr_v1.get_fetch_chunksize(afe_detection_data_);
+    ESP_LOGI(TAG, "Audio detection task started, chunk size: %d", chunk_size);
+
+    while (true) {
+        xEventGroupWaitBits(event_group_, DETECTION_RUNNING_EVENT, pdFALSE, pdTRUE, portMAX_DELAY);
+
+        auto res = esp_afe_sr_v1.fetch(afe_detection_data_);
+        if (res == nullptr || res->ret_value == ESP_FAIL) {
+            if (res != nullptr) {
+                ESP_LOGI(TAG, "Error code: %d", res->ret_value);
+            }
+            continue;;
+        }
+
+        // Store the wake word data for voice recognition, like who is speaking
+        StoreWakeWordData((uint16_t*)res->data, res->data_size / sizeof(uint16_t));
+
+        // VAD state change
+        if (vad_state_change_callback_) {
+            if (res->vad_state == AFE_VAD_SPEECH && !is_speaking_) {
+                is_speaking_ = true;
+                vad_state_change_callback_(true);
+            } else if (res->vad_state == AFE_VAD_SILENCE && is_speaking_) {
+                is_speaking_ = false;
+                vad_state_change_callback_(false);
+            }
+        }
+
+        if (res->wakeup_state == WAKENET_DETECTED) {
+            ESP_LOGI(TAG, "Wake word detected");
+            StopDetection();
+
+            if (wake_word_detected_callback_) {
+                wake_word_detected_callback_();
+            }
+        }
+    }
+}
+
+void WakeWordDetect::StoreWakeWordData(uint16_t* data, size_t samples) {
+    // store audio data to wake_word_pcm_
+    std::vector<int16_t> pcm(data, data + samples);
+    wake_word_pcm_.emplace_back(std::move(pcm));
+    // keep about 2 seconds of data, detect duration is 32ms (sample_rate == 16000, chunksize == 512)
+    while (wake_word_pcm_.size() > 2000 / 32) {
+        wake_word_pcm_.pop_front();
+    }
+}
+
+void WakeWordDetect::EncodeWakeWordData() {
+    if (wake_word_encode_task_stack_ == nullptr) {
+        wake_word_encode_task_stack_ = (StackType_t*)malloc(4096 * 8);
+    }
+    wake_word_encode_task_ = xTaskCreateStatic([](void* arg) {
+        auto this_ = (WakeWordDetect*)arg;
+        auto start_time = esp_timer_get_time();
+        // encode detect packets
+        OpusEncoder* encoder = new OpusEncoder();
+        encoder->Configure(CONFIG_AUDIO_INPUT_SAMPLE_RATE, 1, 60);
+        encoder->SetComplexity(0);
+        this_->wake_word_opus_.resize(4096 * 4);
+        size_t offset = 0;
+
+        for (auto& pcm: this_->wake_word_pcm_) {
+            encoder->Encode(pcm, [this_, &offset](const uint8_t* opus, size_t opus_size) {
+                size_t protocol_size = sizeof(BinaryProtocol) + opus_size;
+                if (offset + protocol_size < this_->wake_word_opus_.size()) {
+                    auto protocol = (BinaryProtocol*)(&this_->wake_word_opus_[offset]);
+                    protocol->version = htons(PROTOCOL_VERSION);
+                    protocol->type = htons(0);
+                    protocol->reserved = 0;
+                    protocol->timestamp = 0;
+                    protocol->payload_size = htonl(opus_size);
+                    memcpy(protocol->payload, opus, opus_size);
+                    offset += protocol_size;
+                }
+            });
+        }
+        this_->wake_word_pcm_.clear();
+        this_->wake_word_opus_.resize(offset);
+
+        auto end_time = esp_timer_get_time();
+        ESP_LOGI(TAG, "Encode wake word opus: %zu bytes in %lld ms", this_->wake_word_opus_.size(), (end_time - start_time) / 1000);
+        xEventGroupSetBits(this_->event_group_, WAKE_WORD_ENCODED_EVENT);
+        delete encoder;
+        vTaskDelete(NULL);
+    }, "encode_detect_packets", 4096 * 8, this, 1, wake_word_encode_task_stack_, &wake_word_encode_task_buffer_);
+}
+
+const std::string&& WakeWordDetect::GetWakeWordStream() {
+    xEventGroupWaitBits(event_group_, WAKE_WORD_ENCODED_EVENT, pdTRUE, pdTRUE, portMAX_DELAY);
+    return std::move(wake_word_opus_);
+}
--- a/main/WakeWordDetect.h
+++ b/main/WakeWordDetect.h
@@ -0,0 +1,50 @@
+#ifndef WAKE_WORD_DETECT_H
+#define WAKE_WORD_DETECT_H
+
+#include <esp_afe_sr_models.h>
+#include <esp_nsn_models.h>
+
+#include <freertos/FreeRTOS.h>
+#include <freertos/task.h>
+#include <freertos/event_groups.h>
+
+#include <list>
+#include <string>
+#include <vector>
+#include <functional>
+
+
+class WakeWordDetect {
+public:
+    WakeWordDetect();
+    ~WakeWordDetect();
+
+    void Feed(const int16_t* data, int size);
+    void OnWakeWordDetected(std::function<void()> callback);
+    void OnVadStateChange(std::function<void(bool speaking)> callback);
+    void StartDetection();
+    void StopDetection();
+    bool IsDetectionRunning();
+    void EncodeWakeWordData();
+    const std::string&& GetWakeWordStream();
+
+private:
+    esp_afe_sr_data_t* afe_detection_data_ = nullptr;
+    char* wakenet_model_ = NULL;
+    std::vector<int16_t> input_buffer_;
+    EventGroupHandle_t event_group_;
+    std::function<void()> wake_word_detected_callback_;
+    std::function<void(bool speaking)> vad_state_change_callback_;
+    bool is_speaking_ = false;
+
+    TaskHandle_t wake_word_encode_task_ = nullptr;
+    StaticTask_t wake_word_encode_task_buffer_;
+    StackType_t* wake_word_encode_task_stack_ = nullptr;
+    std::list<std::vector<int16_t>> wake_word_pcm_;
+    std::string wake_word_opus_;
+
+    void StoreWakeWordData(uint16_t* data, size_t size);
+    void AudioDetectionTask();
+};
+
+#endif
--- a/main/idf_component.yml
+++ b/main/idf_component.yml
@@ -1,24 +1,14 @@
 ## IDF Component Manager Manifest File
 dependencies:
-  78/esp-builtin-led: "^1.0.0"
-  78/esp-wifi-connect: "^1.0.0"
-  78/esp-ota: "^1.0.0"
-  78/esp-websocket: "^1.0.0"
-  78/esp-opus-encoder: "^1.0.0"
+  78/esp-builtin-led: "^1.0.2"
+  78/esp-wifi-connect: "^1.1.0"
+  78/esp-opus-encoder: "^1.0.2"
+  78/esp-ml307: "^1.1.1"
  espressif/esp-sr: "^1.9.0"
+  espressif/button: "^3.3.1"
+  lvgl/lvgl: "^8.4.0"
+  esp_lvgl_port: "^1.4.0"
  ## Required IDF version
  idf:
    version: ">=5.3"
-  # # Put list of dependencies here
-  # # For components maintained by Espressif:
-  # component: "~1.0.0"
-  # # For 3rd party components:
-  # username/component: ">=1.0.0,<2.0.0"
-  # username2/component2:
-  #   version: "~1.0.0"
-  #   # For transient dependencies `public` flag can be set.
-  #   # `public` flag doesn't have an effect dependencies of the `main` component.
-  #   # All dependencies of `main` are public by default.
-  #   public: true
-description: "An AI voice assistant for ESP32"
-url: "https://github.com/78/xiaozhi-esp32"
+
--- a/main/main.cc
+++ b/main/main.cc
@@ -1,19 +1,15 @@
-#include <cstdio>
+#include <esp_log.h>
+#include <esp_err.h>
+#include <nvs.h>
+#include <nvs_flash.h>
+#include <driver/gpio.h>
+#include <esp_event.h>

-#include "esp_log.h"
-#include "esp_err.h"
-#include "nvs.h"
-#include "nvs_flash.h"
-#include "driver/gpio.h"
-
-#include "WifiConfigurationAp.h"
 #include "Application.h"
 #include "SystemInfo.h"
 #include "SystemReset.h"
-#include "BuiltinLed.h"

 #define TAG "main"
-#define STATS_TICKS         pdMS_TO_TICKS(1000)

 extern "C" void app_main(void)
 {
@@ -32,29 +28,15 @@ extern "C" void app_main(void)
    }
    ESP_ERROR_CHECK(ret);

-    // Get the WiFi configuration
-    nvs_handle_t nvs_handle;
-    ret = nvs_open("wifi", NVS_READONLY, &nvs_handle);
-
-    // If the WiFi configuration is not found, launch the WiFi configuration AP
-    if (ret != ESP_OK) {
-        auto& builtin_led = BuiltinLed::GetInstance();
-        builtin_led.SetBlue();
-        builtin_led.Blink(1000, 500);
-
-        WifiConfigurationAp::GetInstance().Start("Xiaozhi");
-        return;
-    }
-    nvs_close(nvs_handle);
-    
    // Otherwise, launch the application
    Application::GetInstance().Start();

    // Dump CPU usage every 10 second
    while (true) {
        vTaskDelay(10000 / portTICK_PERIOD_MS);
-        // SystemInfo::PrintRealTimeStats(STATS_TICKS);
-        int free_sram = heap_caps_get_minimum_free_size(MALLOC_CAP_INTERNAL);
-        ESP_LOGI(TAG, "Free heap size: %u minimal internal: %u", SystemInfo::GetFreeHeapSize(), free_sram);
+        // SystemInfo::PrintRealTimeStats(pdMS_TO_TICKS(1000));
+        int free_sram = heap_caps_get_free_size(MALLOC_CAP_INTERNAL);
+        int min_free_sram = heap_caps_get_minimum_free_size(MALLOC_CAP_INTERNAL);
+        ESP_LOGI(TAG, "Free internal: %u minimal internal: %u", free_sram, min_free_sram);
    }
 }
--- a/partitions.csv
+++ b/partitions.csv
@@ -3,7 +3,7 @@
 nvs,      data, nvs,     0x9000,    0x4000,
 otadata,  data, ota,     0xd000,    0x2000,
 phy_init, data, phy,     0xf000,    0x1000,
-model,    data, spiffs,  0x100000,  1M,
-factory,  app,  factory, 0x200000,  2M,
-ota_0,    app,  ota_0,   0x400000,  2M,
-ota_1,    app,  ota_1,   0x600000,  2M,
+model,    data, spiffs,  0x10000,   0xF0000,
+factory,  app,  factory, 0x200000,  4M,
+ota_0,    app,  ota_0,   0x600000,  4M,
+ota_1,    app,  ota_1,   0xA00000,  4M,
--- a/partitions_4M.csv
+++ b/partitions_4M.csv
@@ -0,0 +1,7 @@
+# ESP-IDF Partition Table
+# Name,   Type, SubType, Offset,  Size, Flags
+nvs,      data, nvs,     0x9000,    0x4000,
+otadata,  data, ota,     0xd000,    0x2000,
+phy_init, data, phy,     0xf000,    0x1000,
+model,    data, spiffs,  0x10000,   0xF0000,
+factory,  app,  factory, 0x100000,  3M,
--- a/sdkconfig.defaults
+++ b/sdkconfig.defaults
@@ -3,22 +3,11 @@ CONFIG_BOOTLOADER_LOG_LEVEL_NONE=y
 CONFIG_BOOTLOADER_SKIP_VALIDATE_ALWAYS=y
 CONFIG_BOOTLOADER_APP_ROLLBACK_ENABLE=y

-CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ_240=y
-
-CONFIG_SPIRAM=y
-CONFIG_SPIRAM_MODE_OCT=y
-CONFIG_SPIRAM_SPEED_80M=y
-CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL=4096
-CONFIG_SPIRAM_TRY_ALLOCATE_WIFI_LWIP=y
-CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL=32768
-CONFIG_SPIRAM_MEMTEST=n
-
 CONFIG_HTTPD_MAX_REQ_HDR_LEN=2048
 CONFIG_HTTPD_MAX_URI_LEN=2048

 CONFIG_PARTITION_TABLE_CUSTOM=y
 CONFIG_PARTITION_TABLE_CUSTOM_FILENAME="partitions.csv"
-CONFIG_PARTITION_TABLE_FILENAME="partitions.csv"
 CONFIG_PARTITION_TABLE_OFFSET=0x8000

 CONFIG_USE_WAKENET=y
--- a/sdkconfig.defaults.esp32c3
+++ b/sdkconfig.defaults.esp32c3
@@ -0,0 +1,6 @@
+
+CONFIG_ESPTOOLPY_FLASHSIZE_16MB=y
+
+CONFIG_PARTITION_TABLE_CUSTOM=y
+CONFIG_PARTITION_TABLE_CUSTOM_FILENAME="partitions_4M.csv"
+CONFIG_PARTITION_TABLE_OFFSET=0x8000
--- a/sdkconfig.defaults.esp32s3
+++ b/sdkconfig.defaults.esp32s3
@@ -2,6 +2,17 @@
 CONFIG_ESPTOOLPY_FLASHSIZE_16MB=y
 CONFIG_ESPTOOLPY_FLASHMODE_QIO=y

+CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ_240=y
+
+CONFIG_SPIRAM=y
+CONFIG_SPIRAM_MODE_OCT=y
+CONFIG_SPIRAM_SPEED_80M=y
+CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL=4096
+CONFIG_SPIRAM_TRY_ALLOCATE_WIFI_LWIP=y
+CONFIG_SPIRAM_MALLOC_RESERVE_INTERNAL=32768
+CONFIG_SPIRAM_MEMTEST=n
+CONFIG_MBEDTLS_EXTERNAL_MEM_ALLOC=y
+
 CONFIG_ESP32S3_INSTRUCTION_CACHE_32KB=y
 CONFIG_ESP32S3_DATA_CACHE_64KB=y
-CONFIG_ESP32S3_DATA_CACHE_LINE_64B=y
+CONFIG_ESP32S3_DATA_CACHE_LINE_64B=y
--- a/websocket.md
+++ b/websocket.md
@@ -0,0 +1,160 @@
+
+# AI 语音交互通信协议文档
+
+## 1. 连接建立与鉴权
+
+客户端通过 WebSocket 连接到服务器时，需要在 HTTP 头中包含以下信息：
+
+- `Authorization`: Bearer token，格式为 "Bearer <access_token>"
+- `Device-Id`: 设备 MAC 地址
+- `Protocol-Version`: 协议版本号，当前为 2
+
+WebSocket URL: `wss://api.tenclass.net/xiaozhi/v1`
+
+## 2. 二进制数据
+
+客户端发送的二进制数据使用固定头格式的协议，如下：
+
+```cpp
+struct BinaryProtocol {
+    uint16_t version;        // 二进制协议版本，当前为 2
+    uint16_t type;           // 消息类型（0：音频流数据，1：JSON）
+    uint32_t reserved;       // 保留字段
+    uint32_t timestamp;      // 时间戳（保留用作回声消除，也可以用于UDP不可靠传输中的排序）
+    uint32_t payload_size;   // 负载大小
+    uint8_t payload[];       // 可以是音频数据（Opus 编码或协商的音频格式），也可以封装 JSON
+} __attribute__((packed));
+```
+
+注意：所有多字节整数字段使用网络字节序（大端序）。
+
+目前二进制数据跟 JSON 都是走同一个 WebSocket 连接，未来实时对话模式下，二进制音频数据可能走 UDP，可以扩展 hello 消息进行协商。
+
+## 3. 音频数据传输
+
+- 客户端到服务器: 使用二进制协议发送 Opus 编码的音频数据
+- 服务器到客户端: 使用二进制协议发送 Opus 编码的音频数据，格式与客户端发送的相同
+
+出现 payload_size 为 0 的音频数据包可以用做句子边界标记，可以忽略，但不要报错。
+
+## 4. 握手消息
+
+连接建立后，客户端发送一个 JSON 格式的 "hello" 消息，初始化服务器端的音频解码器。
+不需要等待服务器响应，随后即可发送音频数据。
+
+```json
+{
+  "type": "hello",
+  "response_mode": "auto",
+  "audio_params": {
+    "format": "opus",
+    "sample_rate": 16000,
+    "channels": 1
+  }
+}
+```
+
+应答模式 `response_mode` 可以为 `auto` 或 `manual`。
+
+`auto`：自动应答模式，服务器实时计算音频 VAD 并自动决定何时开始应答。
+
+`manual`：手动应答模式，客户端状态从 `listening` 变为 `idle` 时，服务器可以应答。
+
+## 5. 状态更新
+
+客户端在状态变化时发送 JSON 消息:
+
+```json
+{
+  "type": "state",
+  "state": "<新状态>"
+}
+```
+
+可能发送的状态值包括: `idle`, `wake_word_detected`, `listening`, `speaking`。
+
+示例:
+
+1、按住说话（`response_mode` 为 `manual`）
+
+- 当按住说话按钮时，如果未连接服务器，则连接服务器，并编码、缓存当前音频数据，连接成功后，客户端设置状态为 `listening`，并在 hello 消息之后发送缓存的音频数据。
+- 当按住说话按钮时，如果已连接服务器，则客户端设置状态为 `listening`，并发送音频数据。
+- 当释放说话按钮时，状态变为 `idle`，此时服务器开始识别。
+- 服务器开始应答时，推送 `stt` 和 `tts` 消息。
+- 客户端开始播放音频时，状态设为 `speaking`。
+- 客户端结束播放音频时，状态设为 `idle`。
+- 在 `speaking` 状态下，按住说话按钮，会立即停止当前音频播放，状态变为 `listening`。
+
+2、语音唤醒，轮流对话（`response_mode` 为 `auto`）
+
+- 连接服务器，发送 hello 消息，发送唤醒词音频数据，然后发送状态 `wake_word_detected`，服务器开始应答。
+- 客户端开始播放音频时，状态设为 `speaking`，此时客户端不会发送音频数据。
+- 客户端结束播放音频时，状态设为 `listening`，此时客户端发送音频数据。
+- 服务器计算音频 VAD 自动选择时机开始应答时，推送 `stt` 和 `tts` 消息。
+- 客户端收到 `tts`.`start` 时，开始播放音频，状态设为 `speaking`。
+- 客户端收到 `tts`.`stop` 时，停止播放音频，状态设为 `listening`。
+
+3、语音唤醒，实时对话（`response_mode` 为 `real_time`）
+
+- 连接服务器，发送 hello 消息，发送唤醒词音频数据，然后发送状态 `wake_word_detected`，服务器开始应答。
+- 客户端开始播放音频时，状态设为 `speaking`。
+- 客户端结束播放音频时，状态设为 `listening`。
+- 在 `speaking` 和 `listening` 状态下，客户端都会发送音频数据。
+- 服务器计算音频 VAD 自动选择时机开始应答时，推送 `stt` 和 `tts` 消息。
+- 客户端收到 `stt` 时，状态设为 `listening`。如果当前有音频正在播放，则在当前 sentence 结束后停止播放音频。
+- 客户端收到 `tts`.`start` 时，开始播放音频，状态设为 `speaking`。
+- 客户端收到 `tts`.`stop` 时，停止播放音频，状态设为 `listening`。
+
+## 6. 服务器到客户端的消息
+
+### 6.1 语音识别结果 (STT)
+
+```json
+{
+  "type": "stt",
+  "text": "<识别出的文本>"
+}
+```
+
+### 6.2 文本转语音 (TTS)
+
+TTS开始:
+```json
+{
+  "type": "tts",
+  "state": "start",
+  "sample_rate": 24000
+}
+```
+
+句子开始:
+```json
+{
+  "type": "tts",
+  "state": "sentence_start",
+  "text": "你在干什么呀？"
+}
+```
+
+句子结束:
+```json
+{
+  "type": "tts",
+  "state": "sentence_end"
+}
+```
+
+TTS结束:
+```json
+{
+  "type": "tts",
+  "state": "stop"
+}
+```
+
+## 7. 连接管理
+
+- 客户端检测到 WebSocket 断开连接时，应该停止音频播放并重置为空闲状态
+- 在断开连接后，客户端按需重新发起连接（比如按钮按下或语音唤醒）
+
+这个文档概括了 WebSocket 通信协议的主要方面。
Author	SHA1	Message	Date
Terrence	2be6217b1f	v0.3.1	2024-10-03 06:41:16 +08:00
Terrence	879f1cc21e	reconstruct application	2024-10-03 06:39:22 +08:00
Terrence	e59be04394	update to 4MB partition	2024-10-01 15:58:03 +08:00
Terrence	d26e8d25ff	support ML307, new version 0.3.0	2024-10-01 14:16:12 +08:00
Terrence	8e9be5abc7	add websocket protocol	2024-09-26 16:19:54 +08:00
Terrence	7fd72aa8e2	add more wake word packets	2024-09-26 16:19:06 +08:00
Terrence	0396b4a91c	fix bugs	2024-09-25 03:44:28 +08:00
Terrence	53b08843d4	add vad to detection and communication	2024-09-17 11:26:07 +08:00
Terrence	797f9c2515	start AP if WiFi station fails to connect	2024-09-15 14:03:11 +08:00
Terrence	cebe41c2d0	update opus encoder version	2024-09-14 15:00:48 +08:00
Terrence	e46016b3fc	add testing	2024-09-14 14:58:03 +08:00
Terrence	140ed56ee9	增加 RGB 灯亮的注意事项	2024-09-12 21:48:47 +08:00
Terrence	1093bce089	add usage to readme	2024-09-12 19:53:14 +08:00