2026-04-06发表2026-04-06更新Programming28 分钟读完 (大约4268个字)

Orin AGX 64G 中模型常驻指南

私有化部署一个中等大小的模型，用于常驻一个agent，做日常对话服务，而且不能功耗太高，毕竟烧的是自家的电费……

0 网络准备

这台设备是在Gemini的帮助下，使用一台X86 PVE下的Ubuntu刷的，过程艰险无法描述，是非曲直难以论说。总之刷的时候为了防止任何以外，仅执行了最小化安装，除了系统啥都没有。

第一步得把命令行的代理配好：

# 设置apt代理
sudo vim /etc/apt/apt.conf.d/proxy.conf
Acquire::http::Proxy "http://127.0.0.1:7897/";
Acquire::https::Proxy "http://127.0.0.1:7897/";

vim ~/.zshrc
# 设置系统代理
function proxy_on() {
    export http_proxy="http://127.0.0.1:7897"
    export https_proxy=$http_proxy
    export ftp_proxy=$http_proxy
    export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com"
    export HTTP_PROXY=$http_proxy
    export HTTPS_PROXY=$http_proxy
    export FTP_PROXY=$http_proxy
    echo -e "Proxy environment variable set."
}

# 关闭系统代理
function proxy_off() {
    unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY
    echo -e "Proxy environment variable removed."
}

# 设置git代理
git config --global http.proxy http://127.0.0.1:7897
git config --global https.proxy http://127.0.0.1:7897

# 对付顽固分子
git clone https://github.com/rofl0r/proxychains-ng.git && cd proxychains-ng
./configure --prefix=/usr --sysconfdir=/etc
make
sudo make install
sudo make install-config

sudo vim /etc/proxychains.conf
socks5  127.0.0.1 7897

1 NVIDIA工具

使用jtop管理设备状态：

# 最小化安装甚至没有pip
sudo apt update
sudo apt install python3-pip
sudo -H pip3 install -U jetson-stats # 这玩意需要监控底层传感器，不能用uv装，需要重启

此时jtop会显示：

Jetpack NOT DETECTED，说明系统裸的很彻底，适合容器化部署
NV Power[3]: MODE_50W，表明功耗还没开到最大
Jetson Clocks: inactive，表明风扇和核心频率没有锁定在最高状态

性能释放开最大：

1 2	sudo nvpmodel -m 0 sudo jetson_clocks

然后jtop会显示：

NV Power[0]: MAXN
Jetson Clocks: running

2 Docker

这个裸奔的系统连docker都省了：

1
2
3

# 记得 proxy_on
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

配置当前用户免sudo，然后退出ssh再重新登陆：

1	sudo usermod -aG docker $USER

安装 nvidia 容器工具包

1 2	sudo apt update sudo apt install -y nvidia-container-toolkit

配置 Docker 默认使用 NVIDIA 运行时并重启服务：

1 2	sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker

验证：

1 2	docker info \| grep Runtimes Runtimes: io.containerd.runc.v2 nvidia runc

为docker配置代理：

因为英伟达定制系统精简了TUN模块，只能开系统代理，所以最好为docker手动配置系统代理：

# 设置配置文件
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:7897"
Environment="HTTPS_PROXY=http://127.0.0.1:7897"
Environment="NO_PROXY=localhost,127.0.0.1,::1"
EOF
# 重启服务
sudo systemctl daemon-reload
sudo systemctl restart docker

3 llama.cpp测试

下个测试用的小模型：

# 创建模型存放目录
mkdir -p ~/ai-models/gguf
cd ~/ai-models/gguf

# 使用 wget 直接下载 huggingface 镜像站的 GGUF 模型
wget https://hf-mirror.com/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

构建专用的的 ARM64 CUDA 镜像：

cd ~
# 克隆最新源码
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 此时运行构建会报错，因为docker的buildkit的网络行为很诡异，普通的梯子代理不了，Tegra默认内核没有TUN，而我又暂时没有能用的代理路由
❯ sudo docker build -t llama-cpp-cuda -f .devops/cuda.Dockerfile .
[+] Building 30.1s (3/3) FINISHED                                                                                            docker:default
 => [internal] load build definition from cuda.Dockerfile                                                                              0.0s
 => => transferring dockerfile: 2.74kB                                                                                                 0.0s
 => ERROR [internal] load metadata for docker.io/nvidia/cuda:12.8.1-runtime-ubuntu24.04                                               30.0s
 => CANCELED [internal] load metadata for docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04                                              30.0s
------
 > [internal] load metadata for docker.io/nvidia/cuda:12.8.1-runtime-ubuntu24.04:
------
cuda.Dockerfile:41
--------------------
  39 |
  40 |     ## Base image
  41 | >>> FROM ${BASE_CUDA_RUN_CONTAINER} AS base
  42 |
  43 |     RUN apt-get update \
--------------------
ERROR: failed to build: failed to solve: DeadlineExceeded: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Anvidia%2Fcuda%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f10d:183:face:b00c:0:25de]:443: i/o timeout

这里有几个坑：

网络方面，因为docker的buildkit有自己独立的网络请求和dns查找行为，系统代理无法辅助下载。因此，最好是看到报错中的镜像，记下来再手动拉取、构建。

cuda版本方面，可以看到上面的镜像默认使用了 cuda 12.8.1版本，实际上在jtop中可以看到本机为L4T 36.5.0，刷机的Jetpack版本为6.2.2。去 catalog.ngc.nvidia.com 搜索l4t-cuda出来的结果是：nvcr.io/nvidia/l4t-cuda:12.6.11-runtime、nvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04 ，把构建底座改成tegra专用的，否则构建出来的镜像再运行moe模型时，batch.n_token>32 就会出现cuBLAS报错：

ARG UBUNTU_VERSION=22.04
# This needs to generally match the container host's environment.
ARG CUDA_VERSION=12.6.11
# Target the CUDA build image
ARG BASE_CUDA_DEV_CONTAINER=nvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04

ARG BASE_CUDA_RUN_CONTAINER=nvcr.io/nvidia/l4t-cuda:12.6.11-runtime

既然改了22.04，那就出现新问题了，llama.cpp默认目标系统是24.04，因此.devops/cuda.Dockerfile中使用了gcc-14和g++-14，而22.04默认是11，不过最好用12，因此需要改一下构建文件使用gcc-12、g++-12，让系统走默认编译器：
1
2
3
4
RUN apt-get update && \
apt-get install -y gcc-12 g++-12 build-essential cmake python3 python3-pip git libssl-dev libgomp1

ENV CC=gcc-12 CXX=g++-12 CUDAHOSTCXX=g++-12
此外，构建过程中包含很多无意义的编译：
- 因为llama.cpp的docker构建默认行为是编译所有NVIDIA卡的PTX汇编内核，而我只需要编译AGX ORIN，其架构代号是专门的嵌入式架构sm_87，用-DCMAKE_CUDA_ARCHITECTURES=87指定编译平台，可节省90%的编译时间；
- -DGGML_CPU_ALL_VARIANTS=ON参数会让 CMake 尝试去编译包括 SVE (可伸缩矢量扩展) 在内的高级 CPU 算子，就是它需要gcc-12和g++-12，但 Orin 的 Carmel CPU 是基于 ARMv8.2-A 架构的，不支持最激进的 SVE 指令集，可以关掉；

最后，再加两个编译开关，避开tegra这种统一内存设备的架构坑。如果不开，在 Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf 这种模型上会出现不带 mmproj 时一切正常，加上 mmproj 输入图片就报cuMemAddressReserve的错误。

-DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON，以统一内存架构调度；
-DGGML_CUDA_NO_VMM=ON，禁止预申请一大块连续内存。

# 原来的
  23   │ RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
  24   │     export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
  25   │     fi && \
  26   │     cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON LLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
  27   │     cmake --build build --config Release -j$(nproc)
# 替换为
  28   │ RUN cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=OFF -DLLAMA_BUILD_TESTS=OFF -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
  29   │     cmake --build build --config Release -j$(nproc)

正确的安装姿势（如果有自己的路由可以省略手动拉取这一步）：

# 手动拉镜像：
docker pull nvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04
docker pull nvcr.io/nvidia/l4t-cuda:12.6.11-runtime

cd ~/llama.cpp

# 绕开buildkit，约十分钟
DOCKER_BUILDKIT=0 sudo docker build --network host \
  --build-arg HTTP_PROXY=http://127.0.0.1:7897 \
  --build-arg HTTPS_PROXY=http://127.0.0.1:7897 \
  --build-arg NO_PROXY="localhost,127.0.0.1,ports.ubuntu.com,archive.ubuntu.com,security.ubuntu.com" \
  -t llama-cpp-tegra -f .devops/cuda.Dockerfile .

测试编译好的镜像：

sudo docker run -d \
  --name llama-server \
  --runtime nvidia \
  --gpus all \
  -e NVIDIA_DISABLE_REQUIRE=1 \
  -v ~/ai-models/gguf:/models \
  -p 8080:8080 \
  llama-cpp-tegra \
  -m /models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \
  -c 8192 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 999

访问 localhost:8080 就可以看到llama.cpp的对话页面了。

4 llama.cpp部署中模型

最近看上一款模型，是qwen3.5用opus指令蒸馏过的27b稠密模型：Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF。既然agx orin有64g统一内存，所以可以试试Q8_0.gguf（约27G）奢侈一把。

sudo docker run -d \
  --name llama-server \
  --runtime nvidia \
  --gpus all \
  -e NVIDIA_DISABLE_REQUIRE=1 \
  -v ~/ai-models/gguf:/models \
  -p 8080:8080 \
  llama-cpp-tegra \
  -m /models/Qwen3.5-27B.Q4_K_M.gguf \
  -c 32768 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 999

内存绰绰有余，但是生成速度堪忧，设备统一内存带宽也就是204GB/s，然鹅，面对27G的模型，计算速度的物理上限也就是 204 ÷ 26 ≈ 7.8 tokens/s，再去掉总线调度损耗和KV Cache损耗，实际生成速度就是极限的50%~70%，于是我观察到的速度大概为 4 token/s，对于阅读速度稍快的中国人来说，不能忍，得换Q4_K_M的。

不过这两天谷歌有发布了gemma-4，这是个原生多模态，搞一个试试：

# 加载大号模型前可以先清一下内存
sudo sync && sudo sysctl -w vm.drop_caches=3

sudo docker run -d \
  --name llama-cpp-gemma-4-26B-A4B-it-q8_0 \
  --runtime nvidia \
  --gpus all \
  --ulimit memlock=-1:-1 \
  -e NVIDIA_DISABLE_REQUIRE=1 \
  -e GGML_CUDA_NO_PINNED=1 \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  -e GGML_CUDA_FORCE_MMQ=1 \
  -v ~/ai-models/gguf:/models \
  -p 8081:8080 \
  llama-cpp-tegra \
  -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
  --mmproj /models/gemma-4-26B-A4B-it-mmproj-BF16.gguf \
  -c 131072 \
  -fa on \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 999

再测试千问：

sudo docker run -d \
  --name llama-cpp-Qwen3.5-35B-A3B-Q8_K_XL \
  --runtime nvidia \
  --gpus all \
  --ulimit memlock=-1:-1 \
  -e NVIDIA_DISABLE_REQUIRE=1 \
  -e GGML_CUDA_NO_PINNED=1 \
  -e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
  -e GGML_CUDA_FORCE_MMQ=1 \
  -v ~/ai-models/gguf:/models \
  -p 8081:8080 \
  llama-cpp-tegra \
  -m /models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf \
  --mmproj /models/Qwen3.5-35B-A3B-UD-Q8_K_XL-mmproj-BF16.gguf \
  -c 65536 \
  -fa on \
  -np 1 \
  --host 0.0.0.0 \
  --port 8080 \
  -ngl 999

5 LiteLLM模型汇聚网关

为了物尽其用，每个 Orin AGX 64G 都耗尽了显存，我需要在第三个设备上搭一个模型汇聚网关，做成类似 OpenAI API 那样可以选择模型，并把请求分发到各个 end point 上的的网关，Gemini 推荐使用 LiteLLM 的 docker 部署。就是个转发代理，还需要长期开，就把这玩意放在 rock 5b 上了。

写好网关配置vim ~/litellm/config.yaml：

model_list:
  - model_name: gemma-4-26B-A4B-it-UD-Q8_K_XL
    litellm_params:
      model: openai/gemma-4-26B-A4B-it-UD-Q8_K_XL
      api_base: http://192.168.1.138:8081/v1
      api_key: sk-1234

  - model_name: Qwen3.5-35B-A3B-Q8_K_XL
    litellm_params:
      model: openai/Qwen3.5-35B-A3B-Q8_K_XL
      api_base: http://192.168.1.139:8081/v1
      api_key: sk-1234

加载配置即可：

# 网不好可以先 pull
docker pull docker.litellm.ai/berriai/litellm:main-latest

docker run -d \
  -v $(pwd)/config.yaml:/app/config.yaml \
  -p 4001:4000 \
  --name litellm-proxy \
  docker.litellm.ai/berriai/litellm:main-latest \
  --config /app/config.yaml \
  --detailed_debug

测试模型列表：

1
2

curl http://localhost:4001/v1/models
{"data":[{"id":"gemma-4-26B-A4B-it-UD-Q8_K_XL","object":"model","created":1677610602,"owned_by":"openai"},{"id":"Qwen3.5-35B-A3B-Q8_K_XL","object":"model","created":1677610602,"owned_by":"openai"}],"object":"list"}%

测试gemma-4反应：

 curl -X POST "http://localhost:4001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "gemma-4-26B-A4B-it-UD-Q8_K_XL",
    "messages": [
      {"role": "user", "content": "你好，请用一句话介绍一下你自己。"}
    ]
  }'
{"id":"chatcmpl-5RXTGa6ZQ5aW57JGlsDzxOylaWptEVhE","created":1775988908,"model":"gemma-4-26B-A4B-it-UD-Q8_K_XL","object":"chat.completion","system_fingerprint":"b8764-8fe78a7ca","choices":[{"finish_reason":"stop","index":0,"message":{"content":"我是 Gemma 4，是由 Google DeepMind  开发的开源权重大型语言模型。","role":"assistant","reasoning_content":"*   Language: Chinese (\"你好，请用一句话介绍一下你自己。\")\n    *   Content: \"Hello, please introduce yourself in one sentence.\"\n\n    *   Identity: Gemma 4.\n    *   Developer: Google DeepMind.\n    *   Nature: Large Language Model (LLM) with open weights.\n    *   Capabilities: Text and image processing (audio for 2B/4B), text-only output.\n    *   Knowledge Cutoff: January 2025.\n\n    *   *Draft 1 (Literal):* 我是 Gemma 4，由 Google DeepMind 开发的大型语言模型。 (I am Gemma 4, a large language model developed by Google DeepMind.)\n    *   *Draft 2 (More descriptive):* 我是 Gemma 4，由 Google DeepMind 开发的开源权 重的多模态大语言模型。 (I am Gemma 4, an open-weights multimodal LLM developed by Google DeepMind.)\n    *   *Refining for \"one sentence\" and \"identity\":* The prompt asks for an introduction in one sentence.\n\n    *   \"我是 Gemma 4，是由 Google DeepMind 开发的开源权重大型语言模型。\" (I am Gemma 4, an open-weights large language model developed by Google DeepMind.)\n\n    *   Identify as Gemma 4? Yes.\n    *   Mention Google DeepMind? Yes.\n    *   One sentence? Yes.\n\n    \"我是 Gemma 4，是由 Google DeepMind 开发的开源权重大型语言模型。\"","provider_specific_fields":{"refusal":null,"reasoning_content":"*   Language: Chinese (\"你好，请用一句话介绍一下你自己。\")\n    *   Content: \"Hello, please introduce yourself in one sentence.\"\n\n    *   Identity: Gemma 4.\n    *   Developer: Google DeepMind.\n    *   Nature: Large Language Model (LLM) with open weights.\n    *   Capabilities: Text and image processing (audio for 2B/4B), text-only output.\n    *   Knowledge Cutoff: January 2025.\n\n    *   *Draft 1 (Literal):* 我是 Gemma 4，由 Google DeepMind 开发的大型语言模型。 (I am Gemma 4, a large language model developed by Google DeepMind.)\n    *   *Draft 2 (More descriptive):* 我是 Gemma 4，由 Google DeepMind 开发的开源权重的多模态大 语言模型。 (I am Gemma 4, an open-weights multimodal LLM developed by Google DeepMind.)\n    *   *Refining for \"one sentence\" and \"identity\":* The prompt asks for an introduction in one sentence.\n\n    *   \"我是 Gemma 4，是由 Google DeepMind 开发的开源权重大型语言模型。\" (I am Gemma 4, an open-weights large language model developed by Google DeepMind.)\n\n    *   Identify as Gemma 4? Yes.\n    *   Mention Google DeepMind? Yes.\n    *   One sentence? Yes.\n\n    \"我是 Gemma 4，是由 Google DeepMind 开发的开源权重大型语言模型。\""}},"provider_specific_fields":{}}],"usage":{"completion_tokens":353,"prompt_tokens":25,"total_tokens":378,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":25,"prompt_ms":254.532,"prompt_per_token_ms":10.181280000000001,"prompt_per_second":98.21947731522953,"predicted_n":353,"predicted_ms":16617.949,"predicted_per_token_ms":47.076342776203965,"predicted_per_second":21.242091909176036}}%

测试qwen3.5反应：

curl -X POST "http://localhost:4001/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer sk-1234" \
  -d '{
    "model": "Qwen3.5-35B-A3B-Q8_K_XL",
    "messages": [
      {"role": "user", "content": "Hello, please write a very short python script to print hello world."}
    ]
  }'
{"id":"chatcmpl-gCLU58UkI3h51hgucpWwucmTJ1yX7qox","created":1775988956,"model":"Qwen3.5-35B-A3B-Q8_K_XL","object":"chat.completion","system_fingerprint":"b8762-073bb2c20","choices":[{"finish_reason":"stop","index":0,"message":{"content":"```python\nprint(\"hello world\")\n```","role":"assistant","reasoning_content":"Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: Write a very short Python script.\n    *   Functionality: Print \"hello world\".\n    *   Constraint: \"very short\".\n\n2.  **Determine the Solution:**\n    *   The standard way to print in Python is `print(\"hello world\")`.\n    *   This is already one line and very short.\n\n3.  **Draft the Code:**\n    ```python\n    print(\"hello world\")\n    ```\n\n4.  **Review against Constraints:**\n    *   Is it Python? Yes.\n    *   Is it very short? Yes (one line).\n    *   Does it print \"hello world\"? Yes.\n\n5.  **Final Output Generation:** (Just the code block).\n","provider_specific_fields":{"refusal":null,"reasoning_content":"Thinking Process:\n\n1.  **Analyze the Request:**\n    *   Task: Write a very short Python script.\n    *   Functionality: Print \"hello world\".\n    *   Constraint: \"very short\".\n\n2.  **Determine the Solution:**\n    *   The standard way to print in Python is `print(\"hello world\")`.\n    *   This is already one line and very short.\n\n3.  **Draft the Code:**\n    ```python\n    print(\"hello world\")\n    ```\n\n4.  **Review against Constraints:**\n    *   Is it Python? Yes.\n    *   Is it very short? Yes (one line).\n    *   Does it print \"hello world\"? Yes.\n\n5.  **Final Output Generation:** (Just the code block).\n"}},"provider_specific_fields":{}}],"usage":{"completion_tokens":188,"prompt_tokens":24,"total_tokens":212,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":24,"prompt_ms":290.856,"prompt_per_token_ms":12.119,"prompt_per_second":82.5150589982672,"predicted_n":188,"predicted_ms":10157.401,"predicted_per_token_ms":54.02872872340426,"predicted_per_second":18.508671657247756}}%

6 部署 Hermes

我不喜欢龙虾，一个性格使然不喜欢凑热闹，另外就是那个项目几个月前就已经出现项目管理崩溃的迹象了，issue和pr数量极度扭曲。这两天看到 hermes-agent 倒是想试试。

Pre-Requisites

firecrawl 是个网页清洗工具，这玩意是基于 Playwright 的相当重型，CPU和内存压力都不会小，rk3588 大概率受不了，CIX P1在办公室还关机了，只能暂时装我电脑上了，脚本参考 Hermes Agent Full Setup Tutorial: How to Setup Your First AI Agent (Gemma 4)：

git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl

cat > .env << 'EOF'
PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false
BULL_AUTH_KEY=somePASSword
EOF

sed -i 's|# image: ghcr.io/firecrawl/firecrawl|image: ghcr.io/firecrawl/firecrawl|' ./docker-compose.yaml
sed -i 's|  build: apps/api|  # build: apps/api|' docker-compose.yaml
sed -i 's|# image: ghcr.io/firecrawl/playwright-service:latest|image: ghcr.io/firecrawl/playwright-service:latest|' docker-compose.yaml
sed -i 's|    build: apps/playwright-service-ts|    # build: apps/playwright-service-ts|' docker-compose.yaml

docker compose up -d

安装 Hermes

目前装在我电脑的wsl2里，有时间得弄到o6n上，这个服务需要长时间开机，我舍不得电费。

安装过程需要注意的是：

配置模型 API 时，写上面的 LiteLLM 的 URL http://192.168.1.120:4001/v1，接下来模型可以选1,2，就像这样

API base URL [e.g. https://api.example.com/v1]: http://192.168.1.120:4001/v1
API key [optional]:
Verified endpoint via http://192.168.1.120:4001/v1/models (2 model(s) visible)
  Available models:
    1. gemma-4-26B-A4B-it-UD-Q8_K_XL
    2. Qwen3.5-35B-A3B-Q8_K_XL
  Select model [1-2] or type name: 1,2
Context length in tokens [leave blank for auto-detect]:
Default model set to: 1,2 (via http://192.168.1.120:4001/v1)
  💾 Saved to custom providers as "192.168.1.120:4001" (edit in config.yaml)

这玩意支持微信，在选 Messaging Platforms时选上，然后扫码，微信会出来一个 ClawBot。
配置 Firecrawl 时，因为我把 hermes 跟 firecrawl 的 docker compose 装一起了，直接回车用默认的 http://localhost:3002 即可。

附：Orin AGX 64G的PVE刷机指南：

Orin进入Recovery模式分两种状况，一是当Orin处于未开机状态，二是当Orin处于开机状态。

当处于未开机状态时，需要先长按住②键(Force Recovery键)，然后给Orin接上电源线通电，此时白色指示灯亮起，但进入Recovery模式后是黑屏的，所以此时连接Orin的显示屏不会有什么反应。
当处于已开机状态时，需要先长按住②键，然后按下③键(Reset键)，先松开③键，再松开②键。

进入刷机状态时，Ubuntu 上应该出现 recovery 状态的设备（0955:7023）：

agx-flash@agx-flash:~$ lsusb
Bus 002 Device 002: ID 0627:0001 Adomax Technology Co., Ltd QEMU USB Tablet
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 009 Device 003: ID 0955:7023 NVIDIA Corp. APX
Bus 009 Device 002: ID 2109:3431 VIA Labs, Inc. Hub
Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

为以防万一，配上代理：

export http_proxy="http://192.168.1.125:7897"
export https_proxy="http://192.168.1.125:7897"
export HTTP_PROXY="http://192.168.1.125:7897"
export HTTPS_PROXY="http://192.168.1.125:7897"

什么组件都不装，只刷一个Linux进去，等进了系统要用什么再装什么：

1	sdkmanager --cli --action install --login-type devzone --product Jetson --target-os Linux --version 6.2.2 --target JETSON_AGX_ORIN_TARGETS --select 'Jetson Linux' --flash --license accept

刷完会问“SDK Manager is about to install SDK components on your Jetson AGX, To install SDK components on your Jetson AGX Orin modules:…”，只刷系统就选择 2. Skip。

#Linux Python