Orin AGX 64G 中模型常驻指南

私有化部署一个中等大小的模型,用于常驻一个agent,做日常对话服务,而且不能功耗太高,毕竟烧的是自家的电费……

0 网络准备

这台设备是在Gemini的帮助下,使用一台X86 PVE下的Ubuntu刷的,过程艰险无法描述,是非曲直难以论说。总之刷的时候为了防止任何以外,仅执行了最小化安装,除了系统啥都没有。

第一步得把命令行的代理配好:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# 设置apt代理
sudo vim /etc/apt/apt.conf.d/proxy.conf
Acquire::http::Proxy "http://127.0.0.1:7897/";
Acquire::https::Proxy "http://127.0.0.1:7897/";

vim ~/.zshrc
# 设置系统代理
function proxy_on() {
export http_proxy="http://127.0.0.1:7897"
export https_proxy=$http_proxy
export ftp_proxy=$http_proxy
export no_proxy="localhost,127.0.0.1,localaddress,.localdomain.com"
export HTTP_PROXY=$http_proxy
export HTTPS_PROXY=$http_proxy
export FTP_PROXY=$http_proxy
echo -e "Proxy environment variable set."
}

# 关闭系统代理
function proxy_off() {
unset http_proxy https_proxy ftp_proxy no_proxy HTTP_PROXY HTTPS_PROXY FTP_PROXY
echo -e "Proxy environment variable removed."
}

# 设置git代理
git config --global http.proxy http://127.0.0.1:7897
git config --global https.proxy http://127.0.0.1:7897

# 对付顽固分子
git clone https://github.com/rofl0r/proxychains-ng.git && cd proxychains-ng
./configure --prefix=/usr --sysconfdir=/etc
make
sudo make install
sudo make install-config

sudo vim /etc/proxychains.conf
socks5 127.0.0.1 7897

1 NVIDIA工具

使用jtop管理设备状态:

1
2
3
4
# 最小化安装甚至没有pip
sudo apt update
sudo apt install python3-pip
sudo -H pip3 install -U jetson-stats # 这玩意需要监控底层传感器,不能用uv装,需要重启

此时jtop会显示:

  1. Jetpack NOT DETECTED,说明系统裸的很彻底,适合容器化部署
  2. NV Power[3]: MODE_50W,表明功耗还没开到最大
  3. Jetson Clocks: inactive,表明风扇和核心频率没有锁定在最高状态

性能释放开最大:

1
2
sudo nvpmodel -m 0
sudo jetson_clocks

然后jtop会显示:

  1. NV Power[0]: MAXN
  2. Jetson Clocks: running

2 Docker

这个裸奔的系统连docker都省了:

1
2
3
# 记得 proxy_on
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh

配置当前用户免sudo,然后退出ssh再重新登陆:

1
sudo usermod -aG docker $USER

安装 nvidia 容器工具包

1
2
sudo apt update
sudo apt install -y nvidia-container-toolkit

配置 Docker 默认使用 NVIDIA 运行时并重启服务:

1
2
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

验证:

1
2
docker info | grep Runtimes
Runtimes: io.containerd.runc.v2 nvidia runc

为docker配置代理:

因为英伟达定制系统精简了TUN模块,只能开系统代理,所以最好为docker手动配置系统代理:

1
2
3
4
5
6
7
8
9
10
11
# 设置配置文件
sudo mkdir -p /etc/systemd/system/docker.service.d
sudo tee /etc/systemd/system/docker.service.d/http-proxy.conf <<EOF
[Service]
Environment="HTTP_PROXY=http://127.0.0.1:7897"
Environment="HTTPS_PROXY=http://127.0.0.1:7897"
Environment="NO_PROXY=localhost,127.0.0.1,::1"
EOF
# 重启服务
sudo systemctl daemon-reload
sudo systemctl restart docker

3 llama.cpp测试

下个测试用的小模型:

1
2
3
4
5
6
# 创建模型存放目录
mkdir -p ~/ai-models/gguf
cd ~/ai-models/gguf

# 使用 wget 直接下载 huggingface 镜像站的 GGUF 模型
wget https://hf-mirror.com/QuantFactory/Meta-Llama-3-8B-Instruct-GGUF/resolve/main/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf

构建专用的的 ARM64 CUDA 镜像:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
cd ~
# 克隆最新源码
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# 此时运行构建会报错,因为docker的buildkit的网络行为很诡异,普通的梯子代理不了,Tegra默认内核没有TUN,而我又暂时没有能用的代理路由
sudo docker build -t llama-cpp-cuda -f .devops/cuda.Dockerfile .
[+] Building 30.1s (3/3) FINISHED docker:default
=> [internal] load build definition from cuda.Dockerfile 0.0s
=> => transferring dockerfile: 2.74kB 0.0s
=> ERROR [internal] load metadata for docker.io/nvidia/cuda:12.8.1-runtime-ubuntu24.04 30.0s
=> CANCELED [internal] load metadata for docker.io/nvidia/cuda:12.8.1-devel-ubuntu24.04 30.0s
------
> [internal] load metadata for docker.io/nvidia/cuda:12.8.1-runtime-ubuntu24.04:
------
cuda.Dockerfile:41
--------------------
39 |
40 | ## Base image
41 | >>> FROM ${BASE_CUDA_RUN_CONTAINER} AS base
42 |
43 | RUN apt-get update \
--------------------
ERROR: failed to build: failed to solve: DeadlineExceeded: failed to fetch anonymous token: Get "https://auth.docker.io/token?scope=repository%3Anvidia%2Fcuda%3Apull&service=registry.docker.io": dial tcp [2a03:2880:f10d:183:face:b00c:0:25de]:443: i/o timeout

这里有几个坑:

  1. 网络方面,因为docker的buildkit有自己独立的网络请求和dns查找行为,系统代理无法辅助下载。因此,最好是看到报错中的镜像,记下来再手动拉取、构建。

  2. cuda版本方面,可以看到上面的镜像默认使用了 cuda 12.8.1版本,实际上在jtop中可以看到本机为L4T 36.5.0,刷机的Jetpack版本为6.2.2。去 catalog.ngc.nvidia.com 搜索l4t-cuda出来的结果是:nvcr.io/nvidia/l4t-cuda:12.6.11-runtimenvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04 ,把构建底座改成tegra专用的,否则构建出来的镜像再运行moe模型时,batch.n_token>32 就会出现cuBLAS报错:

    1
    2
    3
    4
    5
    6
    7
    ARG UBUNTU_VERSION=22.04
    # This needs to generally match the container host's environment.
    ARG CUDA_VERSION=12.6.11
    # Target the CUDA build image
    ARG BASE_CUDA_DEV_CONTAINER=nvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04

    ARG BASE_CUDA_RUN_CONTAINER=nvcr.io/nvidia/l4t-cuda:12.6.11-runtime
  3. 既然改了22.04,那就出现新问题了,llama.cpp默认目标系统是24.04,因此.devops/cuda.Dockerfile中使用了gcc-14g++-14,而22.04默认是11,不过最好用12,因此需要改一下构建文件使用gcc-12g++-12,让系统走默认编译器:

    1
    2
    3
    4
    RUN apt-get update && \
    apt-get install -y gcc-12 g++-12 build-essential cmake python3 python3-pip git libssl-dev libgomp1

    ENV CC=gcc-12 CXX=g++-12 CUDAHOSTCXX=g++-12
  4. 此外,构建过程中包含很多无意义的编译:

    • 因为llama.cpp的docker构建默认行为是编译所有NVIDIA卡的PTX汇编内核,而我只需要编译AGX ORIN,其架构代号是专门的嵌入式架构sm_87,用-DCMAKE_CUDA_ARCHITECTURES=87指定编译平台,可节省90%的编译时间;
    • -DGGML_CPU_ALL_VARIANTS=ON参数会让 CMake 尝试去编译包括 SVE (可伸缩矢量扩展) 在内的高级 CPU 算子,就是它需要gcc-12g++-12,但 Orin 的 Carmel CPU 是基于 ARMv8.2-A 架构的,不支持最激进的 SVE 指令集,可以关掉;
  5. 最后,再加两个编译开关,避开tegra这种统一内存设备的架构坑。如果不开,在 Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q8_0.gguf 这种模型上会出现不带 mmproj 时一切正常,加上 mmproj 输入图片就报cuMemAddressReserve的错误。

    • -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON,以统一内存架构调度;
    • -DGGML_CUDA_NO_VMM=ON,禁止预申请一大块连续内存。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    # 原来的
    23 │ RUN if [ "${CUDA_DOCKER_ARCH}" != "default" ]; then \
    24 │ export CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=${CUDA_DOCKER_ARCH}"; \
    25 │ fi && \
    26 │ cmake -B build -DGGML_NATIVE=OFF -DGGML_CUDA=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON LLAMA_BUILD_TESTS=OFF ${CMAKE_ARGS} -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    27 │ cmake --build build --config Release -j$(nproc)
    # 替换为
    28 │ RUN cmake -B build -DGGML_NATIVE=ON -DGGML_CUDA=ON -DGGML_CUDA_NO_VMM=ON -DGGML_CUDA_ENABLE_UNIFIED_MEMORY=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=OFF -DLLAMA_BUILD_TESTS=OFF -DCMAKE_CUDA_ARCHITECTURES=87 -DCMAKE_EXE_LINKER_FLAGS=-Wl,--allow-shlib-undefined . && \
    29 │ cmake --build build --config Release -j$(nproc)

正确的安装姿势(如果有自己的路由可以省略手动拉取这一步):

1
2
3
4
5
6
7
8
9
10
11
12
# 手动拉镜像:
docker pull nvcr.io/nvidia/12.6.11-devel:12.6.11-devel-aarch64-ubuntu22.04
docker pull nvcr.io/nvidia/l4t-cuda:12.6.11-runtime

cd ~/llama.cpp

# 绕开buildkit,约十分钟
DOCKER_BUILDKIT=0 sudo docker build --network host \
--build-arg HTTP_PROXY=http://127.0.0.1:7897 \
--build-arg HTTPS_PROXY=http://127.0.0.1:7897 \
--build-arg NO_PROXY="localhost,127.0.0.1,ports.ubuntu.com,archive.ubuntu.com,security.ubuntu.com" \
-t llama-cpp-tegra -f .devops/cuda.Dockerfile .

测试编译好的镜像:

1
2
3
4
5
6
7
8
9
10
11
12
13
sudo docker run -d \
--name llama-server \
--runtime nvidia \
--gpus all \
-e NVIDIA_DISABLE_REQUIRE=1 \
-v ~/ai-models/gguf:/models \
-p 8080:8080 \
llama-cpp-tegra \
-m /models/Meta-Llama-3-8B-Instruct.Q4_K_M.gguf \
-c 8192 \
--host 0.0.0.0 \
--port 8080 \
-ngl 999

访问 localhost:8080 就可以看到llama.cpp的对话页面了。

4 llama.cpp部署中模型

最近看上一款模型,是qwen3.5用opus指令蒸馏过的27b稠密模型:Jackrong/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-GGUF。既然agx orin有64g统一内存,所以可以试试Q8_0.gguf(约27G)奢侈一把。

1
2
3
4
5
6
7
8
9
10
11
12
13
sudo docker run -d \
--name llama-server \
--runtime nvidia \
--gpus all \
-e NVIDIA_DISABLE_REQUIRE=1 \
-v ~/ai-models/gguf:/models \
-p 8080:8080 \
llama-cpp-tegra \
-m /models/Qwen3.5-27B.Q4_K_M.gguf \
-c 32768 \
--host 0.0.0.0 \
--port 8080 \
-ngl 999

内存绰绰有余,但是生成速度堪忧,设备统一内存带宽也就是204GB/s,然鹅,面对27G的模型,计算速度的物理上限也就是 204 ÷ 26 ≈ 7.8 tokens/s,再去掉总线调度损耗和KV Cache损耗,实际生成速度就是极限的50%~70%,于是我观察到的速度大概为 4 token/s,对于阅读速度稍快的中国人来说,不能忍,得换Q4_K_M的。

不过这两天谷歌有发布了gemma-4,这是个原生多模态,搞一个试试:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 加载大号模型前可以先清一下内存
sudo sync && sudo sysctl -w vm.drop_caches=3

sudo docker run -d \
--name llama-cpp-gemma-4-26B-A4B-it-q8_0 \
--runtime nvidia \
--gpus all \
--ulimit memlock=-1:-1 \
-e NVIDIA_DISABLE_REQUIRE=1 \
-e GGML_CUDA_NO_PINNED=1 \
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
-e GGML_CUDA_FORCE_MMQ=1 \
-v ~/ai-models/gguf:/models \
-p 8081:8080 \
llama-cpp-tegra \
-m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf \
--mmproj /models/gemma-4-26B-A4B-it-mmproj-BF16.gguf \
-c 131072 \
-fa on \
--host 0.0.0.0 \
--port 8080 \
-ngl 999

再测试千问:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
sudo docker run -d \
--name llama-cpp-Qwen3.5-35B-A3B-Q8_K_XL \
--runtime nvidia \
--gpus all \
--ulimit memlock=-1:-1 \
-e NVIDIA_DISABLE_REQUIRE=1 \
-e GGML_CUDA_NO_PINNED=1 \
-e GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 \
-e GGML_CUDA_FORCE_MMQ=1 \
-v ~/ai-models/gguf:/models \
-p 8081:8080 \
llama-cpp-tegra \
-m /models/Qwen3.5-35B-A3B-UD-Q8_K_XL.gguf \
--mmproj /models/Qwen3.5-35B-A3B-UD-Q8_K_XL-mmproj-BF16.gguf \
-c 65536 \
-fa on \
-np 1 \
--host 0.0.0.0 \
--port 8080 \
-ngl 999

5 LiteLLM模型汇聚网关

为了物尽其用,每个 Orin AGX 64G 都耗尽了显存,我需要在第三个设备上搭一个模型汇聚网关,做成类似 OpenAI API 那样可以选择模型,并把请求分发到各个 end point 上的的网关,Gemini 推荐使用 LiteLLM 的 docker 部署。就是个转发代理,还需要长期开,就把这玩意放在 rock 5b 上了。

写好网关配置vim ~/litellm/config.yaml

1
2
3
4
5
6
7
8
9
10
11
12
model_list:
- model_name: gemma-4-26B-A4B-it-UD-Q8_K_XL
litellm_params:
model: openai/gemma-4-26B-A4B-it-UD-Q8_K_XL
api_base: http://192.168.1.138:8081/v1
api_key: sk-1234

- model_name: Qwen3.5-35B-A3B-Q8_K_XL
litellm_params:
model: openai/Qwen3.5-35B-A3B-Q8_K_XL
api_base: http://192.168.1.139:8081/v1
api_key: sk-1234

加载配置即可:

1
2
3
4
5
6
7
8
9
10
# 网不好可以先 pull
docker pull docker.litellm.ai/berriai/litellm:main-latest

docker run -d \
-v $(pwd)/config.yaml:/app/config.yaml \
-p 4001:4000 \
--name litellm-proxy \
docker.litellm.ai/berriai/litellm:main-latest \
--config /app/config.yaml \
--detailed_debug

测试模型列表:

1
2
curl http://localhost:4001/v1/models
{"data":[{"id":"gemma-4-26B-A4B-it-UD-Q8_K_XL","object":"model","created":1677610602,"owned_by":"openai"},{"id":"Qwen3.5-35B-A3B-Q8_K_XL","object":"model","created":1677610602,"owned_by":"openai"}],"object":"list"}%

测试gemma-4反应:

1
2
3
4
5
6
7
8
9
10
 curl -X POST "http://localhost:4001/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "gemma-4-26B-A4B-it-UD-Q8_K_XL",
"messages": [
{"role": "user", "content": "你好,请用一句话介绍一下你自己。"}
]
}'
{"id":"chatcmpl-5RXTGa6ZQ5aW57JGlsDzxOylaWptEVhE","created":1775988908,"model":"gemma-4-26B-A4B-it-UD-Q8_K_XL","object":"chat.completion","system_fingerprint":"b8764-8fe78a7ca","choices":[{"finish_reason":"stop","index":0,"message":{"content":"我是 Gemma 4,是由 Google DeepMind 开发的开源权重大型语言模型。","role":"assistant","reasoning_content":"* Language: Chinese (\"你好,请用一句话介绍一下你自己。\")\n * Content: \"Hello, please introduce yourself in one sentence.\"\n\n * Identity: Gemma 4.\n * Developer: Google DeepMind.\n * Nature: Large Language Model (LLM) with open weights.\n * Capabilities: Text and image processing (audio for 2B/4B), text-only output.\n * Knowledge Cutoff: January 2025.\n\n * *Draft 1 (Literal):* 我是 Gemma 4,由 Google DeepMind 开发的大型语言模型。 (I am Gemma 4, a large language model developed by Google DeepMind.)\n * *Draft 2 (More descriptive):* 我是 Gemma 4,由 Google DeepMind 开发的开源权 重的多模态大语言模型。 (I am Gemma 4, an open-weights multimodal LLM developed by Google DeepMind.)\n * *Refining for \"one sentence\" and \"identity\":* The prompt asks for an introduction in one sentence.\n\n * \"我是 Gemma 4,是由 Google DeepMind 开发的开源权重大型语言模型。\" (I am Gemma 4, an open-weights large language model developed by Google DeepMind.)\n\n * Identify as Gemma 4? Yes.\n * Mention Google DeepMind? Yes.\n * One sentence? Yes.\n\n \"我是 Gemma 4,是由 Google DeepMind 开发的开源权重大型语言模型。\"","provider_specific_fields":{"refusal":null,"reasoning_content":"* Language: Chinese (\"你好,请用一句话介绍一下你自己。\")\n * Content: \"Hello, please introduce yourself in one sentence.\"\n\n * Identity: Gemma 4.\n * Developer: Google DeepMind.\n * Nature: Large Language Model (LLM) with open weights.\n * Capabilities: Text and image processing (audio for 2B/4B), text-only output.\n * Knowledge Cutoff: January 2025.\n\n * *Draft 1 (Literal):* 我是 Gemma 4,由 Google DeepMind 开发的大型语言模型。 (I am Gemma 4, a large language model developed by Google DeepMind.)\n * *Draft 2 (More descriptive):* 我是 Gemma 4,由 Google DeepMind 开发的开源权重的多模态大 语言模型。 (I am Gemma 4, an open-weights multimodal LLM developed by Google DeepMind.)\n * *Refining for \"one sentence\" and \"identity\":* The prompt asks for an introduction in one sentence.\n\n * \"我是 Gemma 4,是由 Google DeepMind 开发的开源权重大型语言模型。\" (I am Gemma 4, an open-weights large language model developed by Google DeepMind.)\n\n * Identify as Gemma 4? Yes.\n * Mention Google DeepMind? Yes.\n * One sentence? Yes.\n\n \"我是 Gemma 4,是由 Google DeepMind 开发的开源权重大型语言模型。\""}},"provider_specific_fields":{}}],"usage":{"completion_tokens":353,"prompt_tokens":25,"total_tokens":378,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":25,"prompt_ms":254.532,"prompt_per_token_ms":10.181280000000001,"prompt_per_second":98.21947731522953,"predicted_n":353,"predicted_ms":16617.949,"predicted_per_token_ms":47.076342776203965,"predicted_per_second":21.242091909176036}}%

测试qwen3.5反应:

1
2
3
4
5
6
7
8
9
10
curl -X POST "http://localhost:4001/v1/chat/completions" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer sk-1234" \
-d '{
"model": "Qwen3.5-35B-A3B-Q8_K_XL",
"messages": [
{"role": "user", "content": "Hello, please write a very short python script to print hello world."}
]
}'
{"id":"chatcmpl-gCLU58UkI3h51hgucpWwucmTJ1yX7qox","created":1775988956,"model":"Qwen3.5-35B-A3B-Q8_K_XL","object":"chat.completion","system_fingerprint":"b8762-073bb2c20","choices":[{"finish_reason":"stop","index":0,"message":{"content":"```python\nprint(\"hello world\")\n```","role":"assistant","reasoning_content":"Thinking Process:\n\n1. **Analyze the Request:**\n * Task: Write a very short Python script.\n * Functionality: Print \"hello world\".\n * Constraint: \"very short\".\n\n2. **Determine the Solution:**\n * The standard way to print in Python is `print(\"hello world\")`.\n * This is already one line and very short.\n\n3. **Draft the Code:**\n ```python\n print(\"hello world\")\n ```\n\n4. **Review against Constraints:**\n * Is it Python? Yes.\n * Is it very short? Yes (one line).\n * Does it print \"hello world\"? Yes.\n\n5. **Final Output Generation:** (Just the code block).\n","provider_specific_fields":{"refusal":null,"reasoning_content":"Thinking Process:\n\n1. **Analyze the Request:**\n * Task: Write a very short Python script.\n * Functionality: Print \"hello world\".\n * Constraint: \"very short\".\n\n2. **Determine the Solution:**\n * The standard way to print in Python is `print(\"hello world\")`.\n * This is already one line and very short.\n\n3. **Draft the Code:**\n ```python\n print(\"hello world\")\n ```\n\n4. **Review against Constraints:**\n * Is it Python? Yes.\n * Is it very short? Yes (one line).\n * Does it print \"hello world\"? Yes.\n\n5. **Final Output Generation:** (Just the code block).\n"}},"provider_specific_fields":{}}],"usage":{"completion_tokens":188,"prompt_tokens":24,"total_tokens":212,"prompt_tokens_details":{"cached_tokens":0}},"timings":{"cache_n":0,"prompt_n":24,"prompt_ms":290.856,"prompt_per_token_ms":12.119,"prompt_per_second":82.5150589982672,"predicted_n":188,"predicted_ms":10157.401,"predicted_per_token_ms":54.02872872340426,"predicted_per_second":18.508671657247756}}%

6 部署 Hermes

我不喜欢龙虾,一个性格使然不喜欢凑热闹,另外就是那个项目几个月前就已经出现项目管理崩溃的迹象了,issue和pr数量极度扭曲。这两天看到 hermes-agent 倒是想试试。

Pre-Requisites

firecrawl 是个网页清洗工具,这玩意是基于 Playwright 的相当重型,CPU和内存压力都不会小,rk3588 大概率受不了,CIX P1在办公室还关机了,只能暂时装我电脑上了,脚本参考 Hermes Agent Full Setup Tutorial: How to Setup Your First AI Agent (Gemma 4)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
git clone https://github.com/firecrawl/firecrawl.git
cd firecrawl

cat > .env << 'EOF'
PORT=3002
HOST=0.0.0.0
USE_DB_AUTHENTICATION=false
BULL_AUTH_KEY=somePASSword
EOF

sed -i 's|# image: ghcr.io/firecrawl/firecrawl|image: ghcr.io/firecrawl/firecrawl|' ./docker-compose.yaml
sed -i 's| build: apps/api| # build: apps/api|' docker-compose.yaml
sed -i 's|# image: ghcr.io/firecrawl/playwright-service:latest|image: ghcr.io/firecrawl/playwright-service:latest|' docker-compose.yaml
sed -i 's| build: apps/playwright-service-ts| # build: apps/playwright-service-ts|' docker-compose.yaml

docker compose up -d

安装 Hermes

目前装在我电脑的wsl2里,有时间得弄到o6n上,这个服务需要长时间开机,我舍不得电费。

安装过程需要注意的是:

  • 配置模型 API 时,写上面的 LiteLLM 的 URL http://192.168.1.120:4001/v1,接下来模型可以选1,2,就像这样
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    API base URL [e.g. https://api.example.com/v1]: http://192.168.1.120:4001/v1
    API key [optional]:
    Verified endpoint via http://192.168.1.120:4001/v1/models (2 model(s) visible)
    Available models:
    1. gemma-4-26B-A4B-it-UD-Q8_K_XL
    2. Qwen3.5-35B-A3B-Q8_K_XL
    Select model [1-2] or type name: 1,2
    Context length in tokens [leave blank for auto-detect]:
    Default model set to: 1,2 (via http://192.168.1.120:4001/v1)
    💾 Saved to custom providers as "192.168.1.120:4001" (edit in config.yaml)
  • 这玩意支持微信,在选 Messaging Platforms时选上,然后扫码,微信会出来一个 ClawBot。
  • 配置 Firecrawl 时,因为我把 hermes 跟 firecrawl 的 docker compose 装一起了,直接回车用默认的 http://localhost:3002 即可。

附:Orin AGX 64G的PVE刷机指南:

Orin进入Recovery模式分两种状况,一是当Orin处于未开机状态,二是当Orin处于开机状态。

  1. 当处于未开机状态时,需要先长按住②键(Force Recovery键),然后给Orin接上电源线通电,此时白色指示灯亮起,但进入Recovery模式后是黑屏的,所以此时连接Orin的显示屏不会有什么反应。
  2. 当处于已开机状态时,需要先长按住②键,然后按下③键(Reset键),先松开③键,再松开②键。

进入刷机状态时,Ubuntu 上应该出现 recovery 状态的设备(0955:7023):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
agx-flash@agx-flash:~$ lsusb
Bus 002 Device 002: ID 0627:0001 Adomax Technology Co., Ltd QEMU USB Tablet
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 008 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 007 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 006 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 009 Device 003: ID 0955:7023 NVIDIA Corp. APX
Bus 009 Device 002: ID 2109:3431 VIA Labs, Inc. Hub
Bus 009 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 010 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

为以防万一,配上代理:

1
2
3
4
export http_proxy="http://192.168.1.125:7897"
export https_proxy="http://192.168.1.125:7897"
export HTTP_PROXY="http://192.168.1.125:7897"
export HTTPS_PROXY="http://192.168.1.125:7897"

什么组件都不装,只刷一个Linux进去,等进了系统要用什么再装什么:

1
sdkmanager --cli --action install --login-type devzone --product Jetson --target-os Linux --version 6.2.2 --target JETSON_AGX_ORIN_TARGETS --select 'Jetson Linux' --flash --license accept

刷完会问“SDK Manager is about to install SDK components on your Jetson AGX, To install SDK components on your Jetson AGX Orin modules:…”,只刷系统就选择 2. Skip

评论

Your browser is out-of-date!

Update your browser to view this website correctly.&npsb;Update my browser now

×