[{"content":"一个线上事故 你维护一个 Kubernetes 自定义 controller，负责 watch 集群中的 CRD 资源并做 reconcile。\n某天凌晨，你被告警叫醒：\ncontroller Pod 内存从 200MB 涨到 2GB P99 延迟从 50ms 飙到 5s 最终 OOMKilled，Pod 反复重启 你 kubectl logs 一看，没有明显错误。CPU 使用率也不高。但内存一直在涨，请求越来越慢。\n到底发生了什么？ 要回答这个问题，我们需要理解 Go 程序运行时内部到底在干什么。这就要从 GMP 调度模型说起。\n先搞清楚一个问题：为什么需要 GMP？ 你写 Go 代码时会这样启动一个并发任务：\ngo func() { // 做一些事 }() 一行 go 创建了一个 goroutine。你可以创建几万、几十万个。\n但操作系统不认识 goroutine。 操作系统只认识线程（thread）。而线程很重——每个默认占 1-8MB 内存，创建和切换都需要内核参与。你不可能创建几十万个线程。\n所以问题就是：几十万个 goroutine，怎么在有限的几个线程上跑起来？\n答案就是 Go runtime 内置的调度器——GMP 模型。它不是一个库，不是一个框架，是 Go 语言运行时的核心组件，每个 Go 程序启动时就在运行。\n用大楼比喻理解\u0026quot;内核参与\u0026quot; 把操作系统想象成一栋大楼：内核态是大楼的中控室，只有管理员（内核）能进；用户态是普通办公室，应用程序在这里办公。\n线程的创建，就像在大楼里新开一个工位：你得打电话给中控室（内核），管理员要登记工位信息、分配办公桌椅和门禁卡（分配内核栈、创建 task_struct、设置寄存器上下文）。这套流程走下来，开销不小。\n线程的切换，本质是决定谁能坐到工位上干活。工位（CPU 核心）就那么几个，但有几百个人排着队要用。中控室管理员定了规矩——每人坐 10 分钟就得换人（时间片用完），或者你去等快递了（I/O 阻塞）就先让别人坐。每次换人，管理员要记下你做到哪一页了（保存寄存器），再把下一个人的进度翻出来（恢复寄存器）。这个\u0026quot;记进度、换人、翻进度\u0026quot;的过程就是上下文切换，每次都要经过中控室（内核态），开销显著。\n而 goroutine 的切换，相当于同一间办公室里几个同事自己商量换谁来用电脑，根本不用打电话给管理员——完全在用户态完成，快得多。\n用排班制度理解 GMP 继续这个大楼比喻，GMP 三个角色可以这样对应：\nM（线程） = 办公室里的一张工位（真正能干活的物理资源） G（Goroutine） = 员工（要完成的任务/人） P（Processor） = 排班经理，手里有一个排班表（本地队列），负责安排哪个员工去哪个工位干活 以前（线程模型）是一人一个工位，100 个人就要 100 个工位，太浪费。现在排班经理说：\u0026ldquo;工位就这么几个，你们排队，谁的活到了谁上去干，干完或者等东西（比如等 IO）就先下来让别人上。\u0026ldquo;而且换人过程排班经理自己就能搞定（用户态调度），不用打电话给中控室（内核）。\nGMP 三个角色 G — Goroutine 每次你写 go func()，就创建了一个 G。它包含：\n一个函数的代码和参数 自己的栈（初始只有 2KB，对比线程的 1-8MB） 当前执行到哪一行（程序计数器） 状态标记（正在运行？等待中？可以运行？） G 就是一份\u0026quot;待处理的任务\u0026rdquo;。\nM — Machine（OS 线程） M 对应一个操作系统线程，是真正能执行代码的东西。\nM 的数量是动态的，Go runtime 按需创建，有空闲的就复用，上限默认 10000。\nP — Processor（逻辑处理器） P 是 Go runtime 自己发明的抽象层，其他语言没有这个概念。\nP 不是 CPU，不是线程，它是一个数据结构，持有：\n一个本地 goroutine 队列（最多 256 个 G） 执行 goroutine 所需的运行环境 P 的数量 = GOMAXPROCS，默认等于 CPU 核数。你可以手动修改：\nruntime.GOMAXPROCS(4) // 不管机器几个核，P 就是 4 个 核心规则 M 必须绑定一个 P 才能执行 G。\n用前面的大楼比喻来说：员工（G）排在排班经理（P）的排班表上，排班经理把员工安排到工位（M）上干活。工位必须有排班经理管着才能运转。如果一个工位上的员工去等快递被卡住了（系统调用阻塞），排班经理可以换一个工位继续安排其他员工。\ngraph LR G1[G - 员工] --\u003e P[P - 排班经理本地队列] G2[G - 员工] --\u003e P G3[G - 员工] --\u003e P P --\u003e|安排到工位| M[M - 工位OS 线程] 两级队列 graph TD subgraph local[\"本地队列 (无锁访问)\"] P0[P0] --- G1[G1] \u0026 G2[G2] \u0026 G3[G3] P1[P1] --- G4[G4] \u0026 G5[G5] P2[P2] --- G6[G6] end subgraph global[\"全局队列 (需要加锁)\"] GQ[全局队列] --- G7[G7] \u0026 G8[G8] \u0026 G9[G9] \u0026 G10[G10] end P0 -.-\u003e|本地满了溢出| GQ 为什么分两级？本地队列不用加锁。P 从自己的队列取 G 是最快的操作。全局队列是溢出区——本地队列满了（256 个），新 G 就放到全局队列。\n调度机制 调度循环 M 绑定 P 后，不断循环：\ngraph LR A[取一个 G] --\u003e B[执行 G] --\u003e C[G 完成或让出] --\u003e A Work Stealing（工作窃取） 当一个 P 的本地队列空了：\n先查全局队列（需加锁，取一批 G） 再去偷别的 P 的任务（随机选一个 P，偷走它本地队列的一半） 检查网络轮询器（netpoller）有没有就绪的 G 都没有 → M 休眠 为什么偷一半？偷太少还得再偷，偷太多别人又没活干。一半是实践验证的最优策略。\nsysmon 监控线程 sysmon 是一个特殊的 M，不需要绑定 P，独立运行，是整个调度器的\u0026quot;看门狗\u0026rdquo;：\n运行超过 10ms 的 G → 标记为需要抢占 M 在系统调用中阻塞太久 → 把 P 抢走给其他 M 定期检查网络轮询器 → 唤醒等待 I/O 的 G 需要时触发 GC 抢占式调度 Go 1.14 之前（协作式）：G 只在函数调用时检查是否需要让出。一个 for {} 死循环会卡死整个 P。\nGo 1.14 之后（基于信号的异步抢占）：\nsysmon 发现某个 G 运行超过 10ms 向该 G 所在的 M 发送 SIGURG 信号 信号处理函数注入抢占点 G 保存现场，让出 P 这解决了死循环卡死的问题。\n冷知识：为什么是 SIGURG？SIGURG（Signal Urgent）本来是 Unix 中用于 TCP 带外数据的信号，实际中几乎没有程序使用。Go 团队选它正是因为没人用——不会和用户程序自己注册的信号处理冲突。如果用 SIGUSR1 之类常见信号，很可能和业务代码打架。\nGoroutine 栈管理 这是 goroutine 能创建几百万个的核心原因：\n初始栈 2KB（OS 线程 1-8MB） 函数调用时检查栈空间，不够就分配 2 倍大小的新栈，把旧内容拷贝过去（连续栈） GC 时检查，如果栈使用不到 1/4，就缩小 100 万个 goroutine × 2KB = 2GB。100 万个线程 × 1MB = 1TB。数量级差了 500 倍。\n回到那个事故 现在我们有足够的知识来调查了。\n第一步：抓 pprof 数据 Controller 暴露了 pprof 端口（import _ \u0026quot;net/http/pprof\u0026quot;），在 OOM 之前抓数据：\n# 端口转发 kubectl port-forward deploy/my-controller 6060:6060 # 抓 goroutine 数据 curl http://localhost:6060/debug/pprof/goroutine?debug=2 \u0026gt; goroutine.txt # 抓内存数据 go tool pprof http://localhost:6060/debug/pprof/heap 第二步：读懂 goroutine dump 打开 goroutine.txt：\ngoroutine profile: total 183742 18 万个 goroutine。 一个正常的 K8s controller 通常几十到几百个（informer、worker、runtime 后台线程加起来）。18 万是几百倍的异常。\n继续看具体内容：\ngoroutine 58923 [chan receive, 47 minutes]: main.(*Reconciler).reconcile(...) /app/controller.go:89 goroutine 58924 [chan receive, 47 minutes]: main.(*Reconciler).reconcile(...) /app/controller.go:89 ... (重复几万个) 第三步：用 GMP 知识解读 pprof pprof 的 goroutine dump 格式是固定的：\ngoroutine \u0026lt;ID\u0026gt; [\u0026lt;状态\u0026gt;, \u0026lt;等待时长\u0026gt;]: \u0026lt;调用栈\u0026gt;\n方括号里的状态是 Go runtime 自动标记的，对应 G 在 GMP 中的状态：\npprof 状态 GMP 含义 running G 绑定在 P 上执行 runnable G 在队列中排队 chan receive G 在等 \u0026lt;-ch，挂在 channel 的 recvq 上 chan send G 在等 ch \u0026lt;-，挂在 channel 的 sendq 上 syscall M 在做系统调用，P 可能已被抢走 semacquire G 在等 Mutex / WaitGroup IO wait G 在等网络 I/O select G 阻塞在 select 语句 所以 [chan receive, 47 minutes] 的含义是：\nchan receive → 这个 goroutine 执行到一个 \u0026lt;-ch 操作，channel 里没数据，阻塞了 47 minutes → 已经等了 47 分钟，没人往这个 channel 发数据 结合 GMP 知识：\n观察 GMP 层面 18 万个 goroutine 18 万个 G 对象，每个至少 2KB 栈 → 光栈就 400MB+ 状态 chan receive G 处于 _Gwaiting，从 P 的队列移出，不占 CPU 47 分钟没被唤醒 没人往 channel 发数据，这些 G 永远不会醒 CPU 不高但延迟飙升 G 虽然不占 P，但 GC 需要扫描所有 G 的栈。G 越多 GC 越慢，STW 时间增加 第四步：定位代码 调用栈指向 controller.go:89，打开看：\nfunc (r *Reconciler) reconcile(ctx context.Context, obj *MyResource) error { ch := make(chan result) // 无缓冲 channel go func() { res, err := r.client.Get(ctx, obj.Name) // 调外部 API ch \u0026lt;- result{res, err} // 发送结果 }() select { case res := \u0026lt;-ch: // 正常：收到结果 return r.process(res) case \u0026lt;-ctx.Done(): // 超时：直接返回 return ctx.Err() // ← 问题在这里 } } 看到了吗？当 ctx 超时时，select 走了 ctx.Done() 分支直接返回。但 go func 里的那个 goroutine 还活着——它在调完外部 API 后，试图往 ch 发送数据。\n问题是 ch 是无缓冲 channel，发送必须有人接收。但 reconcile 函数已经返回了，没人再读 ch。于是这个 goroutine 永远卡在 ch \u0026lt;- result{res, err} 这一行。\n每次超时就泄漏一个 goroutine。集群中有大量资源频繁 reconcile，几万个 goroutine 就这样堆积起来。\n第五步：修复 func (r *Reconciler) reconcile(ctx context.Context, obj *MyResource) error { ch := make(chan result, 1) // ← 改成缓冲为 1 的 channel go func() { res, err := r.client.Get(ctx, obj.Name) select { case ch \u0026lt;- result{res, err}: // 尝试发送 default: // ← 没人收就丢弃，goroutine 正常退出 } }() select { case res := \u0026lt;-ch: return r.process(res) case \u0026lt;-ctx.Done(): return ctx.Err() } } 两个改动：\nmake(chan result, 1) — 缓冲为 1，即使没人接收，发送也不会阻塞 select + default — 双保险，如果 channel 满了也不会卡住 总结：GMP 知识的实战价值 GMP 不是面试八股文，它是你排查 Go 并发问题时的\u0026quot;X 光片\u0026quot;。\n线上问题 你需要的 GMP 知识 goroutine 泄漏导致 OOM G 的状态机：_Gwaiting 的 G 不占 CPU 但占内存，channel 持有 G 引用导致 GC 无法回收 GOMAXPROCS 设错导致性能差 P 的数量决定并行度，容器中要手动设置避免读到宿主机核数 大量系统调用拖慢服务 M 阻塞时 P 被抢走给新 M，但 M 数量会膨胀，要限制并发数 死循环卡死程序 Go 1.14 前协作式抢占的局限性，sysmon + SIGURG 信号抢占 GC 停顿导致延迟毛刺 G 越多 GC 扫描越慢，减少 goroutine 数量 = 减少 GC 压力 不懂 GMP 的人看到 pprof 输出只能看到数字，懂 GMP 的人看到的是每个 goroutine 在调度器里的生死状态。\n这是「Go 并发原理实战」系列的第一篇。下一篇我们从一个 channel 死锁的案例出发，聊 channel 的底层原理。\n补充阅读 GMP 中的\u0026quot;魔法数字\u0026quot;：工程经验值而非数学最优解 你可能会好奇：为什么本地队列是 256？为什么偷一半？为什么抢占阈值是 10ms？初始栈为什么是 2KB？\n这些数字都是工程经验值，而非数学最优解。Work stealing 中\u0026quot;偷一半\u0026quot;的策略最早来自 MIT 的 Cilk 项目，直觉上是个平衡点，但并没有严格的数学证明说它在所有场景下最优。256、10ms、2KB 也都是 Go 团队在大量真实 workload 上跑 benchmark 调出来的。\n这在系统软件中很常见——Linux 内核的时间片长度、调度权重等参数也是类似的经验调参。真实 workload 太复杂多变，数学建模很难覆盖所有场景，最终靠 \u0026ldquo;benchmark + 实践反馈\u0026rdquo; 收敛到一个\u0026quot;够好\u0026quot;的值。\n为什么 Linux 线程栈不能像 goroutine 那样小？ goroutine 初始栈只有 2KB，而 Linux 线程栈默认 1-8MB。既然 Go 能做到这么小，为什么 Linux 不学？\n核心原因是线程栈必须在内核态也能用，而 goroutine 栈只在用户态用。\n1. 内核栈不能动态扩展。 当线程陷入内核态（syscall、中断），内核要用这个线程的内核栈。内核代码路径不允许\u0026quot;栈不够了暂停一下去扩容\u0026quot;——中断处理、锁持有期间如果栈溢出就是内核 panic。所以必须一次性分配够。\n2. 用户态栈扩容需要语言层面配合。 goroutine 能动态扩栈，是因为 Go 编译器在每个函数入口插入了栈检查代码，不够就拷贝到更大的新栈。但 C/C++ 编译出的代码没有这个机制，Linux 也不可能要求所有用户态程序都配合。而且拷贝栈意味着所有栈上指针都要修正——Go 有 GC 知道哪些是指针，C 语言做不到。\n3. 2KB 对 C 程序真的不够。 C 程序的栈上经常有大数组、大结构体，一个 char buf[4096] 就超了。Go 的栈小是因为大对象分配在堆上，语言层面配合了这个设计。\n用大楼比喻来说：线程栈就像工位自带的固定储物柜，中控室管理员（内核）也要往里放东西，所以必须一开始就造够大，不能用的时候再扩建。goroutine 栈是员工自带的可伸缩文件夹，反正只有自己用，不够了换个大的就行，排班经理（Go runtime）还帮你搬文件。\n","permalink":"https://xuezhaojun.github.io/collections/go-concurrency/go-goroutine-gmp/","summary":"不讲干巴巴的八股文。从一个真实的 Kubernetes controller 内存暴涨 + 延迟飙高的线上事故出发，用 pprof 数据倒推 GMP 调度模型的每个概念，让你理解为什么要学这些东西。","title":"从一次 K8s Controller OOM 聊起：彻底搞懂 Go GMP 调度模型"},{"content":"这是「Go 并发原理实战」系列的第二篇。第一篇：从一次 K8s Controller OOM 聊起——彻底搞懂 Go GMP 调度模型\n一个线上事故 你的团队维护一个 Kubernetes admission webhook，负责在 Pod 创建时做合规检查：镜像是否来自可信仓库、是否设了资源限制、是否有安全上下文。\n架构很简单：webhook 收到请求后，把 Pod spec 同时发给三个检查器（镜像检查、资源检查、安全检查），等所有结果回来再汇总返回。经典的 fan-out/fan-in 模式。\n某天业务方反馈：高峰期大量 Pod 创建失败，报 webhook 超时（30s）。但你检查单个检查器的逻辑，每个都在毫秒级完成。三个加起来也不可能要 30 秒。\n问题出在哪？出在 channel 上。\n先搞清楚：Channel 到底是什么？ 上一篇我们讲了 GMP——goroutine 怎么被调度到线程上执行。但 goroutine 之间怎么通信？怎么传递数据？怎么协调\u0026quot;你先做完我再做\u0026quot;？\n答案就是 channel。\n用传送带比喻 还是用上一篇的大楼比喻。大楼里有很多员工（goroutine）在不同办公室干活。他们之间怎么传递文件？\nChannel 就是两个办公室之间的传送带。\n无缓冲 channel = 没有传送带，只有一个窗口。发送方必须亲手把文件递过去，接收方必须同时伸手接。两个人必须同时在窗口，否则先到的那个就得等着。这是一次\u0026quot;面对面交接\u0026quot;。 有缓冲 channel = 窗口下面有个小柜子（缓冲区）。发送方可以把文件放进柜子就走，不用等对方。但柜子满了，发送方也得等。 ch := make(chan int) // 无缓冲：窗口交接，必须同步 ch := make(chan int, 5) // 有缓冲：柜子能放 5 份文件 Channel 的底层结构 传送带的比喻帮你建立直觉，但面试官会问底层。Channel 在 Go runtime 中是一个叫 hchan 的结构体：\ntype hchan struct { qcount uint // 柜子里现在有几份文件 dataqsiz uint // 柜子最多能放几份（缓冲区大小） buf unsafe.Pointer // 柜子本身（环形缓冲区） elemsize uint16 // 每份文件多大 closed uint32 // 传送带是否已关闭 sendx uint // 下一份文件放在柜子的哪个格子 recvx uint // 下一次从柜子的哪个格子取 recvq waitq // 在窗口等着取文件的人的排队队列 sendq waitq // 在窗口等着放文件的人的排队队列 lock mutex // 锁，同一时间只能有一个人操作传送带 } 一句话总结：channel = 一个带锁的环形队列 + 两个等待队列。\n环形队列（buf）就是柜子，大小固定。两个等待队列（sendq、recvq）就是窗口两边排队的人。锁保证同一时间只有一个 goroutine 在操作 channel。\n注意：等待队列的底层是链表，没有长度限制，不会满——来一个阻塞的 goroutine 就挂一个节点。这意味着真正的风险不是\u0026quot;等待队列满了\u0026quot;，而是大量 goroutine 阻塞在等待队列上永远不被唤醒，造成 goroutine 泄漏（参见 GMP 篇中的 K8s controller OOM 案例）。\n为什么是环形队列？ 因为高效。不需要移动元素，只需要移动两个指针（sendx 和 recvx）：\nbuf: [ 空 | 空 | 数据A | 数据B | 数据C | 空 | 空 | 空 ] ↑ ↑ recvx sendx (下次从这取) (下次往这放) 取一个数据：从 recvx 位置取，recvx 往右移一格。放一个数据：往 sendx 位置放，sendx 往右移一格。到末尾了就绕回开头——所以叫\u0026quot;环形\u0026quot;。\n环形队列的大小在 make(chan T, n) 时就固定了，运行时不能扩容。这是有意为之的设计——channel 的核心目的是同步和通信，不是当容器用。固定大小迫使你思考\u0026quot;满了怎么办\u0026quot;，这就是背压（backpressure）机制：生产者太快就让它停下来等消费者，避免无限堆积数据最终 OOM。如果你需要动态大小的队列，应该用 slice 或第三方的无界队列，而不是 channel。\nChannel 的方向：限制只发或只收 创建 channel 时默认是双向的（既能发也能收），但在函数参数中可以限制方向，编译期就能防止误用：\nchan int // 双向 channel，既能发也能收 chan\u0026lt;- int // 只发 channel（箭头指向 chan，数据流入） \u0026lt;-chan int // 只收 channel（箭头从 chan 出来，数据流出） 记忆方法：看箭头方向。chan\u0026lt;- int 箭头朝 channel 里指，数据只能往里送；\u0026lt;-chan int 箭头从 channel 出来，数据只能往外取。\n实际使用时，通常创建双向 channel，传给函数时通过参数类型限制方向：\nch := make(chan int) // 双向 // 编译器自动将双向 channel 转为单向 go producer(ch) // 传入后只能发 go consumer(ch) // 传入后只能收 func producer(ch chan\u0026lt;- int) { // 只能往 ch 发数据，试图 \u0026lt;-ch 编译报错 ch \u0026lt;- 42 } func consumer(ch \u0026lt;-chan int) { // 只能从 ch 收数据，试图 ch\u0026lt;- 编译报错 v := \u0026lt;-ch } 这是 Go 类型系统的一个精妙设计：用编译期约束代替运行时错误。如果一个函数只应该发送，就把参数声明为 chan\u0026lt;-，有人不小心在里面写了接收代码，编译直接不过。\n发送和接收的完整流程 这是面试高频题：往 channel 发送数据的时候，底层发生了什么？\n发送流程（ch \u0026lt;- value） Go runtime 按优先级检查三种情况：\n情况 1：窗口对面有人在等着收（recvq 非空）\n这是最快的路径。有人已经在窗口等文件了，直接把数据从发送方的栈拷贝到接收方的栈上，然后唤醒接收方。数据不经过柜子（buf）。\n为什么不放柜子里再让对方取？因为多一次拷贝。直接发比\u0026quot;放进去再取出来\u0026quot;快。\n情况 2：柜子有空位（qcount \u0026lt; dataqsiz）\n没人等着收，但柜子没满。把数据放进 buf[sendx]，移动指针，走了。\n情况 3：柜子满了（或者根本没柜子——无缓冲 channel）\n发送方被包装成一个 sudog（等待者描述符），挂到 sendq 队列上。然后调用 gopark()——上一篇讲过，这会把当前 goroutine 的状态从 _Grunning 改为 _Gwaiting，让出 P，goroutine 休眠。\n等到有人从 channel 接收数据时，发送方才会被唤醒。\n接收流程（value \u0026lt;- ch） 完全镜像对称：\n情况 1：有人等着发（sendq 非空）\n无缓冲 channel：直接从发送方栈拷贝数据到接收方栈 有缓冲 channel：从 buf[recvx] 取数据，然后把 sendq 里等着的发送方的数据放入 buf（因为 buf 一定是满的，才会有人在 sendq 里等） 情况 2：柜子里有数据\n从 buf[recvx] 取走，移动指针。\n情况 3：柜子空了\n挂到 recvq，gopark() 休眠。等有人发送数据时被唤醒。\n回到那个事故 有了这些知识，我们来看 webhook 的代码：\nfunc (w *Webhook) validate(ctx context.Context, pod *corev1.Pod) (bool, error) { ch := make(chan checkResult) // 无缓冲 channel // Fan-out：同时启动三个检查 go func() { ch \u0026lt;- w.checkImage(pod) }() go func() { ch \u0026lt;- w.checkResources(pod) }() go func() { ch \u0026lt;- w.checkSecurity(pod) }() // Fan-in：收集三个结果 for i := 0; i \u0026lt; 3; i++ { result := \u0026lt;-ch if !result.allowed { return false, result.reason } } return true, nil } 看起来没问题对吧？三个 goroutine 并发检查，主 goroutine 收集三个结果。但这里有一个隐蔽的 bug。\nBug 在哪？ 当第一个或第二个结果就不合规时，return false 直接返回了。此时只收了 1 个或 2 个结果，但有 3 个 goroutine 在往 channel 发送。\n场景：第一个结果就是 not allowed 主 goroutine： \u0026lt;-ch → 收到 checkImage 的结果，not allowed → return false （函数返回，ch 不再有人读） 剩下两个 goroutine： checkResources 完成 → ch \u0026lt;- result → 无缓冲 channel，没人收 → 阻塞 checkSecurity 完成 → ch \u0026lt;- result → 无缓冲 channel，没人收 → 阻塞 又是 goroutine 泄漏。 和上一篇 GMP 文章中的 controller 案例一模一样的根因——无缓冲 channel，发送方阻塞，永远不会被唤醒。\n为什么平时没事，高峰期才出问题？ 低峰期大部分 Pod 都合规，三个结果都是 allowed，循环完整跑完，三个 goroutine 都正常退出。\n高峰期很多 Pod 不合规（比如用了未授权镜像），触发提前 return，goroutine 开始泄漏。webhook 进程的 goroutine 数量持续增长，GC 压力增大（上一篇讲过：GC 需要扫描所有 goroutine 的栈），响应变慢，最终超时。\n用 pprof 确认 kubectl port-forward deploy/webhook 6060:6060 curl http://localhost:6060/debug/pprof/goroutine?debug=2 \u0026gt; goroutine.txt goroutine profile: total 52381 goroutine 12847 [chan send, 23 minutes]: main.(*Webhook).validate.func2(...) ← checkResources 的 goroutine /app/webhook.go:45 goroutine 12848 [chan send, 23 minutes]: main.(*Webhook).validate.func3(...) ← checkSecurity 的 goroutine /app/webhook.go:46 chan send — 结合上面的知识：这些 goroutine 卡在 ch \u0026lt;- result，状态是 _Gwaiting，挂在 channel 的 sendq 链表上。sendq 持有 goroutine 的引用，GC 无法回收。\n修复 func (w *Webhook) validate(ctx context.Context, pod *corev1.Pod) (bool, error) { ch := make(chan checkResult, 3) // ← 缓冲 = goroutine 数量 go func() { ch \u0026lt;- w.checkImage(pod) }() go func() { ch \u0026lt;- w.checkResources(pod) }() go func() { ch \u0026lt;- w.checkSecurity(pod) }() for i := 0; i \u0026lt; 3; i++ { result := \u0026lt;-ch if !result.allowed { return false, result.reason // 即使提前返回，剩余 goroutine 发送到缓冲 channel // 不会阻塞，正常退出 // channel 没有引用后会被 GC 回收 } } return true, nil } 核心改动：make(chan checkResult, 3) — 缓冲区大小 = 生产者数量。即使没人接收，发送也不阻塞，goroutine 正常退出，channel 随后被 GC 回收。\n经验法则：fan-out 模式中，channel 的缓冲大小至少等于 goroutine 数量，除非你能保证一定会读完所有结果。\n无缓冲 vs 有缓冲：什么时候用哪个？ 这个事故的根因是用错了 channel 类型。那什么时候用哪种？\n无缓冲 channel：同步握手 ch := make(chan int) 发送和接收必须同时就绪。本质是两个 goroutine 的一次同步握手——\u0026ldquo;我把数据交给你，确认你拿到了我再走\u0026rdquo;。\n适用场景：\n信号通知：done := make(chan struct{})，一方完成后 close(done) 通知另一方 确保交付：你需要确认对方拿到了数据才能继续 请求-响应：一个 goroutine 发请求，等另一个返回结果 有缓冲 channel：异步邮箱 ch := make(chan int, 100) 发送方只要柜子没满就不阻塞。解耦生产者和消费者的速率。\n适用场景：\n削峰：生产者短时间内产出很多数据，消费者慢慢处理 worker pool：用 channel 当任务队列，多个 worker 消费 限制并发：sem := make(chan struct{}, 10) 当信号量用，最多同时 10 个 fan-out/fan-in：像上面 webhook 的例子，缓冲 = 生产者数量 选择口诀 需要\u0026quot;面对面交接\u0026quot; → 无缓冲 需要\u0026quot;放进去就走\u0026quot; → 有缓冲\nSelect 多路复用 webhook 的代码其实还有一个问题：没有超时控制。如果某个检查器卡住了，主 goroutine 会永远等在 \u0026lt;-ch 上。加上 select 和 context：\n// 创建一个 5 秒超时的 context，超时后 ctx.Done() 的 channel 会被关闭 ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() for i := 0; i \u0026lt; 3; i++ { select { case result := \u0026lt;-ch: // 正常收到数据就处理 if !result.allowed { return false, result.reason } case \u0026lt;-ctx.Done(): // 5 秒超时了，走这个分支，不会永远卡住 return false, fmt.Errorf(\u0026#34;validation timeout: %w\u0026#34;, ctx.Err()) } } Select 的底层实现 select 是 Go 的多路复用器——同时监听多个 channel，谁先就绪就执行谁。底层是 selectgo 函数：\n按 channel 地址排序所有 case（多个 channel 加锁时统一顺序，防死锁） 随机打乱轮询顺序（保证公平，不是总选第一个） 遍历所有 case，检查哪个能立即完成 多个就绪 → 随机选一个（所以 select 的行为是不确定的） 都没就绪且没 default → 把当前 goroutine 挂到所有 case 的 channel 的等待队列上，休眠 被任一 channel 唤醒后 → 从其他所有 channel 的等待队列中移除自己 第 5 步和第 6 步是关键：一个 goroutine 可以同时排在多个 channel 的等待队列上。被唤醒后要做清理，从其他队列中移除，否则同一个 goroutine 会被多次唤醒。\nselect + default 的用途 select { case ch \u0026lt;- data: // 发送成功 default: // channel 满了或没人收，走这里（不阻塞） } 加了 default 就变成非阻塞操作。上一步的第 5 步变成\u0026quot;直接走 default\u0026quot;，不休眠。常用于\u0026quot;试一下，不行就算了\u0026quot;的场景。\nChannel 关闭：规则与陷阱 关闭后的行为 close(ch) 后： ✅ 接收：继续接收，先取完缓冲区剩余数据，然后返回零值 + false ❌ 发送：panic: send on closed channel ❌ 再次关闭：panic: close of closed channel 用 close 做广播通知 这是 Go 中最优雅的模式之一。上面说无缓冲 channel 发送只能通知一个接收者，但 close 可以同时唤醒所有等待者：\nquit := make(chan struct{}) // 启动 100 个 worker for i := 0; i \u0026lt; 100; i++ { go func() { for { select { case \u0026lt;-quit: // quit 关闭后，所有 worker 都能收到 return case task := \u0026lt;-taskCh: process(task) } } }() } // 需要停止时 close(quit) // 一次 close，100 个 worker 全部退出 为什么 close 能做广播？因为 close 会遍历 recvq 中所有等待的 goroutine，逐个唤醒。而普通发送只唤醒 recvq 中的第一个。\nclose 的本质：close 不是往 channel 里发一个特殊值，而是把 hchan 结构体的 closed 字段设为 1。一旦关闭，所有对它的接收操作都会立即返回该类型的零值（int → 0，string → \u0026quot;\u0026quot;，struct{} → struct{}{}），永远不会阻塞。\n从接收方的视角看，效果确实就像收到了一个零值。要区分\u0026quot;真的收到了数据 0\u0026quot;还是\u0026quot;channel 关了\u0026quot;，用双返回值：\nv, ok := \u0026lt;-ch // ok == true → 正常数据（哪怕值恰好是 0） // ok == false → channel 已关闭，v 是零值 这也是为什么广播通知推荐用 chan struct{} —— 你根本不关心收到的值，只关心\u0026quot;channel 关了\u0026quot;这个信号。struct{} 占 0 字节，纯粹当信号用，是 Go 社区公认的 best practice。\n关闭的核心原则 只有发送方关闭 channel，永远不要在接收方关闭。\n因为发送方知道\u0026quot;没有更多数据了\u0026quot;，接收方不知道。如果接收方关闭了 channel，发送方继续发送就会 panic。\n多个发送方的情况下，用一个单独的协调者来关闭，或者用 context 取消来代替 close：\n// ❌ 危险：多个发送方，谁来关闭？ func producer(ch chan\u0026lt;- int) { defer close(ch) // 多个 producer 都 defer close → panic // ... } // ✅ 安全：用 context 通知停止，不关闭数据 channel func producer(ctx context.Context, ch chan\u0026lt;- int) { for { select { case \u0026lt;-ctx.Done(): return // 收到取消信号，直接返回 case ch \u0026lt;- data: } } } for range channel for v := range ch { fmt.Println(v) } // 等价于： for { v, ok := \u0026lt;-ch if !ok { break } // channel 关闭且缓冲区空 fmt.Println(v) } for range 会一直读直到 channel 被关闭。如果没人关闭 channel，for range 会永远阻塞——又是一种 goroutine 泄漏。\n面试常问 Q1：channel 是线程安全的吗？ 是的。hchan 内部有 mutex，每次 send/recv 都会加锁。所以你不需要额外加锁来保护 channel 操作。\nQ2：无缓冲 channel 发送和接收，数据拷贝了几次？ 无缓冲：1 次。直接从发送方的栈拷贝到接收方的栈（sendDirect/recvDirect），不经过 buf。 有缓冲：2 次。发送方栈 → buf（第 1 次），buf → 接收方栈（第 2 次）。所以无缓冲 channel 在\u0026quot;恰好有人等着收\u0026quot;的场景下反而更快——少一次内存拷贝。\nQ3：为什么不建议用 channel 传递大结构体？ 因为 channel 的每次 send/recv 都会做数据拷贝（typedmemmove）。传大结构体就拷贝整个结构体。传指针只拷贝 8 字节。\nQ4：向一个 nil channel 发送/接收会怎样？ 永远阻塞。不会 panic，就是永远等着。这个特性有时候有用——在 select 中把某个 case 的 channel 设为 nil 可以\u0026quot;关闭\u0026quot;这个 case。\n总结 概念 一句话 Channel 本质 带锁的环形队列 + 两个 goroutine 等待队列 无缓冲 同步握手，两方必须同时就绪 有缓冲 异步邮箱，发送方放进去就走 发送到满 channel goroutine 挂到 sendq，状态变 _Gwaiting close 唤醒所有等待者（广播），之后发送会 panic select 随机轮询 + 可同时挂到多个 channel 的等待队列 最常见的坑 无缓冲 channel + 提前 return = goroutine 泄漏 这篇文章的事故和上一篇（GMP）的事故根因完全相同——goroutine 阻塞在 channel 上无法退出。两篇连起来看，你会发现 GMP 和 channel 的知识是一体的：channel 的阻塞行为（挂到 sendq/recvq）直接影响 goroutine 在 GMP 调度器中的状态（_Gwaiting），进而影响内存和 GC。\n下一篇：从一次 K8s 级联超时聊起——彻底搞懂 Go Context 的传播机制。\n","permalink":"https://xuezhaojun.github.io/collections/go-concurrency/go-channel/","summary":"一个 K8s admission webhook 在高峰期频繁超时，但单个请求处理逻辑明明很快。问题出在 channel 的使用方式上。从这个事故出发，拆解 channel 的底层结构、发送接收流程、select 实现，以及那些年我们踩过的 channel 坑。","title":"从一次 K8s Webhook 超时聊起：彻底搞懂 Go Channel 底层原理"},{"content":"Where Does an Agent Actually Live? When people say \u0026ldquo;build an AI Agent,\u0026rdquo; what usually comes to mind is: a framework, a protocol, an orchestrator, a vector database, a plugin system. Before any real work begins, there\u0026rsquo;s already a mountain of infrastructure.\nBut what if the answer could be far simpler?\nA Git repository + an AI coding assistant (Claude Code, Cursor, etc.) = a complete Agent.\nNo new frameworks. No new protocols. Directories, files, Git — the most familiar tools every engineer already uses. The repository is the Agent. Its identity, skills, knowledge, and workflows all exist as files, version-controlled with Git, and reviewed through the same code review process as any other code.\nRepository Structure Here\u0026rsquo;s what an actual Agent repository looks like:\nmy-team-agent/ ├── CLAUDE.md # Agent identity and behavioral rules ├── skills/ # 20+ reusable task definitions │ ├── bug-analyze/ │ ├── workspace-clone/ │ ├── jira-triage/ │ └── ... ├── workflows/ # Scheduled or triggered multi-step processes │ ├── daily-bug-triage.md │ ├── daily-standup-prep.md │ └── weekly-pr-report.md ├── solutions/ # Lessons learned — known issues and solutions ├── repos/ # 20+ related projects (read-only references) ├── team-members/ # Team roster and component ownership ├── docs/ # Reference documents, loaded on demand ├── build/ │ └── Dockerfile # Reproducible execution environment └── deploy/ # Kubernetes deployment manifests The CLAUDE.md at the root is the entry point. When an AI assistant opens this repository, it reads this file first to understand who it is, what it can do, and where its knowledge lives. All subsequent interactions are built on top of this context.\nThe key point here: everything is auditable. Who added a skill? When was a workflow changed? Why was a solution deprecated? git log and git blame can answer all these questions. Team members review Agent changes exactly the same way they review code — through Pull Requests.\nCross-Repository Visibility In real enterprise environments, almost no task exists within a single repository. A bug might surface at the API gateway layer, but the root cause is in the authentication service, and the fix requires changes to a shared SDK. If the Agent can only see one repository at a time, it simply cannot perform end-to-end problem analysis.\nThe repos/ directory solves this. It contains shallow clones of 20+ related projects that the Agent can read but won\u0026rsquo;t modify:\n# repos/repos.yaml categories: core: # Core projects the team directly owns - api-server - auth-service - sdk-go - cli-tool dependencies: # Upstream dependencies with custom patches - network-proxy - grpc-fork build: # Build and CI configuration - ci-config - release-pipeline cross-team: # Cross-team shared components - monitoring-stack - notification-service With this global view, the Agent can trace call chains across repositories, understand dependency relationships, and analyze how changes in one project affect others. It sees the same code landscape that a senior engineer carries in their head — except it can grep through all of it in seconds.\nGit Worktree: Parallel Task Isolation When tasks are executed end-to-end — from analysis to coding to testing to creating a PR — each task takes a long time. You can\u0026rsquo;t execute them sequentially, sitting around waiting for one to finish before starting the next. But if two tasks modify the same repository simultaneously, they\u0026rsquo;ll conflict with each other.\nGit Worktree is the native solution. Starting from a bare clone, you can create multiple working directories, each on its own branch:\n# Task 1: Fix a bug in the auth service git worktree add ../workspace/auth-fix-1234 bugfix/auth-1234 # Task 2: Add a new feature to the same auth service git worktree add ../workspace/auth-feature-5678 feature/oauth-5678 Each task operates in its own directory and branch, completely independent. After the task is done and the PR is merged, the worktree is automatically cleaned up.\nThis pattern is already built into the Agent\u0026rsquo;s workspace management skill. When the Agent receives a task, it automatically creates an isolated worktree, works inside it, and cleans up when finished. Multiple tasks on the same repository can truly run in parallel.\nTeam and Project Knowledge Being able to read code isn\u0026rsquo;t enough. To complete end-to-end tasks, the Agent also needs to understand the organization and processes behind the code.\nImagine this scenario: the Agent analyzes a bug, determines it belongs to the import controller component, writes a fix, creates a PR, and then\u0026hellip; needs to notify the responsible person on Slack. Who owns the import controller? What\u0026rsquo;s their Slack ID? Which channel should the message go to?\nThe team-members/ directory answers these questions:\n## Core Team | Name | GitHub | Email | Components | |------------|-------------|--------------------|-----------------------------------| | Alice Chen | alice-c | alice@example.com | api-server, sdk-go | | Bob Park | bob-park | bob@example.com | auth-service, cluster-proxy | | Carol Wu | carol-wu | carol@example.com | import-controller, lifecycle-mgr | Beyond team structure, the Agent also knows:\nProject management processes — how to create Jira tickets, which fields are required, how ticket statuses flow Release strategies — which branch maps to which product version, what the backport rules are Communication channels — where to post bug analysis reports, which channel gets the weekly summary In short, this is everything you\u0026rsquo;d tell a new hire during onboarding — not just \u0026ldquo;how to write code,\u0026rdquo; but \u0026ldquo;how we collaborate.\u0026rdquo;\nProgressive Knowledge Loading AI models have a hard limit on their Context Window — the amount of text they can process at once. If you stuff all documentation, all skill definitions, and the entire team roster in at startup, the Agent drowns in irrelevant information and actually performs worse.\nThe solution is Progressive Disclosure: only load the knowledge the current task requires.\nEach skill file explicitly declares its document dependencies:\n## Reference Loading | Document | When to Load | |-----------------------|---------------------------------------| | team-members.md | Always (for assignment lookups) | | docs/jira.md | When creating or updating Jira tickets| | docs/build-release.md | When the bug involves release branches| | repos/repos.yaml | When identifying which repo to analyze| When the user says \u0026ldquo;show me today\u0026rsquo;s new bugs,\u0026rdquo; the Agent loads the bug handling workflow, team member mappings, and repository list — but skips the release strategy and CI configuration docs. When the task is \u0026ldquo;backport this fix to release-2.8,\u0026rdquo; the Agent loads the branching strategy and version mapping — but skips the Jira API documentation.\nDifferent tasks, different knowledge sets. The Context Window stays focused on what truly matters.\nDockerfile: Reproducible Execution Environment We hit a classic problem early on: the same workflow runs perfectly on one engineer\u0026rsquo;s machine but breaks on another\u0026rsquo;s. The reason is always environment inconsistency — different operating systems, different tool versions, a missing CLI tool.\nAgent execution depends on multiple layers:\nOS and Shell — macOS and Linux differ in path handling, command arguments, and even sed syntax Tool versions — kubectl, helm, jq, yq, language runtimes — any version mismatch can cause hard-to-debug errors Agent runtime — different AI coding tools (Claude Code, Cursor, OpenCode) have different capabilities and tool-calling behaviors The Dockerfile locks everything down:\nFROM debian:bookworm-slim # Core tools RUN apt-get update \u0026amp;\u0026amp; apt-get install -y \\ git jq curl wget openssl unzip make gcc g++ # Go toolchain (pinned version) ENV GO_VERSION=1.24.4 RUN curl -fsSL https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz \\ | tar -C /usr/local -xz # CLI tools (pinned versions) RUN curl -LO \u0026#34;https://dl.k8s.io/release/v1.31.4/bin/linux/amd64/kubectl\u0026#34; \u0026amp;\u0026amp; ... RUN curl -fsSL https://get.helm.sh/helm-v3.17.0-linux-amd64.tar.gz \\ | tar -xz \u0026amp;\u0026amp; ... RUN curl -LO https://github.com/mikefarah/yq/releases/download/v4.45.1/yq_linux_amd64 \\ \u0026amp;\u0026amp; ... Once the Dockerfile is checked into the repository, any engineer can build the exact same environment. No more \u0026ldquo;works on my machine\u0026rdquo; problems. The Agent runs inside a container with a fully deterministic toolchain, independent of the host machine.\nFrom Local Tool to Always-On Agent Everything above works great as a local solution — clone the Agent repository, start the AI assistant, and get to work. But the natural next step is: deploy the Agent to the cloud so it runs 24/7.\nWith the Dockerfile standardizing the environment, you can build a container image and deploy it to a Kubernetes cluster. We built KubeOpenCode, an AI Agent management platform on Kubernetes, to do exactly this. Once deployed, the Agent becomes a persistent service:\nAlways online — no need to keep a terminal open on your laptop Auto-scaling — scales down when idle to save resources, wakes up automatically when tasks arrive Live knowledge updates — the Agent repository syncs from Git every few minutes; when a team member merges a new skill, the running Agent gains the new capability without a restart Scheduled workflows — cron-triggered tasks that don\u0026rsquo;t require manual initiation # Example: Automatically triage new bugs at 9 AM on weekdays apiVersion: kubeopencode.io/v1alpha1 kind: CronTask metadata: name: daily-bug-triage spec: schedule: \u0026#34;0 9 * * 1-5\u0026#34; # Monday-Friday 9:00 taskTemplate: spec: description: | Run the daily bug triage workflow. Collect new bugs, analyze relevance, assign owners, post summary to Slack. With scheduled tasks, the Agent shifts from passive response to proactive work. It triages new bugs and assigns owners before the team\u0026rsquo;s standup. It auto-generates a PR activity report every Friday. It cleans up expired bot PRs every Monday. What the team sees in the morning is no longer a pile of unsorted backlog, but an already-categorized, pre-analyzed task list.\nCredentials (API tokens, Git authentication) are centrally managed in the cluster\u0026rsquo;s Kubernetes Secrets — no more passing around .env files or having everyone configure things individually.\nThe Core Formula The entire pattern boils down to a simple formula:\nCode Repository + AI Coding Assistant = Agent The repository provides identity, skills, knowledge, and workflows. The AI assistant provides reasoning, execution, and natural language interaction. Together, they form an Agent with these properties:\nVersion-controlled — every change is recorded, reviewable, and reversible Auditable — git log shows the complete history of the Agent\u0026rsquo;s evolution Collaborative — team members contribute skills and knowledge through PRs Portable — works with any AI coding assistant that can read files Reproducible — Dockerfile ensures consistent behavior across environments Scalable — deploy to Kubernetes for 24/7 operation and scheduled tasks No vendor lock-in to a specific Agent framework. No proprietary plugin formats. No complex orchestration layer. Just files, Git, and the engineering practices you use every day.\nGetting Started If you want to try this pattern:\nCreate a repository with a CLAUDE.md (or similar entry file) describing the Agent\u0026rsquo;s role and rules Add a skills/ directory and write one or two task definitions — start with work the team does repeatedly Add a docs/ directory with reference materials the Agent needs — team contacts, project specs, API docs Point your AI assistant at this repository and start assigning it tasks No Kubernetes, no Docker, no infrastructure needed to get started. A local repository plus an AI coding tool is all it takes. The infrastructure layer (Dockerfile, cloud deployment, scheduled tasks) can be added incrementally as needs grow.\nThe best part? When a teammate asks \u0026ldquo;what can this Agent do?\u0026rdquo; — just send them the repo link. All the answers are in the files.\n","permalink":"https://xuezhaojun.github.io/posts/repo-as-agent-en/","summary":"What if an AI Agent\u0026rsquo;s identity, skills, and knowledge were all just files in a Git repository? This post introduces a pattern we\u0026rsquo;ve used in practice — no new frameworks, just directories, files, and Git to build production-grade AI Agents.","title":"Repo-as-Agent: Building AI Agents with Git Repositories"},{"content":"Agent 的实体到底是什么？ 当人们说\u0026quot;构建一个 AI Agent\u0026quot;的时候，脑海里浮现的通常是：一个框架、一个协议、一个编排器、一个向量数据库、一套插件系统。还没开始干活，基础设施就已经一大堆了。\n但如果答案可以非常朴素呢？\n一个 Git 仓库 + 一个 AI 编程助手（Claude Code、Cursor 等）= 一个完整的 Agent。\n不需要新框架，不需要新协议。目录、文件、Git——每个工程师最熟悉的基础工具。仓库本身就是 Agent。它的身份、技能、知识、工作流，全部以文件的形式存在，用 Git 做版本管理，和代码一样接受 code review。\n仓库结构 实际的 Agent 仓库长这样：\nmy-team-agent/ ├── CLAUDE.md # Agent 的身份与行为规则 ├── skills/ # 20+ 可复用的任务定义 │ ├── bug-analyze/ │ ├── workspace-clone/ │ ├── jira-triage/ │ └── ... ├── workflows/ # 定时或触发式的多步骤流程 │ ├── daily-bug-triage.md │ ├── daily-standup-prep.md │ └── weekly-pr-report.md ├── solutions/ # \u0026#34;错题本\u0026#34;——已知问题与解决方案 ├── repos/ # 20+ 关联项目（只读引用） ├── team-members/ # 团队花名册与组件归属 ├── docs/ # 参考文档，按需加载 ├── build/ │ └── Dockerfile # 可复现的执行环境 └── deploy/ # Kubernetes 部署清单 根目录下的 CLAUDE.md 是入口。AI 助手打开这个仓库时，首先读取这个文件，从中了解自己是谁、能做什么、知识在哪里。后续所有交互都建立在这个上下文之上。\n这里面最重要的一点：一切都可审计。谁添加了一个技能？什么时候改了工作流？为什么废弃了某个解决方案？git log 和 git blame 能回答所有这些问题。团队成员审核 Agent 的变更，和审核代码一模一样——通过 Pull Request。\n跨仓库视野 在真实的企业环境里，几乎没有任务只存在于单个仓库中。一个 bug 可能在 API 网关层暴露，但根因在认证服务里，修复又需要改动共享 SDK。如果 Agent 一次只能看到一个仓库，它根本做不了端到端的问题分析。\nrepos/ 目录解决了这个问题。它包含 20+ 个相关项目的浅克隆，Agent 可以查阅但不会修改：\n# repos/repos.yaml categories: core: # 团队直接负责的核心项目 - api-server - auth-service - sdk-go - cli-tool dependencies: # 有定制修改的上游依赖 - network-proxy - grpc-fork build: # 构建与 CI 配置 - ci-config - release-pipeline cross-team: # 跨团队协作的组件 - monitoring-stack - notification-service 有了这个全局视野，Agent 可以跨仓库追踪调用链、理解依赖关系、分析一个项目的变更对其他项目的影响。它看到的代码全貌，等于一个资深工程师脑中对所有项目的业务全景——只不过它可以在几秒内 grep 搜索所有代码。\nGit Worktree：并行任务隔离 当任务是端到端执行的——从分析到编码到测试到创建 PR——每个任务的执行时间都很长。你不可能顺序执行，坐在那等一个任务跑完再给下一个。但如果两个任务同时修改同一个仓库，又会互相冲突。\nGit Worktree 是原生的解决方案。从一个 bare clone 出发，可以创建多个工作目录，每个目录在自己的分支上：\n# 任务 1：修复 auth 服务的一个 bug git worktree add ../workspace/auth-fix-1234 bugfix/auth-1234 # 任务 2：给同一个 auth 服务添加新功能 git worktree add ../workspace/auth-feature-5678 feature/oauth-5678 每个任务在自己的目录和分支上操作，互不影响。任务完成、PR 合并后，worktree 自动清理。\n这个模式已经内置到 Agent 的工作区管理技能中。Agent 接到任务后，自动创建隔离的 worktree，在里面干活，完成后自动清理。同一个仓库上的多个任务可以真正并行执行。\n团队与项目知识 光能看代码是不够的。要完成端到端的任务，Agent 还需要理解代码背后的组织和流程。\n想象这个场景：Agent 分析了一个 bug，判断它属于 import controller 组件，写了修复代码，创建了 PR，然后……需要在 Slack 上通知对应的负责人。谁负责 import controller？他的 Slack ID 是什么？应该发到哪个频道？\nteam-members/ 目录回答了这些问题：\n## Core Team | Name | GitHub | Email | Components | |------------|-------------|--------------------|-----------------------------------| | Alice Chen | alice-c | alice@example.com | api-server, sdk-go | | Bob Park | bob-park | bob@example.com | auth-service, cluster-proxy | | Carol Wu | carol-wu | carol@example.com | import-controller, lifecycle-mgr | 除了团队结构，Agent 还知道：\n项目管理流程 —— 怎么创建 Jira 工单、哪些字段是必填的、工单状态怎么流转 版本发布策略 —— 哪个分支对应哪个产品版本、backport 的规则是什么 沟通渠道 —— bug 分析报告发到哪个群、周报发到哪个频道 说白了，这就是你带新人入职时需要告诉他的那些东西——不只是\u0026quot;代码怎么写\u0026quot;，还有\u0026quot;我们怎么协作\u0026quot;。\n渐进式知识加载 AI 模型有一个 Context Window（上下文窗口）的硬限制——一次能处理的文本量是有上限的。如果在启动时把所有文档、所有技能定义、所有团队花名册全塞进去，Agent 会被无关信息淹没，反而表现更差。\n解决方案是渐进式加载（Progressive Disclosure）：只加载当前任务需要的知识。\n每个技能文件都明确声明了它的文档依赖：\n## Reference Loading | Document | When to Load | |-----------------------|---------------------------------------| | team-members.md | Always (for assignment lookups) | | docs/jira.md | When creating or updating Jira tickets| | docs/build-release.md | When the bug involves release branches| | repos/repos.yaml | When identifying which repo to analyze| 当用户说\u0026quot;帮我看看今天有哪些新 bug\u0026quot;，Agent 加载 bug 处理工作流、团队成员映射和仓库清单——但跳过版本发布策略和 CI 配置文档。当任务是\u0026quot;把这个修复 backport 到 release-2.8\u0026quot;，Agent 加载分支策略和版本对应表——但跳过 Jira API 文档。\n不同任务，不同知识集。Context Window 始终聚焦在真正重要的内容上。\nDockerfile：可复现的执行环境 我们很早就遇到了一个经典问题：同样的工作流，在一个工程师的机器上跑得好好的，换个人就报错。原因总是环境不一致——不同的操作系统、不同的工具版本、缺少某个命令行工具。\nAgent 的执行依赖多个层面：\n操作系统和 Shell —— macOS 和 Linux 在路径处理、命令参数、甚至 sed 语法上都有差异 工具版本 —— kubectl、helm、jq、yq、编程语言运行时——任何一个版本不一致都可能导致难以排查的错误 Agent 运行时 —— 不同的 AI 编程工具（Claude Code、Cursor、OpenCode）有不同的能力和工具调用行为 Dockerfile 把这一切锁定：\nFROM debian:bookworm-slim # Core tools RUN apt-get update \u0026amp;\u0026amp; apt-get install -y \\ git jq curl wget openssl unzip make gcc g++ # Go toolchain (pinned version) ENV GO_VERSION=1.24.4 RUN curl -fsSL https://go.dev/dl/go${GO_VERSION}.linux-amd64.tar.gz \\ | tar -C /usr/local -xz # CLI tools (pinned versions) RUN curl -LO \u0026#34;https://dl.k8s.io/release/v1.31.4/bin/linux/amd64/kubectl\u0026#34; \u0026amp;\u0026amp; ... RUN curl -fsSL https://get.helm.sh/helm-v3.17.0-linux-amd64.tar.gz \\ | tar -xz \u0026amp;\u0026amp; ... RUN curl -LO https://github.com/mikefarah/yq/releases/download/v4.45.1/yq_linux_amd64 \\ \u0026amp;\u0026amp; ... Dockerfile 签入仓库后，任何工程师都能构建出完全相同的环境。再也没有\u0026quot;在我机器上能跑\u0026quot;的问题。Agent 在容器里运行，工具链完全确定，和宿主机无关。\n从本地工具到常驻 Agent 上面所有内容作为本地方案已经很好用了——克隆 Agent 仓库，启动 AI 助手，开始干活。但自然的下一步是：把 Agent 部署到云端，让它 24/7 在线运行。\n有了 Dockerfile 标准化环境，就可以构建容器镜像并部署到 Kubernetes 集群。我们基于 Kubernetes 开发了 KubeOpenCode，一个 AI Agent 管理平台，专门做这件事。Agent 部署上去之后，就变成了一个持久化的服务：\n始终在线 —— 不需要在笔记本上开着终端 自动待机 —— 空闲时缩容节省资源，有任务时自动唤醒 知识热更新 —— Agent 仓库每隔几分钟从 Git 自动同步；团队成员合并了新技能，运行中的 Agent 自动获得新能力，无需重启 定时工作流 —— Cron 触发的任务，不需要人工启动 # 示例：工作日每天早上 9 点自动梳理新 bug apiVersion: kubeopencode.io/v1alpha1 kind: CronTask metadata: name: daily-bug-triage spec: schedule: \u0026#34;0 9 * * 1-5\u0026#34; # 周一到周五 9:00 taskTemplate: spec: description: | Run the daily bug triage workflow. Collect new bugs, analyze relevance, assign owners, post summary to Slack. 有了定时任务，Agent 从被动响应变成主动工作。它在团队站会前就把新 bug 梳理好、分配到人。每周五自动生成 PR 活动报告。每周一自动清理过期的 bot PR。团队早上看到的不再是一堆未整理的积压工作，而是已经分好类、分析过的任务清单。\n凭证（API Token、Git 认证）统一管理在集群的 Kubernetes Secret 中——不再需要到处传 .env 文件或让每个人单独配置。\n核心公式 整个模式可以归结为一个简单的公式：\n代码仓库 + AI 编程助手 = Agent 仓库提供身份、技能、知识和工作流。AI 助手提供推理、执行和自然语言交互能力。两者结合，形成的 Agent 具备以下特性：\n版本控制 —— 每一次变更都有记录、可审核、可回滚 可审计 —— git log 展示 Agent 演进的完整历史 协作式 —— 团队成员通过 PR 贡献技能和知识 可移植 —— 适用于任何能读取文件的 AI 编程助手 可复现 —— Dockerfile 确保跨环境的一致行为 可扩展 —— 部署到 Kubernetes 实现 24/7 运行和定时任务 没有对特定 Agent 框架的 vendor lock-in。没有私有的插件格式。没有复杂的编排层。只有文件、Git，以及你每天都在用的工程实践。\n如何开始 如果你想尝试这个模式：\n创建一个仓库，放一个 CLAUDE.md（或类似的入口文件）描述 Agent 的角色和规则 添加 skills/ 目录，先写一两个任务定义——从团队反复执行的工作开始 添加 docs/ 目录，放入 Agent 需要的参考资料——团队联系方式、项目规范、API 文档 让你的 AI 助手打开这个仓库，开始给它布置任务 不需要 Kubernetes，不需要 Docker，不需要任何基础设施就能开始。一个本地仓库加一个 AI 编程工具就够了。基础设施层（Dockerfile、云端部署、定时任务）可以随着需求逐步添加。\n最好的一点？当队友问\u0026quot;这个 Agent 能干什么\u0026quot;——你把仓库链接发给他就行。所有答案都在文件里。\n","permalink":"https://xuezhaojun.github.io/posts/repo-as-agent/","summary":"如果一个 AI Agent 的身份、技能和知识全都是 Git 仓库里的文件呢？本文介绍一种我们实践过的模式——不引入任何新框架，只用目录、文件和 Git 来构建生产级 AI Agent。","title":"Repo-as-Agent：用 Git 仓库构建 AI Agent"},{"content":"集群管理 Cluster-Proxy 核心负责人 — 106 merged PRs\n在 OCM 的 Hub-Spoke 架构下，通过反向隧道使用户可以从 Hub 侧直接访问 Agent 侧的目标 Service（包括 KubeAPI Server），解决 Pull Mode 下 Hub 无法主动连接被管集群的核心问题 ACM 内部 Console、Application、Observability 等核心组件均依赖此组件作为 Hub→Spoke 访问的基础网络层 Import Controller 核心开发者 — 183 merged PRs\n从开源 OCM 到企业级 ACM 的关键衔接组件：提供 auto-import 自动接入和跨云多厂商集群集成能力（AWS/Azure/GCP/私有云），使 ACM 具备真实生产环境的可用性 OCM Core Maintainer — 1400+ merged PRs, 270+ PR reviews，覆盖 Registration、Workload、Placement、Add-on Framework、SDK 等核心模块。全栈 Go 开发，5 年 CRD + Controller / Operator 模式实践。\nRegistration 模块 Approver：负责集群注册与身份认证（CSR 签发、证书自动轮换、Lease 心跳监控），负责该模块的代码审批和质量把关 Switch Hub：实现集群在 Hub 间的在线迁移，支撑 Global Hub（Hub of Hub）横向扩展，突破单 Hub 集群数量上限 AI + K8s 平台能力 KubeOpenCode — K8s 原生 AI Agent 平台，独立完成全部工作\nK8s 原生设计：Agent、Task、CronTask 全部作为 CRD，通过 K8s API 管理 AI 工作负载的生命周期和调度 架构设计 + 前后端开发 + 文档网站 + 内外部推广，99% AI 辅助开发 被 Distinguished Engineer 主动推动纳入 Red Hat 内部 AI 孵化项目之一 Repo-as-Agent 方法论 — 自主提出并落地，Git repo = Agent 本体（身份、技能、知识、工作流全部版本控制），28 个可复用 skill，覆盖 20+ 仓库\n生产使用案例：\nTekton image 变更自动处理：与 konflux-build-catalog 集成（workflow L247），检测 → 分析 → 修改 → PR → 合并全自动化 CronTask 定时任务：每日 Scrum 情况自动分析、每日新 bug 自动分析、每周 bot PR 自动处理 主动性 敏捷落地：主动考取 PSM 认证，推动团队敏捷开发流程落地\n效能提升：创建 konflux-build-catalog 集中化方案，消除 60-70 个 repo 每周数百个重复 Tekton 更新 PR，方案被整个 org 采用\n成本优化：主动 review 全组 AWS 测试集群配置（存储类型 io1→gp3 降幅 96%、实例类型 m5→t3、按测试场景分层为 HA/Lite cluster — HA 用于高可用场景、Lite 用于常规测试），月费 $5,000 → $2,000，年省 $36K\n可维护性：主导 Registration/Work/Placement 等多仓库合并为 Mono Repo，统一依赖管理和 CI/CD 流程，减少跨仓库维护开销\n文档与社区：主导 OCM 社区文档网站重构（PR #429，+1,856 / -11,626 行）— 迁移 Google Docsy 主题（K8s/Istio/gRPC 等 CNCF 项目标准选择），砍掉无实际维护的中文文档（48/55 文件仅标题中文），文件数减少 46%，降低社区参与门槛\n早期经历 爱美购 — Software Engineer（2019 – 2020）跨境电商平台开发，深圳\n共济科技 — Software Engineer（2017 – 2019）数据中心基础设施管理软件，深圳\n","permalink":"https://xuezhaojun.github.io/resume/","summary":"Red Hat 项目经历详述","title":"Red Hat 项目经历"},{"content":"Contact Email: xuezhaokeepgoing@gmail.com Phone: +86-15626173020 GitHub: xuezhaojun Technical Focus Kubernetes / Multi-Cluster Management / Go — ACM | OCM Agentic Engineering — KubeOpenCode Experience Red Hat — Senior Software Engineer (2021–2026)\nEducation HKUST — Master of Information Technology\nSYSU — Bachelor of Software Engineering\nCertifications CKA PSM II ","permalink":"https://xuezhaojun.github.io/about/","summary":"About me","title":"Zhao Xue"},{"content":"联系方式 邮箱：xuezhaokeepgoing@gmail.com 电话：+86-15626173020 GitHub：xuezhaojun 技术方向 Kubernetes / 多集群管理 / Go — ACM | OCM AI Agent 工程化 — KubeOpenCode 工作经历 Red Hat — 高级软件工程师（2021–2026）\n教育背景 香港科技大学 — 信息技术硕士\n中山大学 — 软件工程学士\n专业认证 CKA PSM II ","permalink":"https://xuezhaojun.github.io/about-zh/","summary":"个人简介","title":"薛昭"}]