Search CTRL + K

排查 rum collect 服务内存泄露

自接手 rum collect 服务以来,我们已做过数轮优化,包括性能、高可用等,取得了不少的成就,比如启动耗时从 60 秒降低到 3 秒。但仍然有个棘手的内存泄露问题尚未解决,几经接手交接到我这,由于工具使用错误导致一直没能发现问题,遂记录本文以作自省。

环境与现象

rum collect 服务是部署于 TKEX-CSIG 集群上的上报接收服务,负责接收海量 SDK 上报的数据,并经过处理后发送到不同平台。该服务依赖于 redis 获取项目信息和限流、白名单等,之后对接智研、TCCM 等平台。

TKEX-CSIG 自带 node-exporter 监控,从 container 监控中可以看到内存在缓慢地泄露,速度约为 0.5G/天。

初次尝试

根据历史经验,10 次内存泄露 8 次是 goroutine 泄露,还有 1 次 CGO,最多 1 次是真的泄露。开启 collect 服务的 pprof router,用如下命令抓取 profile:

2025-04-28 抓取了可是结果显示没啥明显异常:

❯ go tool pprof  '2025-04-28 heap.prof'
File: collect_svr
Type: inuse_space
Time: 2025-04-28 15:19:25 CST
Duration: 120.43s, Total samples = 185.07MB
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 68.29MB, 36.90% of 185.07MB total
Dropped 45 nodes (cum <= 0.93MB)
Showing top 10 nodes out of 207
      flat  flat%   sum%        cum   cum%
   42.01MB 22.70% 22.70%    42.01MB 22.70%  internal/profile.(*Profile).postDecode
       6MB  3.24% 25.94%    10.50MB  5.67%  internal/profile.glob..func2
    5.21MB  2.81% 28.76%     5.21MB  2.81%  bytes.makeSlice
       5MB  2.70% 31.46%        5MB  2.70%  git.code.oa.com/TAM/tam_collect_svr/internal/data.(*projectRepo).sync
    3.50MB  1.89% 33.35%     3.50MB  1.89%  fmt.Sprintf
    2.50MB  1.35% 34.70%     4.50MB  2.43%  git.code.oa.com/TAM/tam_collect_svr/internal/biz/domain.(*metricsCommon).GetMetricsLabels
    2.50MB  1.35% 36.05%     2.50MB  1.35%  internal/profile.decodeInt64s
      -2MB  1.08% 34.97%       -2MB  1.08%  net/textproto.(*Reader).ReadMIMEHeader
       2MB  1.08% 36.05%        2MB  1.08%  internal/profile.glob..func19
    1.57MB  0.85% 36.90%     1.57MB  0.85%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*MetricList).Marshal
❯ go tool pprof  '2025-04-28 goroutine.prof'
File: collect_svr
Type: goroutine
Time: 2025-04-28 15:43:35 CST
Duration: 120.04s, Total samples = 29
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 8, 27.59% of 29 total
Showing top 10 nodes out of 105
      flat  flat%   sum%        cum   cum%
         8 27.59% 27.59%          8 27.59%  runtime.gopark
         1  3.45% 31.03%          1  3.45%  git.code.oa.com/TAM/tam_collect_svr/internal/data.IsTsmFields
        -1  3.45% 27.59%         -1  3.45%  reflect.Value.Convert
         1  3.45% 31.03%          1  3.45%  runtime.growWork_faststr
        -1  3.45% 27.59%         -1  3.45%  runtime.mapaccess1_faststr
         1  3.45% 31.03%          1  3.45%  runtime.newobject
        -1  3.45% 27.59%         -1  3.45%  runtime.slicebytetostring
         1  3.45% 31.03%          1  3.45%  sort.insertionSort
        -1  3.45% 27.59%         -1  3.45%  syscall.Syscall6
         0     0% 27.59%         -2  6.90%  bufio.(*Reader).Peek

看来问题隐藏很深,等个一周,2025-05-08 再抓再对比分析!

❯ go tool pprof  '2025-05-08 heap.prof'
File: collect_svr
Type: inuse_space
Time: 2025-05-08 15:49:40 CST
Duration: 120.54s, Total samples = 253.20MB
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 79.72MB, 31.48% of 253.20MB total
Dropped 21 nodes (cum <= 1.27MB)
Showing top 10 nodes out of 232
      flat  flat%   sum%        cum   cum%
   53.01MB 20.94% 20.94%    53.01MB 20.94%  internal/profile.(*Profile).postDecode
   10.91MB  4.31% 25.25%    13.41MB  5.30%  internal/profile.glob..func2
    5.48MB  2.16% 27.41%     5.48MB  2.16%  bytes.makeSlice
       4MB  1.58% 28.99%        4MB  1.58%  git.code.oa.com/TAM/tam_collect_svr/internal/data.(*metricsTCCMExporter).getTagsFromSampleLabel
    3.88MB  1.53% 30.53%     3.88MB  1.53%  runtime/pprof.(*profMap).lookup
    3.50MB  1.38% 31.91%     3.50MB  1.38%  strconv.appendQuotedWith
      -3MB  1.18% 30.72%    -1.50MB  0.59%  git.code.oa.com/cm-metrics/tccm-client-go/internal/metric.(*CustomMetric).AddTags
   -2.51MB  0.99% 29.73%       -1MB  0.39%  git.code.oa.com/TAM/tam_collect_svr/internal/biz/domain.(*LogMessage).GetRecordFieldMaps
    2.50MB  0.99% 30.72%        6MB  2.37%  fmt.Sprintf
    1.93MB  0.76% 31.48%     1.93MB  0.76%  google.golang.org/protobuf/internal/encoding/json.(*Decoder).parseString
❯ go tool pprof  '2025-05-08 goroutine.prof'
File: collect_svr
Type: goroutine
Time: 2025-05-08 15:49:55 CST
Duration: 120.08s, Total samples = 28
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -4, 14.29% of 28 total
Showing top 10 nodes out of 87
      flat  flat%   sum%        cum   cum%
        -3 10.71% 10.71%         -3 10.71%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*Metric).Size
        -1  3.57% 14.29%         -1  3.57%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*Tag).MarshalToSizedBuffer
         1  3.57% 10.71%          1  3.57%  github.com/stretchr/stew/slice.Contains
        -1  3.57% 14.29%         -1  3.57%  regexp.(*machine).add
         1  3.57% 10.71%          1  3.57%  runtime.mapaccess1_faststr
        -1  3.57% 14.29%         -1  3.57%  runtime.newobject
         0     0% 14.29%         -1  3.57%  bufio.(*Reader).Peek
         0     0% 14.29%         -1  3.57%  bufio.(*Reader).ReadLine
         0     0% 14.29%         -1  3.57%  bufio.(*Reader).ReadSlice
         0     0% 14.29%         -2  7.14%  bufio.(*Reader).fill

嘶,看不出来明显的“持续存在并增长的” isuse space 啊,协程也没明显增长。

加上对比分析试试:

❯ go tool pprof --base '2025-04-28 heap.prof' '2025-05-08 heap.prof'
File: collect_svr
Type: inuse_space
Time: 2025-04-28 15:19:25 CST
Duration: 240.97s, Total samples = 318.43MB
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 14.29MB, 4.49% of 318.43MB total
Dropped 46 nodes (cum <= 1.59MB)
Showing top 10 nodes out of 260
      flat  flat%   sum%        cum   cum%
      11MB  3.46%  3.46%       11MB  3.46%  internal/profile.(*Profile).postDecode
      -5MB  1.57%  1.88%       -5MB  1.57%  git.code.oa.com/TAM/tam_collect_svr/internal/data.(*projectRepo).sync
    4.91MB  1.54%  3.43%     2.91MB  0.91%  internal/profile.glob..func2
   -4.01MB  1.26%  2.17%    -3.51MB  1.10%  git.code.oa.com/TAM/tam_collect_svr/internal/biz/domain.(*LogMessage).GetRecordFieldMaps
    3.88MB  1.22%  3.39%     3.88MB  1.22%  runtime/pprof.(*profMap).lookup
   -3.50MB  1.10%  2.29%     0.50MB  0.16%  git.code.oa.com/cm-metrics/tccm-client-go/internal/metric.(*CustomMetric).AddTags
    3.50MB  1.10%  3.39%     3.50MB  1.10%  strconv.appendQuotedWith
    3.13MB  0.98%  4.37%     3.13MB  0.98%  google.golang.org/protobuf/internal/encoding/json.(*Decoder).parseString
       3MB  0.94%  5.31%        3MB  0.94%  git.code.oa.com/TAM/tam_collect_svr/internal/data.(*metricsTCCMExporter).getTagsFromSampleLabel
   -2.63MB  0.83%  4.49%    -2.63MB  0.83%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*MetricList).Marshal
❯ go tool pprof --base '2025-04-28 goroutine.prof' '2025-05-08 goroutine.prof'
File: collect_svr
Type: goroutine
Time: 2025-04-28 15:43:35 CST
Duration: 240.12s, Total samples = 41
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for -13, 31.71% of 41 total
Showing top 10 nodes out of 129
      flat  flat%   sum%        cum   cum%
        -8 19.51% 19.51%         -8 19.51%  runtime.gopark
        -3  7.32% 26.83%         -3  7.32%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*Metric).Size
         2  4.88% 21.95%          2  4.88%  runtime.mapaccess1_faststr
        -2  4.88% 26.83%         -2  4.88%  runtime.newobject
        -1  2.44% 29.27%         -1  2.44%  git.code.oa.com/TAM/tam_collect_svr/internal/data.IsTsmFields
        -1  2.44% 31.71%         -1  2.44%  git.code.oa.com/cm-metrics/sdk_metric_pb.(*Tag).MarshalToSizedBuffer
         1  2.44% 29.27%          1  2.44%  github.com/stretchr/stew/slice.Contains
         1  2.44% 26.83%          1  2.44%  reflect.Value.Convert
        -1  2.44% 29.27%         -1  2.44%  regexp.(*machine).add
        -1  2.44% 31.71%         -1  2.44%  runtime.growWork_faststr

还是没能看出异常,这里加上 heap isuse space 的 flamegraph 图:

以及 goroutine 的 graph 图:

同样没能看出明显异常。

我开始怀疑是不是工具问题,用 flamegraph.com 工具更直观寻找内存增长原因:

逐个分析内存增长迅猛的可疑点,可以看到它们要么是日志的、要么是 json 的,甚至是 net/http 相关的,这些都是和抓取时请求量相关的指标,并不能说明什么。

可以从 这个链接 查看对比。

至此,分析陷入瓶颈。

引入 trace 监控

同样的方式抓取 trace 监控:

然后使用第三方 trace ui 工具 gotraceui 对比分析:

左边是 2025-04-28 抓取的 trace,右边是 2025-05-08 抓取的 trace,从 trace 中可以看到有大量协程在 git.code.oa.com/trpc-go/trpc-config-rainbow.(*KV).Watch.func1internal/pkg/config.(*appConfigLoaderRainbowImpl).Watch.func2 处阻塞,并且数量 巨额增加

看来问题定位到了,通过阅读代码发现配置模块会定时(1 分钟)启动两个协程用于拉取七彩石配置,由于未知原因代码处于 chan receive 状态卡住。

配置 block 分析后找到对应阻塞堆栈(配置方法见后文):

4670768 124 @ 0x40d172 0x17a39cc 0xcc3d02 0x18eb4f7 0x18eb3e5 0x47cd01
#       0x40d171        runtime.chanrecv1+0x11                                                                                  /data/home/oliverdding/.local/opt/go/current/src/runtime/chan.go:442
#       0x17a39cb       git.code.oa.com/trpc-go/trpc-config-rainbow.(*KV).Watch+0x70b                                           /data/home/oliverdding/Developer/tam_collect_svr/vendor/git.code.oa.com/trpc-go/trpc-config-rainbow/kv_config.go:84
#       0xcc3d01        git.code.oa.com/TAM/tam_collect_svr/internal/pkg/config.(*appConfigLoaderRainbowImpl).Watch+0x221       /data/home/oliverdding/Developer/tam_collect_svr/internal/pkg/config/rainbow.go:61
#       0x18eb4f6       git.code.oa.com/TAM/tam_collect_svr/cmd/interface.loadConfig.func1.2+0xb6                               /data/home/oliverdding/Developer/tam_collect_svr/cmd/interface/config.go:55
#       0x18eb3e4       git.code.oa.com/TAM/tam_collect_svr/cmd/interface.loadConfig.func1+0x1a4                                /data/home/oliverdding/Developer/tam_collect_svr/cmd/interface/config.go:65

找到对应代码,是 trpc-config-rainbow 库中的代码逻辑,查看上下文可以确认 collect 每隔 1 分钟会调用一次 Watch 方法以尝试获取最新配置。

func (k *KV) Watch(ctx context.Context, key string, opts ...config.Option) (<-chan config.Response, error) {
	if err := k.stream.Check(); err != nil {
		return nil, fmt.Errorf("kv watch %s: stream check err: %w", key, err)
	}

	req := &WatchReq{
		key:  key,
		recv: make(chan Response, channelSize),
		done: make(chan struct{}),
	}

	res, err := k.Get(ctx, key, opts...)
	if err != nil {
		return nil, fmt.Errorf("kv watch: get key %s err: %w", key, err)
	}

	req.recv <- Response{
		key:   key,
		value: res.Value(),
		event: config.EventTypePut,
		meta:  res.MetaData(),
	}

	k.stream.AddWatcher(req)
	c := k.watch(key, req.recv)

	<-req.done
	return c, nil
}

req.done 这个 channel 乍一看没人往里放东西,那往上找,req 被传给了 k.stream.AddWatcher(req),这里有可能往 channel 里放东西吗?

func (w *WatchStream) loop() {
	notify := make(chan struct{}, channelSize)
	buf := make([]Response, 0)
	for {
		select {
		case resp := <-w.cbChan:
			buf = append(buf, resp)
			prepareNotify(notify)

		case <-notify:
			buf = notifying(buf, w.outc)

		case req := <-w.submit:
			w.outc = append(w.outc, req.Receive())
			close(req.Done())
			prepareNotify(notify)
		}
	}

}

提交的 req 会被 append 到 outc 数组中,然后关闭 req.done channel,所以这里不会长期阻塞。可是这个 loop 的确有大量阻塞:

733270955104 147 @ 0x4593b7 0x17a9586 0x17a919c 0x47cd01
#	0x4593b6	runtime.selectgo+0x7f6							/data/home/oliverdding/.local/opt/go/current/src/runtime/select.go:508
#	0x17a9585	git.code.oa.com/trpc-go/trpc-config-rainbow.(*WatchStream).loop+0x165	/data/home/oliverdding/Developer/tam_collect_svr/vendor/git.code.oa.com/trpc-go/trpc-config-rainbow/stream.go:106
#	0x17a919b	git.code.oa.com/trpc-go/trpc-config-rainbow.(*WatchStream).Run+0x3b	/data/home/oliverdding/Developer/tam_collect_svr/vendor/git.code.oa.com/trpc-go/trpc-config-rainbow/stream.go:80

继续阅读代码明确 w.cbChan 何时会来数据,最后发现在嵌套的匿名函数作为回调函数,在拉取配置更新成功后会调用并传入当前最新配置文件。而通过 debug 发现该函数是个无限循环,所以明确答案:应只调用一次 Watch 函数作为监控函数获取最新变动,错误使用库导致无限创建监听协程。

排查方式自省

由于历史经验,我一直使用 seconds 参数获取 delta profile 来分析内存泄露,对于泄露急比如泄露数量和请求数量相关的场合,这的确可以发现问题。但对于泄露缓慢,比如按分钟、小时来计的场景,就无法看出来问题了,因为 delta profile 最多就抓取几分钟。

因此在计划分析 pprof 时,应先不加 seconds 参数获取当前状态,再决定是否需要 delta profile 分析变化趋势。这是我因为思维定式导致的工具使用错误。

❯ go tool pprof goroutine.no_time.0.prof
File: collect_svr
Type: goroutine
Time: 2025-05-12 11:36:06 CST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 30110, 100% of 30120 total
Dropped 204 nodes (cum <= 150)
      flat  flat%   sum%        cum   cum%
     30110   100%   100%      30110   100%  runtime.gopark
         0     0%   100%      14975 49.72%  git.code.oa.com/TAM/tam_collect_svr/internal/pkg/config.(*appConfigLoaderRainbowImpl).Watch.func1
         0     0%   100%      14975 49.72%  git.code.oa.com/trpc-go/trpc-config-rainbow.(*KV).Watch.func1
         0     0%   100%      29989 99.57%  runtime.chanrecv
         0     0%   100%      29980 99.54%  runtime.chanrecv2

❯ go tool pprof --base goroutine.no_time.0.prof goroutine.no_time.1.prof
File: collect_svr
Type: goroutine
Time: 2025-05-12 11:36:06 CST
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 35, 39.33% of 89 total
Showing top 10 nodes out of 113
      flat  flat%   sum%        cum   cum%
        41 46.07% 46.07%         41 46.07%  runtime.gopark
        -1  1.12% 44.94%         -1  1.12%  bytes.(*Buffer).WriteString
        -1  1.12% 43.82%         -1  1.12%  go.uber.org/zap.(*Logger).check
        -1  1.12% 42.70%         -1  1.12%  internal/reflectlite.Swapper.func9
        -1  1.12% 41.57%         -1  1.12%  runtime.assertE2I2
        -1  1.12% 40.45%         -1  1.12%  strconv.appendEscapedRune
        -1  1.12% 39.33%         -1  1.12%  syscall.Syscall
         0     0% 39.33%         -8  8.99%  bufio.(*Reader).Peek
         0     0% 39.33%          2  2.25%  bufio.(*Reader).ReadLine
         0     0% 39.33%          2  2.25%  bufio.(*Reader).ReadSlice

此外,通过阅读 pprof 文档,可以有两个更好的排查方式。

debug 参数

配置 debug 参数会以存文本方式返回当前系统状态,这就可以用于获取汇总信息。

比如 debug=1 就可以获取协程按栈顶函数聚合的分布情况:

❯ curl 'http://localhost:6060/debug/pprof/goroutine?debug=1'                 
goroutine profile: total 30103
14974 @ 0x43a856 0x4085ec 0x408058 0x96233a 0x46ac81
#       0x962339        git.code.oa.com/TAM/tam_collect_svr/internal/pkg/config.(*appConfigLoaderRainbowImpl).Watch.func1+0x119 /go/src/tam_collect_svr/internal/pkg/config/rainbow.go:72

14974 @ 0x43a856 0x4085ec 0x408058 0xff5ffa 0x46ac81
#       0xff5ff9        git.code.oa.com/trpc-go/trpc-config-rainbow.(*KV).Watch.func1+0xd9      /go/src/tam_collect_svr/vendor/git.code.oa.com/trpc-go/trpc-config-rainbow/kv_config.go:69

就可以看到有 14974 个协程在处理配置更新。

比如 debug=2,可以更详细分析不同协程在什么状态,用如下脚本就可以聚合统计:

❯ curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 | grep -E '^goroutine [0-9]+ \[' | sed -E 's/^goroutine [0-9]+ \[([^]]+)\].*$/\1/' | awk -F',' '{print $1}' | sort | uniq -c | sort -nr
  30044 chan receive
    419 runnable
     46 select
     24 semacquire
     15 IO wait
     12 sleep
      4 sync.Cond.Wait
      2 syscall
      1 running

和代码对上,配置更新的协程都阻塞于 chan receive 状态。

block 分析

通过调用 runtime.SetBlockProfileRate 后可以抓取:

来分析当前阻塞情况,该方法同样能发现处于阻塞状态的协程。

内存泄露排查最佳实践

  1. 获取当前系统协程状态
    • 根据栈顶聚合:curl 'http://localhost:6060/debug/pprof/goroutine?debug=1'
    • 根据协程状态聚合:curl -s http://localhost:6060/debug/pprof/goroutine?debug=2 | grep -E '^goroutine [0-9]+ \[' | sed -E 's/^goroutine [0-9]+ \[([^]]+)\].*$/\1/' | awk -F',' '{print $1}' | sort | uniq -c | sort -nr
  2. 分析 heap 变化趋势
    • 要么加 seconds 参数,直接分析这段时间变化趋势
    • 要么抓取两个时间的快照,用 go tool pprof --base 对比分析
  3. 若还需要别的信息
    • 加上 curl -o trace.prof 'http://localhost:6060/debug/pprof/trace?seconds=8' 分析不同协程运行状态
    • 调用 runtime.SetBlockProfileRate 后抓取 go tool pprof http://localhost:6060/debug/pprof/block 获取阻塞状态