Java 程序中导致CPU 100%的常见原因与诊断思路

上周某团队的某个业务的新功能1%上线, 上线不到1天, 开发人员发现该流程里下游有个应用不断有服务器挂掉, 症状表现为CPU 100%, 不能处理新请求. 为了临时快速解决该问题, 同事先是做了2个 heap dump 和 2个 CPU 在某个interval的usage的截屏, 然后让开发人员关闭该新功能, 同时下游服务器被完全重启一遍. 问题得以解决.

问题虽然被临时解决, 但是当时却没发现最终问题所在. 事后去查看相关log, 发现某些机器竟然在那个时间段, 没有把log发到中心log 服务器, 同时本地的某些log被覆盖. 根据截屏的 CPU usage, 发现有很多线程在当时在cache中查找某些值, 有几个线程在HashMap中get()某个值, 看上去很像HashMap的死循环问题, 可是另外有些花费很多CPU的线程又没在HashMap里面. 不过最终查看这些线程Stack, 确定问题是业务逻辑里面有如下代码:

TreeNode curNode = curNode.getParentNode();
Integer curVal = null;
while (null != curNode && null == curVal) {
    curVal = getSomeValueFromConfigCache(curNode);
}

仔细一看, 上面的代码是有问题的, 看上去哪些Node 是一个Tree 结构, 如果通过当前Node 查不到,就根据Parent去查, 结果这个逻辑里面并没有把向上替换的逻辑放到 while 循环里面. 根据当前的代码, 如果当前Node 查不到返回null, 那么这个while 就是一个死循环了. 所以改正很简单, 但是这个代价有点大. 测试代码可能99%都是当前Node 都能查到结果的, 可是就是有1%的数据, 可能要向上找到 ParentNode 才能查到.

对于这种导致CPU 100%的问题, 我们常见的有以下这几种:

  1. Java 内存不够或溢出导致GC overhead问题, GC overhead 导致的CPU 100%问题;
  2. 死循环问题. 如常见的HashMap被多个线程并发使用导致的死循环, 或者上面例子中的死循环;
  3. 某些特费CPU的操作被长期执行. 以前有个case, 使用正则表达式去判断是不是符合某个规则, 可是有些时候输入参数是一个几十K或更长的数据, 该正则写又不好, 导致CPU遇到这种输入, 就爆掉了.

对于上面第一种类型, 基本可以通过GC overhead 的比例去看, 这种情况下CPU完全是GC 引起的, GC overhead 基本接近100%, 然后下一步就是做 heap dump, 去分析verbose GC log 和 heap dump;

对于上面第二种, 可以多次做CPU usage分析, 比如每次都是1s的interval 内所有线程消耗CPU的百分比, 因为CPU 已经100%, 所以很容易可以看到是哪些线程在大量消耗CPU, 然后去分析这些stacktrace, 基本上死循环的代码之外的stack每次都一样, 死循环以内的某些时候的stack有些差别. 可以通过结合 heap dump 里面的局部变量, 参数同时阅读stack里面的方法的调用关系, 去查看死循环的位置所在. 对于HashMap这种, 基本在几次tread dump中就能确定, 然后去 heap dump 去找到那个循环.

对于第三种, 基本和第二种一样, 首先观察下stack里面有没有看上去就是特耗CPU的操作, 如果有继续看源代码去查看. 最后结合heap dump里面的数据去验证.

OpenSSL 基本知识

  1. 目的/作用? PKI的一部分

  2. 实现架构/组件/组成 key, cert, CA, CA chain

  3. OpenSSL 包含 cryptographic library and an SSL/TLS toolkit;

  4. 一个 SSL/ TLS的实现, 一个de facto standard;

  5. Privacy-Enhanced Mail (PEM) format;

  6. 背景: 开源 广泛使用 高性能;

  7. 带命令行工具 方便调试

$ openssl version
OpenSSL 1.0.2g  1 Mar 2016
$ openssl version -a
OpenSSL 1.0.2g  1 Mar 2016
built on: reproducible build, date unspecified
platform: debian-amd64
options:  bn(64,64) rc4(16x,int) des(idx,cisc,16,int) blowfish(idx)
compiler: cc -I. -I.. -I../include  -fPIC -DOPENSSL_PIC ... -DECP_NISTZ256_ASM
**OPENSSLDIR: "/usr/lib/ssl"**
$ls /usr/lib/ssl -la  //配置文件路径
lrwxrwxrwx  1 root root   14 Jun 20  2018 certs -> /etc/ssl/certs
drwxr-xr-x  2 root root 4096 Jan 10 10:55 misc  //各种工具脚本
lrwxrwxrwx  1 root root   20 Dec  4 13:07 openssl.cnf -> /etc/ssl/openssl.cnf
lrwxrwxrwx  1 root root   16 Jun 20  2018 private -> /etc/ssl/private

 $openssl help

$man ciphers
$ openssl genrsa -aes128 -out fd.key 2048 //生成key 文件
$ openssl rsa -text -in fd.key //读取key 文件
$ openssl rsa -in fd.key -pubout -out fd-public.key //通过private key 生成public key;

linux find command 命令

  1. find ~/ -name a.txt
  2. find . -type f -name tech //find file named tech
  3. find . -type d -name tech //find dir named tech
  4. find . -iname tech //ignore case, find both TECH, tech, Tech, etc
  5. find /var -name "*.log"
  6. find . -type f -perm 0777 -print //find all files whose permission is 777
  7. find / -type f ! -perm 777 //find all the files without permission 777
  8. find / -perm /u=r // find read only file
  9. find / -perm /a=x // find executable file
  10. find / -name foo.bar -print 2>/dev/null //"Permission Denied" send to null
  11. find . -name *.bar -maxdepth 2 -print //only search 2 directories deep
  12. find ./dir1 ./dir2 -name foo.bar -print //search 2 dirs
  13. find /some/directory -type l -print // search link file
    type:
b    block (buffered) special 
c    character (unbuffered) special 
d    directory 
p    named pipe (FIFO) 
f     regular file 
l     symbolic link 
s    socket 

There are, however, other expressions you can use as follows:

-amin n - The file was last accessed n minutes ago
-anewer - The file was last accessed more recently than it was modified
-atime n - The file was last accessed more n days ago
-cmin n - The file was last changed n minutes ago
-cnewer - The file was last changed more recently than the file was modified
-ctime n - The file was last changed more than n days ago
-empty - The file is empty
-executable - The file is executable
-false - Always false
-fstype type - The file is on the specified file system
-gid n - The file belongs to group with the ID n
-group groupname - The file belongs to the named group
-ilname pattern - Search for a symbolic line but ignore case
-iname pattern - Search for a file but ignore case
-inum n - Search for a file with the specified node
-ipath path - Search for a path but ignore case
-iregex expression - Search for a expression but ignore case
-links n - Search for a file with the specified number of links
-lname name - Search for a symbolic link
-mmin n - File's data was last modified n minutes ago
-mtime n - File's data was last modified n days ago
-name name - Search for a file with the specified name
-newer name - Search for a file edited more recently than the file given
-nogroup - Search for a file with no group id
-nouser - Search for a file with no user attached to it
-path path - Search for a path
-readable - Find files which are readable
-regex pattern - Search for files matching a regular expression
-type type - Search for a particular type
-uid uid - Files numeric user id is the same as uid
-user name - File is owned by user specified
-writable - Search for files that can be written to

SRE Case Study: Triaging a Non-Heap JVM Out of Memory Issue

原文链接: https://www.ebayinc.com/stories/blogs/tech/sre-case-study-triage-a-non-heap-jvm-out-of-memory-issue/

Most Java virtual machine out of memory issues happen on the heap, but this time proved to be a little different.
A Java virtual machine (JVM) has an auto memory management feature, so Java developers don’t need to care about object reclaiming. But they should still be concerned about memory, as it isn’t unlimited, and we do see out of memory errors sometimes. For out of memory issues, there are generally two possible reasons: 1) the memory settings for the JVM are too small, and 2) applications have memory leaks. For the first type, it is easy to fix with more memory; just change some JVM memory setting parameters. For the second type, we need to figure out where the leak is and fix it in code. Today I am going to share a JVM memory leak case that is a little different.

Symptoms
At the beginning, we noticed garbage collection (GC) overhead exceeded and CPU usage alerts for some hosts. GC overhead was around 60%~70%, and the CPU was busy with GC. It appeared to be a memory issue.
请输入图片描述
gcoverhead2
Figure 1. GC overhead alert

Action
Not all the servers for that application had this issue, just some, which meant it could take time to fill up the memory, anywhere from 1 or 2 hours to a few days. In order to mitigate this issue on site, first we took a heap dump and then nuked them for temporary recovery.

Analysis
For GC overhead issues, we analyze the verbose GC log, analyze the heap dump, and analyze the source code.

  1. Analyze the verbose GC log
    

请输入图片描述
The app enables the verbose GC log, which is very useful to analyze memory issues. From the following screenshot, we can see there is a lot of free memory in both young and old generations, but GC is filling up more and more.

gcviewer

This is a little strange, as most of time, we see the both young and old generations are used up, and JVM doesn’t have enough heap to allocate a new object. This issue is not caused by less memory in the young/old generation, so where is the issue?

We all know that the JVM permanent generation full and explicit System.gc() call can also trigger a full GC. Next, we check these two possibilities:

  1. If the full GC is triggered by an explicit System.gc() call, we will see the “system” keyword in the GC log, but we don't see it this time.

  2. If it is triggered by permanent generation full, we can easily identify it in the GC raw log. From the following GC raw log, we can see that the permanent generation has enough free memory.

Verbose GC log snippet:

2018-09-13T20:23:29.058-0700: 2518960.051: [GC2018-09-13T20:23:29.059-0700: 2518960.051: [ParNew Desired survivor size 41943040 bytes, new threshold 6 (max 6) - age 1: 3787848 bytes, 3787848 total - age 2: 2359600 bytes,6147448 total : 662280K->7096K(737280K), 0.0319710 secs] 1224670K->569486K(2170880K), 0.0324480 secs] [Times: user=0.08 sys=0.00, real=0.03 secs]

2018-09-13T20:23:44.824-0700: 2518975.816: [Full GC2018-09-13T20:23:44.824-0700: 2518975.817: [CMS: 562390K->563346K(1433600K), 2.9864680 secs] 795326K->563346K(2170880K), [CMS Perm : 271273K->271054K(524288K)], 2.9869590 secs] [Times: user=2.97 sys=0.00, real=2.99 secs]

2018-09-13T20:23:58.130-0700: 2518989.123: [Full GC2018-09-13T20:23:58.131-0700: 2518989.123: [CMS: 563346K->561519K(1433600K), 2.8341560 secs] 867721K->561519K(2170880K), [CMS Perm : 271080K->271054K(524288K)], 2.8345980 secs] [Times: user=2.84 sys=0.00, real=2.83 secs]

2018-09-13T20:24:01.902-0700: 2518992.894: [Full GC2018-09-13T20:24:01.902-0700: 2518992.895: [CMS: 561519K->560375K(1433600K), 2.6886910 secs] 589208K->560375K(2170880K), [CMS Perm : 271055K->271055K(524288K)], 2.6891280 secs] [Times: user=2.69 sys=0.00, real=2.69 secs]

Therefore, these two possibilities have been ruled out.

In the past, we encountered a complicated case whose symptoms were similar: Both young generation and old generation had 700M free space separately after full GC, and no issue in permanent generation or explicit System.gc() call, but the JVM continued doing full GC. The cause was a java.util.Vector on heap that used about 400M memory, and it tried to extend its size. As the JDK code wrote, each time it extended, it doubled its size, so it needed an extra 800M memory to expand. The JVM couldn't find such a large free space, so it resorted to continuous full GC.

This time, we didn't see this kind of big collection instance.

  1. Check the application log, and find the issue
    

We started to analyze the heap dump, but in the meantime, in the application log, we see a very useful error message: java.lang.OutOfMemoryError: Direct buffer memory. This error points out where the issue is.

OOM error in the log:

INFO | jvm 1| 2018/09/15 03:43:13 | Caused by: java.lang.OutOfMemoryError: Direct buffer memory

INFO | jvm 1| 2018/09/15 03:43:13 | at java.nio.Bits.reserveMemory(Bits.java:658)

INFO | jvm 1| 2018/09/15 03:43:13 | at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123)

INFO | jvm 1| 2018/09/15 03:43:13 | at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)

The direct buffer memory is the OS’ native memory, which is used by the JVM process, not in the JVM heap. It is used by Java NIO to quickly write data to network or disk; no need to copy between JVM heap and native memory. Java application can set the JVM parameter –XX:MaxDirectMemorySize to limit the direct buffer memory size. If no such parameter is set, the JVM can use all the available OS’ native memory. In our case, we checked the JVM’s parameter; it was set to -XX:MaxDirectMemorySize=1024M, which means this application set the Direct Buffer limit as 1G. Based on the above log, this 1G native memory was used up, and then threw the OOM error.

  1. Find the direct memory issue in the heap dump
    

Although the direct buffer memory is out of heap, the JVM still takes care of it. Each time the JVM requests a direct buffer memory, there will be a java.nio.DirectBuffer instance to represent it in the heap. This instance had the native memory address and the size of this memory block, etc. As the DirectBuffer instance’s life cycle was managed by the JVM, it could be collected by the GC thread when there was no reference to it. The associated native memory could also be released when the JVM GC thread collected the DirectBuffer instance.

Why does this app needs more than 1G direct buffer memory? Why it doesn’t it release the memory during the full GC? Now that we have the heap dump, can we find any clue from it? As we just mentioned, the DirectBuffer objects in the heap have some information about the direct buffer memory.

From the application error log, the JVM tries to create a new DirectByteBuffer instance. Let’s check the DirectByteBuffer first. With OQL, we see there are lots of DirectByteBuffer instances in the heap, and we don’t see other DirectBuffer instances, like DirectCharBuffers.

We can confirm how much native memory these DirectByteBuffers are using with this OQL query:

SELECT x, x.capacity FROM java.nio.DirectByteBuffer x WHERE ((x.capacity > 1024 * 1024) and (x.cleaner != null)) //here we only care objects whose capacity is bigger than 1M

The capacity field in DirectByteBuffer means how many memory are requested in the DirectByteBuffer instance. And here we filter the object instances with: x.cleaner != null, which means we skip the sliced DirectByteBuffer instances that are just a view of other DirectByteBuffer instances. In this dump, there are many DirectByteBuffer objects whose capacity is less than 1M; we just skip them. This is the result:
请输入图片描述
heapAnalysis

In this result, there are 25 instances that are holding more than 1M native memory. The biggest one is 179M (188124977/1024/1024), and second one is 124M (130804508/1024/1024). The summary of these top 25 instances is almost 1G. That’s why the total 1G direct buffer memory is used up.

  1. Why are these DirectByteBuffer not collected by GC?
    

If these DirectByteBuffer instances are collected by GC, then direct buffer native memory can also be released. Why can't these DirectByteBuffer instances be collected by the GC thread?

We further check the reference chain. From it, we can clearly see there are some thread local BufferCaches that are holding the references of DirectByteBuffer, and these thread local objects belong to some daemon threads, like the Tomcat daemon threads. That’s why they can’t be collected, as shown in the following reference chain screenshot:
请输入图片描述

Who put these DirectByteBuffers in these thread local BufferCaches? And why not remove them?

Following the reference chain, we looked into the source code of sun.nio.ch.Util.java class. In this class, you see the thread local BufferCache, and you see the method: getTemporaryDirectBuffer(int), which put the DirectByteBuffer objects in the BufferCache. This getTemporaryDirectBuffer is called by serval methods in JDK’s NIO classes. Also, the BufferCache reuses the DirectByteBuffer if the thread requests are not bigger direct buffer native memory. JDK NIO classes use these thread local DirectByteBuffer instances, but don’t release them if that thread is alive.

From above analysis, the issue is in the JDK’s code. This was identified as a JDK issue. In the JDK 8u102 Update Release Notes, a new system property, jdk.nio.maxCachedBufferSize, was added to fix this issue. But in this note, it also says, this parameter can only fix part of this issue and not all cases.

The fix
Most of the time, your application won’t have this issue because your threads are short-life threads, where BufferCache and DirectByteBuffer are collected by the GC thread, and the direct buffer native memory is released to the OS, or because where each time you just need very little direct buffer memory, and the JVM will reuse them. When the only multiple threads are long-life threads, and these threads request a more and more direct buffer memory until reach the max direct buffer limit or all the memory is used up, you will see this issue.

For our case, the app tries to allocate some direct buffer native memory for uploaded files, and Tomcat’s daemon threads handle these requests. There are some very big uploaded files, some more than 100M, and the app opens 40 daemon threads for Tomcat, then at last, it reaches the 1G direct buffer upper limit.

In order to fix it, the app should split bytes to small ones before they operate with NIO utilities. This can be changed in application logic.

Summary
Mostly we see out of memory issues on the heap, but it could happen on the direct buffer. When the direct buffer native memory is used up, even when it is not on the heap, we can still use a heap dump to help analyze the root cause.

一个进程打开了哪些端口在监听

为了让Node.js 能够充分利用多核的CPU,会开一个进程多个worker的模式, 每个worker是一个Node.js event loop. 如何查看开了哪些端口.

eric@eric1:~$ sudo netstat --all --program | grep '8481'
tcp        0      0 localhost:6666          *:*                     LISTEN      8481/pm2: Daemon
tcp        0      0 localhost:ircd          *:*                     LISTEN      8481/pm2: Daemon
tcp6       0      0 [::]:http-alt           [::]:*                  LISTEN      8481/pm2: Daemon
tcp6       0      0 [::]:8082               [::]:*                  LISTEN      8481/pm2: Daemon
tcp6       0      0 [::]:10100              [::]:*                  LISTEN      8481/pm2: Daemon
tcp6       0      0 [::]:10101              [::]:*                  LISTEN      8481/pm2: Daemon
tcp6       0      0 [::]:10102              [::]:*                  LISTEN      8481/pm2: Daemon
unix  3      [ ]         STREAM     CONNECTED     573099484 8481/pm2: Daemon
unix  3      [ ]         STREAM     CONNECTED     573099481 8481/pm2: Daemon
unix  3      [ ]         STREAM     CONNECTED     573099491 8481/pm2: Daemon

eric@eric1:~$ sudo lsof -i -P |grep 8481
pm2:       8481 rebot    3u  IPv6 573099604      0t0  TCP *:10100 (LISTEN)
pm2:       8481 rebot   12u  IPv4 573098742      0t0  TCP localhost:6666 (LISTEN)
pm2:       8481 rebot   13u  IPv4 573098743      0t0  TCP localhost:6667 (LISTEN)
pm2:       8481 rebot   17u  IPv6 573099599      0t0  TCP *:8082 (LISTEN)
pm2:       8481 rebot   18u  IPv6 573099600      0t0  TCP *:8080 (LISTEN)
pm2:       8481 rebot   20u  IPv6 573099610      0t0  TCP *:10101 (LISTEN)
pm2:       8481 rebot   22u  IPv6 573099619      0t0  TCP *:10102 (LISTEN)

进程的线程之间的相互关系:

pstree -a -p -H 8481
pstree -a -l -p -s 8481
top -H -p 8481
ps -L H 8481
ps -eLf
htop 8481