forked from heroku/erlang-in-anger
-
Notifications
You must be signed in to change notification settings - Fork 9
/
106-crash-dumps.tex
167 lines (122 loc) · 8.21 KB
/
106-crash-dumps.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
\chapter{Reading Crash Dumps}
\label{chap:crash-dumps}
Whenever an Erlang node crashes, it will generate a crash dump\footnote{If it isn't killed by the OS for violating ulimits while dumping or didn't segfault.}.
The format is mostly documented in Erlang's official documentation\footnote{\href{http://www.erlang.org/doc/apps/erts/crash\_dump.html}{http://www.erlang.org/doc/apps/erts/crash\_dump.html}}, and anyone willing to dig deeper inside of it will likely be able to figure out what data means by looking at that documentation. There will be specific data that is hard to understand without also understanding the part of the VM they refer to, but that might be too complex for this document.
The crash dump is going to be named \filename{erl\_crash.dump} and be located wherever the Erlang process was running by default. This behaviour (and the file name) can be overridden by specifying the \command{ERL\_CRASH\_DUMP} environment variable\footnote{Heroku's Routing and Telemetry teams use the \otpapp{\href{https://github.com/heroku/heroku\_crashdumps}{heroku\_crashdumps}} app to set the path and name of the crash dumps. It can be added to a project to name the dumps by boot time and put them in a pre-set location}.
\section{General View}
\label{sec:crashdump-general-view}
Reading the crash dump will be useful to figure out possible reasons for a node to die \emph{a posteriori}. One way to get a quick look at things is to use recon's \app{erl\_crashdump\_analyzer.sh}\footnote{\href{https://github.com/ferd/recon/blob/master/script/erl\_crashdump\_analyzer.sh}{https://github.com/ferd/recon/blob/master/script/erl\_crashdump\_analyzer.sh}} and run it on a crash dump:
%% Show debugging here with output
\begin{VerbatimRaw}
$ ./recon/script/erl_crashdump_analyzer.sh erl_crash.dump
analyzing erl_crash.dump, generated on: Thu Apr 17 18:34:53 2014
Slogan: eheap_alloc: Cannot allocate 2733560184 bytes of memory
(of type "old_heap").
Memory:
===
processes: 2912 Mb
processes_used: 2912 Mb
system: 8167 Mb
atom: 0 Mb
atom_used: 0 Mb
binary: 3243 Mb
code: 11 Mb
ets: 4755 Mb
---
total: 11079 Mb
Different message queue lengths (5 largest different):
===
1 5010932
2 159
5 158
49 157
4 156
Error logger queue length:
===
0
File descriptors open:
===
UDP: 0
TCP: 19951
Files: 2
---
Total: 19953
Number of processes:
===
36496
Processes Heap+Stack memory sizes (words) used in the VM (5 largest
different):
===
1 284745853
1 5157867
1 4298223
2 196650
12 121536
Processes OldHeap memory sizes (words) used in the VM (5 largest
different):
===
3 318187
9 196650
14 121536
64 75113
15 46422
Process States when crashing (sum):
===
1 Garbing
74 Scheduled
36421 Waiting
\end{VerbatimRaw}
This data dump won't point out a problem directly to your face, but will be a good clue as to where to look. For example, the node here ran out of memory and had 11079 Mb out of 15 Gb used (I know this because that's the max instance size we were using!) This can be a symptom of:
\begin{itemize*}
\item memory fragmentation;
\item memory leaks in C code or drivers;
\item lots of memory that got to be garbage-collected before generating the crash dump\footnote{Notably here is reference-counted binary memory, which sits in a global heap, but ends up being garbage-collected before generating the crash dump. The binary memory can therefore be underreported. See Chapter \ref{chap:memory-leaks} for more details}.
\end{itemize*}
More generally, look for anything surprising for memory there. Correlate it with the number of processes and the size of mailboxes. One may explain the other.
In this particular dump, one process had 5 million messages in its mailbox. That's telling. Either it doesn't match on all it can get, or it is getting overloaded. There are also dozens of processes with hundreds of messages queued up — this can point towards overload or contention. It's hard to have general advice for your generic crash dump, but there still are a few pointers to help figure things out.
\section{Full Mailboxes}
\label{sec:crash-full-mailboxes}
For loaded mailboxes, looking at large counters is the best way to do it. If there is one large mailbox, go investigate the process in the crash dump. Figure out if it's happening because it's not matching on some message, or overload. If you have a similar node running, you can log on it and go inspect it. If you find out many mailboxes are loaded, you may want to use recon's \app{queue\_fun.awk} to figure out what function they're running at the time of the crash:
\begin{VerbatimText}
$ awk -v threshold=10000 -f queue_fun.awk /path/to/erl_crash.dump
MESSAGE QUEUE LENGTH: CURRENT FUNCTION
======================================
10641: io:wait_io_mon_reply/2
12646: io:wait_io_mon_reply/2
32991: io:wait_io_mon_reply/2
2183837: io:wait_io_mon_reply/2
730790: io:wait_io_mon_reply/2
80194: io:wait_io_mon_reply/2
...
\end{VerbatimText}
This one will run over the crash dump and output all of the functions scheduled to run for processes with at least 10000 messages in their mailbox. In the case of this run, the script showed that the entire node was locking up waiting on IO for \function{io:format/2} calls, for example.
\section{Too Many (or too few) Processes}
The process count is mostly useful when you know your node's usual average count\footnote{See subsection \ref{subsec:global-procs} for details}, in order to figure if it's abnormal or not.
A count that is higher than normal may reveal a specific leak or overload, depending on applications.
If the process count is extremely low compared to usual, see if the node terminated with a slogan like:
\begin{Verbatim}
Kernel pid terminated (application_controller)
({application_terminated, <AppName>, shutdown})
\end{Verbatim}
In such a case, the issue is that a specific application (\expression{<AppName>}) has reached its maximal restart frequency within its supervisors, and that prompted the node to shut down. Error logs that led to the cascading failure should be combed over to figure things out.
\section{Too Many Ports}
Similarly to the process count, the port count is simple and mostly useful when you know your usual values\footnote{See subsection \ref{subsec:global-ports} for details}.
A high count may be the result of overload, Denial of Service attacks, or plain old resource leaks. Looking at the type of port leaked (TCP, UDP, or files) can also help reveal if there was contention on specific resources, or if the code using them is just wrong.
\section{Can't Allocate Memory}
These are by far the most common types of crashes you are likely to see. There's so much to cover, that Chapter \ref{chap:memory-leaks} is dedicated to understanding them and doing the required debugging on live systems.
In any case, the crash dump will help figure out what the problem was after the fact. The process mailboxes and individual heaps are usually good indicators of issues. If you're running out of memory without any mailbox being outrageously large, look at the processes heap and stack sizes as returned by the recon script.
In case of large outliers at the top, you know some restricted set of processes may be eating up most of your node's memory. In case they're all more or less equal, see if the amount of memory reported sounds like a lot.
If it looks more or less reasonable, head towards the "Memory" section of the dump and check if a type (ETS or Binary, for example) seems to be fairly large. They may point towards resource leaks you hadn't expected.
\section{Exercises}
\subsection*{Review Questions}
\begin{enumerate}
\item How can you choose where a crash dump will be generated?
\item What are common avenues to explore if the crash dump shows that the node ran out of memory?
\item What should you look for if the process count is suspiciously low?
\item If you find the node died with a process having a lot of memory, what could you do to find out which one it was?\end{enumerate}
\subsection*{Hands-On}
Using the analysis of a crash dump in Section \ref{sec:crashdump-general-view}:
\begin{enumerate}
\item What are specific outliers that could point to an issue?
\item Does it look like repeated errors are the issue? If not, what could it be?
\end{enumerate}