Useful System Tricks

You can’t debug production systems without a clever choice of tools. In this article, I’d like to talk about the ones I’ve found most useful.

 strace

From the man pages:

In the simplest case strace runs the specified command until it exits. It intercepts and records the system calls which are called by a process and the signals which are received by a process.

Given a process ID, strace can tell you exactly what a process is doing at the kernel level. Think of it as a cheap tracing system. I was once able to identify deadlock merely by watching a process this way.

This is useful if you either a) don’t have distributed tracing to figure out what your application is currently doing or b) you don’t have enough logging.

A simple example with Chrome:

$ ps -eaf 
UID        PID  PPID  C STIME TTY          TIME CMD
akshat    8174  3105  0 23:09 ?        00:00:00 /opt/google/chrome/chrome --type ...

$ strace -p 8174
Process 8174 attached
restart_syscall(<... resuming interrupted call ...>
8252, 248543873}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ffe1f6b28c8, FUTEX_WAKE_PRIVATE, 1) = 0
gettid()                                = 1
gettid()                                = 1
gettid()                                = 1
gettid()                                = 1
futex(0x7ffe1f6b28f4, FUTEX_WAIT_BITSET_PRIVATE, 1, {8252, 278776591}, ffffffff) = -1 ETIMEDOUT (Connection timed out)
futex(0x7ffe1f6b28c8, FUTEX_WAKE_PRIVATE, 1) = 0
gettid()                                = 1
gettid()                                = 1
gettid()                                = 1
gettid()                                = 1
gettid()                                = 1
futex(0x7ffe1f6b28f4, FUTEX_WAIT_BITSET_PRIVATE, 1, {8282, 768291785}, ffffffff

Woah! Looks like Chrome decided to wake up from a futex-induced sleep, and either ran a few health-checks or decided to fork off some nice background threads, judging by the way it’s invoking gettid.

Similarly robust tools include gdb, which is a proper debugger and can actually reference what line is running in the attached program (assuming it s a C++ program); and jdb, which does the same for Java. strace, however, is language-independent, and comes inbuilt with several Linux distributions.

 lsof

This nifty command lets you examine open file descriptors for a process. Since everything in Linux is a file, this means you can see all the TCP connections, UDP connections, the standard output, what it’s piping stuff to, etcetera, etcetera.

lsof can do more than just attach to a process, though:

# all connections across all processes
$ lsof -i TCP 
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
chrome  3087 akshat  100u  IPv4 209657      0t0  TCP 10.0.0.19:45022->ec2-54-225-100-50.compute-1.amazonaws.com:https (ESTABLISHED)
chrome  3087 akshat  110u  IPv4  15204      0t0  TCP 10.0.0.19:56475->10.0.0.10:8009 (ESTABLISHED)

# all connections on port 56475
$ lsof -i TCP:56475
COMMAND  PID   USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
chrome  3087 akshat  110u  IPv4  15204      0t0  TCP 10.0.0.19:56475->10.0.0.10:8009 (ESTABLISHED)

# all file descriptors for process 3078
$ lsof -p 3078
...
chrome  3087 akshat  252u      REG                8,1      4709 36840295 /home/akshat/.cache/google-chrome/Default/Cache/ad851e438fd4ed16_0
chrome  3087 akshat  254u      REG                8,1      4530 36842310 /home/akshat/.cache/google-chrome/Default/Cache/7d6553d55555cffd_0
chrome  3087 akshat  256u      REG                8,1      4738 36839830 /home/akshat/.cache/google-chrome/Default/Cache/0481a7cb3e3163ce_0
chrome  3087 akshat  257u      REG                8,1   1081344 34736868 /home/akshat/.config/google-chrome/Default/Local Storage/https_danielmiessler.com_0.localstorage
chrome  3087 akshat  100u  IPv4 209657      0t0  TCP 10.0.0.19:45022->ec2-54-225-100-50.compute-1.amazonaws.com:https (ESTABLISHED)
chrome  3087 akshat  110u  IPv4  15204      0t0  TCP 10.0.0.19:56475->10.0.0.10:8009 (ESTABLISHED)
...

# all TCP connections for process 3078 only
$ lsof -p 3078 -i TCP -a
chrome  3087 akshat  100u  IPv4 209657      0t0  TCP 10.0.0.19:45022->ec2-54-225-100-50.compute-1.amazonaws.com:https (ESTABLISHED)
chrome  3087 akshat  110u  IPv4  15204      0t0  TCP 10.0.0.19:56475->10.0.0.10:8009 (ESTABLISHED)

This is a very brief, very concise demonstration of lsof‘s powers. Combined with other tools, it can lead to a very quick analysis of what sort of objects are attached to the process and what sort of connections it’s making.

 pdcp and pdsh

This is the poor man’s Ansible/Chef/Puppet.

Essentially, if you want to ensure a command is run in parallel on multiple machines, pdsh will gladly SSH to each of these machines simultaneously and run that command in one broad stroke for you.

pdcp will likewise do an scp on all boxes for you in parallel.

In the example below, I am redacting the IPs in order to make it more general:

# run in non-interactive mode for two boxes
$ pdsh -w <first-IP> -w <second-IP> echo hostname
<first-IP>: hostname
<second-IP>: hostname

# run in interactive mode for a group of boxes called `workers`
$  pdsh -g workers 
pdsh> hostname
<first-IP>: <hostname>
<second-IP>: <hostname>

# copy a config file to the same location on every box
$ pdcp -g workers /etc/hbase-site.xml /etc/remote/hbase-site.xml

 htop

A superior alternative to top so you can view overall system behaviour in real-time:

htop.png

 sar

A very simple command-line monitoring system:

# collect system metrics three times at interval of 10 seconds
$ sar 10 3
Linux 3.13.0-65-generic (Centaur)   Thursday 16 March 2017  _x86_64_    (8 CPU)

12:23:41  PDT     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:23:51  PDT     all      1.04      0.01      0.54      0.20      0.00     98.20
12:24:01  PDT     all      1.45      0.00      0.74      0.03      0.00     97.79
12:24:11  PDT     all      9.33      2.49      2.04      0.28      0.00     85.87
Average:        all      3.94      0.83      1.11      0.17      0.00     93.96

This is, of course, not sar’s only trick. When enabled, sar automatically logs system metrics at an interval of 10 minutes for all time, allowing you to see the history of your system over the course of time:

# see activity for all day
$ sar -u 

...
12:23:31  PDT     all      0.38      0.00      0.13      0.13      0.00     99.37
12:33:31  PDT     all      1.13      0.00      0.25      0.63      0.00     97.99
12:43:31  PDT     all      0.88      0.00      1.01      0.00      0.00     98.12
Average:        all      0.79      0.00      0.46      0.25      0.00     98.49

This comes in handy if you need to trace spikes in production usage back to the source.

 Shellcheck

Writing shell scripts can be painful and error-prone, particularly if the only way to test it is by sending production traffic its way. Unless you have a robust unit-testing or static analysis tool at hand, you might inadvertently introduce bugs that break your entire process.

Shellcheck is such a tool. It is a static analyzer for shell scripts written in Haskell. However, no Haskell or GHC is necessary: simply try out their online checker.

 
8
Kudos
 
8
Kudos

Now read this

What Can Distributed Systems Teach Us About Concurrent Coding?

Three important questions plagued the asker of this Stack Exchange question on designing concurrent systems in 2010: How do you figure out what can be made concurrent vs. what has to be sequential? How do you reproduce error conditions... Continue →