Monday, 19 September 2016

Using sysdig for Linux system monitoring & troubleshooting


Sysdig is a multi-faceted system monitoring & troubleshooting tool that is touted as "tcpdump + strace + lsof + much more". One of the many powerful features of sysdig is that along with analyzing the live state of a system it allows trace data to be saved to a file for later system exploration & inspection.
Sysdig's functionality is enhanced when used with in combination with small scripts called chisels. These chisels allow us to use sysdig for diagnosing a specific area of system functionality like network usage & bottlenecks.

Installing sysdig:

Sysdig rpm & dependencies are available at the atomic repository for Centos. sysdig requires dkms to be installed as a pre-requisite. I downloaded the dkms & sysdig rpms & installed them as follows:

[root@centdb DB]# rpm -ivh dkms-2.2.0.3-21.el6.art.noarch.rpm
warning: dkms-2.2.0.3-21.el6.art.noarch.rpm: Header V3 RSA/SHA1 Signature, key ID 4520afa9: NOKEY
Preparing...                ########################################### [100%]
   1:dkms                   ########################################### [100%]
[root@centdb DB]# rpm -ivh sysdig-0.1.100-13.el6.art.x86_64.rpm
warning: sysdig-0.1.100-13.el6.art.x86_64.rpm: Header V3 RSA/SHA1 Signature, key ID 4520afa9: NOKEY
Preparing...                ########################################### [100%]
   1:sysdig                 ########################################### [100%]

Creating symlink /var/lib/dkms/sysdig/0.1.100/source ->
                 /usr/src/sysdig-0.1.100

DKMS: add completed.

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area...
make KERNELRELEASE=2.6.32-573.el6.x86_64 -C /lib/modules/2.6.32-573.el6.x86_64/build M=/var/lib/dkms/sysdig/0.1.100/build.....
cleaning build area...

DKMS: build completed.

sysdig-probe:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/2.6.32-573.el6.x86_64/extra/
Adding any weak-modules

depmod....

DKMS: install completed.


Running sysdig without any arguments dumps whatever is going on in the system & it's difficult to analyze this raw information.

[root@centdb DB]# sysdig

7 19:51:47.877122494 0 sysdig (19330) > switch next=2724(netdata) pgft_maj=20 pgft_min=1737 vm_size=36324 vm_rss=4220 vm_swap=0
8 19:51:47.877126062 0 netdata (2724) < nanosleep res=0
9 19:51:47.877139073 0 netdata (2724) > nanosleep interval=20000000(0.02s)
10 19:51:47.877142506 0 netdata (2724) > switch next=0 pgft_maj=2 pgft_min=53 vm_size=671084 vm_rss=21028 vm_swap=0
11 19:51:47.877170219 0 <NA> (0) > switch next=2524(oracle) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
12 19:51:47.877171972 0 oracle (2524) < nanosleep res=0
13 19:51:47.877181147 0 oracle (2524) > nanosleep interval=10000000(0.01s)
14 19:51:47.877183038 0 oracle (2524) > switch next=0 pgft_maj=1 pgft_min=3938 vm_size=620608 vm_rss=14572 vm_swap=0
15 19:51:47.888620415 0 <NA> (0) > switch next=2524(oracle) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
16 19:51:47.888626880 0 oracle (2524) < nanosleep res=0
17 19:51:47.888682462 0 oracle (2524) > nanosleep interval=10000000(0.01s)
18 19:51:47.888690219 0 oracle (2524) > switch next=0 pgft_maj=1 pgft_min=3938 vm_size=620608 vm_rss=14572 vm_swap=0
19 19:51:47.899400592 0 <NA> (0) > switch next=1624(vmtoolsd) pgft_maj=0 pgft_min=0 vm_size=0 vm_rss=0 vm_swap=0
20 19:51:47.899412594 0 vmtoolsd (1624) < poll res=0 fds=
21 19:51:47.899428631 0 vmtoolsd (1624) > times


To make more sense of the information provided by sysdig we use chisels to provide information about specific aspects of system operation. To list the available chisels sorted by categories type:

root@centdb DB]# sysdig -cl | more

Category: CPU Usage
-------------------
spectrogram         Visualize OS latency in real time.
subsecoffset        Visualize subsecond offset execution time.
topcontainers_cpu   Top containers by CPU usage
topprocs_cpu        Top processes by CPU usage

Category: Errors
----------------
topcontainers_error Top containers by number of errors
topfiles_errors     Top files by number of errors
topprocs_errors     top processes by number of errors


To execute a specific chisel type sysdig with "-c" option followed by the name of the chisel you want to execute. For example, to view the slowest system calls on the server type:

[root@centdb DB]# sysdig -c bottlenecks

^C167858) 0.000000000 rs:main (1953) > futex addr=7FED2BFDBA04 op=128(FUTEX_PRIVATE_FLAG) val=365
1197980) 120.014706578 rs:main (1953) < futex res=0
167851) 0.000000000 rsyslogd (1955) > select
1197973) 120.014636013 rsyslogd (1955) < select res=1
339895) 0.000000000 vmware-vmblock- (1607) > read fd=3(<f>/dev/fuse) size=135168
1369942) 119.757423686 vmware-vmblock- (1607) < read res=40 data=(...............................jR......
340055) 0.000000000 vmware-vmblock- (1609) > read fd=3(<f>/dev/fuse) size=135168
1370098) 119.756062546 vmware-vmblock- (1609) < read res=40 data=(...............................kR......
342198) 0.000000000 auditd (1919) > futex addr=7FBD50847254 op=128(FUTEX_PRIVATE_FLAG) val=133
1197586) 116.007123841 auditd (1919) < futex res=0
159517) 0.000000000 at-spi-registry (2974) > poll fds=5:p1 7:p3 9:u1 10:u3 11:u3 12:u3 14:u3 15:u3 16:u3 17:u3 18:u3 19:u3 20:u3 21:u3 22:u3 23:u3 26:u3 27:u3 28:u3 29:u3 13:u1 timeout=-1


In the above output, the columns 2,3 & 4 indicate the execution time, process name & PID of the process respectively.

Monitoring user activities:

If we want to monitor user activities on the system like commands run & files updated we use the spy_users chisel.

First we'll collect a sysdig trace with a few extra options:

sysdig -s 8092 -z -w /tmp/sysdig/`hostname`.trace.gz

-s 8092" tells sysdig to capture up to 4096 bytes of each event.
-z" (used with "-w") enables compression for a trace file.
-w <trace-file>" saves sysdig traces to a specified file.

Once we've collected data & stopped the trace, we can view the activities of users on the system by typing:


[root@centdb ~]# sysdig -r /tmp/sysdig/centdb.trace.gz -c spy_users | more
61374 21:01:53 netdata) /bin/date +%s * 1000 + %-N / 1000000
61374 21:01:53 netdata) /bin/date +%s * 1000 + %-N / 1000000
61374 21:01:54 netdata) /bin/date +%s * 1000 + %-N / 1000000
61374 21:01:55 netdata) /bin/date +%s * 1000 + %-N / 1000000
61374 21:01:56 netdata) /bin/date +%s * 1000 + %-N / 1000000
61374 21:01:57 netdata) /bin/date +%s * 1000 + %-N / 1000000
22132 21:01:57 test) uname -a
22132 21:01:58 test) date

To filter out a particular use type:

[root@centdb ~]# sysdig -r /tmp/sysdig/centdb.trace.gz -c spy_users "user.name=test"
22132 21:01:57 test) uname -a
22132 21:01:58 test) date
22132 21:02:01 test) sudo su -

After switching to root although the commands will be run as root but the PID 22132 won't change as shown below:

[root@centdb ~]# sysdig -r /tmp/sysdig/centdb.trace.gz -c spy_users | grep 22132
22132 21:01:57 test) uname -a
22132 21:01:58 test) date
22132 21:02:01 test) sudo su -
   22132 21:02:01 root) su -
      22132 21:02:01 root) -bash
22132 21:02:01 root) id -un
22132 21:02:01 root) /bin/hostname
22132 21:02:02 root) tty -s
22132 21:02:02 root) tput colors
22132 21:02:02 root) dircolors --sh /etc/DIR_COLORS
         22132 21:02:02 root) grep -qi ^COLOR.*none /etc/DIR_COLORS
         22132 21:02:02 root) grep -qs ^PRELINKING=yes /etc/sysconfig/prelink
22132 21:02:02 root) /sbin/consoletype stdout
22132 21:02:02 root) /usr/bin/id -u
         22132 21:02:06 root) whoami
         22132 21:02:10 root) id -a

This article was a basic introduction to sysdig usage. You can checkout more examples & documentation here.

No comments:

Post a Comment

Using capture groups in grep in Linux

Introduction Let me start by saying that this article isn't about capture groups in grep per se. What we are going to do here with gr...